[go: up one dir, main page]

CN114639367B - Song editing method, device and equipment - Google Patents

Song editing method, device and equipment

Info

Publication number
CN114639367B
CN114639367B CN202210276635.9A CN202210276635A CN114639367B CN 114639367 B CN114639367 B CN 114639367B CN 202210276635 A CN202210276635 A CN 202210276635A CN 114639367 B CN114639367 B CN 114639367B
Authority
CN
China
Prior art keywords
song
time point
edited
information
clipped
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210276635.9A
Other languages
Chinese (zh)
Other versions
CN114639367A (en
Inventor
龚韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202210276635.9A priority Critical patent/CN114639367B/en
Publication of CN114639367A publication Critical patent/CN114639367A/en
Application granted granted Critical
Publication of CN114639367B publication Critical patent/CN114639367B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H7/00Instruments in which the tones are synthesised from a data store, e.g. computer organs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/005Data structures for use in electrophonic musical devices; Data structures including musical parameters derived from musical analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/121Musical libraries, i.e. musical databases indexed by musical parameters, wavetables, indexing schemes using musical parameters, musical rule bases or knowledge bases, e.g. for automatic composing methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/171Transmission of musical instrument data, control or status information; Transmission, remote access or control of music data for electrophonic musical instruments
    • G10H2240/281Protocol or standard connector for transmission of analog or digital data to or from an electrophonic musical instrument
    • G10H2240/311MIDI transmission
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/311Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

本申请实施例提供了一种歌曲剪辑方法、装置及设备,其中方法包括:对待剪辑歌曲的音频文件进行处理,提取待剪辑歌曲的音频特征信息;根据待剪辑歌曲的音频特征信息,确定待剪辑歌曲的第一音频信息;根据待剪辑歌曲的第一音频信息和待剪辑歌曲的第一文本信息,确定第一结构体;根据待剪辑歌曲的指定歌词文件和第一结构体对第二结构体进行时间节点修正处理,得到第三结构体。通过该方法,可以利用歌曲结构化分段的语义信息,自动将待剪辑的音乐片段进行对齐校准和时间修正。不仅能保留剪辑后的歌曲片段的语义结构完整和连贯性,还可以提升剪辑效率,降低开销。

The embodiments of the present application provide a song editing method, device and equipment, wherein the method includes: processing the audio file of the song to be edited, extracting the audio feature information of the song to be edited; determining the first audio information of the song to be edited based on the audio feature information of the song to be edited; determining the first structure based on the first audio information of the song to be edited and the first text information of the song to be edited; performing time node correction processing on the second structure based on the specified lyrics file of the song to be edited and the first structure to obtain a third structure. Through this method, the semantic information of the structured segmentation of the song can be used to automatically align, calibrate and time-correct the music clips to be edited. Not only can the semantic structure integrity and coherence of the edited song clips be retained, but also the editing efficiency can be improved and the overhead can be reduced.

Description

Song editing method, device and equipment
Technical Field
The present application relates to the field of multimedia technologies, and in particular, to a song editing method, apparatus, and device.
Background
With the development of multimedia industry, especially the rising of short videos, the consumer age of fragmentation causes the demand of vast users for music content to change. Thus, the development of song editing technology is promoted due to the demand application such as music bell, song trial listening, and duel song segment (or named climax segment) music distribution. The main and auxiliary big paragraphs are one of the most important and most commonly used music forms, and each main and auxiliary big paragraph can be divided into AB character small paragraphs with semantic similar structure repetition. Currently, a method of manually editing or traditional automatic editing is generally adopted to generate a highlight of songs. However, since different persons have deviations in understanding the song structure, the editing caliber is not uniform and the manual editing efficiency is low. The traditional automatic editing method can improve efficiency, but cannot ensure the semantic structural integrity of song segments.
Disclosure of Invention
Aiming at the technical problems, the application provides a song editing method, a song editing device and song editing equipment, which not only can keep the semantic structural integrity and consistency of the edited song fragments, but also can improve the editing efficiency and reduce the expenditure.
In a first aspect, an embodiment of the present application provides a song clipping method. The method may be performed by a computer device (e.g., a terminal or server), and the specific method includes:
processing the audio file of the song to be clipped, and extracting the audio characteristic information of the song to be clipped;
determining first audio information of the song to be clipped according to the audio characteristic information of the song to be clipped;
determining a first structure body according to the first audio information of the song to be clipped and the first text information of the song to be clipped;
And carrying out time node correction processing on the second structure body according to the designated lyric file of the song to be clipped and the first structure body to obtain a third structure body, wherein the second structure body comprises part or all of the content of the preset song to be clipped, and the third structure body comprises part or all of the content of the preset song to be clipped after the time node correction.
By the method, the semantic information of the song structural segments can be utilized to automatically align, calibrate and correct time the music segments to be clipped. The method not only can keep the integrity and continuity of the semantic structure of the song segments after clipping, but also can improve the clipping efficiency and reduce the cost.
The computer device obtains first audio information of the song to be clipped (for example, audio feature information of a vice song of the song to be clipped), and the song clip to be clipped can be more accurately based on the fact that the vice song has the characteristics of providing variability for song tunes and having stronger memory.
In one possible implementation, the computer device inputs audio feature information of the song to be clipped into the neural network, and obtains a first audio information probability set of the song to be clipped, wherein the audio feature information comprises CQT local features and MIDI vocal melody features of the song to be clipped;
and determining first audio information of the song to be clipped according to the first audio information probability set, wherein the first audio information comprises the sub-song audio information of the song to be clipped.
It can be seen that, since the CQT local feature conforms to the perceived frequency of the human ear, the MIDI human voice melody feature can more precisely hit the human voice time point of the first audio information boundary, so by inputting the CQT local feature and the human voice melody feature of the song to be clipped into the neural network, thereby determining the first audio information, the accuracy of the obtained first audio information can be improved.
In one possible implementation, the computer device calculates an edit distance between any two sentences of lyrics in the song to be clipped based on a specified lyrics file of the song to be clipped;
and obtaining first text information of the song to be clipped according to the editing distance, wherein the first text information comprises a text similarity matrix of the song to be clipped.
Therefore, the method adopts the appointed lyric file (the special lyric file provided by the application can be called as a QRC lyric file) to acquire the first text information, so that the lyric text corresponding to each time node can be more accurately acquired based on the time stamp characteristic of the QRC lyric file, and the accuracy of editing is improved.
In a possible implementation manner, the computer device divides paragraphs of the song to be clipped according to first text information of the song to be clipped to obtain first time information, wherein the first time information comprises time nodes corresponding to different paragraphs respectively;
Fuzzy matching is carried out on the first time information and a time node corresponding to the first audio information of the song to be clipped, lyric text information corresponding to the first audio information and lyric text information corresponding to second audio information are determined, and the second audio information comprises main song audio information of the song to be clipped;
And carrying out structural segmentation on the lyric text information corresponding to the first audio information and the lyric text information corresponding to the second audio information according to the overlapping degree of the lyric text words and the structural similarity of the lyric composition in the song to be clipped, and determining a first structural body which is a structural segmentation result of the song to be clipped.
Therefore, the computer equipment can endow semantic information to different paragraph structures by carrying out fuzzy matching on the time node corresponding to the first audio of the song to be clipped and the time node corresponding to different paragraphs in the song to be clipped, so that the main song information and the auxiliary song information of the song to be clipped are divided. According to the overlapping degree of words of the text of each lyric and the structural similarity of each lyric in the song to be clipped, the main song information and the auxiliary song information can be divided into small paragraphs with symmetrical structures, so that the semantic structural information of the whole song can be obtained.
In one possible embodiment, the computer device obtains a preset starting point in time and a preset duration of the second structure;
According to the appointed lyric file of the song to be clipped, calibrating the preset starting time point and ending time point of the second structure body to obtain the starting time point and ending time point of the second structure body after calibrating the time node;
and according to the first structural body, performing time node correction processing on the starting time point and the ending time point of the second structural body after the time node is calibrated to obtain a third structural body.
In the embodiment of the application, the computer equipment carries out time point correction processing on the starting time point and the ending time point of the calibrated second structure body according to the first structure body, so that the structural engagement degree and the continuity of the hearing feeling of the second structure body can be improved, and the user's song hearing experience is improved.
In a possible implementation manner, the computer device updates the first text information according to the specified lyric file, and the updated first text information includes lyric text information corresponding to the first audio information and lyric text information corresponding to the second audio information;
and carrying out calibration processing on the preset starting time point and the ending time point of the second structure body according to the updated first text information and the time information corresponding to the updated first text information to obtain the starting time point and the ending time point of the second structure body after the time node is calibrated.
In the embodiment of the application, the computer equipment performs calibration processing on the preset starting time point and the preset ending time point of the second structure body through the QRC lyric file, so that the accurate time of the starting time point and the ending time point of the second structure body can be obtained, and the accuracy of editing is improved.
In one possible embodiment, the computer device obtains, from the first structure, a first time difference value corresponding to a start time point of the second structure after the calibration time node;
acquiring a second time difference value corresponding to the ending time point of a second structure body after the time node is calibrated from the first structure body;
performing time node correction processing on the starting time point of the second structure body after the time node is calibrated according to the starting time point of the second structure body after the time node is calibrated and the first time difference value;
performing time node correction processing on the ending time point of the second structure body after the time node is calibrated according to the ending time point of the second structure body after the time node is calibrated and the second time difference value;
The third structure includes a start time point and an end time point of the second structure after the time node correction processing.
In the embodiment of the application, since the first structure body comprises the semantic structure information of the whole song, the computer equipment carries out time node correction processing on the starting time point and the ending time point of the second structure body after the time node is calibrated according to the first structure body, so that the semantic structure integrity and the continuity of the hearing feeling of the obtained third structure body can be improved.
In one possible embodiment, the computer device determines the corresponding time length from a starting time point and an ending time point of the second structure after the calibration time node;
And according to the time length, connecting different paragraphs in the first structural body end to obtain a third structural body.
In the embodiment of the application, the computer equipment can connect different paragraphs in the first structure body end to end according to the clipping time length, and can improve the engagement degree and the continuity of songs while freely splicing the paragraphs.
In a second aspect, an embodiment of the present application provides a song clipping apparatus, including:
The preprocessing module is used for processing the audio files of the songs to be clipped and extracting the audio characteristic information of the songs to be clipped;
The determining module is used for determining first audio information of the song to be clipped according to the audio characteristic information of the song to be clipped;
The determining module is further used for determining a first structural body according to the first audio information of the song to be clipped and the first text information of the song to be clipped;
the processing module is used for carrying out time node correction processing on the second structure body according to the QRC lyric file of the song to be clipped and the first structure body to obtain a third structure body, wherein the second structure body comprises part or all of the content of the preset song to be clipped, and the third structure body comprises part or all of the content of the preset song to be clipped after the time node correction.
In a third aspect, the embodiment of the application also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the computer program realizes any one of the methods when being executed by the processor.
In a fourth aspect, the present application also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements any of the methods described above.
In a fifth aspect, embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided by the embodiment of the application.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a song editing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a CQT local feature provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of a MIDI vocal melody feature according to an embodiment of the present application;
FIG. 4 is a graph of short-term climax probability provided by an embodiment of the application;
FIG. 5 is a schematic diagram of filtering a short-time climax probability curve according to an embodiment of the present application;
FIG. 6 is a flow chart of another song clipping method provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of a lyric text similarity matrix, according to an embodiment of the application;
FIG. 8 is a schematic view of a first structure according to an embodiment of the present application;
FIG. 9 is a schematic illustration of a continuous clip provided by an embodiment of the present application;
FIG. 10 is a schematic diagram of a splice clip provided by an embodiment of the present application;
FIG. 11 is a block diagram of a song clipping method provided by an embodiment of the present application;
FIG. 12 is a schematic diagram of a song clipping apparatus provided by an embodiment of the present application;
fig. 13 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
To facilitate an understanding of the disclosed embodiments of the application, some concepts to which embodiments of the application relate are first described. The description of these concepts includes, but is not limited to, the following.
1. QRC lyrics file
A lyric file in the format of an extensible markup language (Extensible Markup Language, XML) can be precisely controlled to the point in time of each word in the lyrics.
2. CQT local features
The Constant-Q Transform (CQT) local feature is a nonlinear frequency domain feature obtained by filtering a time-domain audio signal with a set of Constant-Q filters, and the feature is more compliant with the music theory.
3. MIDI vocal melody feature
The digital music format (Musical Instrument DIGITAL INTERFACE, MIDI) is characterized by a group of characteristics describing pitch, rhythm and strength of human voice, representing the fluctuation of melody.
4. Main song
The main song (Verse) is a main song and has the function of slowly pushing the melody to climax and simultaneously clearly expressing the story background expressed by the song, thereby having stronger narrative.
5. Chorus song
Chorus (Refrain or chord), which may also be referred to as climax, is a multiple or repeated piece of lyrics in a song, typically occurring between several pieces of Chorus. The chorus is contrasted with the chorus in length, melody, rhythm and emotion, provides variability for song tunes, and has stronger memory.
6. Edit distance
The edit distance (EDIT DISTANCE) is a quantitative assessment of the degree of difference of the two strings. It refers to the minimum number of editing operations required to switch from one to the other between two strings. In general, the smaller the edit distance, the greater the similarity of the two character strings.
7. Structured segmentation
The structured segment contains rich semantic information, which is one of the main manifestations of music. From a song content perspective, the lyrics of similar repetition are typically grouped into a set or paragraph, and the structure of a general popular song may be divided into alternate paragraph forms of chorus.
Currently, the method for editing songs mainly adopts manual editing and automatic editing. In the manual editing method, because different people have deviation in understanding the song structure, the editing caliber is not uniform, and the manual editing efficiency is very low. The existing automatic editing method mainly carries out editing according to time length or simply by utilizing an audio signal processing method, the method can not identify major and minor paragraphs of songs, so that the editing of song clips is incomplete, and the semantic structural integrity of the edited song clips can not be ensured.
Based on the above, the embodiment of the application provides a song clipping method, device and equipment. The method utilizes semantic information of song structural segments to automatically align, calibrate and time correct the music segments to be clipped. The method not only can keep the integrity and continuity of the semantic structure of the song segments after clipping, but also can improve the clipping efficiency and reduce the cost.
It should be noted that, in a specific implementation, the above solution may be implemented by a computer device, where the computer device may be a terminal or a server, where the terminal mentioned herein may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart watch, a smart tv, a smart vehicle terminal, etc., where the server mentioned herein may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content distribution networks (Content Delivery Network, CDN), and big data and artificial intelligent platforms, etc., and is not limited herein.
In order to facilitate understanding of the embodiments of the present application, a specific implementation of the song clipping method will be described in detail below by taking a computer device as an example to execute the song clipping method.
Fig. 1 is a flowchart of a song clipping method according to an embodiment of the present application. As shown in fig. 1. The method may be performed by a computer device, comprising the following steps S101-S102.
S101, processing the audio file of the song to be clipped, and extracting the audio characteristic information of the song to be clipped.
Where the song to be clipped refers to a piece of song (including audio files and lyrics files) specified by the user. For example, the song to be clipped includes a plurality of song segments (e.g., a main song segment and a sub song segment), and lyrics corresponding to each song segment, respectively. Optionally, the lyrics file also includes non-lyrics information (e.g., title, singer, make, mix, etc. information in the lyrics text).
Optionally, the file format of the audio file of the song to be clipped includes, but is not limited to, MP3, MP4, waveform audio file format (Wave Audio File Format, WAV), etc.
In an alternative embodiment, the computer device extracts audio feature information of the song to be clipped includes extracting CQT local features and MIDI vocal melody features of the song to be clipped. Fig. 2 is a schematic diagram of a CQT local feature according to an embodiment of the present application. As shown in fig. 2, the CQT local feature is a non-linear time spectrum that is transformed with a base of log 2 (i.e., log 2). In fig. 2, the abscissa represents time, and the ordinate represents frequency height, and when the pitch is in log 2 -based logarithmic span distribution, the perceived frequency of the human ear is met (the higher the frequency, the lower the sensitivity). Fig. 3 is a schematic diagram of a MIDI voice melody feature according to an embodiment of the present application, wherein in fig. 3, the abscissa represents time and the ordinate represents the level of voice (or called pitch). The characteristic has stronger representation capability on voice, and can more accurately hit the voice time point of the sub-song segment boundary.
S102, determining first audio information of the song to be clipped according to the audio characteristic information of the song to be clipped.
Wherein the first audio information is a sub-song clip of the song to be clipped. For example, if the song to be clipped includes a plurality of sub-song segments, the first audio information may include audio information of each of the plurality of sub-song segments.
In an alternative embodiment, the computer device determining the first audio information of the song to be clipped may include the steps of inputting audio characteristic information of the song to be clipped into a neural network to obtain a first audio information probability set of the song to be clipped, and determining the first audio information of the song to be clipped including the sub-song audio information of the song to be clipped according to the first audio information probability set.
Specifically, for example, the probability set includes { P 1,P2,…,Pn }, n represents the nth time point, and each probability represents the likelihood that the time point is the climax part. The computer device may obtain a first audio information probability curve from the first audio information probability set. For example, fig. 4 is a graph of climax probability provided by an embodiment of the application. The climax probability curve is the first audio information probability curve. Wherein the abscissa represents the index of the time point and the ordinate represents the probability.
The computer device may determine the first audio information of the song to be clipped based on the first audio information probability curve. I.e. the computer device may determine the climax parts (i.e. the chorus parts) of the song to be clipped based on fig. 4. As shown in fig. 4, the probability between the point a and the point b in the graph is the greatest, so that the point a and the point b can be primarily considered as the climax piece of the song to be clipped. Similarly, based on the probability value, a climax segment of the song to be clipped can be considered between the c point and the d point.
In an embodiment of determining the first audio information of the song to be clipped, the computer device may further input the audio feature information of the song to be clipped into the neural network to obtain a first audio information probability curve of the song to be clipped, and determine the first audio information of the song to be clipped according to the first audio information probability curve, wherein the first audio information comprises the sub-song audio information of the song to be clipped. For example, the computer device may determine a climax clip (i.e., a chorus clip) of a song to be clipped based on fig. 4.
It will be appreciated that the probability curve of the first audio information may be obtained directly without being based on a set of probabilities in this embodiment.
In an embodiment of determining the first audio information of the song to be clipped according to the first audio information probability curve, the computer device may obtain a first audio filtering curve by filtering the first audio information probability curve, and determine the first audio information of the song to be clipped according to the first audio filtering curve.
In an embodiment of determining the first audio information of the song to be clipped according to the first audio filtering curve, the computer device may determine time information and a confidence level of the first audio information of the song to be clipped according to a maximum point and a minimum point after the maximum point in the filtering curve, the time information of the first audio includes a start time point and an end time point, the confidence level of the first audio information includes a start confidence level and an end confidence level, and calculate an average confidence level of the first audio information in the song to be clipped according to the start confidence level and the end confidence level of the first audio information, and determine the first audio information of the song to be clipped. The abscissa and the ordinate of the maximum point in the filtering curve are the starting time point and the starting confidence coefficient of the first audio information respectively, and the abscissa and the ordinate of the minimum point after the maximum point are the ending time point and the ending confidence coefficient of the first audio information respectively.
In this embodiment, the computer device calculates the average confidence of the first audio information in the song to be clipped, and determines the first audio information of the song to be clipped according to the average confidence, so that reliability of the first audio information can be ensured.
Alternatively, the computer device may determine the first audio information by setting a threshold value of average confidence [ a, b ]. For example, if the average confidence of the first audio information is c, c e [ a, b ], the first audio information with the average confidence of c may be used as the first audio information of the song to be clipped. For another example, if the average confidence level of the first audio information is d,The first audio information with the average confidence level d may not be used as the first audio information of the song to be clipped.
Fig. 5 is a schematic diagram of filtering a short-time climax probability curve according to an embodiment of the present application. As shown in fig. 5, the computer apparatus may obtain a filter curve v f by filtering the short-time climax probability curve v p using a Harr filter, and may obtain first audio information of a song to be clipped according to the filter curve v f. Firstly, the computer equipment can input the CQT local feature and MIDI life melody feature of the song to be clipped into the neural network, predict short-time climax probability once every 500ms as a frame to obtain a short-time climax probability set of the song to be clipped, and obtain a climax probability curve v p according to the short-time climax probability set. The computer device slides through the entire climax probability curve v p using a Haar filter of width 20s, resulting in a filtered curve v f. The time of the maximum point (i.e., point a in fig. 5) in the filter curve v f is determined as the value corresponding to the climax start time point t c_start,tc_start as the start confidence level s c_start, and the time of the minimum point (i.e., point B in fig. 5) after the maximum point is determined as the value corresponding to the climax end time point t c_end,tc_end as the end confidence level s c_end. Calculating the average confidence of the climax fragments by using the initial confidence s c_start and the ending confidence s c_end
S103, determining a first structural body according to the first audio information of the song to be clipped and the first text information of the song to be clipped.
The first text information is lyric text corresponding to the song to be clipped. For example, the first text information includes lyrics text of a main song segment and lyrics text of a sub song segment in the song to be clipped. Optionally, the first text information may further comprise a lyric text similarity matrix of the song to be clipped.
The first structure body is a structural segmentation result of the song to be clipped, and comprises semantic structure information of the song to be clipped. For example, the first structure includes a main song segment, a sub-song segment, and a character AB small segment of a song to be clipped.
S104, performing time node correction processing on the second structure body according to the designated lyric file of the song to be clipped and the first structure body to obtain a third structure body.
Wherein the second structure includes part or all of the content of the preset song to be clipped, and the second structure may also be called a clip to be clipped. For example, the second structure may be one or more segments in a preset song to be clipped. The second structure body comprises a preset starting point and a preset duration, and an ending time point of the second structure body can be obtained according to the preset starting time point and the preset duration. For example, the preset starting time point may be 00:30.21 in the songs to be clipped, the preset duration may be 20s, and the ending time point is 00:50.21.
The third structure includes part or all of the content of the song to be clipped after the time node correction, and the second structure after the time node correction can also be called audio after clipping. The third structure may further include a start time point and an end time point of the second structure after the time node correction process.
In an alternative implementation manner, the computer device performs time node correction processing on the second structure according to a specified lyric file (for example, the specified lyric file is a QRC lyric file) of a song to be clipped and the first structure to obtain a third structure, and the method comprises the steps of obtaining a preset starting time point and a preset duration of the second structure, performing calibration processing on the preset starting time point and the preset ending time point of the second structure according to the specified lyric file of the song to be clipped to obtain a starting time point and an ending time point of the second structure after the calibration time node, and performing time node correction processing on the starting time point and the ending time point of the second structure after the calibration time node according to the first structure to obtain the third structure.
By adopting the embodiment of the application, the music fragments to be clipped are automatically aligned, calibrated and time corrected by utilizing the semantic information of the song structural fragments. The method not only can keep the integrity and continuity of the semantic structure of the song segments after clipping, but also can improve the clipping efficiency and reduce the cost. In addition, since the text information of the lyrics in the QRC lyrics file supports multiple languages, the embodiment of the application can support song clips in multiple languages and small languages.
Fig. 6 is a flowchart of another song editing method according to an embodiment of the present application, as shown in fig. 6, where the method described in the embodiment of the present application includes steps S601a-S604. It should be noted that step a (e.g., S601a and S602 a) in the flow of the method represents an operation performed on audio information of a song to be clipped, and step b (e.g., S601b and S602 b) represents an operation performed on a second structure.
S601a, processing the audio file of the song to be clipped, and extracting the audio characteristic information of the song to be clipped.
S602a, determining first audio information of the song to be clipped according to the audio characteristic information of the song to be clipped.
The specific process of steps S601a and S402a may be referred to the descriptions in S101 and S102, and will not be described herein.
S603, determining a first structural body according to the first audio information of the song to be clipped and the first text information of the song to be clipped.
In an alternative embodiment, the computer device may include a preprocessing portion, a multimodal fusion portion, and a post-processing portion when determining the first structure based on the first audio information of the song to be clipped and the first text information of the song to be clipped.
First, the preprocessing section includes obtaining first text information of a song to be clipped based on a QRC lyric file of the song to be clipped. The computer equipment further comprises acquiring a QRC lyric file of the song to be clipped, calculating an editing distance between any two sentences of lyrics in the song to be clipped based on the QRC lyric file of the song to be clipped, and acquiring the first text information of the song to be clipped according to the editing distance, wherein the first text information comprises a text similarity matrix of the song to be clipped.
Alternatively, the computer device may define the file format of the QRC lyric file as the QRC format.
Fig. 7 is a schematic diagram of a text similarity matrix for lyrics according to an embodiment of the present application, where in fig. 7, the abscissa and ordinate represent indexes of lyrics, for example, (10, 35) represents similarity between 10 th sentence lyrics and 35 th sentence lyrics.
Optionally, the preprocessing part further comprises dividing paragraphs of the song to be clipped according to the first text information of the song to be clipped to obtain first time information, wherein the first time information comprises time nodes respectively corresponding to different paragraphs;
and the multi-mode fusion part comprises the step of carrying out fuzzy matching on the first time information and the time node corresponding to the first audio information of the song to be clipped, and determining the lyric text information corresponding to the first audio information and the lyric text information corresponding to the second audio information, wherein the second audio information comprises the main song audio information of the song to be clipped. Wherein multimodal refers to a variety of sources, media, or forms of information, such as text, audio, images, and the like. The multi-mode fusion refers to the fusion of multiple information.
In the fuzzy matching of the first time information and the time nodes corresponding to the first audio information of the song to be clipped, the first time information comprises the time nodes corresponding to different paragraphs in the song to be clipped respectively, wherein the time nodes comprise a starting time point and an ending time point corresponding to a main song segment, a starting time point and an ending time point corresponding to a sub song segment and the like. The time node corresponding to the first audio information comprises a starting time point and an ending time point corresponding to the chorus segment. And matching the time nodes corresponding to different paragraphs in the song to be clipped with the starting time point and the ending time point corresponding to the chorus segment respectively is called fuzzy matching.
Finally, the post-processing part comprises the steps of carrying out structural segmentation on the lyric text information corresponding to the first audio information and the lyric text information corresponding to the second audio information according to the overlapping degree of each lyric text word and the structural similarity of each lyric in the song to be clipped, and determining the first structural body which is the structural segmentation result of the song to be clipped.
The specific implementation flows of the preprocessing section, the multi-mode fusion section and the post-processing section are exemplified below. The method comprises the steps of providing a first audio information as a chorus segment in a song to be clipped, obtaining a QRC lyric file of the song to be clipped by computer equipment when determining a first structure, calculating the editing distance between any two sentences of lyrics in the song to be clipped based on the QRC lyric file to obtain a text similarity matrix of the song to be clipped, combining lyric segments with similarity larger than a preset threshold value by utilizing an optimal path search algorithm based on the text similarity matrix to obtain the segmented song to be clipped (namely a main song segment and a sub song segment), and respectively corresponding time nodes of different segments. And (3) performing fuzzy matching on the time nodes corresponding to different paragraphs in the song to be clipped and the starting time point and the ending time point of the chorus segment determined in the step (S402 a) to determine lyric text information corresponding to the chorus segment and lyric text information corresponding to the chorus segment. And dividing the lyric text information corresponding to the auxiliary song segment and the lyric text information corresponding to the main song segment into character AB small paragraphs with symmetrical structures according to the overlapping degree of words of each lyric text in the song to be clipped and the structural similarity of each lyric, and determining a first structural body of the song to be clipped. The first structure body comprises semantic structure information of songs to be clipped, and main song sections, auxiliary song sections and role AB small sections. The first structure (or semantic structure called song to be clipped) can be written as
Sec=[V1,A1,B1,C1,A2,B2,C2,…,Vn,Cn,An,Bn],n∈[1,N], Where V represents a main song paragraph, C represents a sub song paragraph, A represents a character A paragraph, B represents a character B paragraph, and N represents the total sentence number of song lyrics to be clipped. Fig. 8 is a schematic diagram of a first structure provided in an embodiment of the present application, as shown in fig. 8, where a song to be clipped includes two main song segments V 1、V2, two sub-song segments C 1、C2, four character a small segments a 1、A2、A3 and a 4, and three character B small segments B 1、B2、B4.
S601b, acquiring a preset starting time point and a preset duration of the second structure.
For example, the computer device may acquire a preset starting point in time t u_start and a preset time period l dur of the second structure from the user side.
S602b, performing calibration processing on a preset starting time point and an ending time point of the second structure body according to the designated lyric file of the song to be clipped, and obtaining the starting time point and the ending time point of the second structure body after the calibration time node.
In an alternative implementation manner, the computer device performs calibration processing on a preset starting time point and an ending time point of the second structure body according to a designated lyric file of a song to be clipped to obtain the starting time point and the ending time point of the second structure body after the calibration time node, and the method comprises the steps of updating first text information according to the designated lyric file, wherein the updated first text information comprises lyric text information corresponding to first audio information and lyric text information corresponding to second audio information, and performing calibration processing on the preset starting time point and the ending time point of the second structure body according to the updated first text information and the updated time information corresponding to the first text information to obtain the starting time point and the ending time point of the second structure body after the calibration time node.
Optionally, assuming that the specified lyric file is a QRC lyric file, in an embodiment of updating the first text information according to the specified lyric file, the computer device eliminates, according to the QRC lyric file and the filtering module, the prelude portion of the song to be clipped and song information of non-lyric contents such as title, singer, make, mix, etc. in the lyric text.
In the embodiment of the invention, the preset starting time point and the ending time point of the second structure body are calibrated according to the updated first text information and the time information corresponding to the updated first text information to obtain the starting time point and the ending time point of the second structure body after the calibration time node, the preset starting time point of the second structure body is processed by the updated first text information to obtain the first starting time point of the second structure body, the first ending time point of the second structure body is obtained based on the first starting time point and the preset time length, and the first starting time point and the first ending time point are calibrated by the updated time information corresponding to the first text information to obtain the second starting time point and the second ending time point which are the starting time point and the ending time point of the second structure body after the calibration time node.
For example, assume that the preset starting time point of the second structure obtained by the computer device from the user side is t u_start, and the preset duration is l dur. The computer device may remove the pre-playing part of the song to be clipped and song information of non-lyric contents such as title, singer, make, mix, etc. in the lyric text by using the lyric text content of the QRC lyric file and the filtering module, and obtain updated lyric text information. Processing the preset starting time point t u_start of the second structure body according to the updated lyric text information, determining the first starting time point t ' u_start of the second structure body, and carrying out addition operation on the preset duration l dur and t ' u_start to obtain the estimated ending time point t ' u_end of the second structure body. Then, the accurate time information of the QRC lyric file is utilized to align the first starting time point t 'u_start and the ending time point t' u_end, so that t '' u_start, t '' u_end.t″u_start and t '' u_end are the starting time point and the ending time point of the second structural body after the calibration time node.
Optionally, the computer device may further directly perform calibration processing on a preset starting time point and an ending time point of the second structure body according to the updated first text information and the updated time information corresponding to the first text information, so as to obtain the starting time point and the ending time point of the second structure body after the calibration time node.
For example, assume that the preset starting time point of the second structure obtained by the computer device from the user side is T u_start and the preset duration is L dur. The computer device can firstly remove the pre-playing part of the song to be clipped and song information of non-lyric contents such as titles, singers, production, mixing and the like in the lyric text according to the lyric text content of the QRC lyric file and the filtering module to obtain updated lyric text information, then process a preset starting time point T u_start of the second structure according to the updated lyric text information and time information corresponding to the updated lyric text information to determine a starting time point T ' u_start of the second structure, and finally perform addition operation on preset time lengths L dur and T ' u_start to determine an estimated ending time point T ' u_end of the second structure. Wherein T 'u_start and T' u_end are the start time point and the end time point of the second structure after the calibration time node.
S604, performing time node correction processing on the starting time point and the ending time point of the second structure body after the time node is calibrated according to the first structure body, so as to obtain a third structure body.
In an alternative embodiment, the computer device performs time node correction processing on the starting time point and the ending time point of the second structure body after the time node is calibrated according to the first structure body to obtain a third structure body, and the method comprises the steps of obtaining a first time difference value corresponding to the starting time point of the second structure body after the time node is calibrated from the first structure body, and obtaining a second time difference value corresponding to the ending time point of the second structure body after the time node is calibrated from the first structure body. And performing time node correction processing on the ending time point of the second structure body after the time node is calibrated according to the ending time point of the second structure body after the time node is calibrated and the second time difference value. The third structure includes a start time point and an end time point of the second structure after the time node correction processing. It will be appreciated that the manner in which the third structure is obtained in this embodiment may also be referred to as a continuous clip manner, i.e. a clip manner that maintains segment start-to-end continuity.
The first time difference value comprises a difference value of boundary time of two paragraphs adjacent to a paragraph where a start time point of the second structure body after the calibration time node is located, and the second time difference value comprises a difference value of boundary time of two paragraphs adjacent to a paragraph where an end time point of the second structure body after the calibration time node is located.
Fig. 9 is a schematic diagram of a continuous clip provided by an embodiment of the present application. As shown in fig. 9, assuming that the first structure is the Sec described in step S403, the start time point and the end time point of the second structure are t″ u_start and t″ u_end, respectively. When the computer device performs time node correction processing on t ' u_start and t ' u_end according to the first structure to obtain a third structure, the computer device may obtain, from the first structure Sec, difference values Δup and Δdown of boundary times of two paragraphs adjacent to the paragraphs where t ' u_start and t ' u_end are located, respectively, and take a paragraph boundary point of the minimum difference value as a corrected start time, where in formula (1), t ' u_star represents a start time point of the second structure after correcting the time node. For example, as shown in fig. 9, where t ' u_start has a value of 00:51.68, the boundary time of the upper paragraph adjacent to the paragraph where t ' u_start is 00:42.30, and the boundary time of the lower paragraph adjacent to the paragraph where t ' u_start is 01:02.59, Δup and Δdown are calculated as Δup=00:51.68-00:42.30=00:09.38, and Δdown=01:02.59-00:51.68=00:10.91. Since Δup < Δdown, the correction of t″ u_start is performed at the paragraph boundary time point where Δup is located, and t' "u_star is 00:51.68-00:09.38=00:42.30 according to equation (1). The same applies to the corrected end time t '"u_end corresponding to the end time t'" u_end. and finally, the actual duration of the second structure body after the correction according to the first structure body is l ' ' ' dur in the following formula (2), wherein the actual duration of the second structure body is approximately equal to the preset duration.
t′′′u_start=t″u_start±min(Δup,Δdown) (1)
l′′′dur=t″′u_end-t″′u_start≈ldur (2)
In another alternative embodiment, the computer device performs time node correction processing on a starting time point and an ending time point of the second structure body after the time node is calibrated according to the first structure body to obtain a third structure body, and the method comprises the steps of determining target duration between the starting time point and the ending time point of the second structure body after the time node is calibrated, and connecting different paragraphs in the first structure body end to end according to the target duration to obtain the third structure body. It can be understood that the manner of obtaining the third structure body in this embodiment may also be referred to as a stitching and clipping manner, that is, a clipping manner of freely stitching together different main and sub song segments or character AB small segments according to the needs of the user under the condition that the clip segment duration is satisfied.
Fig. 10 is a schematic diagram of a splice clip according to an embodiment of the present application. As shown in fig. 10, assuming that the first structure is the Sec described in step S403, the computer device may freely splice together different main and sub song segments or character AB small segments in the first structure under the condition that the clip duration is satisfied, that is, splice V 1、C1、C2 segments in the Sec end to end, to obtain the third structure. At this time, the computer device may make a fade-in and fade-out of the sound of As at the splice, further reducing the abrupt sense of the splice, where a may be 0.5s. This manner of clipping may be referred to as a splice clipping.
Alternatively, after obtaining the third structure, the computer apparatus may cut the third structure from the audio file of the song to be cut using the audio cutting tool according to the start time point and the end time point of the third structure (i.e., the start time point and the end time point of the second structure after the time point correction), and output the third structure. The third structure is the clipped song clip.
By adopting the embodiment of the application, the semantic information of the song structural section is utilized to automatically align, calibrate and correct time of the music section to be clipped, so that the semantic structural integrity and the continuity of the listening sense of the clipped song section can be maintained, the clipping efficiency can be improved, and the cost can be reduced.
In addition, the embodiment of the application supports the editing of any starting point and any duration (any continuous time area) of the designated song and also supports the editing of the semantic paragraphs freely spliced according to the duration, so that the application can automatically and flexibly clip the music fragments with complete structure and continuous hearing sense according to different scene requirements. For example, the application can be applied to scenes such as short video soundtracks, music games, ringtones, choruses and the like, without limitation.
Fig. 11 is a frame diagram of a song clipping method according to an embodiment of the present application, which corresponds to steps S601a to S604 described above. The computer equipment firstly extracts the CQT audio local characteristic and the MIDI human voice melody characteristic of the song to be clipped, and then inputs the CQT audio local characteristic and the MIDI human voice melody characteristic into the neural network to determine the chorus segment. And obtaining a structural segmentation result of the song to be clipped by utilizing a multi-mode fusion technology in combination with the determined chorus segment. And then, according to the starting time and the preset duration of the preset to-be-clipped segment, aligning and calibrating the preset starting time point and the ending time point of the to-be-clipped segment based on the QRC lyric file to obtain the starting time point and the ending time point of the to-be-clipped segment after aligning and calibrating the time node. And then, correcting the starting time point and the ending time point of the to-be-clipped fragment aligned with the calibration time point by utilizing the structural segmentation result of the to-be-clipped song, so that the to-be-clipped fragment is adaptively corrected to the boundary of the nearest neighbor fragments of the fragments respectively positioned in the starting time point and the ending time point. And finally, cutting out the cut song fragments by a song cutting tool based on the corrected time points of the to-be-cut fragments. It can be seen that by adopting the embodiment of the application, the semantic information of the song structural section and the QRC lyrics file are utilized to automatically align, calibrate and correct the section to be clipped, so that the semantic structural integrity and consistency of the clip section can be maintained
In addition, compared with the manual editing method, the method has the advantages of time saving and labor saving, and can improve editing efficiency, thereby reducing editing expenditure, and compared with the existing automatic editing method, the method can improve the success rate and coverage rate of editing songs.
Fig. 12 is a schematic diagram of a song clipping apparatus according to an embodiment of the present application. The song clipping apparatus described in this embodiment may include the following portions:
The preprocessing module 1201 is used for processing the audio file of the song to be clipped and extracting the audio characteristic information of the song to be clipped;
a determining module 1202, configured to determine first audio information of a song to be clipped according to audio feature information of the song to be clipped;
the determining module 1202 is further configured to determine a first structure according to the first audio information of the song to be clipped and the first text information of the song to be clipped;
The processing module 1203 is configured to perform time node correction processing on the second structure according to the specified lyric file of the song to be clipped and the first structure, to obtain a third structure, where the second structure includes part or all of the content of the preset song to be clipped, and the third structure includes part or all of the content of the preset song to be clipped after the time node correction.
In an alternative embodiment, the determining module 1202, when configured to determine the first audio information of the song to be clipped according to the audio feature information of the song to be clipped, is specifically configured to:
inputting the audio characteristic information of the song to be clipped into a neural network to obtain a first audio information probability set of the song to be clipped, wherein the audio characteristic information comprises CQT local characteristics and MIDI vocal melody characteristics of the song to be clipped;
and determining first audio information of the song to be clipped according to the first audio information probability set, wherein the first audio information comprises the sub-song audio information of the song to be clipped.
In an alternative embodiment, the processing module 1203 is further configured to calculate an edit distance between any two lyrics in the song to be clipped based on the specified lyrics file of the song to be clipped;
and obtaining first text information of the song to be clipped according to the editing distance, wherein the first text information comprises a text similarity matrix of the song to be clipped.
In an alternative embodiment, the determining module 1202, when configured to determine the first structure according to the first audio information of the song to be clipped and the first text information of the song to be clipped, is specifically configured to:
according to the first text information of the song to be clipped, dividing paragraphs of the song to be clipped to obtain first time information, wherein the first time information comprises time nodes corresponding to different paragraphs respectively;
Fuzzy matching is carried out on the first time information and a time node corresponding to the first audio information of the song to be clipped, lyric text information corresponding to the first audio information and lyric text information corresponding to second audio information are determined, and the second audio information comprises main song audio information of the song to be clipped;
and carrying out structural segmentation on the lyric text information corresponding to the first audio information and the lyric text information corresponding to the second audio information according to the overlapping degree of words of each lyric text and the structural similarity of each lyric in the song to be clipped, and determining a first structural body which is a structural segmentation result of the song to be clipped.
In an alternative embodiment, the processing module 1203 is specifically configured to, when performing the time node correction processing on the second structure according to the specified lyric file of the song to be clipped and the first structure to obtain the third structure:
According to the appointed lyric file of the song to be clipped, calibrating the preset starting time point and the ending time point of the second structure to obtain the starting time point and the ending time point of the second structure after calibrating the time node;
and according to the first structural body, performing time node correction processing on the starting time point and the ending time point of the second structural body after the time node is calibrated to obtain a third structural body.
In an optional implementation manner, the processing module 1203 is configured to, when performing calibration processing on a preset starting time point and an ending time point of the second structure according to a specified lyric file of a song to be clipped to obtain the starting time point and the ending time point of the second structure after the calibration time node, specifically:
Updating the first text information according to the appointed lyric file, wherein the updated first text information comprises lyric text information corresponding to the first audio information and lyric text information corresponding to the second audio information;
and carrying out calibration processing on the preset starting time point and the ending time point of the second structure body according to the updated first text information and the time information corresponding to the updated first text information to obtain the starting time point and the ending time point of the second structure body after the time node is calibrated.
In an alternative embodiment, the processing module 1203 is configured to, when performing, according to the first structure, a time node correction process on a start time point and an end time point of the second structure after calibrating the time node, obtain a third structure, specifically:
acquiring a first time difference value corresponding to a starting time point of a second structure body after the time node is calibrated from the first structure body;
acquiring a second time difference value corresponding to the ending time point of the second structure body after the time node is calibrated from the first structure body;
Performing time node correction processing on the starting time point of the second structure body after the time node is calibrated according to the starting time point of the second structure body after the time node is calibrated and the first time difference value;
Performing time node correction processing on the ending time point of the second structure body after the time node is calibrated according to the ending time point of the second structure body after the time node is calibrated and the second time difference value;
The third structure includes a start time point and an end time point of the second structure after the time node correction processing.
In an alternative embodiment, the processing module 1203 is specifically configured to, when performing, according to the first structure, a time node correction process on a start time point and an end time point of the second structure after calibrating the time node, to obtain the third structure:
determining a target duration between a starting time point and an ending time point of the second structure after the calibration time node;
and according to the target duration, connecting different paragraphs in the first structural body end to obtain a third structural body.
It may be appreciated that the specific implementation of each module and the beneficial effects that can be achieved in the song clipping apparatus according to the embodiments of the present application may refer to the descriptions of the foregoing related embodiments, which are not repeated herein.
Fig. 13 is a schematic structural view of a computer device according to an embodiment of the present application. The computer devices described in the embodiments of the present application include a processor 1301, a user interface 1302, a communication interface 1303, and a memory 1304. The processor 1301, the user interface 1302, the communication interface 1303 and the memory 1304 may be connected by a bus or other means, for example, in the embodiment of the present application.
The processor 1301 (or called central processing unit (Central Processing Unit, CPU)) is a computing core and a control core of the computer device, and may analyze various instructions in the computer device and process various data of the computer device, for example, the CPU may be used to analyze an on-off instruction sent by a user to the computer device and control the computer device to perform an on-off operation, for example, the CPU may transmit various interaction data between internal structures of the computer device, and so on. The user interface 1302 is a medium for implementing interaction and information exchange between a user and a computer device, and may specifically include a Display screen (Display) for output, a Keyboard (Keyboard) for input, and the like, where the Keyboard may be a physical Keyboard, a touch screen virtual Keyboard, or a Keyboard that combines a physical Keyboard and a touch screen virtual Keyboard. Communication interface 1303 may optionally include a standard wired interface, a wireless interface (e.g., wi-Fi, mobile communication interface, etc.), controlled by processor 1301 for transceiving data. Memory 1304 (Memory) is a Memory device in a computer device for storing programs and data. It will be appreciated that the memory 1304 herein may include both built-in memory of the computer device and extended memory supported by the computer device. The memory 1304 provides storage space that stores operating systems of the computer device, which may include, but is not limited to, android systems, iOS systems, windows Phone systems, and the like, as the application is not limited in this regard.
In an embodiment of the present application, processor 1301 performs the following operations by executing executable program code in memory 1304:
processing the audio file of the song to be clipped, and extracting the audio characteristic information of the song to be clipped;
determining first audio information of the song to be clipped according to the audio characteristic information of the song to be clipped;
determining a first structure body according to the first audio information of the song to be clipped and the first text information of the song to be clipped;
And carrying out time node correction processing on the second structure body according to the designated lyric file of the song to be clipped and the first structure body to obtain a third structure body, wherein the second structure body comprises part or all of the content of the preset song to be clipped, and the third structure body comprises part or all of the content of the preset song to be clipped after the time node correction.
In an alternative embodiment, the processor 1301, when configured to determine the first audio information of the song to be clipped according to the audio feature information of the song to be clipped, is specifically configured to:
inputting the audio characteristic information of the song to be clipped into a neural network to obtain a first audio information probability set of the song to be clipped, wherein the audio characteristic information comprises CQT local characteristics and MIDI vocal melody characteristics of the song to be clipped;
and determining first audio information of the song to be clipped according to the first audio information probability set, wherein the first audio information comprises the sub-song audio information of the song to be clipped.
In an alternative embodiment, the processor 1301 is further configured to calculate an edit distance between any two lyrics in the song to be clipped based on the specified lyrics file of the song to be clipped;
and obtaining first text information of the song to be clipped according to the editing distance, wherein the first text information comprises a text similarity matrix of the song to be clipped.
In an alternative embodiment, processor 1301, when determining the first structure according to the first audio information of the song to be clipped and the first text information of the song to be clipped, is specifically configured to:
according to the first text information of the song to be clipped, dividing paragraphs of the song to be clipped to obtain first time information, wherein the first time information comprises time nodes corresponding to different paragraphs respectively;
Fuzzy matching is carried out on the first time information and a time node corresponding to the first audio information of the song to be clipped, lyric text information corresponding to the first audio information and lyric text information corresponding to second audio information are determined, and the second audio information comprises main song audio information of the song to be clipped;
and carrying out structural segmentation on the lyric text information corresponding to the first audio information and the lyric text information corresponding to the second audio information according to the overlapping degree of words of each lyric text and the structural similarity of each lyric in the song to be clipped, and determining a first structural body which is a structural segmentation result of the song to be clipped.
In an alternative embodiment, when the processor 1301 is configured to perform the time node correction processing on the second structure according to the specified lyrics file of the song to be clipped and the first structure, to obtain the third structure, the processor is specifically configured to:
According to the appointed lyric file of the song to be clipped, calibrating the preset starting time point and the ending time point of the second structure to obtain the starting time point and the ending time point of the second structure after calibrating the time node;
and according to the first structural body, performing time node correction processing on the starting time point and the ending time point of the second structural body after the time node is calibrated to obtain a third structural body.
In an alternative embodiment, when the processor 1301 is configured to perform calibration processing on a preset starting time point and an ending time point of the second structure according to a specified lyric file of a song to be clipped, to obtain the starting time point and the ending time point of the second structure after the calibration time node, the processor is specifically configured to:
Updating the first text information according to the appointed lyric file, wherein the updated first text information comprises lyric text information corresponding to the first audio information and lyric text information corresponding to the second audio information;
and carrying out calibration processing on the preset starting time point and the ending time point of the second structure body according to the updated first text information and the time information corresponding to the updated first text information to obtain the starting time point and the ending time point of the second structure body after the time node is calibrated.
In an alternative embodiment, when the processor 1301 is configured to perform, according to the first structure, a time node correction process on a start time point and an end time point of the second structure after calibrating the time node, to obtain the third structure, the processor is specifically configured to:
acquiring a first time difference value corresponding to a starting time point of a second structure body after the time node is calibrated from the first structure body;
acquiring a second time difference value corresponding to the ending time point of the second structure body after the time node is calibrated from the first structure body;
Performing time node correction processing on the starting time point of the second structure body after the time node is calibrated according to the starting time point of the second structure body after the time node is calibrated and the first time difference value;
Performing time node correction processing on the ending time point of the second structure body after the time node is calibrated according to the ending time point of the second structure body after the time node is calibrated and the second time difference value;
The third structure includes a start time point and an end time point of the second structure after the time node correction processing.
In an alternative embodiment, when the processor 1301 is configured to perform, according to the first structure, a time node correction process on a start time point and an end time point of the second structure after calibrating the time node, to obtain the third structure, the processor is specifically configured to:
determining a target duration between a starting time point and an ending time point of the second structure after the calibration time node;
and according to the target duration, connecting different paragraphs in the first structural body end to obtain a third structural body.
In a specific implementation, the processor 1301, the user interface 1302, the communication interface 1303 and the memory 1304 described in the embodiments of the present application may execute an implementation of the computer device described in the song clipping method provided in the embodiments of the present application, and may also execute an implementation described in the song clipping apparatus provided in the embodiments of the present application, which is not described herein again.
The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, the program instructions implement the song editing method provided by the embodiment of the application, and specifically, the implementation manner provided by each step can be referred to, and will not be repeated herein.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the method according to an embodiment of the application. The specific implementation manner may refer to the foregoing description, and will not be repeated here.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of action described, as some steps may be performed in other order or simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing related hardware, and the program may be stored in a computer readable storage medium, where the storage medium may include a flash disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or the like.
The above disclosure is illustrative only of some embodiments of the application and is not intended to limit the scope of the application, which is defined by the claims and their equivalents.

Claims (9)

1.一种歌曲剪辑方法,其特征在于,所述方法包括:1. A song editing method, characterized in that the method comprises: 对待剪辑歌曲的音频文件进行处理,提取所述待剪辑歌曲的音频特征信息;Processing the audio file of the song to be edited and extracting audio feature information of the song to be edited; 根据所述待剪辑歌曲的音频特征信息,确定所述待剪辑歌曲的第一音频信息,所述第一音频信息包括所述待剪辑歌曲的副歌音频信息;Determining first audio information of the song to be edited based on the audio feature information of the song to be edited, where the first audio information includes audio information of the chorus of the song to be edited; 获取所述待剪辑歌曲的第一文本信息,所述第一文本信息包括所述待剪辑歌曲的主歌音频信息对应的歌词文本信息、副歌音频信息对应的歌词文本信息和所述待剪辑歌曲的文本相似性矩阵,所述文本相似性矩阵用于指示所述待剪辑歌曲中任意两句歌词之间的相似度;Obtaining first text information of the song to be edited, the first text information including lyrics text information corresponding to the verse audio information of the song to be edited, lyrics text information corresponding to the chorus audio information, and a text similarity matrix of the song to be edited, the text similarity matrix being used to indicate the similarity between any two lyrics in the song to be edited; 根据所述待剪辑歌曲的第一文本信息,基于所述文本相似性矩阵,对所述待剪辑歌曲进行段落划分,获得第一时间信息,所述第一时间信息包括不同段落分别对应的时间节点;According to the first text information of the song to be edited, based on the text similarity matrix, the song to be edited is divided into paragraphs to obtain first time information, where the first time information includes time nodes corresponding to different paragraphs; 对所述第一时间信息和所述待剪辑歌曲的第一音频信息对应的时间节点进行模糊匹配,确定所述第一音频信息对应的歌词文本信息和第二音频信息对应的歌词文本信息,所述第二音频信息包括所述待剪辑歌曲的主歌音频信息;Performing fuzzy matching on the first time information and the time nodes corresponding to the first audio information of the song to be edited, determining lyrics text information corresponding to the first audio information and lyrics text information corresponding to the second audio information, wherein the second audio information includes the audio information of the main song of the song to be edited; 根据所述待剪辑歌曲中歌词文本单词重叠度和歌词组成结构相似度,对第一音频信息对应的歌词文本信息和第二音频信息对应的歌词文本信息进行结构化分段,确定第一结构体,所述第一结构体为所述待剪辑歌曲的结构化分段结果;获取第二结构体的预设起始时间点和预设时长,所述第二结构体包括预设待剪辑歌曲的部分或全部内容,所述预设待剪辑歌曲的部分或全部内容为预设的待剪辑片段;Based on the word overlap and structural similarity of the lyrics in the song to be edited, the lyrics text information corresponding to the first audio information and the lyrics text information corresponding to the second audio information are structurally segmented to determine a first structure, wherein the first structure is a result of the structured segmentation of the song to be edited; obtaining a preset starting time point and a preset duration of a second structure, wherein the second structure includes part or all of the content of the preset song to be edited, wherein the part or all of the content of the preset song to be edited is a preset segment to be edited; 根据所述待剪辑歌曲的指定歌词文件,对所述第二结构体的预设起始时间点和结束时间点进行校准处理,得到校准时间节点后的第二结构体的起始时间点和结束时间点;Calibrate the preset start time point and end time point of the second structure according to the specified lyrics file of the song to be edited, and obtain the start time point and end time point of the second structure after the calibration time node; 根据所述第一结构体,对所述校准时间节点后的第二结构体的起始时间点和结束时间点进行时间节点修正处理,得到第三结构体,所述第三结构体包括时间节点修正后的所述预设待剪辑歌曲的部分或全部内容,所述第三结构体的起始时间点和结束时间点位于所述校准时间节点后的第二结构体的起始时间点和结束时间点分别所在段落的最近邻段落的边界处。According to the first structure, the start time point and the end time point of the second structure after the calibration time node are subjected to time node correction processing to obtain a third structure, wherein the third structure includes part or all of the content of the preset song to be edited after the time node correction, and the start time point and the end time point of the third structure are located at the boundary of the nearest adjacent paragraph of the paragraph where the start time point and the end time point of the second structure after the calibration time node are respectively located. 2.根据权利要求1所述的方法,其特征在于,所述根据所述待剪辑歌曲的音频特征信息,确定所述待剪辑歌曲的第一音频信息,包括:2. The method according to claim 1, wherein determining the first audio information of the song to be edited based on the audio feature information of the song to be edited comprises: 将所述待剪辑歌曲的音频特征信息输入到神经网络中,获得所述待剪辑歌曲的第一音频信息概率集合,所述音频特征信息包括所述待剪辑歌曲的CQT局部特征和MIDI人声旋律特征;Inputting audio feature information of the song to be edited into a neural network to obtain a first audio information probability set of the song to be edited, the audio feature information including CQT local features and MIDI vocal melody features of the song to be edited; 根据所述第一音频信息概率集合,确定所述待剪辑歌曲的第一音频信息。The first audio information of the song to be edited is determined based on the first audio information probability set. 3.根据权利要求1所述的方法,其特征在于,所述方法还包括:3. The method according to claim 1, further comprising: 基于待剪辑歌曲的指定歌词文件,确定所述待剪辑歌曲中任意两句歌词之间的编辑距离;Determining, based on a specified lyrics file of a song to be edited, an edit distance between any two lyrics in the song to be edited; 根据所述编辑距离,获得所述待剪辑歌曲的第一文本信息。According to the edit distance, first text information of the song to be edited is obtained. 4.根据权利要求1所述的方法,其特征在于,所述根据所述待剪辑歌曲的指定歌词文件,对所述第二结构体的预设起始时间点和结束时间点进行校准处理,得到校准时间节点后的第二结构体的起始时间点和结束时间点,包括:4. The method according to claim 1, wherein the step of calibrating the preset start time point and end time point of the second structure according to the specified lyrics file of the song to be edited to obtain the start time point and end time point of the second structure after the calibration time nodes comprises: 根据所述指定歌词文件,更新所述第一文本信息,更新后的第一文本信息包括第一音频信息对应的歌词文本信息和第二音频信息对应的歌词文本信息;updating the first text information according to the designated lyrics file, wherein the updated first text information includes lyrics text information corresponding to the first audio information and lyrics text information corresponding to the second audio information; 根据所述更新后的第一文本信息及所述更新后的第一文本信息对应的时间信息,对所述第二结构体的预设起始时间点和结束时间点进行校准处理,得到校准时间节点后的第二结构体的起始时间点和结束时间点。According to the updated first text information and the time information corresponding to the updated first text information, the preset starting time point and ending time point of the second structure are calibrated to obtain the starting time point and ending time point of the second structure after the calibrated time node. 5.根据权利要求1或4所述的方法,其特征在于,所述根据所述第一结构体,对所述校准时间节点后的第二结构体的起始时间点和结束时间点进行时间节点修正处理,得到第三结构体,包括:5. The method according to claim 1 or 4, characterized in that the step of performing time node correction processing on the start time point and the end time point of the second structure after the calibration time node based on the first structure to obtain a third structure comprises: 从所述第一结构体中,获取与校准时间节点后的第二结构体的起始时间点对应的第一时间差值,所述第一时间差值包括与校准时间节点后的第二结构体的起始时间点所在段落相邻的两个段落的边界时间的差值;Obtaining, from the first structure, a first time difference corresponding to a starting time point of a second structure after the calibration time node, wherein the first time difference comprises a difference between boundary times of two adjacent paragraphs to which the starting time point of the second structure after the calibration time node belongs; 从所述第一结构体中,获取与校准时间节点后的第二结构体的结束时间点对应的第二时间差值,所述第二时间差值包括与校准时间节点后的第二结构体的结束时间点所在段落相邻的两个段落的边界时间的差值;Obtaining, from the first structure, a second time difference corresponding to an end time point of a second structure after the calibration time node, the second time difference comprising a difference between boundary times of two adjacent paragraphs to which the end time point of the second structure after the calibration time node belongs; 根据所述校准时间节点后的第二结构体起始时间点和所述第一时间差值,对所述校准时间节点后的第二结构体的起始时间点进行时间节点修正处理;performing a time node correction process on the starting time point of the second structure body after the calibration time node according to the starting time point of the second structure body after the calibration time node and the first time difference; 根据所述校准时间节点后的第二结构体的结束时间点和所述第二时间差值,对所述校准时间节点后的第二结构体的结束时间点进行时间节点修正处理;performing a time node correction process on the end time point of the second structure body after the calibration time node according to the end time point of the second structure body after the calibration time node and the second time difference; 所述第三结构体包括时间节点修正处理后的第二结构体的起始时间点和结束时间点。The third structure includes the start time point and the end time point of the second structure after the time node correction processing. 6.根据权利要求1或4所述的方法,其特征在于,所述根据所述第一结构体,对所述校准时间节点后的第二结构体起始时间点和结束时间点进行时间节点修正处理,得到第三结构体,包括:6. The method according to claim 1 or 4, characterized in that the step of performing time node correction processing on the start time point and the end time point of the second structure after the calibration time node based on the first structure to obtain a third structure comprises: 确定所述校准时间节点后的第二结构体起始时间点和结束时间点之间的目标时长;Determine a target duration between a start time point and an end time point of a second structure after the calibration time node; 根据所述目标时长,将所述第一结构体中不同的段落进行首尾相接,得到第三结构体。According to the target duration, different paragraphs in the first structure are connected end to end to obtain a third structure. 7.一种计算机设备,其特征在于,包括存储器、处理器,其中,所述存储器上存储有计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1至6中任一项所述的歌曲剪辑方法的步骤。7. A computer device, characterized in that it comprises a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the steps of the song editing method as described in any one of claims 1 to 6 are implemented. 8.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至6中任一项所述的歌曲剪辑方法的步骤。8. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the song editing method as described in any one of claims 1 to 6 are implemented. 9.一种计算机程序产品,包括计算机指令,其特征在于,所述计算机指令存储在计算机可读存储介质中,当被计算机设备的处理器读取并执行时,使得所述计算机设备执行如权利要求1至6中任一项所述的歌曲剪辑方法的步骤。9. A computer program product comprising computer instructions, characterized in that the computer instructions are stored in a computer-readable storage medium, and when read and executed by a processor of a computer device, the computer device executes the steps of the song editing method as described in any one of claims 1 to 6.
CN202210276635.9A 2022-03-21 2022-03-21 Song editing method, device and equipment Active CN114639367B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210276635.9A CN114639367B (en) 2022-03-21 2022-03-21 Song editing method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210276635.9A CN114639367B (en) 2022-03-21 2022-03-21 Song editing method, device and equipment

Publications (2)

Publication Number Publication Date
CN114639367A CN114639367A (en) 2022-06-17
CN114639367B true CN114639367B (en) 2025-08-22

Family

ID=81948985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210276635.9A Active CN114639367B (en) 2022-03-21 2022-03-21 Song editing method, device and equipment

Country Status (1)

Country Link
CN (1) CN114639367B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037764A (en) * 2020-08-06 2020-12-04 杭州网易云音乐科技有限公司 Music structure determination method, device, equipment and medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9230528B2 (en) * 2012-09-19 2016-01-05 Ujam Inc. Song length adjustment
CN105187936B (en) * 2015-06-15 2018-08-21 福建星网视易信息系统有限公司 Based on the method for broadcasting multimedia file and device for singing audio scoring
CN110415723B (en) * 2019-07-30 2021-12-03 广州酷狗计算机科技有限公司 Method, device, server and computer readable storage medium for audio segmentation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037764A (en) * 2020-08-06 2020-12-04 杭州网易云音乐科技有限公司 Music structure determination method, device, equipment and medium

Also Published As

Publication number Publication date
CN114639367A (en) 2022-06-17

Similar Documents

Publication Publication Date Title
US11687315B2 (en) Audio content production, audio sequencing, and audio blending system and method
CN106652997A (en) Audio synthesis method and terminal
CN108806656B (en) Automatic generation of songs
CN103597543B (en) Semantic Track Mixer
US10971125B2 (en) Music synthesis method, system, terminal and computer-readable storage medium
CN109977255A (en) Model generating method, audio-frequency processing method, device, terminal and storage medium
CN107247769A (en) Method, device, terminal and storage medium for ordering songs by voice
CN107221323A (en) Method for ordering songs by voice, terminal and storage medium
CN114582306B (en) Audio adjustment method and computer device
CN108877766A (en) Song synthetic method, device, equipment and storage medium
CN110600004A (en) Voice synthesis playing method and device and storage medium
CN111554329A (en) Audio editing method, server and storage medium
CN114639367B (en) Song editing method, device and equipment
CN113903342B (en) Voice recognition error correction method and device
CN110619673B (en) Method for generating and playing sound chart, method, system and equipment for processing data
CN105630831B (en) Singing search method and system
CN114550690B (en) Song synthesis method and device
US9412395B1 (en) Narrator selection by comparison to preferred recording features
CN115410544B (en) Sound effect processing method and device and electronic equipment
CN107025902B (en) Data processing method and device
JP6589521B2 (en) Singing standard data correction device, karaoke system, program
CN116935817A (en) Music editing method, apparatus, electronic device, and computer-readable storage medium
CN114038481B (en) Lyric timestamp generation method, device, equipment and medium
Köküer et al. Curating and annotating a collection of traditional Irish flute recordings to facilitate stylistic analysis
KR20050041749A (en) Voice synthesis apparatus depending on domain and speaker by using broadcasting voice data, method for forming voice synthesis database and voice synthesis service system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant