Disclosure of Invention
Aiming at the technical problems, the application provides a song editing method, a song editing device and song editing equipment, which not only can keep the semantic structural integrity and consistency of the edited song fragments, but also can improve the editing efficiency and reduce the expenditure.
In a first aspect, an embodiment of the present application provides a song clipping method. The method may be performed by a computer device (e.g., a terminal or server), and the specific method includes:
processing the audio file of the song to be clipped, and extracting the audio characteristic information of the song to be clipped;
determining first audio information of the song to be clipped according to the audio characteristic information of the song to be clipped;
determining a first structure body according to the first audio information of the song to be clipped and the first text information of the song to be clipped;
And carrying out time node correction processing on the second structure body according to the designated lyric file of the song to be clipped and the first structure body to obtain a third structure body, wherein the second structure body comprises part or all of the content of the preset song to be clipped, and the third structure body comprises part or all of the content of the preset song to be clipped after the time node correction.
By the method, the semantic information of the song structural segments can be utilized to automatically align, calibrate and correct time the music segments to be clipped. The method not only can keep the integrity and continuity of the semantic structure of the song segments after clipping, but also can improve the clipping efficiency and reduce the cost.
The computer device obtains first audio information of the song to be clipped (for example, audio feature information of a vice song of the song to be clipped), and the song clip to be clipped can be more accurately based on the fact that the vice song has the characteristics of providing variability for song tunes and having stronger memory.
In one possible implementation, the computer device inputs audio feature information of the song to be clipped into the neural network, and obtains a first audio information probability set of the song to be clipped, wherein the audio feature information comprises CQT local features and MIDI vocal melody features of the song to be clipped;
and determining first audio information of the song to be clipped according to the first audio information probability set, wherein the first audio information comprises the sub-song audio information of the song to be clipped.
It can be seen that, since the CQT local feature conforms to the perceived frequency of the human ear, the MIDI human voice melody feature can more precisely hit the human voice time point of the first audio information boundary, so by inputting the CQT local feature and the human voice melody feature of the song to be clipped into the neural network, thereby determining the first audio information, the accuracy of the obtained first audio information can be improved.
In one possible implementation, the computer device calculates an edit distance between any two sentences of lyrics in the song to be clipped based on a specified lyrics file of the song to be clipped;
and obtaining first text information of the song to be clipped according to the editing distance, wherein the first text information comprises a text similarity matrix of the song to be clipped.
Therefore, the method adopts the appointed lyric file (the special lyric file provided by the application can be called as a QRC lyric file) to acquire the first text information, so that the lyric text corresponding to each time node can be more accurately acquired based on the time stamp characteristic of the QRC lyric file, and the accuracy of editing is improved.
In a possible implementation manner, the computer device divides paragraphs of the song to be clipped according to first text information of the song to be clipped to obtain first time information, wherein the first time information comprises time nodes corresponding to different paragraphs respectively;
Fuzzy matching is carried out on the first time information and a time node corresponding to the first audio information of the song to be clipped, lyric text information corresponding to the first audio information and lyric text information corresponding to second audio information are determined, and the second audio information comprises main song audio information of the song to be clipped;
And carrying out structural segmentation on the lyric text information corresponding to the first audio information and the lyric text information corresponding to the second audio information according to the overlapping degree of the lyric text words and the structural similarity of the lyric composition in the song to be clipped, and determining a first structural body which is a structural segmentation result of the song to be clipped.
Therefore, the computer equipment can endow semantic information to different paragraph structures by carrying out fuzzy matching on the time node corresponding to the first audio of the song to be clipped and the time node corresponding to different paragraphs in the song to be clipped, so that the main song information and the auxiliary song information of the song to be clipped are divided. According to the overlapping degree of words of the text of each lyric and the structural similarity of each lyric in the song to be clipped, the main song information and the auxiliary song information can be divided into small paragraphs with symmetrical structures, so that the semantic structural information of the whole song can be obtained.
In one possible embodiment, the computer device obtains a preset starting point in time and a preset duration of the second structure;
According to the appointed lyric file of the song to be clipped, calibrating the preset starting time point and ending time point of the second structure body to obtain the starting time point and ending time point of the second structure body after calibrating the time node;
and according to the first structural body, performing time node correction processing on the starting time point and the ending time point of the second structural body after the time node is calibrated to obtain a third structural body.
In the embodiment of the application, the computer equipment carries out time point correction processing on the starting time point and the ending time point of the calibrated second structure body according to the first structure body, so that the structural engagement degree and the continuity of the hearing feeling of the second structure body can be improved, and the user's song hearing experience is improved.
In a possible implementation manner, the computer device updates the first text information according to the specified lyric file, and the updated first text information includes lyric text information corresponding to the first audio information and lyric text information corresponding to the second audio information;
and carrying out calibration processing on the preset starting time point and the ending time point of the second structure body according to the updated first text information and the time information corresponding to the updated first text information to obtain the starting time point and the ending time point of the second structure body after the time node is calibrated.
In the embodiment of the application, the computer equipment performs calibration processing on the preset starting time point and the preset ending time point of the second structure body through the QRC lyric file, so that the accurate time of the starting time point and the ending time point of the second structure body can be obtained, and the accuracy of editing is improved.
In one possible embodiment, the computer device obtains, from the first structure, a first time difference value corresponding to a start time point of the second structure after the calibration time node;
acquiring a second time difference value corresponding to the ending time point of a second structure body after the time node is calibrated from the first structure body;
performing time node correction processing on the starting time point of the second structure body after the time node is calibrated according to the starting time point of the second structure body after the time node is calibrated and the first time difference value;
performing time node correction processing on the ending time point of the second structure body after the time node is calibrated according to the ending time point of the second structure body after the time node is calibrated and the second time difference value;
The third structure includes a start time point and an end time point of the second structure after the time node correction processing.
In the embodiment of the application, since the first structure body comprises the semantic structure information of the whole song, the computer equipment carries out time node correction processing on the starting time point and the ending time point of the second structure body after the time node is calibrated according to the first structure body, so that the semantic structure integrity and the continuity of the hearing feeling of the obtained third structure body can be improved.
In one possible embodiment, the computer device determines the corresponding time length from a starting time point and an ending time point of the second structure after the calibration time node;
And according to the time length, connecting different paragraphs in the first structural body end to obtain a third structural body.
In the embodiment of the application, the computer equipment can connect different paragraphs in the first structure body end to end according to the clipping time length, and can improve the engagement degree and the continuity of songs while freely splicing the paragraphs.
In a second aspect, an embodiment of the present application provides a song clipping apparatus, including:
The preprocessing module is used for processing the audio files of the songs to be clipped and extracting the audio characteristic information of the songs to be clipped;
The determining module is used for determining first audio information of the song to be clipped according to the audio characteristic information of the song to be clipped;
The determining module is further used for determining a first structural body according to the first audio information of the song to be clipped and the first text information of the song to be clipped;
the processing module is used for carrying out time node correction processing on the second structure body according to the QRC lyric file of the song to be clipped and the first structure body to obtain a third structure body, wherein the second structure body comprises part or all of the content of the preset song to be clipped, and the third structure body comprises part or all of the content of the preset song to be clipped after the time node correction.
In a third aspect, the embodiment of the application also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the computer program realizes any one of the methods when being executed by the processor.
In a fourth aspect, the present application also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements any of the methods described above.
In a fifth aspect, embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided by the embodiment of the application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
To facilitate an understanding of the disclosed embodiments of the application, some concepts to which embodiments of the application relate are first described. The description of these concepts includes, but is not limited to, the following.
1. QRC lyrics file
A lyric file in the format of an extensible markup language (Extensible Markup Language, XML) can be precisely controlled to the point in time of each word in the lyrics.
2. CQT local features
The Constant-Q Transform (CQT) local feature is a nonlinear frequency domain feature obtained by filtering a time-domain audio signal with a set of Constant-Q filters, and the feature is more compliant with the music theory.
3. MIDI vocal melody feature
The digital music format (Musical Instrument DIGITAL INTERFACE, MIDI) is characterized by a group of characteristics describing pitch, rhythm and strength of human voice, representing the fluctuation of melody.
4. Main song
The main song (Verse) is a main song and has the function of slowly pushing the melody to climax and simultaneously clearly expressing the story background expressed by the song, thereby having stronger narrative.
5. Chorus song
Chorus (Refrain or chord), which may also be referred to as climax, is a multiple or repeated piece of lyrics in a song, typically occurring between several pieces of Chorus. The chorus is contrasted with the chorus in length, melody, rhythm and emotion, provides variability for song tunes, and has stronger memory.
6. Edit distance
The edit distance (EDIT DISTANCE) is a quantitative assessment of the degree of difference of the two strings. It refers to the minimum number of editing operations required to switch from one to the other between two strings. In general, the smaller the edit distance, the greater the similarity of the two character strings.
7. Structured segmentation
The structured segment contains rich semantic information, which is one of the main manifestations of music. From a song content perspective, the lyrics of similar repetition are typically grouped into a set or paragraph, and the structure of a general popular song may be divided into alternate paragraph forms of chorus.
Currently, the method for editing songs mainly adopts manual editing and automatic editing. In the manual editing method, because different people have deviation in understanding the song structure, the editing caliber is not uniform, and the manual editing efficiency is very low. The existing automatic editing method mainly carries out editing according to time length or simply by utilizing an audio signal processing method, the method can not identify major and minor paragraphs of songs, so that the editing of song clips is incomplete, and the semantic structural integrity of the edited song clips can not be ensured.
Based on the above, the embodiment of the application provides a song clipping method, device and equipment. The method utilizes semantic information of song structural segments to automatically align, calibrate and time correct the music segments to be clipped. The method not only can keep the integrity and continuity of the semantic structure of the song segments after clipping, but also can improve the clipping efficiency and reduce the cost.
It should be noted that, in a specific implementation, the above solution may be implemented by a computer device, where the computer device may be a terminal or a server, where the terminal mentioned herein may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart watch, a smart tv, a smart vehicle terminal, etc., where the server mentioned herein may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content distribution networks (Content Delivery Network, CDN), and big data and artificial intelligent platforms, etc., and is not limited herein.
In order to facilitate understanding of the embodiments of the present application, a specific implementation of the song clipping method will be described in detail below by taking a computer device as an example to execute the song clipping method.
Fig. 1 is a flowchart of a song clipping method according to an embodiment of the present application. As shown in fig. 1. The method may be performed by a computer device, comprising the following steps S101-S102.
S101, processing the audio file of the song to be clipped, and extracting the audio characteristic information of the song to be clipped.
Where the song to be clipped refers to a piece of song (including audio files and lyrics files) specified by the user. For example, the song to be clipped includes a plurality of song segments (e.g., a main song segment and a sub song segment), and lyrics corresponding to each song segment, respectively. Optionally, the lyrics file also includes non-lyrics information (e.g., title, singer, make, mix, etc. information in the lyrics text).
Optionally, the file format of the audio file of the song to be clipped includes, but is not limited to, MP3, MP4, waveform audio file format (Wave Audio File Format, WAV), etc.
In an alternative embodiment, the computer device extracts audio feature information of the song to be clipped includes extracting CQT local features and MIDI vocal melody features of the song to be clipped. Fig. 2 is a schematic diagram of a CQT local feature according to an embodiment of the present application. As shown in fig. 2, the CQT local feature is a non-linear time spectrum that is transformed with a base of log 2 (i.e., log 2). In fig. 2, the abscissa represents time, and the ordinate represents frequency height, and when the pitch is in log 2 -based logarithmic span distribution, the perceived frequency of the human ear is met (the higher the frequency, the lower the sensitivity). Fig. 3 is a schematic diagram of a MIDI voice melody feature according to an embodiment of the present application, wherein in fig. 3, the abscissa represents time and the ordinate represents the level of voice (or called pitch). The characteristic has stronger representation capability on voice, and can more accurately hit the voice time point of the sub-song segment boundary.
S102, determining first audio information of the song to be clipped according to the audio characteristic information of the song to be clipped.
Wherein the first audio information is a sub-song clip of the song to be clipped. For example, if the song to be clipped includes a plurality of sub-song segments, the first audio information may include audio information of each of the plurality of sub-song segments.
In an alternative embodiment, the computer device determining the first audio information of the song to be clipped may include the steps of inputting audio characteristic information of the song to be clipped into a neural network to obtain a first audio information probability set of the song to be clipped, and determining the first audio information of the song to be clipped including the sub-song audio information of the song to be clipped according to the first audio information probability set.
Specifically, for example, the probability set includes { P 1,P2,…,Pn }, n represents the nth time point, and each probability represents the likelihood that the time point is the climax part. The computer device may obtain a first audio information probability curve from the first audio information probability set. For example, fig. 4 is a graph of climax probability provided by an embodiment of the application. The climax probability curve is the first audio information probability curve. Wherein the abscissa represents the index of the time point and the ordinate represents the probability.
The computer device may determine the first audio information of the song to be clipped based on the first audio information probability curve. I.e. the computer device may determine the climax parts (i.e. the chorus parts) of the song to be clipped based on fig. 4. As shown in fig. 4, the probability between the point a and the point b in the graph is the greatest, so that the point a and the point b can be primarily considered as the climax piece of the song to be clipped. Similarly, based on the probability value, a climax segment of the song to be clipped can be considered between the c point and the d point.
In an embodiment of determining the first audio information of the song to be clipped, the computer device may further input the audio feature information of the song to be clipped into the neural network to obtain a first audio information probability curve of the song to be clipped, and determine the first audio information of the song to be clipped according to the first audio information probability curve, wherein the first audio information comprises the sub-song audio information of the song to be clipped. For example, the computer device may determine a climax clip (i.e., a chorus clip) of a song to be clipped based on fig. 4.
It will be appreciated that the probability curve of the first audio information may be obtained directly without being based on a set of probabilities in this embodiment.
In an embodiment of determining the first audio information of the song to be clipped according to the first audio information probability curve, the computer device may obtain a first audio filtering curve by filtering the first audio information probability curve, and determine the first audio information of the song to be clipped according to the first audio filtering curve.
In an embodiment of determining the first audio information of the song to be clipped according to the first audio filtering curve, the computer device may determine time information and a confidence level of the first audio information of the song to be clipped according to a maximum point and a minimum point after the maximum point in the filtering curve, the time information of the first audio includes a start time point and an end time point, the confidence level of the first audio information includes a start confidence level and an end confidence level, and calculate an average confidence level of the first audio information in the song to be clipped according to the start confidence level and the end confidence level of the first audio information, and determine the first audio information of the song to be clipped. The abscissa and the ordinate of the maximum point in the filtering curve are the starting time point and the starting confidence coefficient of the first audio information respectively, and the abscissa and the ordinate of the minimum point after the maximum point are the ending time point and the ending confidence coefficient of the first audio information respectively.
In this embodiment, the computer device calculates the average confidence of the first audio information in the song to be clipped, and determines the first audio information of the song to be clipped according to the average confidence, so that reliability of the first audio information can be ensured.
Alternatively, the computer device may determine the first audio information by setting a threshold value of average confidence [ a, b ]. For example, if the average confidence of the first audio information is c, c e [ a, b ], the first audio information with the average confidence of c may be used as the first audio information of the song to be clipped. For another example, if the average confidence level of the first audio information is d,The first audio information with the average confidence level d may not be used as the first audio information of the song to be clipped.
Fig. 5 is a schematic diagram of filtering a short-time climax probability curve according to an embodiment of the present application. As shown in fig. 5, the computer apparatus may obtain a filter curve v f by filtering the short-time climax probability curve v p using a Harr filter, and may obtain first audio information of a song to be clipped according to the filter curve v f. Firstly, the computer equipment can input the CQT local feature and MIDI life melody feature of the song to be clipped into the neural network, predict short-time climax probability once every 500ms as a frame to obtain a short-time climax probability set of the song to be clipped, and obtain a climax probability curve v p according to the short-time climax probability set. The computer device slides through the entire climax probability curve v p using a Haar filter of width 20s, resulting in a filtered curve v f. The time of the maximum point (i.e., point a in fig. 5) in the filter curve v f is determined as the value corresponding to the climax start time point t c_start,tc_start as the start confidence level s c_start, and the time of the minimum point (i.e., point B in fig. 5) after the maximum point is determined as the value corresponding to the climax end time point t c_end,tc_end as the end confidence level s c_end. Calculating the average confidence of the climax fragments by using the initial confidence s c_start and the ending confidence s c_end
S103, determining a first structural body according to the first audio information of the song to be clipped and the first text information of the song to be clipped.
The first text information is lyric text corresponding to the song to be clipped. For example, the first text information includes lyrics text of a main song segment and lyrics text of a sub song segment in the song to be clipped. Optionally, the first text information may further comprise a lyric text similarity matrix of the song to be clipped.
The first structure body is a structural segmentation result of the song to be clipped, and comprises semantic structure information of the song to be clipped. For example, the first structure includes a main song segment, a sub-song segment, and a character AB small segment of a song to be clipped.
S104, performing time node correction processing on the second structure body according to the designated lyric file of the song to be clipped and the first structure body to obtain a third structure body.
Wherein the second structure includes part or all of the content of the preset song to be clipped, and the second structure may also be called a clip to be clipped. For example, the second structure may be one or more segments in a preset song to be clipped. The second structure body comprises a preset starting point and a preset duration, and an ending time point of the second structure body can be obtained according to the preset starting time point and the preset duration. For example, the preset starting time point may be 00:30.21 in the songs to be clipped, the preset duration may be 20s, and the ending time point is 00:50.21.
The third structure includes part or all of the content of the song to be clipped after the time node correction, and the second structure after the time node correction can also be called audio after clipping. The third structure may further include a start time point and an end time point of the second structure after the time node correction process.
In an alternative implementation manner, the computer device performs time node correction processing on the second structure according to a specified lyric file (for example, the specified lyric file is a QRC lyric file) of a song to be clipped and the first structure to obtain a third structure, and the method comprises the steps of obtaining a preset starting time point and a preset duration of the second structure, performing calibration processing on the preset starting time point and the preset ending time point of the second structure according to the specified lyric file of the song to be clipped to obtain a starting time point and an ending time point of the second structure after the calibration time node, and performing time node correction processing on the starting time point and the ending time point of the second structure after the calibration time node according to the first structure to obtain the third structure.
By adopting the embodiment of the application, the music fragments to be clipped are automatically aligned, calibrated and time corrected by utilizing the semantic information of the song structural fragments. The method not only can keep the integrity and continuity of the semantic structure of the song segments after clipping, but also can improve the clipping efficiency and reduce the cost. In addition, since the text information of the lyrics in the QRC lyrics file supports multiple languages, the embodiment of the application can support song clips in multiple languages and small languages.
Fig. 6 is a flowchart of another song editing method according to an embodiment of the present application, as shown in fig. 6, where the method described in the embodiment of the present application includes steps S601a-S604. It should be noted that step a (e.g., S601a and S602 a) in the flow of the method represents an operation performed on audio information of a song to be clipped, and step b (e.g., S601b and S602 b) represents an operation performed on a second structure.
S601a, processing the audio file of the song to be clipped, and extracting the audio characteristic information of the song to be clipped.
S602a, determining first audio information of the song to be clipped according to the audio characteristic information of the song to be clipped.
The specific process of steps S601a and S402a may be referred to the descriptions in S101 and S102, and will not be described herein.
S603, determining a first structural body according to the first audio information of the song to be clipped and the first text information of the song to be clipped.
In an alternative embodiment, the computer device may include a preprocessing portion, a multimodal fusion portion, and a post-processing portion when determining the first structure based on the first audio information of the song to be clipped and the first text information of the song to be clipped.
First, the preprocessing section includes obtaining first text information of a song to be clipped based on a QRC lyric file of the song to be clipped. The computer equipment further comprises acquiring a QRC lyric file of the song to be clipped, calculating an editing distance between any two sentences of lyrics in the song to be clipped based on the QRC lyric file of the song to be clipped, and acquiring the first text information of the song to be clipped according to the editing distance, wherein the first text information comprises a text similarity matrix of the song to be clipped.
Alternatively, the computer device may define the file format of the QRC lyric file as the QRC format.
Fig. 7 is a schematic diagram of a text similarity matrix for lyrics according to an embodiment of the present application, where in fig. 7, the abscissa and ordinate represent indexes of lyrics, for example, (10, 35) represents similarity between 10 th sentence lyrics and 35 th sentence lyrics.
Optionally, the preprocessing part further comprises dividing paragraphs of the song to be clipped according to the first text information of the song to be clipped to obtain first time information, wherein the first time information comprises time nodes respectively corresponding to different paragraphs;
and the multi-mode fusion part comprises the step of carrying out fuzzy matching on the first time information and the time node corresponding to the first audio information of the song to be clipped, and determining the lyric text information corresponding to the first audio information and the lyric text information corresponding to the second audio information, wherein the second audio information comprises the main song audio information of the song to be clipped. Wherein multimodal refers to a variety of sources, media, or forms of information, such as text, audio, images, and the like. The multi-mode fusion refers to the fusion of multiple information.
In the fuzzy matching of the first time information and the time nodes corresponding to the first audio information of the song to be clipped, the first time information comprises the time nodes corresponding to different paragraphs in the song to be clipped respectively, wherein the time nodes comprise a starting time point and an ending time point corresponding to a main song segment, a starting time point and an ending time point corresponding to a sub song segment and the like. The time node corresponding to the first audio information comprises a starting time point and an ending time point corresponding to the chorus segment. And matching the time nodes corresponding to different paragraphs in the song to be clipped with the starting time point and the ending time point corresponding to the chorus segment respectively is called fuzzy matching.
Finally, the post-processing part comprises the steps of carrying out structural segmentation on the lyric text information corresponding to the first audio information and the lyric text information corresponding to the second audio information according to the overlapping degree of each lyric text word and the structural similarity of each lyric in the song to be clipped, and determining the first structural body which is the structural segmentation result of the song to be clipped.
The specific implementation flows of the preprocessing section, the multi-mode fusion section and the post-processing section are exemplified below. The method comprises the steps of providing a first audio information as a chorus segment in a song to be clipped, obtaining a QRC lyric file of the song to be clipped by computer equipment when determining a first structure, calculating the editing distance between any two sentences of lyrics in the song to be clipped based on the QRC lyric file to obtain a text similarity matrix of the song to be clipped, combining lyric segments with similarity larger than a preset threshold value by utilizing an optimal path search algorithm based on the text similarity matrix to obtain the segmented song to be clipped (namely a main song segment and a sub song segment), and respectively corresponding time nodes of different segments. And (3) performing fuzzy matching on the time nodes corresponding to different paragraphs in the song to be clipped and the starting time point and the ending time point of the chorus segment determined in the step (S402 a) to determine lyric text information corresponding to the chorus segment and lyric text information corresponding to the chorus segment. And dividing the lyric text information corresponding to the auxiliary song segment and the lyric text information corresponding to the main song segment into character AB small paragraphs with symmetrical structures according to the overlapping degree of words of each lyric text in the song to be clipped and the structural similarity of each lyric, and determining a first structural body of the song to be clipped. The first structure body comprises semantic structure information of songs to be clipped, and main song sections, auxiliary song sections and role AB small sections. The first structure (or semantic structure called song to be clipped) can be written as
Sec=[V1,A1,B1,C1,A2,B2,C2,…,Vn,Cn,An,Bn],n∈[1,N], Where V represents a main song paragraph, C represents a sub song paragraph, A represents a character A paragraph, B represents a character B paragraph, and N represents the total sentence number of song lyrics to be clipped. Fig. 8 is a schematic diagram of a first structure provided in an embodiment of the present application, as shown in fig. 8, where a song to be clipped includes two main song segments V 1、V2, two sub-song segments C 1、C2, four character a small segments a 1、A2、A3 and a 4, and three character B small segments B 1、B2、B4.
S601b, acquiring a preset starting time point and a preset duration of the second structure.
For example, the computer device may acquire a preset starting point in time t u_start and a preset time period l dur of the second structure from the user side.
S602b, performing calibration processing on a preset starting time point and an ending time point of the second structure body according to the designated lyric file of the song to be clipped, and obtaining the starting time point and the ending time point of the second structure body after the calibration time node.
In an alternative implementation manner, the computer device performs calibration processing on a preset starting time point and an ending time point of the second structure body according to a designated lyric file of a song to be clipped to obtain the starting time point and the ending time point of the second structure body after the calibration time node, and the method comprises the steps of updating first text information according to the designated lyric file, wherein the updated first text information comprises lyric text information corresponding to first audio information and lyric text information corresponding to second audio information, and performing calibration processing on the preset starting time point and the ending time point of the second structure body according to the updated first text information and the updated time information corresponding to the first text information to obtain the starting time point and the ending time point of the second structure body after the calibration time node.
Optionally, assuming that the specified lyric file is a QRC lyric file, in an embodiment of updating the first text information according to the specified lyric file, the computer device eliminates, according to the QRC lyric file and the filtering module, the prelude portion of the song to be clipped and song information of non-lyric contents such as title, singer, make, mix, etc. in the lyric text.
In the embodiment of the invention, the preset starting time point and the ending time point of the second structure body are calibrated according to the updated first text information and the time information corresponding to the updated first text information to obtain the starting time point and the ending time point of the second structure body after the calibration time node, the preset starting time point of the second structure body is processed by the updated first text information to obtain the first starting time point of the second structure body, the first ending time point of the second structure body is obtained based on the first starting time point and the preset time length, and the first starting time point and the first ending time point are calibrated by the updated time information corresponding to the first text information to obtain the second starting time point and the second ending time point which are the starting time point and the ending time point of the second structure body after the calibration time node.
For example, assume that the preset starting time point of the second structure obtained by the computer device from the user side is t u_start, and the preset duration is l dur. The computer device may remove the pre-playing part of the song to be clipped and song information of non-lyric contents such as title, singer, make, mix, etc. in the lyric text by using the lyric text content of the QRC lyric file and the filtering module, and obtain updated lyric text information. Processing the preset starting time point t u_start of the second structure body according to the updated lyric text information, determining the first starting time point t ' u_start of the second structure body, and carrying out addition operation on the preset duration l dur and t ' u_start to obtain the estimated ending time point t ' u_end of the second structure body. Then, the accurate time information of the QRC lyric file is utilized to align the first starting time point t 'u_start and the ending time point t' u_end, so that t '' u_start, t '' u_end.t″u_start and t '' u_end are the starting time point and the ending time point of the second structural body after the calibration time node.
Optionally, the computer device may further directly perform calibration processing on a preset starting time point and an ending time point of the second structure body according to the updated first text information and the updated time information corresponding to the first text information, so as to obtain the starting time point and the ending time point of the second structure body after the calibration time node.
For example, assume that the preset starting time point of the second structure obtained by the computer device from the user side is T u_start and the preset duration is L dur. The computer device can firstly remove the pre-playing part of the song to be clipped and song information of non-lyric contents such as titles, singers, production, mixing and the like in the lyric text according to the lyric text content of the QRC lyric file and the filtering module to obtain updated lyric text information, then process a preset starting time point T u_start of the second structure according to the updated lyric text information and time information corresponding to the updated lyric text information to determine a starting time point T ' u_start of the second structure, and finally perform addition operation on preset time lengths L dur and T ' u_start to determine an estimated ending time point T ' u_end of the second structure. Wherein T 'u_start and T' u_end are the start time point and the end time point of the second structure after the calibration time node.
S604, performing time node correction processing on the starting time point and the ending time point of the second structure body after the time node is calibrated according to the first structure body, so as to obtain a third structure body.
In an alternative embodiment, the computer device performs time node correction processing on the starting time point and the ending time point of the second structure body after the time node is calibrated according to the first structure body to obtain a third structure body, and the method comprises the steps of obtaining a first time difference value corresponding to the starting time point of the second structure body after the time node is calibrated from the first structure body, and obtaining a second time difference value corresponding to the ending time point of the second structure body after the time node is calibrated from the first structure body. And performing time node correction processing on the ending time point of the second structure body after the time node is calibrated according to the ending time point of the second structure body after the time node is calibrated and the second time difference value. The third structure includes a start time point and an end time point of the second structure after the time node correction processing. It will be appreciated that the manner in which the third structure is obtained in this embodiment may also be referred to as a continuous clip manner, i.e. a clip manner that maintains segment start-to-end continuity.
The first time difference value comprises a difference value of boundary time of two paragraphs adjacent to a paragraph where a start time point of the second structure body after the calibration time node is located, and the second time difference value comprises a difference value of boundary time of two paragraphs adjacent to a paragraph where an end time point of the second structure body after the calibration time node is located.
Fig. 9 is a schematic diagram of a continuous clip provided by an embodiment of the present application. As shown in fig. 9, assuming that the first structure is the Sec described in step S403, the start time point and the end time point of the second structure are t″ u_start and t″ u_end, respectively. When the computer device performs time node correction processing on t ' u_start and t ' u_end according to the first structure to obtain a third structure, the computer device may obtain, from the first structure Sec, difference values Δup and Δdown of boundary times of two paragraphs adjacent to the paragraphs where t ' u_start and t ' u_end are located, respectively, and take a paragraph boundary point of the minimum difference value as a corrected start time, where in formula (1), t ' u_star represents a start time point of the second structure after correcting the time node. For example, as shown in fig. 9, where t ' u_start has a value of 00:51.68, the boundary time of the upper paragraph adjacent to the paragraph where t ' u_start is 00:42.30, and the boundary time of the lower paragraph adjacent to the paragraph where t ' u_start is 01:02.59, Δup and Δdown are calculated as Δup=00:51.68-00:42.30=00:09.38, and Δdown=01:02.59-00:51.68=00:10.91. Since Δup < Δdown, the correction of t″ u_start is performed at the paragraph boundary time point where Δup is located, and t' "u_star is 00:51.68-00:09.38=00:42.30 according to equation (1). The same applies to the corrected end time t '"u_end corresponding to the end time t'" u_end. and finally, the actual duration of the second structure body after the correction according to the first structure body is l ' ' ' dur in the following formula (2), wherein the actual duration of the second structure body is approximately equal to the preset duration.
t′′′u_start=t″u_start±min(Δup,Δdown) (1)
l′′′dur=t″′u_end-t″′u_start≈ldur (2)
In another alternative embodiment, the computer device performs time node correction processing on a starting time point and an ending time point of the second structure body after the time node is calibrated according to the first structure body to obtain a third structure body, and the method comprises the steps of determining target duration between the starting time point and the ending time point of the second structure body after the time node is calibrated, and connecting different paragraphs in the first structure body end to end according to the target duration to obtain the third structure body. It can be understood that the manner of obtaining the third structure body in this embodiment may also be referred to as a stitching and clipping manner, that is, a clipping manner of freely stitching together different main and sub song segments or character AB small segments according to the needs of the user under the condition that the clip segment duration is satisfied.
Fig. 10 is a schematic diagram of a splice clip according to an embodiment of the present application. As shown in fig. 10, assuming that the first structure is the Sec described in step S403, the computer device may freely splice together different main and sub song segments or character AB small segments in the first structure under the condition that the clip duration is satisfied, that is, splice V 1、C1、C2 segments in the Sec end to end, to obtain the third structure. At this time, the computer device may make a fade-in and fade-out of the sound of As at the splice, further reducing the abrupt sense of the splice, where a may be 0.5s. This manner of clipping may be referred to as a splice clipping.
Alternatively, after obtaining the third structure, the computer apparatus may cut the third structure from the audio file of the song to be cut using the audio cutting tool according to the start time point and the end time point of the third structure (i.e., the start time point and the end time point of the second structure after the time point correction), and output the third structure. The third structure is the clipped song clip.
By adopting the embodiment of the application, the semantic information of the song structural section is utilized to automatically align, calibrate and correct time of the music section to be clipped, so that the semantic structural integrity and the continuity of the listening sense of the clipped song section can be maintained, the clipping efficiency can be improved, and the cost can be reduced.
In addition, the embodiment of the application supports the editing of any starting point and any duration (any continuous time area) of the designated song and also supports the editing of the semantic paragraphs freely spliced according to the duration, so that the application can automatically and flexibly clip the music fragments with complete structure and continuous hearing sense according to different scene requirements. For example, the application can be applied to scenes such as short video soundtracks, music games, ringtones, choruses and the like, without limitation.
Fig. 11 is a frame diagram of a song clipping method according to an embodiment of the present application, which corresponds to steps S601a to S604 described above. The computer equipment firstly extracts the CQT audio local characteristic and the MIDI human voice melody characteristic of the song to be clipped, and then inputs the CQT audio local characteristic and the MIDI human voice melody characteristic into the neural network to determine the chorus segment. And obtaining a structural segmentation result of the song to be clipped by utilizing a multi-mode fusion technology in combination with the determined chorus segment. And then, according to the starting time and the preset duration of the preset to-be-clipped segment, aligning and calibrating the preset starting time point and the ending time point of the to-be-clipped segment based on the QRC lyric file to obtain the starting time point and the ending time point of the to-be-clipped segment after aligning and calibrating the time node. And then, correcting the starting time point and the ending time point of the to-be-clipped fragment aligned with the calibration time point by utilizing the structural segmentation result of the to-be-clipped song, so that the to-be-clipped fragment is adaptively corrected to the boundary of the nearest neighbor fragments of the fragments respectively positioned in the starting time point and the ending time point. And finally, cutting out the cut song fragments by a song cutting tool based on the corrected time points of the to-be-cut fragments. It can be seen that by adopting the embodiment of the application, the semantic information of the song structural section and the QRC lyrics file are utilized to automatically align, calibrate and correct the section to be clipped, so that the semantic structural integrity and consistency of the clip section can be maintained
In addition, compared with the manual editing method, the method has the advantages of time saving and labor saving, and can improve editing efficiency, thereby reducing editing expenditure, and compared with the existing automatic editing method, the method can improve the success rate and coverage rate of editing songs.
Fig. 12 is a schematic diagram of a song clipping apparatus according to an embodiment of the present application. The song clipping apparatus described in this embodiment may include the following portions:
The preprocessing module 1201 is used for processing the audio file of the song to be clipped and extracting the audio characteristic information of the song to be clipped;
a determining module 1202, configured to determine first audio information of a song to be clipped according to audio feature information of the song to be clipped;
the determining module 1202 is further configured to determine a first structure according to the first audio information of the song to be clipped and the first text information of the song to be clipped;
The processing module 1203 is configured to perform time node correction processing on the second structure according to the specified lyric file of the song to be clipped and the first structure, to obtain a third structure, where the second structure includes part or all of the content of the preset song to be clipped, and the third structure includes part or all of the content of the preset song to be clipped after the time node correction.
In an alternative embodiment, the determining module 1202, when configured to determine the first audio information of the song to be clipped according to the audio feature information of the song to be clipped, is specifically configured to:
inputting the audio characteristic information of the song to be clipped into a neural network to obtain a first audio information probability set of the song to be clipped, wherein the audio characteristic information comprises CQT local characteristics and MIDI vocal melody characteristics of the song to be clipped;
and determining first audio information of the song to be clipped according to the first audio information probability set, wherein the first audio information comprises the sub-song audio information of the song to be clipped.
In an alternative embodiment, the processing module 1203 is further configured to calculate an edit distance between any two lyrics in the song to be clipped based on the specified lyrics file of the song to be clipped;
and obtaining first text information of the song to be clipped according to the editing distance, wherein the first text information comprises a text similarity matrix of the song to be clipped.
In an alternative embodiment, the determining module 1202, when configured to determine the first structure according to the first audio information of the song to be clipped and the first text information of the song to be clipped, is specifically configured to:
according to the first text information of the song to be clipped, dividing paragraphs of the song to be clipped to obtain first time information, wherein the first time information comprises time nodes corresponding to different paragraphs respectively;
Fuzzy matching is carried out on the first time information and a time node corresponding to the first audio information of the song to be clipped, lyric text information corresponding to the first audio information and lyric text information corresponding to second audio information are determined, and the second audio information comprises main song audio information of the song to be clipped;
and carrying out structural segmentation on the lyric text information corresponding to the first audio information and the lyric text information corresponding to the second audio information according to the overlapping degree of words of each lyric text and the structural similarity of each lyric in the song to be clipped, and determining a first structural body which is a structural segmentation result of the song to be clipped.
In an alternative embodiment, the processing module 1203 is specifically configured to, when performing the time node correction processing on the second structure according to the specified lyric file of the song to be clipped and the first structure to obtain the third structure:
According to the appointed lyric file of the song to be clipped, calibrating the preset starting time point and the ending time point of the second structure to obtain the starting time point and the ending time point of the second structure after calibrating the time node;
and according to the first structural body, performing time node correction processing on the starting time point and the ending time point of the second structural body after the time node is calibrated to obtain a third structural body.
In an optional implementation manner, the processing module 1203 is configured to, when performing calibration processing on a preset starting time point and an ending time point of the second structure according to a specified lyric file of a song to be clipped to obtain the starting time point and the ending time point of the second structure after the calibration time node, specifically:
Updating the first text information according to the appointed lyric file, wherein the updated first text information comprises lyric text information corresponding to the first audio information and lyric text information corresponding to the second audio information;
and carrying out calibration processing on the preset starting time point and the ending time point of the second structure body according to the updated first text information and the time information corresponding to the updated first text information to obtain the starting time point and the ending time point of the second structure body after the time node is calibrated.
In an alternative embodiment, the processing module 1203 is configured to, when performing, according to the first structure, a time node correction process on a start time point and an end time point of the second structure after calibrating the time node, obtain a third structure, specifically:
acquiring a first time difference value corresponding to a starting time point of a second structure body after the time node is calibrated from the first structure body;
acquiring a second time difference value corresponding to the ending time point of the second structure body after the time node is calibrated from the first structure body;
Performing time node correction processing on the starting time point of the second structure body after the time node is calibrated according to the starting time point of the second structure body after the time node is calibrated and the first time difference value;
Performing time node correction processing on the ending time point of the second structure body after the time node is calibrated according to the ending time point of the second structure body after the time node is calibrated and the second time difference value;
The third structure includes a start time point and an end time point of the second structure after the time node correction processing.
In an alternative embodiment, the processing module 1203 is specifically configured to, when performing, according to the first structure, a time node correction process on a start time point and an end time point of the second structure after calibrating the time node, to obtain the third structure:
determining a target duration between a starting time point and an ending time point of the second structure after the calibration time node;
and according to the target duration, connecting different paragraphs in the first structural body end to obtain a third structural body.
It may be appreciated that the specific implementation of each module and the beneficial effects that can be achieved in the song clipping apparatus according to the embodiments of the present application may refer to the descriptions of the foregoing related embodiments, which are not repeated herein.
Fig. 13 is a schematic structural view of a computer device according to an embodiment of the present application. The computer devices described in the embodiments of the present application include a processor 1301, a user interface 1302, a communication interface 1303, and a memory 1304. The processor 1301, the user interface 1302, the communication interface 1303 and the memory 1304 may be connected by a bus or other means, for example, in the embodiment of the present application.
The processor 1301 (or called central processing unit (Central Processing Unit, CPU)) is a computing core and a control core of the computer device, and may analyze various instructions in the computer device and process various data of the computer device, for example, the CPU may be used to analyze an on-off instruction sent by a user to the computer device and control the computer device to perform an on-off operation, for example, the CPU may transmit various interaction data between internal structures of the computer device, and so on. The user interface 1302 is a medium for implementing interaction and information exchange between a user and a computer device, and may specifically include a Display screen (Display) for output, a Keyboard (Keyboard) for input, and the like, where the Keyboard may be a physical Keyboard, a touch screen virtual Keyboard, or a Keyboard that combines a physical Keyboard and a touch screen virtual Keyboard. Communication interface 1303 may optionally include a standard wired interface, a wireless interface (e.g., wi-Fi, mobile communication interface, etc.), controlled by processor 1301 for transceiving data. Memory 1304 (Memory) is a Memory device in a computer device for storing programs and data. It will be appreciated that the memory 1304 herein may include both built-in memory of the computer device and extended memory supported by the computer device. The memory 1304 provides storage space that stores operating systems of the computer device, which may include, but is not limited to, android systems, iOS systems, windows Phone systems, and the like, as the application is not limited in this regard.
In an embodiment of the present application, processor 1301 performs the following operations by executing executable program code in memory 1304:
processing the audio file of the song to be clipped, and extracting the audio characteristic information of the song to be clipped;
determining first audio information of the song to be clipped according to the audio characteristic information of the song to be clipped;
determining a first structure body according to the first audio information of the song to be clipped and the first text information of the song to be clipped;
And carrying out time node correction processing on the second structure body according to the designated lyric file of the song to be clipped and the first structure body to obtain a third structure body, wherein the second structure body comprises part or all of the content of the preset song to be clipped, and the third structure body comprises part or all of the content of the preset song to be clipped after the time node correction.
In an alternative embodiment, the processor 1301, when configured to determine the first audio information of the song to be clipped according to the audio feature information of the song to be clipped, is specifically configured to:
inputting the audio characteristic information of the song to be clipped into a neural network to obtain a first audio information probability set of the song to be clipped, wherein the audio characteristic information comprises CQT local characteristics and MIDI vocal melody characteristics of the song to be clipped;
and determining first audio information of the song to be clipped according to the first audio information probability set, wherein the first audio information comprises the sub-song audio information of the song to be clipped.
In an alternative embodiment, the processor 1301 is further configured to calculate an edit distance between any two lyrics in the song to be clipped based on the specified lyrics file of the song to be clipped;
and obtaining first text information of the song to be clipped according to the editing distance, wherein the first text information comprises a text similarity matrix of the song to be clipped.
In an alternative embodiment, processor 1301, when determining the first structure according to the first audio information of the song to be clipped and the first text information of the song to be clipped, is specifically configured to:
according to the first text information of the song to be clipped, dividing paragraphs of the song to be clipped to obtain first time information, wherein the first time information comprises time nodes corresponding to different paragraphs respectively;
Fuzzy matching is carried out on the first time information and a time node corresponding to the first audio information of the song to be clipped, lyric text information corresponding to the first audio information and lyric text information corresponding to second audio information are determined, and the second audio information comprises main song audio information of the song to be clipped;
and carrying out structural segmentation on the lyric text information corresponding to the first audio information and the lyric text information corresponding to the second audio information according to the overlapping degree of words of each lyric text and the structural similarity of each lyric in the song to be clipped, and determining a first structural body which is a structural segmentation result of the song to be clipped.
In an alternative embodiment, when the processor 1301 is configured to perform the time node correction processing on the second structure according to the specified lyrics file of the song to be clipped and the first structure, to obtain the third structure, the processor is specifically configured to:
According to the appointed lyric file of the song to be clipped, calibrating the preset starting time point and the ending time point of the second structure to obtain the starting time point and the ending time point of the second structure after calibrating the time node;
and according to the first structural body, performing time node correction processing on the starting time point and the ending time point of the second structural body after the time node is calibrated to obtain a third structural body.
In an alternative embodiment, when the processor 1301 is configured to perform calibration processing on a preset starting time point and an ending time point of the second structure according to a specified lyric file of a song to be clipped, to obtain the starting time point and the ending time point of the second structure after the calibration time node, the processor is specifically configured to:
Updating the first text information according to the appointed lyric file, wherein the updated first text information comprises lyric text information corresponding to the first audio information and lyric text information corresponding to the second audio information;
and carrying out calibration processing on the preset starting time point and the ending time point of the second structure body according to the updated first text information and the time information corresponding to the updated first text information to obtain the starting time point and the ending time point of the second structure body after the time node is calibrated.
In an alternative embodiment, when the processor 1301 is configured to perform, according to the first structure, a time node correction process on a start time point and an end time point of the second structure after calibrating the time node, to obtain the third structure, the processor is specifically configured to:
acquiring a first time difference value corresponding to a starting time point of a second structure body after the time node is calibrated from the first structure body;
acquiring a second time difference value corresponding to the ending time point of the second structure body after the time node is calibrated from the first structure body;
Performing time node correction processing on the starting time point of the second structure body after the time node is calibrated according to the starting time point of the second structure body after the time node is calibrated and the first time difference value;
Performing time node correction processing on the ending time point of the second structure body after the time node is calibrated according to the ending time point of the second structure body after the time node is calibrated and the second time difference value;
The third structure includes a start time point and an end time point of the second structure after the time node correction processing.
In an alternative embodiment, when the processor 1301 is configured to perform, according to the first structure, a time node correction process on a start time point and an end time point of the second structure after calibrating the time node, to obtain the third structure, the processor is specifically configured to:
determining a target duration between a starting time point and an ending time point of the second structure after the calibration time node;
and according to the target duration, connecting different paragraphs in the first structural body end to obtain a third structural body.
In a specific implementation, the processor 1301, the user interface 1302, the communication interface 1303 and the memory 1304 described in the embodiments of the present application may execute an implementation of the computer device described in the song clipping method provided in the embodiments of the present application, and may also execute an implementation described in the song clipping apparatus provided in the embodiments of the present application, which is not described herein again.
The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, the program instructions implement the song editing method provided by the embodiment of the application, and specifically, the implementation manner provided by each step can be referred to, and will not be repeated herein.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the method according to an embodiment of the application. The specific implementation manner may refer to the foregoing description, and will not be repeated here.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of action described, as some steps may be performed in other order or simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing related hardware, and the program may be stored in a computer readable storage medium, where the storage medium may include a flash disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or the like.
The above disclosure is illustrative only of some embodiments of the application and is not intended to limit the scope of the application, which is defined by the claims and their equivalents.