CN115579018B - A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals - Google Patents
A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals Download PDFInfo
- Publication number
- CN115579018B CN115579018B CN202211120049.1A CN202211120049A CN115579018B CN 115579018 B CN115579018 B CN 115579018B CN 202211120049 A CN202211120049 A CN 202211120049A CN 115579018 B CN115579018 B CN 115579018B
- Authority
- CN
- China
- Prior art keywords
- graph
- frame
- melody
- amplitude spectrum
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Auxiliary Devices For Music (AREA)
Abstract
The invention discloses a multitone music singing melody extraction method based on musical tone signal spectrogram modeling, which comprises the steps of firstly, obtaining a logarithmic frequency magnitude spectrum by solving the constant Q conversion of a mixed audio signal; next, a graph structure is obtained from the frequency point location relationship between the fundamental wave and each subharmonic component of the same musical sound source. And then, taking the constant Q conversion amplitude spectrum as a graph rolling network input, converting melody pitch into a single heat vector, taking the single heat vector as the output of the graph rolling network, learning a complex input-output mapping function by utilizing the graph rolling network, and taking the frequency corresponding to the maximum value in each frame output node of the graph rolling network as the preliminary melody pitch estimation result of the frame. And finally, adopting a post-processing step to construct a significance spectrogram and finely adjusting melody pitch estimation. The invention constructs a logarithmic frequency domain graph structure to realize singing melody extraction, adopts a data driving mode to automatically learn parameters of a graph convolution network, and achieves the purpose of singing melody extraction under lightweight parameters.
Description
Technical Field
The invention relates to the technical field of audio signal processing, in particular to a multi-tone music singing melody extraction method based on music signal spectrogram modeling.
Background
The multi-tone music is a mixed signal of the vocal and accompaniment, and simultaneously, there may be two or more sound sources for sounding, so that the vocal and accompaniment are overlapped in both time domain and frequency domain, thereby making it difficult to accurately extract the vocal melody. At present, the perception attribute of the auditory sense of the human ear to the singing melody cannot be accurately described, so that the melody extraction modeling still has no accurate theoretical support. Energy significance and temporal continuity are two fundamental bases of melody extraction, and existing methods model energy significance and temporal continuity in different ways. Existing melody extraction methods include saliency, source separation, and machine-like methods. The saliency method comprises the steps of spectrum analysis, multi-pitch estimation, melody track tracking and the like. The melody energy significance is modeled by manually setting significance functions, and the scientificity and rationality of the significance functions are difficult to ensure. The source separation type method separates or enhances singing voice components from the mixed signal, and then adopts a single pitch estimation method to estimate melody pitch. The problem of source separation falls into the category of underdetermined problems, and satisfactory results are still not obtained, thereby limiting the performance of such methods. The machine learning class method includes a conventional machine learning class method and a deep learning method. Since the melody is sometimes submerged by noise, the robustness of the conventional machine learning method is poor, and the deep learning method has the disadvantages of large parameter scale and poor interpretation.
Disclosure of Invention
According to the problems existing in the prior art, the invention discloses a multi-tone music singing melody extraction method based on music signal spectrogram modeling, which specifically comprises the following steps:
performing constant Q conversion on the audio signal to obtain a logarithmic frequency amplitude spectrum, intercepting the amplitude spectrum in a certain frequency range, splicing the amplitude spectrums of successive odd frames before and after the ith frame to obtain a spliced amplitude spectrum, and taking the spliced amplitude spectrum as an input characteristic of the ith frame and representing the spliced amplitude spectrum as X i;
constructing an adjacency matrix corresponding to the spliced amplitude spectrum;
each frequency point of the spliced amplitude spectrum is used as a node of the graph structure, and the connection relation of the edges, namely the nodes, is determined according to the adjacency matrix, so that each frequency component of the musical tone signal is represented by the graph structure;
Discretizing melody pitch frequency corresponding to the ith frame signal to obtain a single heat vector of an output tag, and taking the single heat vector as the output of a graph convolution network to obtain an output tag Y i corresponding to an ith frame input characteristic X i;
training a graph convolution network to obtain optimal parameters;
Performing melody pitch prediction on the test set by adopting trained network parameters, and taking the frequency corresponding to the maximum value in the output node of the graph convolution network as preliminary melody pitch estimation;
performing median filtering on the preliminary melody pitch sequence obtained by the graph convolution network to obtain a smooth melody pitch track;
Framing the audio signals, and then carrying out zero padding and short-time Fourier transform on each frame of signals to obtain a short-time Fourier transform amplitude spectrum;
The phase vocoder is adopted to correct the instantaneous amplitude and the instantaneous frequency of the short-time Fourier transform amplitude spectrum;
calculating a saliency value frame by frame according to a saliency function;
And taking a banded region formed by a certain frequency range with the smooth melody pitch track as the center as a final singing melody output candidate range, searching a maximum significance value in the candidate range, taking the frequency corresponding to the maximum significance value as a final singing melody output result in a non-0 frequency range, and not correcting the 0 value output of the graph convolution network.
The saliency function is:
where a i is the i-th spectral peak amplitude, tr (a i) is the amplitude threshold function, and w (b, h, f i) is the weight function.
The invention provides a multi-tone singing melody extraction method based on musical tone signal spectrogram modeling, which comprises the steps of carrying out constant Q conversion on a mixed audio signal to obtain a logarithmic frequency magnitude spectrum, constructing an adjacent matrix according to the frequency point relation between the fundamental frequency and each subharmonic component of the same musical tone sound source to obtain a graph structure, using a graph rolling network to learn a complex input/output mapping function, taking the frequency corresponding to the maximum value in each frame output node of the graph rolling network as a preliminary melody pitch estimation result of the frame, adopting a post-processing step to construct a saliency spectrogram and finely adjusting melody pitch estimation, so that the method has higher accuracy and robustness.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.
FIG. 1 is a flow chart of the method of the present invention
FIG. 2 is a waveform diagram of a music signal in the time domain according to the present invention;
FIG. 3 is a graph of the constant Q transform amplitude spectrum of the music signal according to the present invention
FIG. 4 is a schematic diagram of an adjacent matrix according to the present invention
FIG. 5 is a preliminary melody sequence estimation according to the present invention
FIG. 6 is a graph of post-processing significance level of the present invention
FIG. 7 is a diagram showing the final singing melody extraction result according to the present invention
Detailed Description
In order to make the technical scheme and advantages of the present invention more clear, the technical scheme in the embodiment of the present invention is clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention:
The singing melody extraction method provided by the invention is shown in fig. 1. First, a constant Q transform of the mixed audio signal is obtained, and a logarithmic frequency magnitude spectrum is obtained. Next, a graph structure is obtained from the frequency point location relationship between the fundamental wave and each subharmonic component of the same musical sound source. And then, taking the constant Q conversion amplitude spectrum as a graph rolling network input, converting melody pitch into a single heat vector, taking the single heat vector as the output of the graph rolling network, learning a complex input-output mapping function by utilizing the graph rolling network, and taking the frequency corresponding to the maximum value in each frame output node of the graph rolling network as the preliminary melody pitch estimation result of the frame. And finally, adopting a post-processing step to construct a significance spectrogram and finely adjusting melody pitch estimation.
Examples:
In consideration of the fact that melodies have typical harmonic properties, each frequency point in a music signal frequency spectrum is represented by a node, the harmonic properties of musical sounds are represented by edges, and therefore each subharmonic internal connection of a certain musical sound source can be represented by a graph structure, and then singing melody extraction based on graphs is achieved. In order to improve the smoothness of melody pitch track and reduce quantization error, the invention adopts a pitch saliency function to finely adjust preliminary melody pitch estimation, and the specific scheme comprises the following steps:
S1, a section of audio signal is given, and the time domain waveform diagram is shown in fig. 2. The audio signal is subjected to constant Q conversion with a frequency resolution of 12 frequency points/octaves to obtain a logarithmic frequency magnitude spectrum, as shown in fig. 3. The amplitude spectrum in the frequency range of 47.65-8141.46Hz is intercepted, so that the amplitude spectrum of each frame of audio signal has 90 frequency points. And splicing the continuous 3-frame amplitude spectrums from the previous frame to the next frame of each frame to obtain a spliced amplitude spectrum with the length of 270, wherein the spliced amplitude spectrum is used as an ith frame input characteristic representation X i.
S2, constructing an adjacency matrix corresponding to 3-frame amplitude spectrum splicing, wherein the adjacency matrix is shown in fig. 4, and a specific calculation formula is as follows:
Wherein n=90, h=1,.. 5,i (j) =1,..270.
And S3, splicing each frequency point of the magnitude spectrum to serve as a node of a graph structure, determining edges by the adjacency matrix defined by the formula (1), namely defining the connection relation of each node, and thus representing each component of the musical tone signal by the graph structure.
S4, discretizing the melody pitch frequency corresponding to the ith frame signal according to the resolution of 12 frequency points/octaves to obtain a single heat vector of an output label, and outputting the single heat vector as the output of a graph rolling network, so that an output label Y i corresponding to the ith frame input characteristic X i is obtained.
And S5, carrying out parameter training on the training set, selecting a binary cross entropy function as a loss function, selecting Adam as an optimizer, setting the learning rate to be 0.001, training 1000 epochs, setting the batch size to be 256, and continuously iterating to obtain the optimal graph convolution network parameters.
S6, predicting melody pitch on the test set by adopting the trained network parameters, and taking the frequency corresponding to the maximum value in the graph convolution network output node as preliminary melody pitch estimation, as shown in fig. 5.
And S7, performing median filtering (the filter window width is 7) on the preliminary melody pitch sequence obtained by the graph convolution network to obtain a smooth melody pitch track.
And S8, framing the audio signal, wherein each frame contains 2048 points, zero padding is carried out on each frame of signal, 8192 points of short-time Fourier transform is carried out, and a short-time Fourier transform amplitude spectrum is obtained.
S9, using a phase vocoder to correct the instantaneous amplitude and the instantaneous frequency of the amplitude spectrum after the short-time Fourier transform.
S10, calculating a saliency value according to a saliency function frame by frame as shown in FIG. 6, wherein the saliency function is as follows:
where a i is the i-th spectral peak amplitude, tr (a i) is the amplitude threshold function, and w (b, h, f i) is the weight function.
S11, taking a banded region formed by positive and negative 1.5 semitone ranges with the smooth melody pitch track as the center as a final singing melody output candidate range, searching the largest significance value in the candidate range, correcting the non-0 output of the graph rolling network by using the frequency corresponding to the largest significance value, and not correcting the 0 value output of the graph rolling network, wherein the result is shown in fig. 7.
The invention constructs a logarithmic frequency domain graph structure to realize singing melody extraction by considering that fundamental waves and subharmonics in a logarithmic frequency domain have invariance to different fundamental frequencies, and adopts a data driving mode to automatically learn parameters of a graph convolution network so as to achieve the purpose of singing melody extraction under lightweight parameters.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.
Claims (1)
1. A multi-tone musical singing melody extraction method based on musical tone signal spectrogram modeling, characterized by comprising:
Performing constant Q conversion on the audio signal to obtain logarithmic frequency amplitude spectrum, intercepting the amplitude spectrum in a certain frequency range, splicing the ith frame and the continuous 3-frame amplitude spectrum from the previous frame to the next frame of the ith frame to obtain spliced amplitude spectrum,
Taking the spliced amplitude spectrum as an i-th frame input characteristic, and representing the spliced amplitude spectrum as X i;
constructing an adjacency matrix corresponding to 3-frame amplitude spectrum splicing, wherein the specific calculation formula is as follows:
Wherein n=90, h=1,. 5,i (j) =1,. 270;
each frequency point of the spliced amplitude spectrum is used as a node of the graph structure, and the connection relation of the edges, namely the nodes, is determined according to the adjacency matrix, so that each frequency component of the musical tone signal is represented by the graph structure;
Discretizing melody pitch frequency corresponding to the ith frame signal to obtain a single heat vector of an output tag, and taking the single heat vector as the output of a graph convolution network to obtain an output tag Y i corresponding to an ith frame input characteristic X i;
training a graph convolution network to obtain optimal parameters;
Performing melody pitch prediction on the test set by adopting trained network parameters, and taking the frequency corresponding to the maximum value in the output node of the graph convolution network as preliminary melody pitch estimation;
performing median filtering on the preliminary melody pitch sequence obtained by the graph convolution network to obtain a smooth melody pitch track;
Framing the audio signals, and then carrying out zero padding and short-time Fourier transform on each frame of signals to obtain a short-time Fourier transform amplitude spectrum;
The phase vocoder is adopted to correct the instantaneous amplitude and the instantaneous frequency of the short-time Fourier transform amplitude spectrum;
calculating a saliency value frame by frame according to a saliency function;
the saliency function is:
Wherein a i is the i-th spectral peak amplitude, tr (a i) is an amplitude threshold function, and w (b, h, f i) is a weight function;
a banded region formed by positive and negative 1.5 semitone ranges with the smooth melody pitch track as the center is used as a final singing melody output candidate range, the largest significance value is searched in the candidate range, the frequency corresponding to the largest significance value is used for correcting the non-0 output of the graph rolling network, and the 0 value output of the graph rolling network is not corrected.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211120049.1A CN115579018B (en) | 2022-09-14 | 2022-09-14 | A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211120049.1A CN115579018B (en) | 2022-09-14 | 2022-09-14 | A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN115579018A CN115579018A (en) | 2023-01-06 |
| CN115579018B true CN115579018B (en) | 2025-05-30 |
Family
ID=84580870
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202211120049.1A Active CN115579018B (en) | 2022-09-14 | 2022-09-14 | A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115579018B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119541540A (en) * | 2024-11-14 | 2025-02-28 | 西北大学 | A pitch estimation method based on compressed excitation block |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113111786A (en) * | 2021-04-15 | 2021-07-13 | 西安电子科技大学 | Underwater target identification method based on small sample training image convolutional network |
| CN114158004A (en) * | 2021-12-09 | 2022-03-08 | 重庆邮电大学 | Indoor passive moving target detection method based on graph convolution neural network |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| FR3031225B1 (en) * | 2014-12-31 | 2018-02-02 | Audionamix | IMPROVED SEPARATION METHOD AND COMPUTER PROGRAM PRODUCT |
| CN113593606B (en) * | 2021-09-30 | 2022-02-15 | 清华大学 | Audio recognition method and apparatus, computer equipment, computer readable storage medium |
-
2022
- 2022-09-14 CN CN202211120049.1A patent/CN115579018B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113111786A (en) * | 2021-04-15 | 2021-07-13 | 西安电子科技大学 | Underwater target identification method based on small sample training image convolutional network |
| CN114158004A (en) * | 2021-12-09 | 2022-03-08 | 重庆邮电大学 | Indoor passive moving target detection method based on graph convolution neural network |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115579018A (en) | 2023-01-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111128213A (en) | Noise suppression method and system for processing in different frequency bands | |
| Fernandez et al. | Classical and novel discriminant features for affect recognition from speech. | |
| Durrieu et al. | Singer melody extraction in polyphonic signals using source separation methods | |
| CN108417228A (en) | Human voice timbre similarity measurement method under musical instrument timbre transfer | |
| CN110648684B (en) | Bone conduction voice enhancement waveform generation method based on WaveNet | |
| CN110310621A (en) | Singing synthesis method, device, equipment and computer-readable storage medium | |
| CN114302301B (en) | Frequency response correction method and related product | |
| Shi et al. | Music genre classification based on chroma features and deep learning | |
| CN109979488A (en) | Voice based on stress analysis turns music notation system | |
| CN105845149A (en) | Predominant pitch acquisition method in acoustical signal and system thereof | |
| Sebastian et al. | Group delay based music source separation using deep recurrent neural networks | |
| CN118298845B (en) | Training method, training device, training medium and training equipment for pitch recognition model of complex tone audio | |
| CN115579018B (en) | A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals | |
| CN114627892A (en) | A method for extracting the main melody of multi-voice music based on deep learning | |
| Rajan et al. | Group delay based melody monopitch extraction from music | |
| CN114299918A (en) | Acoustic model training and speech synthesis method, device and system, and storage medium | |
| CN119864047B (en) | Audio separation method, system and related device | |
| CN110675845A (en) | Human voice humming accurate recognition algorithm and digital notation method | |
| JP4217616B2 (en) | Two-stage pitch judgment method and apparatus | |
| Kawahara et al. | Higher order waveform symmetry measure and its application to periodicity detectors for speech and singing with fine temporal resolution | |
| CN109697985B (en) | Voice signal processing method and device and terminal | |
| Vincent et al. | Single-channel mixture decomposition using Bayesian harmonic models | |
| Gong et al. | Monaural musical octave sound separation using relaxed extended common amplitude modulation | |
| Gu et al. | A discrete-cepstrum based spectrum-envelope estimation scheme and its example application of voice transformation | |
| Voinov et al. | Implementation and Analysis of Algorithms for Pitch Estimation in Musical Fragments |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |