CN115579018B

CN115579018B - A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals

Info

Publication number: CN115579018B
Application number: CN202211120049.1A
Authority: CN
Inventors: 张维维; 闫凌宇
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2022-09-14
Filing date: 2022-09-14
Publication date: 2025-05-30
Anticipated expiration: 2042-09-14
Also published as: CN115579018A

Abstract

The invention discloses a multitone music singing melody extraction method based on musical tone signal spectrogram modeling, which comprises the steps of firstly, obtaining a logarithmic frequency magnitude spectrum by solving the constant Q conversion of a mixed audio signal; next, a graph structure is obtained from the frequency point location relationship between the fundamental wave and each subharmonic component of the same musical sound source. And then, taking the constant Q conversion amplitude spectrum as a graph rolling network input, converting melody pitch into a single heat vector, taking the single heat vector as the output of the graph rolling network, learning a complex input-output mapping function by utilizing the graph rolling network, and taking the frequency corresponding to the maximum value in each frame output node of the graph rolling network as the preliminary melody pitch estimation result of the frame. And finally, adopting a post-processing step to construct a significance spectrogram and finely adjusting melody pitch estimation. The invention constructs a logarithmic frequency domain graph structure to realize singing melody extraction, adopts a data driving mode to automatically learn parameters of a graph convolution network, and achieves the purpose of singing melody extraction under lightweight parameters.

Description

Multitone music singing melody extraction method based on musical tone signal spectrogram modeling

Technical Field

The invention relates to the technical field of audio signal processing, in particular to a multi-tone music singing melody extraction method based on music signal spectrogram modeling.

Background

The multi-tone music is a mixed signal of the vocal and accompaniment, and simultaneously, there may be two or more sound sources for sounding, so that the vocal and accompaniment are overlapped in both time domain and frequency domain, thereby making it difficult to accurately extract the vocal melody. At present, the perception attribute of the auditory sense of the human ear to the singing melody cannot be accurately described, so that the melody extraction modeling still has no accurate theoretical support. Energy significance and temporal continuity are two fundamental bases of melody extraction, and existing methods model energy significance and temporal continuity in different ways. Existing melody extraction methods include saliency, source separation, and machine-like methods. The saliency method comprises the steps of spectrum analysis, multi-pitch estimation, melody track tracking and the like. The melody energy significance is modeled by manually setting significance functions, and the scientificity and rationality of the significance functions are difficult to ensure. The source separation type method separates or enhances singing voice components from the mixed signal, and then adopts a single pitch estimation method to estimate melody pitch. The problem of source separation falls into the category of underdetermined problems, and satisfactory results are still not obtained, thereby limiting the performance of such methods. The machine learning class method includes a conventional machine learning class method and a deep learning method. Since the melody is sometimes submerged by noise, the robustness of the conventional machine learning method is poor, and the deep learning method has the disadvantages of large parameter scale and poor interpretation.

Disclosure of Invention

According to the problems existing in the prior art, the invention discloses a multi-tone music singing melody extraction method based on music signal spectrogram modeling, which specifically comprises the following steps:

performing constant Q conversion on the audio signal to obtain a logarithmic frequency amplitude spectrum, intercepting the amplitude spectrum in a certain frequency range, splicing the amplitude spectrums of successive odd frames before and after the ith frame to obtain a spliced amplitude spectrum, and taking the spliced amplitude spectrum as an input characteristic of the ith frame and representing the spliced amplitude spectrum as X ⁱ;

constructing an adjacency matrix corresponding to the spliced amplitude spectrum;

each frequency point of the spliced amplitude spectrum is used as a node of the graph structure, and the connection relation of the edges, namely the nodes, is determined according to the adjacency matrix, so that each frequency component of the musical tone signal is represented by the graph structure;

Discretizing melody pitch frequency corresponding to the ith frame signal to obtain a single heat vector of an output tag, and taking the single heat vector as the output of a graph convolution network to obtain an output tag Y ⁱ corresponding to an ith frame input characteristic X ⁱ;

training a graph convolution network to obtain optimal parameters;

Performing melody pitch prediction on the test set by adopting trained network parameters, and taking the frequency corresponding to the maximum value in the output node of the graph convolution network as preliminary melody pitch estimation;

performing median filtering on the preliminary melody pitch sequence obtained by the graph convolution network to obtain a smooth melody pitch track;

Framing the audio signals, and then carrying out zero padding and short-time Fourier transform on each frame of signals to obtain a short-time Fourier transform amplitude spectrum;

The phase vocoder is adopted to correct the instantaneous amplitude and the instantaneous frequency of the short-time Fourier transform amplitude spectrum;

calculating a saliency value frame by frame according to a saliency function;

And taking a banded region formed by a certain frequency range with the smooth melody pitch track as the center as a final singing melody output candidate range, searching a maximum significance value in the candidate range, taking the frequency corresponding to the maximum significance value as a final singing melody output result in a non-0 frequency range, and not correcting the 0 value output of the graph convolution network.

The saliency function is:

where a _i is the i-th spectral peak amplitude, tr (a _i) is the amplitude threshold function, and w (b, h, f _i) is the weight function.

The invention provides a multi-tone singing melody extraction method based on musical tone signal spectrogram modeling, which comprises the steps of carrying out constant Q conversion on a mixed audio signal to obtain a logarithmic frequency magnitude spectrum, constructing an adjacent matrix according to the frequency point relation between the fundamental frequency and each subharmonic component of the same musical tone sound source to obtain a graph structure, using a graph rolling network to learn a complex input/output mapping function, taking the frequency corresponding to the maximum value in each frame output node of the graph rolling network as a preliminary melody pitch estimation result of the frame, adopting a post-processing step to construct a saliency spectrogram and finely adjusting melody pitch estimation, so that the method has higher accuracy and robustness.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.

FIG. 1 is a flow chart of the method of the present invention

FIG. 2 is a waveform diagram of a music signal in the time domain according to the present invention;

FIG. 3 is a graph of the constant Q transform amplitude spectrum of the music signal according to the present invention

FIG. 4 is a schematic diagram of an adjacent matrix according to the present invention

FIG. 5 is a preliminary melody sequence estimation according to the present invention

FIG. 6 is a graph of post-processing significance level of the present invention

FIG. 7 is a diagram showing the final singing melody extraction result according to the present invention

Detailed Description

In order to make the technical scheme and advantages of the present invention more clear, the technical scheme in the embodiment of the present invention is clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention:

The singing melody extraction method provided by the invention is shown in fig. 1. First, a constant Q transform of the mixed audio signal is obtained, and a logarithmic frequency magnitude spectrum is obtained. Next, a graph structure is obtained from the frequency point location relationship between the fundamental wave and each subharmonic component of the same musical sound source. And then, taking the constant Q conversion amplitude spectrum as a graph rolling network input, converting melody pitch into a single heat vector, taking the single heat vector as the output of the graph rolling network, learning a complex input-output mapping function by utilizing the graph rolling network, and taking the frequency corresponding to the maximum value in each frame output node of the graph rolling network as the preliminary melody pitch estimation result of the frame. And finally, adopting a post-processing step to construct a significance spectrogram and finely adjusting melody pitch estimation.

Examples:

In consideration of the fact that melodies have typical harmonic properties, each frequency point in a music signal frequency spectrum is represented by a node, the harmonic properties of musical sounds are represented by edges, and therefore each subharmonic internal connection of a certain musical sound source can be represented by a graph structure, and then singing melody extraction based on graphs is achieved. In order to improve the smoothness of melody pitch track and reduce quantization error, the invention adopts a pitch saliency function to finely adjust preliminary melody pitch estimation, and the specific scheme comprises the following steps:

S1, a section of audio signal is given, and the time domain waveform diagram is shown in fig. 2. The audio signal is subjected to constant Q conversion with a frequency resolution of 12 frequency points/octaves to obtain a logarithmic frequency magnitude spectrum, as shown in fig. 3. The amplitude spectrum in the frequency range of 47.65-8141.46Hz is intercepted, so that the amplitude spectrum of each frame of audio signal has 90 frequency points. And splicing the continuous 3-frame amplitude spectrums from the previous frame to the next frame of each frame to obtain a spliced amplitude spectrum with the length of 270, wherein the spliced amplitude spectrum is used as an ith frame input characteristic representation X ⁱ.

S2, constructing an adjacency matrix corresponding to 3-frame amplitude spectrum splicing, wherein the adjacency matrix is shown in fig. 4, and a specific calculation formula is as follows:

Wherein n=90, h=1,.. 5,i (j) =1,..270.

And S3, splicing each frequency point of the magnitude spectrum to serve as a node of a graph structure, determining edges by the adjacency matrix defined by the formula (1), namely defining the connection relation of each node, and thus representing each component of the musical tone signal by the graph structure.

S4, discretizing the melody pitch frequency corresponding to the ith frame signal according to the resolution of 12 frequency points/octaves to obtain a single heat vector of an output label, and outputting the single heat vector as the output of a graph rolling network, so that an output label Y ⁱ corresponding to the ith frame input characteristic X ⁱ is obtained.

And S5, carrying out parameter training on the training set, selecting a binary cross entropy function as a loss function, selecting Adam as an optimizer, setting the learning rate to be 0.001, training 1000 epochs, setting the batch size to be 256, and continuously iterating to obtain the optimal graph convolution network parameters.

S6, predicting melody pitch on the test set by adopting the trained network parameters, and taking the frequency corresponding to the maximum value in the graph convolution network output node as preliminary melody pitch estimation, as shown in fig. 5.

And S7, performing median filtering (the filter window width is 7) on the preliminary melody pitch sequence obtained by the graph convolution network to obtain a smooth melody pitch track.

And S8, framing the audio signal, wherein each frame contains 2048 points, zero padding is carried out on each frame of signal, 8192 points of short-time Fourier transform is carried out, and a short-time Fourier transform amplitude spectrum is obtained.

S9, using a phase vocoder to correct the instantaneous amplitude and the instantaneous frequency of the amplitude spectrum after the short-time Fourier transform.

S10, calculating a saliency value according to a saliency function frame by frame as shown in FIG. 6, wherein the saliency function is as follows:

S11, taking a banded region formed by positive and negative 1.5 semitone ranges with the smooth melody pitch track as the center as a final singing melody output candidate range, searching the largest significance value in the candidate range, correcting the non-0 output of the graph rolling network by using the frequency corresponding to the largest significance value, and not correcting the 0 value output of the graph rolling network, wherein the result is shown in fig. 7.

The invention constructs a logarithmic frequency domain graph structure to realize singing melody extraction by considering that fundamental waves and subharmonics in a logarithmic frequency domain have invariance to different fundamental frequencies, and adopts a data driving mode to automatically learn parameters of a graph convolution network so as to achieve the purpose of singing melody extraction under lightweight parameters.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A multi-tone musical singing melody extraction method based on musical tone signal spectrogram modeling, characterized by comprising:

Performing constant Q conversion on the audio signal to obtain logarithmic frequency amplitude spectrum, intercepting the amplitude spectrum in a certain frequency range, splicing the ith frame and the continuous 3-frame amplitude spectrum from the previous frame to the next frame of the ith frame to obtain spliced amplitude spectrum,

Taking the spliced amplitude spectrum as an i-th frame input characteristic, and representing the spliced amplitude spectrum as X ⁱ;

constructing an adjacency matrix corresponding to 3-frame amplitude spectrum splicing, wherein the specific calculation formula is as follows:

Wherein n=90, h=1,. 5,i (j) =1,. 270;

training a graph convolution network to obtain optimal parameters;

calculating a saliency value frame by frame according to a saliency function;

the saliency function is:

Wherein a _i is the i-th spectral peak amplitude, tr (a _i) is an amplitude threshold function, and w (b, h, f _i) is a weight function;

a banded region formed by positive and negative 1.5 semitone ranges with the smooth melody pitch track as the center is used as a final singing melody output candidate range, the largest significance value is searched in the candidate range, the frequency corresponding to the largest significance value is used for correcting the non-0 output of the graph rolling network, and the 0 value output of the graph rolling network is not corrected.