[go: up one dir, main page]

CN115579018B - A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals - Google Patents

A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals Download PDF

Info

Publication number
CN115579018B
CN115579018B CN202211120049.1A CN202211120049A CN115579018B CN 115579018 B CN115579018 B CN 115579018B CN 202211120049 A CN202211120049 A CN 202211120049A CN 115579018 B CN115579018 B CN 115579018B
Authority
CN
China
Prior art keywords
graph
frame
melody
amplitude spectrum
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211120049.1A
Other languages
Chinese (zh)
Other versions
CN115579018A (en
Inventor
张维维
闫凌宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Maritime University
Original Assignee
Dalian Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Maritime University filed Critical Dalian Maritime University
Priority to CN202211120049.1A priority Critical patent/CN115579018B/en
Publication of CN115579018A publication Critical patent/CN115579018A/en
Application granted granted Critical
Publication of CN115579018B publication Critical patent/CN115579018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The invention discloses a multitone music singing melody extraction method based on musical tone signal spectrogram modeling, which comprises the steps of firstly, obtaining a logarithmic frequency magnitude spectrum by solving the constant Q conversion of a mixed audio signal; next, a graph structure is obtained from the frequency point location relationship between the fundamental wave and each subharmonic component of the same musical sound source. And then, taking the constant Q conversion amplitude spectrum as a graph rolling network input, converting melody pitch into a single heat vector, taking the single heat vector as the output of the graph rolling network, learning a complex input-output mapping function by utilizing the graph rolling network, and taking the frequency corresponding to the maximum value in each frame output node of the graph rolling network as the preliminary melody pitch estimation result of the frame. And finally, adopting a post-processing step to construct a significance spectrogram and finely adjusting melody pitch estimation. The invention constructs a logarithmic frequency domain graph structure to realize singing melody extraction, adopts a data driving mode to automatically learn parameters of a graph convolution network, and achieves the purpose of singing melody extraction under lightweight parameters.

Description

Multitone music singing melody extraction method based on musical tone signal spectrogram modeling
Technical Field
The invention relates to the technical field of audio signal processing, in particular to a multi-tone music singing melody extraction method based on music signal spectrogram modeling.
Background
The multi-tone music is a mixed signal of the vocal and accompaniment, and simultaneously, there may be two or more sound sources for sounding, so that the vocal and accompaniment are overlapped in both time domain and frequency domain, thereby making it difficult to accurately extract the vocal melody. At present, the perception attribute of the auditory sense of the human ear to the singing melody cannot be accurately described, so that the melody extraction modeling still has no accurate theoretical support. Energy significance and temporal continuity are two fundamental bases of melody extraction, and existing methods model energy significance and temporal continuity in different ways. Existing melody extraction methods include saliency, source separation, and machine-like methods. The saliency method comprises the steps of spectrum analysis, multi-pitch estimation, melody track tracking and the like. The melody energy significance is modeled by manually setting significance functions, and the scientificity and rationality of the significance functions are difficult to ensure. The source separation type method separates or enhances singing voice components from the mixed signal, and then adopts a single pitch estimation method to estimate melody pitch. The problem of source separation falls into the category of underdetermined problems, and satisfactory results are still not obtained, thereby limiting the performance of such methods. The machine learning class method includes a conventional machine learning class method and a deep learning method. Since the melody is sometimes submerged by noise, the robustness of the conventional machine learning method is poor, and the deep learning method has the disadvantages of large parameter scale and poor interpretation.
Disclosure of Invention
According to the problems existing in the prior art, the invention discloses a multi-tone music singing melody extraction method based on music signal spectrogram modeling, which specifically comprises the following steps:
performing constant Q conversion on the audio signal to obtain a logarithmic frequency amplitude spectrum, intercepting the amplitude spectrum in a certain frequency range, splicing the amplitude spectrums of successive odd frames before and after the ith frame to obtain a spliced amplitude spectrum, and taking the spliced amplitude spectrum as an input characteristic of the ith frame and representing the spliced amplitude spectrum as X i;
constructing an adjacency matrix corresponding to the spliced amplitude spectrum;
each frequency point of the spliced amplitude spectrum is used as a node of the graph structure, and the connection relation of the edges, namely the nodes, is determined according to the adjacency matrix, so that each frequency component of the musical tone signal is represented by the graph structure;
Discretizing melody pitch frequency corresponding to the ith frame signal to obtain a single heat vector of an output tag, and taking the single heat vector as the output of a graph convolution network to obtain an output tag Y i corresponding to an ith frame input characteristic X i;
training a graph convolution network to obtain optimal parameters;
Performing melody pitch prediction on the test set by adopting trained network parameters, and taking the frequency corresponding to the maximum value in the output node of the graph convolution network as preliminary melody pitch estimation;
performing median filtering on the preliminary melody pitch sequence obtained by the graph convolution network to obtain a smooth melody pitch track;
Framing the audio signals, and then carrying out zero padding and short-time Fourier transform on each frame of signals to obtain a short-time Fourier transform amplitude spectrum;
The phase vocoder is adopted to correct the instantaneous amplitude and the instantaneous frequency of the short-time Fourier transform amplitude spectrum;
calculating a saliency value frame by frame according to a saliency function;
And taking a banded region formed by a certain frequency range with the smooth melody pitch track as the center as a final singing melody output candidate range, searching a maximum significance value in the candidate range, taking the frequency corresponding to the maximum significance value as a final singing melody output result in a non-0 frequency range, and not correcting the 0 value output of the graph convolution network.
The saliency function is:
where a i is the i-th spectral peak amplitude, tr (a i) is the amplitude threshold function, and w (b, h, f i) is the weight function.
The invention provides a multi-tone singing melody extraction method based on musical tone signal spectrogram modeling, which comprises the steps of carrying out constant Q conversion on a mixed audio signal to obtain a logarithmic frequency magnitude spectrum, constructing an adjacent matrix according to the frequency point relation between the fundamental frequency and each subharmonic component of the same musical tone sound source to obtain a graph structure, using a graph rolling network to learn a complex input/output mapping function, taking the frequency corresponding to the maximum value in each frame output node of the graph rolling network as a preliminary melody pitch estimation result of the frame, adopting a post-processing step to construct a saliency spectrogram and finely adjusting melody pitch estimation, so that the method has higher accuracy and robustness.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.
FIG. 1 is a flow chart of the method of the present invention
FIG. 2 is a waveform diagram of a music signal in the time domain according to the present invention;
FIG. 3 is a graph of the constant Q transform amplitude spectrum of the music signal according to the present invention
FIG. 4 is a schematic diagram of an adjacent matrix according to the present invention
FIG. 5 is a preliminary melody sequence estimation according to the present invention
FIG. 6 is a graph of post-processing significance level of the present invention
FIG. 7 is a diagram showing the final singing melody extraction result according to the present invention
Detailed Description
In order to make the technical scheme and advantages of the present invention more clear, the technical scheme in the embodiment of the present invention is clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention:
The singing melody extraction method provided by the invention is shown in fig. 1. First, a constant Q transform of the mixed audio signal is obtained, and a logarithmic frequency magnitude spectrum is obtained. Next, a graph structure is obtained from the frequency point location relationship between the fundamental wave and each subharmonic component of the same musical sound source. And then, taking the constant Q conversion amplitude spectrum as a graph rolling network input, converting melody pitch into a single heat vector, taking the single heat vector as the output of the graph rolling network, learning a complex input-output mapping function by utilizing the graph rolling network, and taking the frequency corresponding to the maximum value in each frame output node of the graph rolling network as the preliminary melody pitch estimation result of the frame. And finally, adopting a post-processing step to construct a significance spectrogram and finely adjusting melody pitch estimation.
Examples:
In consideration of the fact that melodies have typical harmonic properties, each frequency point in a music signal frequency spectrum is represented by a node, the harmonic properties of musical sounds are represented by edges, and therefore each subharmonic internal connection of a certain musical sound source can be represented by a graph structure, and then singing melody extraction based on graphs is achieved. In order to improve the smoothness of melody pitch track and reduce quantization error, the invention adopts a pitch saliency function to finely adjust preliminary melody pitch estimation, and the specific scheme comprises the following steps:
S1, a section of audio signal is given, and the time domain waveform diagram is shown in fig. 2. The audio signal is subjected to constant Q conversion with a frequency resolution of 12 frequency points/octaves to obtain a logarithmic frequency magnitude spectrum, as shown in fig. 3. The amplitude spectrum in the frequency range of 47.65-8141.46Hz is intercepted, so that the amplitude spectrum of each frame of audio signal has 90 frequency points. And splicing the continuous 3-frame amplitude spectrums from the previous frame to the next frame of each frame to obtain a spliced amplitude spectrum with the length of 270, wherein the spliced amplitude spectrum is used as an ith frame input characteristic representation X i.
S2, constructing an adjacency matrix corresponding to 3-frame amplitude spectrum splicing, wherein the adjacency matrix is shown in fig. 4, and a specific calculation formula is as follows:
Wherein n=90, h=1,.. 5,i (j) =1,..270.
And S3, splicing each frequency point of the magnitude spectrum to serve as a node of a graph structure, determining edges by the adjacency matrix defined by the formula (1), namely defining the connection relation of each node, and thus representing each component of the musical tone signal by the graph structure.
S4, discretizing the melody pitch frequency corresponding to the ith frame signal according to the resolution of 12 frequency points/octaves to obtain a single heat vector of an output label, and outputting the single heat vector as the output of a graph rolling network, so that an output label Y i corresponding to the ith frame input characteristic X i is obtained.
And S5, carrying out parameter training on the training set, selecting a binary cross entropy function as a loss function, selecting Adam as an optimizer, setting the learning rate to be 0.001, training 1000 epochs, setting the batch size to be 256, and continuously iterating to obtain the optimal graph convolution network parameters.
S6, predicting melody pitch on the test set by adopting the trained network parameters, and taking the frequency corresponding to the maximum value in the graph convolution network output node as preliminary melody pitch estimation, as shown in fig. 5.
And S7, performing median filtering (the filter window width is 7) on the preliminary melody pitch sequence obtained by the graph convolution network to obtain a smooth melody pitch track.
And S8, framing the audio signal, wherein each frame contains 2048 points, zero padding is carried out on each frame of signal, 8192 points of short-time Fourier transform is carried out, and a short-time Fourier transform amplitude spectrum is obtained.
S9, using a phase vocoder to correct the instantaneous amplitude and the instantaneous frequency of the amplitude spectrum after the short-time Fourier transform.
S10, calculating a saliency value according to a saliency function frame by frame as shown in FIG. 6, wherein the saliency function is as follows:
where a i is the i-th spectral peak amplitude, tr (a i) is the amplitude threshold function, and w (b, h, f i) is the weight function.
S11, taking a banded region formed by positive and negative 1.5 semitone ranges with the smooth melody pitch track as the center as a final singing melody output candidate range, searching the largest significance value in the candidate range, correcting the non-0 output of the graph rolling network by using the frequency corresponding to the largest significance value, and not correcting the 0 value output of the graph rolling network, wherein the result is shown in fig. 7.
The invention constructs a logarithmic frequency domain graph structure to realize singing melody extraction by considering that fundamental waves and subharmonics in a logarithmic frequency domain have invariance to different fundamental frequencies, and adopts a data driving mode to automatically learn parameters of a graph convolution network so as to achieve the purpose of singing melody extraction under lightweight parameters.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (1)

1. A multi-tone musical singing melody extraction method based on musical tone signal spectrogram modeling, characterized by comprising:
Performing constant Q conversion on the audio signal to obtain logarithmic frequency amplitude spectrum, intercepting the amplitude spectrum in a certain frequency range, splicing the ith frame and the continuous 3-frame amplitude spectrum from the previous frame to the next frame of the ith frame to obtain spliced amplitude spectrum,
Taking the spliced amplitude spectrum as an i-th frame input characteristic, and representing the spliced amplitude spectrum as X i;
constructing an adjacency matrix corresponding to 3-frame amplitude spectrum splicing, wherein the specific calculation formula is as follows:
Wherein n=90, h=1,. 5,i (j) =1,. 270;
each frequency point of the spliced amplitude spectrum is used as a node of the graph structure, and the connection relation of the edges, namely the nodes, is determined according to the adjacency matrix, so that each frequency component of the musical tone signal is represented by the graph structure;
Discretizing melody pitch frequency corresponding to the ith frame signal to obtain a single heat vector of an output tag, and taking the single heat vector as the output of a graph convolution network to obtain an output tag Y i corresponding to an ith frame input characteristic X i;
training a graph convolution network to obtain optimal parameters;
Performing melody pitch prediction on the test set by adopting trained network parameters, and taking the frequency corresponding to the maximum value in the output node of the graph convolution network as preliminary melody pitch estimation;
performing median filtering on the preliminary melody pitch sequence obtained by the graph convolution network to obtain a smooth melody pitch track;
Framing the audio signals, and then carrying out zero padding and short-time Fourier transform on each frame of signals to obtain a short-time Fourier transform amplitude spectrum;
The phase vocoder is adopted to correct the instantaneous amplitude and the instantaneous frequency of the short-time Fourier transform amplitude spectrum;
calculating a saliency value frame by frame according to a saliency function;
the saliency function is:
Wherein a i is the i-th spectral peak amplitude, tr (a i) is an amplitude threshold function, and w (b, h, f i) is a weight function;
a banded region formed by positive and negative 1.5 semitone ranges with the smooth melody pitch track as the center is used as a final singing melody output candidate range, the largest significance value is searched in the candidate range, the frequency corresponding to the largest significance value is used for correcting the non-0 output of the graph rolling network, and the 0 value output of the graph rolling network is not corrected.
CN202211120049.1A 2022-09-14 2022-09-14 A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals Active CN115579018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211120049.1A CN115579018B (en) 2022-09-14 2022-09-14 A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211120049.1A CN115579018B (en) 2022-09-14 2022-09-14 A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals

Publications (2)

Publication Number Publication Date
CN115579018A CN115579018A (en) 2023-01-06
CN115579018B true CN115579018B (en) 2025-05-30

Family

ID=84580870

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211120049.1A Active CN115579018B (en) 2022-09-14 2022-09-14 A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals

Country Status (1)

Country Link
CN (1) CN115579018B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119541540A (en) * 2024-11-14 2025-02-28 西北大学 A pitch estimation method based on compressed excitation block

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111786A (en) * 2021-04-15 2021-07-13 西安电子科技大学 Underwater target identification method based on small sample training image convolutional network
CN114158004A (en) * 2021-12-09 2022-03-08 重庆邮电大学 Indoor passive moving target detection method based on graph convolution neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3031225B1 (en) * 2014-12-31 2018-02-02 Audionamix IMPROVED SEPARATION METHOD AND COMPUTER PROGRAM PRODUCT
CN113593606B (en) * 2021-09-30 2022-02-15 清华大学 Audio recognition method and apparatus, computer equipment, computer readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111786A (en) * 2021-04-15 2021-07-13 西安电子科技大学 Underwater target identification method based on small sample training image convolutional network
CN114158004A (en) * 2021-12-09 2022-03-08 重庆邮电大学 Indoor passive moving target detection method based on graph convolution neural network

Also Published As

Publication number Publication date
CN115579018A (en) 2023-01-06

Similar Documents

Publication Publication Date Title
CN111128213A (en) Noise suppression method and system for processing in different frequency bands
Fernandez et al. Classical and novel discriminant features for affect recognition from speech.
Durrieu et al. Singer melody extraction in polyphonic signals using source separation methods
CN108417228A (en) Human voice timbre similarity measurement method under musical instrument timbre transfer
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
CN110310621A (en) Singing synthesis method, device, equipment and computer-readable storage medium
CN114302301B (en) Frequency response correction method and related product
Shi et al. Music genre classification based on chroma features and deep learning
CN109979488A (en) Voice based on stress analysis turns music notation system
CN105845149A (en) Predominant pitch acquisition method in acoustical signal and system thereof
Sebastian et al. Group delay based music source separation using deep recurrent neural networks
CN118298845B (en) Training method, training device, training medium and training equipment for pitch recognition model of complex tone audio
CN115579018B (en) A method for extracting melody from polyphonic music based on spectrum modeling of musical sound signals
CN114627892A (en) A method for extracting the main melody of multi-voice music based on deep learning
Rajan et al. Group delay based melody monopitch extraction from music
CN114299918A (en) Acoustic model training and speech synthesis method, device and system, and storage medium
CN119864047B (en) Audio separation method, system and related device
CN110675845A (en) Human voice humming accurate recognition algorithm and digital notation method
JP4217616B2 (en) Two-stage pitch judgment method and apparatus
Kawahara et al. Higher order waveform symmetry measure and its application to periodicity detectors for speech and singing with fine temporal resolution
CN109697985B (en) Voice signal processing method and device and terminal
Vincent et al. Single-channel mixture decomposition using Bayesian harmonic models
Gong et al. Monaural musical octave sound separation using relaxed extended common amplitude modulation
Gu et al. A discrete-cepstrum based spectrum-envelope estimation scheme and its example application of voice transformation
Voinov et al. Implementation and Analysis of Algorithms for Pitch Estimation in Musical Fragments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant