[go: up one dir, main page]

CN113205831A - Method for extracting pitch and duration values in musical instrument sound based on data set - Google Patents

Method for extracting pitch and duration values in musical instrument sound based on data set Download PDF

Info

Publication number
CN113205831A
CN113205831A CN202110634335.9A CN202110634335A CN113205831A CN 113205831 A CN113205831 A CN 113205831A CN 202110634335 A CN202110634335 A CN 202110634335A CN 113205831 A CN113205831 A CN 113205831A
Authority
CN
China
Prior art keywords
pitch
sound
frame
extracting
playing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110634335.9A
Other languages
Chinese (zh)
Inventor
李惠子
曹琛
冯亚星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Average Law Technology Co ltd
Original Assignee
Shenzhen Average Law Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Average Law Technology Co ltd filed Critical Shenzhen Average Law Technology Co ltd
Priority to CN202110634335.9A priority Critical patent/CN113205831A/en
Publication of CN113205831A publication Critical patent/CN113205831A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • G10L2025/906Pitch tracking

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Auxiliary Devices For Music (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

The invention discloses a method for extracting pitch and duration values in musical instrument sounds based on a data set, which mainly solves the problem that the traditional method in the prior art has good performance in the field of single tone identification subdivision but relatively weak performance in the field of complex tone identification. The method for extracting pitch and duration values from musical instrument sounds based on a data set includes registering a digital music score in a digital music score library; the player registers the music score into the digital sound library; matching the category label sequence with the sound characteristics to form a construction method of a sound characteristic-music element data set; a method and system for extracting pitch and duration values in musical instruments based on a data set. Through the scheme, the method and the device achieve the purposes of extracting the time value information and the pitch information of the digital sound signal generated by playing the musical instrument by utilizing the supervised machine learning model, can better extract the music elements in the polyphony, and have very high practical value and popularization value.

Description

Method for extracting pitch and duration values in musical instrument sound based on data set
Technical Field
The present invention relates to the field of computer technology, and more particularly, to a method, system and apparatus for extracting pitch and duration information from the sound of musical instruments.
Background
Pitch and duration information is two fundamental and important elements in music, and performance analysis or automatic transcription of any musical instrument needs to be based on pitch and duration; the traditional method involves digital signal processing of music analysis, which mainly uses manual rules to make logic judgment and extract relevant music elements.
For example pitch analysis typically uses fundamental frequency estimation. The traditional method has good performance in the single-tone recognition subdivision field, but has relatively weak performance in the complex tone recognition field, and cannot extract the pitch and the duration in the sound well.
Disclosure of Invention
The invention aims to provide a method for extracting pitch and duration values in musical instrument sounds based on a data set, so as to solve the problem that the traditional method has good performance in the field of single-tone identification subdivision but relatively weak performance in the field of complex tone identification.
In order to solve the above problems, the present invention provides the following technical solutions:
a method of constructing a musical instrument music analysis data set includes the steps of:
(A1) selecting a digital music score according to the acoustic characteristics, the playing skill requirement and the playing skill level requirement of the musical instrument, and registering the digital music score into a digital music score library;
(A2) the method comprises the steps that a performer with corresponding performance skill level performs music scores to generate digital sound signals, and the digital sound signals are registered in a digital sound library;
(A3) correspondingly matching the digital music score in the step (A1) with the digital sound signal in the step (A2) to form a music score-sound data set;
(A4) performing feature extraction on the voice of the player in the melody-voice data set of the step (a 3);
(A5) extracting music element information from the music score file in the extensible markup format in the music score-sound data set of step (a3), and generating a category label sequence according to a classification label rule;
(A6) and matching the category label sequence with the sound characteristics according to the data initialization rule and the matching rule to form a sound characteristic-music element data set.
Specifically, the features in step (a4) include amplitude-related features, frequency-related features, distribution features of sound spectral energy in the time dimension, distribution features of sound spectral energy in the frequency dimension; the music elements in the step (a5) include pitch, duration, tempo, intensity; the classification label rule is to convert pitch, duration, beat and strength into Boolean vectors or numerical vectors; the initialization rule in step (a6) is to determine an initial frame index, and the matching rule is to match the tag frame index with the sound feature index.
Specifically, the amplitude-related features include the amplitude maxima, means, higher derivatives per frame; the frequency correlation characteristics comprise the maximum value, the mean value and the high-order derivative of each frame frequency; the distribution characteristics of the sound spectrum energy in the time dimension comprise the maximum value, the mean value and the high-order derivative in the time dimension; the distribution characteristics of the sound spectrum energy in the frequency dimension comprise the maximum value, the mean value and the high-order derivative in the frequency dimension.
Specifically, the category label sequence in the step (a5) includes a trigger pitch sequence label generated by extracting pitch sequence information from the score file, and a trigger frame sequence label generated by extracting duration sequence information from the score file and based on the performance start time and the performance speed.
A method of extracting pitch and duration values in a data set-based musical instrument sound comprising the steps of:
(B1) receiving the sound of a player, and sequentially carrying out noise reduction and head and tail silent section removal processing;
(B2) extracting a plurality of kinds of characteristic information in the sound processed in the step (B1);
(B3) inputting the plurality of kinds of feature information extracted in the step (B2) into a pre-trained playing trigger frame classifier, and classifying the playing trigger frames in the sound signals in the step (B1); the trigger frame is marked as true, and the non-trigger frame is marked as false;
(B4) extracting the frame index sequence with the true prediction type in the step (B3), correcting and converting the systematic deviation which does not conform to the playing rule, and outputting a time value sequence;
(B5) inputting the plurality of kinds of feature information extracted in the step (B2) into a pre-trained playing triggered pitch classifier, and classifying the triggered pitch in the sound signal in the step (B1); the triggered pitch is true and the non-triggered pitch is false;
(B6) the pitch index predicted as the true category in the step (B5) is extracted, the systematic deviation not complying with the performance rule is corrected, and pitch series information is output.
Specifically, the characteristic information in the step (B2) includes an amplitude-related characteristic, a frequency-related characteristic, a distribution characteristic of the spectral energy of the sound in a time dimension, and a distribution characteristic of the spectral energy of the sound in a frequency dimension.
Specifically, the specific process of pre-training the performance trigger frame classifier in the step (B3) is as follows:
(B31) converting the sound signal in the data set into a plurality of frames;
(B32) extracting amplitude-related features, frequency-related features, distribution features of sound spectrum energy in a time dimension, and distribution features of sound spectrum energy in a frequency dimension from the sound signal in (B31) frame by frame;
(B33) marking the frame in the step (B32) with a playing trigger frame and a non-playing trigger frame through an automatic type marking module of the playing trigger frame, wherein the playing trigger frame is true, and the non-playing trigger frame is false;
(B34) performing artificial feature filter filtering on each frame in the step (B32): selecting a filtering threshold range of amplitude and energy by calculating the distribution of the observation samples, wherein the frame types exceeding the threshold range are non-performance frames;
(B35) and (C) taking the samples within the threshold range after the processing of the step (B33) as training samples to train a two-classifier.
Specifically, the pre-training process of triggering pitch classification in step (B5) is specifically:
(B51) converting the sound signal in the data set into a plurality of frames;
(B52) extracting, frame by frame, a distribution feature of spectral energy in a frequency dimension for the sound signal in (B51);
(B53) marking the triggered pitch and the non-played triggered pitch by the automatic class marking module of the played triggered pitch for the frame of the step (B52), wherein the triggered pitch is true and the non-triggered pitch is false;
(B54) training a single-class classifier by using the sample with the false class in the step (B53) as a training sample, and detecting an abnormal value;
(B55) and (4) merging the samples predicted to be false in the step (B54) and the samples with the true category in the step (B53) as training samples to train a two-classifier.
A system for extracting pitch and duration values from musical instrument sounds based on a data set includes an audio receiver, an audio processor, a feature extractor connected in series; a time value extractor and a pitch extractor which are correspondingly connected with the automatic category marker;
an audio receiver for converting physical sound into a digital sound signal;
the audio processor is used for carrying out noise reduction and head-to-tail silence removal on the input digital sound signals;
the characteristic extractor is used for extracting the relevant characteristics of the processed digital sound signals;
a time value extractor for extracting time value information of the prediction data;
and a pitch extractor for extracting pitch information of the prediction data.
Specifically, the audio processor includes a noise reduction module and a silence processing module sequentially connected to the audio receiver.
Specifically, the feature extractor comprises a first feature extraction module, a second feature extraction module, a third feature extraction module and a fourth feature extraction module which are respectively connected with the silence processing module; a first feature extractor for extracting sound amplitude related features; a second feature extractor for extracting sound frequency-related features; the third feature extractor is used for extracting the distribution feature of the sound spectrum energy in the time dimension; and the fourth feature extractor is used for extracting the distribution feature of the sound spectrum energy in the frequency dimension.
Specifically, the third feature extraction module and the fourth feature extraction module are both linear semitone infinite pulse filter banks, and filter the sound signal to obtain sound spectrum energy information.
Specifically, the playing trigger frame automatic category labeling module is respectively connected with the first feature extraction module, the second feature extraction module and the third feature extraction module; and the playing triggered pitch automatic category labeling module is connected with the fourth feature extraction module.
Specifically, the time value extractor comprises a playing trigger frame classifier, a deviation correction module and a time value conversion module which are connected in sequence; the playing trigger frame classifier comprises an automatic playing trigger frame category marking module, an artificial feature filter and two classifiers which are sequentially connected; the playing trigger frame automatic category labeling module is respectively connected with the first feature extraction module, the second feature extraction module and the third feature extraction module and the classifier.
Specifically, the pitch extractor includes a performance triggered pitch classifier and a bias correction module connected to each other; the playing triggered pitch classifier comprises an automatic playing triggered pitch class marking module, a single class classifier and a two-class classifier which are sequentially connected with the fourth feature extraction module.
A computer device comprises a processor and a memory, in which a computer program is stored, which computer program, when being loaded and executed by the processor, carries out a construction method or such as an extraction method.
A computer-readable medium has stored therein a computer program that is loaded and executed by a processor to implement a construction method or an extraction method.
Compared with the prior art, the invention has the following beneficial effects:
(1) the method comprises the steps of constructing a music score-sound data set and a sound characteristic-music element data set; after the extracted features are automatically labeled by an automatic class labeling module of a playing trigger frame, filtering the features by an artificial feature filter, and then training a second classifier by taking samples within a threshold range as training samples; after the extracted features are automatically labeled by a pitch automatic category labeling module triggered by playing, using the samples with the category of false as a training single-category classifier, and combining the samples with the category of false with the samples with the category of true as a training sample to train a secondary classifier; by the above process, the availability of data for applying the supervised machine learning technique is improved and the data cost is reduced.
(2) On the basis of reasonably constructing a training data set, the machine learning method is used for replacing the artificial rule of the traditional method, and the method has good performance in the fields of single-tone recognition and complex-tone recognition.
(3) The invention can realize the extraction of the pitch and the duration of any musical instrument, and provides standards for the musical instrument as learning materials and examination materials.
(4) The invention can quickly and efficiently extract pitch and duration information of the playing sound of any musical instrument, and provides data input support for music analysis or automatic music transcription.
(5) The invention can also be applied to the extraction of the pitch and the duration of the singing voice, and is convenient for analyzing the singing effect and the like.
Drawings
Fig. 1 is a block diagram showing the structure of an extraction system.
Fig. 2 is a block diagram showing the structure of a performance trigger frame classifier.
Fig. 3 is a block diagram showing the structure of a performance triggered pitch classifier.
Detailed Description
The present invention is further illustrated by the following figures and examples, which include, but are not limited to, the following examples.
A method of constructing a musical instrument music analysis data set includes the steps of:
(A1) selecting a digital music score according to the acoustic characteristics, the playing skill requirement and the playing skill level requirement of the musical instrument, and registering the digital music score into a digital music score library;
(A2) the method comprises the steps that a performer with corresponding performance skill level performs music scores to generate digital sound signals, and the digital sound signals are registered in a digital sound library;
(A3) correspondingly matching the digital music score in the step (A1) with the digital sound signal in the step (A2) to form a music score-sound data set;
(A4) performing feature extraction on the voice of the player in the melody-voice data set of the step (a 3);
(A5) extracting music element information from the music score file in the extensible markup format in the music score-sound data set of step (a3), and generating a category label sequence according to a classification label rule;
(A6) and matching the category label sequence with the sound characteristics according to the data initialization rule and the matching rule to form a sound characteristic-music element data set.
Wherein the characteristics in step (a4) include amplitude-related characteristics, frequency-related characteristics, distribution characteristics of sound spectral energy in the time dimension, and distribution characteristics of sound spectral energy in the frequency dimension; the music elements in the step (a5) include pitch, duration, tempo, intensity; the classification label rule is to convert pitch, duration, beat and strength into Boolean vectors or numerical vectors; the initialization rule in step (a6) is to determine an initial frame index, and the matching rule is that the tag frame index matches the sound feature index, and may further include other music elements.
The amplitude correlation characteristics comprise the maximum value, the mean value and the high-order derivative of each frame of amplitude; the frequency correlation characteristics comprise the maximum value, the mean value and the high-order derivative of each frame frequency; the distribution characteristics of the sound spectrum energy in the time dimension comprise the maximum value, the mean value and the high-order derivative in the time dimension; the distribution characteristics of the sound spectrum energy in the frequency dimension comprise the maximum value, the mean value and the high-order derivative in the frequency dimension.
In a preferred embodiment of the present invention, the class label sequence in step (a5) includes a trigger pitch sequence label generated by extracting pitch sequence information from the music score file, a trigger frame sequence label generated by extracting duration sequence information from the music score file and based on the performance start time and the performance speed; other sequence tags, such as beat sequence tags, etc., may also be included.
The pitch data set and the duration data set of musical instruments can be constructed through the construction method, and data sets of other music elements can also be constructed.
A sound feature-music element data set is constructed by a building block method.
A method of extracting pitch and duration values in a data set-based musical instrument sound comprising the steps of:
(B1) receiving the sound of a player, and sequentially carrying out noise reduction and head and tail silent section removal processing;
(B2) extracting a plurality of kinds of characteristic information in the sound processed in the step (B1);
(B3) inputting the plurality of kinds of feature information extracted in the step (B2) into a pre-trained playing trigger frame classifier, and classifying the playing trigger frames in the sound signals in the step (B1); the trigger frame is marked as true, and the non-trigger frame is marked as false;
(B4) extracting the frame index sequence with the true prediction type in the step (B3), correcting and converting the systematic deviation which does not conform to the playing rule, and outputting a time value sequence;
(B5) inputting the plurality of kinds of feature information extracted in the step (B2) into a pre-trained playing triggered pitch classifier, and classifying the triggered pitch in the sound signal in the step (B1); the triggered pitch is true and the non-triggered pitch is false;
(B6) the pitch index predicted as the true category in the step (B5) is extracted, the systematic deviation not complying with the performance rule is corrected, and pitch series information is output.
Wherein, the characteristic information in the step (B2) includes an amplitude-related characteristic, a frequency-related characteristic, a distribution characteristic of the sound spectrum energy in a time dimension, and a distribution characteristic of the sound spectrum energy in a frequency dimension.
In a preferred embodiment of the present invention, the pre-training of the performance trigger frame classifier in step (B3) includes the following steps:
(B31) converting the sound signal in the data set into a plurality of frames;
(B32) extracting amplitude related features, frequency related features, distribution features of sound spectrum energy storage time dimensions, and distribution features of sound spectrum energy storage frequency dimensions from the sound signal in the step (B31) frame by frame;
(B33) marking the frame in the step (B32) with a playing trigger frame and a non-playing trigger frame through an automatic type marking module of the playing trigger frame, wherein the playing trigger frame is true, and the non-playing trigger frame is false;
(B34) performing artificial feature filter filtering on each frame in the step (B32): selecting a filtering threshold range of amplitude and energy by calculating the distribution of the observation samples, wherein the frame types exceeding the threshold range are non-performance frames;
(B35) and (C) taking the samples within the threshold range after the processing of the step (B33) as training samples to train a two-classifier.
In a preferred embodiment of the present invention, the pre-training process of triggering pitch classification in step (B5) is as follows:
(B51) converting the sound signal in the data set into a plurality of frames;
(B52) extracting, frame by frame, a distribution feature of spectral energy in a frequency dimension for the sound signal in (B51);
(B53) marking the triggered pitch and the non-played triggered pitch by the automatic class marking module of the played triggered pitch for the frame of the step (B52), wherein the triggered pitch is true and the non-triggered pitch is false;
(B54) training a single-class classifier by using the sample with the false class in the step (B53) as a training sample, and detecting an abnormal value;
(B55) and (4) merging the samples predicted to be false in the step (B54) and the samples with the true category in the step (B53) as training samples to train a two-classifier.
A system for extracting pitch and duration values from musical instrument sounds based on a data set includes an audio receiver, an audio processor, a feature extractor connected in series; a time value extractor and a pitch extractor which are correspondingly connected with the automatic category marker;
an audio receiver for converting physical sound into a digital sound signal;
the audio processor is used for carrying out noise reduction and head-to-tail silence removal on the input digital sound signals;
the characteristic extractor is used for extracting the relevant characteristics of the processed digital sound signals;
a time value extractor for extracting time value information of the prediction data;
and a pitch extractor for extracting pitch information of the prediction data.
In a preferred embodiment of the present invention, the audio processor comprises a noise reduction module and a silence processing module sequentially connected to the audio receiver.
In a preferred embodiment of the present invention, the feature extractor includes a first feature extraction module, a second feature extraction module, a third feature extraction module, and a fourth feature extraction module, which are respectively connected to the silence processing module; a first feature extractor for extracting sound amplitude related features; a second feature extractor for extracting sound frequency-related features; the third feature extractor is used for extracting the distribution feature of the sound spectrum energy in the time dimension; and the fourth feature extractor is used for extracting the distribution feature of the sound spectrum energy in the frequency dimension.
In a preferred embodiment of the present invention, the third feature extraction module and the fourth feature extraction module are both linear semitone infinite pulse filter banks, and filter the sound signal to obtain sound spectrum energy information.
In a preferred embodiment of the invention, the playing trigger frame automatic category labeling module is respectively connected with the first feature extraction module, the second feature extraction module and the third feature extraction module; the playing triggered pitch automatic category marking module is connected with the fourth feature extraction module; and the playing triggering frame automatic category labeling module labels the corresponding characteristic sequence according to the triggering time sequence labels in the sound characteristic-music element data set.
In a preferred embodiment of the present invention, the time value extractor comprises a playing trigger frame classifier, a deviation correction module and a time value conversion module which are connected in sequence; the playing trigger frame classifier comprises an automatic playing trigger frame category marking module, an artificial feature filter and two classifiers which are sequentially connected; the playing trigger frame automatic category labeling module is respectively connected with the first feature extraction module, the second feature extraction module and the third feature extraction module and the classifier.
In a preferred embodiment of the present invention, the pitch extractor comprises a performance triggered pitch classifier and a bias correction module connected to each other; the playing triggered pitch classifier comprises an automatic playing triggered pitch class marking module, a single class classifier and a two-class classifier which are sequentially connected with the fourth feature extraction module; and the playing triggered pitch classifier labels the corresponding characteristic sequences according to triggered pitch sequence labels in the sound characteristic-music element data set.
A computer device comprises a processor and a memory, in which a computer program is stored, which computer program, when loaded and executed by the processor, implements a building method or an extraction method.
A computer-readable medium has stored therein a computer program that is loaded and executed by a processor to implement a construction method or an extraction method.
Embodiments of the present invention relate to extracting duration information and pitch information of a digital sound signal generated by musical instrument performance using a supervised machine learning model
On the basis of reasonably constructing a training data set, the machine learning method is used for replacing the artificial rule of the traditional method, and the method has good performance in the fields of single-tone recognition and complex-tone recognition
As shown in fig. 1 to 3, a diagram of an extraction process of a system for extracting pitch and duration values in musical instrument sounds based on a data set and the purpose of each step; the method comprises the following specific steps:
IP01 a musical instrument is played by a player.
M10, converting the physical sound signal into a digital sound signal according to a preset sampling rate and a digital sound format by an audio receiver to obtain a digital sound signal sequence.
M20, preprocessing the digital sound signal sequence obtained in the previous step.
M21, intercepting a sound signal sequence with a proper time length from the beginning of the digital sound sequence as an environment noise sample, and performing noise reduction processing on the whole digital sound signal sequence.
M22, defining the signal with amplitude lower than threshold value at the beginning and the end of the digital sound signal sequence after noise reduction processing as silence, and cutting off the digital sound signal corresponding to the sequence position.
M30, framing the digital sound signal subjected to silence removal according to a preset window size, a preset sliding window size and a preset window alignment mode to generate a digital sound signal of a frame sequence, and performing feature extraction.
M31 the digital sound signal of the sequence of frames is passed through a first feature extractor to extract amplitude-related features including, but not limited to, amplitude maxima, averages, higher order derivatives, etc. per frame.
M32 the digital sound signal of the sequence of frames is passed through a second feature extractor to extract frequency-dependent features including, but not limited to, the most significant, mean, higher order derivatives of the zero-crossing rate per frame, etc.
M33, the digital sound signal of the frame sequence is subjected to a third feature extractor to extract the distribution features of the sound spectrum energy in the time dimension, including but not limited to the maxima, the mean, the higher derivatives, etc. of the energy per frame. It should be noted that, in this embodiment, the distribution characteristics of the spectral energy of sound in the time dimension are extracted based on the infinite impulse filtering, that is, the digital sound signals of the frame sequence are passed through the linear semitone infinite impulse filter bank to generate the energy values of each frame of sound signals in 128 semitone frequency bands.
M34, the digital sound signal of the sequence of frames is subjected to a fourth feature extractor to extract the distribution features of the sound spectral energy in the frequency dimension, including but not limited to the maxima, means, higher derivatives, etc. of the energy of each frequency band. Similarly, the present embodiment is based on a linear semitone infinite impulse filtering set to extract the distribution characteristics of the sound spectrum energy in the time dimension.
Performing automatic class labeling on the characteristics extracted by the first, second and third characteristic extractors for the playing trigger frame; extracting each performance sound time value information from the music score file with the expandable mark format, converting the information into a trigger frame sequence of each performance sound according to the starting performance time and the performance speed, marking the trigger frames in all the time frame sequences as true, and marking the non-trigger frames as false.
Selecting a filtering threshold range of amplitude and energy by calculating the observed sample distribution, the frame class beyond the threshold range being a non-performing frame.
Samples within the threshold range in [ M415 ] are used as training samples to train a two-classifier.
And (M42) inputting the prediction sample data into a playing triggering frame classifier, extracting a frame index sequence with a true prediction type, correcting and converting systematic deviation which does not accord with playing rules, and outputting a time value sequence.
M43, the corrected frame index sequence is converted into a duration sequence, and duration sequence information is output.
Performing the characteristics extracted by the fourth characteristic extractor to trigger a pitch automatic category labeling module to label; extracting the name information of each playing tone from the music score file with the expandable mark format, converting the name information into a trigger pitch sequence of each playing tone according to the relation between the name information and the pitch frequency band, marking the trigger pitch frequency band in each playing tone frequency band sequence as true, and marking the non-trigger pitch frequency band as false.
(M513) training a single classifier by using the false non-trigger pitch frequency band as a training sample, and detecting an abnormal value;
[ M514 ] combine the non-trigger pitch frequency band samples that are false and the non-trigger pitch frequency band samples that are true as training samples to train a two-classifier.
M52 inputs the prediction sample data into a performance triggered pitch classifier, extracts a pitch index predicted as true in the trigger classifier, corrects systematic deviations that do not conform to the performance rules, and outputs pitch sequence information.
The invention is well implemented in accordance with the above-described embodiments. It should be noted that, based on the above structural design, in order to solve the same technical problems, even if some insubstantial modifications or colorings are made on the present invention, the adopted technical solution is still the same as the present invention, and therefore, the technical solution should be within the protection scope of the present invention.

Claims (6)

1. A method of extracting pitch and duration values in a data set-based musical instrument sound, comprising the steps of:
(B1) receiving the sound of a player, and sequentially carrying out noise reduction and head and tail silent section removal processing;
(B2) extracting a plurality of kinds of characteristic information in the sound processed in the step (B1);
(B3) inputting the plurality of kinds of feature information extracted in the step (B2) into a pre-trained playing trigger frame classifier, and classifying the playing trigger frames in the sound signals in the step (B1); the trigger frame is marked as true, and the non-trigger frame is marked as false;
(B4) extracting the frame index sequence with the true prediction type in the step (B3), correcting and converting the systematic deviation which does not conform to the playing rule, and outputting a time value sequence;
(B5) inputting the plurality of kinds of feature information extracted in the step (B2) into a pre-trained playing triggered pitch classifier, and classifying the triggered pitch in the sound signal in the step (B1); the triggered pitch is true and the non-triggered pitch is false;
(B6) the pitch index predicted as the true category in the step (B5) is extracted, the systematic deviation not complying with the performance rule is corrected, and pitch series information is output.
2. The extraction method according to claim 1, wherein the feature information in step (B2) includes an amplitude-related feature, a frequency-related feature, a distribution feature of the spectral energy of the sound in a time dimension, and a distribution feature of the spectral energy of the sound in a frequency dimension.
3. The extraction method according to claim 1, wherein the pre-training of the performance trigger frame classifier in step (B3) is performed by:
(B31) converting the sound signal in the data set into a plurality of frames;
(B32) extracting amplitude-related features, frequency-related features and distribution features of sound spectrum energy in a time dimension from the sound signal in the step (B31) frame by frame;
(B33) marking the frame in the step (B32) with a playing trigger frame and a non-playing trigger frame through an automatic type marking module of the playing trigger frame, wherein the playing trigger frame is true, and the non-playing trigger frame is false;
(B34) performing artificial feature filter filtering on each frame in the step (B32): selecting a filtering threshold range of amplitude and energy by calculating the distribution of the observation samples, wherein the frame types exceeding the threshold range are non-performance frames;
(B35) and (C) taking the samples within the threshold range after the processing of the step (B33) as training samples to train a two-classifier.
4. The extraction method according to claim 1, wherein the pre-training specific process of triggering pitch classification of step (B5) is:
(B51) converting the sound signal in the data set into a plurality of frames;
(B52) extracting, frame by frame, a distribution feature of spectral energy in a frequency dimension for the sound signal in (B51);
(B53) marking the triggered pitch and the non-played triggered pitch by the automatic class marking module of the played triggered pitch for the frame of the step (B52), wherein the triggered pitch is true and the non-triggered pitch is false;
(B54) training a single-class classifier by using the sample with the false class in the step (B53) as a training sample, and detecting an abnormal value;
(B55) and (4) merging the samples predicted to be false in the step (B54) and the samples with the true category in the step (B53) as training samples to train a two-classifier.
5. Computer device, characterized in that it comprises a processor and a memory, in which a computer program is stored, which computer program, when being loaded and executed by the processor, carries out the extraction method according to claims 1 to 4.
6. A computer-readable medium, in which a computer program is stored, which is loaded and executed by a processor to implement the extraction method as in 1 to 4.
CN202110634335.9A 2019-07-25 2019-07-25 Method for extracting pitch and duration values in musical instrument sound based on data set Withdrawn CN113205831A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110634335.9A CN113205831A (en) 2019-07-25 2019-07-25 Method for extracting pitch and duration values in musical instrument sound based on data set

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910669985.XA CN110415730B (en) 2019-07-25 2019-07-25 A music analysis data set construction method and a pitch and time value extraction method based thereon
CN202110634335.9A CN113205831A (en) 2019-07-25 2019-07-25 Method for extracting pitch and duration values in musical instrument sound based on data set

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201910669985.XA Division CN110415730B (en) 2019-07-25 2019-07-25 A music analysis data set construction method and a pitch and time value extraction method based thereon

Publications (1)

Publication Number Publication Date
CN113205831A true CN113205831A (en) 2021-08-03

Family

ID=68362807

Family Applications (3)

Application Number Title Priority Date Filing Date
CN202110634335.9A Withdrawn CN113205831A (en) 2019-07-25 2019-07-25 Method for extracting pitch and duration values in musical instrument sound based on data set
CN202110634456.3A Withdrawn CN113205832A (en) 2019-07-25 2019-07-25 Data set-based extraction system for pitch and duration values in musical instrument sounds
CN201910669985.XA Active CN110415730B (en) 2019-07-25 2019-07-25 A music analysis data set construction method and a pitch and time value extraction method based thereon

Family Applications After (2)

Application Number Title Priority Date Filing Date
CN202110634456.3A Withdrawn CN113205832A (en) 2019-07-25 2019-07-25 Data set-based extraction system for pitch and duration values in musical instrument sounds
CN201910669985.XA Active CN110415730B (en) 2019-07-25 2019-07-25 A music analysis data set construction method and a pitch and time value extraction method based thereon

Country Status (1)

Country Link
CN (3) CN113205831A (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111210841B (en) * 2020-01-13 2022-07-29 杭州矩阵之声科技有限公司 Musical instrument phoneme recognition model establishing method and musical instrument phoneme recognition method
CN111863026B (en) * 2020-07-27 2024-05-03 北京世纪好未来教育科技有限公司 Keyboard instrument playing music processing method and device and electronic device
CN112667844B (en) * 2020-12-23 2025-01-14 腾讯音乐娱乐科技(深圳)有限公司 Audio retrieval method, device, equipment and storage medium
CN113436591B (en) * 2021-06-24 2023-11-17 广州酷狗计算机科技有限公司 Pitch information generation method, device, computer equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8889976B2 (en) * 2009-08-14 2014-11-18 Honda Motor Co., Ltd. Musical score position estimating device, musical score position estimating method, and musical score position estimating robot
JP2011221133A (en) * 2010-04-06 2011-11-04 Sony Corp Information processing device, client device, server device, list generating method, list retrieving method, list providing method, and program
JP6047985B2 (en) * 2012-07-31 2016-12-21 ヤマハ株式会社 Accompaniment progression generator and program
CN108363769A (en) * 2018-02-07 2018-08-03 大连大学 The method for building up of semantic-based music retrieval data set
CN109065008B (en) * 2018-05-28 2020-10-27 森兰信息科技(上海)有限公司 Music performance music score matching method, storage medium and intelligent musical instrument

Also Published As

Publication number Publication date
CN110415730B (en) 2021-08-31
CN110415730A (en) 2019-11-05
CN113205832A (en) 2021-08-03

Similar Documents

Publication Publication Date Title
CN110415730B (en) A music analysis data set construction method and a pitch and time value extraction method based thereon
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
CN107369439B (en) Voice awakening method and device
CN108305616A (en) A kind of audio scene recognition method and device based on long feature extraction in short-term
Deshmukh et al. Speech based emotion recognition using machine learning
CN102789779A (en) Speech recognition system and recognition method thereof
Ghule et al. Feature extraction techniques for speech recognition: A review
CN112466316A (en) Zero-sample voice conversion system based on generation countermeasure network
CN113611286B (en) Cross-language speech emotion recognition method and system based on common feature extraction
CN116580706B (en) A speech recognition method based on artificial intelligence
CN114283822B (en) Gamma pass frequency cepstrum coefficient-based many-to-one voice conversion method
CN107464563B (en) Voice interaction toy
CN112259107A (en) A voiceprint recognition method under the condition of small sample of conference scene
CN109410968B (en) An efficient method for detecting the starting position of vocals in songs
CN118230722B (en) Intelligent voice recognition method and system based on AI
CN111091816B (en) Data processing system and method based on voice evaluation
Tsenov et al. Speech recognition using neural networks
CN118136022A (en) Intelligent voice recognition system and method
Nazir et al. An Arabic mispronunciation detection system based on the frequency of mistakes for Asian speakers
Oudre et al. Chord recognition using measures of fit, chord templates and filtering methods
JPH10509526A (en) Decision Tree Classifier Designed Using Hidden Markov Model
CN114203159B (en) Speech emotion recognition method, terminal device and computer readable storage medium
Eichner et al. Speech synthesis using stochastic Markov graphs
CN116092483A (en) Speech recognition method for data enhancement based on mixed feature extraction
Zailan et al. Comparative analysis of LPC and MFCC for male speaker recognition in text-independent context

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210803