CN113851113B - Model training method and device, voice wake-up method and device - Google Patents
Model training method and device, voice wake-up method and device Download PDFInfo
- Publication number
- CN113851113B CN113851113B CN202111137419.8A CN202111137419A CN113851113B CN 113851113 B CN113851113 B CN 113851113B CN 202111137419 A CN202111137419 A CN 202111137419A CN 113851113 B CN113851113 B CN 113851113B
- Authority
- CN
- China
- Prior art keywords
- audio
- information
- feature information
- model
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The application discloses a model training method and device, a voice awakening method and device, electronic equipment and a readable storage medium, and belongs to the technical field of data processing. The model training method comprises the steps of obtaining first characteristic information of audio training data, wherein the audio training data comprise wake-up audio and non-wake-up audio, outputting phoneme information and semantic information of the audio training data through an acoustic model to be trained, a generated countermeasure network model and the first characteristic information, outputting second characteristic information of the audio training data through the generated countermeasure network model to be trained, the phoneme information and the semantic information, and training the acoustic model and the generated countermeasure network model according to the first characteristic information and the second characteristic information.
Description
Technical Field
The application belongs to the technical field of data processing, and particularly relates to a model training method and device, a voice awakening method and device, electronic equipment and a readable storage medium.
Background
At present, voice interaction has become an important form of man-machine interaction. The voice wake-up function is used as a voice interaction entrance and is successfully applied to various different types of electronic devices, including intelligent sound boxes, intelligent mobile phones, intelligent home devices, intelligent vehicle-mounted devices and the like.
For example, the user can wake up the intelligent sound box successfully through the designated wake-up word, so that the sound box can be controlled to play the audio through the voice, and for example, the user can wake up the mobile phone successfully through the designated wake-up word, so that the mobile phone can be controlled to dial the phone through the voice.
In the prior art, phenomena such as arousal failure or false arousal and the like often occur due to inaccurate voice judgment.
Disclosure of Invention
The embodiment of the application aims to provide a model training method, which can solve the problems that in the prior art, phenomena such as arousal failure or false arousal and the like are caused by inaccurate voice judgment.
In a first aspect, an embodiment of the present application provides a model training method, where the method includes obtaining first feature information of audio training data, where the audio training data includes wake-up audio and non-wake-up audio, generating an countermeasure network model through an acoustic model to be trained and the first feature information, outputting phoneme information and semantic information of the audio training data, outputting second feature information of the audio training data through the generated countermeasure network model to be trained and the phoneme information and the semantic information, and training the acoustic model and the generated countermeasure network model according to the first feature information and the second feature information.
In a second aspect, an embodiment of the present application provides a voice wake-up method, where the method includes obtaining third feature information of a first audio, outputting first phoneme information of the first audio through the acoustic model and the third feature information, and outputting a wake-up instruction when the first phoneme information is matched with preset phoneme information of the wake-up audio, where the acoustic model is trained by the model training method in the first aspect.
In a third aspect, an embodiment of the present application provides a model training device, which includes a first obtaining module configured to obtain first feature information of audio training data, where the audio training data includes wake-up audio and non-wake-up audio, a first output module configured to output phoneme information and semantic information of the audio training data through an acoustic model to be trained, a generating countermeasure network model, and the first feature information, a second output module configured to output second feature information of the audio training data through the generating countermeasure network model to be trained, the phoneme information, and the semantic information, and a training module configured to train the acoustic model and the generating countermeasure network model according to the first feature information and the second feature information.
In a fourth aspect, an embodiment of the present application provides a voice wake-up device, where the device includes a second obtaining module configured to obtain third feature information of a first audio, a third output module configured to output first phoneme information of the first audio through the acoustic model and the third feature information, and a fourth output module configured to output a wake-up instruction when the first phoneme information is matched with preset phoneme information of the wake-up audio, where the acoustic model is trained by the model training method in the first aspect.
In a fifth aspect, an embodiment of the present application provides an electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, the program or instruction implementing the steps of the method according to the first or second aspect when executed by the processor.
In a sixth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor implement the steps of the method according to the first or second aspect.
In a seventh aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect or the second aspect.
Thus, in the embodiment of the application, in the voice wake-up function, the acoustic model needs to be trained to ensure that the accuracy of the acoustic model in audio judgment is high. Firstly, a large amount of audio including wake-up audio and non-wake-up audio is used as audio training data, first characteristic information is extracted, the first characteristic information is input into an acoustic model, and phoneme information of the audio training data is output. And secondly, inputting the first characteristic information to generate an countermeasure network model, and outputting semantic information of the audio training data. Then, inputting the phoneme information into a generated countermeasure network model, generating the countermeasure network model, combining the semantic information and the phoneme information, and outputting second characteristic information of the audio training data. Further, the acoustic model and the generation countermeasure network model are trained based on the output second feature information and the first feature information so that a distinction between the second feature information and the first feature information is minimized. Therefore, in the embodiment of the application, the mode of combining the two audio characteristics of the phoneme information and the semantic information is mainly adopted to enhance the representation of the audio semantic characteristic information so as to realize the model training in the whole function, thereby achieving the training purpose of the acoustic model, leading the accuracy of the acoustic model for judging the audio to be higher, further improving the accuracy of judging the awakening audio and avoiding the phenomenon of arousing or false awakening.
Drawings
FIG. 1 is a flow chart of a model training method of an embodiment of the present application;
FIG. 2 is a schematic illustration of a model training method of an embodiment of the present application;
FIG. 3 is a schematic diagram of a network architecture of a model training method according to an embodiment of the present application;
FIG. 4 is a flow chart of a voice wakeup method according to an embodiment of the present application;
FIG. 5 is a block diagram of a model training apparatus of an embodiment of the present application;
FIG. 6 is a block diagram of a voice wake apparatus of an embodiment of the application;
FIG. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application;
fig. 8 is a second schematic diagram of a hardware structure of an electronic device according to an embodiment of the application.
Detailed Description
The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
The model training method provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.
Referring to fig. 1, a flow chart of a model training method of one embodiment of the present application is shown, the method being applied to an electronic device, comprising:
step 110, first characteristic information of audio training data is obtained, wherein the audio training data comprises wake-up audio and non-wake-up audio.
The wake-up audio is an audio for outputting a wake-up instruction, and other audio except the wake-up audio is a non-wake-up audio.
In this embodiment, the model in the wake-up function is trained based on audio training data including wake-up audio and non-wake-up audio, so as to improve the accuracy of judgment of the audio in the wake-up function.
The first characteristic information is a sum of characteristic information obtained based on a large amount of audio in the audio training data.
In this embodiment, the first feature information is used to represent Fbank features of the audio training data.
Optionally, fbank feature extraction is performed on the training corpus based on the audio training data, typically 80-dimensional features can be extracted, with a sampling rate of 16KHz.
Step 120, outputting the phoneme information and the semantic information of the audio training data through the acoustic model to be trained, the generation of the countermeasure network model and the first characteristic information.
And outputting the phoneme information of the audio training data through the acoustic model to be trained.
The acoustic model is a model for recognizing sound in speech recognition or speech wake-up.
In this step, the first feature information is taken as an input, and phoneme information of the audio training data is output through an acoustic model.
Optionally, the phoneme information comprises a phoneme probability matrix. Wherein, for each audio in the audio training data, each frame corresponds to a set of phoneme probability sequences.
In addition, semantic information of the audio training data is output through the generated countermeasure network model to be trained.
Alternatively, the generation of the countermeasure network model in this embodiment is based on a variational automatic encoder (Conditional Variational Auto-Encoders, abbreviated as C-VAE), and the generation of the countermeasure network model may be considered to include an encoder.
Therefore, in this step, the first feature information is taken as an input, and the semantic information of the audio training data is output by the encoder.
The semantic information is the sum of semantic information obtained by a large amount of audio in the audio training data.
Illustratively, by the encoder, respective audio in the audio training data, semantic information corresponding to each frame may be obtained.
The second feature information of this embodiment is a semantic representation hidden variable.
And 130, outputting second characteristic information of the audio training data through the to-be-trained generation countermeasure network model, the phoneme information and the semantic information.
In this embodiment, a generating countermeasure (VAWGAN) network is utilized, and the VAWGAN network is a deep learning model, which is one of the most promising methods for unsupervised learning on complex distribution in recent years. The model produces a fairly good output through the learning of the game of each other by at least two modules in the framework, a generation module (GENERATIVE MODEL) and an authentication module (DISCRIMINATIVE MODEL).
Optionally, the generating module comprises a generator and the authenticating module comprises an authenticator.
In this step, the output of the acoustic model (i.e., the phoneme information of the audio training data) and the output of the encoder (i.e., the semantic information of the audio training data) are spliced, input to the generator, and the second characteristic information of the audio training data is output.
In this embodiment, the second feature information is used to represent the fake features of the audio training data.
Wherein the first feature information is a real feature derived based on the audio training data and the second feature information is a synthesized feature derived based on the model output.
Referring to fig. 2, the output z of the encoder represents a semantic representation hidden variable, the acoustic model output a (x) represents a phoneme posterior probability matrix, the horizontal axis represents a time dimension, and the vertical axis represents a phoneme posterior probability sequence. Since both can represent semantic information representation of the audio, the representation of the semantic feature information of the audio can be better enhanced by combining the semantic information representation of the audio, so that the acoustic model generated by modeling can better represent the phoneme probability of each frame of the audio.
And 140, training the acoustic model and the generated countermeasure network model according to the first characteristic information and the second characteristic information.
In this step, the acoustic model and VAWGAN network are trained to adjust the parameters in each model, resulting in a trained acoustic model and VAWGAN network.
In the training process, the acoustic model parameters a, the encoder parameters Φ, the generator parameters θ, and the discriminator parameters ψ are optimized, respectively.
In this step, the purpose of training is to minimize the difference between the synthesized second feature information and the actual first feature information, so that the audio identified by the acoustic model is closest to the actual audio, and thus the accuracy of determining the audio can be improved.
Thus, in the embodiment of the application, in the voice wake-up function, the acoustic model needs to be trained to ensure that the accuracy of the acoustic model in audio judgment is high. Firstly, a large amount of audio including wake-up audio and non-wake-up audio is used as audio training data, first characteristic information is extracted, the first characteristic information is input into an acoustic model, and phoneme information of the audio training data is output. And secondly, inputting the first characteristic information to generate an countermeasure network model, and outputting semantic information of the audio training data. Then, inputting the phoneme information into a generated countermeasure network model, generating the countermeasure network model, combining the semantic information and the phoneme information, and outputting second characteristic information of the audio training data. Further, the acoustic model and the generation countermeasure network model are trained based on the output second feature information and the first feature information so that a distinction between the second feature information and the first feature information is minimized. Therefore, in the embodiment of the application, the mode of combining the two audio characteristics of the phoneme information and the semantic information is mainly adopted to enhance the representation of the audio semantic characteristic information so as to realize the model training in the whole function, thereby achieving the training purpose of the acoustic model, leading the accuracy of the acoustic model for judging the audio to be higher, further improving the accuracy of judging the awakening audio and avoiding the phenomenon of arousing or false awakening.
In the flow of the model training method according to another embodiment of the present application, step 140 includes:
And A1, training the acoustic model and the generated countermeasure network model until a first error rate between the first characteristic information and the second characteristic information meets a first preset condition.
In this step, the first characteristic information and the second characteristic information are input to the discriminator, and the distinction between the first characteristic information and the second characteristic information is output.
Optionally, the difference between the first characteristic information and the second characteristic information is represented by a first error rate.
For this embodiment, an explanation is that the training purpose to be finally achieved by the present application is that the first error rate between the first feature information and the second feature information is smaller than a certain threshold value, and therefore, the first preset condition is that the first error rate is smaller than the threshold value.
For the present embodiment, it is further explained that the training purpose to be finally achieved by the present application achieves the preset iteration number, so that the first error rate between the first feature information and the second feature information is no longer changed and reaches the minimum value, and therefore, the first preset condition is that the error rate under the preset iteration number is achieved.
Illustratively, in one experiment, the number of iterations was selected to be 200000.
In this embodiment, based on the first preset condition, a final training effect is achieved, so that the difference between the first feature information and the second feature information is minimized, and the accuracy of the acoustic model in judging the audio is improved.
In the flow of the model training method according to another embodiment of the present application, before step 120, the method further includes:
And step B1, training the acoustic model until the matching rate of the phoneme information of the audio training data and the preset phoneme information meets a fourth preset condition.
Optionally, the audio training data further comprises text labels corresponding to the respective audio.
Optionally, the trained voice recognition network with higher accuracy is utilized, the text labels corresponding to the audios are combined, the audio training data are aligned, and the phoneme labels corresponding to each frame of each audio in the audio training data are obtained, and further, all the phoneme labels form preset phoneme information of the embodiment.
The preset phoneme information comprises a phoneme label corresponding to each frame of each audio.
The matching rate in this embodiment is based on the matching between the phoneme probability sequence of each frame in the audio training data and the phoneme label of the corresponding frame, and the obtained total matching rate.
Optionally, the training process of this embodiment is:
And establishing a mapping relation between Fbank features x of the audio training data and preset phoneme information.
And the first step, outputting a phoneme probability sequence of each frame through an acoustic model and first characteristic information, and obtaining a phoneme label of each frame through a voice recognition network.
Secondly, utilizing the cross entropy loss function, and obtaining an error loss function between the phoneme probability sequence z p and the phoneme label through metric reasoning:
zp=[pi1,pi2,...,pic] (2)
Wherein M is the sum of all the phoneme labels, y ic is the sign function (0 or 1) of the phoneme label, 1 is taken if the phoneme label of the ith frame is equal to c, or 0;p ic is taken as the prediction probability that the ith frame belongs to c, and z p is a phoneme probability sequence.
Wherein, z p is derived from the acoustic model reasoning according to the inputted Fbank characteristics:
zp=A(x) (3)
wherein A is a parameter of the acoustic model, and in the process of training the acoustic model, the cross entropy loss in (1) is minimized through continuous iteration, so that the acoustic model is continuously converged.
The matching rate meets a fourth preset condition, and the matching rate corresponds to that the error L between the phoneme probability sequence z p and the phoneme label is minimized.
In this embodiment, before the phoneme information of the output audio training data is used for the input of the encoder, the acoustic model may be primarily trained according to the above-described training method to minimize the difference between the phoneme information of the audio training data obtained by the acoustic model and the preset phoneme information. Therefore, on the basis of the training method provided in the present embodiment, by combining with the training method provided in the previous embodiment, a purpose of performing finer training on the acoustic model can be achieved, so as to ensure that the accuracy of the acoustic model in audio judgment is as high as possible.
In the process of the model training method according to another embodiment of the present application, generating the countermeasure network model includes an authentication module and a generation module, and step 140 includes:
and C1, training the generating module and the acoustic model until the second characteristic information output by the generating module meets a second preset condition.
And C2, training the identification module until a second error rate between the first characteristic information and the second characteristic information output by the generation module meets a third preset condition.
From the foregoing embodiments, it is apparent that the present application incorporates VAWGAN networks into the decoder to enhance the VAE effect based on a variable automatic encoder. Wherein VAWGAN includes two parts, one part is a generator for generating a synthesized spectrum and the other part is a discriminator for judging whether the synthesized spectrum is a true spectrum. It is understood that the decoder comprises a generator and a discriminator.
In VAWGAN networks, the objective function is:
Jvawgan=L(x;φ,θ)+αJwgan (4)
where L (x; phi, theta) is the objective function of the encoder portion:
wherein D KL(qφ(z|x)||pθ (z)) represents the relative entropy (Kullback-Leibler Divergence, abbreviated KL divergence) between the discrimination module q φ (z|x) and the true posterior probability p (z|x). The prior probability p θ (z) is a standard multidimensional gaussian distribution. q φ (z|x) and p θ (x|z) are encoder and decoder, respectively, subject to a multidimensional gaussian distribution, with mean vector and covariance matrix (μ φ(z),σφ (z)) and (μ θ(x),σθ (x)) respectively. Thus, the right two terms can be simplified as:
Where K is the dimension of the intermediate variable z and L is the number of times q φ (z|x) is sampled. Since the sampling process is a discontinuous operation and cannot be derived, the network parameters of the encoder and decoder cannot be updated by back propagation. Then another random variable epsilon is introduced to carry out the re-parameterization on the hidden variable z, so that z (l)=μθ(x)+ε(l)*σθ (x), epsilon (l) to N (0,I) are as follows:
where D is the number of samples of x.
Thus, the objective loss function of the optimized VAWGAN network can be obtained.
The parameters of the acoustic model A (x) are dynamically changed along with the training process according to the loss function, so that the model is continuously converged, and the output z of the encoder is dynamically changed according to the output of the encoder.
Based on the above, it is continued to explain how the acoustic models a (x) and VAWGAN are trained simultaneously so that the acoustic model a (x) achieves a better effect.
J wgan denotes the objective function of part VAWGAN:
Where α is a loss coefficient of VAWGAN, and D Ψ is a decision output of the discriminator on whether the feature is true or false. A (x) is fed into the generator in combination with z and is then judged by the discriminator. The latter half of the above generator two-dimensional convolutional neural network loss function:
since the acoustic model a (x) requires constant optimization of parameters in this process, the objective function of the optimization generator becomes:
Wherein min represents the minimum generator and acoustic model loss, and the parameters of the optimal generator and acoustic model A are solved, and the latter half part is an acoustic model loss function, and the generator loss function needs to be combined so that the overall loss reaches an optimal value.
The loss function of the discriminator two-dimensional convolutional neural network becomes as a result of the loss function optimization of the acoustic model added to the generator:
The objective function of the optimization discriminator is:
where max represents the loss function of the maximization discriminator, i.e. the objective of the discriminator is to maximize the gap between distinguishing real features and fake features, thereby continuously optimizing the model parameters of the discriminator.
In this embodiment, the decoder is composed of a generator and a discriminator. In the training process, firstly, parameters of the discriminator are fixed, a generator and an acoustic model are trained, so that the loss function L G of the whole generator is as small as possible, namely, the second characteristic information meets a second preset condition to obtain a Fbank characteristic x' (namely, the second characteristic information), then, parameters of the generator and the acoustic model are fixed, the discriminator is trained, so that the loss function L D of the discriminator is as large as possible, namely, -L D is minimized, namely, the second error rate between the first characteristic information and the second characteristic information meets a third preset condition.
In one explanation, the two steps in this embodiment are alternately repeated. For example, the first time, the step C1 is performed to train the generator and the acoustic model so that the loss function L G of the whole generator is as small as possible based on the first parameters of the discriminator, the step C2 is further performed to train the discriminator based on the parameters of the generator and the acoustic model after the present training, and the second time, the step C2 is performed to train the discriminator based on the parameters of the generator and the acoustic model after the present training because the parameters of the discriminator are adjusted in the first time step C2, so that the loss function L G of the whole generator is as small as possible based on the adjusted second parameters. And so on until the iteration times are completed.
Correspondingly, the second error rate is used to represent the error rate obtained in one repetition step, and the first error rate is used to represent the error rate obtained in the final training.
In yet another explanation, the two steps in the present embodiment represent two types of steps, respectively. For example, step C1, representing all steps of training the generator and acoustic model, may be an overview of multiple repeated steps, and step C2, representing all steps of training the discriminator, may be an overview of multiple repeated steps.
Correspondingly, the first error rate and the second error rate are used to represent the error rate resulting from the final training.
In this explanation, the order is not limited between the steps C1 and C2.
Wherein the generator adopts a two-dimensional convolutional neural network, comprising 4 convolutional layers. The filter sizes of the 4 convolution layers are 9*1, 7*1, 7*1 and 1025 x 1 respectively, the step sizes are 3, 3 and 1 respectively, the filter depths are 32, 16, 8 and 1 respectively, and the activation function is LReLU.
The discriminator employs a two-dimensional convolutional neural network comprising 3 convolutional layers and 1 fully-connected layer. The filter sizes of the 3 convolution layers are 7*1, 7*1 and 115 x 1 respectively, the step sizes are 3, the filter depths are 16, 32 and 64 respectively, and the activation function is LReLU.
The acoustic model is consistent with the encoder in structure, and a two-dimensional convolutional neural network is adopted, wherein the acoustic model comprises 5 convolutional layers and 1 fully-connected layer. The filter sizes of the 5 convolution layers are 7*1, the step sizes are 3, the filter depths are 16, 32, 64, 128 and 256 respectively, and the activation function is LReLU. The network structure is shown in fig. 3. Random gradient descent (Stochastic GRADIENT DESCENT, SGD) is used to update the network model parameters during training.
In this embodiment, a method for training a voice wake-up acoustic model based on generating an countermeasure network is provided, so as to improve the modeling effect of the acoustic model. Wherein, the training of the voice wake-up system is realized by combining a generated countermeasure network based on a variation self-encoder and an acoustic model. By combining VAWGAN networks in the acoustic model, the modeling quality of the acoustic model can be improved well, and high-quality voice awakening is realized.
In the flow of the model training method according to another embodiment of the present application, step 120 includes:
and D1, outputting phoneme information corresponding to the target frames of each audio in the audio training data through the acoustic model to be trained and the first characteristic information.
And D2, outputting semantic information corresponding to the target frames of each audio in the audio training data through the to-be-trained generation countermeasure network model and the first characteristic information.
Optionally, the target frame comprises each frame of each audio in the audio training data.
Optionally, the target frame comprises a partial frame of each audio in the audio training data. The partial frames can be collected according to a certain frequency so as to ensure that the target frames are uniformly distributed in the audio training data.
Correspondingly, based on the obtained phoneme information of the target frame, semantic information of the corresponding frame is obtained, so that the phoneme information and the semantic information of any frame are combined, and a fake feature corresponding to the frame is generated to be compared with Fbank features corresponding to the frame.
Further, feature comparison is performed on each frame in the target frames in sequence, so that feature comparison of the whole audio training data is completed.
In the present embodiment, a method of acquiring phoneme information and semantic information of audio training data is provided to explain the present embodiment in more detail. The embodiment aims at audio training data, and corresponding phoneme information and semantic information are regularly acquired for a target frame in the audio training data so as to generate a synthesized characteristic of the frame, so that the synthesized characteristic is used for comparing with real characteristics of the frame. Therefore, the overall situation of the audio training data can be deduced based on the feature comparison of the target frame in the embodiment, so as to be used for model training in the application.
Referring to fig. 4, a flowchart of a voice wake-up method according to another embodiment of the present application is shown, and the method is applied to an electronic device, and includes:
and 150, acquiring third characteristic information of the first audio.
The model training method and the voice awakening method provided by the application are respectively and correspondingly applied to two stages, wherein the first stage is the training stage in the embodiment, and the other stage is the awakening stage in the embodiment.
In this step, the third feature information is used to represent Fbank features of the first audio.
In this embodiment, the first audio may be a piece of audio stream. Therefore, a segment of audio is streamed into a memory buffer (buffer), and a frame length is 10ms, so that the frame can be jumped (for example, 1 frame is sent every 3 frames) to reduce the calculation amount, and then features are extracted.
Step 160, outputting the first phoneme information of the first audio through the acoustic model and the third characteristic information.
Wherein the acoustic model is trained by the model training method in any of the above embodiments.
And inputting the extracted Fbank features into the acoustic model trained in the previous embodiment for reasoning to obtain the corresponding first phoneme information of the first audio.
Wherein the first phoneme information comprises a phoneme probability matrix.
Step 170, outputting a wake-up instruction under the condition that the first phoneme information is matched with the preset phoneme information of the wake-up audio.
The wake-up instruction is used for waking up the terminal equipment and is applied to a voice wake-up function.
This step corresponds to the viterbi decoding step.
In this step, the phoneme probability matrix obtained in step 160 is sent to a decoding diagram of wake-up audio, and decoding is performed by using a viterbi algorithm, so as to obtain a score, and it is determined whether the score is greater than a certain threshold, if yes, the process is waken, and if no, the process is continued to send the next frame of data.
The score can be understood as the association degree of the first phoneme information and the preset phoneme information of the wake-up audio, and if the association degree is greater than a certain threshold value, the first phoneme information is matched with the preset phoneme information of the wake-up audio.
Illustratively, based on the first phoneme information, a phoneme label with the highest probability corresponding to each frame in the fed audio stream can be obtained through decoding, the phoneme label is compared with a preset phoneme label corresponding to each frame in the wake-up audio, and if the similarity is greater than a certain set value, the association degree of the first phoneme information and the preset phoneme information of the wake-up audio is greater than a certain threshold value.
Thus, based on the foregoing embodiment, the mode of combining the two audio features of the phoneme information and the semantic information is adopted to enhance the representation of the audio semantic feature information, so as to realize model training in the whole function. Therefore, in the wake-up stage, through the inference calculation of the trained acoustic model on the received first audio, more accurate phoneme information of the first audio can be obtained, so that after the first audio is compared with the wake-up audio, terminal devices such as a mobile phone can be accurately and timely awakened.
In summary, the modeling process of voice wakeup typically trains an acoustic model to build a mapping between voice features and phonemes, and then decodes the speech features using an optimal path algorithm. However, due to low power consumption and fast response requirements, the acoustic model resources are limited, and the situation that the acoustic model is inaccurate in judgment, and arousal is not easy or aroused by mistake is often caused. Based on the method, in the voice awakening acoustic model training stage, the generation countermeasure network based on the variation self-encoder is used, and the modeling quality of the acoustic model can be better improved by combining the acoustic model, so that the phoneme reasoning is more accurate, the false awakening can be reduced, and the awakening rate is improved.
It should be noted that, in the model training method provided by the embodiment of the present application, the execution subject may be a model training device, or a control module in the model training device for executing the model training method. In the embodiment of the application, a model training device is taken as an example to execute a model training method, and the model training device provided by the embodiment of the application is described.
Fig. 5 shows a block diagram of a model training apparatus according to another embodiment of the present application, the apparatus comprising:
A first obtaining module 10, configured to obtain first feature information of audio training data, where the audio training data includes wake-up audio and non-wake-up audio;
A first output module 20, configured to output phoneme information and semantic information of the audio training data through the acoustic model to be trained, the generation of the countermeasure network model, and the first feature information;
a second output module 30, configured to output second feature information of the audio training data by generating an countermeasure network model to be trained, and phoneme information and semantic information;
The training module 40 is configured to train the acoustic model and the generated countermeasure network model according to the first feature information and the second feature information.
Thus, in the embodiment of the application, in the voice wake-up function, the acoustic model needs to be trained to ensure that the accuracy of the acoustic model in audio judgment is high. Firstly, a large amount of audio including wake-up audio and non-wake-up audio is used as audio training data, first characteristic information is extracted, the first characteristic information is input into an acoustic model, and phoneme information of the audio training data is output. And secondly, inputting the first characteristic information to generate an countermeasure network model, and outputting semantic information of the audio training data. Then, inputting the phoneme information into a generated countermeasure network model, generating the countermeasure network model, combining the semantic information and the phoneme information, and outputting second characteristic information of the audio training data. Further, the acoustic model and the generation countermeasure network model are trained based on the output second feature information and the first feature information so that a distinction between the second feature information and the first feature information is minimized. Therefore, in the embodiment of the application, the mode of combining the two audio characteristics of the phoneme information and the semantic information is mainly adopted to enhance the representation of the audio semantic characteristic information so as to realize the model training in the whole function, thereby achieving the training purpose of the acoustic model, leading the accuracy of the acoustic model for judging the audio to be higher, further improving the accuracy of judging the awakening audio and avoiding the phenomenon of arousing or false awakening.
Optionally, training module 40 includes:
the first training unit is used for training the acoustic model and the generated countermeasure network model until a first error rate between the first characteristic information and the second characteristic information meets a first preset condition.
Optionally, the generating the countermeasure network model includes an authentication module and a generating module, and the training module 40 includes:
The second training unit is used for training the generating module and the acoustic model until second characteristic information output by the generating module meets a second preset condition;
And the third training unit is used for training the identification module until the second error rate between the first characteristic information and the second characteristic information output by the generation module meets a third preset condition.
Optionally, the first output module 20 includes:
The first output unit is used for outputting phoneme information corresponding to the target frame of each audio in the audio training data through the acoustic model to be trained and the first characteristic information;
the second output unit is used for outputting semantic information corresponding to the target frames of each audio in the audio training data through the to-be-trained generation countermeasure network model and the first characteristic information.
It should be noted that, in the voice wake-up method provided by the embodiment of the present application, the execution body may be a voice wake-up device, or a control module in the voice wake-up device for executing the voice wake-up method. In the embodiment of the application, a voice wake-up device executes a voice wake-up method as an example, and the voice wake-up device provided by the embodiment of the application is described.
FIG. 6 shows a block diagram of a model training apparatus of another embodiment of the present application, the apparatus comprising:
a second obtaining module 50, configured to obtain third feature information of the first audio;
A third output module 60 for outputting first phoneme information of the first audio through the acoustic model and the third characteristic information;
A fourth output module 70, configured to output a wake-up instruction if the first phoneme information matches with preset phoneme information of the wake-up audio;
Wherein the acoustic model is trained by the model training method of any of the foregoing embodiments.
Thus, based on the foregoing embodiment, the mode of combining the two audio features of the phoneme information and the semantic information is adopted to enhance the representation of the audio semantic feature information, so as to realize model training in the whole function. Therefore, in the wake-up stage, through the inference calculation of the trained acoustic model on the received first audio, more accurate phoneme information of the first audio can be obtained, so that after the first audio is compared with the wake-up audio, terminal devices such as a mobile phone can be accurately and timely awakened.
The model training device/voice wake-up device in the embodiment of the application can be a device, and can also be a component, an integrated circuit or a chip in the terminal. The device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), etc., and the non-mobile electronic device may be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, etc., and the embodiments of the present application are not limited in particular.
The model training device/voice wake-up device in the embodiment of the application can be a device with an action system. The action system may be an Android (Android) action system, an ios action system, or other possible action systems, and the embodiment of the application is not limited specifically.
The model training device/voice wake-up device provided by the embodiment of the application can realize each process realized by the corresponding method embodiment, and in order to avoid repetition, the description is omitted here.
Optionally, as shown in fig. 7, the embodiment of the present application further provides an electronic device 100, including a processor 101, a memory 102, and a program or an instruction stored in the memory 102 and capable of running on the processor 101, where the program or the instruction implements each process of any one of the foregoing model training method/the voice wake-up method embodiments when executed by the processor 101, and the process can achieve the same technical effect, so that repetition is avoided and redundant description is omitted herein.
The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.
Fig. 8 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.
The electronic device 1000 includes, but is not limited to, a radio frequency unit 1001, a network module 1002, an audio output unit 1003, an input unit 1004, a sensor 1005, a display unit 1006, a user input unit 1007, an interface unit 1008, a memory 1009, and a processor 1010.
Those skilled in the art will appreciate that the electronic device 1000 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 1010 by a power management system to perform functions such as managing charge, discharge, and power consumption by the power management system. The electronic device structure shown in fig. 8 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.
The processor 1010 is configured to obtain first feature information of audio training data, where the audio training data includes wake-up audio and non-wake-up audio, output phoneme information and semantic information of the audio training data through an acoustic model to be trained, a generated countermeasure network model, and the first feature information, output second feature information of the audio training data through the generated countermeasure network model to be trained, the phoneme information and the semantic information, and train the acoustic model and the generated countermeasure network model according to the first feature information and the second feature information.
Thus, in the embodiment of the application, in the voice wake-up function, the acoustic model needs to be trained to ensure that the accuracy of the acoustic model in audio judgment is high. Firstly, a large amount of audio including wake-up audio and non-wake-up audio is used as audio training data, first characteristic information is extracted, the first characteristic information is input into an acoustic model, and phoneme information of the audio training data is output. And secondly, inputting the first characteristic information to generate an countermeasure network model, and outputting semantic information of the audio training data. Then, inputting the phoneme information into a generated countermeasure network model, generating the countermeasure network model, combining the semantic information and the phoneme information, and outputting second characteristic information of the audio training data. Further, the acoustic model and the generation countermeasure network model are trained based on the output second feature information and the first feature information so that a distinction between the second feature information and the first feature information is minimized. Therefore, in the embodiment of the application, the mode of combining the two audio characteristics of the phoneme information and the semantic information is mainly adopted to enhance the representation of the audio semantic characteristic information so as to realize the model training in the whole function, thereby achieving the training purpose of the acoustic model, leading the accuracy of the acoustic model for judging the audio to be higher, further improving the accuracy of judging the awakening audio and avoiding the phenomenon of arousing or false awakening.
Optionally, the processor 1010 is further configured to train the acoustic model and the generating countermeasure network model until a first error rate between the first characteristic information and the second characteristic information meets a first preset condition.
Optionally, the generating an countermeasure network model includes an authentication module and a generating module, and the processor 1010 is further configured to train the generating module and the acoustic model until the second feature information output by the generating module meets a second preset condition, and train the authentication module until a second error rate between the first feature information and the second feature information output by the generating module meets a third preset condition.
Optionally, the processor 1010 is further configured to output phoneme information corresponding to a target frame of each audio in the audio training data through the acoustic model to be trained and the first feature information, and output semantic information corresponding to the target frame of each audio in the audio training data through the generating countermeasure network model to be trained and the first feature information.
The processor 1010 is configured to obtain third feature information of the first audio, output first phoneme information of the first audio through an acoustic model and the third feature information, and output a wake-up instruction when the first phoneme information is matched with preset phoneme information of the wake-up audio, where the acoustic model is obtained by training the scene.
Thus, based on the foregoing embodiment, the mode of combining the two audio features of the phoneme information and the semantic information is adopted to enhance the representation of the audio semantic feature information, so as to realize model training in the whole function. Therefore, in the wake-up stage, through the inference calculation of the trained acoustic model on the received first audio, more accurate phoneme information of the first audio can be obtained, so that after the first audio is compared with the wake-up audio, terminal devices such as a mobile phone can be accurately and timely awakened.
In summary, the modeling process of voice wakeup typically trains an acoustic model to build a mapping between voice features and phonemes, and then decodes the speech features using an optimal path algorithm. However, due to low power consumption and fast response requirements, the acoustic model resources are limited, and the situation that the acoustic model is inaccurate in judgment, and arousal is not easy or aroused by mistake is often caused. Based on the method, in the voice awakening acoustic model training stage, the generation countermeasure network based on the variation self-encoder is used, and the modeling quality of the acoustic model can be better improved by combining the acoustic model, so that the phoneme reasoning is more accurate, the false awakening can be reduced, and the awakening rate is improved.
It should be appreciated that in embodiments of the present application, the input unit 1004 may include a graphics processor (Graphics Processing Unit, GPU) 10041 and a microphone 10042, where the graphics processor 10041 processes still pictures or video processed image data obtained by an image capture device (e.g., a camera) in a video processing capture mode or an image capture mode. The display unit 1006 may include a display panel 10061, and the display panel 10061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1007 includes a touch panel 10071 and other input devices 10072. The touch panel 10071 is also referred to as a touch screen. The touch panel 10071 can include two portions, a touch detection device and a touch controller. Other input devices 10072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. Memory 1009 may be used to store software programs as well as various data including, but not limited to, application programs and an action system. The processor 1010 may integrate an application processor that primarily processes an action system, user pages, applications, etc., with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1010.
The embodiment of the application also provides a readable storage medium, wherein the readable storage medium stores a program or an instruction, and the program or the instruction realizes each process of the model training method/the voice awakening method embodiment when being executed by a processor, and can achieve the same technical effect, so that repetition is avoided, and the description is omitted.
Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium such as a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
The embodiment of the application further provides a chip, the chip comprises a processor and a communication interface, the communication interface is coupled with the processor, the processor is used for running a program or instructions, the processes of the model training method/voice awakening method embodiment can be realized, the same technical effects can be achieved, and the repetition is avoided, and the description is omitted here.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.
Claims (14)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111137419.8A CN113851113B (en) | 2021-09-27 | 2021-09-27 | Model training method and device, voice wake-up method and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111137419.8A CN113851113B (en) | 2021-09-27 | 2021-09-27 | Model training method and device, voice wake-up method and device |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN113851113A CN113851113A (en) | 2021-12-28 |
| CN113851113B true CN113851113B (en) | 2025-06-20 |
Family
ID=78980147
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111137419.8A Active CN113851113B (en) | 2021-09-27 | 2021-09-27 | Model training method and device, voice wake-up method and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113851113B (en) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114842855B (en) * | 2022-04-06 | 2025-09-05 | 北京百度网讯科技有限公司 | Voice wake-up model training, wake-up method, device, equipment and storage medium |
| CN114822576B (en) * | 2022-04-30 | 2024-08-13 | 中国人民解放军总医院第一医学中心 | Communication system voice enhancement method based on magnetic resonance pulse sequence noise estimation |
| CN115132185A (en) * | 2022-06-29 | 2022-09-30 | 中国银行股份有限公司 | Speech recognition model training method, speech recognition method and related equipment |
| CN115936091B (en) * | 2022-11-24 | 2024-03-08 | 北京百度网讯科技有限公司 | Training method and device for deep learning model, electronic equipment and storage medium |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR102204979B1 (en) * | 2018-08-24 | 2021-01-19 | 네이버 주식회사 | Method and system for generating multi-turn conversation response using deep learing generation model and multi-modal distribution |
| TW202029181A (en) * | 2019-01-28 | 2020-08-01 | 正崴精密工業股份有限公司 | Method and apparatus for specific user to wake up by speech recognition |
| CN112908317B (en) * | 2019-12-04 | 2023-04-07 | 中国科学院深圳先进技术研究院 | Voice recognition system for cognitive impairment |
| CN111354374A (en) * | 2020-03-13 | 2020-06-30 | 北京声智科技有限公司 | Speech processing method, model training method and electronic device |
| CN112420050B (en) * | 2020-11-18 | 2021-06-18 | 北京帝派智能科技有限公司 | Voice recognition method and device and electronic equipment |
| CN113284485B (en) * | 2021-07-09 | 2021-11-09 | 中国科学院自动化研究所 | End-to-end system for unified Chinese and English mixed text generation and voice recognition |
-
2021
- 2021-09-27 CN CN202111137419.8A patent/CN113851113B/en active Active
Non-Patent Citations (1)
| Title |
|---|
| 基于门控生成对抗网络的 Transformer 声学模型研究;吕旭东;中国优秀硕士学位论文全文数据库;20230815;全文 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113851113A (en) | 2021-12-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113851113B (en) | Model training method and device, voice wake-up method and device | |
| US10332507B2 (en) | Method and device for waking up via speech based on artificial intelligence | |
| US11657799B2 (en) | Pre-training with alignments for recurrent neural network transducer based end-to-end speech recognition | |
| EP4336490A1 (en) | Voice processing method and related device | |
| CN107134279B (en) | Voice awakening method, device, terminal and storage medium | |
| US7529671B2 (en) | Block synchronous decoding | |
| CN111833845A (en) | Multi-language speech recognition model training method, device, equipment and storage medium | |
| CN114127849B (en) | Speech emotion recognition method and device | |
| CN113450771B (en) | Awakening method, model training method and device | |
| US10762417B2 (en) | Efficient connectionist temporal classification for binary classification | |
| CN112466288A (en) | Voice recognition method and device, electronic equipment and storage medium | |
| CN111312222A (en) | Awakening and voice recognition model training method and device | |
| CN111653274B (en) | Wake-up word recognition method, device and storage medium | |
| CN112652306A (en) | Voice wake-up method and device, computer equipment and storage medium | |
| CN111508493A (en) | Voice wake-up method, device, electronic device and storage medium | |
| CN114360510B (en) | A speech recognition method and related device | |
| EP4629232A1 (en) | Interaction method and apparatus, device, and storage medium | |
| KR20230141932A (en) | Adaptive visual speech recognition | |
| CN116310983A (en) | Multi-mode emotion recognition method and device | |
| WO2021139182A1 (en) | Effective intelligent voice detection method and apparatus, device and computer-readable storage medium | |
| CN118644596B (en) | Face key point moving image generation method and related equipment | |
| CN114842855A (en) | Training of voice wake-up model, wake-up method, device, equipment and storage medium | |
| CN114519999A (en) | Speech recognition method, device, equipment and storage medium based on bimodal model | |
| CN116705013B (en) | Voice wake-up word detection method and device, storage medium and electronic equipment | |
| CN112261321A (en) | Subtitle processing method and device and electronic equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |