CN113851113B

CN113851113B - Model training method and device, voice wake-up method and device

Info

Publication number: CN113851113B
Application number: CN202111137419.8A
Authority: CN
Inventors: 石杨
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2025-06-20
Anticipated expiration: 2041-09-27
Also published as: CN113851113A

Abstract

The application discloses a model training method and device, a voice awakening method and device, electronic equipment and a readable storage medium, and belongs to the technical field of data processing. The model training method comprises the steps of obtaining first characteristic information of audio training data, wherein the audio training data comprise wake-up audio and non-wake-up audio, outputting phoneme information and semantic information of the audio training data through an acoustic model to be trained, a generated countermeasure network model and the first characteristic information, outputting second characteristic information of the audio training data through the generated countermeasure network model to be trained, the phoneme information and the semantic information, and training the acoustic model and the generated countermeasure network model according to the first characteristic information and the second characteristic information.

Description

Model training method and device, and voice awakening method and device

Technical Field

The application belongs to the technical field of data processing, and particularly relates to a model training method and device, a voice awakening method and device, electronic equipment and a readable storage medium.

Background

At present, voice interaction has become an important form of man-machine interaction. The voice wake-up function is used as a voice interaction entrance and is successfully applied to various different types of electronic devices, including intelligent sound boxes, intelligent mobile phones, intelligent home devices, intelligent vehicle-mounted devices and the like.

For example, the user can wake up the intelligent sound box successfully through the designated wake-up word, so that the sound box can be controlled to play the audio through the voice, and for example, the user can wake up the mobile phone successfully through the designated wake-up word, so that the mobile phone can be controlled to dial the phone through the voice.

In the prior art, phenomena such as arousal failure or false arousal and the like often occur due to inaccurate voice judgment.

Disclosure of Invention

The embodiment of the application aims to provide a model training method, which can solve the problems that in the prior art, phenomena such as arousal failure or false arousal and the like are caused by inaccurate voice judgment.

In a first aspect, an embodiment of the present application provides a model training method, where the method includes obtaining first feature information of audio training data, where the audio training data includes wake-up audio and non-wake-up audio, generating an countermeasure network model through an acoustic model to be trained and the first feature information, outputting phoneme information and semantic information of the audio training data, outputting second feature information of the audio training data through the generated countermeasure network model to be trained and the phoneme information and the semantic information, and training the acoustic model and the generated countermeasure network model according to the first feature information and the second feature information.

In a second aspect, an embodiment of the present application provides a voice wake-up method, where the method includes obtaining third feature information of a first audio, outputting first phoneme information of the first audio through the acoustic model and the third feature information, and outputting a wake-up instruction when the first phoneme information is matched with preset phoneme information of the wake-up audio, where the acoustic model is trained by the model training method in the first aspect.

In a third aspect, an embodiment of the present application provides a model training device, which includes a first obtaining module configured to obtain first feature information of audio training data, where the audio training data includes wake-up audio and non-wake-up audio, a first output module configured to output phoneme information and semantic information of the audio training data through an acoustic model to be trained, a generating countermeasure network model, and the first feature information, a second output module configured to output second feature information of the audio training data through the generating countermeasure network model to be trained, the phoneme information, and the semantic information, and a training module configured to train the acoustic model and the generating countermeasure network model according to the first feature information and the second feature information.

In a fourth aspect, an embodiment of the present application provides a voice wake-up device, where the device includes a second obtaining module configured to obtain third feature information of a first audio, a third output module configured to output first phoneme information of the first audio through the acoustic model and the third feature information, and a fourth output module configured to output a wake-up instruction when the first phoneme information is matched with preset phoneme information of the wake-up audio, where the acoustic model is trained by the model training method in the first aspect.

In a fifth aspect, an embodiment of the present application provides an electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, the program or instruction implementing the steps of the method according to the first or second aspect when executed by the processor.

In a sixth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor implement the steps of the method according to the first or second aspect.

In a seventh aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect or the second aspect.

Thus, in the embodiment of the application, in the voice wake-up function, the acoustic model needs to be trained to ensure that the accuracy of the acoustic model in audio judgment is high. Firstly, a large amount of audio including wake-up audio and non-wake-up audio is used as audio training data, first characteristic information is extracted, the first characteristic information is input into an acoustic model, and phoneme information of the audio training data is output. And secondly, inputting the first characteristic information to generate an countermeasure network model, and outputting semantic information of the audio training data. Then, inputting the phoneme information into a generated countermeasure network model, generating the countermeasure network model, combining the semantic information and the phoneme information, and outputting second characteristic information of the audio training data. Further, the acoustic model and the generation countermeasure network model are trained based on the output second feature information and the first feature information so that a distinction between the second feature information and the first feature information is minimized. Therefore, in the embodiment of the application, the mode of combining the two audio characteristics of the phoneme information and the semantic information is mainly adopted to enhance the representation of the audio semantic characteristic information so as to realize the model training in the whole function, thereby achieving the training purpose of the acoustic model, leading the accuracy of the acoustic model for judging the audio to be higher, further improving the accuracy of judging the awakening audio and avoiding the phenomenon of arousing or false awakening.

Drawings

FIG. 1 is a flow chart of a model training method of an embodiment of the present application;

FIG. 2 is a schematic illustration of a model training method of an embodiment of the present application;

FIG. 3 is a schematic diagram of a network architecture of a model training method according to an embodiment of the present application;

FIG. 4 is a flow chart of a voice wakeup method according to an embodiment of the present application;

FIG. 5 is a block diagram of a model training apparatus of an embodiment of the present application;

FIG. 6 is a block diagram of a voice wake apparatus of an embodiment of the application;

FIG. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application;

fig. 8 is a second schematic diagram of a hardware structure of an electronic device according to an embodiment of the application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The model training method provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.

Referring to fig. 1, a flow chart of a model training method of one embodiment of the present application is shown, the method being applied to an electronic device, comprising:

step 110, first characteristic information of audio training data is obtained, wherein the audio training data comprises wake-up audio and non-wake-up audio.

The wake-up audio is an audio for outputting a wake-up instruction, and other audio except the wake-up audio is a non-wake-up audio.

In this embodiment, the model in the wake-up function is trained based on audio training data including wake-up audio and non-wake-up audio, so as to improve the accuracy of judgment of the audio in the wake-up function.

The first characteristic information is a sum of characteristic information obtained based on a large amount of audio in the audio training data.

In this embodiment, the first feature information is used to represent Fbank features of the audio training data.

Optionally, fbank feature extraction is performed on the training corpus based on the audio training data, typically 80-dimensional features can be extracted, with a sampling rate of 16KHz.

Step 120, outputting the phoneme information and the semantic information of the audio training data through the acoustic model to be trained, the generation of the countermeasure network model and the first characteristic information.

And outputting the phoneme information of the audio training data through the acoustic model to be trained.

The acoustic model is a model for recognizing sound in speech recognition or speech wake-up.

In this step, the first feature information is taken as an input, and phoneme information of the audio training data is output through an acoustic model.

Optionally, the phoneme information comprises a phoneme probability matrix. Wherein, for each audio in the audio training data, each frame corresponds to a set of phoneme probability sequences.

In addition, semantic information of the audio training data is output through the generated countermeasure network model to be trained.

Alternatively, the generation of the countermeasure network model in this embodiment is based on a variational automatic encoder (Conditional Variational Auto-Encoders, abbreviated as C-VAE), and the generation of the countermeasure network model may be considered to include an encoder.

Therefore, in this step, the first feature information is taken as an input, and the semantic information of the audio training data is output by the encoder.

The semantic information is the sum of semantic information obtained by a large amount of audio in the audio training data.

Illustratively, by the encoder, respective audio in the audio training data, semantic information corresponding to each frame may be obtained.

The second feature information of this embodiment is a semantic representation hidden variable.

And 130, outputting second characteristic information of the audio training data through the to-be-trained generation countermeasure network model, the phoneme information and the semantic information.

In this embodiment, a generating countermeasure (VAWGAN) network is utilized, and the VAWGAN network is a deep learning model, which is one of the most promising methods for unsupervised learning on complex distribution in recent years. The model produces a fairly good output through the learning of the game of each other by at least two modules in the framework, a generation module (GENERATIVE MODEL) and an authentication module (DISCRIMINATIVE MODEL).

Optionally, the generating module comprises a generator and the authenticating module comprises an authenticator.

In this step, the output of the acoustic model (i.e., the phoneme information of the audio training data) and the output of the encoder (i.e., the semantic information of the audio training data) are spliced, input to the generator, and the second characteristic information of the audio training data is output.

In this embodiment, the second feature information is used to represent the fake features of the audio training data.

Wherein the first feature information is a real feature derived based on the audio training data and the second feature information is a synthesized feature derived based on the model output.

Referring to fig. 2, the output z of the encoder represents a semantic representation hidden variable, the acoustic model output a (x) represents a phoneme posterior probability matrix, the horizontal axis represents a time dimension, and the vertical axis represents a phoneme posterior probability sequence. Since both can represent semantic information representation of the audio, the representation of the semantic feature information of the audio can be better enhanced by combining the semantic information representation of the audio, so that the acoustic model generated by modeling can better represent the phoneme probability of each frame of the audio.

And 140, training the acoustic model and the generated countermeasure network model according to the first characteristic information and the second characteristic information.

In this step, the acoustic model and VAWGAN network are trained to adjust the parameters in each model, resulting in a trained acoustic model and VAWGAN network.

In the training process, the acoustic model parameters a, the encoder parameters Φ, the generator parameters θ, and the discriminator parameters ψ are optimized, respectively.

In this step, the purpose of training is to minimize the difference between the synthesized second feature information and the actual first feature information, so that the audio identified by the acoustic model is closest to the actual audio, and thus the accuracy of determining the audio can be improved.

In the flow of the model training method according to another embodiment of the present application, step 140 includes:

And A1, training the acoustic model and the generated countermeasure network model until a first error rate between the first characteristic information and the second characteristic information meets a first preset condition.

In this step, the first characteristic information and the second characteristic information are input to the discriminator, and the distinction between the first characteristic information and the second characteristic information is output.

Optionally, the difference between the first characteristic information and the second characteristic information is represented by a first error rate.

For this embodiment, an explanation is that the training purpose to be finally achieved by the present application is that the first error rate between the first feature information and the second feature information is smaller than a certain threshold value, and therefore, the first preset condition is that the first error rate is smaller than the threshold value.

For the present embodiment, it is further explained that the training purpose to be finally achieved by the present application achieves the preset iteration number, so that the first error rate between the first feature information and the second feature information is no longer changed and reaches the minimum value, and therefore, the first preset condition is that the error rate under the preset iteration number is achieved.

Illustratively, in one experiment, the number of iterations was selected to be 200000.

In this embodiment, based on the first preset condition, a final training effect is achieved, so that the difference between the first feature information and the second feature information is minimized, and the accuracy of the acoustic model in judging the audio is improved.

In the flow of the model training method according to another embodiment of the present application, before step 120, the method further includes:

And step B1, training the acoustic model until the matching rate of the phoneme information of the audio training data and the preset phoneme information meets a fourth preset condition.

Optionally, the audio training data further comprises text labels corresponding to the respective audio.

Optionally, the trained voice recognition network with higher accuracy is utilized, the text labels corresponding to the audios are combined, the audio training data are aligned, and the phoneme labels corresponding to each frame of each audio in the audio training data are obtained, and further, all the phoneme labels form preset phoneme information of the embodiment.

The preset phoneme information comprises a phoneme label corresponding to each frame of each audio.

The matching rate in this embodiment is based on the matching between the phoneme probability sequence of each frame in the audio training data and the phoneme label of the corresponding frame, and the obtained total matching rate.

Optionally, the training process of this embodiment is:

And establishing a mapping relation between Fbank features x of the audio training data and preset phoneme information.

And the first step, outputting a phoneme probability sequence of each frame through an acoustic model and first characteristic information, and obtaining a phoneme label of each frame through a voice recognition network.

Secondly, utilizing the cross entropy loss function, and obtaining an error loss function between the phoneme probability sequence z _p and the phoneme label through metric reasoning:

z_p＝[p_i1,p_i2,...,p_ic] (2)

Wherein M is the sum of all the phoneme labels, y _ic is the sign function (0 or 1) of the phoneme label, 1 is taken if the phoneme label of the ith frame is equal to c, or 0;p _ic is taken as the prediction probability that the ith frame belongs to c, and z _p is a phoneme probability sequence.

Wherein, z _p is derived from the acoustic model reasoning according to the inputted Fbank characteristics:

z_p=A(x) (3)

wherein A is a parameter of the acoustic model, and in the process of training the acoustic model, the cross entropy loss in (1) is minimized through continuous iteration, so that the acoustic model is continuously converged.

The matching rate meets a fourth preset condition, and the matching rate corresponds to that the error L between the phoneme probability sequence z _p and the phoneme label is minimized.

In this embodiment, before the phoneme information of the output audio training data is used for the input of the encoder, the acoustic model may be primarily trained according to the above-described training method to minimize the difference between the phoneme information of the audio training data obtained by the acoustic model and the preset phoneme information. Therefore, on the basis of the training method provided in the present embodiment, by combining with the training method provided in the previous embodiment, a purpose of performing finer training on the acoustic model can be achieved, so as to ensure that the accuracy of the acoustic model in audio judgment is as high as possible.

In the process of the model training method according to another embodiment of the present application, generating the countermeasure network model includes an authentication module and a generation module, and step 140 includes:

and C1, training the generating module and the acoustic model until the second characteristic information output by the generating module meets a second preset condition.

And C2, training the identification module until a second error rate between the first characteristic information and the second characteristic information output by the generation module meets a third preset condition.

From the foregoing embodiments, it is apparent that the present application incorporates VAWGAN networks into the decoder to enhance the VAE effect based on a variable automatic encoder. Wherein VAWGAN includes two parts, one part is a generator for generating a synthesized spectrum and the other part is a discriminator for judging whether the synthesized spectrum is a true spectrum. It is understood that the decoder comprises a generator and a discriminator.

In VAWGAN networks, the objective function is:

J_vawgan＝L(x;φ,θ)+αJ_wgan (4)

where L (x; phi, theta) is the objective function of the encoder portion:

wherein D _KL(q_φ(z|x)||p_θ (z)) represents the relative entropy (Kullback-Leibler Divergence, abbreviated KL divergence) between the discrimination module q _φ (z|x) and the true posterior probability p (z|x). The prior probability p _θ (z) is a standard multidimensional gaussian distribution. q _φ (z|x) and p _θ (x|z) are encoder and decoder, respectively, subject to a multidimensional gaussian distribution, with mean vector and covariance matrix (μ _φ(z),σ_φ (z)) and (μ _θ(x),σ_θ (x)) respectively. Thus, the right two terms can be simplified as:

Where K is the dimension of the intermediate variable z and L is the number of times q _φ (z|x) is sampled. Since the sampling process is a discontinuous operation and cannot be derived, the network parameters of the encoder and decoder cannot be updated by back propagation. Then another random variable epsilon is introduced to carry out the re-parameterization on the hidden variable z, so that z ^(l)＝μ_θ(x)+ε^(l)*σ_θ (x), epsilon (l) to N (0,I) are as follows:

where D is the number of samples of x.

Thus, the objective loss function of the optimized VAWGAN network can be obtained.

The parameters of the acoustic model A (x) are dynamically changed along with the training process according to the loss function, so that the model is continuously converged, and the output z of the encoder is dynamically changed according to the output of the encoder.

Based on the above, it is continued to explain how the acoustic models a (x) and VAWGAN are trained simultaneously so that the acoustic model a (x) achieves a better effect.

J _wgan denotes the objective function of part VAWGAN:

Where α is a loss coefficient of VAWGAN, and D _Ψ is a decision output of the discriminator on whether the feature is true or false. A (x) is fed into the generator in combination with z and is then judged by the discriminator. The latter half of the above generator two-dimensional convolutional neural network loss function:

since the acoustic model a (x) requires constant optimization of parameters in this process, the objective function of the optimization generator becomes:

Wherein min represents the minimum generator and acoustic model loss, and the parameters of the optimal generator and acoustic model A are solved, and the latter half part is an acoustic model loss function, and the generator loss function needs to be combined so that the overall loss reaches an optimal value.

The loss function of the discriminator two-dimensional convolutional neural network becomes as a result of the loss function optimization of the acoustic model added to the generator:

The objective function of the optimization discriminator is:

where max represents the loss function of the maximization discriminator, i.e. the objective of the discriminator is to maximize the gap between distinguishing real features and fake features, thereby continuously optimizing the model parameters of the discriminator.

In this embodiment, the decoder is composed of a generator and a discriminator. In the training process, firstly, parameters of the discriminator are fixed, a generator and an acoustic model are trained, so that the loss function L _G of the whole generator is as small as possible, namely, the second characteristic information meets a second preset condition to obtain a Fbank characteristic x' (namely, the second characteristic information), then, parameters of the generator and the acoustic model are fixed, the discriminator is trained, so that the loss function L _D of the discriminator is as large as possible, namely, -L _D is minimized, namely, the second error rate between the first characteristic information and the second characteristic information meets a third preset condition.

In one explanation, the two steps in this embodiment are alternately repeated. For example, the first time, the step C1 is performed to train the generator and the acoustic model so that the loss function L _G of the whole generator is as small as possible based on the first parameters of the discriminator, the step C2 is further performed to train the discriminator based on the parameters of the generator and the acoustic model after the present training, and the second time, the step C2 is performed to train the discriminator based on the parameters of the generator and the acoustic model after the present training because the parameters of the discriminator are adjusted in the first time step C2, so that the loss function L _G of the whole generator is as small as possible based on the adjusted second parameters. And so on until the iteration times are completed.

Correspondingly, the second error rate is used to represent the error rate obtained in one repetition step, and the first error rate is used to represent the error rate obtained in the final training.

In yet another explanation, the two steps in the present embodiment represent two types of steps, respectively. For example, step C1, representing all steps of training the generator and acoustic model, may be an overview of multiple repeated steps, and step C2, representing all steps of training the discriminator, may be an overview of multiple repeated steps.

Correspondingly, the first error rate and the second error rate are used to represent the error rate resulting from the final training.

In this explanation, the order is not limited between the steps C1 and C2.

Wherein the generator adopts a two-dimensional convolutional neural network, comprising 4 convolutional layers. The filter sizes of the 4 convolution layers are 9*1, 7*1, 7*1 and 1025 x 1 respectively, the step sizes are 3, 3 and 1 respectively, the filter depths are 32, 16, 8 and 1 respectively, and the activation function is LReLU.

The discriminator employs a two-dimensional convolutional neural network comprising 3 convolutional layers and 1 fully-connected layer. The filter sizes of the 3 convolution layers are 7*1, 7*1 and 115 x 1 respectively, the step sizes are 3, the filter depths are 16, 32 and 64 respectively, and the activation function is LReLU.

The acoustic model is consistent with the encoder in structure, and a two-dimensional convolutional neural network is adopted, wherein the acoustic model comprises 5 convolutional layers and 1 fully-connected layer. The filter sizes of the 5 convolution layers are 7*1, the step sizes are 3, the filter depths are 16, 32, 64, 128 and 256 respectively, and the activation function is LReLU. The network structure is shown in fig. 3. Random gradient descent (Stochastic GRADIENT DESCENT, SGD) is used to update the network model parameters during training.

In this embodiment, a method for training a voice wake-up acoustic model based on generating an countermeasure network is provided, so as to improve the modeling effect of the acoustic model. Wherein, the training of the voice wake-up system is realized by combining a generated countermeasure network based on a variation self-encoder and an acoustic model. By combining VAWGAN networks in the acoustic model, the modeling quality of the acoustic model can be improved well, and high-quality voice awakening is realized.

In the flow of the model training method according to another embodiment of the present application, step 120 includes:

and D1, outputting phoneme information corresponding to the target frames of each audio in the audio training data through the acoustic model to be trained and the first characteristic information.

And D2, outputting semantic information corresponding to the target frames of each audio in the audio training data through the to-be-trained generation countermeasure network model and the first characteristic information.

Optionally, the target frame comprises each frame of each audio in the audio training data.

Optionally, the target frame comprises a partial frame of each audio in the audio training data. The partial frames can be collected according to a certain frequency so as to ensure that the target frames are uniformly distributed in the audio training data.

Correspondingly, based on the obtained phoneme information of the target frame, semantic information of the corresponding frame is obtained, so that the phoneme information and the semantic information of any frame are combined, and a fake feature corresponding to the frame is generated to be compared with Fbank features corresponding to the frame.

Further, feature comparison is performed on each frame in the target frames in sequence, so that feature comparison of the whole audio training data is completed.

In the present embodiment, a method of acquiring phoneme information and semantic information of audio training data is provided to explain the present embodiment in more detail. The embodiment aims at audio training data, and corresponding phoneme information and semantic information are regularly acquired for a target frame in the audio training data so as to generate a synthesized characteristic of the frame, so that the synthesized characteristic is used for comparing with real characteristics of the frame. Therefore, the overall situation of the audio training data can be deduced based on the feature comparison of the target frame in the embodiment, so as to be used for model training in the application.

Referring to fig. 4, a flowchart of a voice wake-up method according to another embodiment of the present application is shown, and the method is applied to an electronic device, and includes:

and 150, acquiring third characteristic information of the first audio.

The model training method and the voice awakening method provided by the application are respectively and correspondingly applied to two stages, wherein the first stage is the training stage in the embodiment, and the other stage is the awakening stage in the embodiment.

In this step, the third feature information is used to represent Fbank features of the first audio.

In this embodiment, the first audio may be a piece of audio stream. Therefore, a segment of audio is streamed into a memory buffer (buffer), and a frame length is 10ms, so that the frame can be jumped (for example, 1 frame is sent every 3 frames) to reduce the calculation amount, and then features are extracted.

Step 160, outputting the first phoneme information of the first audio through the acoustic model and the third characteristic information.

Wherein the acoustic model is trained by the model training method in any of the above embodiments.

And inputting the extracted Fbank features into the acoustic model trained in the previous embodiment for reasoning to obtain the corresponding first phoneme information of the first audio.

Wherein the first phoneme information comprises a phoneme probability matrix.

Step 170, outputting a wake-up instruction under the condition that the first phoneme information is matched with the preset phoneme information of the wake-up audio.

The wake-up instruction is used for waking up the terminal equipment and is applied to a voice wake-up function.

This step corresponds to the viterbi decoding step.

In this step, the phoneme probability matrix obtained in step 160 is sent to a decoding diagram of wake-up audio, and decoding is performed by using a viterbi algorithm, so as to obtain a score, and it is determined whether the score is greater than a certain threshold, if yes, the process is waken, and if no, the process is continued to send the next frame of data.

The score can be understood as the association degree of the first phoneme information and the preset phoneme information of the wake-up audio, and if the association degree is greater than a certain threshold value, the first phoneme information is matched with the preset phoneme information of the wake-up audio.

Illustratively, based on the first phoneme information, a phoneme label with the highest probability corresponding to each frame in the fed audio stream can be obtained through decoding, the phoneme label is compared with a preset phoneme label corresponding to each frame in the wake-up audio, and if the similarity is greater than a certain set value, the association degree of the first phoneme information and the preset phoneme information of the wake-up audio is greater than a certain threshold value.

Thus, based on the foregoing embodiment, the mode of combining the two audio features of the phoneme information and the semantic information is adopted to enhance the representation of the audio semantic feature information, so as to realize model training in the whole function. Therefore, in the wake-up stage, through the inference calculation of the trained acoustic model on the received first audio, more accurate phoneme information of the first audio can be obtained, so that after the first audio is compared with the wake-up audio, terminal devices such as a mobile phone can be accurately and timely awakened.

In summary, the modeling process of voice wakeup typically trains an acoustic model to build a mapping between voice features and phonemes, and then decodes the speech features using an optimal path algorithm. However, due to low power consumption and fast response requirements, the acoustic model resources are limited, and the situation that the acoustic model is inaccurate in judgment, and arousal is not easy or aroused by mistake is often caused. Based on the method, in the voice awakening acoustic model training stage, the generation countermeasure network based on the variation self-encoder is used, and the modeling quality of the acoustic model can be better improved by combining the acoustic model, so that the phoneme reasoning is more accurate, the false awakening can be reduced, and the awakening rate is improved.

It should be noted that, in the model training method provided by the embodiment of the present application, the execution subject may be a model training device, or a control module in the model training device for executing the model training method. In the embodiment of the application, a model training device is taken as an example to execute a model training method, and the model training device provided by the embodiment of the application is described.

Fig. 5 shows a block diagram of a model training apparatus according to another embodiment of the present application, the apparatus comprising:

A first obtaining module 10, configured to obtain first feature information of audio training data, where the audio training data includes wake-up audio and non-wake-up audio;

A first output module 20, configured to output phoneme information and semantic information of the audio training data through the acoustic model to be trained, the generation of the countermeasure network model, and the first feature information;

a second output module 30, configured to output second feature information of the audio training data by generating an countermeasure network model to be trained, and phoneme information and semantic information;

The training module 40 is configured to train the acoustic model and the generated countermeasure network model according to the first feature information and the second feature information.

Optionally, training module 40 includes:

the first training unit is used for training the acoustic model and the generated countermeasure network model until a first error rate between the first characteristic information and the second characteristic information meets a first preset condition.

Optionally, the generating the countermeasure network model includes an authentication module and a generating module, and the training module 40 includes:

The second training unit is used for training the generating module and the acoustic model until second characteristic information output by the generating module meets a second preset condition;

And the third training unit is used for training the identification module until the second error rate between the first characteristic information and the second characteristic information output by the generation module meets a third preset condition.

Optionally, the first output module 20 includes:

The first output unit is used for outputting phoneme information corresponding to the target frame of each audio in the audio training data through the acoustic model to be trained and the first characteristic information;

the second output unit is used for outputting semantic information corresponding to the target frames of each audio in the audio training data through the to-be-trained generation countermeasure network model and the first characteristic information.

It should be noted that, in the voice wake-up method provided by the embodiment of the present application, the execution body may be a voice wake-up device, or a control module in the voice wake-up device for executing the voice wake-up method. In the embodiment of the application, a voice wake-up device executes a voice wake-up method as an example, and the voice wake-up device provided by the embodiment of the application is described.

FIG. 6 shows a block diagram of a model training apparatus of another embodiment of the present application, the apparatus comprising:

a second obtaining module 50, configured to obtain third feature information of the first audio;

A third output module 60 for outputting first phoneme information of the first audio through the acoustic model and the third characteristic information;

A fourth output module 70, configured to output a wake-up instruction if the first phoneme information matches with preset phoneme information of the wake-up audio;

Wherein the acoustic model is trained by the model training method of any of the foregoing embodiments.

The model training device/voice wake-up device in the embodiment of the application can be a device, and can also be a component, an integrated circuit or a chip in the terminal. The device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), etc., and the non-mobile electronic device may be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, etc., and the embodiments of the present application are not limited in particular.

The model training device/voice wake-up device in the embodiment of the application can be a device with an action system. The action system may be an Android (Android) action system, an ios action system, or other possible action systems, and the embodiment of the application is not limited specifically.

The model training device/voice wake-up device provided by the embodiment of the application can realize each process realized by the corresponding method embodiment, and in order to avoid repetition, the description is omitted here.

Optionally, as shown in fig. 7, the embodiment of the present application further provides an electronic device 100, including a processor 101, a memory 102, and a program or an instruction stored in the memory 102 and capable of running on the processor 101, where the program or the instruction implements each process of any one of the foregoing model training method/the voice wake-up method embodiments when executed by the processor 101, and the process can achieve the same technical effect, so that repetition is avoided and redundant description is omitted herein.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.

Fig. 8 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1000 includes, but is not limited to, a radio frequency unit 1001, a network module 1002, an audio output unit 1003, an input unit 1004, a sensor 1005, a display unit 1006, a user input unit 1007, an interface unit 1008, a memory 1009, and a processor 1010.

Those skilled in the art will appreciate that the electronic device 1000 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 1010 by a power management system to perform functions such as managing charge, discharge, and power consumption by the power management system. The electronic device structure shown in fig. 8 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.

The processor 1010 is configured to obtain first feature information of audio training data, where the audio training data includes wake-up audio and non-wake-up audio, output phoneme information and semantic information of the audio training data through an acoustic model to be trained, a generated countermeasure network model, and the first feature information, output second feature information of the audio training data through the generated countermeasure network model to be trained, the phoneme information and the semantic information, and train the acoustic model and the generated countermeasure network model according to the first feature information and the second feature information.

Optionally, the processor 1010 is further configured to train the acoustic model and the generating countermeasure network model until a first error rate between the first characteristic information and the second characteristic information meets a first preset condition.

Optionally, the generating an countermeasure network model includes an authentication module and a generating module, and the processor 1010 is further configured to train the generating module and the acoustic model until the second feature information output by the generating module meets a second preset condition, and train the authentication module until a second error rate between the first feature information and the second feature information output by the generating module meets a third preset condition.

Optionally, the processor 1010 is further configured to output phoneme information corresponding to a target frame of each audio in the audio training data through the acoustic model to be trained and the first feature information, and output semantic information corresponding to the target frame of each audio in the audio training data through the generating countermeasure network model to be trained and the first feature information.

The processor 1010 is configured to obtain third feature information of the first audio, output first phoneme information of the first audio through an acoustic model and the third feature information, and output a wake-up instruction when the first phoneme information is matched with preset phoneme information of the wake-up audio, where the acoustic model is obtained by training the scene.

It should be appreciated that in embodiments of the present application, the input unit 1004 may include a graphics processor (Graphics Processing Unit, GPU) 10041 and a microphone 10042, where the graphics processor 10041 processes still pictures or video processed image data obtained by an image capture device (e.g., a camera) in a video processing capture mode or an image capture mode. The display unit 1006 may include a display panel 10061, and the display panel 10061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1007 includes a touch panel 10071 and other input devices 10072. The touch panel 10071 is also referred to as a touch screen. The touch panel 10071 can include two portions, a touch detection device and a touch controller. Other input devices 10072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. Memory 1009 may be used to store software programs as well as various data including, but not limited to, application programs and an action system. The processor 1010 may integrate an application processor that primarily processes an action system, user pages, applications, etc., with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1010.

The embodiment of the application also provides a readable storage medium, wherein the readable storage medium stores a program or an instruction, and the program or the instruction realizes each process of the model training method/the voice awakening method embodiment when being executed by a processor, and can achieve the same technical effect, so that repetition is avoided, and the description is omitted.

Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium such as a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.

The embodiment of the application further provides a chip, the chip comprises a processor and a communication interface, the communication interface is coupled with the processor, the processor is used for running a program or instructions, the processes of the model training method/voice awakening method embodiment can be realized, the same technical effects can be achieved, and the repetition is avoided, and the description is omitted here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A model training method, characterized in that the method comprises:

Acquire first feature information of audio training data, where the audio training data includes wake-up audio and non-wake-up audio;

Outputting the phoneme information of the audio training data through the acoustic model to be trained and the first feature information;

Outputting semantic information of the audio training data through an encoder of a generative adversarial network model to be trained and the first feature information;

Outputting second feature information of the audio training data through a generator of the generative adversarial network model to be trained, the phoneme information, and the semantic information;

The acoustic model and the generative adversarial network model are trained according to the first feature information and the second feature information.

2. The method according to claim 1, characterized in that the training of the acoustic model and the generative adversarial network model according to the first feature information and the second feature information comprises:

The acoustic model and the generative adversarial network model are trained until a first error rate between the first feature information and the second feature information meets a first preset condition.

3. The method according to claim 1, characterized in that the generative adversarial network model includes an identification module and a generation module; the training of the acoustic model and the generative adversarial network model according to the first feature information and the second feature information comprises:

Training the generation module and the acoustic model until the second feature information output by the generation module meets a second preset condition;

The identification module is trained until a second error rate between the first feature information and the second feature information output by the generation module satisfies a third preset condition.

4. The method according to claim 1, wherein the step of outputting the phoneme information of the audio training data through the acoustic model to be trained and the first feature information comprises:

The phoneme information corresponding to the target frame of each audio in the audio training data is output through the acoustic model to be trained and the first feature information.

5. The method according to claim 4, characterized in that the outputting of the semantic information of the audio training data through the encoder of the generative adversarial network model to be trained and the first feature information comprises:

The semantic information corresponding to the target frame of each audio in the audio training data is output through the generative adversarial network model to be trained and the first feature information.

6. A voice wake-up method, characterized in that the method comprises:

Acquire third characteristic information of the first audio;

Outputting first phoneme information of the first audio through the acoustic model and the third feature information;

When the first phoneme information matches the preset phoneme information of the wake-up audio, outputting a wake-up instruction;

Wherein, the acoustic model is trained by the model training method described in any one of claims 1-5.

7. A model training device, characterized in that the device comprises:

A first acquisition module, configured to acquire first feature information of audio training data, wherein the audio training data includes wake-up audio and non-wake-up audio;

A first output module, configured to output the phoneme information of the audio training data through the acoustic model to be trained and the first feature information; and output the semantic information of the audio training data through the encoder of the generative adversarial network model to be trained and the first feature information;

A second output module, configured to output second feature information of the audio training data through a generator of the generative adversarial network model to be trained, the phoneme information, and the semantic information;

A training module is used to train the acoustic model and the generative adversarial network model according to the first feature information and the second feature information.

8. The device according to claim 7, characterized in that the training module comprises:

The first training unit is used to train the acoustic model and the generative adversarial network model until a first error rate between the first feature information and the second feature information meets a first preset condition.

9. The device according to claim 7, characterized in that the generative adversarial network model includes an identification module and a generation module; the training module includes:

A second training unit, used for training the generation module and the acoustic model until the second feature information output by the generation module meets a second preset condition;

The third training unit is used to train the identification module until a second error rate between the first feature information and the second feature information output by the generation module meets a third preset condition.

10. The device according to claim 7, wherein the first output module comprises:

The first output unit is used to output the phoneme information corresponding to the target frame of each audio in the audio training data through the acoustic model to be trained and the first feature information.

11. The device according to claim 10, characterized in that the first output module comprises:

The second output unit is used to output the semantic information corresponding to the target frame of each audio in the audio training data through the generative adversarial network model to be trained and the first feature information.

12. A voice wake-up device, characterized in that the device comprises:

A second acquisition module, used to acquire third characteristic information of the first audio;

A third output module, configured to output first phoneme information of the first audio through an acoustic model and the third feature information;

a fourth output module, configured to output a wake-up instruction when the first phoneme information matches the preset phoneme information of the wake-up audio;

13. An electronic device, characterized in that it includes a processor, a memory, and a program or instruction stored in the memory and executable on the processor, wherein when the program or instruction is executed by the processor, the steps of the model training method according to any one of claims 1 to 5 or the voice wake-up method according to claim 6 are implemented.

14. A readable storage medium, characterized in that a program or instruction is stored on the readable storage medium, and when the program or instruction is executed by a processor, the steps of the model training method according to any one of claims 1 to 5 or the voice wake-up method according to claim 6 are implemented.