WO2018107810A1

WO2018107810A1 - Voiceprint recognition method and apparatus, and electronic device and medium

Info

Publication number: WO2018107810A1
Application number: PCT/CN2017/099707
Authority: WO
Inventors: 王健宗; 郭卉; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2016-12-15
Filing date: 2017-08-30
Publication date: 2018-06-21
Also published as: CN107610707A; CN107610707B

Abstract

Provided are a voiceprint recognition method and apparatus, and an electronic device and a medium, wherein same are applicable to the technical field of identity authentication. The method comprises: pre-processing an input voice to acquire a valid voice in the voice; extracting an MFCC acoustic feature of the voice and outputting first and second feature matrices containing an MFCC dimension and a voice framing number; constructing a long short-term recurrent neural network model and taking the first feature matrix as an input; training feature extraction matrices by using a training parameter of the neural network model and a speaker feature of the voice, with each of the feature extraction matrices corresponding to one speaker model; and selecting a speaker model matching the second feature matrix, with a speaker output corresponding to the matching speaker model being a voiceprint recognition result. Thus, a more appropriate acoustic feature can be mined from a training voice, so that the distinguishing features of speakers can be recognized more accurately, a speaker model with higher robustness can be learnt, and a better voiceprint recognition result can be acquired.

Description

Voiceprint recognition method, device, electronic device and medium

Technical field

The present application belongs to the technical field of identity authentication, and in particular, to a voiceprint recognition method, device, electronic device and medium.

Background technique

Voiceprint recognition, also known as speaker recognition, is used to determine which segment of speech is spoken by a certain segment of speech or to confirm whether a segment of speech is spoken by a designated person. The speech parameters of the physiological and behavioral characteristics of the speaker, the technique of automatically identifying the identity of the speaker. At present, voiceprint recognition is widely used in the Internet, banking systems, public security and other fields. Voiceprint is a sound wave spectrum that carries speech information displayed by electroacoustic instruments. Each person's speech acoustic characteristics are both relatively stable and variability, not absolute and immutable. This variation can come from physiology, pathology, psychology, simulation, camouflage, and also related to environmental disturbances.

The mainstream voiceprint recognition method in the industry generally needs to first model the voiceprint of the speaker, usually the pre-training of the global background model. In the existing voiceprint model, a mixed Gaussian model is mainly used to train a general background model. Since the mixed Gaussian background model based on unsupervised training does not have the category information of the sample data, it is only used to represent the characteristics of all the speakers in the speaker space, and is a single speaker-independent background model, so it is difficult to accurately distinguish the speech. The difference in human characteristics ultimately leads to a low recognition accuracy when the speaker's voiceprint is recognized.

technical problem

The embodiment of the invention provides a voiceprint recognition method, device, electronic device and medium, so as to solve the problem that the prior art is difficult to accurately distinguish the distinctive features of the speaker, thereby resulting in a low accuracy of voiceprint recognition.

Technical solution

A first aspect of the embodiments of the present invention provides a voiceprint recognition method, including:

Performing pre-processing on the input K voices to obtain valid voices in each voice, the voices including training voices and to-be-recognized voices;

Extracting a Mel frequency cepstral coefficient acoustic characteristic of the effective speech in each training speech, and outputting a first feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of each training speech;

Constructing a long-term recurrent neural network model, and inputting the first feature matrix into the neural network model to obtain an output parameter of the neural network model;

Using the output parameters of the neural network model and the speaker features corresponding to each training speech, respectively, N feature extraction matrices of N training speeches are respectively trained, and each feature extraction matrix corresponds to one of the training speeches. Speaker model;

Extracting a Mel frequency cepstral coefficient acoustic characteristic of the effective speech in the speech to be recognized, and outputting a second feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of the to-be-recognized speech;

In the N speaker models, according to a preset similarity measurement algorithm, a speaker model matching the second feature matrix is selected, and the speaker output corresponding to the selected speaker model is Depicting the voiceprint recognition result of the recognized voice;

Wherein, K and N are integers greater than zero, and K is greater than N.

A second aspect of the embodiments of the present invention provides a voiceprint recognition apparatus, including:

a pre-processing module, configured to pre-process the input K voices to obtain valid voices in each voice, where the voice includes a training voice and a voice to be recognized;

a first extraction module, configured to extract a Meer frequency cepstral coefficient acoustic feature of the effective speech in each training speech, output a dimension including the Mel frequency cepstral coefficient, and a number of sub-frames of each training speech First feature matrix;

a building module, configured to construct a long-term recurrent neural network model, and input the first feature matrix into the neural network model to obtain an output parameter of the neural network model;

a training module, configured to use the output parameters of the neural network model and the speaker features corresponding to each training speech to respectively obtain N feature extraction matrices of N training speeches, where each feature extraction matrix corresponds to a speaker model of the training speech;

a second extraction module, configured to extract an acoustic characteristic of a Mel frequency cepstral coefficient of the effective speech in the speech to be recognized, and output a dimension including the dimension of the Mel frequency cepstral coefficient and the number of sub-frames of the to-be-recognized speech Two characteristic matrix;

An identification module, configured to select, in the N speaker models, a speaker model that matches the second feature matrix according to a preset similarity measurement algorithm, where the selected speaker model corresponds The speaker outputs the voiceprint recognition result of the speech to be recognized;

Wherein, K and N are integers greater than zero, and K is greater than N.

A third aspect of the embodiments of the present invention provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the computer program The following steps are implemented:

Wherein, K and N are integers greater than zero, and K is greater than N.

According to a fourth aspect of the embodiments of the present invention, a computer readable storage medium storing a computer program, the computer program being executed by at least one processor, implements the following steps:

Using the output parameters of the neural network model and the speaker characteristics corresponding to each training speech, respectively training Deriving N feature extraction matrices of N training speeches, each feature extraction matrix corresponding to a speaker model of the training speech;

Wherein, K and N are integers greater than zero, and K is greater than N.

Beneficial effect

In the embodiment of the present invention, the voiceprint background model is trained by supervised learning, and by combining the characteristics of the speaker, a more suitable acoustic feature set can be extracted from the original training voice data, thereby more accurately distinguishing the speaker. The difference feature can obtain better voiceprint recognition effect in the scene of overlapping voices. Since the main process of recognition is based on the deep neural network model, it is possible to learn a more robust speaker model and solve the problem of low recognition accuracy of the existing voiceprint recognition method.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the embodiments or the description of the prior art will be briefly described below. It is obvious that the drawings in the following description are only the present invention. For some embodiments, other drawings may be obtained from those of ordinary skill in the art in light of the inventive workability.

1 is a flowchart of an implementation of a voiceprint recognition method according to an embodiment of the present invention;

2 is a specific implementation flowchart of step S101 in the voiceprint recognition method according to the embodiment of the present invention;

3 is a specific implementation flowchart of step S102 in the voiceprint recognition method according to the embodiment of the present invention;

4 is a specific implementation flowchart of step S103 in the voiceprint recognition method according to the embodiment of the present invention;

FIG. 5 is a specific implementation flowchart of step S104 in the voiceprint recognition method according to the embodiment of the present invention;

6 is a structural block diagram of a voiceprint recognition apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an electronic device according to an embodiment of the present invention.

Embodiments of the invention

In the following description, for purposes of illustration and description However, it will be apparent to those skilled in the art that the present invention may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the invention.

In order to explain the technical solution described in the present invention, the following description will be made by way of specific embodiments.

The embodiment of the invention is implemented based on a time recursive depth neural network. The training of the speaker model relies on the acoustic characteristics of the training speech to estimate and optimize the parameters of the model, and different speaker models are used to represent different speaker personality characteristics. After obtaining the feature extraction matrix of the speech to be recognized, the speaker model is matched with multiple speaker models in turn, and the speaker model that does not meet the matching condition is eliminated. Finally, the speaker corresponding to the speaker model matching the matching condition is received as the voiceprint recognition. the result of.

FIG. 1 is a flowchart showing an implementation process of a voiceprint recognition method according to an embodiment of the present invention, which is described in detail as follows:

In S101, the input K voices are respectively preprocessed to obtain valid voices in each voice, The speech includes training speech and speech to be recognized.

In this embodiment, different speaker models are established by inputting a sufficient number of training speeches, which are labeled speech samples of known speaker identity, used to adjust parameters of the speaker model, so that the model can be based on Supervise learning and achieve the required recognition performance in practical applications.

When it is necessary to determine which of a number of people a certain voice is said or to confirm whether a certain voice is said by a designated person, the speech is the speech to be recognized. The training speech is different from the speech to be recognized, and may be different or the same speech data. When the two are the same, the speech to be recognized can be used to test the performance of the finally derived speaker model, and test whether it can accurately recognize the speaker identity of the speech to be recognized.

Pre-processing the speech to reduce the background noise level in each continuous speech signal, output effective speech with practical analysis meaning, provide a high SNR training set for subsequent speaker model training, and improve model training. Speed, to achieve more accurate model training results.

As another embodiment of the present invention, FIG. 2 shows a specific implementation flow of the voiceprint recognition method S101 provided by the embodiment of the present invention, which is described in detail as follows:

S201: Perform pre-emphasis processing on the input K voices respectively to improve a frequency band of the high frequency signal in each voice.

In this embodiment, in order to reduce the influence of the lip radiation, the high-frequency resonance peak is highlighted, and each speech signal is respectively passed through a high-pass filter to emphasize the high-frequency portion of the speech, so that the spectrum of the speech signal becomes more smooth.

S202, using a frame-and-window algorithm, respectively converting each of the pre-emphasis processed speech into a short-time stationary signal.

Each of the pre-emphasis processed speech is segmented by selecting an appropriate number of sampling points to convert each speech into a multi-frame short-term speech signal. Among them, each frame signal can be regarded as a stationary process, that is, the statistical characteristics are stable.

In this embodiment, the windowing process indicates that the original short-term speech signal is used as an integrand and is integrated with a particular window function. A window function is a real function that takes zero values except for a given interval, including but not limited to window functions such as rectangular windows, triangular windows, Hanning windows, and Hamming windows.

Preferably, in the embodiment, the window function is a Hanning window.

S203. Differentiate the noise and the voice in the short-term stationary signal based on the endpoint detection algorithm, and output the voice in the short-term stationary signal as the effective voice of each voice.

First, a higher short-term energy decision threshold is selected in the short-time power spectrum profile corresponding to the short-term speech signal, and the first coarse judgment is performed. The starting and ending point of the effective speech signal is outside the time interval corresponding to the threshold value and the short-term energy envelope intersection.

According to the average energy of the background noise, a lower short-term energy decision threshold is selected, and the two points of the short-term energy envelope intersecting the threshold are used as the starting and ending points of the effective speech signal, and the effective speech can be extracted and output. .

In the embodiment of the present invention, by pre-emphasizing the input multiple voices, the output signal-to-noise ratio of the high frequency band is obviously reduced, and the noise in the short-time stationary signal is filtered by extracting the effective voice in the voice signal, thereby reducing the speaking. The amount of calculation in the human model training process and the shortening of the speech processing time of the subsequent multiple steps can eliminate the noise interference of the silent segment and improve the correct rate of speech recognition.

In S102, extracting a Mel Frequency Cepstral Coefficient (MFCC) acoustic feature of the effective speech in each training speech, outputting a dimension including the Mel frequency cepstral coefficient and a number of sub-frames of each training speech The first feature matrix.

The Mel's Mel frequency based on the human ear's auditory characteristics is nonlinearly related to the Hz frequency. Using the nonlinear relationship, the Hz spectral feature is calculated.

The conversion formula of Hz frequency and Mel frequency is: F _mel = 2595 * lg (1 + f _HZ / 700)

As another embodiment of the present invention, FIG. 3 shows a specific implementation flow of the voiceprint recognition method S102 provided by the embodiment of the present invention, as follows:

In S301, the effective speech in each of the training speeches is analyzed by a fast Fourier transform to obtain a power spectrum of the valid speech.

After the effective speech extracted from the above embodiment is subjected to fast Fourier transform, the spectrum of the effective speech of each frame is obtained, and after the modulo is applied to the spectrum, the square calculation is performed to obtain the power spectrum of the effective speech of each frame. The different energy distributions characterized by the power spectrum represent different characteristics of speech.

In S302, the power spectrum is filtered by a filter set of a Mel scale, the filter set includes M triangular filters, and the logarithmic energy of the output of each of the triangular filters is obtained.

The center frequencies of the M triangular filters are f(m), m=1, 2, . . . , k, respectively, wherein k is preferably 22 to 26.

In S303, after performing the discrete cosine transform on the logarithmic energy, the Mel frequency cepstral coefficient acoustic characteristic of the effective speech is output.

In S304, according to the Mel frequency cepstral coefficient acoustic feature, a first feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of each training speech is output.

The energy of the effective speech signal per frame plus the logarithmic energy constructs a two-dimensional MFCC acoustic signature. A variety of acoustic features, such as pitch, zero-crossing rate, and formant, are added in the process so that the output first feature matrix can be represented by "MFCC dimension x number of frames", which is the original input. The number of framings of each speech signal during the framing windowing process.

In the embodiment of the invention, the power spectrum of the effective speech is filtered by the triangular filter, the smoothing of the effective speech spectrum of each frame is realized, the harmonic effect is eliminated, and the resonance peak of the original speech signal corresponding to the effective speech of each frame is highlighted. Taking the first feature matrix including the MFCC acoustic feature dimension as the input of the neural network model enables the training of the neural network model to be not affected by the pitch of the input speech, reducing the amount of computation.

In S103, a long-term recurrent neural network model is constructed, and the first feature matrix is input into the neural network model to obtain an output parameter of the neural network model.

As another embodiment of the present invention, FIG. 4 shows a specific implementation flow of the voiceprint recognition method S103 provided by the embodiment of the present invention, which is described in detail as follows:

In S401, a long-term recursive neural network model is initialized, the neural network model including an input layer, a recursive layer containing long and short-term memory units, and an output layer.

In this embodiment, the neural network model includes multiple levels, and the roles of the different layers are different. Here, taking the five-layer network as an example, the network structure of the long-short recursive neural network is explained. It can be understood that in the actual applied network structure, the number of layers of the neural network is not limited to five layers.

In this embodiment, an open source deep learning tool CNTK is used to initialize a five-layer long-term recurrent neural network model. The network structure of the neural network model (DNN) is: one input layer, three recursive layers containing long-term and short-term memory units (LSTM), and An output layer. Each recursive layer contains 1024 nodes and includes a two-level hierarchy, where one level is a mapping layer with 512 nodes.

The parameters of the LSTM recursive layer input are 83-dimensional speech feature vectors. Based on the information of the current frame, the first five frames, and the last five frames of effective speech, each time only one frame of effective speech is iterated, so there are a total of 913 dimensions (11 The feature vector of frame × 83 dimension is used as the input of LSTM. After entering the LSTM recursive layer, the 913-dimensional feature vector passes through 1024 hidden layer memory cells in sequence. Therefore, the input and output feature vector dimensions of the LSTM recursive layer are the same.

For the training of this neural network structure, an optimization method of stochastic gradient descent can be used.

In S402, the first feature matrix is input to the neural network model.

In S403, the frame feature vectors in the first feature matrix are classified by using a Softmax classifier, and state clustering is performed according to the classification result to obtain a plurality of types of frame feature vectors.

In S404, a posterior probabilities of the various types of frame feature vectors are respectively calculated, and a posterior probabilities of the various types of frame feature vectors are output parameters of the neural network model.

The DNN output parameters are:

Wherein, the i represents an effective speech of the i-th frame; the θ represents text information corresponding to the speech; the f _i represents a first feature matrix input by the deep neural network; and the k represents the k-th class of the output, corresponding to The number of Gaussian blends in a traditional mixed Gaussian model.

In S104, using the output parameters of the neural network model and the speaker features corresponding to each training speech, respectively, N feature extraction matrices of N training speeches are respectively obtained, and each feature extraction matrix corresponds to one The speaker model of the training speech.

As another embodiment of the present invention, FIG. 5 shows a specific implementation flow of the voiceprint recognition method S104 provided by the embodiment of the present invention, which is described in detail as follows:

In S501, a training parameter of the neural network model is acquired, where the training parameter is a mixed weight, an average value, and a variance of the output parameter.

Based on the DNN output parameters in the above embodiment, the calculation formulas of the three training parameters are respectively:

Mixed weights:

Mean:

variance:

In S502, a feature vector corresponding to the speaker of each training speech is calculated by using a forward-backward algorithm according to the training parameter and the speaker feature corresponding to the training speech.

In this embodiment, the speaker feature corresponding to the training voice represents the speaker identity tag information of the training voice, and the forward-backward algorithm is used according to the mixed weight, the mean value, the variance, and the tag information of the training voice of the DNN output parameter. The Baum-Welch algorithm of the principle iteratively estimates the feature vector of the speaker corresponding to each training speech.

In S503, the training parameters of the neural network model and the feature vector corresponding to each speaker of the training speech are iterated to convergence, and the feature extraction matrix of each training speech is obtained.

In S105, extracting a Mel frequency cepstral coefficient acoustic feature of the effective speech in the to-be-recognized speech, and outputting a second feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of the to-be-recognized speech .

The content of the specific embodiment in S102 is also applicable in S105. The difference is that the original voice signal processed in this step is the voice to be recognized, and the original voice signal processed in S102 is the training voice. The same, not repeated here.

In S106, in the N speaker models, according to a preset similarity measurement algorithm, a speaker model matching the second feature matrix is selected, and the selected speaker model corresponds to a speech. The human output is the voiceprint recognition result of the speech to be recognized.

The similarity measurement algorithm includes, but is not limited to, an algorithm such as a distance measure, a similarity measure, and a matching measure to measure the degree of similarity between the second feature matrix and the speaker model in the objective objective representation form.

As another embodiment of the present invention, a speaker model that matches the second feature matrix is obtained by a cosine measure in a similarity measure algorithm.

In this embodiment, the cosine of the angles of the two vectors in the vector space is used to measure the difference between the second feature matrix and the N speaker model individuals. The similarity judgment of two vectors (representing the second feature matrix of the speech to be recognized and the speaker model) is performed by comparing the cosine distances of the input two i-vector low-dimensional vectors and setting a certain threshold. Wherein, the line connecting the feature point and the origin intersects the origin, and the smaller the angle, the more similar the two features are, and the larger the angle, the smaller the similarity of the two features.

In the N speaker models, a speaker model with the greatest similarity is selected, and the original speaker of the speaker model is the speaker of the speech to be recognized, thereby obtaining the voiceprint recognition result of the speech to be recognized.

It should be understood that the size of the sequence of the steps in the above embodiments does not imply a sequence of executions, and the order of execution of the processes should be determined by its function and internal logic, and should not be construed as limiting the implementation of the embodiments of the present invention.

Corresponding to the voiceprint recognition method described in the above embodiments, FIG. 6 is a structural block diagram of a voiceprint recognition apparatus according to an embodiment of the present invention. The voiceprint recognition apparatus may be a software module, a hardware module, or a soft and hard. Combined module. For the convenience of explanation, only the parts related to the present embodiment are shown.

Referring to Figure 6, the apparatus includes:

The pre-processing module 61 is configured to pre-process the input K voices to obtain valid voices in each voice, where the voice includes the training voice and the voice to be recognized.

a first extraction module 62, configured to extract a Mel frequency cepstral coefficient acoustic feature of the effective speech in each training speech, output a dimension including the Mel frequency cepstral coefficient, and a framing of each training speech The first feature matrix of the number.

The building module 63 is configured to construct a long-term recursive neural network model, and input the first feature matrix into the neural network model to obtain an output parameter of the neural network model.

The training module 64 is configured to use the output parameters of the neural network model and the speaker features corresponding to each training speech to respectively train N feature extraction matrices of N training speeches, and each feature extraction matrix Corresponding to a speaker model of the training speech.

a second extraction module 65, configured to extract a Meer frequency cepstral coefficient acoustic feature of the valid speech in the to-be-recognized speech, output a dimension including the Mel frequency cepstral coefficient, and a number of sub-frames of the to-be-recognized speech The second feature matrix.

The identification module 66 is configured to select, in the N speaker models, a speaker model that matches the second feature matrix according to a preset similarity measurement algorithm, where the selected speaker model corresponds to The speaker output is the voiceprint recognition result of the speech to be recognized.

Wherein, K and N are integers greater than zero, and K is greater than N.

Optionally, the pre-processing module 61 includes:

a pre-emphasis sub-module, configured to perform pre-emphasis processing on the input K voices respectively to improve a frequency band of the high-frequency signal in each voice;

a conversion sub-module, configured to convert each voice after the pre-emphasis processing into a short-time stationary signal by using a frame-and-window algorithm;

And a detecting submodule, configured to distinguish the noise and the voice in the short-term stationary signal based on the endpoint detection algorithm, and output the voice in the short-term stationary signal as the effective voice of each voice.

Optionally, the first extraction module 62 includes:

Obtaining a sub-module, configured to analyze, by using a fast Fourier transform, the effective voice in each training voice, to obtain a power spectrum of the valid voice;

a filtering submodule, configured to filter the power spectrum by using a filter group of a Meyer scale, the filter set comprising M triangular filters, and acquiring a logarithmic energy of the output of each of the triangular filters, The M is an integer greater than zero;

a transform submodule, configured to perform a discrete cosine transform on the logarithmic energy, and output an acoustic characteristic of the Mel frequency cepstral coefficient of the effective speech;

And an output submodule, configured to output, according to the Mel frequency cepstral coefficient acoustic feature, a first feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of framings of each training speech.

Optionally, the building module 63 includes:

An initialization sub-module for initializing a long-term recursive neural network model, the neural network model comprising an input layer, a recursive layer containing long and short-term memory units, and an output layer;

An input submodule, configured to input the first feature matrix into the neural network model;

a classification sub-module, configured to classify frame feature vectors in the first feature matrix by using a Softmax classifier, and perform state clustering according to the classification result to obtain a plurality of types of frame feature vectors;

And a calculation submodule, configured to respectively calculate a posterior probabilities of the various types of frame feature vectors, wherein the posterior probabilities of the various types of frame feature vectors are output parameters of the neural network model.

Optionally, the training module 64 includes:

a parameter obtaining submodule, configured to acquire training parameters of the neural network model, where the training parameter is a mixed weight, an average value, and a variance of the output parameter;

a feature acquisition sub-module, configured to calculate, according to the training parameter and the speaker feature corresponding to the training speech, a feature vector of the speaker corresponding to each training voice by using a forward-backward algorithm;

An iterative sub-module, configured to iterate to the training parameter of the neural network model and the feature vector of each of the training speech corresponding to the speaker, to obtain a feature extraction matrix of each training speech

FIG. 7 is a schematic diagram of an electronic device according to an embodiment of the present invention. As shown in FIG. 7, the electronic device 7 of this embodiment includes a processor 70, a memory 71, and a computer program 72, such as a voiceprint recognition program, stored in the memory 71 and operable on the processor 70. When the processor 70 executes the computer program 72, the steps in the foregoing various file management method embodiments are implemented, such as steps 101 to 106 shown in FIG. Alternatively, the processor 70, when executing the computer program 72, implements the functions of the modules/units in the various apparatus embodiments described above, such as the functions of the modules 61-66 shown in FIG.

Illustratively, the computer program 72 can be partitioned into one or more modules/units that are stored in the memory 71 and executed by the processor 70 to complete this invention. The one The plurality of modules/units may be a series of computer program instruction segments capable of performing a particular function, the instruction segments being used to describe the execution of the computer program 72 in the electronic device 7.

The electronic device 7 can be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The electronic device 7 may include, but is not limited to, a processor 70, a memory 71. It will be understood by those skilled in the art that FIG. 7 is merely an example of the electronic device 7, and does not constitute a limitation on the electronic device 7, and may include more or less components than those illustrated, or combine some components, or different components. For example, the electronic device 7 may further include an input and output device, a network access device, a bus, and the like.

The processor 70 may be a central processing unit (CPU), or may be other general-purpose processors, a digital signal processor (DSP), an application specific integrated circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor or any conventional processor or the like.

The memory 71 may be an internal storage unit of the electronic device 7, such as a hard disk or memory of the electronic device 7. The memory 71 may also be an external storage device of the electronic device 7, such as a plug-in hard disk equipped on the electronic device 7, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc. Further, the memory 71 may also include both an internal storage unit of the electronic device 7 and an external storage device. The memory 71 is used to store the computer program and other programs and data required by the electronic device 7. The memory 71 can also be used to temporarily store data that has been output or is about to be output.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, only the division of each functional module and module described above is exemplified. In practical applications, the above functions may be assigned to different functional modules according to needs. The module is completed by dividing the internal structure of the device into different functional modules or modules to perform all or part of the functions described above. Each functional module and module in the embodiment may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module, and the integrated module may be implemented by hardware. Formal implementation can also be implemented in the form of software functional modules. In addition, the specific names of the respective functional modules and modules are only for the purpose of facilitating mutual differentiation, and are not intended to limit the scope of protection of the present application. For the specific working process of the modules and modules in the foregoing system, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described herein again.

Those of ordinary skill in the art will appreciate that the modules and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the system embodiment described above is merely illustrative. For example, the division of the module or module is only a logical function division. In actual implementation, there may be another division manner, for example, multiple modules or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed. Alternatively, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or module, and may be in electrical, mechanical or other form.

The modules described as separate components may or may not be physically separated. The components displayed as modules may or may not be physical modules, that is, may be located in one place, or may be distributed to multiple network modules. on. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules.

The integrated modules, if implemented in the form of software functional modules and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the embodiments of the present invention may contribute to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage. The medium includes a plurality of instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform all or part of the steps of the methods described in various embodiments of the embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

The embodiments described above are only for explaining the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and the modifications or substitutions do not deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be included in Within the scope of protection of the present invention.

Claims

A voiceprint recognition method, comprising:

Performing pre-processing on the input K voices to obtain valid voices in each voice, the voices including training voices and to-be-recognized voices;

Extracting a Mel frequency cepstral coefficient acoustic characteristic of the effective speech in each training speech, and outputting a first feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of each training speech;

Constructing a long-term recurrent neural network model, and inputting the first feature matrix into the neural network model to obtain an output parameter of the neural network model;

Using the output parameters of the neural network model and the speaker features corresponding to each training speech, respectively, N feature extraction matrices of N training speeches are respectively trained, and each feature extraction matrix corresponds to one of the training speeches. Speaker model;

Extracting a Mel frequency cepstral coefficient acoustic characteristic of the effective speech in the speech to be recognized, and outputting a second feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of the to-be-recognized speech;

In the N speaker models, according to a preset similarity measurement algorithm, a speaker model matching the second feature matrix is selected, and the speaker output corresponding to the selected speaker model is Depicting the voiceprint recognition result of the recognized voice;

Wherein, K and N are integers greater than zero, and K is greater than N.
The voiceprint recognition method according to claim 1, wherein the pre-processing the input K voices to obtain the valid voices in each voice includes:

Performing pre-emphasis processing on the input K voices respectively to improve the frequency band of the high frequency signal in each voice;

Using a frame-and-window algorithm, each of the pre-emphasis processed speech is converted into a short-time stationary signal;

The noise and the voice in the short-term stationary signal are distinguished based on an endpoint detection algorithm, and the voice in the short-term stationary signal is output as the effective voice of each voice.
The voiceprint recognition method according to claim 2, wherein said extracting an acoustic characteristic of a Mel frequency cepstral coefficient of the effective speech in each of the training speeches, and outputting a dimension including the Mel frequency cepstral coefficient And the first feature matrix of the number of framings of each training voice includes:

Obtaining the effective voice in each of the training voices by fast Fourier transform, obtaining the effective Power spectrum of speech;

The power spectrum is filtered by a filter set of a Mel scale, the filter set includes M triangular filters, and the logarithmic energy of the output of each of the triangular filters is obtained, and the M is greater than zero Integer

After performing discrete cosine transform on the logarithmic energy, outputting an acoustic characteristic of the Mel frequency cepstral coefficient of the effective speech;

And outputting, according to the Mel frequency cepstral coefficient acoustic characteristic, a first feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of each training speech.
The voiceprint recognition method according to claim 1, wherein said constructing a long-term recursive neural network model, and inputting said first feature matrix into said neural network model to obtain an output of said neural network model The parameters include:

Initializing a long-term recursive neural network model, the neural network model comprising an input layer, a recursive layer containing long and short-term memory units, and an output layer;

Inputting the first feature matrix into the neural network model;

The frame feature vectors in the first feature matrix are classified by using a Softmax classifier, and state clustering is performed according to the classification result to obtain a plurality of types of frame feature vectors;

A posterior probabilities of the various types of frame feature vectors are respectively calculated, and a posterior probabilities of the various types of frame feature vectors are output parameters of the neural network model.
The voiceprint recognition method according to claim 1, wherein the N of the N training voices are respectively trained by using the output parameters of the neural network model and the speaker characteristics corresponding to each training speech. Feature extraction matrices include:

Obtaining training parameters of the neural network model, where the training parameters are mixed weights, mean values, and variances of the output parameters;

And calculating, according to the training parameter and the speaker feature corresponding to the training voice, a feature vector of the speaker corresponding to each training voice by using a forward-backward algorithm;

The training parameters of the neural network model and the feature vector of each training speech corresponding to the speaker are iterated to convergence, and the feature extraction matrix of each training speech is obtained.
A voiceprint recognition device, comprising:

a pre-processing module, configured to separately process the input K speeches to obtain each of the words An effective voice in the voice, the voice including the training voice and the voice to be recognized;

a first extraction module, configured to extract a Meer frequency cepstral coefficient acoustic feature of the effective speech in each training speech, output a dimension including the Mel frequency cepstral coefficient, and a number of sub-frames of each training speech First feature matrix;

a building module, configured to construct a long-term recurrent neural network model, and input the first feature matrix into the neural network model to obtain an output parameter of the neural network model;

a training module, configured to use the output parameters of the neural network model and the speaker features corresponding to each training speech to respectively obtain N feature extraction matrices of N training speeches, where each feature extraction matrix corresponds to a speaker model of the training speech;

a second extraction module, configured to extract an acoustic characteristic of a Mel frequency cepstral coefficient of the effective speech in the speech to be recognized, and output a dimension including the dimension of the Mel frequency cepstral coefficient and the number of sub-frames of the to-be-recognized speech Two characteristic matrix;

An identification module, configured to select, in the N speaker models, a speaker model that matches the second feature matrix according to a preset similarity measurement algorithm, where the selected speaker model corresponds The speaker outputs the voiceprint recognition result of the speech to be recognized;

Wherein, K and N are integers greater than zero, and K is greater than N.
The voiceprint recognition device according to claim 6, wherein the preprocessing module comprises:

a pre-emphasis sub-module, configured to perform pre-emphasis processing on the input K voices respectively to improve a frequency band of the high-frequency signal in each voice;

a conversion sub-module, configured to convert each voice after the pre-emphasis processing into a short-time stationary signal by using a frame-and-window algorithm;

And a detecting submodule, configured to distinguish the noise and the voice in the short-term stationary signal based on the endpoint detection algorithm, and output the voice in the short-term stationary signal as the effective voice of each voice.
The voiceprint recognition device according to claim 7, wherein the first extraction module comprises:

Obtaining a sub-module, configured to analyze, by using a fast Fourier transform, the effective voice in each training voice, to obtain a power spectrum of the valid voice;

a filtering submodule for filtering the power spectrum by using a filter scale of a Meyer scale, The filter bank includes M triangular filters, and obtains a logarithmic energy of the output of each of the triangular filters, the M being an integer greater than zero;

a transform submodule, configured to perform a discrete cosine transform on the logarithmic energy, and output an acoustic characteristic of the Mel frequency cepstral coefficient of the effective speech;

And an output submodule, configured to output, according to the Mel frequency cepstral coefficient acoustic feature, a first feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of framings of each training speech.
The voiceprint recognition device according to claim 6, wherein the building module comprises:

An initialization sub-module for initializing a long-term recursive neural network model, the neural network model comprising an input layer, a recursive layer containing long and short-term memory units, and an output layer;

An input submodule, configured to input the first feature matrix into the neural network model;

a classification sub-module, configured to classify frame feature vectors in the first feature matrix by using a Softmax classifier, and perform state clustering according to the classification result to obtain a plurality of types of frame feature vectors;

And a calculation submodule, configured to respectively calculate a posterior probabilities of the various types of frame feature vectors, wherein the posterior probabilities of the various types of frame feature vectors are output parameters of the neural network model.
The voiceprint recognition device according to claim 6, wherein the training module comprises:

a parameter obtaining submodule, configured to acquire training parameters of the neural network model, where the training parameter is a mixed weight, an average value, and a variance of the output parameter;

a feature acquisition sub-module, configured to calculate, according to the training parameter and the speaker feature corresponding to the training speech, a feature vector of the speaker corresponding to each training voice by using a forward-backward algorithm;

And an iteration sub-module, configured to iterate to the convergence of the training parameters of the neural network model and the feature vector of each of the training speech corresponding to the speaker, to obtain a feature extraction matrix of each training speech.
An electronic device, comprising: a memory, a processor, wherein the memory stores a computer program executable on the processor, and the processor executes the computer sequence to implement the following steps:

Performing pre-processing on the input K voices to obtain valid voices in each voice, the voices including training voices and to-be-recognized voices;

Extracting a Mel frequency cepstral coefficient acoustic characteristic of the effective speech in each training speech, and outputting a first feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of each training speech;

Constructing a long-term recurrent neural network model, and inputting the first feature matrix into the neural network model to obtain an output parameter of the neural network model;

Using the output parameters of the neural network model and the speaker features corresponding to each training speech, respectively, N feature extraction matrices of N training speeches are respectively trained, and each feature extraction matrix corresponds to one of the training speeches. Speaker model;

Extracting a Mel frequency cepstral coefficient acoustic characteristic of the effective speech in the speech to be recognized, and outputting a second feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of the to-be-recognized speech;

In the N speaker models, according to a preset similarity measurement algorithm, a speaker model matching the second feature matrix is selected, and the speaker output corresponding to the selected speaker model is Depicting the voiceprint recognition result of the recognized voice;

Wherein, K and N are integers greater than zero, and K is greater than N.
The electronic device according to claim 11, wherein the pre-processing the input K voices to obtain the valid voices in each voice includes:

Performing pre-emphasis processing on the input K voices respectively to improve the frequency band of the high frequency signal in each voice;

Using a frame-and-window algorithm, each of the pre-emphasis processed speech is converted into a short-time stationary signal;

The noise and the voice in the short-term stationary signal are distinguished based on an endpoint detection algorithm, and the voice in the short-term stationary signal is output as the effective voice of each voice.
The electronic device according to claim 12, wherein said extracting an acoustic characteristic of a Mel frequency cepstral coefficient of an effective speech in each of said training speeches, outputting a dimension and a content including said Mel frequency cepstral coefficient The first feature matrix of the number of framings of each training speech includes:

Obtaining a valid speech in each of the training speeches by a fast Fourier transform to obtain a power spectrum of the valid speech;

The power spectrum is filtered by a filter set of a Mel scale, the filter set includes M triangular filters, and the logarithmic energy of the output of each of the triangular filters is obtained, and the M is greater than zero Integer

After performing discrete cosine transform on the logarithmic energy, outputting an acoustic characteristic of the Mel frequency cepstral coefficient of the effective speech;

And outputting, according to the Mel frequency cepstral coefficient acoustic characteristic, a first feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of each training speech.
The electronic device according to claim 11, wherein said constructing a long-term recursive neural network model, and inputting said first feature matrix into said neural network model to obtain output parameters of said neural network model comprises :

Initializing a long-term recursive neural network model, the neural network model comprising an input layer, a recursive layer containing long and short-term memory units, and an output layer;

Inputting the first feature matrix into the neural network model;

The frame feature vectors in the first feature matrix are classified by using a Softmax classifier, and state clustering is performed according to the classification result to obtain a plurality of types of frame feature vectors;

A posterior probabilities of the various types of frame feature vectors are respectively calculated, and a posterior probabilities of the various types of frame feature vectors are output parameters of the neural network model.
The electronic device according to claim 11, wherein the N characteristics of the N training speeches are respectively trained by using the output parameters of the neural network model and the speaker characteristics corresponding to each training speech The extraction matrix includes:

Obtaining training parameters of the neural network model, where the training parameters are mixed weights, mean values, and variances of the output parameters;

And calculating, according to the training parameter and the speaker feature corresponding to the training voice, a feature vector of the speaker corresponding to each training voice by using a forward-backward algorithm;

The training parameters of the neural network model and the feature vector of each training speech corresponding to the speaker are iterated to convergence, and the feature extraction matrix of each training speech is obtained.
A computer readable storage medium storing a computer program, wherein the computer program, when executed by at least one processor, implements the following steps:

Performing pre-processing on the input K voices to obtain valid voices in each voice, the voices including training voices and to-be-recognized voices;

Extracting a Mel frequency cepstral coefficient acoustic characteristic of the effective speech in each training speech, and outputting a first feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of each training speech;

Constructing a long-term recurrent neural network model, and inputting the first feature matrix into the neural network model to obtain an output parameter of the neural network model;

Using the output parameters of the neural network model and the speaker features corresponding to each training speech, respectively, N feature extraction matrices of N training speeches are respectively trained, and each feature extraction matrix corresponds to one of the training speeches. Speaker model;

Extracting a Mel frequency cepstral coefficient acoustic characteristic of the effective speech in the speech to be recognized, and outputting a second feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of the to-be-recognized speech;

In the N speaker models, according to a preset similarity measurement algorithm, a speaker model matching the second feature matrix is selected, and the speaker output corresponding to the selected speaker model is Depicting the voiceprint recognition result of the recognized voice;

Wherein, K and N are integers greater than zero, and K is greater than N.
The computer readable storage medium according to claim 16, wherein the pre-processing the input K speeches to obtain the valid speech in each of the speeches comprises:

Performing pre-emphasis processing on the input K voices respectively to improve the frequency band of the high frequency signal in each voice;

Using a frame-and-window algorithm, each of the pre-emphasis processed speech is converted into a short-time stationary signal;

The noise and the voice in the short-term stationary signal are distinguished based on an endpoint detection algorithm, and the voice in the short-term stationary signal is output as the effective voice of each voice.
The computer readable storage medium according to claim 17, wherein said extracting a Mel frequency cepstral coefficient acoustic characteristic of the effective speech in each of said training speeches, outputting said Mel frequency cepstrum coefficient The dimension and the first feature matrix of the number of framings of each training speech include:

Obtaining a valid speech in each of the training speeches by a fast Fourier transform to obtain a power spectrum of the valid speech;

The power spectrum is filtered by a filter set of a Mel scale, the filter set includes M triangular filters, and the logarithmic energy of the output of each of the triangular filters is obtained, and the M is greater than zero Integer

After performing discrete cosine transform on the logarithmic energy, outputting an acoustic characteristic of the Mel frequency cepstral coefficient of the effective speech;

And outputting, according to the Mel frequency cepstral coefficient acoustic characteristic, a first feature matrix including a dimension of the Mel frequency cepstral coefficient and a number of sub-frames of each training speech.
The computer readable storage medium according to claim 16, wherein said constructing a long-term recursive neural network model, and inputting said first feature matrix into said neural network model to obtain said neural network model Output parameters include:

Initializing a long-term recursive neural network model, the neural network model comprising an input layer, a recursive layer containing long and short-term memory units, and an output layer;

Inputting the first feature matrix into the neural network model;

The frame feature vectors in the first feature matrix are classified by using a Softmax classifier, and state clustering is performed according to the classification result to obtain a plurality of types of frame feature vectors;

A posterior probabilities of the various types of frame feature vectors are respectively calculated, and a posterior probabilities of the various types of frame feature vectors are output parameters of the neural network model.
The computer readable storage medium according to claim 16, wherein the training parameters of the N training voices are respectively obtained by using an output parameter of the neural network model and a speaker feature corresponding to each training speech The N feature extraction matrices include:

Obtaining training parameters of the neural network model, where the training parameters are mixed weights, mean values, and variances of the output parameters;

And calculating, according to the training parameter and the speaker feature corresponding to the training voice, a feature vector of the speaker corresponding to each training voice by using a forward-backward algorithm;

The training parameters of the neural network model and the feature vector of each training speech corresponding to the speaker are iterated to convergence, and the feature extraction matrix of each training speech is obtained.