[go: up one dir, main page]

CN118471234A - Voice wakeup method and device, AR equipment and storage medium - Google Patents

Voice wakeup method and device, AR equipment and storage medium Download PDF

Info

Publication number
CN118471234A
CN118471234A CN202410546249.6A CN202410546249A CN118471234A CN 118471234 A CN118471234 A CN 118471234A CN 202410546249 A CN202410546249 A CN 202410546249A CN 118471234 A CN118471234 A CN 118471234A
Authority
CN
China
Prior art keywords
voice
wake
network
data
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410546249.6A
Other languages
Chinese (zh)
Inventor
李晶晶
袁斌
毛婷婷
侯天峰
刘兵兵
蒋超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Goertek Inc
Original Assignee
Goertek Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Goertek Inc filed Critical Goertek Inc
Priority to CN202410546249.6A priority Critical patent/CN118471234A/en
Publication of CN118471234A publication Critical patent/CN118471234A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electric Clocks (AREA)

Abstract

The application discloses a voice awakening method, a device, AR equipment and a storage medium, which relate to the technical field of AR and disclose the voice awakening method, comprising the following steps: acquiring at least one frame of voice data acquired by a bone conduction microphone, and determining a time sequence MFCC matrix according to the at least one frame of voice data; inputting the time sequence MFCC matrix into a pre-trained voice feature extraction network, and outputting to obtain a relevant feature matrix; determining whether user voice exists in at least one frame of voice data according to the pre-trained voice discrimination sub-network and the related feature matrix; if the user voice exists, inputting a time sequence voice feature matrix corresponding to the relevant feature matrix into a pre-trained wake-up word voice detection sub-network to detect, and waking up equipment corresponding to the wake-up word voice when the wake-up word voice is detected. The application can realize that the AR equipment can effectively and accurately recognize the voice data, thereby improving the accuracy of the wake-up result of the AR equipment.

Description

Voice wakeup method and device, AR equipment and storage medium
Technical Field
The present application relates to the field of AR technologies, and in particular, to a voice wake-up method, apparatus, AR device, and storage medium.
Background
AR (Augmented Reality ) products (such as AR glasses) are rapidly developed, various new optical schemes are layered in various ways, and various ways of waking up the AR products, such as voice waking up, can be set.
The current voice wake-up mode of the AR equipment comprises voice recognition according to voice data collected by the air guide microphone, so as to wake up the AR equipment. However, this method has a certain disadvantage, for example, the air conduction microphone does not have noise immunity, and the collected voice data contains noise, so that difficulty is brought to voice recognition, and further the voice data cannot be effectively and accurately recognized, so that the wake-up result of the AR device is affected.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The application mainly aims to provide a voice awakening method, a voice awakening device, AR equipment and a storage medium, and aims to solve the technical problem that the AR equipment cannot effectively and accurately recognize voice data so as to influence the awakening result of the AR equipment.
In order to achieve the above object, the present application provides a voice wake-up method applied to an AR device, in which a bone conduction microphone is disposed, the method comprising:
acquiring at least one frame of voice data acquired by a bone conduction microphone, and determining a time sequence MFCC matrix according to the at least one frame of voice data;
inputting the time sequence MFCC matrix to a pre-trained voice feature extraction network, and outputting to obtain a relevant feature matrix;
Determining whether user voice exists in the at least one frame of voice data according to the pre-trained voice discrimination sub-network and the related feature matrix;
if the user voice exists, inputting the time sequence voice feature matrix corresponding to the relevant feature matrix into a pre-trained wake-up word voice detection sub-network to detect, and waking up equipment corresponding to the wake-up word voice when the wake-up word voice is detected.
In one embodiment, the step of determining the time sequence MFCC matrix according to the at least one frame of voice data includes:
determining mel cepstrum coefficients corresponding to the at least one frame of voice data;
And arranging the mel cepstrum coefficients according to time sequence, and determining a time sequence MFCC matrix according to an arrangement result.
In an embodiment, after the step of inputting the time sequence MFCC matrix to the pre-trained speech feature extraction network and outputting to obtain the relevant feature matrix, the method further includes:
determining a time dimension corresponding to the related feature matrix, determining a target bit in a preset time sequence voice feature matrix according to the time dimension, and deleting data in front of the target bit in the preset time sequence voice feature matrix;
And advancing the data of the target bit in the preset time sequence voice feature matrix according to the target bit, and accessing the relevant feature matrix at the tail end of the preset time sequence voice feature matrix to obtain the time sequence voice feature matrix.
In an embodiment, the step of inputting the time sequence voice feature matrix corresponding to the relevant feature matrix to the pre-trained wake-up word voice detection sub-network for detection includes:
Determining each phoneme in a time sequence voice feature matrix corresponding to the related feature matrix according to a pre-trained wake-up word voice detection sub-network, and combining different phonemes to obtain a phoneme combination;
And sequentially matching each phoneme combination with preset wake-up word voice, and if the phoneme combination matched with the wake-up word voice exists, determining that the wake-up word voice exists.
In one embodiment, the step of determining whether the user voice exists in the at least one frame of voice data according to the pre-trained voice discrimination sub-network and the correlation feature matrix includes:
Inputting the relevant feature matrix into a pre-trained voice discrimination sub-network, and outputting to obtain a prediction probability value;
And if the predicted probability value is greater than a preset threshold value, determining that user voice exists in the at least one frame of voice data.
In an embodiment, the AR device is provided with an air conduction microphone, and the method further includes:
Constructing a training set according to training data recorded by the air conduction microphone, and constructing a verification set according to verification voice data recorded by the bone conduction microphone;
Performing repeated iterative training on a preset voice feature extraction network, a preset wake-up word voice detection sub-network and a preset voice judgment sub-network according to the training set, wherein in each iterative training process, the preset voice feature extraction network and the preset voice judgment sub-network are subjected to joint training, and the voice feature extraction network and the preset wake-up word voice detection sub-network after joint training are subjected to joint training;
After each time of iterative training is completed, evaluating a trained voice feature extraction network, a trained wake-up word voice detection sub-network and a trained voice judgment sub-network according to the verification set to obtain an evaluation performance score;
Selecting the maximum evaluation performance score from the evaluation performance scores corresponding to each iteration training, taking a voice feature extraction network corresponding to the maximum evaluation performance score as a pre-trained voice feature extraction network, taking a voice discrimination sub-network corresponding to the maximum evaluation performance score as a pre-trained voice discrimination sub-network, and taking a wake-up word voice detection sub-network corresponding to the maximum evaluation performance score as a pre-trained wake-up word voice detection sub-network.
In one embodiment, the step of constructing the training set according to training data recorded by the air conduction microphone includes:
different types of noise data recorded by the air conduction microphone are acquired, and a plurality of different acoustic environments are created based on the different types of noise data;
Constructing different background sound data according to different acoustic environments;
acquiring first voice data recorded by the air conduction microphone in a quiet environment, and performing low-pass filtering processing on the first voice data to obtain low-pass filtering voice data;
and fusing the background sound data and the low-pass filtered voice data aiming at each piece of background sound data to obtain training data, and constructing a training data set according to the training data.
In addition, in order to achieve the above object, the present application also provides a voice wake-up device, which is disposed in an AR device, in which a bone conduction microphone is disposed, the device comprising:
the acquisition module is used for acquiring at least one frame of voice data acquired by the bone conduction microphone and determining a time sequence MFCC matrix according to the at least one frame of voice data;
the feature extraction module is used for inputting the time sequence MFCC matrix into a pre-trained voice feature extraction network and outputting the time sequence MFCC matrix to obtain a relevant feature matrix;
The determining module is used for determining whether user voice exists in the at least one frame of voice data according to the pre-trained voice discrimination sub-network and the related feature matrix;
And the wake-up module is used for inputting the time sequence voice feature matrix corresponding to the related feature matrix into a pre-trained wake-up word voice detection sub-network for detection if the user voice exists, and waking up equipment corresponding to the wake-up word voice when the wake-up word voice is detected.
In addition, to achieve the above object, the present application also proposes an AR device comprising: a bone conduction microphone, a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program configured to implement the steps of the voice wakeup method as described above.
In addition, to achieve the above object, the present application also proposes a storage medium, which is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the voice wake-up method as described above.
The application determines a time sequence MFCC matrix through at least one frame of voice data collected by a bone conduction microphone in the AR equipment, inputs the MFCC matrix into a pre-trained voice feature extraction network, and outputs the MFCC matrix to obtain a relevant feature matrix. Because the bone conduction microphone is adopted for voice data acquisition, the superior anti-noise capability and privacy protection capability of the bone conduction microphone can be effectively utilized, and the acquired voice data is more accurate. And because the data quantity of the time sequence sampling points corresponding to at least one frame of voice data acquired by the bone conduction microphone is very large, the time sequence MFCC matrix is input into the voice feature extraction network by compressing at least one frame of voice into the time sequence MFCC matrix, so that the power consumption can be reduced while the voice recognition precision of the subsequent wake-up words is ensured, and the phenomenon of resource waste is avoided.
In addition, whether user voice exists in at least one frame of voice data is determined according to the pre-trained voice judging sub-network and the relevant feature matrix, if so, the time sequence voice feature matrix corresponding to the relevant feature matrix is input into the pre-trained wake-up word voice detecting sub-network for detection, and when the wake-up word voice is detected, equipment corresponding to the wake-up word voice is waken. That is, when the wake-up word voice detection is performed, the user voice detection and the dual verification of the wake-up word voice detection are required, so that the wake-up word voice detection can be performed only under the condition that the user voice is determined to exist, the phenomenon of continuously performing the wake-up word voice detection is avoided, the maximum possible reduction of power consumption is realized, the precision of wake-up voice control can be improved, and the misjudgment caused by unknown voice is eliminated. And the AR equipment can effectively and accurately recognize the voice data, so that the accuracy of the wake-up result of the AR equipment is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flowchart of a voice wake-up method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a network architecture of a wake-up word voice detection sub-network in an embodiment of a voice wake-up method according to the present application;
fig. 3 is a schematic flow chart of a voice wake-up method according to a second embodiment of the present application;
FIG. 4 is a schematic overall flow chart corresponding to the voice wake-up method of the present application;
FIG. 5 is a schematic block diagram of a voice wake-up device according to an embodiment of the present application;
Fig. 6 is a schematic diagram of a device structure of a hardware operating environment related to a voice wake-up method in an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.
For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.
In this embodiment, for convenience of description, the following description will be made with an AR device (such as AR glasses) as an execution subject.
Since in most cases the surrounding noise environment is very complex, the AR device performs noise reduction on the microphone data to obtain enough user speech information, so that the predefined wake-up word speech segments in the time-series audio stream can be detected more accurately. However, since access and computing resources at the bluetooth chip end in the AR device are usually very limited and low power consumption needs must be satisfied, noise reduction processing is generally not considered to be performed on the input data in advance, but this results in a large number of false alarms under low signal-to-noise ratio, and thus causes great interference to the associated devices. When the voice data acquisition device is in a quite quiet environment or in order to avoid social embarrassment in real life, a user can involuntarily reduce the volume of the voice data acquisition device so as to reduce the influence on the user, and at the moment, if the voice data acquisition device adopts an air guide microphone, the phenomenon that equipment associated with the voice data acquisition device cannot wake up occurs.
Bone conduction microphone can utilize the skull of wearer to pass the sound, strengthens the speech energy that the wearer said, and weakens the energy of non-wearer and surrounding environment noise and then ensured the privacy of the user who adopts wearing formula equipment, like when the user wears AR glasses and carries out the whistle to speak, also can accurately catch user's sound. However, the voice data collected by the bone conduction microphone is mainly concentrated in a low-frequency part, so that high-frequency information is almost lost, and further, the difficulty is brought to voice recognition, but the noise resistance is high, and most of noise can be blocked.
The deep learning algorithm is difficult to deploy at the chip end with low computational resources due to the fact that the high-complexity computation of the deep learning algorithm is difficult, only an algorithm model with a very small scale can be operated, but the scale of the algorithm model can directly influence the accuracy of the algorithm, namely, the larger the scale of the algorithm model is, the stronger the fitting capacity on complex data is theoretically, and the better the reasoning and computing effects of the algorithm model are. In order to deploy the deep learning algorithm model into the Bluetooth chip end and process audio data in real time, the calculated amount of the algorithm model is reduced usually at the expense of the effect of the algorithm model; in addition, the deep learning algorithm has great requirement on data volume, the voice data of the wake-up word is expensive to acquire, and the definition of the wake-up word is often changed, which makes it difficult to develop a high-precision voice wake-up algorithm at the Bluetooth chip end.
Optionally, in the present embodiment, deep learning voice wakeup is performed based on real-time data frames of the AR device bone conduction microphone. The method completes the migration study of bone conduction microphone data on the voice data collected by the large air conduction microphone, and can complete double verification on the collected voice data, and specifically comprises the following steps: judging whether a wearer speaks or not and specifically identifying the voice of the wake-up word; and adopting a self-designed loss function and an evaluation function to complete the joint learning and screening of the multi-output result of the voice wake-up algorithm so as to meet the requirements of high-precision voice wake-up under the conditions of low power consumption and low noise-to-noise ratio.
Optionally, when the voice wake-up device is an AR device, such as AR glasses, the voice feature may be learned through a bone conduction microphone data stream at the nose pad of the AR glasses, so as to analyze whether the user has a need to wake up the AR glasses device in different complex application scenarios. In order to reduce power consumption and improve accuracy, each frame data acquired by the bone conduction microphone is correspondingly processed, whether the user, i.e. the wearer, speaks a wake-up word or not is finally output, and corresponding equipment, such as AR equipment, is woken up under the condition that the user determines that the user speaks the wake-up word.
Optionally, in the AR device, the bluetooth chip end realizes high-precision and low-power consumption voice wake-up based on the bone conduction microphone at the nose pad. The method solves the problems that the low-complexity algorithm cannot realize high-precision voice awakening, the existing deep learning algorithm is difficult to operate rapidly at a low-computing resource chip end, the voice awakening algorithm is high in power consumption because of continuous operation, and the deep learning algorithm has insufficient generalization capability because of too small data quantity of a bone conduction microphone at the nose pad of the AR glasses.
Optionally, in order to meet the requirements of real-time performance, low power consumption and high precision of the Bluetooth chip, double verification is performed in a voice wake-up network of the AR equipment, meanwhile, fine setting is performed on a loss function and an evaluation function which participate in training, and fixed-point calculation can reduce memory occupation and calculation amount of the Bluetooth chip platform end during model reasoning after training, so that low power consumption and time delay are further reduced.
Optionally, in order to exclude a large amount of influence of irrelevant sound, for example, the wearer wakes up the AR device by mistake when not speaking, and meanwhile, the real-time performance and the low power consumption requirement of the chip end are considered, so that the background data can be processed according to priori knowledge, for example, divided into noise and unknown voice, and output in advance in the wake-up word voice control network to exclude a large amount of interference, thereby reducing the calculation amount of the wake-up word voice control model.
Optionally, the enhanced data of the air conduction microphone data is used as training data, and the truly recorded bone conduction microphone data at the nose pad of the AR glasses is subjected to multi-round verification, so that a model with strong adaptability to the voice data collected by the bone conduction microphone at the nose pad of the AR glasses is screened.
Optionally, in the wake-up word speech recognition network in this embodiment, a dual-verification network structure is designed at a low computing resource product end, so that a better balance can be provided for a low rejection rate, a low false alarm rate and low power consumption, and in order to balance the accuracy between wake-up word speech and irrelevant sounds, a self-design is performed on a loss function and an evaluation function of a training wake-up word speech control model.
Optionally, the irrelevant sound data can be processed in the voice time sequence flow, so that the accuracy of wake-up word voice recognition is improved, and the risk that the irrelevant sound data is misjudged as wake-up word voice data and a false wake-up operation is executed is avoided.
Optionally, the voice data collected by a small amount of bone conduction wind is trained by adopting a deep learning algorithm, and a small amount of recorded bone conduction microphone data at the AR eye nose pad is taken as a target, so that a large amount of existing air conduction microphone data is mined, and the voice characteristic fitting capability of the bone conduction microphone wake-up word data is fully improved.
Optionally, the embodiment can provide an innovative solution in the aspect that the low-parameter model is difficult to analyze audio data in a noisy environment, learn and update parameters of the deep learning wake-up word voice control model, eliminate a large amount of interference of unknown voice in advance through special design, reduce power consumption, further improve the accuracy of wake-up word voice detection in high noise, and provide a new solution in the aspect of how the deep learning algorithm trains on a small amount of bone conduction microphone data.
Based on this, an embodiment of the present application provides a voice wake-up method, and referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of the voice wake-up method of the present application.
In this embodiment, the voice wake-up method is applied to an AR device, in which a bone conduction microphone is disposed, and includes steps S10 to S40:
step S10, at least one frame of voice data acquired by a bone conduction microphone is acquired, and a time sequence MFCC matrix is determined according to the at least one frame of voice data;
It should be noted that the at least one frame of voice data may include at least one frame of sampling point data. The time-series MFCC matrix (Temporal MFCC Matrix) refers to a matrix composed of MFCC eigenvalues of a series of frames extracted by MFCC (Mel Frequency Cepstral Coefficients, mel-frequency cepstral coefficient) algorithm in speech signal processing. This matrix reflects the acoustic properties of the speech signal over time. For example, for a continuous speech signal, it is first divided into a plurality of overlapping frames according to a certain time window (e.g. 25 ms), and after the signal in each frame is subjected to pre-emphasis, windowing, discrete Fourier Transform (DFT), and filtering by a mel filter bank, a corresponding mel spectrum energy distribution is calculated. Next, redundant information is removed and the most representative cepstral coefficients are extracted by performing a Discrete Cosine Transform (DCT) on the mel spectrum, typically taking the first 13 or more coefficients as MFCC feature vectors. Alternatively, each row of the time-series MFCC matrix is the MFCC feature vector of one frame, the number of rows is equal to the effective number of frames obtained by processing the speech signal, and the number of columns is the MFCC feature dimension (e.g., 13 dimensions).
Optionally, an AR device is exemplified as AR glasses, and a bone conduction microphone is provided at a nose pad of the AR glasses. Optionally, an air conduction microphone may also be provided on the AR glasses. Alternatively, an air conduction microphone, also called a sound wave microphone or an air conduction microphone, is a device that converts and records sound signals by capturing sound vibrations in the air. The working principle is that the pressure change generated when sound waves propagate in the air is utilized, the change acts on the vibrating diaphragm in the microphone, so that the vibrating diaphragm vibrates along with the sound waves, and then the mechanical vibrations are converted into electric signals through electromagnetic conversion (moving coil type) or capacitive charge-discharge effect (capacitive type) and other modes.
Alternatively, the bone conduction microphone in the AR device may directly contact the bones of the head or face of the user, and convert the vibration signals into electrical signals by sensing the vocal cord vibrations generated when the user speaks, thereby obtaining at least one frame of voice data collected by the bone conduction microphone.
Optionally, after determining at least one frame of voice data, mel cepstrum coefficients of the at least one frame of voice data may be calculated, and the calculated mel cepstrum coefficients are converted into the time-series MFCC matrix.
In one possible embodiment, step S10 may include steps a11 to a13:
a11, determining mel cepstrum coefficients corresponding to at least one frame of voice data;
it should be noted that mel cepstrum coefficient (Mel Frequency Cepstral Coefficients, abbreviated as MFCCs) is a characteristic parameter for sound processing.
And step A12, arranging mel cepstrum coefficients according to time sequence, and determining a time sequence MFCC matrix according to the arrangement result.
Optionally, the data volume of the time sequence sampling points captured by the bone conduction microphone is very large, and if the time sequence sampling points are directly input into the deep learning model for calculation, the real-time performance, low power consumption and the like of the chip platform end can be greatly challenged. Alternatively, a time-series MFCC matrix may be employed as an input to the deep learning model instead of directly acquired time-series sampling points. By calculating mel cepstrum coefficients of one frame of sampling point data (such as 480 points and 30 milliseconds at a 16KHZ sampling rate), compression of the input data of the wake-up word voice control model (such as 12 feature points) is realized, namely, the possibility is provided for constructing a low-parameter deep learning model for a chip platform end while the recognition precision of the wake-up word voice is ensured.
Optionally, the sampling point data of at least one frame of voice data may be converted into corresponding mel cepstrum coefficients, and then each mel cepstrum coefficient is arranged according to the time sequence of each frame of collected voice data. The mel cepstrum coefficients of different frames are arranged according to the time sequence to form a time sequence matrix. Alternatively, if there is no unprocessed time-ordered MFCC matrix in the AR device. The formed time sequence matrix can be directly used as a time sequence MFCC matrix, the time sequence MFCC matrix is input into a voice feature extraction network for processing, and the state of the time sequence MFCC matrix is converted from unprocessed to processed. Optionally, if an unprocessed time sequence MFCC matrix exists, updating the formed time sequence matrix to the unprocessed time sequence MFCC matrix to obtain a time sequence MFCC matrix to be input into the voice feature extraction network.
Optionally, if the bone conduction microphone collects 10 frames of data, and converts the previous 5 frames of data into the time sequence MFCC matrix before the current moment, mel cepstrum coefficients corresponding to the next 5 frames of data may be calculated and converted to form a time sequence MFCC matrix corresponding to the next 5 frames of data, and the time sequence MFCC matrix corresponding to the next 5 frames of data is updated to the time sequence MFCC matrix corresponding to the previous 5 frames of data, for example, is placed at the end of the time sequence MFCC matrix corresponding to the previous 5 frames of data. Obtaining a time sequence MFCC matrix corresponding to 10 frame data, and inputting the time sequence MFCC matrix corresponding to 10 frame data into a voice feature extraction network for processing.
Step S20, inputting the time sequence MFCC matrix into a pre-trained voice feature extraction network, and outputting to obtain a relevant feature matrix;
Alternatively, the speech feature extraction network may be a deep learning-based technique for automatically extracting key features from the original speech signal that facilitate subsequent processing and recognition. Such networks typically include multiple hierarchies that simulate the complex processes of human ears and brain processing of sound signals and automatically discover and encode the inherent patterns of speech signals through learning. Optionally, the network architecture of the speech feature extraction network may include at least one of a convolutional neural network, a recurrent neural network, and a 1-D convolutional neural network. Optionally, the pre-trained voice feature extraction network is a voice feature extraction network with better performance, which is obtained by completing corresponding model training before the actual application of the AR equipment.
Optionally, the relevant feature matrix may be a feature matrix formed by performing feature extraction on the feature extracted by the pre-trained voice feature extraction network and arranging the features according to time sequence.
Optionally, the AR device may input the time sequence MFCC matrix to the pre-trained speech feature extraction network to perform model training processing, that is, perform a compression process on the time sequence MFCC matrix through the pre-trained speech feature extraction network, and output the time sequence MFCC matrix to obtain the relevant feature matrix.
Optionally, since the size of the input data of the deep learning model has a great relationship with the memory occupation and the calculation amount when the AR device (such as the AR glasses device) is used, in order to reduce the input and the size of the model, the memory occupation and the power consumption when the algorithm runs are reduced. The input of the deep learning model is compressed twice in this embodiment. For example, the time sequence sampling points captured by the bone conduction microphone at the nose pad of the AR glasses are compressed into time sequence mel cepstrum coefficients, so as to obtain a time sequence MFCC matrix. The time-series MFCC matrix is input to a pre-trained speech feature extraction network. In the voice feature extraction network, a convolution structure is adopted to perform local time sequence voice feature extraction on the time sequence MFCC matrix, namely, the voice features in the local time sequence are extracted, and further the time sequence mel cepstrum coefficient is compressed into the local voice features. Optionally, the output of the voice feature extraction network may be combined according to a time sequence to obtain global voice features, and the global voice features are used as a related feature matrix. Alternatively, in an embodiment scenario, the local speech feature output by the speech feature extraction network may be used as a relevant feature matrix, and the global speech feature may be used as an input to be input to the wake word speech detection model, such as a time-series speech feature matrix. Optionally, the temporal mel cepstral coefficients comprise at least one mel cepstral coefficient.
Optionally, the data is input to a pre-trained speech feature extraction network, for example, as a 5 frame data corresponding time-sequential MFCC matrix. The pre-trained speech feature extraction network may be a feature matrix corresponding to 3 frames of data.
Step S30, determining whether user voice exists in at least one frame of voice data according to the pre-trained voice discrimination sub-network and the related feature matrix;
Alternatively, the speech discrimination subnetwork may be a type of deep learning network structure dedicated to extracting and enhancing discriminating characteristics associated with a particular task in a speech signal during speech recognition, speech enhancement, speaker recognition, or other speech processing tasks. Optionally, the pre-trained voice discrimination sub-network is a voice discrimination sub-network with better performance, wherein the corresponding model training is completed before the AR equipment is actually applied.
Optionally, after obtaining the relevant feature matrix output by the voice feature extraction network, the relevant feature matrix may be input into the voice discrimination sub-network, and in the voice discrimination sub-network, the relevant feature matrix is judged to determine whether the voice feature corresponding to the user voice exists. If the voice features corresponding to the user voices exist, determining that the user voices exist in the voice data corresponding to the relevant feature matrix. If the voice characteristics corresponding to the user voices do not exist, determining that the voice data corresponding to the relevant characteristic matrix does not exist.
In one possible embodiment, step S30 may include steps C11-C12:
Step C11, inputting the relevant feature matrix into a pre-trained voice discrimination sub-network, and outputting to obtain a prediction probability value;
and step C12, if the predicted probability value is larger than a preset threshold value, determining that user voice exists in at least one frame of voice data.
Optionally, the preset threshold may be a threshold set by the user in advance, and may be a value between 0 and 1.
Alternatively, the MFCC feature matrix with the current fixed time length may be input to the voice recognition sub-network after being calculated by the voice feature extraction network, that is, the maximum pooling layer, the full connection layer, and hard-sigmod are input for activation (an activation function), and a prediction probability value between 0 and 1 is output. For example, when the preset threshold is 0.5, if the predicted probability value is greater than 0.5, it is determined that the user speech exists, and at this time, it is determined that the user speech exists in at least one frame of speech data. And when the predicted probability value is less than or equal to 0.5, judging that no user speaking exists, namely no user voice exists.
Step S40, if the user voice exists, inputting the time sequence voice feature matrix corresponding to the relevant feature matrix into a pre-trained wake-up word voice detection sub-network to detect, and waking up equipment corresponding to the wake-up word voice when the wake-up word voice is detected.
Alternatively, the wake word speech detection sub-network may be used to monitor in real-time whether the audio stream contains preset wake words (e.g., "hey, siri", "xx fairy", etc.). When the user speaks the wake-up word, the wake-up word voice detection sub-network can quickly and accurately identify and activate the subsequent voice processing flow, so that interaction with the user is realized. Optionally, the pre-trained wake-up word voice detection sub-network is a wake-up word voice detection sub-network with better performance, which has completed corresponding model training before the AR equipment is actually applied.
Alternatively, the network architecture of the wake-up word voice detection subnetwork may be as shown in fig. 2, including the GRU (Gated Recurrent Unit, gated loop unit), max-pooling (max pooling), FC (fully connected layer), and output portion, including softmax-o and softmax-t. Alternatively, softmax-o is used in an AR device, such as at the AR eyeglass device end. softmax-t is used to train the wake word speech detection sub-network. Alternatively, softmax-0 outputs the probability of the occurrence of wake word speech and the probability of unknown sounds within a fixed length of time. And judging whether the wake-up word voice exists according to the probability of the wake-up word voice, for example, when the probability of the wake-up word voice is more than 0.9, the wake-up word voice can be determined to exist, and equipment corresponding to the wake-up word voice, such as AR glasses or other household equipment, is awakened.
Alternatively, if it is determined that there is user voice, the relevant feature matrix may be converted into a time-series voice feature matrix. Optionally, the time sequence voice feature matrix may be input to a pre-trained wake-up word voice detection sub-network to detect wake-up word voice, and if the detection result is that wake-up word voice exists, wake-up operation is performed on a device corresponding to the wake-up word voice. Alternatively, if it is judged that there is no user voice, the operation of step S10 is performed again. The new time sequence MFCC matrix is obtained and input into a voice feature extraction network which is trained by children, the relevant feature matrix is obtained by output, and the relevant feature matrix is input into a voice judgment sub-network which is trained well for judgment, so that whether user voice exists or not is determined.
Optionally, in an embodiment scenario, the calculation of the voice feature extraction network and the voice discrimination sub-network determines whether to continue to perform the calculation of the wake-up word voice detection sub-network, and since the real application scenario may include a large amount of background sounds which cannot be used by the voice of the wearer, the bone conduction microphone in the AR device has excellent noise immunity, and a large amount of background sounds in a high noise or quiet environment can be filtered out only by the calculation of the voice feature extraction network and the calculation of the voice discrimination sub-network of the MFCC matrix in a short time sequence. The wake-up word voice detection sub-network based on the realization of the voice feature matrix can further judge whether the user speaks the wake-up word, and wake up the equipment corresponding to the wake-up word, such as AR glasses equipment, when judging that the user speaks the wake-up word.
In one possible embodiment, step S40 may include steps D11-D12:
Step D11, determining each phoneme in the time sequence voice feature matrix corresponding to the relevant feature matrix according to the pre-trained wake-up word voice detection sub-network, and combining different phonemes to obtain a phoneme combination;
And step D12, sequentially matching each phoneme combination with preset wake-up word voice, and if the phoneme combination matched with the wake-up word voice exists, determining that the wake-up word voice exists.
Alternatively, the relevant feature matrix may be input to a pre-trained wake-up word-speech detection sub-network and the individual speech segments detected in the wake-up word-speech detection sub-network. For example, each frame corresponds to a voice segment, and phonemes corresponding to the voice segment are detected and identified. Alternatively, a plurality of speech segments may be determined according to the time-series speech feature matrix, and a phoneme corresponding to each speech segment may be determined. And combining different phonemes to obtain a plurality of phoneme combinations. For example, if 5 phonemes are determined according to the time-series speech feature matrix, the 5 phonemes may be randomly combined according to the time sequence, so as to obtain a plurality of phoneme combinations.
Optionally, the wake-up word voice set in advance is acquired, and phonemes corresponding to the wake-up word voice and arrangement modes thereof are determined. And performing similarity calculation on the phonemes and the arrangement modes thereof corresponding to the wake-up word sounds, wherein if the target phoneme combination exists in each phoneme combination, the similarity between the phonemes and the arrangement modes of the target phoneme combination and the phonemes and the arrangement modes thereof corresponding to the wake-up word sounds is larger than a preset similarity threshold, such as larger than 90%. The target phone combination may be determined to be a phone combination that matches the wake word speech, i.e., the presence of the wake word speech may be determined to wake up the device corresponding to the wake up word speech.
Alternatively, in an embodiment scenario, after step S20, step S21 may be performed, namely:
Step S21, inputting the time sequence voice feature matrix corresponding to the relevant feature matrix into a pre-trained wake-up word voice detection sub-network to detect, and carrying out wake-up operation on equipment corresponding to the wake-up word voice when the wake-up word voice is detected.
Optionally, after feature extraction is performed on the time sequence MFCC matrix through the pre-trained voice feature extraction network, a relevant feature matrix is obtained, and the relevant feature matrix is blended into a preset time sequence voice feature matrix within a certain time sequence range, so as to obtain a time sequence voice feature matrix corresponding to the relevant feature matrix. The time sequence voice feature matrix is directly input into a pre-trained wake-up word voice detection sub-network to detect wake-up word voice, if the probability of detecting the wake-up word voice is larger, for example, the probability is larger than a certain value, the wake-up word voice is determined to be detected, and wake-up operation is carried out on equipment corresponding to the wake-up word voice. If the probability of detecting the wake-up word voice is smaller, determining that the wake-up word voice is not detected, and not executing corresponding wake-up operation at the moment.
The embodiment provides a voice awakening method, which comprises the steps of determining a time sequence MFCC matrix through at least one frame of voice data acquired by a bone conduction microphone in AR equipment, inputting the MFCC matrix into a pre-trained voice feature extraction network, and outputting the MFCC matrix to obtain a relevant feature matrix. Because the bone conduction microphone is adopted for voice data acquisition, the superior anti-noise capability and privacy protection capability of the bone conduction microphone can be effectively utilized, and the acquired voice data is more accurate. And because the data quantity of the time sequence sampling points corresponding to at least one frame of voice data acquired by the bone conduction microphone is very large, the time sequence MFCC matrix is input into the voice feature extraction network by compressing at least one frame of voice into the time sequence MFCC matrix, so that the power consumption can be reduced while the voice recognition precision of the subsequent wake-up words is ensured, and the phenomenon of resource waste is avoided.
In addition, whether user voice exists in at least one frame of voice data is determined according to the pre-trained voice judging sub-network and the relevant feature matrix, if so, the time sequence voice feature matrix corresponding to the relevant feature matrix is input into the pre-trained wake-up word voice detecting sub-network for detection, and when the wake-up word voice is detected, equipment corresponding to the wake-up word voice is waken. That is, when the wake-up word voice detection is performed, the user voice detection and the dual verification of the wake-up word voice detection are required, so that the wake-up word voice detection can be performed only under the condition that the user voice is determined to exist, the phenomenon of continuously performing the wake-up word voice detection is avoided, the maximum possible reduction of power consumption is realized, the precision of wake-up voice control can be improved, and the misjudgment caused by unknown voice is eliminated. And the AR equipment can effectively and accurately recognize the voice data, so that the accuracy of the wake-up result of the AR equipment is improved.
In the second embodiment of the present application, the same or similar content as in the first embodiment of the present application may be referred to the above description, and will not be repeated. On this basis, referring to fig. 3, after step S20, the voice wake-up method further includes steps S50 to S60:
Step S50, determining a time dimension corresponding to the relevant feature matrix, determining a target bit in a preset time sequence voice feature matrix according to the time dimension, and deleting data in front of the target bit in the preset time sequence voice feature matrix;
Alternatively, the time dimension may be determined according to the order of the acquired data frames. The target bit may be a data bit representing the position of the data in the matrix.
Alternatively, the AR device may determine a corresponding time dimension of the correlation feature matrix, such as a corresponding number of frames, etc. After the frame number corresponding to the relevant characteristic matrix is determined, the data position corresponding to the frame number can be determined in the preset time sequence voice characteristic matrix and used as a target bit. If, for example, the correlation feature matrix is determined to be obtained by performing corresponding processing on three frames of data according to the time dimension corresponding to the correlation feature matrix. Then the data bit of the three frames of data quantity can be determined from the first data bit in the preset time sequence voice characteristic matrix until the calculation is satisfied, and the data bit is used as the target bit. Optionally, deleting all data in the buffer memory before the target bit in the voice characteristic matrix with preset time sequence.
Step S60, advancing the data after the target bit in the preset time sequence voice feature matrix according to the target bit, and accessing the relevant feature matrix at the tail end of the preset time sequence voice feature matrix to obtain the time sequence voice feature matrix.
Optionally, after deleting all the data before the target bit, all the data after the target bit in the preset time sequence voice feature matrix in the buffer memory may be moved forward by the target bit. If the target bit is the third bit in the preset time sequence voice feature matrix, after deleting the data of the first three bits in the preset time sequence voice feature matrix, sequentially advancing all the data of the third bit in the preset time sequence by three bits according to the original time sequence. At this time, the last three bits in the preset time sequence voice feature matrix are free of data, and the relevant feature matrix is put into the last three bits in the preset time sequence voice feature matrix, namely, the relevant feature matrix is accessed at the tail end of the preset time sequence voice feature matrix. Taking a preset time sequence voice characteristic matrix containing the related characteristic matrix as the time sequence voice characteristic matrix.
Optionally, the features sequentially output by the voice feature extraction network according to the time sequence are stored in a preset time sequence voice feature matrix. The relevant feature matrix obtained through current calculation can be updated to a preset time sequence voice feature matrix, and the time sequence voice feature matrix corresponding to the relevant feature matrix is obtained.
Optionally, the M-th pre-bit data in the pre-set time-sequence speech feature matrix buffer may be deleted. M is related to the time dimension of the voice feature extraction network calculation output, for example, the time dimension comprises 3 frames of data, and the length from the first bit data to the Mth bit data in the preset time sequence voice feature matrix is required to be determined to be greater than or equal to the data length of the 3 frames of data. That is, if the data corresponding to the relevant feature matrix is 3 frames of data, the data of the previous 3 frames in the preset time sequence voice feature matrix needs to be deleted. And sequentially advancing the M-th data in the buffer memory by M bits according to the original time sequence, and placing the relevant feature matrix into the tail end of a preset time sequence voice feature matrix to obtain the time sequence voice feature matrix. The time sequence voice characteristic matrix stores the voice characteristic sequence in a past fixed long time taking the current relevant characteristic matrix as an end point.
In this embodiment, a target bit in a preset time sequence voice feature matrix is determined according to a time dimension corresponding to the relevant feature matrix, data before the target bit is deleted, data after the target bit is moved forward by the target bit, then the relevant feature matrix is accessed to the tail end of the preset time sequence voice feature matrix to obtain the time sequence voice feature matrix, so that the time sequence voice feature matrix contains more voice features as much as possible, and the accuracy of subsequent wake-up word voice recognition is improved.
In the third embodiment of the present application, the same or similar content as the above-described embodiments can be referred to in the description above, and the description is omitted. On the basis, the voice awakening method further comprises the steps of S100-S60:
step S100, a training set is constructed according to training data recorded by an air conduction microphone, and a verification set is constructed according to verification voice data recorded by a bone conduction microphone;
Optionally, because the data required for training the deep learning model is huge, and the cost for recording a large amount of background sounds, wake-up word voices and non-wake-up word voices by adopting the bone conduction microphone is high, and resources are consumed, the migration learning of the wake-up word voice control model can be performed based on the data collected by the air conduction microphone. Optionally, the bone conduction microphone records the spectrum of the voice data at the AR glasses nose pad, with energy concentrated mainly in the mid-low frequency part. The frequency domain of the voice data recorded by the air conduction microphone is very complete, so that the voice data collected by the bone conduction microphone can be simulated by adopting low-pass filtering on the voice data of the air conduction microphone to participate in training of the wake-up word voice control model, namely, a training set is constructed according to the voice data recorded by the air conduction microphone.
Optionally, a verification set may be further constructed, and the verification set may be constructed according to the voice data recorded by the bone conduction microphone in the AR device, and the voice data recorded by the bone conduction microphone is used as the verification voice data in the training verification stage.
In one possible embodiment, step S100 may include steps E11-E14:
step E11, acquiring different types of noise data recorded by the air conduction microphone, and creating a plurality of different acoustic environments based on the different types of noise data;
optionally, the AR device may collect different kinds of noise data recorded by the air conduction microphone in the AR device during the training phase, such as traffic noise, industrial noise, noise generated by building construction, and the like.
Alternatively, in creating the acoustic environment, one or more kinds of noise data may be selected to create the acoustic environment, such as selecting to construct an environment with traffic noise only as one acoustic environment. An environment containing traffic noise and noise generated by construction is selected as an acoustic environment or the like.
E12, constructing different background sound data according to different acoustic environments;
Alternatively, a bone conduction microphone in the AR device may perform a recording of these created acoustic environments. I.e. each acoustic environment may correspond to one background sound data. For example, the acoustic environment is an environment including traffic noise, the background acoustic data includes traffic noise.
Step E13, acquiring first voice data recorded by the air conduction microphone in a quiet environment, and performing low-pass filtering processing on the first voice data to obtain low-pass filtered voice data;
Optionally, in a quiet environment, the air conduction microphone in the AR device records voice data, and uses the recorded voice data as the first voice data. Alternatively, the first voice data may be subjected to low-pass filtering processing, so as to obtain low-pass filtered voice data.
Alternatively, the low pass filtering may remove or attenuate high frequency components in the signal, preserving and highlighting the main content of the speech signal, mainly the fundamental frequency and envelope information of the speech. The speech signal is mainly concentrated in the lower frequency range, so low-pass filtering helps to eliminate noise (e.g. hissing, high frequency howling, etc.) and other unwanted high frequency disturbances, while maintaining speech intelligibility.
Alternatively, a variety of relevant parameter values and filtering modes may be selected when the low-pass filtering process is performed. And the low-pass filtered data may be used as voice data recorded by a bone conduction microphone.
And E14, fusing the background sound data and the low-pass filtered voice data according to each piece of background sound data to obtain training data, and constructing a training data set according to the training data.
Alternatively, the same operation may be adopted for each background sound data. For example, the background sound data and the low-pass filtered voice data are fused according to the random signal-to-noise ratio, so that training data are obtained, and then the training data are aggregated according to a plurality of training data to obtain a training data set.
Step S200, performing repeated iterative training on a preset voice feature extraction network, a preset wake-up word voice detection sub-network and a preset voice discrimination sub-network according to a training set, wherein in each iterative training process, the preset voice feature extraction network and the preset voice discrimination sub-network are subjected to joint training, and the voice feature extraction network and the preset wake-up word voice detection sub-network after joint training are subjected to joint training;
optionally, a wake word voice control model may be provided, where the wake word voice control model includes a voice feature extraction network, a voice discrimination sub-network, and a wake word voice detection sub-network.
Optionally, the network structure of the wake-up word voice control model is finely set, the voice feature extraction network is connected to a batch normalization layer after convolution calculation, and the training difficulty of the wake-up word voice control model can be reduced on the premise of not increasing any calculation cost of a chip end. After the time domain convolution calculation, a batch normalization layer is added, so that the range of the output value domain can be constrained. The number of rows and columns in the audio time sequence MFCC matrix are greatly different, that is, there is a great amount of redundant information in the time dimension, so that the maximum pooling calculation in the time dimension can be performed to remove redundancy, and a gating cyclic unit layer is added in the wake-up word voice detection sub-network to be used for extracting the association between wake-up word voice phonemes in the time dimension.
Optionally, when the preset voice feature extraction network, the preset wake-up word voice detection sub-network and the preset voice discrimination sub-network are subjected to iterative training for a plurality of times according to the training set. Each time of iterative training, the joint training of the voice feature extraction network and the wake-up word voice detection sub-network can be selected, and after the joint training is completed once, relevant parameters in the voice feature extraction network and the wake-up word voice detection sub-network are adaptively adjusted to obtain an adjusted voice feature extraction network and an adjusted wake-up word voice detection sub-network. And then carrying out joint training on the voice feature extraction network and the voice judging sub-network, wherein in the joint training, the voice feature extraction network is the adjusted voice feature extraction network in the joint training stage of the voice feature extraction network and the wake-up word voice detection sub-network of the previous time.
Optionally, when the voice feature extraction network and the wake-up word voice detection sub-network perform joint training, sampling points with fixed time length can be formed after time sequence data in a training set are intercepted or filled, overlapped and equally divided to obtain voice data with fixed frame quantity, mel cepstrum coefficients of each frame of voice data are calculated, mel cepstrum coefficients of different frames are arranged according to time sequence, a time sequence MFCC matrix of a training stage is formed, and the time sequence MFCC matrix is used as a first training time sequence MFCC matrix. Optionally, the first training timing MFCC matrix is grouped in time order. For example, every M (M is greater than or equal to 1 and less than or equal to 5) frames are a group, namely, the first training time sequence MFCC matrix has M frames in the time dimension, each group of data is input into a preset voice feature extraction network to be calculated respectively, the output of the voice feature extraction network is combined according to time sequence and is integrated into the time sequence voice feature matrix of a training stage, the voice probability and the unknown sound probability are finally output through a preset wake-up word voice detection sub-network, the labels and the loss functions of input data input into the voice feature extraction network in the training set are combined, and parameters in the voice feature extraction network and the wake-up word voice detection sub-network are adjusted after back propagation.
Optionally, when the voice feature extraction network and the voice discrimination sub-network perform joint training, sampling points with fixed time length can be formed after the time sequence data in the training set are intercepted or filled, overlapped and equally divided to obtain voice data with fixed frame quantity, mel cepstrum coefficients of each frame of voice data are calculated, mel cepstrum coefficients of different frames are arranged according to time sequence, a time sequence MFCC matrix of a training stage is formed, and the time sequence MFCC matrix is used as a second training time sequence MFCC matrix. Optionally, the second training timing MFCC matrix is time-sequentially grouped. For example, every M (1.ltoreq.M.ltoreq.5) frames is a group, that is, the second training sequence MFCC matrix has M frames in the time dimension. Inputting the second training time sequence MFCC matrix of different data into a preset voice feature extraction sub-network, taking the output of the preset voice feature extraction sub-network as the input of the preset voice discrimination sub-network, finally determining whether the probability of user voice exists according to the output of the preset voice discrimination sub-network, and adjusting parameters in the voice feature extraction network and the voice discrimination sub-network after back propagation by combining the labels and the loss function of the input data input into the voice feature extraction network in the training set.
Optionally, based on a self-designed wake-up word voice control model loss function, calculating an error between the output value calculated by forward reasoning and a real label value, and reversely transmitting the error to a wake-up word voice detection sub-network, a voice feature extraction network and a voice discrimination sub-network to finish updating related weight parameters, namely realizing training of the wake-up word voice control model.
Alternatively, the tag value of the voice discriminating sub-network may be set to 0 or 1, and when the background sound is stored in the MFCC matrix, the tag value of the voice discriminating sub-network is 0. When the noise to be carried or the clean voice is stored in the MFCC matrix, the label value of the voice judging sub-network is 1. Alternatively, the loss function of the voice discrimination subnetwork may be:
Where N represents the number of training sample batches. Gamma n is a weight factor for solving the problem of unbalanced data amount between the voice and the environmental sound category, gamma n epsilon [0,1], and when the voice MFCC matrix is stored as noisy voice or clean voice, gamma n =gamma, and when the voice MFCC matrix is stored as environmental sound, gamma n=1-γ;ρn is an output value of forward calculation of the voice discrimination sub-network.
Optionally, the wake-up word voice detection subnetworks output softmax-t and softmax-o. The softmax-o is used in training and application, and is the probability of belonging to wake-up word voice and unknown voice respectively under the input of the current time sequence voice feature matrix, and the output is a class label, for example, the class label corresponding to the unknown voice is 0, and the class label corresponding to the wake-up word voice is 1. And after being coded and converted by the one-hot code, the voice data are participated in training of the wake-up word voice detection sub-network and the voice feature extraction network. The softmax-t is only used in training, the device label is text data, for example, the blank class label is 0, the class label at the unknown word phoneme is 1, the class label at the wake-up word phoneme is 2, and the device label is converted by one-hot coding and participates in training of the wake-up word voice detection sub-network and the voice characteristic extraction network. Optionally, the wake-up word voice detection subnetwork has a loss function of:
Wherein, To wake up the output of the word-speech detection sub-network softmax-t,A text label is represented and is displayed,The output probability of the sub-network softmax-t is detected for the wake-up word speech at time step t,Representing slaveTo the point ofIs a set of the possible paths; p is the output of the wake-up word voice detection sub-network softmax-o, and y is the class label corresponding to the wake-up word voice detection sub-network softmax-o.
Alternatively, the overall loss function of the wake word speech control model may be as follows:
Ltotal=(α*Lsd+β*Lawaken)Ltotal=(α*Lsd+β*Lawaken);
the alpha and beta are the weight occupied by the loss of the voice discrimination sub-network output and the loss of the wake-up word voice detection sub-network output in the total loss calculation respectively. And reversely transmitting the value L total calculated by the total loss function to each layer in the wake-up word voice control network, learning and updating parameters in each network layer, and finally shortening the distance between the network layer and the tag and outputting a correct judging result.
Step S300, after each time of iterative training is completed, evaluating the trained voice feature extraction network, the trained wake-up word voice detection sub-network and the trained voice discrimination sub-network according to the verification set to obtain an evaluation performance score;
Optionally, after performing one-time joint iterative training on the preset voice feature extraction network and the preset voice discrimination sub-network, and performing one-time joint iterative training on the preset voice feature extraction network and the preset wake-up word voice detection sub-network, at least one time of operation can be performed, and verification is performed by adopting verification. The voice data in the verification set is input to a voice feature extraction network after the iterative training is completed to perform feature extraction, user voice detection and judgment are performed through a voice judgment sub-network, and under the condition that user voice is determined, wake-up word voice detection is performed through a wake-up word voice detection sub-network. And scoring the performance score of the verification operation to obtain an estimated performance score. I.e. each iterative training has at least one estimated performance score corresponding thereto.
Alternatively, the best performing model may be screened out per verification operation. Model parameters (e.g., weight parameters) of the best performing model on voice data recorded by the bone conduction microphone at the AR glasses nose pad are saved and deployed at the AR glasses device.
Alternatively, the model parameters that are evaluated to obtain an estimated performance score may be as follows:
totalacc=-posacc*log10(1-negacc+eps);
The total acc may represent the score performance of the training model on the verification set, and the larger the value, the better. pos acc is the correct rate for wake-up word speech on the verification set. neg acc is the correct rate for unknown classes of sounds on the verification set. eps is a small number that is used to prevent the total acc value from becoming too large.
Step S400, selecting the maximum evaluation performance score from the evaluation performance scores corresponding to each iteration training, taking the voice feature extraction network corresponding to the maximum evaluation performance score as a pre-trained voice feature extraction network, taking the voice discrimination sub-network corresponding to the maximum evaluation performance score as a pre-trained voice discrimination sub-network, and taking the wake-up word voice detection sub-network corresponding to the maximum evaluation performance score as a pre-trained wake-up word voice detection sub-network.
Optionally, under the condition that iterative training reaches a preset number of times or the overall performance of the model is detected to be optimal, selecting the maximum evaluation performance score from the evaluation performance scores corresponding to each iterative training, taking a voice feature extraction network corresponding to the maximum evaluation performance score as a pre-trained voice feature extraction network, taking a voice discrimination sub-network corresponding to the maximum evaluation performance score as a pre-trained voice discrimination sub-network, and taking a wake-up word voice detection sub-network corresponding to the maximum evaluation performance score as a pre-trained wake-up word voice detection sub-network.
Optionally, in the model training stage, in order to obtain better distinguishing capability of the wake-up word voice, based on pronunciation corresponding to the large model voice recognition dataset on the text label, performing transfer learning on phonemes forming the wake-up word voice: the label value at the blank is 0, the label value at the unknown phoneme is 1, and the label value at the phoneme of the wake-up word is 2, so that the strong distinction between the voice of the wake-up word, the unknown voice and the background sound is realized on the phoneme level based on the large-scale data set, and the voice control precision of the wake-up word is further improved. In addition, the data volume requirements on wake-up word voice and unknown voice are reduced, the deep learning wake-up word voice control model can realize better learning under a small volume of voice data, and the wake-up word voice control algorithm of the embodiment has excellent performance in precision and generalization capability.
In the training process of the voice judging subnetwork, the background voice and the human voice are distinguished based on the irrelevant voice data with random noise, a small amount of wake-up word voice data and noise data, so that the first round of verification of the wake-up word voice is completed; in the training process of the wake-up word voice detection sub-network, irrelevant sounds and wake-up word phoneme combinations are distinguished based on voice recognition data, a small amount of wake-up word voice data and noise data, and the second round of verification of the wake-up word voice is completed. Alternatively, the power consumption of the voice discrimination sub-network can be reduced to a certain extent in the case that the wearer does not speak, because the calculation amount required by the voice discrimination sub-network is very small compared with the whole wake-up word voice control network, and the voice discrimination sub-network can filter out background sounds of the wearer when the voice control of the wake-up word is not used in a large amount; the accuracy of the wake-up word speech control is improved to a certain extent, and the wake-up word speech recognition sub-network also provides a large number of negative examples on the phoneme level, which are different from the wake-up word speech and contain learning of the correct combination of the wake-up word phonemes.
In this embodiment, the voice data collected by the bone conduction microphone is simulated according to the training data recorded by the air conduction microphone, and then a training set is constructed according to the training data, and multiple joint iterative training is performed on the preset voice feature extraction network, the preset wake-up word voice detection sub-network and the preset voice discrimination sub-network by using the training set. And constructing a verification set according to the verification voice data recorded by the bone conduction microphone, verifying and determining an evaluation performance score according to the verification set, determining a pre-trained voice feature extraction network according to the maximum evaluation performance score, a pre-trained voice discrimination sub-network and a pre-trained wake-up word voice detection sub-network. And furthermore, a small amount of voice data recorded by the bone conduction microphone is taken as a target, a large amount of training data recorded by the existing air conduction microphone is mined, and the voice characteristic fitting capacity of the set bone conduction microphone wake-up word data is fully improved. And further, the follow-up voice feature extraction network according to the pre-trained voice judgment sub-network and the pre-trained wake-up word voice detection sub-network can effectively and accurately identify voice data when the voice wake-up is carried out, and the accuracy of the wake-up result of the AR equipment is improved.
In addition, referring to fig. 4, the general flow of voice wake-up can be three parts, namely updating the time-series MFCC matrix and the time-series voice feature matrix, and judging whether the user speaks and detecting the wake-up word voice.
Optionally, the bone conduction microphone frame data is read, mel cepstrum coefficients of the bone conduction microphone data frame are calculated, and updated to the time sequence MFCC matrix, that is, the original time sequence MFCC matrix is updated. And inputting the updated time sequence MFCC matrix to a voice feature extraction network, further obtaining a feature matrix related to voice information, inputting the feature matrix to a voice judging sub-network, and synchronously updating the feature matrix to the time sequence voice feature matrix. In the voice judging sub-network, judging the relevant characteristic matrix in a fixed short-time range, determining whether user voice exists, if so, determining that user speaking exists, and inputting the time sequence voice characteristic matrix comprising the relevant characteristic matrix into the wake-up word voice detection sub-network to detect wake-up word voice. For example, the time sequence voice characteristic matrix in a fixed long time range is detected to determine whether the wake-up word voice fragments exist. If the relevant phonemes are successfully combined into wake-up word speech, the relevant device is awakened. If the relevant phonemes are not combined into wake-up word speech, the method returns to wait for a new task. In addition, if no user speech is detected, it is determined that there is no user speech, a new task may be waited after returning.
The application also provides a voice wake-up device, referring to fig. 5, the voice wake-up device comprises:
An acquisition module a100, configured to acquire at least one frame of voice data acquired by the bone conduction microphone, and determine a time sequence MFCC matrix according to the at least one frame of voice data;
The feature extraction module A200 is used for inputting the time sequence MFCC matrix into a pre-trained voice feature extraction network and outputting to obtain a relevant feature matrix;
the determining module A300 is used for determining whether user voice exists in at least one frame of voice data according to the pre-trained voice discrimination sub-network and the related feature matrix;
And the wake-up module A400 is used for inputting the time sequence voice feature matrix corresponding to the relevant feature matrix into the pre-trained wake-up word voice detection sub-network for detection if the user voice exists, and waking up equipment corresponding to the wake-up word voice when the wake-up word voice is detected.
In an embodiment, the obtaining module a100 is configured to:
Determining mel cepstrum coefficients corresponding to at least one frame of voice data;
and arranging mel cepstrum coefficients according to time sequence, and determining a time sequence MFCC matrix according to the arrangement result.
In an embodiment, the feature extraction module a200 is configured to:
Determining a time dimension corresponding to the related characteristic matrix, determining a target bit in a preset time sequence voice characteristic matrix according to the time dimension, and deleting data in front of the target bit in the preset time sequence voice characteristic matrix;
And advancing the data after the target bit in the preset time sequence voice feature matrix according to the target bit, and accessing the relevant feature matrix at the tail end of the preset time sequence voice feature matrix to obtain the time sequence voice feature matrix.
In one embodiment, wake module a400 is configured to:
Determining each phoneme in the time sequence voice feature matrix corresponding to the relevant feature matrix according to the pre-trained wake-up word voice detection sub-network, and combining different phonemes to obtain a phoneme combination;
and sequentially matching each phoneme combination with preset wake-up word voice, and if the phoneme combination matched with the wake-up word voice exists, determining that the wake-up word voice exists.
In one embodiment, the determining module a300 is configured to:
Inputting the relevant feature matrix into a pre-trained voice discrimination sub-network, and outputting to obtain a prediction probability value;
if the predicted probability value is larger than the preset threshold value, determining that user voice exists in at least one frame of voice data.
In an embodiment, an air conduction microphone is set in the AR device, and the voice wake-up device includes a training module for:
constructing a training set according to training data recorded by the air conduction microphone, and constructing a verification set according to verification voice data recorded by the bone conduction microphone;
Performing repeated iterative training on a preset voice feature extraction network, a preset wake-up word voice detection sub-network and a preset voice judgment sub-network according to a training set, wherein in each iterative training process, the preset voice feature extraction network and the preset voice judgment sub-network are subjected to joint training, and the voice feature extraction network and the preset wake-up word voice detection sub-network after joint training are subjected to joint training;
After each time of iterative training is completed, evaluating the trained voice feature extraction network, the trained wake-up word voice detection sub-network and the trained voice judgment sub-network according to the verification set to obtain an evaluation performance score;
Selecting the maximum evaluation performance score from the evaluation performance scores corresponding to each iteration training, taking a voice feature extraction network corresponding to the maximum evaluation performance score as a pre-trained voice feature extraction network, taking a voice discrimination sub-network corresponding to the maximum evaluation performance score as a pre-trained voice discrimination sub-network, and taking a wake-up word voice detection sub-network corresponding to the maximum evaluation performance score as a pre-trained wake-up word voice detection sub-network.
In one embodiment, the training module is configured to:
different types of noise data recorded by the air conduction microphone are acquired, and a plurality of different acoustic environments are created based on the different types of noise data;
Constructing different background sound data according to different acoustic environments;
acquiring first voice data recorded by an air conduction microphone in a quiet environment, and performing low-pass filtering processing on the first voice data to obtain low-pass filtering voice data;
And fusing the background sound data and the low-pass filtered voice data aiming at each background sound data to obtain training data, and constructing a training data set according to the training data.
The voice wake-up device provided by the application can solve the technical problem that the AR equipment cannot effectively and accurately recognize voice data so as to influence the wake-up result of the AR equipment by adopting the voice wake-up method in the embodiment. Compared with the prior art, the voice wake-up device has the same beneficial effects as the voice wake-up method provided by the embodiment, and other technical features in the voice wake-up device are the same as the features disclosed by the method of the embodiment, and are not repeated herein.
The present application provides an AR device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the voice wake-up method in the first embodiment.
Referring now to fig. 6, a schematic diagram of an AR device suitable for use in implementing embodiments of the present application is shown. The AR device in the embodiment of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (Personal DIGITAL ASSISTANT: personal digital assistant), a PAD (Portable Application Description: tablet), a PMP (Portable MEDIA PLAYER: portable multimedia player), an in-vehicle terminal (e.g., an in-vehicle navigation terminal), and the like, a fixed terminal such as a digital TV, a desktop computer, and the like. The AR device shown in fig. 5 is only one example and should not impose any limitation on the functionality and scope of use of embodiments of the present application.
As shown in fig. 6, the AR device may include a processing means 1001 (e.g., a central processor, a graphic processor, etc.), which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage means 1003 into a random access Memory (RAM: random Access Memory) 1004. In the RAM1004, various programs and data required for the operation of the AR device are also stored. The processing device 1001, the ROM1002, and the RAM1004 are connected to each other by a bus 1005. An input/output (I/O) interface 1006 is also connected to the bus. In general, the following systems may be connected to the I/O interface 1006: input devices 1007 including, for example, a touch screen, touchpad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, and the like; an output device 1008 including, for example, a Liquid crystal display (LCD: liquid CRYSTAL DISPLAY), a speaker, a vibrator, and the like; storage device 1003 including, for example, a magnetic tape, a hard disk, and the like; and communication means 1009. The communication means 1009 may allow the AR device to communicate wirelessly or by wire with other devices to exchange data. While an AR device having various systems is shown in the figures, it should be understood that not all of the illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through a communication device, or installed from the storage device 1003, or installed from the ROM 1002. The above-described functions defined in the method of the disclosed embodiment of the application are performed when the computer program is executed by the processing device 1001.
The AR equipment provided by the application can solve the technical problem that the AR equipment cannot effectively and accurately recognize voice data by adopting the voice awakening method in the embodiment, thereby influencing the awakening result of the AR equipment. Compared with the prior art, the beneficial effects of the AR device provided by the application are the same as those of the voice wake-up method provided by the above embodiment, and other technical features of the AR device are the same as those disclosed by the method of the above embodiment, and are not described in detail herein.
It is to be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the description of the above embodiments, particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
The present application provides a computer readable storage medium having computer readable program instructions (i.e., a computer program) stored thereon for performing the voice wakeup method of the above embodiments.
The computer readable storage medium provided by the present application may be, for example, a U disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access Memory (RAM: random Access Memory), a Read-Only Memory (ROM: read Only Memory), an erasable programmable Read-Only Memory (EPROM: erasable Programmable Read Only Memory or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM: CD-Read Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wire, fiber optic cable, RF (Radio Frequency), and the like, or any suitable combination of the foregoing.
The above computer readable storage medium may be contained in an AR device; or may exist alone without being assembled into an AR device.
The computer-readable storage medium carries one or more programs that, when executed by the AR device, enable the AR device to perform the flow of steps in the voice wakeup method described above.
Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN: local Area Network) or a wide area network (WAN: wide Area Network), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present application may be implemented in software or in hardware. Wherein the name of the module does not constitute a limitation of the unit itself in some cases.
The readable storage medium provided by the application is a computer readable storage medium, and the computer readable storage medium stores computer readable program instructions (namely computer program) for executing the voice wake-up method, so that the technical problem that the AR equipment cannot effectively and accurately recognize voice data and further influence the wake-up result of the AR equipment can be solved. Compared with the prior art, the beneficial effects of the computer readable storage medium provided by the application are the same as those of the voice wake-up method provided by the above embodiment, and are not described herein.
The application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the voice wakeup method as described above.
The computer program product provided by the application can solve the technical problem that the AR equipment cannot effectively and accurately recognize voice data, thereby influencing the wake-up result of the AR equipment. Compared with the prior art, the beneficial effects of the computer program product provided by the application are the same as those of the voice wake-up method provided by the above embodiment, and are not described herein.
The foregoing description is only a partial embodiment of the present application, and is not intended to limit the scope of the present application, and all the equivalent structural changes made by the description and the accompanying drawings under the technical concept of the present application, or the direct/indirect application in other related technical fields are included in the scope of the present application.

Claims (10)

1. A method of waking up speech, the method being applied to an AR device in which a bone conduction microphone is provided, the method comprising:
acquiring at least one frame of voice data acquired by a bone conduction microphone, and determining a time sequence MFCC matrix according to the at least one frame of voice data;
inputting the time sequence MFCC matrix to a pre-trained voice feature extraction network, and outputting to obtain a relevant feature matrix;
Determining whether user voice exists in the at least one frame of voice data according to the pre-trained voice discrimination sub-network and the related feature matrix;
if the user voice exists, inputting the time sequence voice feature matrix corresponding to the relevant feature matrix into a pre-trained wake-up word voice detection sub-network to detect, and waking up equipment corresponding to the wake-up word voice when the wake-up word voice is detected.
2. The method of claim 1, wherein the step of determining a time-ordered MFCC matrix from the at least one frame of speech data comprises:
determining mel cepstrum coefficients corresponding to the at least one frame of voice data;
And arranging the mel cepstrum coefficients according to time sequence, and determining a time sequence MFCC matrix according to an arrangement result.
3. The method of claim 1, wherein after the step of inputting the time-series MFCC matrix to a pre-trained speech feature extraction network and outputting a correlation feature matrix, further comprising:
determining a time dimension corresponding to the related feature matrix, determining a target bit in a preset time sequence voice feature matrix according to the time dimension, and deleting data in front of the target bit in the preset time sequence voice feature matrix;
And advancing the data of the target bit in the preset time sequence voice feature matrix according to the target bit, and accessing the relevant feature matrix at the tail end of the preset time sequence voice feature matrix to obtain the time sequence voice feature matrix.
4. The method of claim 1, wherein the step of inputting the time-series speech feature matrix corresponding to the correlation feature matrix to a pre-trained wake-up word speech detection sub-network for detection comprises:
Determining each phoneme in a time sequence voice feature matrix corresponding to the related feature matrix according to a pre-trained wake-up word voice detection sub-network, and combining different phonemes to obtain a phoneme combination;
And sequentially matching each phoneme combination with preset wake-up word voice, and if the phoneme combination matched with the wake-up word voice exists, determining that the wake-up word voice exists.
5. The method of claim 1, wherein said step of determining whether user speech is present in said at least one frame of speech data based on a pre-trained speech discrimination sub-network and said correlation feature matrix comprises:
Inputting the relevant feature matrix into a pre-trained voice discrimination sub-network, and outputting to obtain a prediction probability value;
And if the predicted probability value is greater than a preset threshold value, determining that user voice exists in the at least one frame of voice data.
6. The method of any one of claims 1 to 5, wherein an air conduction microphone is disposed in the AR device, the method further comprising:
Constructing a training set according to training data recorded by the air conduction microphone, and constructing a verification set according to verification voice data recorded by the bone conduction microphone;
Performing repeated iterative training on a preset voice feature extraction network, a preset wake-up word voice detection sub-network and a preset voice judgment sub-network according to the training set, wherein in each iterative training process, the preset voice feature extraction network and the preset voice judgment sub-network are subjected to joint training, and the voice feature extraction network and the preset wake-up word voice detection sub-network after joint training are subjected to joint training;
After each time of iterative training is completed, evaluating a trained voice feature extraction network, a trained wake-up word voice detection sub-network and a trained voice judgment sub-network according to the verification set to obtain an evaluation performance score;
Selecting the maximum evaluation performance score from the evaluation performance scores corresponding to each iteration training, taking a voice feature extraction network corresponding to the maximum evaluation performance score as a pre-trained voice feature extraction network, taking a voice discrimination sub-network corresponding to the maximum evaluation performance score as a pre-trained voice discrimination sub-network, and taking a wake-up word voice detection sub-network corresponding to the maximum evaluation performance score as a pre-trained wake-up word voice detection sub-network.
7. The method of claim 6, wherein the step of constructing a training set from training data recorded by the air conduction microphone comprises:
different types of noise data recorded by the air conduction microphone are acquired, and a plurality of different acoustic environments are created based on the different types of noise data;
Constructing different background sound data according to different acoustic environments;
acquiring first voice data recorded by the air conduction microphone in a quiet environment, and performing low-pass filtering processing on the first voice data to obtain low-pass filtering voice data;
and fusing the background sound data and the low-pass filtered voice data aiming at each piece of background sound data to obtain training data, and constructing a training data set according to the training data.
8. A voice wake-up device, the device being disposed in an AR apparatus, in which a bone conduction microphone is disposed, the device comprising:
the acquisition module is used for acquiring at least one frame of voice data acquired by the bone conduction microphone and determining a time sequence MFCC matrix according to the at least one frame of voice data;
the feature extraction module is used for inputting the time sequence MFCC matrix into a pre-trained voice feature extraction network and outputting the time sequence MFCC matrix to obtain a relevant feature matrix;
The determining module is used for determining whether user voice exists in the at least one frame of voice data according to the pre-trained voice discrimination sub-network and the related feature matrix;
And the wake-up module is used for inputting the time sequence voice feature matrix corresponding to the related feature matrix into a pre-trained wake-up word voice detection sub-network for detection if the user voice exists, and waking up equipment corresponding to the wake-up word voice when the wake-up word voice is detected.
9. An AR device, the device comprising: bone conduction microphone, memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the voice wake-up method of any one of claims 1 to 7.
10. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the voice wake-up method according to any one of claims 1 to 7.
CN202410546249.6A 2024-04-30 2024-04-30 Voice wakeup method and device, AR equipment and storage medium Pending CN118471234A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410546249.6A CN118471234A (en) 2024-04-30 2024-04-30 Voice wakeup method and device, AR equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410546249.6A CN118471234A (en) 2024-04-30 2024-04-30 Voice wakeup method and device, AR equipment and storage medium

Publications (1)

Publication Number Publication Date
CN118471234A true CN118471234A (en) 2024-08-09

Family

ID=92155474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410546249.6A Pending CN118471234A (en) 2024-04-30 2024-04-30 Voice wakeup method and device, AR equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118471234A (en)

Similar Documents

Publication Publication Date Title
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
US10083397B2 (en) Personalized intelligent wake-up system and method based on multimodal deep neural network
JP2019522810A (en) Neural network based voiceprint information extraction method and apparatus
CN108122556A (en) Reduce the method and device that driver's voice wakes up instruction word false triggering
CN108364662B (en) Voice emotion recognition method and system based on paired identification tasks
CN113643693B (en) Acoustic model conditioned on sound characteristics
CN109431507A (en) Cough disease identification method and device based on deep learning
CN114127849B (en) Speech emotion recognition method and device
CN110047512B (en) Environmental sound classification method, system and related device
CN111667818A (en) Method and device for training awakening model
Sterling et al. Automated cough assessment on a mobile platform
CN110060693A (en) Model training method and device, electronic equipment and storage medium
CN113205820B (en) A method of generating a sound encoder for sound event detection
CN112652306A (en) Voice wake-up method and device, computer equipment and storage medium
CN112825250A (en) Voice wake-up method, apparatus, storage medium and program product
CN113450771A (en) Awakening method, model training method and device
CN110946554A (en) Cough type identification method, device and system
CN117037772A (en) Voice audio segmentation method, device, computer equipment and storage medium
US20240221722A1 (en) Eyewear device and method of use
CN116631386B (en) A sound event localization and recognition method based on residual module with fused channel attention mechanism
CN112562723A (en) Pronunciation accuracy determination method and device, storage medium and electronic equipment
EP4475121A1 (en) Interactive speech signal processing method, related device and system
CN117524228A (en) Voice data processing method, device, equipment and medium
CN118471234A (en) Voice wakeup method and device, AR equipment and storage medium
CN115376494B (en) Voice detection method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination