[go: up one dir, main page]

CN117995177A - Beam selection method of microphone array, electronic equipment and storage medium - Google Patents

Beam selection method of microphone array, electronic equipment and storage medium Download PDF

Info

Publication number
CN117995177A
CN117995177A CN202410108418.8A CN202410108418A CN117995177A CN 117995177 A CN117995177 A CN 117995177A CN 202410108418 A CN202410108418 A CN 202410108418A CN 117995177 A CN117995177 A CN 117995177A
Authority
CN
China
Prior art keywords
channel
audio data
frame
microphone array
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410108418.8A
Other languages
Chinese (zh)
Inventor
马永保
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cloudminds Shanghai Robotics Co Ltd
Original Assignee
Cloudminds Shanghai Robotics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cloudminds Shanghai Robotics Co Ltd filed Critical Cloudminds Shanghai Robotics Co Ltd
Priority to CN202410108418.8A priority Critical patent/CN117995177A/en
Publication of CN117995177A publication Critical patent/CN117995177A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the application relates to the field of voice processing, and discloses a beam selection method of a microphone array, which comprises the following steps: processing each frame of audio data received by the microphone array into first audio data of one frame of multiple channels; determining a first channel corresponding to the audio data with highest voice quality from the first audio data of the current frame; selecting the audio data of the first channel or the second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition; the second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed. The application can select the voice wave beam sent into the voice recognition system under any application scene, has no limitation of the application scene and has lower complexity.

Description

Beam selection method of microphone array, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the field of voice recognition, in particular to a beam selection method of a microphone array, electronic equipment and a storage medium.
Background
In a voice interactive product scene, a microphone array is often used for pickup so as to improve the quality of far-field pickup, for example, a beam forming algorithm and a blind source separation algorithm are adopted to remove noise, reverberation and the like from voice, so that the quality of picked voice is improved, and the accuracy of voice recognition is improved. However, because the position angle of the speaker is not fixed, multiple beams or blind source separation algorithm processing is needed to output the voice of multiple channels, and then a certain selection mechanism is adopted to select and output the voice of only one channel to be sent to the voice recognition system for recognition.
However, the blind source separation algorithm and the multi-beam algorithm separate voices, noise and interference in different directions picked up by the microphone array, then the separated data of a plurality of channels are respectively sent to the wake-up module, and the signal channel with the highest confidence is sent to the recognition system by the wake-up module. However, the algorithm using the wake-up module has a limitation of application scenarios, and the wake-up module cannot be used in some interaction scenarios or conference scenarios without wake-up.
Disclosure of Invention
The invention aims to provide a beam selection method of a microphone array, electronic equipment and a storage medium, which can select a voice beam sent into a voice recognition system under any condition, have no limitation of application scenes and have lower complexity.
In order to solve the above technical problems, embodiments of the present invention provide a beam selection method of a microphone array, an electronic device, and a storage medium, including:
Processing each frame of audio data received by the microphone array into first audio data of one frame of multiple channels;
determining a first channel corresponding to the audio data with highest voice quality from the first audio data of the current frame;
Selecting the audio data of the first channel or the second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition; the second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed.
The embodiment of the invention also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a beam selection method of a microphone array as described above.
Embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements a beam selection method of a microphone array as described above.
Compared with the prior art, the method and the device have the advantages that each frame of audio data received by the microphone array is processed into one frame of multichannel first audio data; determining a first channel corresponding to audio data with highest voice quality from first audio data of a current frame; selecting the audio data of a first channel or a second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition; the second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed. The voice beam (audio data in the channel) sent into the voice recognition system can be selected under any application scene, and the method has no limitation of the application scene and low complexity.
Drawings
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.
Fig. 1 is a specific flowchart of a beam selection method of a microphone array according to a first embodiment of the present invention;
fig. 2 is a specific flowchart of a beam selection method of a microphone array according to a second embodiment of the present invention;
Fig. 3 is a schematic view of a neural network selection module of a microphone array according to a second embodiment of the present invention;
Fig. 4 is a schematic structural view of an electronic device according to a third embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present application, numerous technical details have been set forth in order to provide a better understanding of the present application. The claimed application may be practiced without these specific details and with various changes and modifications based on the following embodiments.
In a voice interactive product scene, a microphone array is often used for pickup so as to improve the quality of far-field pickup, for example, a beam forming algorithm and a blind source separation algorithm are adopted to remove noise, reverberation and the like from voice, so that the quality of picked voice is improved, and the accuracy of voice recognition is improved. However, because the position angle of the speaker is not fixed, multiple beams or blind source separation algorithm processing is needed to be adopted to output the voice of multiple channels, and then a selection mechanism is adopted to select and output the voice of one channel only to be sent to a voice recognition system for recognition.
For the above cases, common selection mechanisms include:
1) And a blind source separation algorithm is used, namely, voices, noise and interference in different directions picked up by a microphone array are separated, then data of a plurality of separated channels are respectively sent to a wake-up module, wake-up confidence degrees of the channels are obtained, the largest confidence degree in the channels is selected, and if the confidence degree is larger than a wake-up threshold value, a signal channel with the largest confidence degree is sent to the recognition system.
2) The blind source separation algorithm in 1) is replaced with a multi-beam algorithm. The multi-beam algorithm is to make beam forming algorithm in different directions, so as to obtain data of a plurality of channels, and the wake-up module selects a channel with highest wake-up confidence from the channels and sends the channel to the recognition system.
3) And (3) replacing the wake-up modules in the steps 1) and 2), and adopting a method for selecting a channel with larger energy after noise reduction. The method comprises the steps of firstly carrying out noise reduction treatment on signals after a plurality of wave beam forming or blind source separation, and then selecting a channel with larger energy for outputting. The effect of the method is seriously dependent on a noise reduction algorithm, and if the traditional noise reduction algorithm is adopted, the influence of non-stable and point source interference cannot be removed. If the deep learning noise reduction algorithm is adopted, voice damage can be caused when the signal to noise ratio is low, so that the situation of wrong selection is caused. And because each channel needs to do independent deep learning noise reduction, a large amount of calculation force is needed, and the requirement on the calculation force is high.
4) An estimated sound angle of arrival (DOA) method is employed, and then a beamforming algorithm is performed in the estimated DOA direction. However, the effect of the method is seriously dependent on the accuracy of DOA, under the conditions of strong reverberation, strong noise and interference, the DOA is difficult to estimate accurately, and the DOA on two sides of the linear array has poor accuracy, so that the method has a great risk.
In summary, the current beam selection schemes, scheme 1) and scheme 2), are preferred, but require a wake-up module. In some wake-free interactive or conference scenarios, the schemes of 1) and 2) cannot be adopted. Aiming at the problem, the application provides a beam selection method of a microphone array, which has no limitation of application scenes and has lower complexity.
A first embodiment of the present invention relates to a beam selection method of a microphone array, as shown in fig. 1, specifically including the following steps.
Step 101: each frame of audio data received by the microphone array is processed into first audio data of one frame of multiple channels.
The microphone array refers to a multi-microphone system in which a certain number of acoustic sensors (microphones) are arranged according to a certain rule, and the spatial characteristics of a sound field are sampled and filtered. The frequency response of all microphones of the microphone array is consistent, and the sampling clocks of the microphones are synchronous.
Microphone arrays can be generally divided into: linear arrays, planar arrays and volumetric arrays. The linear array is a linear array formed by two microphones, almost all mobile phones and earphones currently adopt a double-microphone noise reduction technology to improve the conversation effect, and part of intelligent sound boxes adopt the scheme. The linear array formed by the two microphones has the greatest advantages of low cost, lower power consumption, obvious defects and limited noise reduction effect compared with a plurality of microphones, namely, the effect of far scene interaction is not good. The combination of the planar array is more diversified, and the planar array is commonly arranged on the intelligent sound box and the voice interaction robot, wherein the planar array is commonly arranged on the 4-microphone array, the 6-microphone array and the like. The linear array of the planar array can realize equivalent tone test of 360 degrees on the plane, the more the number of microphones is, the higher the fineness of space division is, the far-field scene recognition effect is good, the higher the power consumption is, and the ID design is complex. The three-dimensional array is spherical or cylindrical, so that real full-space 360-degree lossless pickup can be realized, the problem of poor response of a planar array high pitch angle signal is solved, and the effect is best and is commonly found in the professional field of voice recognition.
Specifically, the microphone array receives audio data through a plurality of microphones contained therein, and the number of the plurality of microphones is at least two. The microphone array processes the received audio data frame by frame, and each frame of audio data can adopt a beam forming algorithm with generalized sidelobe cancellation to process one frame of audio data into first audio data of a plurality of channels, namely, one frame of first audio data of a plurality of channels.
Step 102: and determining a first channel corresponding to the audio data with the highest voice quality from the first audio data of the current frame.
Specifically, after the microphone array processes the received audio data of the current frame, one frame of multi-channel first audio data corresponding to the audio data of the current frame can be obtained, and a channel corresponding to the audio data with the highest voice quality is selected from the multi-channel first audio data of the current frame to be the first channel. In practical application, the voice quality can be measured by parameters related to the voice quality, such as signal-to-noise ratio, reverberation, etc., which are not particularly limited in the application, for example, a channel corresponding to the audio data with the highest signal-to-noise ratio can be selected from the multi-channel first audio data of the current frame as the first channel; or selecting a channel corresponding to the audio data with the minimum reverberation as a first channel; or selecting the channel corresponding to the audio data with the highest signal-to-noise ratio and the smallest reverberation as the first channel.
Step 103: selecting the audio data of a first channel or a second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition;
The second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed.
Specifically, after the first channel is determined, in the first audio data of the current frame, the audio data of the first channel or the second channel can be selected according to a preset condition and sent to a voice recognition system for recognition. The preset conditions may be defined by the technician, for example, when the audio data of the first channel is selected to be sent to the voice recognition system for recognition, the preset conditions may be: preferentially judging whether the audio data of the first channel meets the requirement of the input voice recognition system, if so, sending the audio data of the first channel into the voice recognition system, otherwise, sending the audio data of the second channel into the voice recognition system. For example, the audio data of the first channel can be fed into the speech recognition system only if it is speech data and not noise, otherwise the audio data of the second channel is fed into the speech recognition system; or the audio data of the first channel can be sent to the voice recognition system only when a certain parameter of the audio data of the first channel meets a preset threshold value or less, otherwise, the audio data of the second channel is sent to the voice recognition system and the like.
When the audio data of the first channel does not meet the requirement of the input voice recognition system, selecting the audio data of the second channel from the first audio data of the current frame, and sending the audio data of the second channel to the voice recognition system for recognition. The second channel is the channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed. Specifically, the second channel may be a first channel corresponding to the audio data with the highest speech quality in the first audio data of the previous frame, or a channel corresponding to the audio data sent to the speech recognition system when the first audio data of the previous two frames of the current frame is processed. It should be noted that, when the first audio data of the current frame cannot determine the second channel (the current frame is the first frame audio data that needs to be sent to the speech recognition system), the audio data of one channel may be optionally sent to the speech recognition system for recognition.
Compared with the related art, the embodiment of the application processes each frame of audio data received by the microphone array into one frame of multichannel first audio data; determining a first channel corresponding to audio data with highest voice quality from first audio data of a current frame; selecting the audio data of a first channel or a second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition; the second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed. The voice beam (audio data in the channel) sent into the voice recognition system can be selected under any application scene, and the method has no limitation of the application scene and low complexity.
A second embodiment of the present invention relates to a beam selection method for a microphone array, which is a refinement of the foregoing embodiment, and specifically includes the following.
In some embodiments, as shown in fig. 2, step 101 may include the steps of:
step 1011: and carrying out generalized side lobe destructive beam forming processing on each frame of audio data received by the microphone array in multiple directions to obtain a frame of multi-channel complex subband domain signal.
Specifically, generalized sidelobe canceling (Generalized Sidelobe Canceller, GSC) beamforming is a digital signal processing technique used to enhance the transmission quality of an artificial communication signal and reduce unintentional interference in severe environments such as noise, clutter, etc. The basic principle is that the phase difference of the same signal received by different array elements in the active array is utilized to form a wave beam, so that the signal is focused in a certain direction, and meanwhile, the interference in the non-wave beam direction is restrained through the self-adaptive filter in the same array, thereby improving the effectiveness of communication. GSC beamforming may be performed in the time domain, in the frequency domain, or in the complex subband domain, which is not particularly limited in the present application, as long as a frame of multi-channel complex subband domain signal can be obtained from each frame of audio data received by the microphone array. And obtaining the magnitude spectrum of the complex subband domain signal of one frame of multichannel, and then carrying out Mel spectrum compression to obtain the Mel spectrum of one frame of multichannel.
Specifically, in fourier analysis, the variation of the amplitude |fn| or Cn of each component with the frequency nω1 is generally referred to as an amplitude spectrum of the signal, and the variation of the phase Φn of each component with the angular frequency nω1 is generally referred to as a phase spectrum of the signal. The magnitude spectrum and the phase spectrum are collectively referred to as the spectrum of the signal. The complex subband domain refers to a subband signal that is either real or complex, and the subband signal that converts audio data into complex subband signals is the complex subband domain signal. The sub-band is obtained by a sub-band coding technique, which is a technique of converting an original signal from a time domain into a frequency domain, then dividing it into a plurality of sub-bands, and respectively performing digital coding on the sub-bands. It uses Band Pass Filter (BPF) group to divide the original signal into several (for example m) sub-bands (sub-bands).
In one example, step 1011 may include the steps of: carrying out Generalized Sidelobe Cancellation (GSC) beam forming of multiple directions on each frame of audio data received by a microphone array in a complex subband domain to obtain a frame of multi-channel complex subband domain signal;
wherein, the Generalized Sidelobe Canceling (GSC) beam forming result of each direction corresponds to a complex subband domain signal of one channel in a frame of complex subband domain signals of multiple channels.
Specifically, GSC beamforming may be performed in the time domain, in the frequency domain, or in the complex subband domain. When the Generalized Sidelobe Cancellation (GSC) wave beam forming is carried out on each frame of audio data received by the microphone array in a plurality of directions in a complex sub-band domain, a frame of multi-channel complex sub-band domain signal can be directly obtained. The Generalized Sidelobe Cancellation (GSC) beam forming result of each direction corresponds to a complex subband domain signal of one channel in a frame of complex subband domain signals of multiple channels, namely, the complex subband domain signal of one frame of N channels can be obtained by performing Generalized Sidelobe Cancellation (GSC) beam forming of N directions, wherein N is not less than 2. For example, when the microphone array is a 6mic linear array, GSC beamforming in 5 directions (30 degrees, 60 degrees, 90 degrees, 120 degrees, 150 degrees) may be performed on each frame of audio data received by the microphone array in a complex subband domain, so as to obtain a complex subband domain signal with 5 channels in one frame.
Alternatively, in another example, step 1011 may include the steps of: carrying out Generalized Sidelobe Cancellation (GSC) beam forming of multiple directions on each frame of audio data received by a microphone array in a time domain to obtain a frame of multichannel time domain signal;
Adding a subband analysis window to the time domain signal of one frame of multichannel, and performing Fourier transform to obtain a complex subband domain signal of one frame of multichannel;
Wherein, the Generalized Sidelobe Canceling (GSC) beam forming result of each direction corresponds to a time domain signal of one channel in the time domain signal of one frame of multiple channels.
Specifically, GSC beamforming may be performed in the time domain, in the frequency domain, or in the complex subband domain. When Generalized Sidelobe Cancellation (GSC) wave beam forming is carried out on each frame of audio data received by the microphone array in a plurality of directions in the time domain, a frame of multichannel time domain signal can be obtained. And adding a subband analysis window to the time domain signal of one frame of multiple channels, and performing Fourier transform to obtain a complex subband domain signal of one frame of multiple channels. The Generalized Sidelobe Cancellation (GSC) beam forming result of each direction corresponds to one channel time domain signal in one frame of multi-channel time domain signals, namely N frames of N-channel time domain signals can be obtained by performing the Generalized Sidelobe Cancellation (GSC) beam forming of N directions, wherein N is not less than 2. For example, when the microphone array is a 6mic linear array, GSC beamforming in 5 directions (30 degrees, 60 degrees, 90 degrees, 120 degrees, 150 degrees) can be performed on each frame of audio data received by the microphone array in the time domain, so as to obtain a frame of 5-channel time domain signals; a frame of 5 channel time domain signals is added with a subband analysis window and then fourier transformed into complex subband domain signals.
Correspondingly, generalized side lobe cancellation (GSC) wave beam forming in multiple directions can be carried out on each frame of audio data received by the microphone array in the frequency domain, so that a frame of multichannel frequency domain signal is obtained. After obtaining a frame multichannel frequency domain signal, the frame multichannel frequency domain signal needs to be converted into a frame multichannel time domain signal, and then the frame multichannel time domain signal is converted into a frame multichannel complex subband domain signal by the method of the embodiment.
Step 1012: and obtaining the magnitude spectrum of the complex subband domain signal of one frame of multichannel, and then carrying out Mel spectrum compression to obtain the Mel spectrum of one frame of multichannel.
Specifically, the conversion of audio data into the complex subband domain is chosen because most models overlap the audio signal by 50% or 75% from frame to frame, frame by frame, windowed, and then FFT transformed the signal into the frequency domain. For an overlapping rate of 50%, signal leakage in the frequency domain is serious no matter a hanning window or a hamming window is added, and spectrum leakage can be reflected on the amplitude spectrum. If the leakage is large, particularly, the harmonic component of the amplitude spectrum is fuzzy when the male voice with lower fundamental frequency is encountered, and the harmonic structure is difficult to accurately learn from the fuzzy language spectrum when the neural network selection module performs deep learning; the 75% overlap ratio has less spectral leakage, but the processed frequency bin is increased by a factor of 2. Therefore, the embodiment adopts complex-domain subband transformation, the frequency domain leakage after transformation is obviously lower than the short-time Fourier transformation with the overlapping rate of 50%, the frequency spectrum leakage with the overlapping rate of 75% is close, and the number of frequency points after transformation is consistent with that of the overlapping rate of 50%, but is not increased. The Mel spectrum (mel spectrum) is adopted to effectively compress the number of frequency points, reduce the characteristic dimension sent into the network, and reduce the calculated amount.
In some embodiments, step 102 may include the steps of: inputting the Mel frequency spectrum of one frame of the current frame into a neural network selection module to obtain the posterior probability of the Mel frequency spectrum of each channel, selecting the channel with the highest posterior probability as the first channel where the audio data with the highest voice quality is located, and simultaneously generating the channel index corresponding to the first channel.
Specifically, posterior probability (posterior probability): meaning that an event has occurred, it is desirable to calculate the probability that the cause of the event occurred is due to a factor. The posterior probability in this embodiment may be understood as the probability that the voice data of the current channel is the audio data with the highest voice quality, and in this embodiment, the posterior probability is specifically calculated by the neural network selection module. After the posterior probability of the voice quality of each channel is obtained, the channel with the highest posterior probability is selected as a first channel where the audio data with the highest voice quality is located, and simultaneously, a channel index corresponding to the first channel is generated. The channel index is used to uniquely identify the first channel.
In one example, the process of inputting the mel spectrum of one frame of the current frame into the neural network selection module to obtain the posterior probability of the mel spectrum of each channel further includes: the neural network selection module generates a Voice Activation Detection (VAD) result corresponding to each channel for a frame multi-channel Mel spectrum of the current frame.
In particular, voice Activation Detection (VAD) algorithms are mainly used to detect whether a human voice signal is present in a current sound signal. The algorithm distinguishes speech signal segments from various background noise signal segments by determining the input signal. There are a number of feature extraction methods for VAD, which can be determined in particular by short-term energy (short TIME ENERGY, STE) and short-term zero-crossing rate (zero cross counter, ZCC), i.e. energy-based features. The short-time energy is the energy of a frame of speech signal, and the zero crossing rate is the number of times the time domain signal of a frame of speech passes through 0 (time axis). Generally, a VAD algorithm with high accuracy extracts and judges a plurality of features based on energy, frequency domain features, cepstrum features, harmonic features, long-term information features and the like. In the application, the feature extraction method for generating the Voice Activation Detection (VAD) result corresponding to each channel by the neural network selection module according to the Mel spectrum of one frame of the current frame is not particularly limited.
In one example, the training process of the neural network selection module includes the following:
1. Data preparation
(1) Signals acquired by the original microphone array are generated.
This signal comprises two parts: one is directional speech (including directional noise) and the other is background noise. The generation of directional speech, which contains both target sound sources and directional disturbances (which may be noise or other human sounds), is to simulate the sounds of different directions and distances collected by the microphone array. A multichannel Room Impulse Response (RIR) tool is used to generate multichannel impulse responses of different room sizes, different microphone (mic) array placement positions, different angles relative to the mic array positions, and different distances of the speaker sources. The impulse response convolution is used for respectively convoluting pure voice or noise, the directional voice can be obtained after convolution results are added, and the signal to noise ratio during addition can be randomly selected in the interval of 0db-30 db. The process only considers the situation that at most 2 sound sources in different directions occur simultaneously, and if the two sound sources are human voice, the target sound source selects the human voice with larger energy; if one sound source is a human voice and the other is noise, the sound source of the human voice is selected as the target sound source.
The generation of background noise is of different directionality, and there is no apparent directionality of this noise. The impulse response of the background noise can be generated by summing up 8 impulse responses arranged at 8 corners of the room (4 corners of the ground + 4 corners of the top), and then convolving the impulse response with a noise signal to obtain the background noise. After the background noise is generated, randomly selecting and superposing the background noise on the generated directional voice in a signal-to-noise ratio interval of 5db-30db, and finally generating a multichannel voice signal.
(2) A multi-beam signal of a training model is generated.
And (3) performing GSC beam forming on signals acquired by the microphone array simulated in the step (1) in 5 directions (30 degrees, 60 degrees, 90 degrees, 120 degrees and 150 degrees) respectively to obtain multi-beam output of 5 channels.
(3) An optimal channel index tag and a (voice activation detection) VAD tag are generated.
And (3) making a difference between the target source direction of arrival estimation (direction of arrival, DOA) angle of the data of the step (1) and 5 directions (30 degrees, 60 degrees, 90 degrees, 120 degrees and 150 degrees), determining a channel corresponding to the direction with the smallest difference as a first channel, and outputting a channel index of the channel. And randomly selecting a channel corresponding to one direction from the two directions as a first channel in the middle direction with equal difference values, and outputting a channel index of the channel. In addition, the VAD tag may be determined by threshold based on the energy of the clean speech signal: when the threshold value is exceeded, the VAD label is that the voice exists, and when the threshold value is exceeded, the VAD label is that the voice does not exist.
2. Model training
(1) And carrying out sub-band domain transformation on the generated multi-beam signal, and then solving the mel spectrum of the multi-beam signal.
(2) And randomly selecting a transformed mel spectrum to be input into a network to obtain a channel index of a predicted first channel and a prediction result of the VAD.
(3) And inputting the result, the corresponding optimal channel index label and the VAD label into a loss function, and calculating a loss value loss. The loss function of the scheme can adopt a cross entropy loss function.
(4) And sending loss to Adam for optimization, and updating parameters in the neural network selection module through back propagation.
(5) And continuously sending the rest data to a neural network selection module, and repeating the process until all data training is completed.
The method comprises the steps of inputting a mel frequency spectrum of one frame of multiple channels of a current frame into a trained neural network selection module, obtaining posterior probability of the mel frequency spectrum of each channel, generating a Voice Activation Detection (VAD) result corresponding to each channel by the trained neural network selection module according to the mel frequency spectrum of one frame of multiple channels of the current frame, and generating only the Voice Activation Detection (VAD) result corresponding to the first channel after determining the first channel.
The neural network selection module in this embodiment is schematically shown in fig. 3. The Crossconv modules are formed by 2*M sub-conv convolution module groups, each sub-conv module consists of conv2d+ batchnorm + prelu modules, and the input of each sub-conv is mel spectrum of 2 different beam channels. The module enables the neural network to actively learn the difference characteristics between different two beam channels, and increases the accuracy of the first channel (the channel with the highest posterior probability) predicted finally. The results processed by each sub-conv module are spliced together, and fusion of different weights is carried out through the network layer below, so that the network automatically learns the fusion weights, and the characteristic difference between different channels can be highlighted.
The conv block1-conv block4 can fuse the characteristic difference of different channels and continuously compress and extract high-dimensional characteristics. The single conv-block consists of depth separable convolution + batchnorm + prelu, which can reduce the amount of computation.
TGRU1+fc+sigmoid module for predicting the VAD result. TGRU1 is to make unidirectional GRU on time axis, so that the network can combine the characteristics of the historical frame and the characteristics of the current frame to make fusion output. The full connection layer fc and sigmoid activation functions map the result output by TGRU to the prediction result of the VAD.
FGRU + TGRU +fc+softmax modules, FGRU are bi-directional GRUs on the frequency domain signature axis, currently increasing the receptive field between frequency points, bi-directional for sensing high frequency to low frequency and low frequency to high frequency changes. TGRU2 functions in the same manner as TGRU, and is that the network perceives historical input and output states on the time axis. The reason for the VAD module not being used FGRU is: the VAD task is relatively simple and the computational effort can be reduced without FGRU. The full connection layer fc and the activation function softmax map TGRU output results to posterior probabilities that each channel is the optimal channel (the voice quality is highest), and the maximum posterior probability output is selected to be the optimal channel.
In some embodiments, in step 103, among the first audio data of the current frame, the audio data of the first channel or the second channel is selected to be sent to the speech recognition system for recognition, including:
When Voice Activation Detection (VAD) results of the first channel are that voice signals are detected, smoothing the channel index of the first channel to obtain a smoothed channel index; and sending the audio data of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame.
Specifically, the neural network selection module generates a channel index of a first channel and a Voice Activation Detection (VAD) result of the first channel after determining the first channel corresponding to the audio data with the highest voice quality from the first audio data of the current frame. The channel index of the first channel generated by the neural network selection module and the Voice Activation Detection (VAD) result of the first channel are input into a post-processing module at the same time, and when the Voice Activation Detection (VAD) result of the first channel is that the voice signal is detected, the channel index of the first channel is subjected to smoothing processing to obtain a smoothed channel index; and sending the audio data of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame. At this time, the audio data input in the voice recognition system is the voice of the person with the highest voice quality.
In other embodiments, in step 103, among the first audio data of the current frame, the audio data of the first channel or the second channel is selected to be sent to the speech recognition system for recognition, including:
when Voice Activation Detection (VAD) of the first channel results in that no voice signal is detected, smoothing the channel index of the second channel to obtain a smoothed channel index; and sending the audio data of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame.
Specifically, the neural network selection module generates a channel index of a first channel and a Voice Activation Detection (VAD) result of the first channel after determining the first channel corresponding to the audio data with the highest voice quality from the first audio data of the current frame. The channel index of the first channel generated by the neural network selection module may be input to a post-processing module simultaneously with a Voice Activation Detection (VAD) result of the first channel. The post-processing module records the channel index of the audio data fed into the voice recognition system in the first audio data of each frame. When the Voice Activation Detection (VAD) result of the first channel is that the voice signal is not detected, namely, the voice quality of the first channel is highest but the voice is not voice, the last VAD result recorded in the post-processing module is that the channel corresponding to the channel index of the voice signal is detected as a second channel, and the channel index of the second channel is subjected to smoothing processing to obtain a smoothed channel index; and sending the audio data of the current frame of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame. It should be noted that, when the last VAD result is not present in the post-processing module, that is, the VAD result is not present before the current frame, that is, the VAD result is the channel index of the audio data of the voice signal sent to the voice recognition system, the audio data of one channel may be optionally sent to the voice recognition system, which is not limited in this aspect of the application.
Compared with the related art, the embodiment of the application obtains the posterior probability of the Mel frequency spectrum of each channel by inputting the Mel frequency spectrum of one frame of the current frame into the neural network selection module, and simultaneously generates the Voice Activation Detection (VAD) result corresponding to each channel aiming at the Mel frequency spectrum of one frame of the current frame, and simultaneously considers the posterior probability of the voice data of the first channel of the current frame and whether the voice data is human voice, so as to ensure that the audio data with the maximum posterior probability is sent into the voice recognition system when the voice data of the first channel contains the human voice data, ensure the recognition quality of the voice recognition system, improve the accuracy of voice recognition, and meanwhile, adopt the neural network selection module based on deep learning, the calculation power is small and the time delay is low.
A third embodiment of the invention relates to an electronic device, as shown in fig. 4, comprising at least one processor 202; and a memory communicatively coupled to the at least one processor 202; the memory 201 stores instructions executable by the at least one processor 202, and the instructions are executed by the at least one processor 202, so that the at least one processor 202 can execute any of the local oscillator leakage correction method embodiments described above.
Where memory 201 and processor 202 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together various circuits of one or more of the processor 202 and memory 201. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 202 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 202.
The processor 202 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 201 may be used to store data used by processor 202 in performing operations.
A fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program when executed by the processor implements any of the local oscillator leakage correction method embodiments described above.
That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments of the application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (10)

1. A method of beam selection for a microphone array, comprising:
Processing each frame of audio data received by the microphone array into first audio data of one frame of multiple channels;
determining a first channel corresponding to the audio data with highest voice quality from the first audio data of the current frame;
Selecting the audio data of the first channel or the second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition; the second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed.
2. The method for beam selection of a microphone array according to claim 1, wherein processing each frame of audio data received by the microphone array into a frame of multichannel first audio data comprises:
performing generalized side lobe cancellation beam forming processing on each frame of audio data received by the microphone array in multiple directions to obtain a frame of multi-channel complex subband domain signal;
And obtaining the magnitude spectrum of the complex subband domain signals of the frame of multichannel, and then carrying out Mel spectrum compression to obtain the Mel spectrum of the frame of multichannel.
3. The method for selecting a beam of a microphone array according to claim 2, wherein the performing generalized sidelobe canceling beam forming processing in multiple directions on each frame of audio data received by the microphone array to obtain a frame of multi-channel complex subband domain signal comprises:
Performing generalized side lobe destructive beam forming on each frame of audio data received by the microphone array in a plurality of directions in a complex sub-band domain to obtain a complex sub-band domain signal of one frame of multichannel;
the generalized sidelobe canceling beam forming result in each direction corresponds to a complex subband domain signal of one channel in the complex subband domain signals of one frame of multiple channels.
4. The method for selecting a beam of a microphone array according to claim 2, wherein the performing generalized sidelobe canceling beam forming processing in multiple directions on each frame of audio data received by the microphone array to obtain a frame of multi-channel complex subband domain signal comprises:
performing generalized side lobe cancellation beam forming of multiple directions on each frame of audio data received by the microphone array in a time domain to obtain a frame of multichannel time domain signal;
Adding a subband analysis window to the one-frame multichannel time domain signal, and performing Fourier transform to obtain a one-frame multichannel complex subband domain signal;
The generalized sidelobe canceling beam forming result in each direction corresponds to one channel time domain signal in the one-frame multi-channel time domain signal.
5. The method for selecting a beam of a microphone array according to claim 2, wherein the determining a first channel corresponding to the audio data with the highest voice quality from the first audio data of the current frame includes:
inputting the Mel frequency spectrum of the one frame of the current frame into a neural network selection module to obtain the posterior probability of the Mel frequency spectrum of each channel, selecting the channel with the highest posterior probability as the first channel where the audio data with the highest voice quality is located, and generating the channel index corresponding to the first channel.
6. The method for beam selection of a microphone array according to claim 5, wherein inputting the mel spectrum of the one frame multi-channel of the current frame into the neural network selection module, in the process of obtaining the posterior probability of the mel spectrum of each channel, further comprises: the neural network selection module generates a voice activation detection result corresponding to each channel according to the Mel spectrum of the one-frame multi-channel of the current frame.
7. The method for beam selection of a microphone array according to claim 6, wherein selecting the audio data of the first channel or the second channel from the first audio data of the current frame is sent to a speech recognition system for recognition, comprising:
when the voice activation detection result of the first channel is that a voice signal is detected, smoothing the channel index of the first channel to obtain a smoothed channel index;
and sending the audio data of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame.
8. The method for beam selection of a microphone array according to claim 6, wherein selecting the audio data of the first channel or the second channel from the first audio data of the current frame is sent to a speech recognition system for recognition, comprising:
When the voice activation detection result of the first channel is that the voice signal is not detected, smoothing the channel index of the second channel to obtain a smoothed channel index;
and sending the audio data of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame.
9. An electronic device, comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the beam selection method of the microphone array of any of claims 1-8.
10. A computer readable storage medium storing a computer program, which when executed by a processor implements the beam selection method of a microphone array according to any of claims 1-8.
CN202410108418.8A 2024-01-25 2024-01-25 Beam selection method of microphone array, electronic equipment and storage medium Pending CN117995177A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410108418.8A CN117995177A (en) 2024-01-25 2024-01-25 Beam selection method of microphone array, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410108418.8A CN117995177A (en) 2024-01-25 2024-01-25 Beam selection method of microphone array, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117995177A true CN117995177A (en) 2024-05-07

Family

ID=90900784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410108418.8A Pending CN117995177A (en) 2024-01-25 2024-01-25 Beam selection method of microphone array, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117995177A (en)

Similar Documents

Publication Publication Date Title
JP7434137B2 (en) Speech recognition method, device, equipment and computer readable storage medium
US10735887B1 (en) Spatial audio array processing system and method
US12143806B2 (en) Spatial audio array processing system and method
CN111798860B (en) Audio signal processing method, device, equipment and storage medium
Vaseghi Multimedia signal processing: theory and applications in speech, music and communications
CN111916101B (en) Deep learning noise reduction method and system fusing bone vibration sensor and double-microphone signals
US8880396B1 (en) Spectrum reconstruction for automatic speech recognition
CN107993670A (en) Microphone array voice enhancement method based on statistical model
CN108109617A (en) A kind of remote pickup method
CN113870893B (en) Multichannel double-speaker separation method and system
CN115910091B (en) Method and device for generating voice separation by introducing fundamental frequency clue
US11997474B2 (en) Spatial audio array processing system and method
CN118486318A (en) A method, medium and system for eliminating noise in outdoor live broadcast environment
Huang et al. Advances in microphone array processing and multichannel speech enhancement
CN114758670A (en) Beamforming method, apparatus, electronic device and storage medium
CN117995177A (en) Beam selection method of microphone array, electronic equipment and storage medium
CN115862632A (en) Voice recognition method and device, electronic equipment and storage medium
Firoozabadi et al. Combination of nested microphone array and subband processing for multiple simultaneous speaker localization
CN117121104A (en) Estimating an optimized mask for processing acquired sound data
US20250071505A1 (en) Spatial audio array processing system and method
Nakazawa et al. Non-intrusive speech intelligibility estimation using deep learning with speech enhancement and convolutional layers
Mahmmod et al. Speech Enhancement: A Review of Various Approaches, Trends, and challenges
CN114203167B (en) A speech data training method and recognition method based on distributed array
CN119694333B (en) Directional pickup method, system, equipment and storage medium
CN115223580B (en) A speech enhancement method based on spherical microphone array and deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination