CN117995177A - Beam selection method of microphone array, electronic equipment and storage medium - Google Patents
Beam selection method of microphone array, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN117995177A CN117995177A CN202410108418.8A CN202410108418A CN117995177A CN 117995177 A CN117995177 A CN 117995177A CN 202410108418 A CN202410108418 A CN 202410108418A CN 117995177 A CN117995177 A CN 117995177A
- Authority
- CN
- China
- Prior art keywords
- channel
- audio data
- frame
- microphone array
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000001228 spectrum Methods 0.000 claims description 41
- 238000000034 method Methods 0.000 claims description 35
- 238000013528 artificial neural network Methods 0.000 claims description 23
- 230000004913 activation Effects 0.000 claims description 20
- 238000001514 detection method Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 9
- 238000009499 grossing Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 230000001066 destructive effect Effects 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 description 22
- 230000006870 function Effects 0.000 description 8
- 230000009467 reduction Effects 0.000 description 8
- 230000004044 response Effects 0.000 description 8
- 238000000926 separation method Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 7
- 238000012805 post-processing Methods 0.000 description 5
- 230000009466 transformation Effects 0.000 description 5
- 238000003491 array Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000288105 Grus Species 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The embodiment of the application relates to the field of voice processing, and discloses a beam selection method of a microphone array, which comprises the following steps: processing each frame of audio data received by the microphone array into first audio data of one frame of multiple channels; determining a first channel corresponding to the audio data with highest voice quality from the first audio data of the current frame; selecting the audio data of the first channel or the second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition; the second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed. The application can select the voice wave beam sent into the voice recognition system under any application scene, has no limitation of the application scene and has lower complexity.
Description
Technical Field
The embodiment of the invention relates to the field of voice recognition, in particular to a beam selection method of a microphone array, electronic equipment and a storage medium.
Background
In a voice interactive product scene, a microphone array is often used for pickup so as to improve the quality of far-field pickup, for example, a beam forming algorithm and a blind source separation algorithm are adopted to remove noise, reverberation and the like from voice, so that the quality of picked voice is improved, and the accuracy of voice recognition is improved. However, because the position angle of the speaker is not fixed, multiple beams or blind source separation algorithm processing is needed to output the voice of multiple channels, and then a certain selection mechanism is adopted to select and output the voice of only one channel to be sent to the voice recognition system for recognition.
However, the blind source separation algorithm and the multi-beam algorithm separate voices, noise and interference in different directions picked up by the microphone array, then the separated data of a plurality of channels are respectively sent to the wake-up module, and the signal channel with the highest confidence is sent to the recognition system by the wake-up module. However, the algorithm using the wake-up module has a limitation of application scenarios, and the wake-up module cannot be used in some interaction scenarios or conference scenarios without wake-up.
Disclosure of Invention
The invention aims to provide a beam selection method of a microphone array, electronic equipment and a storage medium, which can select a voice beam sent into a voice recognition system under any condition, have no limitation of application scenes and have lower complexity.
In order to solve the above technical problems, embodiments of the present invention provide a beam selection method of a microphone array, an electronic device, and a storage medium, including:
Processing each frame of audio data received by the microphone array into first audio data of one frame of multiple channels;
determining a first channel corresponding to the audio data with highest voice quality from the first audio data of the current frame;
Selecting the audio data of the first channel or the second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition; the second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed.
The embodiment of the invention also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a beam selection method of a microphone array as described above.
Embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements a beam selection method of a microphone array as described above.
Compared with the prior art, the method and the device have the advantages that each frame of audio data received by the microphone array is processed into one frame of multichannel first audio data; determining a first channel corresponding to audio data with highest voice quality from first audio data of a current frame; selecting the audio data of a first channel or a second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition; the second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed. The voice beam (audio data in the channel) sent into the voice recognition system can be selected under any application scene, and the method has no limitation of the application scene and low complexity.
Drawings
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.
Fig. 1 is a specific flowchart of a beam selection method of a microphone array according to a first embodiment of the present invention;
fig. 2 is a specific flowchart of a beam selection method of a microphone array according to a second embodiment of the present invention;
Fig. 3 is a schematic view of a neural network selection module of a microphone array according to a second embodiment of the present invention;
Fig. 4 is a schematic structural view of an electronic device according to a third embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present application, numerous technical details have been set forth in order to provide a better understanding of the present application. The claimed application may be practiced without these specific details and with various changes and modifications based on the following embodiments.
In a voice interactive product scene, a microphone array is often used for pickup so as to improve the quality of far-field pickup, for example, a beam forming algorithm and a blind source separation algorithm are adopted to remove noise, reverberation and the like from voice, so that the quality of picked voice is improved, and the accuracy of voice recognition is improved. However, because the position angle of the speaker is not fixed, multiple beams or blind source separation algorithm processing is needed to be adopted to output the voice of multiple channels, and then a selection mechanism is adopted to select and output the voice of one channel only to be sent to a voice recognition system for recognition.
For the above cases, common selection mechanisms include:
1) And a blind source separation algorithm is used, namely, voices, noise and interference in different directions picked up by a microphone array are separated, then data of a plurality of separated channels are respectively sent to a wake-up module, wake-up confidence degrees of the channels are obtained, the largest confidence degree in the channels is selected, and if the confidence degree is larger than a wake-up threshold value, a signal channel with the largest confidence degree is sent to the recognition system.
2) The blind source separation algorithm in 1) is replaced with a multi-beam algorithm. The multi-beam algorithm is to make beam forming algorithm in different directions, so as to obtain data of a plurality of channels, and the wake-up module selects a channel with highest wake-up confidence from the channels and sends the channel to the recognition system.
3) And (3) replacing the wake-up modules in the steps 1) and 2), and adopting a method for selecting a channel with larger energy after noise reduction. The method comprises the steps of firstly carrying out noise reduction treatment on signals after a plurality of wave beam forming or blind source separation, and then selecting a channel with larger energy for outputting. The effect of the method is seriously dependent on a noise reduction algorithm, and if the traditional noise reduction algorithm is adopted, the influence of non-stable and point source interference cannot be removed. If the deep learning noise reduction algorithm is adopted, voice damage can be caused when the signal to noise ratio is low, so that the situation of wrong selection is caused. And because each channel needs to do independent deep learning noise reduction, a large amount of calculation force is needed, and the requirement on the calculation force is high.
4) An estimated sound angle of arrival (DOA) method is employed, and then a beamforming algorithm is performed in the estimated DOA direction. However, the effect of the method is seriously dependent on the accuracy of DOA, under the conditions of strong reverberation, strong noise and interference, the DOA is difficult to estimate accurately, and the DOA on two sides of the linear array has poor accuracy, so that the method has a great risk.
In summary, the current beam selection schemes, scheme 1) and scheme 2), are preferred, but require a wake-up module. In some wake-free interactive or conference scenarios, the schemes of 1) and 2) cannot be adopted. Aiming at the problem, the application provides a beam selection method of a microphone array, which has no limitation of application scenes and has lower complexity.
A first embodiment of the present invention relates to a beam selection method of a microphone array, as shown in fig. 1, specifically including the following steps.
Step 101: each frame of audio data received by the microphone array is processed into first audio data of one frame of multiple channels.
The microphone array refers to a multi-microphone system in which a certain number of acoustic sensors (microphones) are arranged according to a certain rule, and the spatial characteristics of a sound field are sampled and filtered. The frequency response of all microphones of the microphone array is consistent, and the sampling clocks of the microphones are synchronous.
Microphone arrays can be generally divided into: linear arrays, planar arrays and volumetric arrays. The linear array is a linear array formed by two microphones, almost all mobile phones and earphones currently adopt a double-microphone noise reduction technology to improve the conversation effect, and part of intelligent sound boxes adopt the scheme. The linear array formed by the two microphones has the greatest advantages of low cost, lower power consumption, obvious defects and limited noise reduction effect compared with a plurality of microphones, namely, the effect of far scene interaction is not good. The combination of the planar array is more diversified, and the planar array is commonly arranged on the intelligent sound box and the voice interaction robot, wherein the planar array is commonly arranged on the 4-microphone array, the 6-microphone array and the like. The linear array of the planar array can realize equivalent tone test of 360 degrees on the plane, the more the number of microphones is, the higher the fineness of space division is, the far-field scene recognition effect is good, the higher the power consumption is, and the ID design is complex. The three-dimensional array is spherical or cylindrical, so that real full-space 360-degree lossless pickup can be realized, the problem of poor response of a planar array high pitch angle signal is solved, and the effect is best and is commonly found in the professional field of voice recognition.
Specifically, the microphone array receives audio data through a plurality of microphones contained therein, and the number of the plurality of microphones is at least two. The microphone array processes the received audio data frame by frame, and each frame of audio data can adopt a beam forming algorithm with generalized sidelobe cancellation to process one frame of audio data into first audio data of a plurality of channels, namely, one frame of first audio data of a plurality of channels.
Step 102: and determining a first channel corresponding to the audio data with the highest voice quality from the first audio data of the current frame.
Specifically, after the microphone array processes the received audio data of the current frame, one frame of multi-channel first audio data corresponding to the audio data of the current frame can be obtained, and a channel corresponding to the audio data with the highest voice quality is selected from the multi-channel first audio data of the current frame to be the first channel. In practical application, the voice quality can be measured by parameters related to the voice quality, such as signal-to-noise ratio, reverberation, etc., which are not particularly limited in the application, for example, a channel corresponding to the audio data with the highest signal-to-noise ratio can be selected from the multi-channel first audio data of the current frame as the first channel; or selecting a channel corresponding to the audio data with the minimum reverberation as a first channel; or selecting the channel corresponding to the audio data with the highest signal-to-noise ratio and the smallest reverberation as the first channel.
Step 103: selecting the audio data of a first channel or a second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition;
The second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed.
Specifically, after the first channel is determined, in the first audio data of the current frame, the audio data of the first channel or the second channel can be selected according to a preset condition and sent to a voice recognition system for recognition. The preset conditions may be defined by the technician, for example, when the audio data of the first channel is selected to be sent to the voice recognition system for recognition, the preset conditions may be: preferentially judging whether the audio data of the first channel meets the requirement of the input voice recognition system, if so, sending the audio data of the first channel into the voice recognition system, otherwise, sending the audio data of the second channel into the voice recognition system. For example, the audio data of the first channel can be fed into the speech recognition system only if it is speech data and not noise, otherwise the audio data of the second channel is fed into the speech recognition system; or the audio data of the first channel can be sent to the voice recognition system only when a certain parameter of the audio data of the first channel meets a preset threshold value or less, otherwise, the audio data of the second channel is sent to the voice recognition system and the like.
When the audio data of the first channel does not meet the requirement of the input voice recognition system, selecting the audio data of the second channel from the first audio data of the current frame, and sending the audio data of the second channel to the voice recognition system for recognition. The second channel is the channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed. Specifically, the second channel may be a first channel corresponding to the audio data with the highest speech quality in the first audio data of the previous frame, or a channel corresponding to the audio data sent to the speech recognition system when the first audio data of the previous two frames of the current frame is processed. It should be noted that, when the first audio data of the current frame cannot determine the second channel (the current frame is the first frame audio data that needs to be sent to the speech recognition system), the audio data of one channel may be optionally sent to the speech recognition system for recognition.
Compared with the related art, the embodiment of the application processes each frame of audio data received by the microphone array into one frame of multichannel first audio data; determining a first channel corresponding to audio data with highest voice quality from first audio data of a current frame; selecting the audio data of a first channel or a second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition; the second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed. The voice beam (audio data in the channel) sent into the voice recognition system can be selected under any application scene, and the method has no limitation of the application scene and low complexity.
A second embodiment of the present invention relates to a beam selection method for a microphone array, which is a refinement of the foregoing embodiment, and specifically includes the following.
In some embodiments, as shown in fig. 2, step 101 may include the steps of:
step 1011: and carrying out generalized side lobe destructive beam forming processing on each frame of audio data received by the microphone array in multiple directions to obtain a frame of multi-channel complex subband domain signal.
Specifically, generalized sidelobe canceling (Generalized Sidelobe Canceller, GSC) beamforming is a digital signal processing technique used to enhance the transmission quality of an artificial communication signal and reduce unintentional interference in severe environments such as noise, clutter, etc. The basic principle is that the phase difference of the same signal received by different array elements in the active array is utilized to form a wave beam, so that the signal is focused in a certain direction, and meanwhile, the interference in the non-wave beam direction is restrained through the self-adaptive filter in the same array, thereby improving the effectiveness of communication. GSC beamforming may be performed in the time domain, in the frequency domain, or in the complex subband domain, which is not particularly limited in the present application, as long as a frame of multi-channel complex subband domain signal can be obtained from each frame of audio data received by the microphone array. And obtaining the magnitude spectrum of the complex subband domain signal of one frame of multichannel, and then carrying out Mel spectrum compression to obtain the Mel spectrum of one frame of multichannel.
Specifically, in fourier analysis, the variation of the amplitude |fn| or Cn of each component with the frequency nω1 is generally referred to as an amplitude spectrum of the signal, and the variation of the phase Φn of each component with the angular frequency nω1 is generally referred to as a phase spectrum of the signal. The magnitude spectrum and the phase spectrum are collectively referred to as the spectrum of the signal. The complex subband domain refers to a subband signal that is either real or complex, and the subband signal that converts audio data into complex subband signals is the complex subband domain signal. The sub-band is obtained by a sub-band coding technique, which is a technique of converting an original signal from a time domain into a frequency domain, then dividing it into a plurality of sub-bands, and respectively performing digital coding on the sub-bands. It uses Band Pass Filter (BPF) group to divide the original signal into several (for example m) sub-bands (sub-bands).
In one example, step 1011 may include the steps of: carrying out Generalized Sidelobe Cancellation (GSC) beam forming of multiple directions on each frame of audio data received by a microphone array in a complex subband domain to obtain a frame of multi-channel complex subband domain signal;
wherein, the Generalized Sidelobe Canceling (GSC) beam forming result of each direction corresponds to a complex subband domain signal of one channel in a frame of complex subband domain signals of multiple channels.
Specifically, GSC beamforming may be performed in the time domain, in the frequency domain, or in the complex subband domain. When the Generalized Sidelobe Cancellation (GSC) wave beam forming is carried out on each frame of audio data received by the microphone array in a plurality of directions in a complex sub-band domain, a frame of multi-channel complex sub-band domain signal can be directly obtained. The Generalized Sidelobe Cancellation (GSC) beam forming result of each direction corresponds to a complex subband domain signal of one channel in a frame of complex subband domain signals of multiple channels, namely, the complex subband domain signal of one frame of N channels can be obtained by performing Generalized Sidelobe Cancellation (GSC) beam forming of N directions, wherein N is not less than 2. For example, when the microphone array is a 6mic linear array, GSC beamforming in 5 directions (30 degrees, 60 degrees, 90 degrees, 120 degrees, 150 degrees) may be performed on each frame of audio data received by the microphone array in a complex subband domain, so as to obtain a complex subband domain signal with 5 channels in one frame.
Alternatively, in another example, step 1011 may include the steps of: carrying out Generalized Sidelobe Cancellation (GSC) beam forming of multiple directions on each frame of audio data received by a microphone array in a time domain to obtain a frame of multichannel time domain signal;
Adding a subband analysis window to the time domain signal of one frame of multichannel, and performing Fourier transform to obtain a complex subband domain signal of one frame of multichannel;
Wherein, the Generalized Sidelobe Canceling (GSC) beam forming result of each direction corresponds to a time domain signal of one channel in the time domain signal of one frame of multiple channels.
Specifically, GSC beamforming may be performed in the time domain, in the frequency domain, or in the complex subband domain. When Generalized Sidelobe Cancellation (GSC) wave beam forming is carried out on each frame of audio data received by the microphone array in a plurality of directions in the time domain, a frame of multichannel time domain signal can be obtained. And adding a subband analysis window to the time domain signal of one frame of multiple channels, and performing Fourier transform to obtain a complex subband domain signal of one frame of multiple channels. The Generalized Sidelobe Cancellation (GSC) beam forming result of each direction corresponds to one channel time domain signal in one frame of multi-channel time domain signals, namely N frames of N-channel time domain signals can be obtained by performing the Generalized Sidelobe Cancellation (GSC) beam forming of N directions, wherein N is not less than 2. For example, when the microphone array is a 6mic linear array, GSC beamforming in 5 directions (30 degrees, 60 degrees, 90 degrees, 120 degrees, 150 degrees) can be performed on each frame of audio data received by the microphone array in the time domain, so as to obtain a frame of 5-channel time domain signals; a frame of 5 channel time domain signals is added with a subband analysis window and then fourier transformed into complex subband domain signals.
Correspondingly, generalized side lobe cancellation (GSC) wave beam forming in multiple directions can be carried out on each frame of audio data received by the microphone array in the frequency domain, so that a frame of multichannel frequency domain signal is obtained. After obtaining a frame multichannel frequency domain signal, the frame multichannel frequency domain signal needs to be converted into a frame multichannel time domain signal, and then the frame multichannel time domain signal is converted into a frame multichannel complex subband domain signal by the method of the embodiment.
Step 1012: and obtaining the magnitude spectrum of the complex subband domain signal of one frame of multichannel, and then carrying out Mel spectrum compression to obtain the Mel spectrum of one frame of multichannel.
Specifically, the conversion of audio data into the complex subband domain is chosen because most models overlap the audio signal by 50% or 75% from frame to frame, frame by frame, windowed, and then FFT transformed the signal into the frequency domain. For an overlapping rate of 50%, signal leakage in the frequency domain is serious no matter a hanning window or a hamming window is added, and spectrum leakage can be reflected on the amplitude spectrum. If the leakage is large, particularly, the harmonic component of the amplitude spectrum is fuzzy when the male voice with lower fundamental frequency is encountered, and the harmonic structure is difficult to accurately learn from the fuzzy language spectrum when the neural network selection module performs deep learning; the 75% overlap ratio has less spectral leakage, but the processed frequency bin is increased by a factor of 2. Therefore, the embodiment adopts complex-domain subband transformation, the frequency domain leakage after transformation is obviously lower than the short-time Fourier transformation with the overlapping rate of 50%, the frequency spectrum leakage with the overlapping rate of 75% is close, and the number of frequency points after transformation is consistent with that of the overlapping rate of 50%, but is not increased. The Mel spectrum (mel spectrum) is adopted to effectively compress the number of frequency points, reduce the characteristic dimension sent into the network, and reduce the calculated amount.
In some embodiments, step 102 may include the steps of: inputting the Mel frequency spectrum of one frame of the current frame into a neural network selection module to obtain the posterior probability of the Mel frequency spectrum of each channel, selecting the channel with the highest posterior probability as the first channel where the audio data with the highest voice quality is located, and simultaneously generating the channel index corresponding to the first channel.
Specifically, posterior probability (posterior probability): meaning that an event has occurred, it is desirable to calculate the probability that the cause of the event occurred is due to a factor. The posterior probability in this embodiment may be understood as the probability that the voice data of the current channel is the audio data with the highest voice quality, and in this embodiment, the posterior probability is specifically calculated by the neural network selection module. After the posterior probability of the voice quality of each channel is obtained, the channel with the highest posterior probability is selected as a first channel where the audio data with the highest voice quality is located, and simultaneously, a channel index corresponding to the first channel is generated. The channel index is used to uniquely identify the first channel.
In one example, the process of inputting the mel spectrum of one frame of the current frame into the neural network selection module to obtain the posterior probability of the mel spectrum of each channel further includes: the neural network selection module generates a Voice Activation Detection (VAD) result corresponding to each channel for a frame multi-channel Mel spectrum of the current frame.
In particular, voice Activation Detection (VAD) algorithms are mainly used to detect whether a human voice signal is present in a current sound signal. The algorithm distinguishes speech signal segments from various background noise signal segments by determining the input signal. There are a number of feature extraction methods for VAD, which can be determined in particular by short-term energy (short TIME ENERGY, STE) and short-term zero-crossing rate (zero cross counter, ZCC), i.e. energy-based features. The short-time energy is the energy of a frame of speech signal, and the zero crossing rate is the number of times the time domain signal of a frame of speech passes through 0 (time axis). Generally, a VAD algorithm with high accuracy extracts and judges a plurality of features based on energy, frequency domain features, cepstrum features, harmonic features, long-term information features and the like. In the application, the feature extraction method for generating the Voice Activation Detection (VAD) result corresponding to each channel by the neural network selection module according to the Mel spectrum of one frame of the current frame is not particularly limited.
In one example, the training process of the neural network selection module includes the following:
1. Data preparation
(1) Signals acquired by the original microphone array are generated.
This signal comprises two parts: one is directional speech (including directional noise) and the other is background noise. The generation of directional speech, which contains both target sound sources and directional disturbances (which may be noise or other human sounds), is to simulate the sounds of different directions and distances collected by the microphone array. A multichannel Room Impulse Response (RIR) tool is used to generate multichannel impulse responses of different room sizes, different microphone (mic) array placement positions, different angles relative to the mic array positions, and different distances of the speaker sources. The impulse response convolution is used for respectively convoluting pure voice or noise, the directional voice can be obtained after convolution results are added, and the signal to noise ratio during addition can be randomly selected in the interval of 0db-30 db. The process only considers the situation that at most 2 sound sources in different directions occur simultaneously, and if the two sound sources are human voice, the target sound source selects the human voice with larger energy; if one sound source is a human voice and the other is noise, the sound source of the human voice is selected as the target sound source.
The generation of background noise is of different directionality, and there is no apparent directionality of this noise. The impulse response of the background noise can be generated by summing up 8 impulse responses arranged at 8 corners of the room (4 corners of the ground + 4 corners of the top), and then convolving the impulse response with a noise signal to obtain the background noise. After the background noise is generated, randomly selecting and superposing the background noise on the generated directional voice in a signal-to-noise ratio interval of 5db-30db, and finally generating a multichannel voice signal.
(2) A multi-beam signal of a training model is generated.
And (3) performing GSC beam forming on signals acquired by the microphone array simulated in the step (1) in 5 directions (30 degrees, 60 degrees, 90 degrees, 120 degrees and 150 degrees) respectively to obtain multi-beam output of 5 channels.
(3) An optimal channel index tag and a (voice activation detection) VAD tag are generated.
And (3) making a difference between the target source direction of arrival estimation (direction of arrival, DOA) angle of the data of the step (1) and 5 directions (30 degrees, 60 degrees, 90 degrees, 120 degrees and 150 degrees), determining a channel corresponding to the direction with the smallest difference as a first channel, and outputting a channel index of the channel. And randomly selecting a channel corresponding to one direction from the two directions as a first channel in the middle direction with equal difference values, and outputting a channel index of the channel. In addition, the VAD tag may be determined by threshold based on the energy of the clean speech signal: when the threshold value is exceeded, the VAD label is that the voice exists, and when the threshold value is exceeded, the VAD label is that the voice does not exist.
2. Model training
(1) And carrying out sub-band domain transformation on the generated multi-beam signal, and then solving the mel spectrum of the multi-beam signal.
(2) And randomly selecting a transformed mel spectrum to be input into a network to obtain a channel index of a predicted first channel and a prediction result of the VAD.
(3) And inputting the result, the corresponding optimal channel index label and the VAD label into a loss function, and calculating a loss value loss. The loss function of the scheme can adopt a cross entropy loss function.
(4) And sending loss to Adam for optimization, and updating parameters in the neural network selection module through back propagation.
(5) And continuously sending the rest data to a neural network selection module, and repeating the process until all data training is completed.
The method comprises the steps of inputting a mel frequency spectrum of one frame of multiple channels of a current frame into a trained neural network selection module, obtaining posterior probability of the mel frequency spectrum of each channel, generating a Voice Activation Detection (VAD) result corresponding to each channel by the trained neural network selection module according to the mel frequency spectrum of one frame of multiple channels of the current frame, and generating only the Voice Activation Detection (VAD) result corresponding to the first channel after determining the first channel.
The neural network selection module in this embodiment is schematically shown in fig. 3. The Crossconv modules are formed by 2*M sub-conv convolution module groups, each sub-conv module consists of conv2d+ batchnorm + prelu modules, and the input of each sub-conv is mel spectrum of 2 different beam channels. The module enables the neural network to actively learn the difference characteristics between different two beam channels, and increases the accuracy of the first channel (the channel with the highest posterior probability) predicted finally. The results processed by each sub-conv module are spliced together, and fusion of different weights is carried out through the network layer below, so that the network automatically learns the fusion weights, and the characteristic difference between different channels can be highlighted.
The conv block1-conv block4 can fuse the characteristic difference of different channels and continuously compress and extract high-dimensional characteristics. The single conv-block consists of depth separable convolution + batchnorm + prelu, which can reduce the amount of computation.
TGRU1+fc+sigmoid module for predicting the VAD result. TGRU1 is to make unidirectional GRU on time axis, so that the network can combine the characteristics of the historical frame and the characteristics of the current frame to make fusion output. The full connection layer fc and sigmoid activation functions map the result output by TGRU to the prediction result of the VAD.
FGRU + TGRU +fc+softmax modules, FGRU are bi-directional GRUs on the frequency domain signature axis, currently increasing the receptive field between frequency points, bi-directional for sensing high frequency to low frequency and low frequency to high frequency changes. TGRU2 functions in the same manner as TGRU, and is that the network perceives historical input and output states on the time axis. The reason for the VAD module not being used FGRU is: the VAD task is relatively simple and the computational effort can be reduced without FGRU. The full connection layer fc and the activation function softmax map TGRU output results to posterior probabilities that each channel is the optimal channel (the voice quality is highest), and the maximum posterior probability output is selected to be the optimal channel.
In some embodiments, in step 103, among the first audio data of the current frame, the audio data of the first channel or the second channel is selected to be sent to the speech recognition system for recognition, including:
When Voice Activation Detection (VAD) results of the first channel are that voice signals are detected, smoothing the channel index of the first channel to obtain a smoothed channel index; and sending the audio data of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame.
Specifically, the neural network selection module generates a channel index of a first channel and a Voice Activation Detection (VAD) result of the first channel after determining the first channel corresponding to the audio data with the highest voice quality from the first audio data of the current frame. The channel index of the first channel generated by the neural network selection module and the Voice Activation Detection (VAD) result of the first channel are input into a post-processing module at the same time, and when the Voice Activation Detection (VAD) result of the first channel is that the voice signal is detected, the channel index of the first channel is subjected to smoothing processing to obtain a smoothed channel index; and sending the audio data of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame. At this time, the audio data input in the voice recognition system is the voice of the person with the highest voice quality.
In other embodiments, in step 103, among the first audio data of the current frame, the audio data of the first channel or the second channel is selected to be sent to the speech recognition system for recognition, including:
when Voice Activation Detection (VAD) of the first channel results in that no voice signal is detected, smoothing the channel index of the second channel to obtain a smoothed channel index; and sending the audio data of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame.
Specifically, the neural network selection module generates a channel index of a first channel and a Voice Activation Detection (VAD) result of the first channel after determining the first channel corresponding to the audio data with the highest voice quality from the first audio data of the current frame. The channel index of the first channel generated by the neural network selection module may be input to a post-processing module simultaneously with a Voice Activation Detection (VAD) result of the first channel. The post-processing module records the channel index of the audio data fed into the voice recognition system in the first audio data of each frame. When the Voice Activation Detection (VAD) result of the first channel is that the voice signal is not detected, namely, the voice quality of the first channel is highest but the voice is not voice, the last VAD result recorded in the post-processing module is that the channel corresponding to the channel index of the voice signal is detected as a second channel, and the channel index of the second channel is subjected to smoothing processing to obtain a smoothed channel index; and sending the audio data of the current frame of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame. It should be noted that, when the last VAD result is not present in the post-processing module, that is, the VAD result is not present before the current frame, that is, the VAD result is the channel index of the audio data of the voice signal sent to the voice recognition system, the audio data of one channel may be optionally sent to the voice recognition system, which is not limited in this aspect of the application.
Compared with the related art, the embodiment of the application obtains the posterior probability of the Mel frequency spectrum of each channel by inputting the Mel frequency spectrum of one frame of the current frame into the neural network selection module, and simultaneously generates the Voice Activation Detection (VAD) result corresponding to each channel aiming at the Mel frequency spectrum of one frame of the current frame, and simultaneously considers the posterior probability of the voice data of the first channel of the current frame and whether the voice data is human voice, so as to ensure that the audio data with the maximum posterior probability is sent into the voice recognition system when the voice data of the first channel contains the human voice data, ensure the recognition quality of the voice recognition system, improve the accuracy of voice recognition, and meanwhile, adopt the neural network selection module based on deep learning, the calculation power is small and the time delay is low.
A third embodiment of the invention relates to an electronic device, as shown in fig. 4, comprising at least one processor 202; and a memory communicatively coupled to the at least one processor 202; the memory 201 stores instructions executable by the at least one processor 202, and the instructions are executed by the at least one processor 202, so that the at least one processor 202 can execute any of the local oscillator leakage correction method embodiments described above.
Where memory 201 and processor 202 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together various circuits of one or more of the processor 202 and memory 201. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 202 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 202.
The processor 202 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 201 may be used to store data used by processor 202 in performing operations.
A fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program when executed by the processor implements any of the local oscillator leakage correction method embodiments described above.
That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments of the application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.
Claims (10)
1. A method of beam selection for a microphone array, comprising:
Processing each frame of audio data received by the microphone array into first audio data of one frame of multiple channels;
determining a first channel corresponding to the audio data with highest voice quality from the first audio data of the current frame;
Selecting the audio data of the first channel or the second channel from the first audio data of the current frame, and sending the audio data of the first channel or the second channel to a voice recognition system for recognition; the second channel is a channel corresponding to the audio data sent to the voice recognition system when the first audio data of the previous frame is processed.
2. The method for beam selection of a microphone array according to claim 1, wherein processing each frame of audio data received by the microphone array into a frame of multichannel first audio data comprises:
performing generalized side lobe cancellation beam forming processing on each frame of audio data received by the microphone array in multiple directions to obtain a frame of multi-channel complex subband domain signal;
And obtaining the magnitude spectrum of the complex subband domain signals of the frame of multichannel, and then carrying out Mel spectrum compression to obtain the Mel spectrum of the frame of multichannel.
3. The method for selecting a beam of a microphone array according to claim 2, wherein the performing generalized sidelobe canceling beam forming processing in multiple directions on each frame of audio data received by the microphone array to obtain a frame of multi-channel complex subband domain signal comprises:
Performing generalized side lobe destructive beam forming on each frame of audio data received by the microphone array in a plurality of directions in a complex sub-band domain to obtain a complex sub-band domain signal of one frame of multichannel;
the generalized sidelobe canceling beam forming result in each direction corresponds to a complex subband domain signal of one channel in the complex subband domain signals of one frame of multiple channels.
4. The method for selecting a beam of a microphone array according to claim 2, wherein the performing generalized sidelobe canceling beam forming processing in multiple directions on each frame of audio data received by the microphone array to obtain a frame of multi-channel complex subband domain signal comprises:
performing generalized side lobe cancellation beam forming of multiple directions on each frame of audio data received by the microphone array in a time domain to obtain a frame of multichannel time domain signal;
Adding a subband analysis window to the one-frame multichannel time domain signal, and performing Fourier transform to obtain a one-frame multichannel complex subband domain signal;
The generalized sidelobe canceling beam forming result in each direction corresponds to one channel time domain signal in the one-frame multi-channel time domain signal.
5. The method for selecting a beam of a microphone array according to claim 2, wherein the determining a first channel corresponding to the audio data with the highest voice quality from the first audio data of the current frame includes:
inputting the Mel frequency spectrum of the one frame of the current frame into a neural network selection module to obtain the posterior probability of the Mel frequency spectrum of each channel, selecting the channel with the highest posterior probability as the first channel where the audio data with the highest voice quality is located, and generating the channel index corresponding to the first channel.
6. The method for beam selection of a microphone array according to claim 5, wherein inputting the mel spectrum of the one frame multi-channel of the current frame into the neural network selection module, in the process of obtaining the posterior probability of the mel spectrum of each channel, further comprises: the neural network selection module generates a voice activation detection result corresponding to each channel according to the Mel spectrum of the one-frame multi-channel of the current frame.
7. The method for beam selection of a microphone array according to claim 6, wherein selecting the audio data of the first channel or the second channel from the first audio data of the current frame is sent to a speech recognition system for recognition, comprising:
when the voice activation detection result of the first channel is that a voice signal is detected, smoothing the channel index of the first channel to obtain a smoothed channel index;
and sending the audio data of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame.
8. The method for beam selection of a microphone array according to claim 6, wherein selecting the audio data of the first channel or the second channel from the first audio data of the current frame is sent to a speech recognition system for recognition, comprising:
When the voice activation detection result of the first channel is that the voice signal is not detected, smoothing the channel index of the second channel to obtain a smoothed channel index;
and sending the audio data of the channel corresponding to the smoothed channel index into a voice recognition system for recognition in the first audio data of the current frame.
9. An electronic device, comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the beam selection method of the microphone array of any of claims 1-8.
10. A computer readable storage medium storing a computer program, which when executed by a processor implements the beam selection method of a microphone array according to any of claims 1-8.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410108418.8A CN117995177A (en) | 2024-01-25 | 2024-01-25 | Beam selection method of microphone array, electronic equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410108418.8A CN117995177A (en) | 2024-01-25 | 2024-01-25 | Beam selection method of microphone array, electronic equipment and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN117995177A true CN117995177A (en) | 2024-05-07 |
Family
ID=90900784
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410108418.8A Pending CN117995177A (en) | 2024-01-25 | 2024-01-25 | Beam selection method of microphone array, electronic equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN117995177A (en) |
-
2024
- 2024-01-25 CN CN202410108418.8A patent/CN117995177A/en active Pending
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7434137B2 (en) | Speech recognition method, device, equipment and computer readable storage medium | |
| US10735887B1 (en) | Spatial audio array processing system and method | |
| US12143806B2 (en) | Spatial audio array processing system and method | |
| CN111798860B (en) | Audio signal processing method, device, equipment and storage medium | |
| Vaseghi | Multimedia signal processing: theory and applications in speech, music and communications | |
| CN111916101B (en) | Deep learning noise reduction method and system fusing bone vibration sensor and double-microphone signals | |
| US8880396B1 (en) | Spectrum reconstruction for automatic speech recognition | |
| CN107993670A (en) | Microphone array voice enhancement method based on statistical model | |
| CN108109617A (en) | A kind of remote pickup method | |
| CN113870893B (en) | Multichannel double-speaker separation method and system | |
| CN115910091B (en) | Method and device for generating voice separation by introducing fundamental frequency clue | |
| US11997474B2 (en) | Spatial audio array processing system and method | |
| CN118486318A (en) | A method, medium and system for eliminating noise in outdoor live broadcast environment | |
| Huang et al. | Advances in microphone array processing and multichannel speech enhancement | |
| CN114758670A (en) | Beamforming method, apparatus, electronic device and storage medium | |
| CN117995177A (en) | Beam selection method of microphone array, electronic equipment and storage medium | |
| CN115862632A (en) | Voice recognition method and device, electronic equipment and storage medium | |
| Firoozabadi et al. | Combination of nested microphone array and subband processing for multiple simultaneous speaker localization | |
| CN117121104A (en) | Estimating an optimized mask for processing acquired sound data | |
| US20250071505A1 (en) | Spatial audio array processing system and method | |
| Nakazawa et al. | Non-intrusive speech intelligibility estimation using deep learning with speech enhancement and convolutional layers | |
| Mahmmod et al. | Speech Enhancement: A Review of Various Approaches, Trends, and challenges | |
| CN114203167B (en) | A speech data training method and recognition method based on distributed array | |
| CN119694333B (en) | Directional pickup method, system, equipment and storage medium | |
| CN115223580B (en) | A speech enhancement method based on spherical microphone array and deep neural network |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |