CN114236469A

CN114236469A - A method and system for robot speech recognition and localization

Info

Publication number: CN114236469A
Application number: CN202111361624.2A
Authority: CN
Inventors: 张九华; 陈兴元; 罗国平
Original assignee: Leshan Normal University
Current assignee: Leshan Normal University
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-03-25

Abstract

The invention discloses a robot voice recognition positioning method, which comprises the steps of S1, constructing a microphone array model; s2, collecting voice signals of the microphone elements and preprocessing the voice signals; s3, carrying out matching recognition on the preprocessed voice signals; s4, respectively calculating the relative time delay of the voice signals matched between the microphone elements; and S5, estimating the position and the posture of the robot according to the calculated relative time delay. The invention can acquire the voice signals in the environment in real time and accurately determine the accurate position of the robot through sound source positioning, thereby ensuring the positioning precision and accelerating the calculation efficiency of the robot positioning algorithm.

Description

Robot voice recognition positioning method and system

Technical Field

The invention relates to the technical field of robot self-perception and positioning, in particular to a robot voice recognition positioning method and system.

Background

In daily life, the interaction modes among people mainly comprise voice, vision, gestures and other forms, wherein the voice is the simplest and most efficient interaction mode and is also most consistent with the communication habits of people. The voice recognition technology is a research hotspot in recent years, has made great progress, and is applied to many fields, such as vehicle-mounted equipment, games, intelligent household appliances, and the like. The voice recognition technology enables the machine to understand the content spoken by the user, frees both hands of the user, and improves human-computer interaction experience.

The emphasis of speech recognition is different for different applications. There are cases where only some of the keywords need to be identified, such as motion control based on speech keywords; some scenes require that all Chinese characters contained in the voice are recognized as accurately as possible, such as voice input; there are also situations that require not only complete recognition of the text, but also insight into the emotional information of the speaker. In order to enable a user to have good human-computer interaction experience, besides a voice recognition technology, a sound source positioning technology is also not required, only a machine knows the direction of a speaker, action response can be made in a targeted manner, and positioning information is further combined with information such as vision, so that more functional scenes can be developed. Although the voice technology has been widely used in many fields, the voice technology has not been completely popularized in the robot industry, and some technical problems still remain to be solved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a robot voice recognition positioning method and system.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

in a first aspect, the present invention provides a robot voice recognition positioning method, including the following steps:

s1, constructing a microphone array model;

s2, collecting voice signals of the microphone elements and preprocessing the voice signals;

s3, carrying out matching recognition on the preprocessed voice signals;

s4, respectively calculating the relative time delay of the voice signals matched between the microphone elements;

and S5, estimating the position and the posture of the robot according to the calculated relative time delay.

Further, the step S1 specifically includes:

four microphones are adopted in a world coordinate system to form a quaternary cross-shaped microphone array, the center of the quaternary cross-shaped microphone array is located at the origin of the world coordinate system, each microphone is located on a coordinate axis, and the distances from the origin of the world coordinate system are equal.

Further, the step S2 specifically includes the following sub-steps:

s2-1, collecting voice signals of microphone elements;

s2-2, sampling the collected voice signals by adopting a set sampling frequency;

s2-3, carrying out high-frequency lifting processing on the sampled voice signal;

s2-4, framing the processed voice signal;

s2-5, windowing the processed voice signal;

and S2-6, carrying out end point detection on the processed voice signal by adopting a short-time energy and short-time average zero crossing rate method.

Further, the step S3 specifically includes the following sub-steps:

s3-1, respectively extracting linear prediction coefficient characteristics and frequency cepstrum coefficient characteristics from the preprocessed voice signals, and establishing a voice recognition characteristic vector sequence;

s3-2, calculating a frame matching distance matrix of each speech recognition feature vector sequence and a known speech recognition feature vector sequence;

and S3-3, recursively searching the speech signal with the minimum matching distance in the frame matching distance matrix as a recognition result.

Further, the step S3-3 of recursively searching for the speech signal with the minimum matching distance in the frame matching distance matrix specifically includes the following sub-steps:

s3-1-1, constructing a search objective function:

D(i,j)＝|t(i)-r(j)|+min{D(i-1,j),D(i-1,j-1),D(i,j-1)}

in the formula, D (i, j) represents the matching distance between the ith feature in the speech recognition feature vector sequence and the jth feature in the known speech recognition feature vector sequence, t (i) represents the ith feature value in the speech recognition feature vector sequence, and r (j) represents the jth feature value in the known speech recognition feature vector sequence;

the constraint conditions are as follows:

D(1,1)＝|t(1)-r(1)|

and S3-1-2, starting from D (1,1), calculating the values of D (i, j) row by row or column by column, and finally comparing all the calculated values to screen the voice signal with the minimum matching distance.

Further, the calculation formula of the relative time delay in step S4 is as follows:

τ₁₂＝argmaxE(αs(n-τ₁)s(n-τ₁-τ₂))

in the formula, τ₁₂Representing the relative time delays of the speech signal to the microphones 1 and 2, E representing the mathematical expectation value, a representing the attenuation coefficient of the speech signal, s (n) representing the speech signal, τ₁,τ₂Representing the time of arrival of the sound signal at microphone 1 and microphone 2.

Further, in step S5, the expression for estimating the distance between the robot and the sound source according to the calculated relative time delay is as follows:

wherein c represents the speed of sound, τ₁₂Representing the relative time delay, τ, of the speech signal to microphone 1 and microphone 2₁₃Representing the relative time delay, τ, of the speech signal to the microphones 1 and 3₁₄Representing the relative time delay of the speech signal to the microphone 1 and the microphone 4.

Further, the expression for estimating the robot azimuth angle according to the calculated relative time delay in step S5 is as follows:

in the formula, τ₁₂Representing the relative time delay, τ, of the speech signal to microphone 1 and microphone 2₁₃Representing the relative time delay, τ, of the speech signal to the microphones 1 and 3₁₄Representing the relative time delay of the speech signal to the microphone 1 and the microphone 4.

Further, the expression for estimating the pitch angle of the robot according to the calculated relative time delay in step S5 is as follows:

wherein c represents the speed of sound, τ₁₂Representing the relative time delay, τ, of the speech signal to microphone 1 and microphone 2₁₃Representing the relative time delay, τ, of the speech signal to the microphones 1 and 3₁₄Representing the relative time delays of the speech signals to the microphone 1 and the microphone 4 and d representing the distance of the microphone elements to the origin of the world coordinate system.

In a second aspect, the present invention further provides a robot voice recognition positioning system, including:

the constructing module is used for constructing a microphone array model;

the acquisition module is used for acquiring the voice signals of the microphone elements and carrying out pretreatment;

the recognition module is used for carrying out matching recognition on the preprocessed voice signals;

and the estimation module is used for respectively calculating the relative time delay of the voice signals matched between the microphone elements and estimating the pose of the robot according to the calculated relative time delay.

The invention has the following beneficial effects:

according to the method, the microphone array model is built, the voice signals of the microphone elements are preprocessed and then are matched and identified, and finally the relative time delay of the matched voice signals among the microphone elements is calculated respectively to estimate the pose of the robot, so that the voice signals in the environment are acquired in real time, the accurate position of the robot is accurately determined through sound source positioning, and the calculation efficiency of a robot positioning algorithm can be improved while the positioning accuracy is ensured.

Drawings

FIG. 1 is a schematic flow chart of a robot speech recognition positioning method according to the present invention;

fig. 2 is a schematic structural diagram of a robot speech recognition positioning system of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, an embodiment of the present invention provides a robot speech recognition positioning method, including the following steps S1 to S5:

s1, constructing a microphone array model;

in an optional embodiment of the present invention, step S1 specifically includes:

in an optional embodiment of the present invention, step S2 specifically includes the following sub-steps:

s2-1, collecting voice signals sent by a microphone element;

specifically, the present invention collects and receives the voice signal emitted from the sound source for each microphone element in the four-element cross-shaped microphone array constructed in step S1.

specifically, the present invention samples the collected voice signal with a sampling frequency equal to or higher than twice the voice signal frequency, and converts the sampled voice signal from an analog signal to a digital signal.

specifically, the invention firstly performs pre-filtering on the sampled voice signal by adopting an anti-aliasing filter, and then performs high-frequency boosting processing on the voice signal, thereby smoothing the frequency spectrum of the voice signal.

The high-frequency boosting formula adopted by the invention is as follows:

in the formula, the output signal after high frequency lifting, x (n), x (n-1) are the input sampling values at the current moment and the previous moment respectively.

S2-4, framing the processed voice signal;

specifically, the invention divides the voice signal into a plurality of time segments, namely voice signal frames, by adopting a set time interval.

S2-5, windowing the processed voice signal;

specifically, the invention adopts a Hamming window to carry out windowing processing on the voice signal frame, thereby enabling the voice signal frame to be more stable.

The Hamming window expression adopted by the invention is as follows:

Specifically, the method comprises the steps of firstly determining a high threshold parameter and a low threshold parameter according to short-time energy and a short-time average zero crossing rate; then, judging an initial endpoint of the voice signal according to the high threshold parameter; searching a secondary endpoint when the short-time average amplitude is reduced to the low threshold parameter near the determined initial endpoint according to the low threshold parameter; and finally, setting a segmentation threshold according to the average value of the short-time average zero-crossing rate, and searching a point near the secondary endpoint when the short-time average zero-crossing rate is reduced to the segmentation threshold of a set multiple as the endpoint obtained by final detection.

S3, carrying out matching recognition on the preprocessed voice signals;

in an optional embodiment of the present invention, step S3 specifically includes the following sub-steps:

Specifically, the recursive search of the speech signal with the minimum matching distance in the frame matching distance matrix of the present invention specifically comprises the following sub-steps:

s3-1-1, constructing a search objective function:

D(i,j)＝|t(i)-r(j)|+min{D(i-1,j),D(i-1,j-1),D(i,j-1)}

the constraint conditions are as follows:

D(1,1)＝|t(1)-r(1)|

in an alternative embodiment of the present invention, the calculation formula of the relative time delay in step S4 is:

τ₁₂＝argmaxE(αs(n-τ₁)s(n-τ₁-τ₂))

in the formula, τ₁₂Representing the relative time delays of the speech signals to the microphones 1 and 2, E representing the mathematical expectation value, alpha tableRepresenting the attenuation coefficient of the speech signal, s (n) representing the speech signal, tau₁,τ₂Representing the time of arrival of the sound signal at microphone 1 and microphone 2.

In the present invention, the microphone 1 is used as a comparison microphone, and the same principle of the relative time delays of the microphone 3 and the microphone 4 and the microphone 1 can be obtained by calculation according to the above calculation formula, which is not described herein again.

In an alternative embodiment of the invention, the robot pose to be taken into account by the invention comprises the robot-to-sound source distance, the robot azimuth angle and the robot pitch angle.

The expression of estimating the distance between the robot and the sound source according to the calculated relative time delay is as follows:

The expression of estimating the azimuth angle of the robot according to the calculated relative time delay is as follows:

The expression of estimating the pitch angle of the robot according to the calculated relative time delay is as follows:

As shown in fig. 2, the present invention further provides a robot voice recognition positioning system, including:

the constructing module is used for constructing a microphone array model;

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. a robot voice recognition positioning method, is characterized in that, comprises the following steps:

S1. Build a microphone array model;

S2, collect the voice signal of the microphone array element, and perform preprocessing;

S3, matching and recognizing the preprocessed speech signal;

S4, respectively calculating the relative time delay of the matched voice signal between the microphone array elements;

S5. Estimate the robot pose according to the calculated relative time delay.

2. The robot voice recognition positioning method according to claim 1, wherein the step S1 specifically comprises:

In the world coordinate system, four microphones are used to form a quaternary cross-shaped microphone array. The center of the above-mentioned quaternary cross-shaped microphone array is located at the origin of the world coordinate system, and each microphone is located on the coordinate axis, and the distance from the origin of the world coordinate system is equal. .

3. robot voice recognition positioning method according to claim 1, is characterized in that, described step S2 specifically comprises the following sub-steps:

S2-1, collect the voice signal of the microphone array element;

S2-2. Use the set sampling frequency to sample the collected voice signal;

S2-3, perform high-frequency boost processing on the sampled voice signal;

S2-4, performing frame-by-frame processing on the processed voice signal;

S2-5, performing window processing on the processed voice signal;

S2-6, using short-time energy and short-time average zero-crossing rate method to perform endpoint detection on the processed speech signal.

4. The robot voice recognition positioning method according to claim 1, wherein the step S3 specifically comprises the following steps:

S3-1. Extract the linear prediction coefficient feature and the frequency cepstral coefficient feature from the preprocessed speech signal respectively, and establish a speech recognition feature vector sequence;

S3-2, calculate the frame matching distance matrix between each speech recognition feature vector sequence and the known speech recognition feature vector sequence;

S3-3, recursively search for the speech signal with the smallest matching distance in the frame matching distance matrix as the recognition result.

5. robot voice recognition positioning method according to claim 4, is characterized in that, in described step S3-3, in described step S3-3, in the frame matching distance matrix, recursively search for the voice signal with the minimum matching distance specifically comprises the following substeps:

S3-1-1. Build the search objective function:

D(i,j)=|t(i)-r(j)|+min{D(i-1,j),D(i-1,j-1),D(i,j-1)}

In the formula, D(i,j) represents the matching distance between the ith feature in the speech recognition feature vector sequence and the jth feature in the known speech recognition feature vector sequence, and t(i) represents the ith feature in the speech recognition feature vector sequence. eigenvalues, r(j) represents the jth eigenvalue in the sequence of known speech recognition feature vectors;

The constraints are:

D(1,1)=|t(1)-r(1)|

S3-1-2. Starting from D(1,1), calculate the value of D(i,j) row by row or column by column, and finally compare all the calculated values and filter to obtain the speech signal with the smallest matching distance.

6. robot voice recognition positioning method according to claim 1, is characterized in that, the calculation formula of relative time delay in described step S4 is:

τ ₁₂ =argmaxE(αs(n-τ ₁ )s(n-τ ₁ -τ ₂ ))

In the formula, τ ₁₂ represents the relative delay from the speech signal to microphone 1 and microphone 2, E represents the mathematical expectation, α represents the attenuation coefficient of the speech signal, s(n) represents the speech signal, τ ₁ , τ ₂ represent the sound signal to the microphone 1 and mic 2 time.

7. robot speech recognition positioning method according to claim 1, is characterized in that, in described step S5, the expression of estimated robot and sound source distance according to the relative time delay of calculation is:

In the formula, c represents the speed of sound, τ ₁₂ represents the relative time delay from the speech signal to microphone 1 and microphone 2, τ ₁₃ represents the relative time delay from the speech signal to microphone 1 and microphone 3, and τ ₁₄ represents the speech signal to microphone 1 and microphone 4. relative delay.

8. robot voice recognition positioning method according to claim 1, is characterized in that, in described step S5, the expression of estimated robot azimuth angle according to the relative time delay of calculation is:

In the formula, τ ₁₂ represents the relative delay from the speech signal to microphone 1 and microphone 2, τ ₁₃ represents the relative delay from the speech signal to microphone 1 and microphone 3, and τ ₁₄ represents the relative delay from the speech signal to microphone 1 and microphone 4 .

9. robot speech recognition positioning method according to claim 1, is characterized in that, the expression that estimates robot pitch angle according to the relative time delay of calculation in described step S5 is:

In the formula, c represents the speed of sound, τ ₁₂ represents the relative time delay from the speech signal to microphone 1 and microphone 2, τ ₁₃ represents the relative time delay from the speech signal to microphone 1 and microphone 3, and τ ₁₄ represents the speech signal to microphone 1 and microphone 4. The relative delay of , d represents the distance from the microphone element to the origin of the world coordinate system.

10. A robot voice recognition and positioning system, comprising:

Building blocks for modeling microphone arrays;

The acquisition module is used to collect the voice signal of the microphone array element and perform preprocessing;

The recognition module is used for matching and recognizing the preprocessed speech signal;

The estimation module is used to separately calculate the relative time delay of the matched speech signals between the microphone array elements, and estimate the robot pose according to the calculated relative time delay.