CN106875949B

CN106875949B - Correction method and device for voice recognition

Info

Publication number: CN106875949B
Application number: CN201710291330.4A
Authority: CN
Inventors: 石日俭; 贺磊; 刘旭; 吕晓霞
Original assignee: Szbroad Technology Co ltd
Current assignee: Szbroad Technology Co ltd
Priority date: 2017-04-28
Filing date: 2017-04-28
Publication date: 2020-09-22
Anticipated expiration: 2037-04-28
Also published as: CN106875949A

Abstract

The embodiment of the invention discloses a method and a device for correcting voice recognition, wherein the method comprises the following steps: determining the current application scene of the user according to the detection data of the set detection equipment; performing voice recognition on the detected sound in the current application scene; performing deep learning on the linguistic data obtained by voice recognition based on the deep learning model corresponding to the current application scene to obtain a learning result; and correcting the voice recognition result according to the learning result. The embodiment of the invention can meet the requirement of speech recognition of specific application scenes, has pertinence to perform speech recognition on each application scene, greatly improves the accuracy of the speech recognition, further promotes man-machine interaction, and has wide application range.

Description

Correction method and device for voice recognition

Technical Field

The present invention relates to speech processing technologies, and in particular, to a method and an apparatus for correcting speech recognition.

Background

With the development of science and technology, human beings have entered the era of artificial intelligence, which is used to extend the intelligence and ability of human beings, simulate the thinking process and intelligent behavior of human beings, and make machines capable of performing complex work that usually needs human intelligence to complete. One of the important branches of artificial intelligence comprises voice recognition, word translation and voice synthesis, wherein the voice recognition technology is that a machine converts an input voice signal into a corresponding text through a recognition and understanding process to realize the communication between a human and the machine; the word translation technology is to translate the words recognized by the voice into sentences according to correct grammar; text To Speech (TTS) is a technique of converting Text information generated by a machine or input from the outside into speech similar to human expression and outputting the speech.

At present, speech recognition technologies developed by companies such as science news, microsoft and google are calculated based on a large data platform with huge cloud data processing capacity, the data volume has the characteristic of being large and wide, man-machine language interaction can be basically realized, and recognition and translation of specific application sentences under specific application scenes are often not accurate enough.

In the prior art, a correction set is obtained by filtering step by using a statistical or machine learning method. However, this method is not accurate because the process of correcting the input of each user is substantially the same due to lack of pertinence. For example, when the voice "lihua" of different users is received, the corresponding text obtained by the initial recognition is "li hua", and the corresponding text may be corrected to "pear flower", "physicochemical", or "fireworks display", that is, the correction result is not obtained more specifically according to different application scenarios.

Disclosure of Invention

The embodiment of the invention provides a method and a device for correcting voice recognition, which aim to solve the problem of inaccurate correction of a voice recognition result in the prior art.

In a first aspect, an embodiment of the present invention provides a method for correcting speech recognition, including:

determining the current application scene of the user according to the detection data of the set detection equipment;

performing voice recognition on the detected sound in the current application scene;

performing deep learning on the linguistic data obtained by voice recognition based on the deep learning model corresponding to the current application scene to obtain a learning result;

and correcting the voice recognition result according to the learning result.

Further, the determining the current application scenario where the user is located according to the detection data of the setting detection device includes at least one of the following:

performing voice recognition on the detected sound, and judging the application scene corresponding to the corpus to which the corpus belongs by the voice recognition;

detecting the position of the mobile terminal through a positioning module, and acquiring the current application scene of a user;

detecting the characteristics of the application scene through the Bluetooth digital signal processing equipment, and determining the current application scene according to the characteristics.

Further, before determining the current application scenario where the user is located according to the detection data of the setting detection device, the method further includes:

clustering a corpus under each application scene by using a clustering algorithm, and extracting corpus features according to the clustering result;

and training the corpus features, and creating deep learning models corresponding to each application scene.

Further, the correcting the result of the speech recognition according to the learning result includes:

and if the learning result is that the voice recognition result is not matched with the current application scene, correcting the voice recognition result into a corresponding result in the current application scene.

Further, the corpus comprises: stored user-entered corpus, screened corpus, and/or corpus obtained by correcting speech recognition results.

In a second aspect, an embodiment of the present invention further provides a device for correcting speech recognition, including:

the scene determining module is used for determining the current application scene of the user according to the detection data of the set detection equipment;

the voice recognition module is used for performing voice recognition on the detected sound in the current application scene;

the deep learning module is used for carrying out deep learning on the linguistic data obtained by voice recognition based on a deep learning model corresponding to the current application scene to obtain a learning result;

and the correction module is used for correcting the voice recognition result according to the learning result.

Further, the scene determination module includes:

the first determining unit is used for carrying out voice recognition on the detected sound and judging the application scene corresponding to the corpus to which the corpus belongs by the voice recognition;

the second determining unit is used for detecting the position of the mobile terminal through the positioning module and acquiring the current application scene of the user;

and the third determining unit is used for detecting the characteristics of the application scene through the Bluetooth digital signal processing equipment and determining the current application scene according to the characteristics.

Further, the apparatus further comprises:

the characteristic extraction unit is used for grouping the corpus under each application scene by using a clustering algorithm and extracting corpus characteristics according to the grouping result;

and the model creating unit is used for training the corpus features and creating deep learning models corresponding to each application scene.

Further, the correction module includes:

and the correcting unit is used for correcting the voice recognition result into a corresponding result in the current application scene if the learning result is that the voice recognition result is not matched with the current application scene.

Further, the corpus comprises:

stored user-entered corpus, screened corpus, and/or corpus obtained by correcting speech recognition results.

The embodiment of the invention provides a method and a device for correcting voice recognition, which are used for determining a current application scene by acquiring detection data, deeply learning linguistic data obtained by voice recognition in a deep learning model corresponding to the current application scene, correcting a voice recognition result unmatched with the current application scene, and replacing the voice recognition result unmatched with a correct character translation result, so that the requirements of voice recognition of a specific application scene can be met, voice recognition is performed on each application scene in a targeted manner, the accuracy of voice recognition is greatly improved, human-computer interaction is further promoted, people and machines can effectively communicate, the user experience is improved, and the application range is wide.

Drawings

FIG. 1 is a flowchart illustrating a method for correcting speech recognition according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a method for correcting speech recognition according to a second embodiment of the present invention;

FIG. 3a is a flowchart of a method for correcting speech recognition according to a third embodiment of the present invention;

FIG. 3b is a diagram illustrating a method for correcting speech recognition according to a third embodiment of the present invention;

FIG. 4 is a flowchart of a method for correcting speech recognition according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a speech recognition correction apparatus according to a fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a method for correcting speech recognition according to an embodiment of the present invention, where the embodiment is applicable to a case where a result of speech recognition is corrected according to a current application scenario, and the method may be executed by a speech recognition correction apparatus, which may be implemented in a software and/or hardware manner and is generally integrated in a device with a speech recognition function.

The method of the first embodiment of the invention specifically comprises the following steps:

s101, determining the current application scene of the user according to the detection data of the setting detection device.

Chinese language is profound, and voice recognition of Chinese language is difficult to perform even if only one voice tone is different, even if the voice tones are completely the same, the meanings to be expressed are completely different, so that the current application scene where a user is located needs to be detected, and the linguistic data under the specific application scene used by the user is recognized and judged according to different application scenes, so that the final result of voice recognition is more accurate. The current application environment can be detected by using the setting detection device, so that the current application scene where the user is located is determined.

And S102, performing voice recognition on the detected sound in the current application scene.

Specifically, after the current application scene where the user is located is determined, voice recognition is performed on the detected sound, and a voice recognition result, that is, a corpus obtained through the voice recognition, is obtained.

S103, deep learning is carried out on the linguistic data obtained by voice recognition based on the deep learning model corresponding to the current application scene, and a learning result is obtained.

Specifically, a deep learning model corresponding to each application scene is created, a neural network simulating human brain for analysis and learning is established, deep learning and analysis including semantics, voice, intonation, context, grammar and the like are performed on the linguistic data obtained by voice recognition, whether the initial result of the voice recognition is matched with the current application scene or not is judged, and whether the linguistic data obtained by the voice recognition is accurate or not is judged.

And S104, correcting the voice recognition result according to the learning result.

Specifically, after deep learning, if the corpus obtained by speech recognition is inaccurate, the result of speech recognition is corrected, the speech recognition result is translated into correct characters, and the previous speech recognition result is replaced.

In this embodiment, a current application scenario in which a user is located is first determined, and a corpus obtained by speech recognition is deeply learned in combination with the current application scenario, and if the corpus obtained by speech recognition is inaccurate, a result of the speech recognition is corrected according to the current application scenario according to a result of the deep learning. For example: the language input by the user is 'programmer writes codes in front of a computer', the recognition result of the big data speech engine is 'programmer writes capitals in front of the computer' possibly due to reasons of nonstandard accents, too fast speed and the like sent by the user, the current application scene can be determined as the working scene of the programmer according to words such as 'programmer' and 'computer', and the 'writemen' is corrected into 'written codes' by deep learning of the recognition result of the big data speech engine in a deep learning model, so that a correct speech recognition result is obtained.

The voice recognition correction method provided by the embodiment of the invention can meet the voice recognition requirements of specific application scenes, can perform voice recognition on each application scene in a targeted manner, greatly improves the accuracy of voice recognition, further promotes man-machine interaction, enables people to effectively communicate with machines, improves the user experience, and has a wide application range.

Example two

Fig. 2 is a flowchart of a speech recognition correction method according to a second embodiment of the present invention, where the second embodiment of the present invention is optimized based on the first embodiment, specifically, the operation of determining the current application scenario where the user is located according to the detection data of the detection device is further optimized, as shown in fig. 2, the second embodiment of the present invention specifically includes:

s201, carrying out voice recognition on the detected sound, and judging an application scene corresponding to a corpus to which the corpus belongs by the voice recognition.

Specifically, a corpus set having a mapping relation with each application scene is collected and stored, the corpus set is a set of all collected corpora, voice recognition is performed on detected sounds according to the corpora input by a user, the detected sounds are compared with the content of the corpus set, and a current application scene corresponding to the corpus set to which the corpus belongs is found and judged through the voice recognition. The mapping relation between the keywords and the application scenes can be established by collecting the keywords of the specific application scenes. For example, all linguistic data such as common phrases and menu names of restaurant scenes are collected, and a mapping relation between the linguistic data and the restaurant application scenes is established.

S202, detecting the position of the mobile terminal through a positioning module, and acquiring the current application scene of the user.

Specifically, the position of the user can be detected through a module with a positioning function in the mobile terminal used by the user, and the current application scene where the user is located is determined according to the detection result. The module with the Positioning function can adopt a Global Positioning System (GPS), a bluetooth Positioning technology, a Positioning method such as connecting mobile data traffic or wireless local area network to position the current application scene through map software, and the like.

S203, detecting the characteristics of the application scene through the Bluetooth digital signal processing equipment, and determining the current application scene according to the characteristics.

Specifically, a sensor in the bluetooth digital signal processing device is used to collect a current application scene signal, and characteristics of an application scene are detected according to the collected signal, for example, the temperature sensor may detect the temperature of the environment to determine whether the environment is an indoor environment or an outdoor environment, so as to determine the current application scene where the user is located.

In this embodiment, a global positioning system may be used to locate a position where a user is located, for example: if the user is located in a restaurant, the current application scene can be judged to be the restaurant, and the voice recognition result is related to the restaurant scene.

It should be noted that the three methods are used for determining the current application scenario, and any one or any two or all of the methods may be selected to determine the current application scenario according to the actual application situation.

And S204, performing voice recognition on the detected sound in the current application scene.

S205, deep learning is carried out on the linguistic data obtained by voice recognition based on the deep learning model corresponding to the current application scene, and a learning result is obtained.

And S206, correcting the voice recognition result according to the learning result.

The correction method for voice recognition provided by the embodiment of the invention can accurately acquire the current application scene where the user is located, and performs voice recognition according to the current application scene in a targeted manner, so that the accuracy of voice recognition is improved, and the actual interactive experience between the user and a product is improved.

EXAMPLE III

Fig. 3a is a flowchart of a correction method for speech recognition according to a third embodiment of the present invention, which is optimized and improved based on the above embodiments, and further illustrates an operation before determining a current application scenario where a user is located according to detection data of a set detection device, as shown in fig. 3a, the method according to the third embodiment of the present invention specifically includes:

s301, clustering the corpus under each application scene by using a clustering algorithm, and extracting corpus features according to the clustering result.

Preferably, the corpus comprises: stored user-entered corpus, screened corpus, and/or corpus obtained by correcting speech recognition results.

Specifically, the corpus is used as basic data in the deep learning model, and may be stored corpus input by a user, and/or corpus screened by a professional speech technologist according to various topics, and/or corpus obtained by performing speech synthesis on a speech recognition result, and analyzing and correcting the speech synthesis result. And (4) grouping the corpus by using clustering algorithms such as a partition method or a hierarchy method and the like, and extracting the characteristics of each group of corpora.

S302, training the corpus features, and creating deep learning models corresponding to the application scenes.

Specifically, a corpus is input into the model, the characteristics of the corpus are trained through a neural network, the thinking mode of the human brain is simulated, and a deep learning model for each application scene is created. And judging the accuracy of the voice recognition result of each corpus by combining the application scene.

And S303, determining the current application scene of the user according to the detection data of the setting detection device.

S304, performing voice recognition on the detected sound in the current application scene.

S305, deep learning is carried out on the linguistic data obtained by voice recognition based on the deep learning model corresponding to the current application scene, and a learning result is obtained.

S306, correcting the voice recognition result according to the learning result.

In this embodiment, fig. 3b is a schematic diagram of a correction method for speech recognition according to a third embodiment of the present invention, and referring to fig. 3b, a current geographic location of a user may be obtained through a positioning function of a mobile terminal used by the user, a bluetooth digital signal processing device, and a matching application scenario in which an input corpus is searched, so as to determine a current application scenario in which the user is located. And inputting the stored user linguistic data, the classified linguistic data provided by the voice technologist and the linguistic data corrected by the voice synthesis result into a model for training, and creating a deep learning model corresponding to each application scene. Inputting the result of the speech recognition of the big data speech engine into the deep learning model, correcting the result of the speech recognition according to the current application scene, predicting error-prone points, correcting the result of the wrong speech recognition, and replacing the original wrong translation result with the correct translation result.

According to the correction method for voice recognition provided by the third embodiment of the invention, the current application scene recognition is more accurate by creating the deep learning model, so that the accuracy of the voice recognition result is judged, the inaccurate voice recognition result is corrected, and the accuracy of the voice recognition is improved.

Example four

Fig. 4 is a flowchart of a speech recognition correction method according to a fourth embodiment of the present invention, which is optimized and improved based on the foregoing embodiments, and further describes an operation of correcting a speech recognition result according to the learning result, as shown in fig. 4, the method according to the fourth embodiment of the present invention specifically includes:

s401, determining the current application scene of the user according to the detection data of the setting detection device.

S402, performing voice recognition on the detected sound in the current application scene.

And S403, performing deep learning on the linguistic data obtained by voice recognition based on the deep learning model corresponding to the current application scene to obtain a learning result.

S404, if the learning result is that the voice recognition result is not matched with the current application scene, correcting the voice recognition result into a corresponding result in the current application scene.

Specifically, whether the result of the voice recognition output by the big data voice engine is matched with the current application scene is verified, if not, the result of the voice recognition is corrected to be the result matched with the current application scene, and is translated into correct characters to replace the original error result.

The voice recognition correction method provided by the fourth embodiment of the invention corrects the voice recognition result which is not matched with the application scene, improves the accuracy of voice recognition and translation in the specific application scene, and optimizes the system logic.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a speech recognition correction apparatus according to a fifth embodiment of the present invention, which is applied to correct a speech recognition result that does not match with an application scenario. As shown in fig. 5, the apparatus includes: a scene determination module 501, a speech recognition module 502, a deep learning module 503, and a correction module 504.

A scene determining module 501, configured to determine a current application scene where a user is located according to detection data of a set detection device;

a speech recognition module 502, configured to perform speech recognition on the detected sound in the current application scenario;

a deep learning module 503, configured to perform deep learning on the corpus obtained by speech recognition based on the deep learning model corresponding to the current application scenario, and obtain a learning result;

and a correcting module 504, configured to correct the result of speech recognition according to the learning result.

The embodiment of the invention determines the current application scene by acquiring the detection data, deeply learns the corpus obtained by voice recognition in the deep learning model corresponding to the current application scene, corrects the voice recognition result which is not matched with the current application scene, and replaces the voice recognition result with the correct character translation result, so that the requirement of voice recognition of the specific application scene can be met, voice recognition of each application scene is pertinently carried out, the accuracy of voice recognition is greatly improved, human-computer interaction is further promoted, people and machines can effectively communicate, the user experience is improved, and the application range is wide.

On the basis of the foregoing embodiments, the scene determining module 501 may include:

On the basis of the above embodiments, the apparatus may further include:

On the basis of the foregoing embodiments, the correction module 504 may include:

On the basis of the foregoing embodiments, the corpus may include:

In this embodiment, the current application scene in which the user is located is determined in the scene determination module by the method of searching for the application scene matched with the input corpus through the first determination unit, locating the geographic position of the user through the second determination unit, and detecting the application scene characteristics through the third determination unit, and the sound detected in the current application scene is identified in the voice recognition module to obtain the recognition result. The method comprises the steps that stored linguistic data input by a user and/or professional voice technicians are screened out according to various topics, and/or voice recognition results are subjected to voice synthesis, the linguistic data obtained by analyzing and correcting the voice synthesis results are input into a model as basic data of a corpus to be trained, deep learning models corresponding to application scenes are created, deep learning is conducted on the linguistic data obtained by voice recognition based on the deep learning models corresponding to the current application scenes in a deep learning module, if the learning results are that the voice recognition results are not matched with the current application scenes, the voice recognition results are corrected in a correction unit of the correction module, translated into correct characters, and original translation results are replaced.

The voice recognition correcting device provided by the fifth embodiment of the invention improves the accuracy of voice recognition, promotes effective communication of man-machine interaction, improves the logic of a voice recognition system and has a wide application range.

The voice recognition correction device provided by the embodiment of the invention can execute the method for correcting the voice recognition provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for correcting speech recognition, comprising:

determining a current application scene where a user is located according to detection data of a setting detection device, wherein the determining the current application scene where the user is located according to the detection data of the setting detection device comprises: performing voice recognition on the detected sound, judging the application scene corresponding to the corpus to which the corpus belongs by the voice recognition, and detecting the position of the mobile terminal through a positioning module to acquire the current application scene of the user; detecting the characteristics of an application scene through Bluetooth digital signal processing equipment, and determining the current application scene according to the characteristics;

correcting the result of the voice recognition according to the learning result;

the correcting the result of the voice recognition according to the learning result comprises:

2. The method according to claim 1, wherein before determining the current application scenario where the user is located according to the detection data of the setting detection device, the method further comprises:

3. The method of claim 2, wherein the corpus comprises: stored user-entered corpus, screened corpus, and/or corpus obtained by correcting speech recognition results.

4. A correction device for speech recognition, comprising:

a scene determining module, configured to determine a current application scene where a user is located according to detection data of a set detection device, where the scene determining module includes: a first determining unit, configured to perform speech recognition on the detected sound, and determine an application scenario corresponding to a corpus to which the corpus belongs by the speech recognition, where the scenario determining module further includes: the second determining unit is used for detecting the position of the mobile terminal through the positioning module and acquiring the current application scene of the user; the third determining unit is used for detecting the characteristics of the application scene through the Bluetooth digital signal processing equipment and determining the current application scene according to the characteristics;

the correction module is used for correcting the result of the voice recognition according to the learning result;

the correction module includes:

5. The apparatus of claim 4, further comprising:

6. The apparatus of claim 5, wherein the corpus comprises: