[go: up one dir, main page]

CN113016029A - Method and apparatus for providing context-based speech recognition service - Google Patents

Method and apparatus for providing context-based speech recognition service Download PDF

Info

Publication number
CN113016029A
CN113016029A CN201880099155.1A CN201880099155A CN113016029A CN 113016029 A CN113016029 A CN 113016029A CN 201880099155 A CN201880099155 A CN 201880099155A CN 113016029 A CN113016029 A CN 113016029A
Authority
CN
China
Prior art keywords
speech recognition
voice
speech
recognition result
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201880099155.1A
Other languages
Chinese (zh)
Inventor
黄铭振
姜敏虎
池昌真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Saisteran International Co ltd
Original Assignee
Saisteran International Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Saisteran International Co ltd filed Critical Saisteran International Co ltd
Publication of CN113016029A publication Critical patent/CN113016029A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a device for recognizing voice. In more detail, the voice recognition apparatus according to the present invention acquires voice information from a user and converts the acquired voice information into voice data. The voice recognition model generates a first voice recognition result by recognizing the converted voice data using a first voice recognition model, generates a second voice recognition result by recognizing the converted voice data using a second voice recognition model, and selects a specific voice recognition result from the first voice recognition result and the second voice recognition result through a specific determination process.

Description

Method and apparatus for providing context-based speech recognition service
Technical Field
The invention relates to a method and a device for recognizing user voice. And more particularly, to a method and apparatus for context-based speech recognition accuracy in a method for recognizing speech acquired from a user.
Background
Automatic speech recognition (hereinafter referred to as speech recognition) is a technique of converting speech into text using a computer. In recent years, such speech recognition has achieved a rapid increase in recognition rate.
However, although the recognition rate is generally improved, a performance difference occurs according to a data structure or a model structure used when a language model or an acoustic model is learned.
Disclosure of Invention
Technical problem to be solved
The present invention has been made in view of the above problems, and an object of the present invention is to provide a method for selecting a highly accurate speech recognition result from a plurality of speech recognition results when performing speech recognition using a plurality of speech recognition models.
It is another object of the present invention to provide a method for selecting a speech recognition model for speech recognition using context information.
Technical problems to be achieved in the present invention are not limited to the above technical problems, and other technical problems not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention pertains from the following description.
Technical scheme
To achieve the object, a method of recognizing speech according to the present invention includes: a step of acquiring voice information from a user; converting the acquired voice information into voice data; a step of generating a first voice recognition result by recognizing the converted voice data using a first voice recognition model; a step of generating a second speech recognition result by recognizing the converted speech data using a second speech recognition model; and a step of selecting a specific voice recognition result from the first voice recognition result and the second voice recognition result by a specific determination process.
In the present invention, the specific determination process includes: extracting context information from the first speech recognition result and the second speech recognition result; comparing the context information with a first feature of the first voice recognition model and a second feature of the second voice recognition model which are preset respectively; and a step of selecting one of the first speech recognition result and the second speech recognition result based on the comparison result.
In the present invention, the context information includes at least one of a part of the speech information or information obtained from the first speech recognition result and the second speech recognition result, or information related to a user who uttered speech.
In the present invention, the first speech recognition model and the second speech recognition model are one of a plurality of speech recognition models for recognizing the speech information obtained from the user.
Also, in the present invention, a step of generating a plurality of voice recognition results by recognizing the converted voice data using a plurality of the voice recognition models, the specific voice recognition result being one selected from the first voice recognition result, the second voice recognition result, and the plurality of voice recognition results, is further included.
Also, in the present invention, the specific determination process is a process for determining a speech recognition result based on a context included in context information.
Further, the present invention provides a method for recognizing speech, including: a step of acquiring voice information from a user; converting the acquired voice information into voice data; a step of generating a first speech recognition result by recognizing the speech data using the first speech recognition model; a step of selecting a second speech recognition model for recognizing the speech data from a plurality of speech recognition models based on the first speech recognition result; and a step of generating a second speech recognition result by recognizing the speech data using the second speech recognition model.
Further, the present invention further includes: a step of extracting context information from the first speech recognition result; and comparing the context information with preset details of the plurality of speech recognition models, and selecting the second speech recognition model according to the comparison result.
Also, in the present invention, the first speech recognition model is a speech recognition model for extracting the context information.
Further, the present invention provides a method for recognizing speech, including: a step of acquiring voice information from a user; converting the acquired voice information into voice data; and a step of generating a speech recognition result by recognizing the speech data using a specific speech recognition model selected from the plurality of speech recognition models.
Also, the present invention further includes: a step of setting context information for speech recognition; and a step of selecting the specific speech recognition model whose features are most suitable for the context information from among the plurality of speech recognition models.
Advantageous effects
According to an embodiment of the present invention, when a plurality of results are generated using a plurality of speech recognition models at the time of recognizing a speech input, the accuracy of speech recognition is selected by selecting high accuracy in the recognition results of the food recognition models.
In addition, by selecting a speech recognition model according to context information, each of a plurality of speech recognition models can be used according to the purpose.
Furthermore, an appropriate speech recognition model can be selected even in a service for a large-scale user or in an environment in which the physical and contextual environment in which the user is located changes from time to time.
In addition, since an appropriate speech recognition model can be selected, it is possible to reduce misrecognition due to similar words that may occur when a large language model is used, and misrecognition due to unregistered words that may occur when a small language is applied.
Drawings
The accompanying drawings, which are included to provide a part of the detailed description and are included to assist in understanding the invention, provide embodiments of the invention and together with the detailed description will be technical features of the invention.
Fig. 1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.
Fig. 2 and 3 are diagrams showing examples of a voice recognition apparatus according to an embodiment of the present invention.
Fig. 4 and 5 are diagrams showing another example of a voice recognition apparatus according to an embodiment of the present invention.
Fig. 6 is a diagram showing another example of a voice recognition apparatus according to an embodiment of the present invention.
Fig. 7 is a flowchart illustrating an example of a voice recognition method according to an embodiment of the present invention.
Fig. 8 is a flowchart illustrating another example of a voice recognition method according to an embodiment of the present invention.
Fig. 9 is a flowchart illustrating another example of a voice recognition method according to an embodiment of the present invention.
Detailed Description
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. The detailed description, which will be disclosed below in connection with the appended drawings, is intended to describe exemplary embodiments of the invention, and is not intended to represent the only embodiments in which the invention may be practiced. The following detailed description includes specific details in order to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without these specific details.
In some cases, well-known structures and devices may be omitted or may be shown in block diagram form centering on the core function of each structure and device in order to avoid obscuring the concepts of the present invention.
Fig. 1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.
Referring to fig. 1, a voice recognition apparatus 100 for recognizing a user's voice includes an input unit 110, a storage unit 120, a control unit 130, and/or an output unit 140, and the like.
Since the components shown in fig. 1 are not necessary, an electronic device having more components or fewer components may be implemented.
Hereinafter, the above components will be described in order.
The input unit 110 may receive an audio signal, a video signal or voice information (or an audio signal) and data from a user.
The input unit 110 may include a camera and a microphone to receive an audio signal or a video signal. The camera processes image frames, such as still images or moving images, acquired by the image sensor in a video call mode or a photographing mode.
The image frames processed by the camera may be stored in the storage unit 120.
The microphone receives external sound signals from the microphone in a call mode, a recording mode or a voice recognition mode, and processes them as electronic voice data. Various noise removal algorithms may be implemented in the microphone to remove noise generated in the process of receiving the external sound signal.
When a voice spoken by a user is input through a microphone or a microphone, the input unit 110 converts the voice into an electric signal and transmits it to the control unit 130.
The control unit 130 may obtain the user's voice data by applying a voice recognition algorithm or a voice recognition engine to the signal received from the input unit 110.
At this time, the signal input to the control unit 130 may be converted into a more useful form for voice recognition, and the control unit 130 converts the input signal from an analog form into a digital form and detects the start point and end point of the signal, and the actual voice portion/data included in the voice data may be detected. This is called EPD (end point detection).
Also, the control unit 130 within the detected space Cepstrum (Cepstrum), Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), or filter bank energy may extract a feature vector of the signal by applying a feature vector extraction technique.
The storage unit 120 may store a program for the operation of the control unit 130, and may temporarily store input/output data. A sample file of the symbol-based malicious code detection model may be saved from a user and analysis results of the malicious code may be saved.
The storage unit 120 may store various data related to the recognized voice, and particularly, may store information about an end point of voice data processed by the control unit 130 and a feature vector.
The storage unit 120 includes at least one of a flash memory, a hard disk, a memory card, a ROM (read only memory unit), a RAM (random access memory unit), a memory card, an electrically erasable programmable read only memory unit (EEPROM), a programmable read only memory unit (PROM), a magnetic storage unit, a magnetic disk, and an optical disk.
In addition, the control unit 130 may obtain a recognition result by comparing the extracted feature vector with the trained reference pattern. For this purpose, a speech recognition model for modeling and comparing signal characteristics of speech and a language model for modeling a language sequential relationship of words or syllables as corresponding to a recognized vocabulary may be used.
The voice recognition model can be classified into a direct comparison method of setting a recognition object as a feature vector model and comparing it with a feature vector of voice data, and a statistical method of counting feature vectors of the recognition object.
The direct comparison method is a method of setting units such as words and phonemes as a feature vector model and comparing the degree of similarity with the input speech, and representatively, there is a vector quantization method. According to the vector quantization method, a feature vector of input voice data is mapped with a codebook as a reference model and encoded as a representative value, thereby comparing code values with each other.
The statistical model method is a method of configuring units of an identification object as state sequences and using relationships between the state sequences. The state sequence may be composed of a plurality of nodes. Methods using the relationship between state sequences include Dynamic Time Warping (DTW), Hidden Markov Models (HMM), and methods using neural networks.
Dynamic time warping is a method of compensating for time axis differences by comparing with a reference model, which is a recognition technique that can calculate the likelihood of generating an input speech from an estimated model by considering the dynamic characteristics of speech whose signal length varies with time even when the same person utters the same utterance, a hidden markov model assuming that speech is a markov process having a state transition probability and an observation probability of a node (output symbol) in each state, and then estimating the state transition probability and the observation probability of the node by training data.
On the other hand, a language model that models language sequential relationships such as words or syllables can reduce acoustic ambiguity and reduce recognition errors by applying sequential relationships between units that make up the language to units obtained from speech recognition. Language models include statistical language models that use chain probabilities of words, such as Unigram, Bigram, and Trigram, and Finite State Automaton (FSA) -based models.
The control unit 130 may use any of the above methods in recognizing speech. For example, a speech recognition model to which a hidden markov model is applied may be used, or an N-best search method in which a speech recognition model and a language model are integrated may be used. The N-best search method may improve recognition performance by selecting a maximum of N recognition result candidates using a speech recognition model and a language model, and then re-evaluating the ranking of the candidates.
The control unit 130 may calculate a confidence score (or may be abbreviated as "confidence") to ensure the reliability of the recognition result.
The reliability score is a measure of the reliability of the result to the speech recognition result and may be defined as the relative value of the probability of whether the speech emanates from a phoneme or word as the recognition result or from another phoneme or word. Thus, the reliability score may be represented as a value between 0 and 1 or between 0 and 100. If the reliability score is greater than a preset threshold, the recognition result is recognized, and if the reliability score is small, the recognition result can be rejected.
In addition to this, the reliability score may be obtained according to various existing reliability score acquisition algorithms.
The control unit 130 may be implemented in a computer-readable recording medium using software, hardware, or a combination thereof. According to hardware implementation, at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a processor and a micro control unit, and an electrical unit such as a microprocessor.
According to which implementation can be performed together with a separate software module performing at least one function or operation, and which software code can be implemented by a software application written in a suitable programming language.
The control unit 130 implements the functions, processes, and/or methods set forth in fig. 2 to 6 described later, and hereinafter, for convenience of explanation, the control unit 130 is the same as and described with respect to the voice recognition apparatus 100.
The output unit 140 is used to generate outputs related to vision, hearing, and the like, and output information processed by the apparatus 100.
For example, the output unit 140 may output a recognition result of a voice signal processed by the control unit 130 so that a user can recognize through a visual or auditory function.
The voice recognition model described below can recognize voice information input from a user by the same method as the voice recognition model described in fig. 1.
Fig. 2 and 3 are diagrams showing examples of a voice recognition apparatus according to an embodiment of the present invention.
Referring to fig. 2 and 3, the voice recognition apparatus recognizes voice data acquired from a user as a plurality of voice recognition models, and selects one of recognition results from the plurality of voice recognition models based on context information to provide a voice recognition service.
Specifically, the voice recognition apparatus may generate voice data by converting voice information input from a user into an electric signal and converting an analog signal, which is the converted electric signal, into a digital signal.
The speech recognition model may then use the first speech recognition model (2010) and the second speech recognition model (2020) to recognize speech data, respectively.
The speech recognition apparatus acquires two speech recognition results (speech recognition result 1(2030), speech recognition result 2(2040)) from speech data converted from a speech signal input by a user using each of a basic speech recognition model and a speech recognition model of the user
The speech recognition device applies the first and second speech recognition results to a first particular determination process (e.g., an appropriate speech recognition model determination technique based on the first context) to select and output a more appropriate speech recognition result from the first and second speech recognition results (2050).
That is, the voice recognition apparatus may select a voice recognition result more suitable for the purpose of voice recognition from the first voice recognition result and the second voice recognition result through the first specific determination process, and may output the selected voice recognition result.
For example, when the context information extracted from the first and second speech recognition results is related to the address search, a speech recognition model more suitable for the address search is selected among the first and second speech recognition results, and the speech recognition result of the selected speech recognition model is provided as a speech recognition service.
Hereinafter, a specific determination process will be described with reference to fig. 3.
Fig. 3 is a flowchart illustrating an example of a first specific determination process for determining an appropriate speech recognition model based on the context based on the first speech recognition result and the second speech recognition result.
As shown in fig. 3, when the first speech recognition result 3010 and the second speech recognition result 3020 are generated from the first speech recognition model and the second speech recognition model, respectively, the first specification process selects a speech recognition model more suitable for the purpose of speech recognition from the first speech recognition result 3010 and the second speech recognition result 3020 based on the context 3032 extracted from the first speech recognition result 3010 and the second speech recognition result 3020 (3034).
Then, the speech recognition apparatus selects (3036) and outputs (3040) a speech recognition result generated from the selected speech recognition model.
For example, in fig. 3, in the result of the first voice recognition "tell me the address of li jidong" and the second voice recognition result "tell me the address of li jidong", the voice recognition apparatus determines "tell me the address" as the context information.
Specifically, the voice recognition apparatus may extract the context information "tell me address" from "tell me li jun-based address" and "tell me li jidong address" (3032).
The speech recognition device may then compare the extracted context information with features of the first speech recognition model (first features) and features of the second speech recognition model (second features), and may select the first speech recognition model as the speech recognition model more suitable for speech recognition purposes (3034).
The speech recognition device may then select a first speech recognition result of the selected first speech recognition model (3036), and may output the selected first speech recognition result "telling me the same address".
In this case, all information that can be inferred from the recognition result can be used as the context information, except for a part of the recognized sentence recognized from the speech data as the context information.
For example, information related to the user, such as at least one of a location of the user, weather of the user, habits of the user, a background of a preamble of the user, occupation of the user, a location of the user, financial status of the user, a current time, and a language of the user, etc., may be used as the context information.
Fig. 4 and 5 are diagrams showing another example of a voice recognition apparatus according to an embodiment of the present invention.
Referring to fig. 4 and 5, the voice recognition apparatus recognizes voice data obtained from a user into a plurality of voice recognition models, and selects one of the recognition results from the plurality of voice recognition models based on context information to provide a voice recognition service.
Specifically, the speech recognition apparatus may generate the first speech recognition result (4020) by recognizing speech information input from the user as the first speech recognition model 4010.
In this case, the first speech recognition model is a speech information model for extracting a context from speech information obtained by the user, and may be configured to use only a small resource according to the purpose of the speech information model according to the purpose of the speech information.
The speech recognition apparatus may select a specific speech recognition model (4030) most suitable for recognizing speech information input from the user from a plurality of preset speech recognition models using a second specific determination process (e.g., an appropriate speech recognition model determination technique of a second context).
That is, the speech recognition apparatus can select a specific speech recognition model from the plurality of speech recognition models based on the first speech recognition result according to the purpose and use of the speech recognition.
For example, when the context information extracted from the first speech recognition result is related to an address search, a speech recognition model most suitable for the address search may be selected as the specific speech recognition model among the plurality of candidate speech recognition models.
In this case, the second specific determination process includes a process of extracting context information from the first speech recognition result to select a specific speech recognition model, and selecting the specific speech recognition model using the extracted context information.
The speech recognition process may then re-recognize the speech data converted from the speech information input by the user using the selected specific speech recognition model, and finally generate a speech recognition result (4040).
Hereinafter, the second specific determination process will be described with reference to fig. 5.
Fig. 5 is a flowchart illustrating an example of a second specific determination process for determining an appropriate speech recognition model based on the contexts of the first speech recognition result and the second speech recognition result.
Specifically, the voice recognition apparatus generates (or receives input of) a first voice recognition result 5010 that recognizes voice information of the user by the first voice recognition model described in fig. 4, and selects a specific voice recognition model that is most suitable for the purpose of voice recognition from a plurality of (e.g., N) voice recognition models by a second specific determination process based on the generated first voice recognition result (5020).
In this case, the second specific determination process includes a process of extracting context information from the first speech recognition result to select a specific speech recognition model, and selecting the specific speech recognition model using the extracted context information.
For example, "tell me the address of the jidong", "tell me the address" may be extracted as the context information from the first speech recognition result recognized by the first speech recognition model.
In this case, the first speech recognition model is a speech information model for extracting a context from speech information obtained by the user as described above, and is configured to use only a small amount of resources according to the purpose of the speech recognition model.
Regarding the context information, all information inferred from the recognition result may be used as the context information except for a part of a sentence recognized by the speech recognition model.
For example, information related to the user, such as at least one of a location of the user, a weather in which the user is located, a habit of the user, a previous utterance context of the user, an occupation of the user, a responsibility of the user, a financial status of the user, a current time, and a language of the user, etc., may be used as the context information.
Then, the speech recognition apparatus can select a specific speech recognition model that is most suitable for the purpose of speech recognition from the plurality of speech recognition models by "tell me address" using the extracted context information.
By this method, the speech recognition apparatus can extract the context information through the speech recognition model for obtaining the context information, thereby selecting a specific speech recognition model most suitable for the purpose of speech recognition.
Fig. 6 is a diagram showing another example of a voice recognition apparatus according to an embodiment of the present invention.
Referring to fig. 6, the voice recognition apparatus may select a specific voice recognition model from a plurality of voice recognition models in advance by setting context information for voice recognition, and provide a voice recognition service using a voice recognition result recognized by the selected voice recognition model.
Specifically, the speech recognition apparatus may select a specific speech recognition model determined to be most suitable for speech recognition from among a plurality of speech recognition models according to preset context information (6010).
For example, when the purpose and use of the voice recognition service is address search, the voice recognition apparatus may select a voice recognition model preset for address search among a plurality of voice recognition models as a specific voice recognition model.
Then, the voice recognition model generates a voice recognition result by recognizing voice data acquired from the user through the selected specific food model (6020).
In this case, the voice data refers to data in which voice information obtained from a user is changed into an electric signal, and an analog signal, which is the changed electric signal, is changed into a digital signal.
Fig. 7 is a flowchart illustrating an example of a voice recognition method according to an embodiment of the present invention.
Referring to fig. 7, as shown in fig. 2 and 3, the voice recognition apparatus generates voice recognition results by a plurality of voice recognition apparatuses and selects the most suitable voice recognition result from among the generated voice recognition results to provide a voice recognition service.
Specifically, the voice recognition apparatus may acquire voice information from the user and convert the acquired voice information into voice data (S7010).
For example, the voice recognition apparatus may convert voice information acquired from a user into an electric signal, and convert an analog signal, which is the changed electric signal, into voice data, which is a digital signal.
Then, the speech recognition apparatus may recognize the speech data as the first speech recognition model and the second speech recognition model, respectively, to generate a first speech recognition result and a second speech recognition result (S7020 and S7030).
Then, the speech recognition apparatus selects a speech recognition result more suitable for the purpose of speech recognition from the first speech recognition result and the second speech recognition result through the first specific determination process described in fig. 2 and 3, thereby providing the speech recognition service (S7040).
For example, the speech recognition device extracts context information from the first speech recognition result and the second speech recognition result, and compares the extracted context information with a first feature of the first speech recognition model and a second feature of the second speech recognition model, which are preset, respectively.
Then, the speech recognition apparatus may select a speech recognition model suitable for the purpose and/or purpose of speech recognition from the first speech recognition model and the second speech recognition model based on the comparison result.
Then, the speech recognition apparatus may select a second speech recognition result generated by the selected second speech recognition model as a speech recognition result, and provide a speech recognition service based on the selected second speech recognition result.
Fig. 8 is a flowchart illustrating another example of a voice recognition method according to an embodiment of the present invention.
Referring to fig. 8, a speech recognition model may extract context information through speech data and provide a speech recognition service based on the extracted context information.
First, step S8010 is the same as step S7010 of fig. 7, and thus description thereof will be omitted.
Then, the voice recognition apparatus generates a first voice recognition result by recognizing voice data using the first voice recognition model (S8020).
At this time, the first speech recognition model is a speech information model for extracting context from speech information obtained by the user as shown in fig. 4, and is configured to use only a small amount of resources according to the purpose of the speech recognition model.
The voice recognition apparatus may extract context information from the first voice recognition result (S8030).
The context information refers to all information that can be inferred from a recognition result or the like, in addition to a part of a sentence recognized by a speech recognition model.
For example, information related to the user, such as at least one of a location of the user, a weather in which the user is located, a habit of the user, a context of previous speech of the user, an occupation of the user, a duty of the user, a financial status of the user, a current time, and a language of the user, etc., may be used as the context information.
Then, the voice recognition apparatus may select a specific voice recognition model most suitable for recognizing the voice information input from the user from among a plurality of preset voice recognition models using the second specific determination process described in fig. 4 and 5 (S8040).
That is, the speech recognition apparatus can select a specific speech recognition model from the plurality of speech recognition models based on the first speech recognition result according to the purpose and use of the speech recognition.
For example, when the context information extracted from the first speech recognition result is related to an address search, a speech recognition model most suitable for the address search may be selected as the specific food recognition model among the plurality of candidate speech recognition models.
In this case, the second specific determination process includes a process of extracting context information from the first speech recognition result to select a specific speech recognition model, and selecting the specific speech recognition model using the extracted context information.
Then, in the voice recognition process, voice data converted from the voice information input by the user is re-recognized using the selected specific voice recognition model to finally generate a voice recognition result (S8040).
Then, the voice recognition apparatus may provide a voice recognition service based on a voice recognition result of recognizing voice data through a specific voice recognition model.
Fig. 9 is a flowchart illustrating another example of a voice recognition method according to an embodiment of the present invention.
Referring to fig. 9, the voice recognition apparatus may select a specific voice recognition model from a plurality of voice recognition models based on context information before receiving voice information from a user, and recognize the voice information input from the user through the selected voice recognition model.
Specifically, the speech recognition apparatus may preset context information for speech recognition.
The context information refers to all information that can be inferred from a recognition result or the like, in addition to a part of a sentence recognized by the speech recognition model.
For example, information related to the user, such as at least one of a location of the user, a weather in which the user is located, a habit of the user, a previous utterance context of the user, an occupation of the user, a responsibility of the user, a financial status of the user, a current time, and a language of the user, etc., may be used as the context information.
Then, the speech recognition apparatus selects a specific speech recognition model according to the purpose/use of speech recognition from among the plurality of speech recognition models based on the context information (S9020).
For example, in the case of an address search, the speech recognition apparatus may select a speech recognition model preset for the address search as a specific speech recognition model among a plurality of speech recognition models.
Then, when voice information is obtained from the user, the voice recognition apparatus may convert the obtained voice information into voice data (S9010).
For example, the voice recognition apparatus may convert voice information acquired from a user into an electric signal, and convert an analog signal, which is the changed electric signal, into voice data, which is a digital signal.
Then, the voice recognition apparatus may generate a voice recognition result by recognizing voice data using the selected specific voice recognition model (S9050).
Then, the voice recognition apparatus may provide a voice recognition service based on a voice recognition result of recognizing voice data through a specific voice recognition model.
Embodiments in accordance with the present invention are implemented by various means, for example, by hardware, firmware, software, or a combination thereof. In the case of implementation by hardware, an embodiment of the present invention is one or more ASICs (application specific integrated circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs (field programmable gate arrays), processors, control units, micro control units, microprocessors, and the like.
In the case of implementation through firmware or software, the embodiments of the present invention are implemented in the form of modules, procedures, functions, and the like, which perform the functions or operations described above. The software codes may be stored in a memory and driven by a processor. The memory is located inside or outside the processor and may exchange data with the processor in various known ways.
It will be apparent to those skilled in the art that the present invention may be embodied in other specific forms without departing from the essential characteristics thereof. The foregoing detailed description is, therefore, not to be taken in a limiting sense, and is to be considered in all respects illustrative. The scope of the invention should be determined by reasonable interpretation of the appended claims and all changes which come within the equivalent scope of the invention are intended to be embraced therein.
Industrial applicability of the invention
The present invention can be applied to various fields of speech recognition technology, and can provide a method for selecting an optimal speech recognition model based on context.
Due to these characteristics, it is possible to obtain an optimal speech recognition result when unspecified speech input is received from a service using a plurality of speech recognition models having different strengths for each field.
These functions can be applied not only to voice recognition but also to other AI services.

Claims (11)

1. A method, as a method of recognizing speech, comprising:
a step of acquiring voice information from a user;
converting the acquired voice information into voice data;
a step of generating a first voice recognition result by recognizing the converted voice data using a first voice recognition model;
a step of generating a second speech recognition result by recognizing the converted speech data using a second speech recognition model; and
a step of selecting a specific voice recognition result from the first voice recognition result and the second voice recognition result by a specific determination process.
2. The method of claim 1, wherein the particular determination process comprises:
extracting context information from the first speech recognition result and the second speech recognition result;
comparing the context information with a first feature of the first voice recognition model and a second feature of the second voice recognition model which are preset respectively; and
a step of selecting one of the first speech recognition result and the second speech recognition result based on the comparison result.
3. The method of claim 2, wherein context information comprises at least one of a portion of the speech information or information obtained from the first and second speech recognition results, or information relating to a user who uttered speech.
4. The method of claim 1, wherein the first speech recognition model and the second speech recognition model are one of a plurality of speech recognition models for recognizing the speech information obtained from the user.
5. The method according to claim 1, further comprising a step of generating a plurality of speech recognition results by recognizing the converted speech data using a plurality of the speech recognition models, the specific speech recognition result being a selected one of the first speech recognition result, the second speech recognition result, and a plurality of the speech recognition results.
6. The method according to claim 1, wherein the specific determination process is a process for determining a speech recognition result based on a context included in context information.
7. A method, as a method of recognizing speech, comprising:
a step of acquiring voice information from a user;
converting the acquired voice information into voice data;
a step of generating a first speech recognition result by recognizing the speech data using the first speech recognition model;
a step of selecting a second speech recognition model for recognizing the speech data from a plurality of speech recognition models based on the first speech recognition result; and
a step of generating a second speech recognition result by recognizing the speech data using the second speech recognition model.
8. The method of claim 7, further comprising:
a step of extracting context information from the first speech recognition result; and
a step of comparing the context information with preset details of a plurality of the speech recognition models,
and selecting the second speech recognition model according to the comparison result.
9. The method of claim 8, wherein the first speech recognition model is a speech recognition model used to extract the context information.
10. A method, as a method of recognizing speech, comprising:
a step of acquiring voice information from a user;
converting the acquired voice information into voice data; and
a step of generating a speech recognition result by recognizing the speech data using a specific speech recognition model selected from a plurality of speech recognition models.
11. The method of claim 10, further comprising:
a step of setting context information for speech recognition; and
a step of selecting the particular speech recognition model from the plurality of speech recognition models whose features are most suitable for the context information.
CN201880099155.1A 2018-11-02 2018-11-02 Method and apparatus for providing context-based speech recognition service Pending CN113016029A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/KR2018/013280 WO2020091123A1 (en) 2018-11-02 2018-11-02 Method and device for providing context-based voice recognition service

Publications (1)

Publication Number Publication Date
CN113016029A true CN113016029A (en) 2021-06-22

Family

ID=70463797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201880099155.1A Pending CN113016029A (en) 2018-11-02 2018-11-02 Method and apparatus for providing context-based speech recognition service

Country Status (3)

Country Link
KR (1) KR20210052563A (en)
CN (1) CN113016029A (en)
WO (1) WO2020091123A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11721324B2 (en) 2021-06-09 2023-08-08 International Business Machines Corporation Providing high quality speech recognition
CN113889097A (en) * 2021-09-30 2022-01-04 上海喜马拉雅科技有限公司 Speech recognition method and related device
KR20240098282A (en) 2022-12-20 2024-06-28 서강대학교산학협력단 System for correcting errors of a speech recognition system and method thereof
KR102768074B1 (en) * 2024-04-25 2025-02-18 주식회사 리턴제로 Electronic device for generating a data set for language model training and operation method thereof

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182628A1 (en) * 2004-02-18 2005-08-18 Samsung Electronics Co., Ltd. Domain-based dialog speech recognition method and apparatus
CN101034390A (en) * 2006-03-10 2007-09-12 日电(中国)有限公司 Apparatus and method for verbal model switching and self-adapting
US20110153324A1 (en) * 2009-12-23 2011-06-23 Google Inc. Language Model Selection for Speech-to-Text Conversion
CN102693725A (en) * 2011-03-25 2012-09-26 通用汽车有限责任公司 Speech recognition dependent on text message content
CN105244027A (en) * 2015-08-31 2016-01-13 百度在线网络技术(北京)有限公司 Method of generating homophonic text and system thereof
US20160104482A1 (en) * 2014-10-08 2016-04-14 Google Inc. Dynamically biasing language models
CN105654954A (en) * 2016-04-06 2016-06-08 普强信息技术(北京)有限公司 Cloud voice recognition system and method
US9502029B1 (en) * 2012-06-25 2016-11-22 Amazon Technologies, Inc. Context-aware speech processing
CN107945792A (en) * 2017-11-06 2018-04-20 百度在线网络技术(北京)有限公司 Method of speech processing and device
CN108242235A (en) * 2016-12-23 2018-07-03 三星电子株式会社 Electronic device and speech recognition method thereof
WO2018134916A1 (en) * 2017-01-18 2018-07-26 三菱電機株式会社 Speech recognition device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101415534B1 (en) * 2007-02-23 2014-07-07 삼성전자주식회사 Multi-stage speech recognition apparatus and method
KR101317339B1 (en) * 2009-12-18 2013-10-11 한국전자통신연구원 Apparatus and method using Two phase utterance verification architecture for computation speed improvement of N-best recognition word
KR101971513B1 (en) * 2012-07-05 2019-04-23 삼성전자주식회사 Electronic apparatus and Method for modifying voice recognition errors thereof
KR20150054445A (en) * 2013-11-12 2015-05-20 한국전자통신연구원 Sound recognition device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182628A1 (en) * 2004-02-18 2005-08-18 Samsung Electronics Co., Ltd. Domain-based dialog speech recognition method and apparatus
CN101034390A (en) * 2006-03-10 2007-09-12 日电(中国)有限公司 Apparatus and method for verbal model switching and self-adapting
US20110153324A1 (en) * 2009-12-23 2011-06-23 Google Inc. Language Model Selection for Speech-to-Text Conversion
CN102693725A (en) * 2011-03-25 2012-09-26 通用汽车有限责任公司 Speech recognition dependent on text message content
US9502029B1 (en) * 2012-06-25 2016-11-22 Amazon Technologies, Inc. Context-aware speech processing
US20160104482A1 (en) * 2014-10-08 2016-04-14 Google Inc. Dynamically biasing language models
CN105244027A (en) * 2015-08-31 2016-01-13 百度在线网络技术(北京)有限公司 Method of generating homophonic text and system thereof
CN105654954A (en) * 2016-04-06 2016-06-08 普强信息技术(北京)有限公司 Cloud voice recognition system and method
CN108242235A (en) * 2016-12-23 2018-07-03 三星电子株式会社 Electronic device and speech recognition method thereof
WO2018134916A1 (en) * 2017-01-18 2018-07-26 三菱電機株式会社 Speech recognition device
CN107945792A (en) * 2017-11-06 2018-04-20 百度在线网络技术(北京)有限公司 Method of speech processing and device

Also Published As

Publication number Publication date
WO2020091123A1 (en) 2020-05-07
KR20210052563A (en) 2021-05-10

Similar Documents

Publication Publication Date Title
US10847137B1 (en) Trigger word detection using neural network waveform processing
US6125345A (en) Method and apparatus for discriminative utterance verification using multiple confidence measures
US20250201267A1 (en) Method and apparatus for emotion recognition in real-time based on multimodal
EP2048655B1 (en) Context sensitive multi-stage speech recognition
US10650802B2 (en) Voice recognition method, recording medium, voice recognition device, and robot
US6122615A (en) Speech recognizer using speaker categorization for automatic reevaluation of previously-recognized speech data
CN106875936B (en) Voice recognition method and device
CN113016029A (en) Method and apparatus for providing context-based speech recognition service
KR20010102549A (en) Speaker recognition
US20210398521A1 (en) Method and device for providing voice recognition service
Nasereddin et al. Classification techniques for automatic speech recognition (ASR) algorithms used with real time speech translation
CN112651247A (en) Dialogue system, dialogue processing method, translation device, and translation method
KR100930587B1 (en) Confusion Matrix-based Speech Verification Method and Apparatus
JP4340685B2 (en) Speech recognition apparatus and speech recognition method
EP1734509A1 (en) Method and system for speech recognition
CN111640423B (en) Word boundary estimation method and device and electronic equipment
US20220005462A1 (en) Method and device for generating optimal language model using big data
EP3496092B1 (en) Voice processing apparatus, voice processing method and program
JP2021529978A (en) Artificial intelligence service method and equipment for it
Al-Haddad et al. Decision fusion for isolated Malay digit recognition using dynamic time warping (DTW) and hidden Markov model (HMM)
JP2021529338A (en) Pronunciation dictionary generation method and device for that
EP2948943B1 (en) False alarm reduction in speech recognition systems using contextual information
KR101037801B1 (en) Keyword detection method using subunit recognition
JPH0997095A (en) Voice recognition device
KR100677224B1 (en) Speech Recognition Using Anti-Word Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210622