JP5375423B2

JP5375423B2 - Speech recognition system, speech recognition method, and speech recognition program

Info

Publication number: JP5375423B2
Application number: JP2009185520A
Authority: JP
Inventors: 亮輔磯谷; 透岩沢; 誠也長田; 健花沢; 剛範辻川; 史博安達; 隆行荒川; 浩司岡部
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-08-10
Filing date: 2009-08-10
Publication date: 2013-12-25
Anticipated expiration: 2029-08-10
Also published as: JP2011039222A

Description

本発明は、音声認識システム、音声認識方法および音声認識プログラムに関し、特にユーザが発話のタイミングをボタンなどで指示する音声認識システム、音声認識方法および音声認識プログラムに関する。 The present invention relates to a voice recognition system, a voice recognition method, and a voice recognition program, and more particularly, to a voice recognition system, a voice recognition method, and a voice recognition program in which a user instructs an utterance timing with a button or the like.

この種の音声認識システムでは、雑音や意図しない発話による誤操作を防止することがよく知られている。 In this type of speech recognition system, it is well known to prevent erroneous operation due to noise or unintentional speech.

例えば、特許文献１に記載の音声処理装置は、まず、入力された音声のうち音声処理の対象とする区間として操作者により指定された指定区間の入力を受付け、入力された音声から発声区間を検出する。次に、音声処理装置は、入力された音声に基づいて、操作者または操作者以外の者のいずれが発声の発話者であるかを判断する。さらに、音声処理装置は、指定区間と発声区間とが重複する部分を検出し、重複する部分が検出された場合であって、発話者は操作者以外の者であると判断された場合に、重複する部分が含まれる発声区間を、音声処理の対象の区間として決定する。 For example, the speech processing apparatus described in Patent Literature 1 first receives input of a designated section designated by an operator as a section to be subjected to speech processing among input speech, and determines a speech section from the input speech. To detect. Next, the voice processing device determines which of the operator and the person other than the operator is the utterer of the utterance based on the input voice. Furthermore, the voice processing device detects a portion where the designated section and the utterance section overlap, and when the overlapping portion is detected, and when the speaker is determined to be a person other than the operator, An utterance section including an overlapping portion is determined as a section for speech processing.

これにより、特許文献１に記載の音声処理装置は、話者に応じて処理の対象とする音声の区間を適切に決定することができ、誤操作の発生を低減することができる。また、特許文献１に記載の音声処理装置によれば、指定区間の始端の指定が実際の発話開始より遅れたり、終了指示ボタンを押し忘れたりなどの操作者による誤操作を検知することができる。 As a result, the speech processing apparatus described in Patent Literature 1 can appropriately determine the speech segment to be processed according to the speaker, and can reduce the occurrence of erroneous operations. Further, according to the speech processing device described in Patent Document 1, it is possible to detect an erroneous operation by the operator such as the designation of the start of the designated section being delayed from the actual start of utterance or forgetting to press the end instruction button.

特開2007-264473号公報JP 2007-264473 A

しかしながら、特許文献１に記載された手法では、全ての入力された音声から、発話区間を検出するため、常に発話区間検出処理を行っていなければならず、発話区間を検出するための処理負荷が大きい。そのため、発話区間の検出処理負荷が、ユーザの誤操作を検知する処理に影響し、誤操作検知の精度が下がってしまうことがある。 However, in the method described in Patent Document 1, since an utterance section is detected from all input voices, an utterance section detection process must always be performed, and a processing load for detecting the utterance section is increased. large. For this reason, the detection processing load of the utterance section may affect the process of detecting a user's erroneous operation, and the accuracy of erroneous operation detection may be reduced.

以上より、本発明の目的は、発話区間検出の処理負荷を少なくし、ユーザの誤操作を精度よく検知することができる音声認識システムを提供することにある。 In view of the above, an object of the present invention is to provide a voice recognition system that can detect a user's erroneous operation with reduced processing load for detecting an utterance section.

上記目的を達成するために、本発明の音声認識システムは、発話開始の指示を含むユーザによる発話タイミングの指示を取得する発話タイミング指示取得手段と、入力される音声信号を保持し、前記発話タイミング指示取得手段により発話開始の指示が取得された場合、保持している音声信号およびそれ以降に入力される音声信号を出力する音声信号保持手段と、前記音声信号保持手段により出力された音声信号から発話区間を検出する発話区間検出手段と、前記発話区間検出手段により検出された発話区間と、前記発話タイミング指示取得手段により取得された発話タイミングの指示とに基づいて、ユーザの誤操作を検知する誤操作検知手段と、を備える。 In order to achieve the above object, the speech recognition system of the present invention comprises an utterance timing instruction acquisition means for acquiring an instruction of an utterance timing by a user including an instruction to start an utterance, an input voice signal, and the utterance timing When an instruction to start speech is acquired by the instruction acquisition means, the voice signal holding means for outputting the held voice signal and the voice signal input thereafter and the voice signal output by the voice signal holding means An erroneous operation for detecting an erroneous operation of a user based on an utterance interval detecting means for detecting an utterance interval, an utterance interval detected by the utterance interval detecting means, and an instruction of an utterance timing acquired by the utterance timing instruction acquiring means Detecting means.

また、本発明の音声認識方法は、発話開始の指示を含むユーザによる発話タイミングの指示を取得し、入力される音声信号を保持し、前記発話開始の指示が取得された場合、保持している音声信号およびそれ以降に入力される音声信号を出力し、前記出力された音声信号から発話区間を検出し、前記発話区間と、前記発話タイミングの指示とに基づいて、ユーザの誤操作を検知する。 In addition, the speech recognition method of the present invention acquires a speech timing instruction by a user including a speech start instruction, retains an input speech signal, and retains the speech start instruction when the speech start instruction is acquired. An audio signal and an audio signal input thereafter are output, an utterance interval is detected from the output audio signal, and an erroneous operation of the user is detected based on the utterance interval and the instruction of the utterance timing.

さらに、本発明の音声認識プログラムは、コンピュータに、発話開始の指示を含むユーザによる発話タイミングの指示を取得する発話タイミング指示取得ステップと、入力される音声信号を保持し、前記発話タイミング指示取得ステップにより発話開始の指示が取得された場合、保持している音声信号およびそれ以降に入力される音声信号を出力する音声信号保持ステップと、前記音声信号保持ステップにより出力された音声信号から発話区間を検出する発話区間検出ステップと、前記発話区間検出ステップにより検出された発話区間と、前記発話タイミング指示取得ステップにより取得された発話タイミングの指示とに基づいて、ユーザの誤操作を検知する誤操作検知ステップと、を実行させる。 Furthermore, the speech recognition program according to the present invention includes an utterance timing instruction acquisition step for acquiring an utterance timing instruction by a user including an utterance start instruction in a computer, an input voice signal, and the utterance timing instruction acquisition step. When an instruction to start utterance is acquired by the voice signal holding step, a voice signal holding step for outputting a held voice signal and a voice signal input thereafter, and a voice interval from the voice signal output by the voice signal holding step are determined. An erroneous operation detecting step for detecting an erroneous operation of the user based on the detected speech interval detecting step, the utterance interval detected by the utterance interval detecting step, and the utterance timing instruction acquired by the utterance timing instruction acquiring step; , Execute.

本発明によれば、発話区間検出の処理負荷を少なくし、ユーザの誤操作を精度よく検知することができる。 ADVANTAGE OF THE INVENTION According to this invention, the processing load of speech area detection can be reduced and a user's misoperation can be detected accurately.

本発明の第１の実施形態にかかる音声認識システム１のハードウェア構成図である。It is a hardware block diagram of the speech recognition system 1 concerning the 1st Embodiment of this invention. 本発明の第１の実施形態にかかる音声認識システム１の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the speech recognition system 1 concerning the 1st Embodiment of this invention. 誤操作検知手段１０８における誤操作の有無・種類の判定方法の例示である。It is an illustration of a method for determining the presence / absence / type of an erroneous operation in the erroneous operation detection means. 音声認識システム１の動作を示すフローチャートである。3 is a flowchart showing the operation of the voice recognition system 1. 本発明の第２の実施形態にかかる音声認識システム２の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the speech recognition system 2 concerning the 2nd Embodiment of this invention. 本発明の第３の実施形態にかかる音声認識システム３の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the speech recognition system 3 concerning the 3rd Embodiment of this invention. 本発明の第４の実施形態にかかる音声認識システム４の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the speech recognition system 4 concerning the 4th Embodiment of this invention.

＜第１の実施形態＞
本発明にかかる音声認識システムの第１の実施形態について説明する。 <First Embodiment>
A first embodiment of a voice recognition system according to the present invention will be described.

図１は、本発明の第１の実施形態にかかる音声認識システム１のハードウェア構成図である。 FIG. 1 is a hardware configuration diagram of a speech recognition system 1 according to the first embodiment of the present invention.

図１に示すように、音声認識システム１は、CPU１０、メモリ１２、HDD（ハードディスクドライブ）１４、図示しないネットワークを介してデータの通信を行なう通信IF（インターフェース）１６、ディスプレイ等の出力装置１８、キーボードやマウス等のポインティングデバイスを含む入力装置２０および音声を入力して音声信号を出力するマイクロホン等の音声入力装置２２を有する。これらの構成要素は、バス２４を通して互いに接続されており、互いにデータの入出力を行なう。 As shown in FIG. 1, a speech recognition system 1 includes a CPU 10, a memory 12, an HDD (hard disk drive) 14, a communication IF (interface) 16 for communicating data via a network (not shown), an output device 18 such as a display, It has an input device 20 including a pointing device such as a keyboard and a mouse, and an audio input device 22 such as a microphone that inputs audio and outputs an audio signal. These components are connected to each other through the bus 24 and input / output data to / from each other.

図２は、本発明の第１の実施形態にかかる音声認識システム１の機能構成を示すブロック図である。 FIG. 2 is a block diagram showing a functional configuration of the speech recognition system 1 according to the first embodiment of the present invention.

図２に示すように、音声認識システム１は、音声入力手段１００、発話タイミング指示取得手段１０２、音声信号保持手段１０４、発話区間検出手段１０６、誤操作検知手段１０８、音声認識手段１１０、音声認識辞書１１２、音響モデル１１４および誤操作通知手段１１６を備える。音声認識システム１の機能は、プログラムがメモリ１２（図１）にロードされ、CPU１０により実行されて実現される。なお、音声認識システム１の全部または一部の機能は、ハードウェアにより実現されてもよい。 As shown in FIG. 2, the voice recognition system 1 includes a voice input unit 100, a speech timing instruction acquisition unit 102, a voice signal holding unit 104, a speech segment detection unit 106, an erroneous operation detection unit 108, a voice recognition unit 110, a voice recognition dictionary. 112, an acoustic model 114, and an erroneous operation notification means 116. The function of the speech recognition system 1 is realized by loading a program into the memory 12 (FIG. 1) and executing it by the CPU 10. Note that all or part of the functions of the speech recognition system 1 may be realized by hardware.

音声認識システム１において、音声入力手段１００は、音声入力装置２２（図１）から出力された音声信号を入力し、必要に応じてAD変換や符号化された信号の復号化などの処理を行い、音声波形のデジタル信号を出力する。 In the speech recognition system 1, the speech input means 100 inputs the speech signal output from the speech input device 22 (FIG. 1) and performs processing such as AD conversion and decoding of the encoded signal as necessary. Output digital signal of voice waveform.

発話タイミング指示取得手段１０２は、入力装置２０（図１）を通じて、発話開始の指示を含むユーザによる発話タイミングの指示を取得する。発話タイミングは、少なくとも発話開始のタイミングを含んでおり、発話終了のタイミングを含んでいてもよい。ユーザによる指示は、発話開始前にのみボタンを押す、ボタンを押しながら発話して発話終了後にボタンを離す、発話開始前と発話終了後にそれぞれボタンを押す、などにより行われる。発話タイミング指示取得手段１０２は、操作の方法により、発話開始の指示のみ、あるいは発話開始と発話終了の指示を取得する。発話タイミング指示取得手段１０２は、指示を取得すると、取得した発話タイミングの指示を即座に、あるいは指示の時刻情報を付与した上で一定のタイミングで、音声信号保持手段１０４、誤操作検知手段１０８および音声認識手段１１０に出力する。 The utterance timing instruction acquisition unit 102 acquires an utterance timing instruction by the user including an utterance start instruction through the input device 20 (FIG. 1). The utterance timing includes at least the utterance start timing, and may include the utterance end timing. The instruction by the user is performed by pressing the button only before starting the utterance, uttering while pressing the button, releasing the button after the end of the utterance, pressing the button before starting the utterance and after the end of the utterance, etc. The utterance timing instruction acquisition means 102 acquires only the utterance start instruction or the utterance start and utterance end instructions by the operation method. When the utterance timing instruction acquisition unit 102 acquires the instruction, the utterance timing instruction acquisition unit 102, the voice signal holding unit 104, the erroneous operation detection unit 108, and the voice immediately after giving the acquired utterance timing instruction or given timing information of the instruction. Output to the recognition means 110.

なお、発話タイミング指示取得手段１０２が発話開始の指示のみを取得する場合、発話終了タイミングは、後述する発話区間検出手段１０６により検出される発話終了時刻で代用されてもよい。または、発話終了タイミングは、後述する音声認識手段１１０により検知される発話終了時刻で代用されてもよい。 When the utterance timing instruction acquisition unit 102 acquires only the utterance start instruction, the utterance end timing may be substituted with the utterance end time detected by the utterance section detection unit 106 described later. Alternatively, the utterance end timing may be substituted by the utterance end time detected by the voice recognition unit 110 described later.

音声信号保持手段１０４は、所定時間の音声信号を保持するバッファを有する。音声信号保持手段１０４は、音声入力手段１００から入力される音声信号をバッファに格納する。音声信号の容量がバッファの容量を越える場合、音声信号保持手段１０４は、古いものから順に廃棄し、最新の所定時間分の音声信号を保持するようにしてもよい。ここで、バッファの容量は、発話開始の指示の遅れがある場合において、実際の発話開始から発話開始の指示までの音声信号を十分格納できる大きさであることが望ましい。音声信号保持手段１０４は、発話タイミング指示取得手段１０２から発話開始の指示が入力されると、その時点でバッファに格納されている音声信号を、その時刻情報とともに発話区間検出手段１０６および音声認識手段１１０に出力する。また、音声信号保持手段１０４は、発話開始の指示の入力以降に音声入力手段１００から入力される音声信号を、その時刻情報とともに発話区間検出手段１０６および音声認識手段１１０に出力する。音声信号保持手段１０４は、発話タイミング指示取得手段１０２から発話終了の指示が入力されると、音声信号の出力を停止し、以後入力される音声信号をバッファに格納する。 The audio signal holding means 104 has a buffer for holding an audio signal for a predetermined time. The audio signal holding unit 104 stores the audio signal input from the audio input unit 100 in a buffer. When the capacity of the audio signal exceeds the capacity of the buffer, the audio signal holding means 104 may discard the oldest one in order and hold the audio signal for the latest predetermined time. Here, it is desirable that the buffer capacity is large enough to store a speech signal from the actual speech start to the speech start instruction when there is a delay in the speech start instruction. When a speech start instruction is input from the speech timing instruction acquisition unit 102, the speech signal holding unit 104 converts the speech signal stored in the buffer at that time together with the time information into the speech segment detection unit 106 and the speech recognition unit. To 110. The voice signal holding means 104 outputs the voice signal input from the voice input means 100 after the input of the utterance start instruction to the utterance section detection means 106 and the voice recognition means 110 together with the time information. When an utterance end instruction is input from the utterance timing instruction acquisition unit 102, the audio signal holding unit 104 stops outputting the audio signal and stores the input audio signal thereafter in the buffer.

発話区間検出手段１０６は、音声信号保持手段１０４により出力された音声信号から、ユーザの発話区間を検出する。発話区間検出手段１０６は、例えば、音声信号のパワー情報やゼロ交差に基づく方法を用いて、発話区間を検出する。発話区間検出手段１０６は、検出した発話区間の開始時刻および終了時刻の情報を出力する。 The utterance section detection unit 106 detects the user's utterance section from the voice signal output by the voice signal holding unit 104. The utterance section detecting means 106 detects the utterance section using, for example, a method based on the power information of the audio signal and the zero crossing. The utterance section detecting means 106 outputs information on the start time and end time of the detected utterance section.

誤操作検知手段１０８は、発話区間検出手段１０６により検出された発話区間と、発話タイミング指示取得手段１０２により取得された発話タイミングの指示とに基づいて、ユーザの発話タイミング指示の誤操作を検知する。具体的には、誤操作検知手段１０８は、発話区間検出手段１０６により入力される発話区間の開始・終了時刻の情報と、発話タイミング指示取得手段１０２により入力される発話タイミングの指示の有無および時刻情報とを比較して、ユーザの発話タイミング指示の誤操作を検知する。 The erroneous operation detection means 108 detects an erroneous operation of the user's utterance timing instruction based on the utterance section detected by the utterance section detection means 106 and the utterance timing instruction acquired by the utterance timing instruction acquisition means 102. Specifically, the erroneous operation detection unit 108 includes information on start / end times of the utterance period input by the utterance period detection unit 106, presence / absence of an utterance timing instruction input by the utterance timing instruction acquisition unit 102, and time information. And an erroneous operation of the user's utterance timing instruction is detected.

なお、誤操作検知手段１０８が、誤操作の有無・種類を判定する方法は、後述する。 A method by which the erroneous operation detection unit 108 determines the presence / absence / type of an erroneous operation will be described later.

音声認識手段１１０は、前記音声信号保持手段１０４により入力された音声信号の少なくとも一部の区間に対して音声認識を行う。音声認識手段１１０は、音声認識辞書１１２および音響モデル１１４などを用いて音声認識を行う。音声認識手段１１０は、例えば、隠れマルコフモデルを用いる手法を適用して、音声認識を行う。音声認識手段１１０は、認識結果として、テキストあるいはコマンドを出力する。 The voice recognition unit 110 performs voice recognition on at least a part of the voice signal input by the voice signal holding unit 104. The voice recognition unit 110 performs voice recognition using the voice recognition dictionary 112, the acoustic model 114, and the like. The voice recognition unit 110 performs voice recognition by applying a technique using a hidden Markov model, for example. The voice recognition unit 110 outputs a text or a command as a recognition result.

音声認識辞書１１２は、認識対象の単語セットおよび各単語の読みの情報を格納する。 The speech recognition dictionary 112 stores a word set to be recognized and information on reading of each word.

音響モデル１１４は、読みに対応する音響パタンをモデル化した音響モデルを格納する。 The acoustic model 114 stores an acoustic model obtained by modeling an acoustic pattern corresponding to reading.

なお、音声認識手段１１０は、発話タイミング指示取得手段１０２から入力される発話タイミングの指示の時刻情報に基づいて、音声信号のうち認識対象とする区間を決定してもよい。例えば、発話開始と発話終了の指示が入力される場合は、音声認識手段１１０は、認識対象とする区間を、発話開始の指示の時刻から発話終了の指示の時刻までに限定してもよい。あるいは、音声認識手段１１０は、内部に音声信号を保持するバッファを有して、発話開始時刻と発話終了時刻のそれぞれに一定のマージンをつけて、発話開始指示より一定時間前から、発話終了指示より一定時間後までに限定してもよい。また、音声認識手段１１０は、誤操作検知手段１０８から誤操作の有無の情報を受取り、誤操作があった場合には、その認識対象区間に対する音声認識処理および認識結果出力を停止してもよい。 Note that the speech recognition unit 110 may determine a section to be recognized in the speech signal based on the time information of the speech timing instruction input from the speech timing instruction acquisition unit 102. For example, when an instruction to start speech and an instruction to end speech are input, the speech recognition unit 110 may limit the section to be recognized from the time of the speech start instruction to the time of the speech end instruction. Alternatively, the voice recognition means 110 has a buffer for holding a voice signal therein, and adds a certain margin to each of the utterance start time and the utterance end time, and gives an utterance end instruction from a predetermined time before the utterance start instruction. It may be limited to a certain time later. In addition, the voice recognition unit 110 may receive information on the presence or absence of an erroneous operation from the erroneous operation detection unit 108, and may stop the voice recognition process and the recognition result output for the recognition target section when there is an erroneous operation.

誤操作通知手段１１６は、誤操作検知手段１０８によって誤操作が検知された場合に、誤操作の種類に応じたメッセージを画面表示または音声などでユーザに通知する。 When an erroneous operation is detected by the erroneous operation detection unit 108, the erroneous operation notification unit 116 notifies the user of a message corresponding to the type of the erroneous operation on a screen display or voice.

なお、本構成に代えて、音声認識手段１１０の中に誤操作通知手段１１６を含めて、音声認識手段１１０が誤操作の有無に応じて、誤操作の種類に応じたメッセージあるいは認識結果を出力するようにしてもよい。 Instead of this configuration, the voice recognition unit 110 includes the erroneous operation notification unit 116 so that the voice recognition unit 110 outputs a message or a recognition result corresponding to the type of the erroneous operation depending on whether there is an erroneous operation. May be.

次に、誤操作検知手段１０８における誤操作の有無および種類の判定方法を説明する。 Next, a method for determining the presence / absence and type of an erroneous operation in the erroneous operation detection means 108 will be described.

図３は、誤操作検知手段１０８における誤操作の有無・種類の判定方法の例示である。 FIG. 3 is an illustration of a method for determining the presence / absence / type of an erroneous operation in the erroneous operation detection means 108.

図３において、発話開始は、発話区間検出手段１０６によって検出される発話区間の開始時刻を示す。発話終了は、発話区間検出手段１０６によって検出される発話区間の終了時刻を示す。また、発話開始指示は、発話タイミング指示取得手段１０２によって取得されるユーザによる発話開始の指示の時刻を示す。発話終了指示は、発話タイミング指示取得手段１０２によって取得されるユーザによる発話終了の指示の時刻を示す。 In FIG. 3, the utterance start indicates the start time of the utterance section detected by the utterance section detecting means 106. The utterance end indicates the end time of the utterance section detected by the utterance section detecting means 106. The utterance start instruction indicates the time of the utterance start instruction by the user acquired by the utterance timing instruction acquisition unit 102. The utterance end instruction indicates the utterance end instruction time acquired by the utterance timing instruction acquisition unit 102 by the user.

誤操作検知手段１０８は、発話開始および発話開始指示の時刻を比較する。また、誤操作検知手段１０８は、発話終了および発話終了指示の時刻を比較する。次に、誤操作検知手段１０８は、比較した結果が、図３に示す各条件に一致するかどうかを順に調べ、いずれかの条件に一致すると、それに対応した判定結果から、誤操作の有無および種類を判定する。図３で条件Ａ〜Cの少なくともいずれかに一致した場合は、誤操作検知手段１０８は、判定結果欄に示した種類の誤操作があったと判定する。条件Dに一致した場合は、誤操作検知手段１０８は、誤操作がなかったと判定する。 The erroneous operation detection means 108 compares the time of the utterance start and the utterance start instruction. Further, the erroneous operation detection means 108 compares the time of the utterance end and the utterance end instruction. Next, the erroneous operation detection means 108 sequentially checks whether or not the comparison result matches each condition shown in FIG. 3, and if it matches any of the conditions, the presence / absence and type of the erroneous operation are determined from the corresponding determination result. judge. In FIG. 3, if it matches at least one of the conditions A to C, the erroneous operation detection means 108 determines that there is an erroneous operation of the type shown in the determination result column. If the condition D is met, the erroneous operation detection means 108 determines that there has been no erroneous operation.

具体的には、誤操作検知手段１０８は、比較した結果、発話開始後に発話開始指示があった場合、発話開始指示が遅い、と判定する。また、誤操作検知手段１０８は、比較した結果、発話終了前に発話終了指示があった場合、発話終了指示が早いと判定する。また、誤操作検知手段１０８は、比較した結果、発話終了後一定時間内に発話終了指示がなかった場合、発話終了指示のし忘れと判定する。また、誤操作検知手段１０８は、比較した結果、発話開始前に発話開始指示があり、発話終了後一定時間内に発話終了指示があった場合、誤操作なしと判定する。 Specifically, the erroneous operation detection means 108 determines that the utterance start instruction is late when there is an utterance start instruction after the start of utterance as a result of the comparison. In addition, the erroneous operation detection means 108 determines that the utterance end instruction is early when there is an utterance end instruction before the end of the utterance as a result of the comparison. Further, as a result of the comparison, if there is no utterance end instruction within a predetermined time after the end of the utterance, the erroneous operation detection means 108 determines that the utterance end instruction has been forgotten. Further, as a result of the comparison, the erroneous operation detection means 108 determines that there is no erroneous operation when there is an utterance start instruction before the start of utterance and there is an utterance end instruction within a certain time after the end of the utterance.

誤操作通知手段１１６は、誤操作があった場合に、図３に示された誤操作の種類に応じたメッセージをユーザに通知する。例えば「発話開始指示が遅い」と判定された場合には、誤操作通知手段１１６は、発話開始タイミングの指示を行ってから発話するよう促すメッセージをユーザに通知する。 The erroneous operation notifying means 116 notifies the user of a message corresponding to the type of erroneous operation shown in FIG. 3 when there is an erroneous operation. For example, when it is determined that “the utterance start instruction is late”, the erroneous operation notifying unit 116 notifies the user of a message that prompts the user to speak after giving an instruction of the utterance start timing.

次に、音声認識システム１の動作を説明する。 Next, the operation of the voice recognition system 1 will be described.

図４は、音声認識システム１の動作を示すフローチャートである。 FIG. 4 is a flowchart showing the operation of the speech recognition system 1.

図４に示すように、ステップ１０（S１０）において、音声入力手段１００は、入力された音声信号に複合化などの処理を行い、音声信号を出力する。具体的には、音声入力手段１００は、マイクから音声信号を入力し、AD変換を行って音声波形のデジタル信号を出力する。 As shown in FIG. 4, in step 10 (S10), the voice input means 100 performs a process such as decoding on the input voice signal and outputs the voice signal. Specifically, the voice input unit 100 inputs a voice signal from a microphone, performs AD conversion, and outputs a digital signal having a voice waveform.

ステップ１２（S１２）において、音声信号保持手段１０４は、音声入力手段１００から入力される音声信号をバッファに格納する。 In step 12 (S12), the audio signal holding unit 104 stores the audio signal input from the audio input unit 100 in a buffer.

ステップ１４（S１４）において、発話タイミング指示取得手段１０２は、ユーザによる発話タイミングの指示を受け付けたか否かを判定し、受け付けた場合には、発話タイミングの指示を音声信号保持手段１０４、誤操作検知手段１０８および音声認識手段１１０に対して出力してS１６の処理に進み、そうでない場合にはS１２の処理に戻る。例えば、発話タイミング指示取得手段１０２は、ユーザのボタン押下状態を監視し、ボタンが押されると発話開始タイミングの指示、ボタンが離されると発話終了タイミングの指示としてそれぞれ検知する。発話タイミング指示取得手段１０２は、検知した指示を、音声信号保持手段１０４、誤操作検知手段１０８および音声認識手段１１０に出力する。 In step 14 (S14), the utterance timing instruction acquisition means 102 determines whether or not an utterance timing instruction from the user has been accepted. If accepted, the utterance timing instruction acquisition means 102 sends the utterance timing instruction to the voice signal holding means 104, erroneous operation detection means. 108 and the voice recognition means 110, and the process proceeds to S16. Otherwise, the process returns to S12. For example, the utterance timing instruction acquisition unit 102 monitors the user's button pressing state, and detects an utterance start timing instruction when the button is pressed and an utterance end timing instruction when the button is released. The utterance timing instruction acquisition unit 102 outputs the detected instruction to the voice signal holding unit 104, the erroneous operation detection unit 108, and the voice recognition unit 110.

ステップ１６（S１６）において、音声信号保持手段１０４は、発話タイミング指示取得手段１０２から発話開始タイミングの指示が入力されると、その時点でバッファに格納されている音声信号を、その時刻情報とともに発話区間検出手段１０６および音声認識手段１１０に出力する。 In step 16 (S16), the voice signal holding means 104, when the voice start timing instruction is inputted from the voice timing instruction acquiring means 102, the voice signal stored in the buffer at that time together with the time information is spoken. It outputs to the zone detection means 106 and the voice recognition means 110.

ステップ１８（S１８）において、音声信号保持手段１０４は、発話開始タイミングの指示の通知以降に音声入力手段１００から入力される音声信号を、その時刻情報とともに発話区間検出手段１０６および音声認識手段１１０に出力する。 In step 18 (S18), the voice signal holding means 104 sends the voice signal input from the voice input means 100 after the notification of the utterance start timing instruction to the utterance section detection means 106 and the voice recognition means 110 together with the time information. Output.

ステップ２０（S２０）において、発話区間検出手段１０６は、音声信号保持手段１０４から出力された音声信号から発話区間を検出し、その時刻情報を誤操作検知手段１０８に出力する。具体的には、発話区間検出手段１０６は、音声信号保持手段１０４から出力された音声信号を逐次処理し、算出されるパワー情報などを用いて発話開始および発話終了を検出する。 In step 20 (S 20), the utterance section detection unit 106 detects the utterance section from the voice signal output from the voice signal holding unit 104, and outputs the time information to the erroneous operation detection unit 108. Specifically, the utterance section detection unit 106 sequentially processes the audio signal output from the audio signal holding unit 104 and detects the start and end of the utterance using the calculated power information and the like.

ステップ２２（S２２）において、誤操作検知手段１０８は、発話区間検出手段１０６により検出された発話区間と、発話タイミング指示取得手段１０２により取得された発話タイミングの指示とに基づいて、ユーザの発話タイミング指示の誤操作を検知する。例えば、誤操作検知手段１０８は、発話区間検出手段１０６から入力される発話開始・発話終了の時刻情報と、発話タイミング指示取得手段１０２から通知される発話タイミングの指示の有無および時刻情報を比較する。誤操作検知手段１０８は、図３の判定基準にしたがって、ユーザの誤操作の有無および種類を判定する。 In step 22 (S22), the erroneous operation detection means 108 determines the user's utterance timing instruction based on the utterance section detected by the utterance section detection means 106 and the utterance timing instruction acquired by the utterance timing instruction acquisition means 102. Detecting misoperations. For example, the erroneous operation detection unit 108 compares the time information of the utterance start / utterance end input from the utterance section detection unit 106 with the presence / absence of the utterance timing instruction notified from the utterance timing instruction acquisition unit 102 and the time information. The erroneous operation detection means 108 determines the presence and type of a user's erroneous operation according to the determination criteria of FIG.

誤操作なしと判定された場合には、ステップ２４（S２４）において、音声認識手段１１０は、音声信号保持手段１０４から出力された音声信号を音声認識して、認識結果を出力する。 If it is determined that there is no erroneous operation, in step 24 (S24), the speech recognition unit 110 recognizes the speech signal output from the speech signal holding unit 104 and outputs a recognition result.

誤操作ありと判定された場合には、ステップ２６（S２６）において、誤操作通知手段１１６は、誤操作の種類に応じたメッセージをユーザに通知する。例えば、誤操作通知手段１１６は、図３に示される条件に基づいて、「ボタンを押してから発話してください」「発話が終了してからボタンを離してください」等のメッセージを出力する。 If it is determined that there is an erroneous operation, in step 26 (S26), the erroneous operation notification means 116 notifies the user of a message corresponding to the type of erroneous operation. For example, the erroneous operation notification means 116 outputs a message such as “Please speak after pressing the button” or “Please release the button after the utterance is finished” based on the conditions shown in FIG.

なお、ここでは簡単のため、ステップ２２（S２２）において誤操作なしと判定された場合に、音声認識手段１１０が音声認識を行うとして説明した。実際には、音声認識手段１１０は、発話開始タイミングの指示が取得された時点で音声認識を開始して、入力される音声信号を逐次受け取って音声認識を進めるようにしてもよい。この場合、誤操作ありと判定された時点で、音声認識手段１１０は、音声認識を停止してもよい。 Here, for the sake of simplicity, it has been described that the speech recognition means 110 performs speech recognition when it is determined in step 22 (S22) that there is no erroneous operation. Actually, the voice recognition unit 110 may start voice recognition at the time when an instruction of the utterance start timing is acquired, and may sequentially receive voice signals that are input to advance voice recognition. In this case, the voice recognition unit 110 may stop the voice recognition when it is determined that there is an erroneous operation.

以上説明したように、本実施の形態にかかる音声認識システム１は、発話開始指示の通知があるまで発話区間検出処理を行わないため、発話区間検出の処理負荷を少なくすることができる。これにより、音声認識システム１は、発話区間検出処理負荷が誤操作を検知する処理に与える影響を小さくすることができるため、ユーザの誤操作を精度よく検知することができる。 As described above, since the speech recognition system 1 according to the present embodiment does not perform the speech segment detection process until the notification of the speech start instruction is given, the processing load of the speech segment detection can be reduced. Thereby, since the speech recognition system 1 can reduce the influence of the utterance section detection processing load on the process of detecting an erroneous operation, it can accurately detect an erroneous operation of the user.

また、音声認識システム１は、発話タイミング指示の時刻情報に基づいて認識対象区間を限定するため、認識処理を常時行う場合に比べ、音声認識の処理負荷を少なくすることができる。 Moreover, since the speech recognition system 1 limits the recognition target section based on the time information of the utterance timing instruction, the processing load for speech recognition can be reduced compared to the case where the recognition process is always performed.

さらに、音声認識システム１は、音声信号保持手段１０４を有し、発話開始の指示から一定時間遡って発話区間検出処理を行うため、発話開始の指示が実際の発話開始より遅れた場合でも、発話区間を精度よく検出できる。 Furthermore, since the speech recognition system 1 includes the speech signal holding unit 104 and performs the speech segment detection process after a certain time from the speech start instruction, even if the speech start instruction is delayed from the actual speech start, The section can be detected with high accuracy.

＜第２の実施形態＞
次に、本発明にかかる音声認識システムの第２の実施形態について説明する。 <Second Embodiment>
Next, a second embodiment of the speech recognition system according to the present invention will be described.

図５は、本発明の第２の実施形態にかかる音声認識システム２の機能構成を示すブロック図である。 FIG. 5 is a block diagram showing a functional configuration of the speech recognition system 2 according to the second exemplary embodiment of the present invention.

図５に示すように、本発明の第２の実施形態にかかる音声認識システム２は、第１の実施形態にかかる音声認識システム１と比較すると、音声信号保持手段１０４のかわりに発話区間検出手段１０６が音声認識手段１１０に音声信号を出力する点が異なる。さらに、発話タイミング指示取得手段１０２が、音声認識手段１１０に発話タイミングの指示を通知しない点も異なる。 As shown in FIG. 5, the speech recognition system 2 according to the second exemplary embodiment of the present invention is compared with the speech recognition system 1 according to the first exemplary embodiment. The difference is that 106 outputs a voice signal to the voice recognition means 110. Another difference is that the utterance timing instruction acquisition unit 102 does not notify the voice recognition unit 110 of an utterance timing instruction.

発話区間検出手段１０６は、音声信号保持手段１０４から入力された音声信号から、ユーザの発話区間を検出し、その開始・終了時刻の情報を誤操作通知手段１１６に出力する。この際、発話区間検出手段１０６は、発話区間の前後に一定長のマージンを付加してもよい。また、発話区間検出手段１０６は、音声信号保持手段１０４から入力された音声信号を、音声認識手段１１０に出力する。 The utterance section detection unit 106 detects the user's utterance section from the voice signal input from the voice signal holding unit 104, and outputs information on the start / end time to the erroneous operation notification unit 116. At this time, the utterance section detecting means 106 may add a fixed-length margin before and after the utterance section. Further, the utterance section detecting unit 106 outputs the voice signal input from the voice signal holding unit 104 to the voice recognition unit 110.

音声認識手段１１０は、前記音声信号保持手段１０４により入力された音声信号の一部の区間に対して音声認識を行う。音声認識手段１１０は、発話区間検出手段１０６により検出された発話区間に基づいて、音声認識の対象となる区間を決定する。 The voice recognition unit 110 performs voice recognition on a partial section of the voice signal input by the voice signal holding unit 104. The voice recognition unit 110 determines a section for speech recognition based on the utterance section detected by the utterance section detection unit 106.

その他の動作は、本発明の第１の実施形態と同じである。 Other operations are the same as those in the first embodiment of the present invention.

以上説明したように、本実施の形態にかかる音声認識システム２は、音声認識の対象を、発話区間に基づいて限定するため、処理負荷を少なくすることができる。なぜなら、誤操作がない場合には、発話区間検出手段１０６で検出される発話区間は、発話開始の指示から発話終了の指示までの区間の一部分であり、発話タイミングが指示された区間と比べて短いからである。 As described above, since the speech recognition system 2 according to the present embodiment limits the target of speech recognition based on the utterance section, the processing load can be reduced. This is because when there is no erroneous operation, the utterance section detected by the utterance section detection means 106 is a part of the section from the instruction to start utterance to the instruction to end utterance, and is shorter than the section in which the utterance timing is instructed. Because.

＜第３の実施形態＞
次に、本発明にかかる音声認識システムの第３の実施形態について説明する。 <Third Embodiment>
Next, a third embodiment of the speech recognition system according to the present invention will be described.

図６は、本発明の第３の実施形態にかかる音声認識システム３の機能構成を示すブロック図である。 FIG. 6 is a block diagram showing a functional configuration of the speech recognition system 3 according to the third exemplary embodiment of the present invention.

図６に示すように、本発明の第３の実施形態にかかる音声認識システム３は、第２の実施形態にかかる音声認識システム２と比較すると、音声認識手段１１０が発話区間特定手段１１８を有し、特定した発話区間情報を誤操作検知手段１０８に出力する点が異なる。 As shown in FIG. 6, in the speech recognition system 3 according to the third exemplary embodiment of the present invention, the speech recognition unit 110 has the speech segment specifying unit 118 compared to the speech recognition system 2 according to the second exemplary embodiment. However, the point that the specified speech section information is output to the erroneous operation detection means 108 is different.

本実施形態では、簡単のために離散単語認識を例にして説明するが、連続単語認識にも同様に適用可能である。 In the present embodiment, for the sake of simplicity, discrete word recognition will be described as an example, but the present invention can be similarly applied to continuous word recognition.

音声認識手段１１０は、認識対象の単語が格納された音声認識辞書１１２を用いて、対象となる区間に対して音声認識を行う。 The voice recognition unit 110 performs voice recognition on a target section using the voice recognition dictionary 112 in which words to be recognized are stored.

具体的には、音声認識手段１１０は、音声認識辞書１１２に格納された各認識対象の単語の読みの情報をもとに、音響モデル１１４を用いて各単語の標準パタンを生成する。例えば、音響モデルとして音素のＨＭＭ（隠れマルコフモデル）を用いる場合には、音声認識手段１１０は、単語の読みに従って音素のＨＭＭを連結して、単語の標準パタンを構成する。その際、音声認識手段１１０は、無音のＨＭＭを前後に付加する。無音のＨＭＭは、背景雑音等を表現するモデルとして、音響モデル１１４内にあらかじめ記憶されている。音声認識手段１１０は、発話区間検出手段１０６によって切り出された入力音声信号と、各単語の標準パタンとを照合して、各単語に対する尤度を算出する。音声認識手段１１０は、尤度の最も高い単語を求め、認識結果とする。 Specifically, the speech recognition unit 110 generates a standard pattern for each word using the acoustic model 114 based on the reading information of each recognition target word stored in the speech recognition dictionary 112. For example, when a phoneme HMM (Hidden Markov Model) is used as the acoustic model, the speech recognition means 110 connects the phoneme HMMs according to the reading of the word to form a standard pattern of the word. At that time, the voice recognition unit 110 adds a silent HMM to the front and back. The silent HMM is stored in advance in the acoustic model 114 as a model expressing background noise and the like. The voice recognition unit 110 compares the input voice signal cut out by the utterance section detection unit 106 with the standard pattern of each word, and calculates the likelihood for each word. The speech recognition unit 110 obtains the word with the highest likelihood and uses it as the recognition result.

発話区間特定手段１１８は、音声認識の対象となる区間の中で、認識対象の単語が発話された区間を特定する。 The utterance section specifying unit 118 specifies a section in which a word to be recognized is uttered among the sections to be subjected to speech recognition.

具体的には、発話区間特定手段１１８は、入力された音声信号と、音声認識手段１１０の認識結果の単語の標準パタンとの時間の対応付けを行う。発話区間特定手段１１８は、入力された音声信号の中で、単語の前後の無音パタンを除く部分に対応づけられる区間を求める。発話区間特定手段１１８は、対応づけられた区間の開始および終了の時刻情報を、誤操作検知手段１０８に出力する。 Specifically, the utterance section specifying unit 118 associates the time between the input voice signal and the standard pattern of the word as the recognition result of the voice recognition unit 110. The utterance section specifying means 118 obtains a section that is associated with a portion of the input speech signal excluding the silence pattern before and after the word. The utterance section specifying unit 118 outputs start and end time information of the associated section to the erroneous operation detection unit 108.

音声認識手段１１０はまた、リジェクション機能を有する。具体的には、音声認識手段１１０は、入力された音声信号が音声認識辞書１１２に格納されている認識対象の単語のいずれにも合致しないと判定した場合、認識結果を棄却する。 The voice recognition unit 110 also has a rejection function. Specifically, when the speech recognition unit 110 determines that the input speech signal does not match any of the recognition target words stored in the speech recognition dictionary 112, the speech recognition unit 110 rejects the recognition result.

発話区間特定手段１１８は、認識結果が棄却された場合に、発話区間がなかったという情報を誤操作検知手段１０８に出力する。 When the recognition result is rejected, the utterance section specifying unit 118 outputs information indicating that there is no utterance section to the erroneous operation detection unit 108.

誤操作検知手段１０８は、発話区間検出手段１０６により検出された発話区間を、発話区間特定手段１１８により特定された区間に基づいて変更（例えば、置換など）した上で、ユーザの発話タイミング指示の誤操作の有無および種類の判定を行う。なお、誤操作検知手段１０８は、発話区間検出手段１０６の検出結果を受け取らずに、発話区間特定手段１１８の結果を用いてもよい。 The erroneous operation detection unit 108 changes (for example, replaces) the speech segment detected by the speech segment detection unit 106 based on the segment specified by the speech segment specifying unit 118, and then erroneously operates the user's speech timing instruction. The presence / absence and type are determined. Note that the erroneous operation detection unit 108 may use the result of the utterance section specifying unit 118 without receiving the detection result of the utterance section detection unit 106.

以上説明したように、本実施の形態にかかる音声認識システム３は、認識対象の単語の情報を用いて音声認識を行うことで、実際の発話区間と雑音区間を詳細に区別することができる。そのため、音声認識システム３は、実際の発話区間、すなわち、より正確な発話区間の情報を用いて誤操作の判定を行うことができる。 As described above, the speech recognition system 3 according to the present embodiment can distinguish between an actual speech segment and a noise segment in detail by performing speech recognition using information on a recognition target word. Therefore, the voice recognition system 3 can determine an erroneous operation using information on an actual utterance section, that is, a more accurate utterance section.

また、音声認識システム３は、音声認識手段１１０がリジェクション機能を有し、音声入力を意図したユーザの発話ではない区間をキャンセルすることができるので、精度よく発話区間を検出できる。そのため、音声認識システム３は、ユーザの誤操作を精度よく検知することができる。 In the voice recognition system 3, since the voice recognition unit 110 has a rejection function and can cancel a section that is not a user's speech intended for voice input, the speech section can be detected with high accuracy. Therefore, the voice recognition system 3 can accurately detect a user's erroneous operation.

＜第４の実施形態＞
次に、本発明にかかる音声認識システムの第４の実施形態について説明する。 <Fourth Embodiment>
Next, a fourth embodiment of the speech recognition system according to the present invention will be described.

図７は、本発明の第４の実施形態にかかる音声認識システム４の機能構成を示すブロック図である。 FIG. 7 is a block diagram showing a functional configuration of the speech recognition system 4 according to the fourth embodiment of the present invention.

発話タイミング指示取得手段１０２は、発話開始の指示を含むユーザによる発話タイミングの指示を、音声信号保持手段１０４および誤操作検知手段１０８に出力する。 The utterance timing instruction acquisition unit 102 outputs an utterance timing instruction by the user including an utterance start instruction to the audio signal holding unit 104 and the erroneous operation detection unit 108.

音声信号保持手段１０４は、入力される音声信号を保持し、発話タイミング指示取得手段１０２により発話開始の指示が入力された場合、保持している音声信号を発話区間検出手段１０６に出力する。また、音声信号保持手段１０４は、発話タイミング指示取得手段１０２により発話開始タイミングの指示が入力された時点で、それ以降に入力される音声信号を発話区間検出手段１０６に出力する。 The voice signal holding unit 104 holds the input voice signal, and outputs the held voice signal to the utterance section detection unit 106 when an utterance start instruction is input by the utterance timing instruction acquisition unit 102. Further, the voice signal holding means 104 outputs a voice signal inputted thereafter to the utterance section detecting means 106 when the utterance start timing instruction is inputted by the utterance timing instruction obtaining means 102.

発話区間検出手段１０６は、音声信号保持手段１０４により出力された音声信号から発話区間を検出する。 The utterance section detecting unit 106 detects the utterance section from the voice signal output from the voice signal holding unit 104.

誤操作検知手段１０８は、発話区間検出手段により検出された発話区間と、発話タイミング指示取得手段１０２により取得された発話タイミングの指示とに基づいて、ユーザの誤操作を検知する。 The erroneous operation detection unit 108 detects a user's erroneous operation based on the speech segment detected by the speech segment detection unit and the speech timing instruction acquired by the speech timing instruction acquisition unit 102.

以上説明したように、本実施の形態にかかる音声認識システム４によれば、発話区間検出の処理負荷を少なくし、ユーザの誤操作を精度よく検知することができる。 As described above, according to the speech recognition system 4 according to the present embodiment, it is possible to reduce the processing load for detecting the utterance section and accurately detect a user's erroneous operation.

本発明にかかる音声認識システムは、音声によるデータ入力、テキスト入力および機器操作の指示を行う音声認識装置といった用途に適用可能である。 The voice recognition system according to the present invention is applicable to uses such as a voice recognition apparatus that performs voice data input, text input, and device operation instructions.

１音声認識システム１
２音声認識システム２
３音声認識システム３
４音声認識システム４
１０ CPU
１２メモリ
１４ HDD
１６通信IF
１８出力装置
２０入力装置
２２音声入力装置
２４バス
１００音声入力手段
１０２発話タイミング指示取得手段
１０４音声信号保持手段
１０６発話区間検出手段
１０８誤操作検知手段
１１０音声認識手段
１１２音声認識辞書
１１４音響モデル
１１６誤操作通知手段
１１８発話区間特定手段 1 Voice recognition system 1
2 Speech recognition system 2
3 Voice recognition system 3
4 Voice recognition system 4
10 CPU
12 memory 14 HDD
16 Communication IF
18 output device 20 input device 22 voice input device 24 bus 100 voice input means 102 utterance timing instruction acquisition means 104 voice signal holding means 106 utterance section detection means 108 erroneous operation detection means 110 voice recognition means 112 voice recognition dictionary 114 acoustic model 116 notification of erroneous operation Means 118: Speaking section specifying means

Claims

An utterance timing instruction acquisition means for acquiring an utterance timing instruction by the user including an utterance start instruction;
An audio signal holding unit that holds an input audio signal and outputs an audio signal that is input after that when the instruction to start utterance is acquired by the utterance timing instruction acquisition unit; and
An utterance section detecting means for detecting an utterance section from the voice signal output by the voice signal holding means;
Compare the time information of the utterance section detected by the utterance section detection means with the presence / absence and time information of the utterance timing instruction acquired by the utterance timing instruction acquisition means, and at least the time of the instruction to start the utterance Erroneous operation detection means for detecting as an erroneous operation of the user when is later than the start time of the utterance section ,
A speech recognition system comprising:

The utterance timing instruction further includes an utterance end instruction, and the erroneous operation detection unit further includes the utterance end instruction when the utterance end instruction time is earlier than the end time of the utterance section or when there is no utterance end instruction itself. Detect as user's misoperation

The speech recognition system according to claim 1.

The audio signal holding means holds the latest predetermined time of the input audio signal.
The speech recognition system according to claim 1 or 2 .

Voice recognition means for performing voice recognition on at least a part of the voice signal output by the voice signal holding means;
Speech recognition system according to any of claims 1-3.

The voice recognition means determines a section for voice recognition based on the utterance section detected by the utterance section detection means;
The voice recognition system according to claim 4 .

The voice recognition means stops voice recognition when the erroneous operation detection means detects an erroneous operation;
The speech recognition system according to claim 4 or 5 .

The speech recognition means performs speech recognition using a speech recognition dictionary in which a recognition target word is stored, and identifies a section in which the recognition target word is uttered among the partial sections.
The erroneous operation detection means detects an erroneous operation of the user after changing the utterance section detected by the utterance section detection means based on the section specified by the voice recognition means.
The speech recognition system according to any one of claims 4 to 6 .

When the erroneous operation detection means detects an erroneous operation, it further includes an erroneous operation notification means for notifying a message according to the type of the detected erroneous operation.
Speech recognition system according to any of claims 1-7.

Get the utterance timing instruction by the user including the utterance start instruction,
When an input voice signal is held and the instruction to start speech is acquired, the held voice signal and a voice signal input thereafter are output.
Detecting an utterance section from the output voice signal,
Comparing the time information of the detected utterance section with the presence / absence and time information of the acquired utterance timing instruction, at least when the time of the utterance start instruction is later than the start time of the utterance section Detect as user's mistaken operation,
Speech recognition method.

On the computer,
An utterance timing instruction acquisition step for acquiring an utterance timing instruction by the user including an utterance start instruction;
An audio signal holding step for holding an input audio signal and outputting an audio signal input after that when the instruction to start utterance is acquired by the utterance timing instruction acquisition step;
An utterance interval detection step of detecting an utterance interval from the audio signal output by the audio signal holding step;
The time information of the utterance section detected by the utterance section detection step is compared with the presence / absence and time information of the utterance timing instruction acquired by the utterance timing instruction acquisition step, and at least the instruction start instruction time is An erroneous operation detection step of detecting as an erroneous operation of the user when it is later than the start time of the utterance section ,
Voice recognition program that executes