[go: up one dir, main page]

WO2008018136A1 - Speaker recognizing device, speaker recognizing method, etc. - Google Patents

Speaker recognizing device, speaker recognizing method, etc. Download PDF

Info

Publication number
WO2008018136A1
WO2008018136A1 PCT/JP2006/315839 JP2006315839W WO2008018136A1 WO 2008018136 A1 WO2008018136 A1 WO 2008018136A1 JP 2006315839 W JP2006315839 W JP 2006315839W WO 2008018136 A1 WO2008018136 A1 WO 2008018136A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature data
speaker
voice
collating
input
Prior art date
Application number
PCT/JP2006/315839
Other languages
French (fr)
Japanese (ja)
Inventor
Hajime Kobayashi
Soichi Toyama
Ikuo Fujita
Mitsuya Komamura
Original Assignee
Pioneer Corporation
Tech Experts Incorporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pioneer Corporation, Tech Experts Incorporation filed Critical Pioneer Corporation
Priority to PCT/JP2006/315839 priority Critical patent/WO2008018136A1/en
Publication of WO2008018136A1 publication Critical patent/WO2008018136A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Definitions

  • Speaker recognition device speaker recognition method, etc.
  • the present application relates to a technical field such as a speaker recognition apparatus and method.
  • Patent Document 1 discloses a speaker recognition device that can update registered speech at an appropriate timing and can ensure safety at the time of update. It is disclosed.
  • the speaker recognition device in Patent Document 1 is based on the speech information obtained by the input unit 1 and the speaker identification information stored in the speech data storage unit 2 as a speech verification unit. If it is determined that the person is the result of the comparison, the update necessity determination unit 7 determines whether or not to update the speaker identification information. Is configured to update the speaker specific information using the voice information of the input unit 1 and store it again in the voice data storage unit 2.
  • Patent Document 1 JP 2001-265385 A
  • the present application aims to provide a speaker recognition device, a speaker recognition method, and the like that can improve the performance of speaker recognition, with the elimination of such inconveniences as an issue. .
  • the invention according to claim 1 is characterized in that a voice input means for inputting a voice uttered by a speaker, and a voice feature indicating a feature of the inputted voice.
  • Voice feature data extracting means for extracting data
  • voice feature data storage means for storing reference voice feature data to be a reference for collating the voice feature data, the extracted voice feature data, and the voice feature data storage
  • the voice feature data collating means for comparing and collating the reference voice feature data stored in the means, and speaker feature data indicating speaker features other than the voice from the speaker.
  • the speaker feature data storage means for storing the reference speaker feature data as a collation reference of the speaker feature data, and the collation result by the voice feature data collation means is not correct
  • the input Speaker feature data collating means for comparing and collating the recorded speaker feature data with the reference speaker feature data, and the collation result by the speaker feature data collating means is correct
  • Update means for updating the reference voice feature data stored in the voice feature data storage means corresponding to the voice feature data using the voice feature data extracted by the voice feature data extraction means; It is characterized by providing.
  • the invention according to claim 4 is a speech input step for inputting speech uttered by a speaker, and speech feature data extraction for extracting speech feature data indicating the features of the input speech.
  • a speech feature data matching step a speaker feature input step for inputting speaker feature data indicating speaker features other than the speech from the speaker, and a reference story to be a reference for matching the speaker feature data Speaker feature data storage step for storing speaker feature data, and when the collation result in the speech feature data collation step is not correct, the input speaker feature data and the reference speaker feature deci Speaker feature data collating step for comparing and comparing data with each other, and when the collation result in the speaker feature data collating step is correct, the extracted voice feature data is used to correspond to the voice feature data.
  • An update step of updating the stored reference voice feature data is used to correspond to the voice feature data.
  • the invention of the speaker recognition program according to claim 5 is an audio input means for inputting a voice uttered by a speaker to a computer, and voice feature data indicating the characteristics of the inputted voice. Extracting voice feature data extracting means, the extracted voice feature data And voice feature data collating means for comparing the voice feature data stored in the voice feature data storage means with reference voice feature data serving as a reference for collating the voice feature data, and speaker features other than the voice from the speaker. If the collation result by the voice feature data collating unit is not correct, the input speaker characteristic data and the speaker characteristic data storing unit are input.
  • the stored speaker feature data collating means for comparing and collating the reference speaker feature data as the collation reference of the speaker feature data, and the collation result by the speaker feature data collating means is correct.
  • the stored reference speech feature data corresponding to the speech feature data is updated using the speech feature data extracted by the speech feature data extraction unit. Characterized in that to function as a unit.
  • the invention of the recording medium according to claim 6 is characterized in that the speaker recognition program according to claim 5 is recorded so as to be readable by the computer.
  • FIG. 1 is a diagram showing a schematic configuration example of a speaker recognition device S according to the present embodiment.
  • FIG. 2 is a flowchart showing speaker recognition processing in a processing unit P in the speaker recognition device S according to the present embodiment.
  • FIG. 3 is a flowchart showing speaker recognition processing in a processing unit P in the speaker recognition device S according to the present embodiment.
  • Speaker characteristics Figure 6 shows a schematic configuration example of the speaker recognition device S when DB6 is also updated.
  • Powerful speaker recognition device S is, for example, a car navigation device, a disc
  • FIG. 1 is a diagram illustrating a schematic configuration example of the speaker recognition device S according to the present embodiment.
  • the speaker recognition apparatus S includes a voice input unit 1 as a voice input unit, a voice feature data extraction unit 2 as a voice feature data extraction unit, and a voice feature as a voice feature data storage unit.
  • (Speaker Sound) DB (Database) 3 Speech (Acoustic) Feature Data Matching Unit 4 as Speech Feature Data Matching Means, Speaker (Daily) Feature Input Unit 5 as Speaker Feature Input Means, Speaker Features
  • a speaker (individual) feature DB6 as a data storage means, a speaker (individual) feature data collation unit 7 as a speaker feature data collation means, and a voice feature DB update unit 8 as an update means,
  • the speaker (individual) feature DB6 as a data storage means
  • a speaker (individual) feature data collation unit 7 as a speaker feature data collation means
  • a voice feature DB update unit 8 as an update means
  • the processing unit P including a CPU having a calculation function, a working RAM, a ROM for storing various data and programs
  • the CPU is a predetermined program (speaker recognition program of the present application).
  • the speech feature data extraction unit 2 the speech feature data collation unit 4, the speaker feature data collation unit 7, and the speech feature DB update unit 8 function.
  • the voice feature DB3 and the speaker feature DB6 are constructed in the storage unit M such as a hard disk drive.
  • the voice input unit 1 is a microphone or the like for inputting voice uttered by a speaker. It is a voice input device.
  • the type of the voice input unit 1 is not limited as long as it can input voice.
  • the voice feature data extraction unit 2 calculates (extracts) the voice force acoustic parameter (an example of voice feature data indicating the voice feature) input by the voice input unit 1.
  • the acoustic parameters for example, any one that can express the hydroacoustic characteristics using MFCC (Mel Frequency Cepstrum Coefficient) or LPC cepstrum may be used.
  • Speech feature DB3 is reference acoustic data indicating acoustic features of each of a plurality of registered speakers, and is a reference acoustic data (an example of reference speech feature data) that serves as a reference for comparison of the acoustic parameters. Is stored and registered. For example, GMM (Gaussian Mixture Model) generated based on the acoustic parameters of each speaker can be generated based on the acoustic parameters of each speaker. If so, don't worry about the type! /.
  • GMM Global Mixture Model
  • the voice feature data matching unit 4 compares and matches the sound parameter calculated (extracted) by the voice feature data extracting unit 2 with the reference acoustic data stored in the voice feature DB3, and the result of the matching is performed. For example, the recognition result or authentication result is output to the display unit D.
  • the voice feature data matching unit 4 checks to which of the reference acoustic data stored in the voice feature DB3 the extracted acoustic parameter is closest, and outputs the matching result. . More specifically, the extracted acoustic parameters are applied to each of the registered speaker's reference acoustic data (for example, GMM), the likelihood is obtained, and among them, the reference acoustic data that outputs the maximum likelihood. Information (for example, name, etc.) regarding the registered speaker corresponding to (for example, GMM) is output to the display unit D as a matching result. The user who is a speaker can thus see whether or not the collation result is correct (correct) by looking at the collation result displayed on the display unit D.
  • the reference acoustic data for example, GMM
  • the speaker feature input unit 5 is a personal feature input device for inputting personal feature information as speaker feature data indicating a unique feature of a speaker (individual) other than voice.
  • the personal feature information is fingerprint data
  • the personal feature input device is “fingerprint sensor”
  • the personal feature input device is “keyboard”. Or “touch panel”.
  • Various known items such as an iris can be applied as the personal feature information.
  • Speaker feature DB 6 stores reference personal feature information for each of a plurality of registered speakers, which serves as a reference for matching the personal feature information input by speaker feature input unit 5. Stored and registered.
  • the speaker feature data matching unit 7 compares the personal feature information input by the speaker feature input unit 5 with the reference personal feature information stored in the speaker feature DB 6, and compares the matching results. Results (also known as recognition results or authentication results) are determined to be correct (for example, a force that matches a password that has been entered is registered in speaker feature DB6). ing. If the speaker characteristic data matching unit 7 determines that the answer is correct from the matching result, the speaker characteristic data matching unit 7 identifies the registered speaker (e.g., the registered speaker with the matched word) from the speaker feature DB 6 as the target of the correct answer. (decide.
  • the speech feature DB update unit 8 stores in the speech feature DB 3 using the acoustic parameters extracted by the speech feature data extraction unit 2.
  • the reference acoustic data of the specified registered speaker is updated. For this update, for example, MAP estimation is used.
  • FIG. 2 is a flowchart showing speaker recognition processing in the processing unit P in the speaker recognition device S according to the present embodiment.
  • Step Sl when a voice uttered by a speaker is input by the voice input unit 1, an acoustic parameter is calculated (extracted) by the input voice force voice feature data extraction unit 2.
  • Step S2 the calculated (extracted) acoustic parameter and the reference acoustic data stored in the speech feature DB3 are compared and verified by the speech feature data verification unit 4, and the verification result (recognition result) is displayed on the display unit D.
  • the similarity (distance and likelihood) of each speaker is obtained.
  • the similarity (distance and likelihood) of each speaker is Since they often come close to each other, it is possible to automatically determine the correct answer or incorrect answer by judging the similarity (distance or likelihood) of each speaker as a threshold.
  • the verification result information on the corresponding registered speaker (for example, name, etc.)
  • the voice feature data verification unit 4 in FIG. 1 determines whether the verification result is correct.
  • step S3 If the answer is correct, an input indicating the correct answer is made via the voice input unit 1 or the speaker feature input unit 5 or the like. In this way, the processing unit P that has recognized the input determines that the collation result is correct (step S3: YES), and the process is terminated.
  • the voice feature data matching unit 4 performs an input indicating the incorrect answer via the voice input unit 1 or the speaker feature input unit 5 or the like.
  • the processing unit P that recognizes the input determines that the collation result is incorrect (step S3: NO), and proceeds to step S4.
  • the determination of the correct answer Z incorrect answer is a function incorporated in the voice feature data matching unit 4 in FIG.
  • the above input is the similarity of each registered speaker of the voice feature data matching unit 4, and the output is a message to be displayed on the display unit.
  • the message is determined to be correct, the result is a recognition result, and if the message is determined to be incorrect, “please enter personal feature information into speaker feature input section 5”.
  • the message prompts Incidentally, if the threshold judgment is used, the recognition result is output regardless of whether the correct answer is incorrect or incorrect.
  • the voice feature data matching unit 4 in FIG. 1 may determine whether or not the matching result is correct.
  • the processing unit P outputs the collation result and there is no input indicating an incorrect answer (or correct answer) within a predetermined time (for example, 10 seconds), the correct answer (or incorrect answer) is obtained. If there is an input indicating an incorrect answer (or correct answer) within that period, it may be configured to determine that it is an incorrect answer (or correct answer).
  • step S4 the speaker is prompted to input personal characteristic information, and in response to this, personal characteristic information is input from the speaker by the speaker characteristic input unit 5. Then, the input personal feature information and the reference personal feature information stored in the speaker feature DB 6 are!
  • the collected data collating unit 7 performs comparison and collation, and determines whether or not the collation result is correct. If the matching result is incorrect (for example, the reference personal feature information that matches the input personal feature information (matches including a predetermined margin) is not stored in the speaker feature DB 6). (Step S5: NO), the process is terminated.
  • step S6 if the collation result is correct (step S5: YES), the registered speaker that is the target of the correct answer is identified and the voice feature DB3 is updated (step S6).
  • the acoustic parameters extracted by the speech feature data extraction unit 2 are used and stored in the speech feature DB3 corresponding to the acoustic parameters.
  • the registered speaker's reference acoustic data is updated.
  • the voice uttered by the speaker is input, the acoustic parameters indicating the characteristics of the input voice are extracted, and the extracted acoustic parameters and Speech features Reference acoustic data stored in DB3 was compared and collated. If the matching result is not correct !, the personal feature information indicating the speaker features other than the voice is input from the above speaker, and the input personal feature information and the speaker feature DB 6 are stored. Compared with personal characteristics information. Then, when the collation result is correct, the reference acoustic data stored in the speech feature DB 3 corresponding to the acoustic parameter is updated using the extracted acoustic parameter. As a result, it is possible to specify the information power other than the voice for the speaker, and the voice feature DB3 can be updated even if the voice pattern is prone to error when the speaker is recognized, thereby improving the recognition performance. it can.
  • the utterance with an incorrect answer as a result of matching is considered to have sufficient characteristics of the speaker himself, the utterance of the incorrect answer is also actively used. It is configured to update. As a result, speaker recognition can be used with confidence even if voice quality has changed due to poor physical condition or aging.
  • the voice feature DB update unit 8 has an incorrect answer by the voice feature data matching unit 4 and the matching result by the speaker feature data matching unit 7 is not correct. Only when the answer is correct, the acoustic parameters extracted by the speech feature data extraction unit 2 are used to store the reference acoustic data of the identified registered speaker stored in the speech feature DB3. Has been updated.
  • the speech feature DB update unit 8 determines that the collation result by the speech feature data collation unit 4 is correct.
  • the acoustic parameter extracted by the voice feature data extraction unit 2 may be used to update the reference acoustic data of the specified registered speaker stored in the voice feature DB3. This allows more updates than the method shown in Fig. 2, and can be expected to improve the accuracy of speaker recognition.
  • the speaker feature DB6 is also updated to improve the accuracy of the authentication result. It is more desirable to configure to perform processing. However, according to the correct answer Z incorrect answer by the speaker feature data matching unit 7, the acoustic parameter and the reference acoustic data are different from the case of the comparative match. Because the speaker feature data matching unit 7 determines the threshold of the degree of similarity of the speaker corresponding to the matching result and guesses the correct answer Z incorrect answer. Is also included.
  • FIG. 4 is a diagram showing a schematic configuration example of the speaker recognition device S when the update process is performed on the speaker feature DB6.
  • the same components as those in FIG. 1 are denoted by the same reference numerals, and redundant description is omitted.
  • a speaker feature DB update unit 9 is added.
  • the speaker feature data matching unit 7 determines that the answer is correct (that is, when the speaker similarity level exceeds the threshold)
  • the speaker feature DB update unit 9 Using the personal feature information input by the speaker feature input unit 5, the reference personal feature information stored in the speaker feature DB 6 corresponding to the personal feature information is updated. Thereby, the accuracy of the authentication result can be improved.
  • the present invention is not limited to the above-described embodiment.
  • the above embodiment is an exemplification, and the present invention has the same configuration as the technical idea described in the scope of claims of the present invention, and any device that exhibits the same function and effect is the present embodiment. It is included in the technical scope of the invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Telephone Function (AREA)

Abstract

A speaker recognizing device, a speaker recognizing method, a speaker recognition processing program, etc. for more improving the performance of speaker recognition are provided. A speech uttered by a speaker is inputted. An acoustic parameter indicating a feature of the inputted speech is extracted. The extracted acoustic parameter is compared and collated with reference acoustic data stored in a speech feature DB (3). If the collation result shows incorrectness, the speaker inputs personal feature information representing a speaker feature other than that of speech. The inputted personal feature information is compared and collated with reference personal feature information stored in a speech feature DB (6). If the collation results shows correctness, the reference acoustic data stored in the speech feature DB (3) corresponding to the extracted acoustic parameter is updated by using the acoustic parameter.

Description

明 細 書  Specification
話者認識装置、及び話者認識方法等  Speaker recognition device, speaker recognition method, etc.
技術分野  Technical field
[0001] 本願は、話者認識装置及び方法等の技術分野に関する。  The present application relates to a technical field such as a speaker recognition apparatus and method.
背景技術  Background art
[0002] この種の話者認識装置として、例えば特許文献 1には、適切なタイミングで登録音 声の更新を行うことができ、更新時の安全性を確保することができる話者認識装置が 開示されている。  [0002] As a speaker recognition device of this type, for example, Patent Document 1 discloses a speaker recognition device that can update registered speech at an appropriate timing and can ensure safety at the time of update. It is disclosed.
[0003] 力かる特許文献 1 (図 1参照)における話者認識装置は、入力部 1で得られた音声 情報と、音声データ格納部 2に格納されている話者特定情報を、音声照合部 4で照 合し、類似度を求め、その結果、本人であると判定された場合は、更新必要性判定 部 7で話者特定情報を更新するかどうかを判断し、更新すると判断されたときは、入 力部 1の音声情報を使って話者特定情報を更新し、音声データ格納部 2に格納し直 すように構成されている。  [0003] The speaker recognition device in Patent Document 1 (see FIG. 1) is based on the speech information obtained by the input unit 1 and the speaker identification information stored in the speech data storage unit 2 as a speech verification unit. If it is determined that the person is the result of the comparison, the update necessity determination unit 7 determines whether or not to update the speaker identification information. Is configured to update the speaker specific information using the voice information of the input unit 1 and store it again in the voice data storage unit 2.
特許文献 1:特開 2001— 265385号公報  Patent Document 1: JP 2001-265385 A
発明の開示  Disclosure of the invention
発明が解決しょうとする課題  Problems to be solved by the invention
[0004] し力しながら、従来のこのような話者認識装置にぉ 、ては、話者を認識しづら!/、発 話特徴が、音声データ格納部 2に反映されにくくなる点が懸念され、体調不良や経 年変化などにより声質が変わってしまった場合には、話者が安心して話者認識を使 用することができな 、と考えられると 、う不都合がある。 [0004] However, with such a conventional speaker recognition device, it is difficult to recognize the speaker! /, And it is a concern that the utterance characteristics are less likely to be reflected in the voice data storage unit 2. However, if the voice quality changes due to poor physical condition or aging, it is inconvenient if the speaker cannot use the speaker recognition with peace of mind.
[0005] そこで、本願は、このような不都合の解消を一つの課題とし、話者認識の性能をより 向上させることができる話者認識装置、及び話者認識方法等を提供することを目的 する。 [0005] Therefore, the present application aims to provide a speaker recognition device, a speaker recognition method, and the like that can improve the performance of speaker recognition, with the elimination of such inconveniences as an issue. .
課題を解決するための手段  Means for solving the problem
[0006] 上記課題を解決するために、請求項 1に記載の発明は、話者により発せられた音 声を入力するための音声入力手段と、前記入力された音声の特徴を示す音声特徴 データを抽出する音声特徴データ抽出手段と、前記音声特徴データの照合基準とな る基準音声特徴データを格納する音声特徴データ格納手段と、前記抽出された音 声特徴データと、前記音声特徴データ格納手段に格納されて!、る前記基準音声特 徴データとを比較照合する音声特徴データ照合手段と、前記話者から、前記音声以 外の話者特徴を示す話者特徴データを入力するための話者特徴入力手段と、前記 話者特徴データの照合基準となる基準話者特徴データを格納する話者特徴データ 格納手段と、前記音声特徴データ照合手段による照合結果が正解でない場合に、 前記入力された話者特徴データと、前記基準話者特徴データとを比較照合する話者 特徴データ照合手段と、前記話者特徴データ照合手段による照合結果が正解であ る場合に、前記音声特徴データ抽出手段により抽出された前記音声特徴データを用 いて、該音声特徴データに対応する前記音声特徴データ格納手段に格納された前 記基準音声特徴データを更新する更新手段と、を備えることを特徴とする。 [0006] In order to solve the above problem, the invention according to claim 1 is characterized in that a voice input means for inputting a voice uttered by a speaker, and a voice feature indicating a feature of the inputted voice. Voice feature data extracting means for extracting data, voice feature data storage means for storing reference voice feature data to be a reference for collating the voice feature data, the extracted voice feature data, and the voice feature data storage The voice feature data collating means for comparing and collating the reference voice feature data stored in the means, and speaker feature data indicating speaker features other than the voice from the speaker. If the collation result by the speaker feature input means, the speaker feature data storage means for storing the reference speaker feature data as a collation reference of the speaker feature data, and the collation result by the voice feature data collation means is not correct, the input Speaker feature data collating means for comparing and collating the recorded speaker feature data with the reference speaker feature data, and the collation result by the speaker feature data collating means is correct Update means for updating the reference voice feature data stored in the voice feature data storage means corresponding to the voice feature data using the voice feature data extracted by the voice feature data extraction means; It is characterized by providing.
[0007] 請求項 4に記載の発明は、話者により発せられた音声を入力するための音声入力 工程と、前記入力された音声の特徴を示す音声特徴データを抽出する音声特徴デ ータ抽出工程と、前記音声特徴データの照合基準となる基準音声特徴データを格納 する音声特徴データ格納工程と、前記抽出された音声特徴データと、前記格納され て ヽる前記基準音声特徴データとを比較照合する音声特徴データ照合工程と、前記 話者から、前記音声以外の話者特徴を示す話者特徴データを入力するための話者 特徴入力工程と、前記話者特徴データの照合基準となる基準話者特徴データを格 納する話者特徴データ格納工程と、前記音声特徴データ照合工程における照合結 果が正解でない場合に、前記入力された話者特徴データと、前記基準話者特徴デ 一タとを比較照合する話者特徴データ照合工程と、前記話者特徴データ照合工程 における照合結果が正解である場合に、前記抽出された前記音声特徴データを用 いて、該音声特徴データに対応する前記格納された前記基準音声特徴データを更 新する更新工程と、を備えることを特徴とする。  [0007] The invention according to claim 4 is a speech input step for inputting speech uttered by a speaker, and speech feature data extraction for extracting speech feature data indicating the features of the input speech. A step of storing a reference voice feature data as a reference for collating the voice feature data, and comparing and collating the extracted voice feature data with the reference voice feature data stored. A speech feature data matching step, a speaker feature input step for inputting speaker feature data indicating speaker features other than the speech from the speaker, and a reference story to be a reference for matching the speaker feature data Speaker feature data storage step for storing speaker feature data, and when the collation result in the speech feature data collation step is not correct, the input speaker feature data and the reference speaker feature deci Speaker feature data collating step for comparing and comparing data with each other, and when the collation result in the speaker feature data collating step is correct, the extracted voice feature data is used to correspond to the voice feature data. An update step of updating the stored reference voice feature data.
[0008] 請求項 5に記載の話者認識プログラムの発明は、コンピュータを、話者により発せら れた音声を入力するための音声入力手段、前記入力された音声の特徴を示す音声 特徴データを抽出する音声特徴データ抽出手段、前記抽出された音声特徴データ と、音声特徴データ格納手段に格納されている、前記音声特徴データの照合基準と なる基準音声特徴データとを比較照合する音声特徴データ照合手段、前記話者か ら、前記音声以外の話者特徴を示す話者特徴データを入力するための話者特徴入 力手段、前記音声特徴データ照合手段による照合結果が正解でない場合に、前記 入力された話者特徴データと、話者特徴データ格納手段に格納されている、前記話 者特徴データの照合基準となる基準話者特徴データとを比較照合する話者特徴デ ータ照合手段、及び、前記話者特徴データ照合手段による照合結果が正解である 場合に、前記音声特徴データ抽出手段により抽出された前記音声特徴データを用 いて、該音声特徴データに対応する前記格納された前記基準音声特徴データを更 新する更新手段として機能させることを特徴とする。 [0008] The invention of the speaker recognition program according to claim 5 is an audio input means for inputting a voice uttered by a speaker to a computer, and voice feature data indicating the characteristics of the inputted voice. Extracting voice feature data extracting means, the extracted voice feature data And voice feature data collating means for comparing the voice feature data stored in the voice feature data storage means with reference voice feature data serving as a reference for collating the voice feature data, and speaker features other than the voice from the speaker. If the collation result by the voice feature data collating unit is not correct, the input speaker characteristic data and the speaker characteristic data storing unit are input. The stored speaker feature data collating means for comparing and collating the reference speaker feature data as the collation reference of the speaker feature data, and the collation result by the speaker feature data collating means is correct. In this case, the stored reference speech feature data corresponding to the speech feature data is updated using the speech feature data extracted by the speech feature data extraction unit. Characterized in that to function as a unit.
[0009] 請求項 6に記載の記録媒体の発明は、請求項 5に記載の話者認識プログラムが、 前記コンピュータにより読取可能に記録されていることを特徴とする。  [0009] The invention of the recording medium according to claim 6 is characterized in that the speaker recognition program according to claim 5 is recorded so as to be readable by the computer.
図面の簡単な説明  Brief Description of Drawings
[0010] [図 1]本実施形態に係る話者認識装置 Sの概要構成例を示す図である。 FIG. 1 is a diagram showing a schematic configuration example of a speaker recognition device S according to the present embodiment.
[図 2]本実施形態に係る話者認識装置 Sにおける処理部 Pにおける話者認識処理を 示すフローチャートである。  FIG. 2 is a flowchart showing speaker recognition processing in a processing unit P in the speaker recognition device S according to the present embodiment.
[図 3]本実施形態に係る話者認識装置 Sにおける処理部 Pにおける話者認識処理を 示すフローチャートである。  FIG. 3 is a flowchart showing speaker recognition processing in a processing unit P in the speaker recognition device S according to the present embodiment.
圆 4]話者特徴 DB6についても更新処理を行う場合の話者認識装置 Sの概要構成例 を示す図である。  圆 4] Speaker characteristics Figure 6 shows a schematic configuration example of the speaker recognition device S when DB6 is also updated.
符号の説明  Explanation of symbols
[0011] 1 音声入力部 [0011] 1 Voice input unit
2 音声特徴データ抽出部  2 Voice feature data extraction unit
3 音声特徴 DB  3 Voice feature DB
4 音声特徴データ照合部  4 Voice feature data matching unit
5 話者特徴入力部  5 Speaker feature input section
6 話者特徴 DB  6 Speaker characteristics DB
7 話者特徴データ照合部 8 音声特徴 DB更新部 7 Speaker feature data verification unit 8 Voice feature DB update section
9 話者特徴 DB更新部  9 Speaker characteristics DB update section
P 処理部  P processing section
M 記憶部  M Memory unit
D 表示部  D Display section
S 話者認識装置  S Speaker recognition device
発明を実施するための最良の形態  BEST MODE FOR CARRYING OUT THE INVENTION
[0012] 以下、図面を参照して本願の最良の実施形態について詳細に説明する。 Hereinafter, the best embodiment of the present application will be described in detail with reference to the drawings.
[0013] 先ず、本願の実施形態に係る話者認識装置 Sの構成及び機能について、図 1を用 いて説明する。力かる話者認識装置 Sは、例えば、カーナビゲーシヨン装置、ディスクFirst, the configuration and function of the speaker recognition device S according to the embodiment of the present application will be described with reference to FIG. Powerful speaker recognition device S is, for example, a car navigation device, a disc
(例えば、 CD, MD, DVD)再生装置等に適用され、例えば装置の立ち上がり時に おける本人認証のために使用される。 (For example, CD, MD, DVD) It is applied to a playback device etc., and is used for personal authentication at the time of startup of the device, for example.
[0014] 図 1は、本実施形態に係る話者認識装置 Sの概要構成例を示す図である。 FIG. 1 is a diagram illustrating a schematic configuration example of the speaker recognition device S according to the present embodiment.
[0015] 本実施形態に係る話者認識装置 Sは、音声入力手段としての音声入力部 1、音声 特徴データ抽出手段としての音声特徴データ抽出部 2、音声特徴データ格納手段と しての音声特徴 (話者音響) DB (データベース) 3、音声特徴データ照合手段として の音声 (音響)特徴データ照合部 4、話者特徴入力手段としての話者 (偶人)特徴入 力部 5、話者特徴データ格納手段としての話者 (個人)特徴 DB6、話者特徴データ 照合手段としての話者 (個人)特徴データ照合部 7、及び更新手段としての音声特徴 DB更新部 8を含んで構成されて 、る。 The speaker recognition apparatus S according to the present embodiment includes a voice input unit 1 as a voice input unit, a voice feature data extraction unit 2 as a voice feature data extraction unit, and a voice feature as a voice feature data storage unit. (Speaker Sound) DB (Database) 3, Speech (Acoustic) Feature Data Matching Unit 4 as Speech Feature Data Matching Means, Speaker (Daily) Feature Input Unit 5 as Speaker Feature Input Means, Speaker Features A speaker (individual) feature DB6 as a data storage means, a speaker (individual) feature data collation unit 7 as a speaker feature data collation means, and a voice feature DB update unit 8 as an update means, The
[0016] ここで、演算機能を有する CPU,作業用 RAM,各種データやプログラムを記憶す る ROM等を備えた処理部 Pにお 、て、 CPUが所定のプログラム (本願の話者認識 プログラム)を実行することにより、音声特徴データ抽出部 2、音声特徴データ照合部 4、話者特徴データ照合部 7、及び音声特徴 DB更新部 8として、それぞれ機能する ようになっている。 [0016] Here, in the processing unit P including a CPU having a calculation function, a working RAM, a ROM for storing various data and programs, the CPU is a predetermined program (speaker recognition program of the present application). As a result, the speech feature data extraction unit 2, the speech feature data collation unit 4, the speaker feature data collation unit 7, and the speech feature DB update unit 8 function.
[0017] また、音声特徴 DB3及び話者特徴 DB6は、例えばハードディスクドライブ等力ゝらな る記憶部 Mに構築されるようになって 、る。  [0017] Further, the voice feature DB3 and the speaker feature DB6 are constructed in the storage unit M such as a hard disk drive.
[0018] 音声入力部 1は、話者により発せられた音声を入力するためのマイクロフォン等の 音声入力デバイスである。なお、音声入力部 1は、音声を入力できるものであれば、 その種類は問わない。 [0018] The voice input unit 1 is a microphone or the like for inputting voice uttered by a speaker. It is a voice input device. The type of the voice input unit 1 is not limited as long as it can input voice.
[0019] 音声特徴データ抽出部 2は、音声入力部 1により入力された音声力 音響パラメ一 タ (音声の特徴を示す音声特徴データの一例)を算出 (抽出)する。なお、音響パラメ ータには、例えば、 MFCC(Mel Frequency Cepstrum Coefficient)や LPCケプストラム が用いられる力 音響的な特徴が表現できるものであればどれを使用しても良い。  The voice feature data extraction unit 2 calculates (extracts) the voice force acoustic parameter (an example of voice feature data indicating the voice feature) input by the voice input unit 1. As the acoustic parameters, for example, any one that can express the hydroacoustic characteristics using MFCC (Mel Frequency Cepstrum Coefficient) or LPC cepstrum may be used.
[0020] 音声特徴 DB3には、複数の登録話者のそれぞれにおける音響的な特徴を示す基 準音響データであって、上記音響パラメータの照合基準となる基準音響データ (基準 音声特徴データの一例)が格納、登録されている。なお、基準音響データとしては、 例えば、各話者の音響パラメータを基にして生成された GMM(Gaussian Mixture Mo del)がー例として挙げられる力 各話者の音響パラメータを基にして生成できるもので あればその種類を問わな!/、。 [0020] Speech feature DB3 is reference acoustic data indicating acoustic features of each of a plurality of registered speakers, and is a reference acoustic data (an example of reference speech feature data) that serves as a reference for comparison of the acoustic parameters. Is stored and registered. For example, GMM (Gaussian Mixture Model) generated based on the acoustic parameters of each speaker can be generated based on the acoustic parameters of each speaker. If so, don't worry about the type! /.
[0021] 音声特徴データ照合部 4は、音声特徴データ抽出部 2により算出 (抽出)された音 響パラメータと、音声特徴 DB3に格納されている基準音響データとを比較照合して、 その照合結果 (認識結果、又は認証結果ともいう)を、例えば表示部 Dに出力するよう になっている。 [0021] The voice feature data matching unit 4 compares and matches the sound parameter calculated (extracted) by the voice feature data extracting unit 2 with the reference acoustic data stored in the voice feature DB3, and the result of the matching is performed. For example, the recognition result or authentication result is output to the display unit D.
[0022] 例えば、音声特徴データ照合部 4は、上記抽出された音響パラメータが、音声特徴 DB3に格納されている基準音響データのうち、どれに最も近いのかを調べ、その照 合結果を出力する。より具体的には、上記抽出された音響パラメータが、各登録話者 の基準音響データ (例えば、 GMM)のそれぞれにあてはめられ、尤度を求められ、 そのうち、最大尤度を出力した基準音響データ (例えば、 GMM)に該当する登録話 者に関する情報 (例えば、氏名等)が、照合結果として表示部 Dに出力される。話者 であるユーザは、こうして表示部 Dに表示された照合結果を見て、当該照合結果が 正解 (正 、)力どうかを判断することができる。  [0022] For example, the voice feature data matching unit 4 checks to which of the reference acoustic data stored in the voice feature DB3 the extracted acoustic parameter is closest, and outputs the matching result. . More specifically, the extracted acoustic parameters are applied to each of the registered speaker's reference acoustic data (for example, GMM), the likelihood is obtained, and among them, the reference acoustic data that outputs the maximum likelihood. Information (for example, name, etc.) regarding the registered speaker corresponding to (for example, GMM) is output to the display unit D as a matching result. The user who is a speaker can thus see whether or not the collation result is correct (correct) by looking at the collation result displayed on the display unit D.
[0023] 話者特徴入力部 5は、音声以外の話者 (個人)の固有の特徴を示す話者特徴デー タとしての個人特徴情報を入力するための個人特徴入力デバイスである。例えば、 個人特徴情報が指紋データであれば、個人特徴入力デバイスは「指紋センサー」が 該当し、個人特徴情報がパスワードであれば個人特徴入力デバイスは「キーボード」 や「タツチパネル」等が該当する。なお、個人特徴情報として虹彩等公知の様々なも のを適用できる。 [0023] The speaker feature input unit 5 is a personal feature input device for inputting personal feature information as speaker feature data indicating a unique feature of a speaker (individual) other than voice. For example, if the personal feature information is fingerprint data, the personal feature input device is “fingerprint sensor”, and if the personal feature information is a password, the personal feature input device is “keyboard”. Or “touch panel”. Various known items such as an iris can be applied as the personal feature information.
[0024] 話者特徴 DB6には、複数の登録話者のそれぞれにおける基準個人特徴情報であ つて、話者特徴入力部 5により入力された上記個人特徴情報の照合基準となる基準 個人特徴情報を格納、登録されている。  [0024] Speaker feature DB 6 stores reference personal feature information for each of a plurality of registered speakers, which serves as a reference for matching the personal feature information input by speaker feature input unit 5. Stored and registered.
[0025] 話者特徴データ照合部 7は、話者特徴入力部 5により入力された個人特徴情報と、 話者特徴 DB6に格納されている基準個人特徴情報とを比較照合して、その照合結 果 (認識結果、又は認証結果ともいう)が正解である力 (例えば、入力されたパスヮー ドと一致するノ スワードが話者特徴 DB6に登録されて 、る力 )否かを判別するように なっている。話者特徴データ照合部 7は、照合結果より、正解であると判別した場合、 その正解の対象となる登録話者 (例えば、ノ スワードが一致した登録話者)を話者特 徴 DB6から特定 (決定)する。  [0025] The speaker feature data matching unit 7 compares the personal feature information input by the speaker feature input unit 5 with the reference personal feature information stored in the speaker feature DB 6, and compares the matching results. Results (also known as recognition results or authentication results) are determined to be correct (for example, a force that matches a password that has been entered is registered in speaker feature DB6). ing. If the speaker characteristic data matching unit 7 determines that the answer is correct from the matching result, the speaker characteristic data matching unit 7 identifies the registered speaker (e.g., the registered speaker with the matched word) from the speaker feature DB 6 as the target of the correct answer. (decide.
[0026] 音声特徴 DB更新部 8は、話者特徴データ照合部 7による照合結果が正解である場 合に、音声特徴データ抽出部 2により抽出された音響パラメータを用いて、音声特徴 DB3に格納されている上記特定された登録話者の基準音響データを更新する。当 該更新には、例えば、 MAP推定が用いられる。  [0026] When the collation result by the speaker feature data collation unit 7 is correct, the speech feature DB update unit 8 stores in the speech feature DB 3 using the acoustic parameters extracted by the speech feature data extraction unit 2. The reference acoustic data of the specified registered speaker is updated. For this update, for example, MAP estimation is used.
[0027] 次に、本実施形態に係る話者認識装置 Sの動作を、図 2を用いて説明する。  Next, the operation of the speaker recognition apparatus S according to the present embodiment will be described with reference to FIG.
[0028] 図 2は、本実施形態に係る話者認識装置 Sにおける処理部 Pにおける話者認識処 理を示すフローチャートである。  FIG. 2 is a flowchart showing speaker recognition processing in the processing unit P in the speaker recognition device S according to the present embodiment.
[0029] 図 2の処理においては、話者により発せられた音声が音声入力部 1により入力され ると、当該入力された音声力 音声特徴データ抽出部 2により音響パラメータが算出( 抽出)される (ステップ Sl)。  In the process of FIG. 2, when a voice uttered by a speaker is input by the voice input unit 1, an acoustic parameter is calculated (extracted) by the input voice force voice feature data extraction unit 2. (Step Sl).
[0030] 次いで、上記算出 (抽出)された音響パラメータと音声特徴 DB3に格納されている 基準音響データとが音声特徴データ照合部 4により比較照合され、その照合結果( 認識結果)が表示部 Dに出力される (ステップ S2)。  [0030] Next, the calculated (extracted) acoustic parameter and the reference acoustic data stored in the speech feature DB3 are compared and verified by the speech feature data verification unit 4, and the verification result (recognition result) is displayed on the display unit D. (Step S2).
[0031] 図 1における音声特徴データを音声特徴 DBに登録されている登録話者のパター ンに当て嵌めることによって夫々の話者の類似度 (距離や尤度)が求められる。しかし ながら、不正解となる発話が入力された場合、夫々の話者の類似度 (距離や尤度)が 近接して ヽることが多!ヽため、夫々の話者の類似度 (距離や尤度)を閾値判断するこ とにより自動的に正解又は不正解を判定することが可能となる。具体的には、表示部 Dに表示された照合結果 (該当する登録話者に関する情報 (例えば、氏名等))を図 1の音声特徴データ照合部 4は、当該照合結果が正解である力否かを判断し、正解 である場合には、正解を示す入力を音声入力部 1又は話者特徴入力部 5等を介して 行う。こうして、当該入力を認識した処理部 Pにより照合結果が正解であると判別され (ステップ S3 :YES)、当該処理が終了される。 [0031] By applying the voice feature data in Fig. 1 to the registered speaker patterns registered in the voice feature DB, the similarity (distance and likelihood) of each speaker is obtained. However, when an incorrect utterance is input, the similarity (distance and likelihood) of each speaker is Since they often come close to each other, it is possible to automatically determine the correct answer or incorrect answer by judging the similarity (distance or likelihood) of each speaker as a threshold. Specifically, the verification result (information on the corresponding registered speaker (for example, name, etc.)) displayed on the display unit D is used by the voice feature data verification unit 4 in FIG. 1 to determine whether the verification result is correct. If the answer is correct, an input indicating the correct answer is made via the voice input unit 1 or the speaker feature input unit 5 or the like. In this way, the processing unit P that has recognized the input determines that the collation result is correct (step S3: YES), and the process is terminated.
[0032] 一方、上記音声特徴データ照合部 4は、不正解である場合には、不正解を示す入 力を音声入力部 1又は話者特徴入力部 5等を介して行う。こうして、当該入力を認識 した処理部 Pにより照合結果が不正解であると判別され (ステップ S3 : NO)、当ステツ プ S4に移行される。尚、上記正解 Z不正解の判定は、図 1の音声特徴データ照合 部 4に組み込まれる機能となる。また、上記入力は同じく音声特徴データ照合部 4の 各登録話者の類似度、出力は表示部に出すメッセージとなる。更に、メッセージとし ては、正解であるという判定がされた場合は認識結果となり、不正解であると判定され た場合は「話者特徴入力部 5に個人特徴情報を入力してください」といった旨を促す ようなメッセージとなる。因みに、閾値判定を入れな力つた場合は、正解 Z不正解に かかわらず、認識結果が出力される。  On the other hand, if the answer is an incorrect answer, the voice feature data matching unit 4 performs an input indicating the incorrect answer via the voice input unit 1 or the speaker feature input unit 5 or the like. In this way, the processing unit P that recognizes the input determines that the collation result is incorrect (step S3: NO), and proceeds to step S4. The determination of the correct answer Z incorrect answer is a function incorporated in the voice feature data matching unit 4 in FIG. Similarly, the above input is the similarity of each registered speaker of the voice feature data matching unit 4, and the output is a message to be displayed on the display unit. Furthermore, if the message is determined to be correct, the result is a recognition result, and if the message is determined to be incorrect, “please enter personal feature information into speaker feature input section 5”. The message prompts Incidentally, if the threshold judgment is used, the recognition result is output regardless of whether the correct answer is incorrect or incorrect.
[0033] 尚、図 1の音声特徴データ照合部 4が表示部 Dに表示された照合結果が正解であ るカゝ否かを判断する代わりに、表示部 Dに表示された照合結果 (該当する登録話者 に関する情報 (例えば、氏名等))を見た話者が、上記照合結果が正解である力否か を判断してもよい。  [0033] It should be noted that instead of the voice feature data matching unit 4 in FIG. 1 determining whether the matching result displayed on the display unit D is correct, the matching result displayed on the display unit D (corresponding The speaker who sees the information (for example, name, etc.) regarding the registered speaker to be judged may determine whether or not the matching result is correct.
[0034] なお、例えば処理部 Pが上記照合結果を出力して力も所定時間(例えば、 10秒)以 内に不正解 (又は正解)を示す入力が無ければ、正解 (又は不正解)であると判別し 、その期間内に不正解 (又は正解)を示す入力があれば、不正解 (又は正解)である と判別するように構成しても良 、。  [0034] For example, if the processing unit P outputs the collation result and there is no input indicating an incorrect answer (or correct answer) within a predetermined time (for example, 10 seconds), the correct answer (or incorrect answer) is obtained. If there is an input indicating an incorrect answer (or correct answer) within that period, it may be configured to determine that it is an incorrect answer (or correct answer).
[0035] ステップ S4では、話者に対して個人特徴情報の入力が促され、これに応じて当該 話者から個人特徴情報が話者特徴入力部 5により入力される。そして、当該入力され た個人特徴情報と、話者特徴 DB6に格納されて!ヽる基準個人特徴情報とが話者特 徴データ照合部 7により比較照合され、その照合結果が正解であるか否かが判別さ れる。そして、当該照合結果が不正解である(例えば、入力された個人特徴情報と一 致 (所定のマージンを含めた一致)する基準個人特徴情報が話者特徴 DB6に格納さ れていない)場合には (ステップ S5 : NO)、当該処理が終了される。一方、当該照合 結果が正解である場合には (ステップ S5: YES)、当該正解の対象となる登録話者が 特定され、音声特徴 DB3の更新処理が行われる (ステップ S6)。当該音声特徴 DB3 の更新処理においては、上記音声特徴データ抽出部 2により抽出された音響パラメ 一タを用 V、て、該音響パラメータに対応する音声特徴 DB3に格納されて 、る上記特 定された登録話者の基準音響データが更新される。 [0035] In step S4, the speaker is prompted to input personal characteristic information, and in response to this, personal characteristic information is input from the speaker by the speaker characteristic input unit 5. Then, the input personal feature information and the reference personal feature information stored in the speaker feature DB 6 are! The collected data collating unit 7 performs comparison and collation, and determines whether or not the collation result is correct. If the matching result is incorrect (for example, the reference personal feature information that matches the input personal feature information (matches including a predetermined margin) is not stored in the speaker feature DB 6). (Step S5: NO), the process is terminated. On the other hand, if the collation result is correct (step S5: YES), the registered speaker that is the target of the correct answer is identified and the voice feature DB3 is updated (step S6). In the update process of the speech feature DB3, the acoustic parameters extracted by the speech feature data extraction unit 2 are used and stored in the speech feature DB3 corresponding to the acoustic parameters. The registered speaker's reference acoustic data is updated.
[0036] 以上説明したように、本実施形態によれば、話者により発せられた音声を入力し、 当該入力された音声の特徴を示す音響パラメータを抽出し、当該抽出された音響パ ラメータと音声特徴 DB3に格納されている基準音響データを比較照合した。当該照 合結果が正解でな!、場合に、上記話者から音声以外の話者特徴を示す個人特徴情 報を入力し、当該入力された個人特徴情報と、話者特徴 DB6に格納されている基準 個人特徴情報とを比較照合した。そして、当該照合結果が正解である場合に、上記 抽出された音響パラメータを用いて、該音響パラメータに対応する音声特徴 DB3に 格納されて ヽる基準音響データを更新するように構成した。これにより話者を音声以 外の情報力 特定できるようになり、話者を認識するにあたり誤りやすい音声パター ンであっても、音声特徴 DB3を更新することができ、認識性能を向上させることがで きる。 [0036] As described above, according to the present embodiment, the voice uttered by the speaker is input, the acoustic parameters indicating the characteristics of the input voice are extracted, and the extracted acoustic parameters and Speech features Reference acoustic data stored in DB3 was compared and collated. If the matching result is not correct !, the personal feature information indicating the speaker features other than the voice is input from the above speaker, and the input personal feature information and the speaker feature DB 6 are stored. Compared with personal characteristics information. Then, when the collation result is correct, the reference acoustic data stored in the speech feature DB 3 corresponding to the acoustic parameter is updated using the extracted acoustic parameter. As a result, it is possible to specify the information power other than the voice for the speaker, and the voice feature DB3 can be updated even if the voice pattern is prone to error when the speaker is recognized, thereby improving the recognition performance. it can.
[0037] 具体的には、照合結果が不正解であった発話も、話者本人の特徴を十分に持って いるものと思われるため、不正解の発話も積極的に用いて音声特徴 DB3の更新を行 うように構成している。これにより、体調不良や経年変化などにより声質が変わってし まった場合でも、安心して話者認識を使用することができる。  [0037] Specifically, since the utterance with an incorrect answer as a result of matching is considered to have sufficient characteristics of the speaker himself, the utterance of the incorrect answer is also actively used. It is configured to update. As a result, speaker recognition can be used with confidence even if voice quality has changed due to poor physical condition or aging.
[0038] なお、本実施形態にお!、ては、音声特徴 DB更新部 8は、音声特徴データ照合部 4 による照合結果が不正解であり、且つ話者特徴データ照合部 7による照合結果が正 解である場合にのみ、音声特徴データ抽出部 2により抽出された音響パラメータを用 いて、音声特徴 DB3に格納されている上記特定された登録話者の基準音響データ を更新している。 Note that in this embodiment, the voice feature DB update unit 8 has an incorrect answer by the voice feature data matching unit 4 and the matching result by the speaker feature data matching unit 7 is not correct. Only when the answer is correct, the acoustic parameters extracted by the speech feature data extraction unit 2 are used to store the reference acoustic data of the identified registered speaker stored in the speech feature DB3. Has been updated.
[0039] し力しながら、他の実施形態として、音声特徴 DB更新部 8は、図 3に示すように (ス テツプ S23 :YES,ステップ S26)、音声特徴データ照合部 4による照合結果が正解 である場合にも、音声特徴データ抽出部 2により抽出された音響パラメータを用いて 、音声特徴 DB3に格納されている上記特定された登録話者の基準音響データを更 新しても良い。これにより、図 2に示す手法よりも多く更新をかけることが出来るので、 話者認識の精度向上が期待出来る。  However, as another embodiment, as shown in FIG. 3 (step S23: YES, step S26), the speech feature DB update unit 8 determines that the collation result by the speech feature data collation unit 4 is correct. In this case, the acoustic parameter extracted by the voice feature data extraction unit 2 may be used to update the reference acoustic data of the specified registered speaker stored in the voice feature DB3. This allows more updates than the method shown in Fig. 2, and can be expected to improve the accuracy of speaker recognition.
[0040] また、本実施形態において、適用する個人特徴が指紋、虹彩など 100%の認識率 を保証できないものに対しては、認証結果の精度向上を目指し、話者特徴 DB6につ いても更新処理を行うように構成するとなお望ましい。ただし、話者特徴データ照合 部 7による照合の正解 Z不正解にっ 、ては、音響パラメータと基準音響データを比 較照合の場合とは異なり、追加的な照合処理 (つまり、話者特徴データ照合部 7によ る照合)から判断することはできないため、話者特徴データ照合部 7が、照合結果に あたる話者の類似度の高さを閾値判別し、正解 Z不正解を推測する処理も含めるよ うに構成する。  [0040] In this embodiment, for personal features that cannot be guaranteed 100% recognition rate, such as fingerprints and irises, the speaker feature DB6 is also updated to improve the accuracy of the authentication result. It is more desirable to configure to perform processing. However, according to the correct answer Z incorrect answer by the speaker feature data matching unit 7, the acoustic parameter and the reference acoustic data are different from the case of the comparative match. Because the speaker feature data matching unit 7 determines the threshold of the degree of similarity of the speaker corresponding to the matching result and guesses the correct answer Z incorrect answer. Is also included.
[0041] 図 4は、話者特徴 DB6についても更新処理を行う場合の話者認識装置 Sの概要構 成例を示す図である。なお、図 4に示す構成において、図 1と同様の構成部分につい ては同一の符号を付し、重複する説明を省略する。図 4における話者認識装置 Sで は、話者特徴 DB更新部 9が追加されている。そして、上述した話者特徴データ照合 部 7による照合結果より正解と判断された (つまり、話者の類似度の高さが閾値を超え る)場合、当該話者特徴 DB更新部 9は、話者特徴入力部 5により入力された個人特 徴情報を用いて、該個人特徴情報に対応する話者特徴 DB6に格納されている基準 個人特徴情報を更新する。これにより、認証結果の精度を向上させることができる。  [0041] FIG. 4 is a diagram showing a schematic configuration example of the speaker recognition device S when the update process is performed on the speaker feature DB6. In the configuration shown in FIG. 4, the same components as those in FIG. 1 are denoted by the same reference numerals, and redundant description is omitted. In the speaker recognition device S in FIG. 4, a speaker feature DB update unit 9 is added. When the speaker feature data matching unit 7 determines that the answer is correct (that is, when the speaker similarity level exceeds the threshold), the speaker feature DB update unit 9 Using the personal feature information input by the speaker feature input unit 5, the reference personal feature information stored in the speaker feature DB 6 corresponding to the personal feature information is updated. Thereby, the accuracy of the authentication result can be improved.
[0042] また、本発明は、上記実施形態に限定されるものではない。上記実施形態は、例示 であり、本発明の特許請求の範囲に記載された技術的思想と実質的に同一な構成 を有し、同様な作用効果を奏するものは、いかなるものであっても本発明の技術的範 囲に包含される。  [0042] The present invention is not limited to the above-described embodiment. The above embodiment is an exemplification, and the present invention has the same configuration as the technical idea described in the scope of claims of the present invention, and any device that exhibits the same function and effect is the present embodiment. It is included in the technical scope of the invention.

Claims

請求の範囲 The scope of the claims
[1] 話者により発せられた音声を入力するための音声入力手段と、  [1] voice input means for inputting voice uttered by a speaker;
前記入力された音声の特徴を示す音声特徴データを抽出する音声特徴データ抽 出手段と、  Voice feature data extracting means for extracting voice feature data indicating the characteristics of the input voice;
前記音声特徴データの照合基準となる基準音声特徴データを格納する音声特徴 データ格納手段と、  Voice feature data storage means for storing reference voice feature data to be a reference for collating the voice feature data;
前記抽出された音声特徴データと、前記音声特徴データ格納手段に格納されて 、 る前記基準音声特徴データとを比較照合する音声特徴データ照合手段と、  Voice feature data collating means for comparing and collating the extracted voice feature data with the reference voice feature data stored in the voice feature data storage means;
前記話者から、前記音声以外の話者特徴を示す話者特徴データを入力するため の話者特徴入力手段と、  Speaker feature input means for inputting speaker feature data indicating speaker features other than the speech from the speaker;
前記話者特徴データの照合基準となる基準話者特徴データを格納する話者特徴 データ格納手段と、  Speaker feature data storage means for storing reference speaker feature data as a reference for comparing the speaker feature data;
前記音声特徴データ照合手段による照合結果が正解でない場合に、前記入力さ れた話者特徴データと、前記基準話者特徴データとを比較照合する話者特徴デー タ照合手段と、  Speaker feature data collating means for comparing and collating the input speaker feature data with the reference speaker feature data when the collation result by the voice feature data collating means is not correct;
前記話者特徴データ照合手段による照合結果が正解である場合に、前記音声特 徴データ抽出手段により抽出された前記音声特徴データを用いて、該音声特徴デ ータに対応する前記音声特徴データ格納手段に格納された前記基準音声特徴デー タを更新する更新手段と、を備えることを特徴とする話者認識装置。  When the matching result by the speaker feature data matching unit is correct, the voice feature data storage corresponding to the voice feature data is performed using the voice feature data extracted by the voice feature data extracting unit. Updating means for updating the reference speech feature data stored in the means.
[2] 請求項 1に記載の話者認識装置において、  [2] In the speaker recognition device according to claim 1,
前記更新手段は、前記音声特徴データ照合手段による照合結果が正解である場 合にも、前記音声特徴データを用いて、該音声特徴データに対応する前記音声特 徴データ格納手段に格納された前記基準音声特徴データを更新することを特徴とす る話者認識装置。  The updating means uses the voice feature data to store the voice feature data stored in the voice feature data storage means corresponding to the voice feature data even when the collation result by the voice feature data collating means is correct. A speaker recognition device characterized by updating reference speech feature data.
[3] 請求項 1又は 2に記載の話者認識装置において、 [3] In the speaker recognition device according to claim 1 or 2,
前記更新手段は、前記話者特徴データ照合手段による照合結果が正解である場 合に、前記話者特徴データ格納手段により入力された話者特徴データを用いて、該 話者特徴データに対応する前記話者特徴データ格納手段に格納された前記基準話 者特徴データを更新することを特徴とする話者認識装置。 The updating means responds to the speaker feature data using the speaker feature data input by the speaker feature data storage means when the collation result by the speaker feature data collating means is correct. The reference story stored in the speaker feature data storage means A speaker recognition device characterized by updating speaker feature data.
[4] 話者により発せられた音声を入力するための音声入力工程と、  [4] A voice input process for inputting voice uttered by a speaker;
前記入力された音声の特徴を示す音声特徴データを抽出する音声特徴データ抽 出工程と、  A voice feature data extraction step for extracting voice feature data indicating the characteristics of the input voice;
前記音声特徴データの照合基準となる基準音声特徴データを格納する音声特徴 データ格納工程と、  A voice feature data storage step of storing reference voice feature data to be a reference for collating the voice feature data;
前記抽出された音声特徴データと、前記格納されて!、る前記基準音声特徴データ とを比較照合する音声特徴データ照合工程と、  A voice feature data matching step for comparing and collating the extracted voice feature data with the reference voice feature data stored;
前記話者から、前記音声以外の話者特徴を示す話者特徴データを入力するため の話者特徴入力工程と、  A speaker feature input step for inputting speaker feature data indicating speaker features other than the voice from the speaker;
前記話者特徴データの照合基準となる基準話者特徴データを格納する話者特徴 データ格納工程と、  A speaker feature data storage step of storing reference speaker feature data to be a reference for collating the speaker feature data;
前記音声特徴データ照合工程における照合結果が正解でな!、場合に、前記入力 された話者特徴データと、前記基準話者特徴データとを比較照合する話者特徴デー タ照合工程と、  If the collation result in the speech feature data collation step is not correct !, speaker feature data collation step for comparing and collating the input speaker feature data with the reference speaker feature data;
前記話者特徴データ照合工程における照合結果が正解である場合に、前記抽出 された前記音声特徴データを用いて、該音声特徴データに対応する前記格納され た前記基準音声特徴データを更新する更新工程と、を備えることを特徴とする話者 認識方法。  An update step of updating the stored reference speech feature data corresponding to the speech feature data using the extracted speech feature data when the collation result in the speaker feature data matching step is correct And a speaker recognition method.
[5] コンピュータを、 [5] Computer
話者により発せられた音声を入力するための音声入力手段、  Voice input means for inputting voice uttered by a speaker;
前記入力された音声の特徴を示す音声特徴データを抽出する音声特徴データ抽 出手段、  Voice feature data extracting means for extracting voice feature data indicating the characteristics of the input voice;
前記抽出された音声特徴データと、音声特徴データ格納手段に格納されている、 前記音声特徴データの照合基準となる基準音声特徴データとを比較照合する音声 特徴データ照合手段、  Voice feature data collating means for comparing and collating the extracted voice feature data with reference voice feature data stored in the voice feature data storage means and serving as a reference for collating the voice feature data;
前記話者から、前記音声以外の話者特徴を示す話者特徴データを入力するため の話者特徴入力手段、 前記音声特徴データ照合手段による照合結果が正解でない場合に、前記入力さ れた話者特徴データと、 Speaker feature input means for inputting speaker feature data indicating speaker features other than the voice from the speaker; When the collation result by the voice feature data collating means is not correct, the input speaker feature data and
話者特徴データ格納手段に格納されて!ヽる、前記話者特徴データの照合基準とな る基準話者特徴データとを比較照合する話者特徴データ照合手段、及び、 前記話者特徴データ照合手段による照合結果が正解である場合に、前記音声特 徴データ抽出手段により抽出された前記音声特徴データを用いて、該音声特徴デ ータに対応する前記格納された前記基準音声特徴データを更新する更新手段とし て機能させることを特徴とする話者認識プログラム。  Stored in the speaker feature data storage means! The speaker feature data collating means for comparing and collating with reference speaker feature data as a collation reference for the speaker feature data, and the speaker feature data collation When the collation result by the means is correct, the stored reference speech feature data corresponding to the speech feature data is updated using the speech feature data extracted by the speech feature data extraction unit. A speaker recognition program characterized by functioning as an updating means.
請求項 5に記載の話者認識プログラムが、前記コンピュータにより読取可能に記録さ れて!ヽることを特徴とする記録媒体。 6. A recording medium in which the speaker recognition program according to claim 5 is recorded so as to be readable by the computer.
PCT/JP2006/315839 2006-08-10 2006-08-10 Speaker recognizing device, speaker recognizing method, etc. WO2008018136A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2006/315839 WO2008018136A1 (en) 2006-08-10 2006-08-10 Speaker recognizing device, speaker recognizing method, etc.

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2006/315839 WO2008018136A1 (en) 2006-08-10 2006-08-10 Speaker recognizing device, speaker recognizing method, etc.

Publications (1)

Publication Number Publication Date
WO2008018136A1 true WO2008018136A1 (en) 2008-02-14

Family

ID=39032681

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2006/315839 WO2008018136A1 (en) 2006-08-10 2006-08-10 Speaker recognizing device, speaker recognizing method, etc.

Country Status (1)

Country Link
WO (1) WO2008018136A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023079815A1 (en) * 2021-11-08 2023-05-11 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Information processing method, information processing device, and information processing program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS626300A (en) * 1985-07-02 1987-01-13 松下電器産業株式会社 speaker verification device
JP2001503156A (en) * 1996-10-15 2001-03-06 スイスコム アーゲー Speaker identification method
JP2002221990A (en) * 2001-01-25 2002-08-09 Matsushita Electric Ind Co Ltd Personal authentication device
JP3529049B2 (en) * 2002-03-06 2004-05-24 ソニー株式会社 Learning device, learning method, and robot device
JP3727927B2 (en) * 2003-02-10 2005-12-21 株式会社東芝 Speaker verification device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS626300A (en) * 1985-07-02 1987-01-13 松下電器産業株式会社 speaker verification device
JP2001503156A (en) * 1996-10-15 2001-03-06 スイスコム アーゲー Speaker identification method
JP2002221990A (en) * 2001-01-25 2002-08-09 Matsushita Electric Ind Co Ltd Personal authentication device
JP3529049B2 (en) * 2002-03-06 2004-05-24 ソニー株式会社 Learning device, learning method, and robot device
JP3727927B2 (en) * 2003-02-10 2005-12-21 株式会社東芝 Speaker verification device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023079815A1 (en) * 2021-11-08 2023-05-11 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Information processing method, information processing device, and information processing program

Similar Documents

Publication Publication Date Title
CN111566729B (en) Speaker identification with super-phrase voice segmentation for far-field and near-field voice assistance applications
US11735191B2 (en) Speaker recognition with assessment of audio frame contribution
JP4213716B2 (en) Voice authentication system
JP4588069B2 (en) Operator recognition device, operator recognition method, and operator recognition program
EP3740949B1 (en) Authenticating a user
US20180040325A1 (en) Speaker recognition
US9646613B2 (en) Methods and systems for splitting a digital signal
CN104462912B (en) Improved biometric password security
JP4897040B2 (en) Acoustic model registration device, speaker recognition device, acoustic model registration method, and acoustic model registration processing program
US11081115B2 (en) Speaker recognition
JPH1173195A (en) Method for authenticating speaker's proposed identification
WO2018088534A1 (en) Electronic device, control method for electronic device, and control program for electronic device
CN117378006A (en) Mixed multilingual text-dependent and text-independent speaker confirmation
CN113241059B (en) Voice wake-up method, device, equipment and storage medium
WO2007111169A1 (en) Speaker model registration device, method, and computer program in speaker recognition system
JP3849841B2 (en) Speaker recognition device
KR102098956B1 (en) Voice recognition apparatus and method of recognizing the voice
WO2008018136A1 (en) Speaker recognizing device, speaker recognizing method, etc.
JP3818063B2 (en) Personal authentication device
JP2001265387A (en) Speaker verification apparatus and method
JP3919314B2 (en) Speaker recognition apparatus and method
EP4506838A1 (en) Methods and systems for authenticating users
US20250046317A1 (en) Methods and systems for authenticating users
JP2000148187A (en) Speaker recognition method, apparatus using the method, and program recording medium therefor
JP3841342B2 (en) Speech recognition apparatus and speech recognition program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 06782633

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

NENP Non-entry into the national phase

Ref country code: JP

122 Ep: pct application non-entry in european phase

Ref document number: 06782633

Country of ref document: EP

Kind code of ref document: A1