WO2008018136A1

WO2008018136A1 - Speaker recognizing device, speaker recognizing method, etc.

Info

Publication number: WO2008018136A1
Application number: PCT/JP2006/315839
Authority: WO
Inventors: Hajime Kobayashi; Soichi Toyama; Ikuo Fujita; Mitsuya Komamura
Original assignee: Pioneer Corporation; Tech Experts Incorporation
Priority date: 2006-08-10
Filing date: 2006-08-10
Publication date: 2008-02-14

Abstract

A speaker recognizing device, a speaker recognizing method, a speaker recognition processing program, etc. for more improving the performance of speaker recognition are provided. A speech uttered by a speaker is inputted. An acoustic parameter indicating a feature of the inputted speech is extracted. The extracted acoustic parameter is compared and collated with reference acoustic data stored in a speech feature DB (3). If the collation result shows incorrectness, the speaker inputs personal feature information representing a speaker feature other than that of speech. The inputted personal feature information is compared and collated with reference personal feature information stored in a speech feature DB (6). If the collation results shows correctness, the reference acoustic data stored in the speech feature DB (3) corresponding to the extracted acoustic parameter is updated by using the acoustic parameter.

Description

Specification

Speaker recognition device, speaker recognition method, etc.

Technical field

The present application relates to a technical field such as a speaker recognition apparatus and method.

Background art

[0002] As a speaker recognition device of this type, for example, Patent Document 1 discloses a speaker recognition device that can update registered speech at an appropriate timing and can ensure safety at the time of update. It is disclosed.

[0003] The speaker recognition device in Patent Document 1 (see FIG. 1) is based on the speech information obtained by the input unit 1 and the speaker identification information stored in the speech data storage unit 2 as a speech verification unit. If it is determined that the person is the result of the comparison, the update necessity determination unit 7 determines whether or not to update the speaker identification information. Is configured to update the speaker specific information using the voice information of the input unit 1 and store it again in the voice data storage unit 2.

Patent Document 1: JP 2001-265385 A

Disclosure of the invention

Problems to be solved by the invention

[0004] However, with such a conventional speaker recognition device, it is difficult to recognize the speaker! /, And it is a concern that the utterance characteristics are less likely to be reflected in the voice data storage unit 2. However, if the voice quality changes due to poor physical condition or aging, it is inconvenient if the speaker cannot use the speaker recognition with peace of mind.

[0005] Therefore, the present application aims to provide a speaker recognition device, a speaker recognition method, and the like that can improve the performance of speaker recognition, with the elimination of such inconveniences as an issue. .

Means for solving the problem

[0006] In order to solve the above problem, the invention according to claim 1 is characterized in that a voice input means for inputting a voice uttered by a speaker, and a voice feature indicating a feature of the inputted voice. Voice feature data extracting means for extracting data, voice feature data storage means for storing reference voice feature data to be a reference for collating the voice feature data, the extracted voice feature data, and the voice feature data storage The voice feature data collating means for comparing and collating the reference voice feature data stored in the means, and speaker feature data indicating speaker features other than the voice from the speaker. If the collation result by the speaker feature input means, the speaker feature data storage means for storing the reference speaker feature data as a collation reference of the speaker feature data, and the collation result by the voice feature data collation means is not correct, the input Speaker feature data collating means for comparing and collating the recorded speaker feature data with the reference speaker feature data, and the collation result by the speaker feature data collating means is correct Update means for updating the reference voice feature data stored in the voice feature data storage means corresponding to the voice feature data using the voice feature data extracted by the voice feature data extraction means; It is characterized by providing.

[0007] The invention according to claim 4 is a speech input step for inputting speech uttered by a speaker, and speech feature data extraction for extracting speech feature data indicating the features of the input speech. A step of storing a reference voice feature data as a reference for collating the voice feature data, and comparing and collating the extracted voice feature data with the reference voice feature data stored. A speech feature data matching step, a speaker feature input step for inputting speaker feature data indicating speaker features other than the speech from the speaker, and a reference story to be a reference for matching the speaker feature data Speaker feature data storage step for storing speaker feature data, and when the collation result in the speech feature data collation step is not correct, the input speaker feature data and the reference speaker feature deci Speaker feature data collating step for comparing and comparing data with each other, and when the collation result in the speaker feature data collating step is correct, the extracted voice feature data is used to correspond to the voice feature data. An update step of updating the stored reference voice feature data.

[0008] The invention of the speaker recognition program according to claim 5 is an audio input means for inputting a voice uttered by a speaker to a computer, and voice feature data indicating the characteristics of the inputted voice. Extracting voice feature data extracting means, the extracted voice feature data And voice feature data collating means for comparing the voice feature data stored in the voice feature data storage means with reference voice feature data serving as a reference for collating the voice feature data, and speaker features other than the voice from the speaker. If the collation result by the voice feature data collating unit is not correct, the input speaker characteristic data and the speaker characteristic data storing unit are input. The stored speaker feature data collating means for comparing and collating the reference speaker feature data as the collation reference of the speaker feature data, and the collation result by the speaker feature data collating means is correct. In this case, the stored reference speech feature data corresponding to the speech feature data is updated using the speech feature data extracted by the speech feature data extraction unit. Characterized in that to function as a unit.

[0009] The invention of the recording medium according to claim 6 is characterized in that the speaker recognition program according to claim 5 is recorded so as to be readable by the computer.

Brief Description of Drawings

FIG. 1 is a diagram showing a schematic configuration example of a speaker recognition device S according to the present embodiment.

FIG. 2 is a flowchart showing speaker recognition processing in a processing unit P in the speaker recognition device S according to the present embodiment.

FIG. 3 is a flowchart showing speaker recognition processing in a processing unit P in the speaker recognition device S according to the present embodiment.

圆 4] Speaker characteristics Figure 6 shows a schematic configuration example of the speaker recognition device S when DB6 is also updated.

Explanation of symbols

[0011] 1 Voice input unit

2 Voice feature data extraction unit

3 Voice feature DB

4 Voice feature data matching unit

5 Speaker feature input section

6 Speaker characteristics DB

7 Speaker feature data verification unit 8 Voice feature DB update section

9 Speaker characteristics DB update section

P processing section

M Memory unit

D Display section

S Speaker recognition device

BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, the best embodiment of the present application will be described in detail with reference to the drawings.

First, the configuration and function of the speaker recognition device S according to the embodiment of the present application will be described with reference to FIG. Powerful speaker recognition device S is, for example, a car navigation device, a disc

(For example, CD, MD, DVD) It is applied to a playback device etc., and is used for personal authentication at the time of startup of the device, for example.

FIG. 1 is a diagram illustrating a schematic configuration example of the speaker recognition device S according to the present embodiment.

The speaker recognition apparatus S according to the present embodiment includes a voice input unit 1 as a voice input unit, a voice feature data extraction unit 2 as a voice feature data extraction unit, and a voice feature as a voice feature data storage unit. (Speaker Sound) DB (Database) 3, Speech (Acoustic) Feature Data Matching Unit ₄ as Speech Feature Data Matching Means, Speaker (Daily) Feature Input Unit 5 as Speaker Feature Input Means, Speaker Features A speaker (individual) feature DB6 as a data storage means, a speaker (individual) feature data collation unit 7 as a speaker feature data collation means, and a voice feature DB update unit 8 as an update means, The

[0016] Here, in the processing unit P including a CPU having a calculation function, a working RAM, a ROM for storing various data and programs, the CPU is a predetermined program (speaker recognition program of the present application). As a result, the speech feature data extraction unit 2, the speech feature data collation unit 4, the speaker feature data collation unit 7, and the speech feature DB update unit 8 function.

[0017] Further, the voice feature DB3 and the speaker feature DB6 are constructed in the storage unit M such as a hard disk drive.

[0018] The voice input unit 1 is a microphone or the like for inputting voice uttered by a speaker. It is a voice input device. The type of the voice input unit 1 is not limited as long as it can input voice.

The voice feature data extraction unit 2 calculates (extracts) the voice force acoustic parameter (an example of voice feature data indicating the voice feature) input by the voice input unit 1. As the acoustic parameters, for example, any one that can express the hydroacoustic characteristics using MFCC (Mel Frequency Cepstrum Coefficient) or LPC cepstrum may be used.

[0020] Speech feature DB3 is reference acoustic data indicating acoustic features of each of a plurality of registered speakers, and is a reference acoustic data (an example of reference speech feature data) that serves as a reference for comparison of the acoustic parameters. Is stored and registered. For example, GMM (Gaussian Mixture Model) generated based on the acoustic parameters of each speaker can be generated based on the acoustic parameters of each speaker. If so, don't worry about the type! /.

[0021] The voice feature data matching unit 4 compares and matches the sound parameter calculated (extracted) by the voice feature data extracting unit 2 with the reference acoustic data stored in the voice feature DB3, and the result of the matching is performed. For example, the recognition result or authentication result is output to the display unit D.

[0022] For example, the voice feature data matching unit 4 checks to which of the reference acoustic data stored in the voice feature DB3 the extracted acoustic parameter is closest, and outputs the matching result. . More specifically, the extracted acoustic parameters are applied to each of the registered speaker's reference acoustic data (for example, GMM), the likelihood is obtained, and among them, the reference acoustic data that outputs the maximum likelihood. Information (for example, name, etc.) regarding the registered speaker corresponding to (for example, GMM) is output to the display unit D as a matching result. The user who is a speaker can thus see whether or not the collation result is correct (correct) by looking at the collation result displayed on the display unit D.

[0023] The speaker feature input unit 5 is a personal feature input device for inputting personal feature information as speaker feature data indicating a unique feature of a speaker (individual) other than voice. For example, if the personal feature information is fingerprint data, the personal feature input device is “fingerprint sensor”, and if the personal feature information is a password, the personal feature input device is “keyboard”. Or “touch panel”. Various known items such as an iris can be applied as the personal feature information.

[0024] Speaker feature DB 6 stores reference personal feature information for each of a plurality of registered speakers, which serves as a reference for matching the personal feature information input by speaker feature input unit 5. Stored and registered.

[0025] The speaker feature data matching unit 7 compares the personal feature information input by the speaker feature input unit 5 with the reference personal feature information stored in the speaker feature DB 6, and compares the matching results. Results (also known as recognition results or authentication results) are determined to be correct (for example, a force that matches a password that has been entered is registered in speaker feature DB6). ing. If the speaker characteristic data matching unit 7 determines that the answer is correct from the matching result, the speaker characteristic data matching unit 7 identifies the registered speaker (e.g., the registered speaker with the matched word) from the speaker feature DB 6 as the target of the correct answer. (decide.

[0026] When the collation result by the speaker feature data collation unit 7 is correct, the speech feature DB update unit 8 stores in the speech feature DB 3 using the acoustic parameters extracted by the speech feature data extraction unit 2. The reference acoustic data of the specified registered speaker is updated. For this update, for example, MAP estimation is used.

Next, the operation of the speaker recognition apparatus S according to the present embodiment will be described with reference to FIG.

FIG. 2 is a flowchart showing speaker recognition processing in the processing unit P in the speaker recognition device S according to the present embodiment.

In the process of FIG. 2, when a voice uttered by a speaker is input by the voice input unit 1, an acoustic parameter is calculated (extracted) by the input voice force voice feature data extraction unit 2. (Step Sl).

[0030] Next, the calculated (extracted) acoustic parameter and the reference acoustic data stored in the speech feature DB3 are compared and verified by the speech feature data verification unit 4, and the verification result (recognition result) is displayed on the display unit D. (Step S2).

[0031] By applying the voice feature data in Fig. 1 to the registered speaker patterns registered in the voice feature DB, the similarity (distance and likelihood) of each speaker is obtained. However, when an incorrect utterance is input, the similarity (distance and likelihood) of each speaker is Since they often come close to each other, it is possible to automatically determine the correct answer or incorrect answer by judging the similarity (distance or likelihood) of each speaker as a threshold. Specifically, the verification result (information on the corresponding registered speaker (for example, name, etc.)) displayed on the display unit D is used by the voice feature data verification unit 4 in FIG. 1 to determine whether the verification result is correct. If the answer is correct, an input indicating the correct answer is made via the voice input unit 1 or the speaker feature input unit 5 or the like. In this way, the processing unit P that has recognized the input determines that the collation result is correct (step S3: YES), and the process is terminated.

On the other hand, if the answer is an incorrect answer, the voice feature data matching unit 4 performs an input indicating the incorrect answer via the voice input unit 1 or the speaker feature input unit 5 or the like. In this way, the processing unit P that recognizes the input determines that the collation result is incorrect (step S3: NO), and proceeds to step S4. The determination of the correct answer Z incorrect answer is a function incorporated in the voice feature data matching unit 4 in FIG. Similarly, the above input is the similarity of each registered speaker of the voice feature data matching unit 4, and the output is a message to be displayed on the display unit. Furthermore, if the message is determined to be correct, the result is a recognition result, and if the message is determined to be incorrect, “please enter personal feature information into speaker feature input section 5”. The message prompts Incidentally, if the threshold judgment is used, the recognition result is output regardless of whether the correct answer is incorrect or incorrect.

[0033] It should be noted that instead of the voice feature data matching unit 4 in FIG. 1 determining whether the matching result displayed on the display unit D is correct, the matching result displayed on the display unit D (corresponding The speaker who sees the information (for example, name, etc.) regarding the registered speaker to be judged may determine whether or not the matching result is correct.

[0034] For example, if the processing unit P outputs the collation result and there is no input indicating an incorrect answer (or correct answer) within a predetermined time (for example, 10 seconds), the correct answer (or incorrect answer) is obtained. If there is an input indicating an incorrect answer (or correct answer) within that period, it may be configured to determine that it is an incorrect answer (or correct answer).

[0035] In step S4, the speaker is prompted to input personal characteristic information, and in response to this, personal characteristic information is input from the speaker by the speaker characteristic input unit 5. Then, the input personal feature information and the reference personal feature information stored in the speaker feature DB 6 are! The collected data collating unit 7 performs comparison and collation, and determines whether or not the collation result is correct. If the matching result is incorrect (for example, the reference personal feature information that matches the input personal feature information (matches including a predetermined margin) is not stored in the speaker feature DB 6). (Step S5: NO), the process is terminated. On the other hand, if the collation result is correct (step S5: YES), the registered speaker that is the target of the correct answer is identified and the voice feature DB3 is updated (step S6). In the update process of the speech feature DB3, the acoustic parameters extracted by the speech feature data extraction unit 2 are used and stored in the speech feature DB3 corresponding to the acoustic parameters. The registered speaker's reference acoustic data is updated.

[0036] As described above, according to the present embodiment, the voice uttered by the speaker is input, the acoustic parameters indicating the characteristics of the input voice are extracted, and the extracted acoustic parameters and Speech features Reference acoustic data stored in DB3 was compared and collated. If the matching result is not correct !, the personal feature information indicating the speaker features other than the voice is input from the above speaker, and the input personal feature information and the speaker feature DB 6 are stored. Compared with personal characteristics information. Then, when the collation result is correct, the reference acoustic data stored in the speech feature DB 3 corresponding to the acoustic parameter is updated using the extracted acoustic parameter. As a result, it is possible to specify the information power other than the voice for the speaker, and the voice feature DB3 can be updated even if the voice pattern is prone to error when the speaker is recognized, thereby improving the recognition performance. it can.

[0037] Specifically, since the utterance with an incorrect answer as a result of matching is considered to have sufficient characteristics of the speaker himself, the utterance of the incorrect answer is also actively used. It is configured to update. As a result, speaker recognition can be used with confidence even if voice quality has changed due to poor physical condition or aging.

Note that in this embodiment, the voice feature DB update unit 8 has an incorrect answer by the voice feature data matching unit 4 and the matching result by the speaker feature data matching unit 7 is not correct. Only when the answer is correct, the acoustic parameters extracted by the speech feature data extraction unit 2 are used to store the reference acoustic data of the identified registered speaker stored in the speech feature DB3. Has been updated.

However, as another embodiment, as shown in FIG. 3 (step S23: YES, step S26), the speech feature DB update unit 8 determines that the collation result by the speech feature data collation unit 4 is correct. In this case, the acoustic parameter extracted by the voice feature data extraction unit 2 may be used to update the reference acoustic data of the specified registered speaker stored in the voice feature DB3. This allows more updates than the method shown in Fig. 2, and can be expected to improve the accuracy of speaker recognition.

[0040] In this embodiment, for personal features that cannot be guaranteed 100% recognition rate, such as fingerprints and irises, the speaker feature DB6 is also updated to improve the accuracy of the authentication result. It is more desirable to configure to perform processing. However, according to the correct answer Z incorrect answer by the speaker feature data matching unit 7, the acoustic parameter and the reference acoustic data are different from the case of the comparative match. Because the speaker feature data matching unit 7 determines the threshold of the degree of similarity of the speaker corresponding to the matching result and guesses the correct answer Z incorrect answer. Is also included.

[0041] FIG. 4 is a diagram showing a schematic configuration example of the speaker recognition device S when the update process is performed on the speaker feature DB6. In the configuration shown in FIG. 4, the same components as those in FIG. 1 are denoted by the same reference numerals, and redundant description is omitted. In the speaker recognition device S in FIG. 4, a speaker feature DB update unit 9 is added. When the speaker feature data matching unit 7 determines that the answer is correct (that is, when the speaker similarity level exceeds the threshold), the speaker feature DB update unit 9 Using the personal feature information input by the speaker feature input unit 5, the reference personal feature information stored in the speaker feature DB 6 corresponding to the personal feature information is updated. Thereby, the accuracy of the authentication result can be improved.

[0042] The present invention is not limited to the above-described embodiment. The above embodiment is an exemplification, and the present invention has the same configuration as the technical idea described in the scope of claims of the present invention, and any device that exhibits the same function and effect is the present embodiment. It is included in the technical scope of the invention.

Claims

The scope of the claims

[1] voice input means for inputting voice uttered by a speaker;

Voice feature data extracting means for extracting voice feature data indicating the characteristics of the input voice;

Voice feature data storage means for storing reference voice feature data to be a reference for collating the voice feature data;

Voice feature data collating means for comparing and collating the extracted voice feature data with the reference voice feature data stored in the voice feature data storage means;

Speaker feature input means for inputting speaker feature data indicating speaker features other than the speech from the speaker;

Speaker feature data storage means for storing reference speaker feature data as a reference for comparing the speaker feature data;

Speaker feature data collating means for comparing and collating the input speaker feature data with the reference speaker feature data when the collation result by the voice feature data collating means is not correct;

When the matching result by the speaker feature data matching unit is correct, the voice feature data storage corresponding to the voice feature data is performed using the voice feature data extracted by the voice feature data extracting unit. Updating means for updating the reference speech feature data stored in the means.

[2] In the speaker recognition device according to claim 1,

The updating means uses the voice feature data to store the voice feature data stored in the voice feature data storage means corresponding to the voice feature data even when the collation result by the voice feature data collating means is correct. A speaker recognition device characterized by updating reference speech feature data.

[3] In the speaker recognition device according to claim 1 or 2,

The updating means responds to the speaker feature data using the speaker feature data input by the speaker feature data storage means when the collation result by the speaker feature data collating means is correct. The reference story stored in the speaker feature data storage means A speaker recognition device characterized by updating speaker feature data.

[4] A voice input process for inputting voice uttered by a speaker;

A voice feature data extraction step for extracting voice feature data indicating the characteristics of the input voice;

A voice feature data storage step of storing reference voice feature data to be a reference for collating the voice feature data;

A voice feature data matching step for comparing and collating the extracted voice feature data with the reference voice feature data stored;

A speaker feature input step for inputting speaker feature data indicating speaker features other than the voice from the speaker;

A speaker feature data storage step of storing reference speaker feature data to be a reference for collating the speaker feature data;

If the collation result in the speech feature data collation step is not correct !, speaker feature data collation step for comparing and collating the input speaker feature data with the reference speaker feature data;

An update step of updating the stored reference speech feature data corresponding to the speech feature data using the extracted speech feature data when the collation result in the speaker feature data matching step is correct And a speaker recognition method.

[5] Computer

Voice input means for inputting voice uttered by a speaker;

Voice feature data collating means for comparing and collating the extracted voice feature data with reference voice feature data stored in the voice feature data storage means and serving as a reference for collating the voice feature data;

Speaker feature input means for inputting speaker feature data indicating speaker features other than the voice from the speaker; When the collation result by the voice feature data collating means is not correct, the input speaker feature data and

Stored in the speaker feature data storage means! The speaker feature data collating means for comparing and collating with reference speaker feature data as a collation reference for the speaker feature data, and the speaker feature data collation When the collation result by the means is correct, the stored reference speech feature data corresponding to the speech feature data is updated using the speech feature data extracted by the speech feature data extraction unit. A speaker recognition program characterized by functioning as an updating means.

6. A recording medium in which the speaker recognition program according to claim 5 is recorded so as to be readable by the computer.