KR20230055025A

KR20230055025A - Device, method and computer program for recognizing voice

Info

Publication number: KR20230055025A
Application number: KR1020210138244A
Authority: KR
Inventors: 이정한
Original assignee: 주식회사 케이티
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2023-04-25

Abstract

음성을 인식하는 장치에 있어서, 음성 데이터에서 특징 벡터를 추출하는 특징 벡터 추출부; 상기 특징 벡터 및 상시 음성 데이터에 대한 음성 존재 확률값을 포함하는 조인트 벡터에 기초하여 음성 인식 모델을 학습시키는 예측 모델 학습부; 및 상기 음성 인식 모델을 이용하여 음성을 인식하는 음성 인식부를 포함한다. An apparatus for recognizing speech, comprising: a feature vector extraction unit extracting a feature vector from speech data; a predictive model learning unit that trains a speech recognition model based on the feature vector and a joint vector including a speech presence probability value for regular speech data; and a voice recognition unit for recognizing a voice using the voice recognition model.

Description

Apparatus, method and computer program for recognizing voice {DEVICE, METHOD AND COMPUTER PROGRAM FOR RECOGNIZING VOICE}

본 발명은 음성을 인식하는 장치, 방법 및 컴퓨터 프로그램에 관한 것이다.The present invention relates to a device, method and computer program for recognizing speech.

음성 인식 기술은 마이크와 같은 소리 센서를 통해 얻은 음향학적 신호를 단어나 문장으로 변환시키는 기술이다. 음성 인식 기술을 바탕으로 한 음성 인식 서비스로 대표적인 것이 애플의 음성 기반 개인비서 서비스인 '시리(Siri)'이다. 시리는 아이폰 사용자의 음성명령을 바탕으로 모바일 검색, 일정관리, 전화 걸기, 메모 및 음악 재생 등 다양한 생활편의 서비스를 제공하고 있다.Speech recognition technology is a technology that converts an acoustic signal obtained through a sound sensor such as a microphone into words or sentences. A representative voice recognition service based on voice recognition technology is 'Siri', Apple's voice-based personal assistant service. Siri provides various convenience services such as mobile search, schedule management, phone call, memo and music playback based on iPhone users' voice commands.

일반적으로, 음성 인식 기술은 음향 신호를 추출한 후 잡음을 제거하고, 음성 신호의 특징을 추출하여 음성모델 데이터베이스와 비교하는 방식으로 음성을 인식한다. In general, voice recognition technology recognizes voice by extracting a sound signal, removing noise, and extracting features of the voice signal and comparing them with a voice model database.

종래 음성 인식 기술은 음성 데이터에서 사람의 음성이 존재하는 구간과 존재하지 않는 구간을 음성 활동 감지(Voice Activity Detection)를 통해 검출하고 있다. 예를 들어, 음성 데이터에서 에너지를 기반으로 사람의 음성이 존재하는 구간은 '1'로 표기하고, 사람의 음성이 존재하지 않는 구간은 '0'으로 표기한다. Conventional voice recognition technology detects a section where a human voice exists and a section where a human voice does not exist in voice data through voice activity detection. For example, based on the energy in voice data, a section where human voice exists is marked as '1', and a section where human voice does not exist is marked as '0'.

그러나, 종래 음성 인식 기술은 음성 데이터에서 사람의 음성 존재 유무를 '0' 또는 '1'로만 출력하기 때문에 음성 데이터에 노이즈가 포함되어 있는 경우, 음성 인식의 성능이 떨어질 우려가 있다.However, since the conventional voice recognition technology only outputs '0' or '1' to indicate the presence or absence of a human voice in voice data, voice recognition performance may be deteriorated when noise is included in voice data.

한국등록특허공보 제10-1066472호 (2011. 9. 15. 등록)Korean Registered Patent Publication No. 10-1066472 (registered on September 15, 2011) 한국등록특허공보 제10-0998567호 (2010. 11. 30. 등록)Korean Registered Patent Publication No. 10-0998567 (registered on November 30, 2010)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 음성 데이터에서 사람의 음성이 존재할 수 있는 구간을 측정하여 수치화할 수 있는 음성 인식 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다. The present invention is to solve the above-mentioned problems of the prior art, and to provide a voice recognition apparatus, method, and computer program capable of measuring and digitizing a section in which a human voice may exist in voice data.

또한, 음성 데이터에 노이즈가 포함되어 있거나, 맨 앞의 자음이 잘린 음성 신호도 효과적으로 인식할 수 있는 음성 인식 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다.In addition, it is intended to provide a voice recognition apparatus, method, and computer program capable of effectively recognizing even a voice signal in which noise is included in voice data or a first consonant is cut off.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the present embodiment is not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 수단으로서, 본 발명의 일 실시예는, 음성을 인식하는 장치에 있어서, 음성 데이터에서 특징 벡터를 추출하는 특징 벡터 추출부; 상기 특징 벡터 및 상시 음성 데이터에 대한 음성 존재 확률값을 포함하는 조인트 벡터에 기초하여 음성 인식 모델을 학습시키는 예측 모델 학습부; 및 상기 음성 인식 모델을 이용하여 음성을 인식하는 음성 인식부를 포함하는 것인, 음성 인식 장치를 제공할 수 있다. As a means for achieving the above-described technical problem, an embodiment of the present invention is a device for recognizing speech, comprising: a feature vector extractor extracting a feature vector from speech data; a predictive model learning unit that trains a speech recognition model based on the feature vector and a joint vector including a speech presence probability value for regular speech data; and a voice recognition unit for recognizing a voice using the voice recognition model.

본 발명의 다른 실시예는, 음성을 인식하는 방법에 있어서, 음성 데이터에서 특징 벡터를 추출하는 단계; 상기 특징 벡터 및 상시 음성 데이터에 대한 음성 존재 확률값을 포함하는 조인트 벡터에 기초하여 음성 인식 모델을 학습시키는 단계; 및 상기 음성 인식 모델을 이용하여 음성을 인식하는 단계를 포함하는 것인, 음성 인식 방법을 제공할 수 있다. Another embodiment of the present invention is a method for recognizing speech, comprising: extracting a feature vector from speech data; learning a speech recognition model based on the feature vector and a joint vector including a speech existence probability value for regular speech data; and recognizing a voice using the voice recognition model.

본 발명의 또 다른 실시예는, 음성을 인식하는 명령어들의 시퀀스를 포함하는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램에 있어서, 상기 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 음성 데이터에서 특징 벡터를 추출하고, 상기 특징 벡터 및 상시 음성 데이터에 대한 음성 존재 확률값을 포함하는 조인트 벡터에 기초하여 음성 인식 모델을 학습시키고, 상기 음성 인식 모델을 이용하여 음성을 인식하도록 하는 명령어들의 시퀀스를 포함하는, 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램을 제공할 수 있다. Another embodiment of the present invention is a computer program stored on a computer-readable recording medium containing a sequence of instructions for recognizing speech, wherein the computer program, when executed by a computing device, extracts a feature vector from speech data and , a computer readable sequence of instructions for learning a speech recognition model based on the feature vector and a joint vector including a speech presence probability value for regular speech data, and recognizing speech using the speech recognition model. A computer program stored on a recording medium may be provided.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described means for solving the problems is only illustrative and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and detailed description.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 음성 데이터에서 사람의 음성을 존재할 수 있는 구간을 측정하고, 해당 구간에 사람의 음성이 존재할 확률을 수치화할 수 있다. According to any one of the above-described problem solving means of the present invention, it is possible to measure a section in which a human voice may exist in voice data, and quantify a probability that a human voice exists in the corresponding section.

따라서, 음성 데이터에 노이즈가 포함되어 있거나, 특정 구간에서 맨 앞의 자음이 잘린 환경의 음성 신호에서도 효과적으로 사람의 음성을 인식할 수 있는 음성 인식 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다. Therefore, it is possible to provide a voice recognition apparatus, method, and computer program capable of effectively recognizing a human voice even in a voice signal in an environment in which noise is included in voice data or the first consonant is cut off in a specific section.

도 1은 본 발명의 일 실시예에 따른 음성 인식 장치의 구성도이다.
도 2는 본 발명의 일 실시예에 따른 특정 프레임의 지연 시간, 주기성을 설명하기 위한 예시적인 도면이다.
도 3은 본 발명의 일 실시예에 따른 지연 시간 값의 예시를 설명하기 위한 예시적인 도면이다.
도 4는 본 발명의 일 실시예에 따른 지연 시간 값을 도출하는 과정을 설명하기 위한 예시적인 도면이다.
도 5는 본 발명의 일 실시예에 따른 감마 필터를 설명하기 위한 예시적인 도면이다.
도 6은 본 발명의 일 실시예에 따른 음성 인식 방법의 순서도이다.1 is a configuration diagram of a voice recognition device according to an embodiment of the present invention.
2 is an exemplary diagram for explaining a delay time and periodicity of a specific frame according to an embodiment of the present invention.
3 is an exemplary diagram for explaining an example of a delay time value according to an embodiment of the present invention.
4 is an exemplary diagram for explaining a process of deriving a delay time value according to an embodiment of the present invention.
5 is an exemplary diagram for explaining a gamma filter according to an embodiment of the present invention.
6 is a flowchart of a voice recognition method according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail so that those skilled in the art can easily practice the present invention with reference to the accompanying drawings. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Throughout the specification, when a part is said to be "connected" to another part, this includes not only the case where it is "directly connected" but also the case where it is "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, this means that it may further include other components, not excluding other components, unless otherwise stated, and one or more other characteristics. However, it should be understood that it does not preclude the possibility of existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다.In this specification, a "unit" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. Further, one unit may be realized using two or more hardware, and two or more units may be realized by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다.In this specification, some of the operations or functions described as being performed by a terminal or device may be performed instead by a server connected to the terminal or device. Likewise, some of the operations or functions described as being performed by the server may also be performed in a terminal or device connected to the corresponding server.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 음성 인식 장치의 구성도이다. 도 1을 참조하면, 음성 인식 장치(100)는 특징 벡터 추출부(110), 예측 모델 학습부(120) 및 음성 인식부(130)를 포함할 수 있다. 다만 위 구성 요소들(110 내지 130)은 음성 인식 장치(100)에 의하여 제어될 수 있는 구성요소들을 예시적으로 도시한 것일 뿐이다. 1 is a configuration diagram of a voice recognition device according to an embodiment of the present invention. Referring to FIG. 1 , the voice recognition apparatus 100 may include a feature vector extractor 110 , a predictive model learner 120 and a voice recognizer 130 . However, the above components 110 to 130 are merely examples of components that can be controlled by the voice recognition apparatus 100 .

도 1의 음성 인식 장치(100)의 각 구성요소들은 일반적으로 네트워크(network)를 통해 연결된다. 예를 들어, 도 1에 도시된 바와 같이, 특징 벡터 추출부(110), 예측 모델 학습부(120) 및 음성 인식부(130)는 동시에 또는 시간 간격을 두고 연결될 수 있다. Each component of the voice recognition apparatus 100 of FIG. 1 is generally connected through a network. For example, as shown in FIG. 1 , the feature vector extractor 110 , the predictive model learner 120 and the speech recognizer 130 may be connected simultaneously or at a time interval.

네트워크는 단말들 및 서버들과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 근거리 통신망(LAN: Local Area Network), 광역 통신망(WAN: Wide Area Network), 인터넷 (WWW: World Wide Web), 유무선 데이터 통신망, 전화망, 유무선 텔레비전 통신망 등을 포함한다. 무선 데이터 통신망의 일례에는 3G, 4G, 5G, 3GPP(3rd Generation Partnership Project), LTE(Long Term Evolution), WIMAX(World Interoperability for Microwave Access), 와이파이(Wi-Fi), 블루투스 통신, 적외선 통신, 초음파 통신, 가시광 통신(VLC: Visible Light Communication), 라이파이(LiFi) 등이 포함되나 이에 한정되지는 않는다. A network refers to a connection structure capable of exchanging information between nodes such as terminals and servers, such as a local area network (LAN), a wide area network (WAN), and the Internet (WWW: World Wide Web), wired and wireless data communication network, telephone network, and wired and wireless television communication network. Examples of wireless data communication networks include 3G, 4G, 5G, 3rd Generation Partnership Project (3GPP), Long Term Evolution (LTE), World Interoperability for Microwave Access (WIMAX), Wi-Fi, Bluetooth communication, infrared communication, ultrasonic communication, visible light communication (VLC: Visible Light Communication), LiFi, and the like, but are not limited thereto.

음성 인식 장치(100)는 음성 데이터에서 사람의 음성을 존재할 수 있는 구간을 측정하고, 해당 구간에서 사람의 음성이 존재할 확률을 수치화할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 데이터가 포함하는 복수의 프레임 중 특정 프레임에서 사람의 음성이 존재할 확률값을 산출할 수 있다. The voice recognition apparatus 100 may measure a section in which a human voice may exist in voice data, and quantify a probability that a human voice exists in the corresponding section. For example, the voice recognition apparatus 100 may calculate a probability value of human voice present in a specific frame among a plurality of frames included in voice data.

음성 인식 장치(100)는 음성 데이터에 노이즈가 포함되어 있거나, 특정 구간에서 맨 앞의 자음이 잘린 환경의 음성 신호에서도 효과적으로 사람의 음성을 인식할 수 있다. 예를 들어, 음성 인식 장치(100)는 음성 데이터에서 '학교에 갔다'에서 맨 앞의 자음 'ㅎ'이 잘린 음성 신호도 분명하게 인식할 수 있다. The voice recognition apparatus 100 can effectively recognize a human voice even in a voice signal in an environment in which noise is included in voice data or a leading consonant is cut off in a specific section. For example, the voice recognition apparatus 100 can clearly recognize a voice signal in which the first consonant 'ㅎ' in 'I went to school' is cut off from the voice data.

이하, 음성 인식 장치(100)의 각 구성에 대해 살펴보도록 한다.Hereinafter, each component of the voice recognition apparatus 100 will be described.

특징 벡터 추출부(110)는 음성 데이터에서 특징 벡터를 추출할 수 있다. 예를 들어, 특징 벡터 추출부(110)는 음성 데이터에서 음성 고유의 특징값, MFCC(Mel-Frequency Cepstral Coefficient)를 추출할 수 있다.The feature vector extractor 110 may extract feature vectors from voice data. For example, the feature vector extractor 110 may extract a voice-specific feature value, a Mel-Frequency Cepstral Coefficient (MFCC), from voice data.

구체적으로, 특징 벡터 추출부(110)는 음성 데이터를 프레임 별로 구분할 수 있다. 특징 벡터 추출부(110)는 음성 데이터의 프레임 별로 고속 푸리에 변환(Fast Fourier Transform, FFT)을 적용하여 스펙트럼을 추출할 수 있다. 여기서, 고속 푸레이 변환은 신호를 주파수 성분으로 변환하는 알고리즘으로, 기존의 이산 푸리에 변환(DFT)을 보다 빠르게 수행할 수 있도록 최적화한 알고리즘이다. Specifically, the feature vector extractor 110 may classify voice data for each frame. The feature vector extractor 110 may extract a spectrum by applying a Fast Fourier Transform (FFT) to each frame of voice data. Here, the fast playy transform is an algorithm for converting a signal into a frequency component, and is an algorithm optimized to perform a conventional Discrete Fourier Transform (DFT) more quickly.

특징 벡터 추출부(110)는 추출된 스펙트럼에 멜 필터 뱅크(Mel Filter Bank)를 적용하여 멜 스펙트럼(Mel Spectrum)을 산출할 수 있다. 특징 벡터 추출부(110)는 멜 스펙트럼에 켑스트럼(Cepstrum) 분석을 수행하여 특징 벡터, MFCC를 추출할 수 있다. The feature vector extractor 110 may calculate a Mel spectrum by applying a Mel filter bank to the extracted spectrum. The feature vector extractor 110 may extract a feature vector, MFCC, by performing cepstrum analysis on the mel spectrum.

예측 모델 학습부(120)는 특징 벡터 및 상시 음성 데이터에 대한 음성 존재 확률값을 포함하는 조인트 벡터에 기초하여 음성 인식 모델을 학습시킬 수 있다. 예를 들어, 예측 모델 학습부(120)는 하기 수학식 1과 같이 음성 데이터에서 추출된 음성 고유의 특징값, MFCC와 음성이 존재할 확률값을 이용하여 조인트 벡터를 생성할 수 있다. The predictive model learning unit 120 may train a voice recognition model based on a feature vector and a joint vector including a voice presence probability value for regular voice data. For example, the predictive model learning unit 120 may generate a joint vector by using a voice-specific feature value extracted from voice data, an MFCC, and a probability value where voice exists, as shown in Equation 1 below.

<수학식 1><Equation 1>

수학식 1에서,

는 조인트 벡터이고,

는 특징 벡터이고,

는 음성 존재 확률 벡터일 수 있다. In Equation 1,

is the joint vector,

is a feature vector,

may be a voice presence probability vector.

예측 모델 학습부(120)는 음성 데이터에 대한 자기 상관값 및 교차 상관비(Cross Correlation Ratio, CCR)에 기초하여 음성 데이터에 포함된 각 프레임마다 음성 존재 확률값을 산출할 수 있다. The predictive model learning unit 120 may calculate a voice presence probability value for each frame included in the voice data based on the autocorrelation value and the cross correlation ratio (CCR) of the voice data.

예를 들어, 예측 모델 학습부(120)는 정규화된 자기 상관 함수(Normalized Auto Correlation Function, NACF)를 이용하여 자기 상관값을 산출할 수 있다. 예측 모델 학습부(120)는 하기 수학식 2를 이용하여 자기 상관값을 산출할 수 있다.For example, the predictive model learning unit 120 may calculate an autocorrelation value using a normalized autocorrelation function (NACF). The predictive model learning unit 120 may calculate an autocorrelation value using Equation 2 below.

<수학식 2><Equation 2>

수학식 2에서,

은 입력 신호이고, m은 프레임 인덱스이고,

은 신호의 샘플로서 초(second) 단위의 지연 시간(lag time)일 수 있다. 여기서, 지연 시간은 입력 신호로부터 지난 신호의 패턴 중 입력 신호의 패턴과 가장 유사한 패턴을 가지는 지난 신호와 해당 입력 신호 간의 간격에 해당할 수 있다. 즉, 해당 신호의 주기성을 나타낼 수 있다.In Equation 2,

is the input signal, m is the frame index,

is a sample of a signal and may be a lag time in units of seconds. Here, the delay time may correspond to an interval between a previous signal having a pattern most similar to the pattern of the input signal among patterns of the previous signal from the input signal and the corresponding input signal. That is, the periodicity of the corresponding signal may be indicated.

예를 들어, 예측 모델 학습부(120)는, 수학식 2를 참조하면, 원 신호(

)에서

만큼 시간 이동된 신호(

)를 곱한 후 정규화하여 자기 상관값을 산출할 수 있다. 예측 모델 학습부(120)는 원 신호를 기반으로 지연된 시간, 즉, 주기성에 따라 자기 상관값을 산출하여 음성 데이터의 주기 성분을 추출할 수 있다. For example, the predictive model learning unit 120, referring to Equation 2, the original signal (

)at

A signal time-shifted by (

) and then normalized to calculate the autocorrelation value. The predictive model learning unit 120 may extract a periodic component of voice data by calculating an autocorrelation value according to a delayed time, that is, periodicity, based on the original signal.

예를 들어, 수학식 2에서 지연 시간은 하기 수학식 3과 같이 초(sec) 단위로 나타낼 수 있다.For example, the delay time in Equation 2 may be expressed in seconds (sec) as shown in Equation 3 below.

<수학식 3><Equation 3>

,

수학식 3에서,

및

는 기본 주파수(

)가 존재할 수 있는 구간성을 의미할 수 있다. 예를 들어, 기본 주파수가 존재할 수 있는 구간은 일반적으로, 80~500Hz이고, 이 때, 해당 주파수는 수학식 4와 같이 200~32의 범위를 갖는 시간 지연(time-lag) 샘플 인덱스로 나타낼 수 있다.In Equation 3,

and

is the fundamental frequency (

) may mean interval properties that may exist. For example, the period in which the basic frequency may exist is generally 80 to 500 Hz, and in this case, the frequency can be represented by a time-lag sample index having a range of 200 to 32 as shown in Equation 4. there is.

<수학식 4><Equation 4>

예측 모델 학습부(120)는 자기 상관 함수에 기초하여 기본 주파수를 도출할 수 있다. 예를 들어, 기본 주파수는 하기 수학식 5와 같이 나타낼 수 있다.The predictive model learner 120 may derive a fundamental frequency based on the autocorrelation function. For example, the fundamental frequency can be expressed as in Equation 5 below.

<수학식 5><Equation 5>

예측 모델 학습부(120)는 교차 상관비로서 음성 데이터에 포함된 제 1 프레임과 제 1 프레임의 이전 프레임인 제 2 프레임 간의 유사성을 산출할 수 있다. 예를 들어, 예측 모델 학습부(120)는 하기 수학식 6을 이용하여 교차 상관비를 산출할 수 있다.The predictive model learner 120 may calculate a similarity between the first frame included in the voice data and the second frame that is a previous frame of the first frame as a cross-correlation ratio. For example, the predictive model learning unit 120 may calculate the cross-correlation ratio using Equation 6 below.

<수학식 6><Equation 6>

수학식 6에서,

은 입력 신호이고,

는 감마 필터(gammation filter)와의 합성곱을 통해 산출될 수 있다. 예를 들어, 하기 수학식 7을 이용할 수 있다.In Equation 6,

is the input signal,

may be calculated through convolution with a gamma filter. For example, Equation 7 below can be used.

<수학식 7><Equation 7>

수학식 6 및 7에서, n은 샘플이고, c는 채널로써, 프레임 인덱스 m에 대한 주파수를 대역 별로 구분할 수 있다. 예를 들어, 예측 모델 학습부(120)는 수학식 6 및 7을 이용하여 제 1 프레임과 제 2 프레임 특정 채널의 샘플들의 합 간의 비율을 산출할 수 있다.In Equations 6 and 7, n is a sample and c is a channel, and the frequency for the frame index m can be divided for each band. For example, the predictive model learner 120 may calculate a ratio between the sum of samples of the first frame and the second frame specific channel using Equations 6 and 7.

예측 모델 학습부(120)는 제 1 프레임의 기본 주파수에 따라 채널 수를 도출할 수 있다. 예를 들어, 예측 모델 학습부(120)는 수학식 2를 이용하여 산출된 기본 주파수(

)를 포함한 k개의 근접한 채널(수학식 6에서

)을 도출할 수 있다. The prediction model learner 120 may derive the number of channels according to the fundamental frequency of the first frame. For example, the predictive model learning unit 120 calculates the fundamental frequency using Equation 2 (

), including k adjacent channels (in Equation 6)

) can be derived.

예를 들어, 예측 모델 학습부(120)는 수학식 2를 이용하여, 수학식 6에서 기본주파수(

)의 배수의 주파수 채널(

)들을 도출할 수 있고, 이 때, 범위는 80Hz ~ 1000Hz이고, 각 채널은 Center-frequency와 Band-width로 구성될 수 있다.For example, the predictive model learning unit 120 uses Equation 2, and in Equation 6, the fundamental frequency (

) frequency channels of multiples (

) can be derived, and at this time, the range is 80 Hz to 1000 Hz, and each channel can be composed of a center-frequency and a bandwidth.

구체적으로 살펴보면, 성인 남성의 기본 주파수는 일반적으로, 약 120Hz이고, 성인 여성 및 어린이의 기본 주파수는 약 300Hz이다. 따라서, 예측 모델 학습부(120)는 성인 남성의 경우, 현재 프레임의 주파수가 120Hz일 때, k개의 근접한 채널로,

= {120, 240, 360, 480, 600, 720, 840, 960}을 도출할 수 있고, 성인 여성 및 어린이의 경우, 현재 프레임의 주파수가 300Hz일 때, k개의 근접한 채널로,

= {300, 600, 900}을 도출할 수 있다.Specifically, the fundamental frequency of an adult male is generally about 120 Hz, and the fundamental frequency of an adult female and child is about 300 Hz. Therefore, in the case of an adult male, when the frequency of the current frame is 120 Hz, the predictive model learning unit 120 uses k adjacent channels,

= {120, 240, 360, 480, 600, 720, 840, 960} can be derived, and in the case of adult women and children, when the frequency of the current frame is 300 Hz, with k adjacent channels,

= {300, 600, 900} can be derived.

수학식 6 및 7을 참조하면, 예측 모델 학습부(120)는 제 1 프레임 및 제 2 프레임 간의 유사성으로서 도출된 채널 수에 기초하여 제 1 프레임에 해당하는 적어도 하나의 샘플의 합 및 제 2 프레임에 해당하는 적어도 하나의 샘플의 합 간의 비율을 산출할 수 있다. 예를 들어, 예측 모델 학습부(120)는 기본 주파수의 배수들만 추출하여 프레임 전체 신호의 합을 산출할 수 있다. Referring to Equations 6 and 7, the predictive model learning unit 120 calculates the sum of at least one sample corresponding to the first frame and the second frame based on the number of channels derived as a similarity between the first frame and the second frame. A ratio between the sum of at least one sample corresponding to may be calculated. For example, the predictive model learning unit 120 may calculate the sum of signals of all frames by extracting only multiples of the fundamental frequency.

예를 들어, 예측 모델 학습부(120)는 수학식 6을 이용하여 제 1 프레임 및 제 2 프레임 간의 유사성을 수치화할 수 있다. 이를 통해, 자음 구간(Unvoiced frame) 및 모음 구간(Voiced frame) 기본 주파수의 고조파(harmonic) 성분들의 연속성을 향상시킬 수 있다. For example, the predictive model learner 120 may quantify the similarity between the first frame and the second frame using Equation 6. Through this, the continuity of harmonic components of the fundamental frequency of the unvoiced frame and the voiced frame can be improved.

예측 모델 학습부(120)는 자기 상관값에 대한 절대값을 이용하여 음성 존재 확률값을 산출할 수 있다. 예측 모델 학습부(120)는 자기 상관값에 대한 절대값 및 교차 상관비에 기초하여 제 1 프레임과 제 1 프레임의 이전 프레임인 제 2 프레임 간의 유사성을 나타내는 유사성 함수를 산출할 수 있다. 예를 들어, 예측 모델 학습부(120)는 하기 수학식 8을 이용하여 제 1 프레임과 제 2 프레임 간의 유사성을 산출할 수 있다. The predictive model learning unit 120 may calculate a voice presence probability value using an absolute value of an autocorrelation value. The predictive model learner 120 may calculate a similarity function indicating similarity between the first frame and the second frame, which is a previous frame of the first frame, based on the absolute value of the autocorrelation value and the cross-correlation ratio. For example, the predictive model learning unit 120 may calculate the similarity between the first frame and the second frame using Equation 8 below.

<수학식 8><Equation 8>

수학식 8에서,

은 유사성 함수이고,

는 수학식 2에서 산출한 자기 상관값에 절대값을 적용한 값일 수 있다. 예를 들어, 예측 모델 학습부(120)는 하기 수학식 9를 이용하여 자기 상관값의 절대값, 즉, 기본 주파수의 상관값에 절대값을 산출할 수 있다.In Equation 8,

is the similarity function,

May be a value obtained by applying an absolute value to the autocorrelation value calculated in Equation 2. For example, the predictive model learner 120 may calculate an absolute value of an autocorrelation value, that is, an absolute value of a correlation value of a fundamental frequency using Equation 9 below.

<수학식 9><Equation 9>

수학식 9를 참조하면, 예측 모델 학습부(120)는 자기 상관값에 절대값을 적용하여 연관 정도를 측정할 수 있다. 예를 들어,

의 값이 클수록 관계성이 높은 것으로 판단할 수 있다. 즉, 예측 모델 학습부(120)는 특정 프레임에서 도출한 기본 주파수가 제대로 도출한 기본 주파수인지 수치를 통해 확인할 수 있다.Referring to Equation 9, the predictive model learning unit 120 may measure the degree of association by applying an absolute value to the autocorrelation value. for example,

The higher the value of is, the higher the relationship can be determined. That is, the predictive model learning unit 120 may check through numerical values whether the fundamental frequency derived from a specific frame is properly derived.

수학식 8 및 9를 참조하면, 예측 모델 학습부(120)는 제 1 프레임의 절대값과 교차 상관비를 곱한 값에 제 2 프레임의 절대값을 나누어 제 1 프레임과 제 2 프레임 간의 유사성을 산출할 수 있다.Referring to Equations 8 and 9, the prediction model learning unit 120 calculates the similarity between the first frame and the second frame by dividing the absolute value of the second frame by the product of the absolute value of the first frame and the cross-correlation ratio. can do.

예측 모델 학습부(120)는 유사성 함수에 시그모이드 함수(Sigmoid function)를 적용하여 0과 1 사이의 값으로 음성 존재 확률값을 산출할 수 있다. 예를 들어, 예측 모델 학습부(120)는 하기 수학식 10을 이용하여 유사성 함수에 시그모이드 함수를 적용할 수 있다.The predictive model learning unit 120 may calculate a voice presence probability value with a value between 0 and 1 by applying a sigmoid function to the similarity function. For example, the predictive model learning unit 120 may apply a sigmoid function to the similarity function using Equation 10 below.

<수학식 10><Equation 10>

수학식 10을 참조하면, 예측 모델 학습부(120)는 음성 데이터가 포함하는 제 1 프레임과 제 2 프레임 간의 유사성 함수에 시그모이드 함수를 적용하여 최종적으로 제 1 프레임에 대한 음성 존재 확률값을 산출할 수 있다. 예를 들어, 예측 모델 학습부(120)는 유사성 함수에 시그모이드 함수를 적용하여 제 1 프레임에 사람의 음성이 존재할 확률을 0과 1사이의 값으로 수치화할 수 있다. Referring to Equation 10, the predictive model learning unit 120 applies a sigmoid function to a similarity function between a first frame and a second frame included in speech data to finally calculate a speech presence probability value for the first frame. can do. For example, the predictive model learning unit 120 may apply a sigmoid function to the similarity function to quantify the probability that a human voice exists in the first frame as a value between 0 and 1.

이와 같이, 본 발명에 따른 음성 인식 장치(100)는 음성 데이터의 각 구간, 즉, 각 프레임에 사람의 음성의 존재 유무를 0 또는 1의 이분적인 값이 아닌, 0과 1 사이의 확률값으로 산출할 수 있다. 따라서, 음성 인식 장치(100)는 음성 데이터에 노이즈가 포함되어 있거나, 특정 프레임에서 맨 앞의 자음이 잘린 환경의 음성 신호에서도 효과적으로 사람의 음성을 인식할 수 있다. In this way, the voice recognition apparatus 100 according to the present invention calculates the presence or absence of a human voice in each section of voice data, that is, each frame, as a probability value between 0 and 1, rather than a dichotomy value of 0 or 1. can do. Accordingly, the voice recognition apparatus 100 can effectively recognize a human voice even in a voice signal in an environment in which noise is included in voice data or a leading consonant is cut off in a specific frame.

음성 인식부(130)는 음성 인식 모델을 이용하여 음성을 인식할 수 있다. 예를 들어, 음성 인식부(130)는 음성 데이터에 대한 특징 벡터 및 음성 존재 확률값에 기초하여 학습된 음성 인식 모델을 이용하여 해당 음성 데이터에 포함되어 있는 사람의 음성을 인식할 수 있다. The voice recognition unit 130 may recognize voice using a voice recognition model. For example, the voice recognition unit 130 may recognize a human voice included in the corresponding voice data by using a voice recognition model learned based on a feature vector of the voice data and a voice presence probability value.

또한, 음성 인식부(130)는 조인트 벡터로 학습된 음성 인식 모델과 언어 모델을 결합하여 음성 데이터에 포함된 음성을 인식하고, 문장을 검출할 수 있다. 예를 들어, 음성 인식부(130)는 하기 수학식 11을 이용할 수 있다. Also, the speech recognition unit 130 may recognize speech included in speech data and detect sentences by combining a speech recognition model learned as a joint vector and a language model. For example, the voice recognition unit 130 may use Equation 11 below.

<수학식 11><Equation 11>

수학식 11에서,

는 조인트 벡터이고,

는 조인트 벡터로 학습된 음향(Acoustic) 모델이고,

는 언어 모델이고,

는 입력된 음성 데이터의 최대사후확률을 갖는 문장일 수 있다. 수학식 11을 참조하면, 음성 인식부(130)는 음성 인식 모델 및 언어 모델을 통해 음성 데이터에 포함된 문장을 인식하고 검출할 수 있다.In Equation 11,

is the joint vector,

is an acoustic model learned as a joint vector,

is the language model,

may be a sentence having the maximum posterior probability of input speech data. Referring to Equation 11, the voice recognition unit 130 may recognize and detect a sentence included in voice data through a voice recognition model and a language model.

도 2는 본 발명의 일 실시예에 따른 특정 프레임의 지연 시간, 주기성을 설명하기 위한 예시적인 도면이다. 2 is an exemplary diagram for explaining a delay time and periodicity of a specific frame according to an embodiment of the present invention.

다시 수학식 1을 참조하면, 음성 인식 장치(100)는 음성 데이터에서 추출된 특징 벡터 및 음성 존재 확률 벡터를 이용하여 조인트 벡터를 생성할 수 있다. 음성 인식 장치(100)는 수학식 1에서 3차원의 음성 존재 확률 벡터,

를 이용할 수 있다. Referring back to Equation 1, the voice recognition apparatus 100 may generate a joint vector using a feature vector extracted from voice data and a voice presence probability vector. In Equation 1, the speech recognition apparatus 100 includes a 3-dimensional speech presence probability vector,

is available.

예를 들어, 제 1 프레임의 음성 존재 확률 벡터는 제 1 프레임의 음성 존재 확률값, 제 1 프레임의 지연 시간 값(210) 및 제 1 프레임의 델타 지연 시간 값(220)을 포함할 수 있다.For example, the voice presence probability vector of the first frame may include a voice presence probability value of the first frame, a delay time value 210 of the first frame, and a delta delay time value 220 of the first frame.

음성 인식 장치(100)는 하기 수학식 12를 이용하여 특정 프레임, 일예로 제 1 프레임에 대한 지연 시간 값(210)을 산출할 수 있다.The voice recognition apparatus 100 may calculate the delay time value 210 for a specific frame, for example, the first frame, using Equation 12 below.

<수학식 12><Equation 12>

또한, 음성 인식 장치(100)는 하기 수학식 13을 이용하여 특정 프레임, 일예로 제 1 프레임에 대한 델타 지연 시간 값(220)을 산출할 수 있다.In addition, the voice recognition apparatus 100 may calculate the delta delay time value 220 for a specific frame, for example, the first frame, using Equation 13 below.

<수학식 13><Equation 13>

도 3은 본 발명의 일 실시예에 따른 지연 시간 값의 예시를 설명하기 위한 예시적인 도면이다. 도 3의 (a)는 제 1 프레임의 신호와 제 1 프레임의 신호로부터 2만큼의 지연된 시간의 신호가 가장 유사한 패턴을 보이고 있고, (b)는 제 1 프레임의 신호와 제 1 프레임의 신호로부터 5만큼의 지연된 시간의 신호가 가장 유사한 패턴을 보이고 있다.3 is an exemplary diagram for explaining an example of a delay time value according to an embodiment of the present invention. In (a) of FIG. 3, the signal of the first frame and the signal of the time delayed by 2 from the signal of the first frame show the most similar pattern, and (b) shows the signal of the first frame and the signal of the first frame. A signal with a delayed time of 5 shows the most similar pattern.

도 3의 (c)는 제 1 프레임의 신호와 제 1 프레임의 신호로부터 10만큼의 지연된 시간의 신호가 가장 유사한 패턴을 보이고 있고, (d)는 제 1 프레임의 신호와 제 1 프레임의 신호로부터 18만큼의 지연된 시간의 신호가 가장 유사한 패턴을 보이고 있다.In (c) of FIG. 3, the signal of the first frame and the signal of the time delayed by 10 from the signal of the first frame show the most similar pattern, and (d) shows the signal of the first frame and the signal of the first frame. A signal with a delayed time of 18 shows the most similar pattern.

따라서, 도 3의 (a)는 지연 시간 값이 2이고, (b)는 지연 시간 값이 5이고, (c)는 지연 시간 값이 10이고, (d)는 지연 시간 값이 18일 수 있다. 즉, 도 3의 (a)에 도시된 예시의 주기성은 2이고, (b)에 도시된 예시의 주기성은 5이고, (c)에 도시된 예시의 주기성은 10이고, (d)에 도시된 예시의 주기성은 18일 수 있다.Therefore, (a) of FIG. 3 may have a delay time value of 2, (b) may have a delay time value of 5, (c) may have a delay time value of 10, and (d) may have a delay time value of 18. . That is, the periodicity of the example shown in (a) of FIG. 3 is 2, the periodicity of the example shown in (b) is 5, the periodicity of the example shown in (c) is 10, and the periodicity of the example shown in (d) is An exemplary periodicity may be 18.

도 4는 본 발명의 일 실시예에 따른 지연 시간 값을 도출하는 과정을 설명하기 위한 예시적인 도면이다. 도 4를 참조하면, 지연 시간 값은 입력 신호와 입력 신호로부터 지난 신호 중 상관성이 가장 높은 값(410)으로부터 도출될 수 있다.4 is an exemplary diagram for explaining a process of deriving a delay time value according to an embodiment of the present invention. Referring to FIG. 4 , a delay time value may be derived from a value 410 having the highest correlation among an input signal and a signal past the input signal.

예를 들어, 음성 인식 장치(100)는 제 1 프레임의 신호와 제 1 프레임으로부터 지연된 시간의 신호 중 가장 유사성이 높은 신호에 기초하여 지연 시간 값을 도출할 수 있다. For example, the voice recognition apparatus 100 may derive a delay time value based on a signal having the highest similarity between a signal of the first frame and a signal of a delay time from the first frame.

즉, 음성 인식 장치(100)는 제 1 프레임의 신호와 제 1 프레임으로부터 지연된 시간의 신호 중 가장 유사성이 높은 신호와의 간격에 기초하여 제 1 프레임에 대한 주기성을 도출할 수 있다.That is, the voice recognition apparatus 100 may derive the periodicity of the first frame based on the interval between the signal of the first frame and the signal having the highest similarity among the signals of the delayed time from the first frame.

도 5는 본 발명의 일 실시예에 따른 감마 필터를 설명하기 위한 예시적인 도면이다. 도 5를 참조하면, 음성 인식 장치(100)는 음성 데이터(도 5 의(a))에 감마 필터를 적용하여 시간 영역 및 주파수 영역으로 분리된 데이터(도 5 의(b))를 획득할 수 있다.5 is an exemplary diagram for explaining a gamma filter according to an embodiment of the present invention. Referring to FIG. 5 , the voice recognition apparatus 100 may apply a gamma filter to voice data ((a) of FIG. 5 ) to obtain data separated into time domain and frequency domain ((b) of FIG. 5 ). there is.

다시 수학식 6을 참조하면, 음성 인식 장치(100)는 기본 주파수에 따라 채널 수를 도출할 수 있고, 도출된 채널 수만큼의 샘플의 합에 기초하여 교차 상관비를 산출할 수 있다. 이 때, 음성 인식 장치(100)는 입력 신호와 채널 수만큼의 샘플에 감마 필터를 적용하여 합성곱을 수행할 수 있다. Referring back to Equation 6, the voice recognition apparatus 100 may derive the number of channels according to the fundamental frequency, and calculate the cross-correlation ratio based on the sum of samples corresponding to the derived number of channels. In this case, the speech recognition apparatus 100 may perform convolution by applying a gamma filter to samples corresponding to the number of channels and the input signal.

예를 들어, 80~1000Hz이고, 도출된 기본 주파수가 100일 때, 음성 인식 장치(100)는 기본 주파수를 포함한 10개의 채널(수학식 6에서

)을 도출할 수 있고, 10개의 샘플의 합에 기초할 수 있다. 다른 예를 들어, 80~1000Hz이고, 도출된 기본 주파수가 200일 때, 음성 인식 장치(100)는 기본 주파수를 포함한 5개의 채널(수학식 6에서

)을 도출할 수 있고, 5개의 샘플의 합에 기초할 수 있다. 이 때, 음성 인식 장치(100)는 감마 필터를 적용하여, 해당 채널, 프레임에 존재하는 감마 성분들에 대한 전체 합을 산출할 수 있다.For example, when 80 to 1000 Hz and the derived fundamental frequency is 100, the voice recognition apparatus 100 transmits 10 channels including the basic frequency (in Equation 6).

) can be derived, and can be based on the sum of 10 samples. As another example, when 80 to 1000 Hz and the derived fundamental frequency is 200, the voice recognition apparatus 100 has 5 channels including the basic frequency (in Equation 6).

) can be derived, and can be based on the sum of five samples. At this time, the speech recognition apparatus 100 may apply a gamma filter to calculate the total sum of gamma components existing in the corresponding channel and frame.

도 6은 본 발명의 일 실시예에 따른 음성 인식 방법의 순서도이다. 도 6에 도시된 음성을 인식하는 방법은 도 1 내지 도 5에 도시된 실시예에 따라 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라고 하더라도 도 1 내지 도 5에 도시된 실시예에 따른 음성 인식 장치에서 음성을 인식하는 방법에도 적용된다. 6 is a flowchart of a voice recognition method according to an embodiment of the present invention. The method for recognizing a voice shown in FIG. 6 includes steps processed time-sequentially according to the embodiment shown in FIGS. 1 to 5 . Therefore, even if the content is omitted below, it is also applied to the method of recognizing a voice in the voice recognition apparatus according to the exemplary embodiment shown in FIGS. 1 to 5 .

단계 S610에서 음성 인식 장치는 음성 데이터에서 특징 벡터를 추출할 수 있다.In step S610, the voice recognition device may extract a feature vector from voice data.

단계 S620에서 음성 인식 장치는 특징 벡터 및 상시 음성 데이터에 대한 음성 존재 확률값을 포함하는 조인트 벡터에 기초하여 음성 인식 모델을 학습시킬 수 있다.In step S620, the speech recognition apparatus may train a speech recognition model based on a feature vector and a joint vector including a speech existence probability value for regular speech data.

단계 S630에서 음성 인식 장치는 음성 인식 모델을 이용하여 음성을 인식할 수 있다.In step S630, the voice recognition device may recognize voice using the voice recognition model.

상술한 설명에서, 단계 S610 내지 S630는 본 발명의 구현 예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 전환될 수도 있다. In the above description, steps S610 to S630 may be further divided into additional steps or combined into fewer steps, depending on an implementation example of the present invention. Also, some steps may be omitted as needed, and the order of steps may be switched.

도 1 내지 도 6을 통해 설명된 음성 인식 장치에서 음성을 인식하는 방법은 컴퓨터에 의해 실행되는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행 가능한 명령어들을 포함하는 기록 매체의 형태로도 구현될 수 있다. 또한, 도 1 내지 도 6을 통해 설명된 음성 인식 장치에서 음성을 인식하는 방법은 컴퓨터에 의해 실행되는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램의 형태로도 구현될 수 있다. The method of recognizing a voice in the voice recognition apparatus described with reference to FIGS. 1 to 6 is implemented in the form of a computer program stored in a computer-readable recording medium executed by a computer or a recording medium including instructions executable by a computer. It can be. Also, the method of recognizing a voice in the voice recognition apparatus described with reference to FIGS. 1 to 6 may be implemented in the form of a computer program stored in a computer-readable recording medium executed by a computer.

컴퓨터 판독 가능 기록매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 기록매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.Computer readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer readable recording media may include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustrative purposes, and those skilled in the art can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts should be construed as being included in the scope of the present invention. do.

100: 음성 인식 장치
110: 특징 벡터 추출부
120: 예측 모델 학습부
130: 음성 인식부100: voice recognition device
110: feature vector extraction unit
120: prediction model learning unit
130: voice recognition unit

Claims

In the device for recognizing voice,
a feature vector extractor extracting feature vectors from voice data;
a predictive model learning unit that trains a speech recognition model based on the feature vector and a joint vector including a speech presence probability value for regular speech data; and
A voice recognition unit for recognizing a voice using the voice recognition model
To include, voice recognition device.

According to claim 1,
The predictive model learning unit,
and calculating a voice presence probability value for each frame included in the voice data based on an autocorrelation value and a cross correlation ratio (CCR) of the voice data.

According to claim 2,
The predictive model learning unit,
and a similarity between a first frame included in the voice data and a second frame that is a previous frame of the first frame is calculated as the cross-correlation ratio.

According to claim 3,
The predictive model learning unit,
And deriving the number of channels according to the basic frequency of the first frame.

According to claim 4,
The predictive model learning unit,
A ratio between the sum of at least one sample corresponding to the first frame and the sum of at least one sample corresponding to the second frame based on the derived number of channels as a similarity between the first frame and the second frame A voice recognition device that calculates.

According to claim 2,
The predictive model learning unit,
and calculating the voice existence probability value using an absolute value of the autocorrelation value.

According to claim 6,
The predictive model learning unit,
and calculating a similarity function indicating a similarity between a first frame and a second frame that is a previous frame of the first frame based on the absolute value of the autocorrelation value and the cross-correlation ratio.

According to claim 7,
The predictive model learning unit,
And calculating the voice presence probability value as a value between 0 and 1 by applying a sigmoid function to the similarity function.

According to claim 4,
The predictive model learning unit,
And deriving the fundamental frequency based on the autocorrelation function.

In the voice recognition method,
extracting feature vectors from voice data;
learning a speech recognition model based on the feature vector and a joint vector including a speech existence probability value for regular speech data; and
Recognizing a voice using the voice recognition model
To include, voice recognition method.

According to claim 10,
The step of learning the voice recognition model,
Calculating a voice presence probability value for each frame included in the voice data based on an autocorrelation value and a cross correlation ratio (CCR) of the voice data
To include, voice recognition method.

According to claim 11,
The step of learning the voice recognition model,
Calculating a similarity between a first frame included in the voice data and a second frame that is a previous frame of the first frame as the cross-correlation ratio
To further include, the voice recognition method.

According to claim 12,
The step of learning the voice recognition model,
Deriving the number of channels according to the basic frequency of the first frame
To further include, the voice recognition method.

According to claim 13,
The step of learning the voice recognition model,
A ratio between the sum of at least one sample corresponding to the first frame and the sum of at least one sample corresponding to the second frame based on the derived number of channels as a similarity between the first frame and the second frame steps to calculate
To further include, the voice recognition method.

According to claim 11,
The step of learning the voice recognition model,
Calculating the negative presence probability value using the absolute value of the autocorrelation value
To further include, the voice recognition method.

According to claim 15,
The step of learning the voice recognition model,
Calculating a similarity function indicating similarity between a first frame and a second frame that is a previous frame of the first frame based on the absolute value of the autocorrelation value and the cross-correlation ratio
To further include, the voice recognition method.

17. The method of claim 16,
The step of learning the voice recognition model,
Calculating the negative presence probability value as a value between 0 and 1 by applying a sigmoid function to the similarity function
To further include, the voice recognition method.

According to claim 13,
The step of learning the voice recognition model,
Deriving the fundamental frequency based on the autocorrelation function
To further include, the voice recognition method.

A computer program stored in a computer readable recording medium containing a sequence of commands for recognizing voice,
When the computer program is executed by a computing device,
extract feature vectors from speech data;
Learning a speech recognition model based on the feature vector and a joint vector including a speech presence probability value for regular speech data;
A computer program stored on a computer-readable recording medium comprising a sequence of commands for recognizing a voice using the voice recognition model.