[go: up one dir, main page]

CN114038450A - Dialect identification method, dialect identification device, dialect identification equipment and storage medium - Google Patents

Dialect identification method, dialect identification device, dialect identification equipment and storage medium Download PDF

Info

Publication number
CN114038450A
CN114038450A CN202111478141.0A CN202111478141A CN114038450A CN 114038450 A CN114038450 A CN 114038450A CN 202111478141 A CN202111478141 A CN 202111478141A CN 114038450 A CN114038450 A CN 114038450A
Authority
CN
China
Prior art keywords
dialect
voice
data
recognition
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111478141.0A
Other languages
Chinese (zh)
Inventor
汪雪
程刚
蒋志燕
陈诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raisound Technology Co ltd
Original Assignee
Shenzhen Raisound Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raisound Technology Co ltd filed Critical Shenzhen Raisound Technology Co ltd
Priority to CN202111478141.0A priority Critical patent/CN114038450A/en
Publication of CN114038450A publication Critical patent/CN114038450A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses a dialect identification method, which comprises the following steps: receiving dialect voice data input by a user, and extracting voice characteristics of the dialect voice data; carrying out similarity detection on the speech features one by utilizing training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each type of training data; taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model; and converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data. The invention also provides a dialect identification device, an electronic device and a storage medium. The invention can solve the problem of low dialect identification precision.

Description

方言识别方法、装置、设备及存储介质Dialect identification method, device, device and storage medium

技术领域technical field

本发明涉及人工智能技术领域,尤其涉及一种方言识别方法、装置、电子设备及计算机可读存储介质。The present invention relates to the technical field of artificial intelligence, and in particular, to a dialect identification method, apparatus, electronic device, and computer-readable storage medium.

背景技术Background technique

语随着社会的发展,越来越多的软件:如输入法、导航软件、智能问答系统等,都要使用到语言识别技术。语音识别技术正逐步成为信息技术中人机交互的关键技术。目前语音识别多为对普通话进行识别,而方言作为一个地方特色语言,仍有一大批人还在使用,尤其是一些年龄比较大的人不会说普通话只会说方言因此,对方言的语音识别也是一个重要的研究课题。With the development of society, more and more software, such as input method, navigation software, intelligent question answering system, etc., must use language recognition technology. Speech recognition technology is gradually becoming the key technology of human-computer interaction in information technology. At present, speech recognition mostly recognizes Mandarin, and dialect, as a local language, is still used by a large number of people, especially some older people who do not speak Mandarin but only speak dialect. Therefore, the speech recognition of dialect is also an important research topic.

目前的方言识别大多通过对一个模型进行方言训练,再利用训练好的模型对方言进行识别,然而方言种类千差万别,每一种方言识别模型可能只能识别一种或者几种方言,因此,通常需要训练很多种方言识别模型,并将所述多种方言识别模型汇集在一起,形成多方言识别模型,但是所述多方言识别模型在使用时,由于不了解用户输入的方言种类,导致不能精确的选择适合的方言识别模型,从而使得方言识别的准确性不高。At present, dialect recognition is mostly performed by dialect training on a model, and then the trained model is used to recognize the dialect. However, the types of dialects vary widely, and each dialect recognition model may only recognize one or several dialects. Therefore, it is usually necessary to Many dialect recognition models are trained, and the multi-dialect recognition models are brought together to form a multi-dialect recognition model. Choose a suitable dialect recognition model, so that the accuracy of dialect recognition is not high.

发明内容SUMMARY OF THE INVENTION

本发明提供一种方言识别方法、装置及计算机可读存储介质,其主要目的在于解决方言识别精度低的问题。The present invention provides a dialect identification method, device and computer-readable storage medium, the main purpose of which is to solve the problem of low dialect identification accuracy.

为实现上述目的,本发明提供的一种方言识别方法,包括:To achieve the above object, a dialect identification method provided by the present invention includes:

接收用户输入的方言语音数据,提取所述方言语音数据的语音特征;receiving the dialect voice data input by the user, and extracting the voice features of the dialect voice data;

利用预构建的方言模型库中所有方言识别模型对应的训练数据逐一与所述语音特征进行相似度检测,得到所述语音特征与每一种所述训练数据的相似度分值;Utilize the training data corresponding to all dialect recognition models in the pre-built dialect model library to perform similarity detection with the voice features one by one, and obtain the similarity score between the voice feature and each of the training data;

将所述相似度分值最高的训练数据对应的方言识别模型作为目标方言识别模型;Using the dialect recognition model corresponding to the training data with the highest similarity score as the target dialect recognition model;

利用所述目标方言识别模型对所述方言语音数据进行转换,得到所述方言语音数据对应的语音识别结果。The dialect speech data is converted by using the target dialect recognition model to obtain a speech recognition result corresponding to the dialect speech data.

可选地,所述提取所述方言语音数据的语音特征,包括:Optionally, the extracting the voice features of the dialect voice data includes:

将所述语音数据中的声音信号转化为数字信号;converting the sound signal in the voice data into a digital signal;

利用预设的三角带通滤波器对所述数字信号进行计算,得到所述语音数据对应的语音特征。The digital signal is calculated by using a preset triangular band-pass filter to obtain the speech feature corresponding to the speech data.

可选地,所述利用预设的三角带通滤波器对所述数字信号进行计算,得到所述语音数据对应的语音特征,包括:Optionally, the digital signal is calculated by using a preset triangular bandpass filter to obtain the speech feature corresponding to the speech data, including:

对所述数字信号进行预加重、分帧和加窗处理,得到频域能量;Performing pre-emphasis, framing and windowing processing on the digital signal to obtain frequency domain energy;

对所述频域能量进行快速傅里叶变换,得到频谱;performing fast Fourier transform on the frequency domain energy to obtain a frequency spectrum;

利用所述三角带通滤波器对所述频谱进行计算,得到对数能量;Using the triangular bandpass filter to calculate the spectrum to obtain logarithmic energy;

对所述对数能量进行离散余弦变换,得到梅尔频率倒谱系数;Perform discrete cosine transform on the logarithmic energy to obtain Mel frequency cepstral coefficients;

根据所述梅尔频率倒谱系数进行差分计算,得到动态差分参数,并确定所述动态差分参数为语音特征。The difference calculation is performed according to the Mel frequency cepstral coefficients to obtain a dynamic difference parameter, and the dynamic difference parameter is determined as a speech feature.

可选地,所述利用预构建的方言模型库中所有方言识别模型对应的训练数据逐一与所述语音特征进行相似度检测,得到所述语音特征与每一种所述训练数据的相似度分值,包括:Optionally, the training data corresponding to all dialect recognition models in the pre-built dialect model library is used to perform similarity detection with the voice features one by one, to obtain the similarity score between the voice feature and each of the training data. values, including:

逐一提取每一种所述方言识别模型的训练数据的语音特征;Extracting the phonetic features of the training data of each of the dialect recognition models one by one;

计算所述训练数据的语音特征与所述用户输入的方言语音数据的语音特征的距离值;Calculate the distance value of the voice feature of the training data and the voice feature of the dialect voice data input by the user;

根据所述距离值计算得到所述用户输入的方言语音数据的语音特征与每一种所述训练数据的语音特征相似度分值。The similarity score between the voice feature of the dialect voice data input by the user and the voice feature of each type of the training data is calculated according to the distance value.

可选地,,所述利用所述目标方言识别模型对所述方言语音数据进行转换,得到所述方言语音数据对应的语音识别结果,包括:Optionally, the use of the target dialect recognition model to convert the dialect voice data to obtain a voice recognition result corresponding to the dialect voice data, including:

利用所述目标方言识别模型对所述方言语音数据进行预设次数的卷积、池化以及全连接操作,得到编码向量;Using the target dialect recognition model to perform a preset number of convolution, pooling and full connection operations on the dialect voice data to obtain a coding vector;

利用预设的激活函数对所述编码向量进行解码,得到语音识别结果。The encoding vector is decoded using a preset activation function to obtain a speech recognition result.

可选地,所述将所述相似度分值最高的训练数据对应的方言识别模型作为目标方言识别模型之前,所述方法还包括:Optionally, before the dialect recognition model corresponding to the training data with the highest similarity score is used as the target dialect recognition model, the method further includes:

获取随机语音训练集以及多方言语音训练集;Obtain random voice training sets and multi-dialect voice training sets;

利用预构建的通用语音模型根据所述随机语音训练集训练。得到通用语音模型;Trained from the random speech training set using a pre-built general speech model. get the general speech model;

根据所述通用语音模型以及所述多方言语音训练集分别自适应训练,得到多个方言识别模型。According to the general speech model and the multi-dialect speech training set, adaptive training is performed respectively to obtain a plurality of dialect recognition models.

为了解决上述问题,本发明还提供一种方言识别装置,所述装置包括:In order to solve the above problems, the present invention also provides a dialect identification device, the device includes:

语音特征提取模块,用于接收用户输入的方言语音数据,提取所述方言语音数据的语音特征;A voice feature extraction module for receiving dialect voice data input by a user, and extracting voice features of the dialect voice data;

目标方言识别模型确定模块,用于利用预构建的方言模型库中所有方言识别模型对应的训练数据逐一与所述语音特征进行相似度检测,得到所述语音特征与每一种所述训练数据的相似度分值;将所述相似度分值最高的训练数据对应的方言识别模型作为目标方言识别模型;The target dialect recognition model determination module is used to perform similarity detection with the voice features one by one using the training data corresponding to all dialect recognition models in the pre-built dialect model library, and obtain the relationship between the voice features and each of the training data. similarity score; take the dialect recognition model corresponding to the training data with the highest similarity score as the target dialect recognition model;

语音识别结果生成模块,用于利用所述目标方言识别模型对所述方言语音数据进行转换,得到所述方言语音数据对应的语音识别结果。The speech recognition result generation module is used for converting the dialect speech data by using the target dialect recognition model to obtain a speech recognition result corresponding to the dialect speech data.

为了解决上述问题,本发明还提供一种电子设备,所述电子设备包括:In order to solve the above problems, the present invention also provides an electronic device, the electronic device includes:

至少一个处理器;以及,at least one processor; and,

与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述所述的方言识别方法。The memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to execute the dialect recognition method described above.

为了解决上述问题,本发明还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个计算机程序,所述至少一个计算机程序被电子设备中的处理器执行以实现上述所述的方言识别方法。In order to solve the above problems, the present invention also provides a computer-readable storage medium, where at least one computer program is stored in the computer-readable storage medium, and the at least one computer program is executed by a processor in an electronic device to realize the above-mentioned The dialect identification method described above.

本发明实施例通过方言语音数据的语音特征以及方言模型库中方言识别模型的训练数据,得到对应的方言识别模型,方言模型库中的每一个方言识别模型分别对应这一种方言,通过方言所对应的方言识别模型来识别语音数据,提高了识别的精确度;通过方言识别模型得到方言语音数据对应的语音识别结果,提高的方言识别得到的语音识别结果的精确度。因此本发明提出的方言识别方法、装置、电子设备及计算机可读存储介质,可以解决方言识别精度低的问题。In the embodiment of the present invention, the corresponding dialect recognition model is obtained by using the speech features of the dialect voice data and the training data of the dialect recognition model in the dialect model database. Each dialect recognition model in the dialect model database corresponds to this dialect. The corresponding dialect recognition model is used to recognize the speech data, which improves the recognition accuracy; the speech recognition result corresponding to the dialect speech data is obtained through the dialect recognition model, and the accuracy of the speech recognition result obtained by the dialect recognition is improved. Therefore, the dialect identification method, device, electronic device and computer-readable storage medium proposed by the present invention can solve the problem of low dialect identification accuracy.

附图说明Description of drawings

图1为本发明一实施例提供的方言识别方法的流程示意图;1 is a schematic flowchart of a dialect identification method provided by an embodiment of the present invention;

图2为本发明一实施例提供的获取相似度分值的流程示意图;FIG. 2 is a schematic flowchart of obtaining a similarity score according to an embodiment of the present invention;

图3为本发明一实施例提供的获取语音识别结果的流程示意图;3 is a schematic flowchart of obtaining a speech recognition result according to an embodiment of the present invention;

图4为本发明一实施例提供的方言识别装置的功能模块图;4 is a functional block diagram of a dialect identification device provided by an embodiment of the present invention;

图5为本发明一实施例提供的实现所述方言识别方法的电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device implementing the dialect identification method according to an embodiment of the present invention.

本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本申请实施例提供一种方言识别方法。所述方言识别方法的执行主体包括但不限于服务端、终端等能够被配置为执行本申请实施例提供的该方法的电子设备中的至少一种。换言之,所述方言识别方法可以由安装在终端设备或服务端设备的软件或硬件来执行,所述软件可以是区块链平台。所述服务端包括但不限于:单台服务器、服务器集群、云端服务器或云端服务器集群等。所述服务器可以是独立的服务器,也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。The embodiment of the present application provides a dialect identification method. The execution subject of the dialect identification method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal. In other words, the dialect identification method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server can be an independent server, or can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network) Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.

参照图1所示,为本发明一实施例提供的方言识别方法的流程示意图。Referring to FIG. 1 , it is a schematic flowchart of a dialect identification method provided by an embodiment of the present invention.

在本实施例中,所述方言识别方法包括:In this embodiment, the dialect identification method includes:

S1、接收用户输入的方言语音数据,提取所述方言语音数据的语音特征;S1, receive the dialect voice data input by the user, and extract the voice feature of the dialect voice data;

本发明实施例中,所述方言语音数据可以是任意一种方言。所述方言俗称地方话,只通行于一定的地域,不是独立于汉语之外的另一种语言,而只是局部地区使用的语言,如粤语、闽南语、客家话等。In this embodiment of the present invention, the dialect voice data may be any dialect. The dialects are commonly known as local dialects, which are only used in certain regions, and are not another language independent of Chinese, but only languages used in local areas, such as Cantonese, Hokkien, Hakka and so on.

本发明实施例中,所述提取所述方言语音数据的语音特征,包括:In this embodiment of the present invention, the extraction of the voice features of the dialect voice data includes:

将所述语音数据中的声音信号转化为数字信号;converting the sound signal in the voice data into a digital signal;

利用预设的三角带通滤波器对所述数字信号进行计算,得到所述语音数据对应的语音特征。The digital signal is calculated by using a preset triangular band-pass filter to obtain the speech feature corresponding to the speech data.

本发明实施例中,可通过采样、量化和编码等步骤将所述语音数据中的声音信号转化为数字信号。In this embodiment of the present invention, the sound signal in the voice data may be converted into a digital signal through steps such as sampling, quantization, and encoding.

详细地,所述采样是指在特定时刻获取的声音信号幅值;通过采样可以把时间连续的模拟信号(语音数据)转化为时间离散、幅度连续的离散信号。In detail, the sampling refers to the amplitude of the sound signal obtained at a specific moment; through sampling, a time-continuous analog signal (voice data) can be converted into a time-discrete, amplitude-continuous discrete signal.

本发明其中一个实施例中,可按照固定的时间间隔为采样周期对所述语言信息进行采样。在量化步骤中,把在幅度上连续取值的每一个样本转换为离散值表示。In one embodiment of the present invention, the language information may be sampled according to a fixed time interval as a sampling period. In the quantization step, each sample that is continuously valued in magnitude is converted to a discrete-valued representation.

进一步地,所述利用预设的三角带通滤波器对所述数字信号进行计算,得到所述语音数据对应的语音特征,包括:Further, the digital signal is calculated by using a preset triangular bandpass filter to obtain the speech feature corresponding to the speech data, including:

对所述数字信号进行预加重、分帧和加窗处理,得到频域能量;Performing pre-emphasis, framing and windowing processing on the digital signal to obtain frequency domain energy;

对所述频域能量进行快速傅里叶变换,得到频谱;performing fast Fourier transform on the frequency domain energy to obtain a frequency spectrum;

利用所述三角带通滤波器对所述频谱进行计算,得到对数能量;Using the triangular bandpass filter to calculate the spectrum to obtain logarithmic energy;

对所述对数能量进行离散余弦变换,得到梅尔频率倒谱系数;Perform discrete cosine transform on the logarithmic energy to obtain Mel frequency cepstral coefficients;

根据所述梅尔频率倒谱系数进行差分计算,得到动态差分参数,并确定所述动态差分参数为语音特征。The difference calculation is performed according to the Mel frequency cepstral coefficients to obtain a dynamic difference parameter, and the dynamic difference parameter is determined as a speech feature.

本发明实施例中,所述预加重是指将数字信号通过一个高通滤波器,提升高频部分,使信号的频谱变得平坦并保持在低频到高频的整个频带中。所述分帧是以预设的单位时间内采集的数据为一帧对数字信号中对数据进行划分。所述加窗是将每一帧数字信号乘以汉明窗,可以增加一帧左端和右端的连续性。通过对所述数字信号进行预加重、分帧和加窗,可以消除发声过程中声带和嘴唇的效应,补偿语音信号受到发音系统所抑制的高频部分,并且加窗后的数字信号转换为频域上的能量分布,而不同的能量分布可以代表不同语音的特性。In the embodiment of the present invention, the pre-emphasis refers to passing the digital signal through a high-pass filter to enhance the high-frequency part, so that the frequency spectrum of the signal becomes flat and kept in the entire frequency band from low frequency to high frequency. The framing is to divide the data in the digital signal by taking the data collected in a preset unit time as one frame. The windowing is to multiply the digital signal of each frame by a Hamming window, which can increase the continuity of the left end and the right end of a frame. By pre-emphasizing, framing and windowing the digital signal, the effects of vocal cords and lips during the vocalization process can be eliminated, the high-frequency part of the speech signal that is suppressed by the vocalization system can be compensated, and the windowed digital signal can be converted into frequency The energy distribution on the domain, and different energy distributions can represent the characteristics of different speech.

所述三角带通滤波器可以降低运算量并且通过将频谱进行平滑化,具有消除谐波的作用,以突显语音的共振峰。因此,一段语音的音调或音高,不会呈现在梅尔频率倒谱系数内,因此梅尔频率倒谱系数不会受到输入语音的音调不同而有所影响,标准的梅尔频率倒谱系数只反映语音的静态特性,语音的动态特性可以用所述静态特征的差分谱来描述,所述动态差分参数是把动态和静态特征结合起来可以有效提高系统的识别性能。The triangular band-pass filter can reduce the amount of computation and has the effect of eliminating harmonics by smoothing the frequency spectrum, so as to highlight the formant of the speech. Therefore, the pitch or pitch of a speech will not appear in the Mel-frequency cepstral coefficients, so the Mel-frequency cepstral coefficients will not be affected by the pitch of the input speech. The standard Mel-frequency cepstral coefficients Only the static characteristics of speech are reflected, and the dynamic characteristics of speech can be described by the difference spectrum of the static characteristics. The dynamic difference parameter is a combination of dynamic and static characteristics, which can effectively improve the recognition performance of the system.

标准的梅尔频率倒谱系数只反映语音的静态特性,语音的动态特性可以用所述静态特征的差分谱来描述,所述动态差分参数是把动态和静态特征结合起来可以有效提高语音数据的语音特征特性。The standard Mel-frequency cepstral coefficients only reflect the static characteristics of speech, and the dynamic characteristics of speech can be described by the differential spectrum of the static features. Voice characteristics.

S2、利用预构建的方言模型库中所有方言识别模型对应的训练数据逐一与所述语音特征进行相似度检测,得到所述语音特征与每一种所述训练数据的相似度分值;S2, utilize the training data corresponding to all dialect recognition models in the pre-built dialect model library to perform similarity detection with the voice feature one by one, to obtain the similarity score between the voice feature and each of the training data;

本发明实施例中,所述预构建的方言模型库中的每一种方言识别模型都可以为卷积神经网络构建,通过不同的方言语音训练集进行训练后,可以将对应的方言语音转换为文本数据。例如,方言识别模型A是由粤语训练得到,则训练完成的所述方言识别模型A可以实现将粤语转换为文本数据。其中,每一种方言语音训练集都包括方言标签,所述方言标签用于标识方言类型。In the embodiment of the present invention, each dialect recognition model in the pre-built dialect model library can be constructed by a convolutional neural network. After training through different dialect voice training sets, the corresponding dialect voice can be converted into text data. For example, the dialect recognition model A is obtained by training Cantonese, and the trained dialect recognition model A can convert Cantonese into text data. Wherein, each dialect voice training set includes a dialect label, and the dialect label is used to identify the dialect type.

本发明实施例中,参阅图2所示,所述利用预构建的方言模型库中所有方言识别模型对应的训练数据逐一与所述语音特征进行相似度检测,得到所述语音特征与每一种所述训练数据的相似度分值,包括:In the embodiment of the present invention, referring to FIG. 2 , the training data corresponding to all dialect recognition models in the pre-built dialect model library is used to perform similarity detection with the voice features one by one, and the voice features and each type of voice feature are obtained. The similarity score of the training data, including:

S21、逐一提取每一种所述方言识别模型的训练数据的语音特征;S21, extract the speech features of the training data of each of the dialect recognition models one by one;

S22、计算所述训练数据的语音特征与所述用户输入的方言语音数据的语音特征的距离值;S22, calculate the distance value of the voice feature of the training data and the voice feature of the dialect voice data input by the user;

S23、根据所述距离值计算得到所述用户输入的方言语音数据的语音特征与每一种所述训练数据的语音特征相似度分值。S23. Calculate, according to the distance value, a similarity score between the voice feature of the dialect voice data input by the user and the voice feature of each type of the training data.

进一步地,本发明实施例可通过如下公式计算计算所述训练数据的语音特征与所述用户输入的方言语音数据的语音特征的距离值:Further, in this embodiment of the present invention, the distance value between the voice feature of the training data and the voice feature of the dialect voice data input by the user can be calculated by the following formula:

Figure BDA0003394363200000061
Figure BDA0003394363200000061

其中,D为所述距离值,Ri为所述方言识别模型的训练数据的语音特征,T为所述用户输入的方言语音数据的语音特征,θ为预设系数。Wherein, D is the distance value, R i is the voice feature of the training data of the dialect recognition model, T is the voice feature of the dialect voice data input by the user, and θ is a preset coefficient.

例如,所述方言识别模型的训练数据的语音特征为A,所述用户输入的方言语音数据的语音特征为

Figure BDA0003394363200000062
通过公式计算得到所述语音特征A与所述语音特征
Figure BDA0003394363200000063
的距离值为40,本发明实施例根据预设规则计算,例如:相似度分值为1-距离值/100,则所述方言识别模型的训练数据语音特征的相似度分值为0.6。For example, the voice feature of the training data of the dialect recognition model is A, and the voice feature of the dialect voice data input by the user is
Figure BDA0003394363200000062
The voice feature A and the voice feature are obtained by formula calculation
Figure BDA0003394363200000063
The distance value is 40, which is calculated according to a preset rule in this embodiment of the present invention. For example, if the similarity score is 1-distance value/100, the similarity score of the speech feature of the training data of the dialect recognition model is 0.6.

S3、将所述相似度分值最高的训练数据对应的方言识别模型作为目标方言识别模型;S3, using the dialect recognition model corresponding to the training data with the highest similarity score as the target dialect recognition model;

本发明实施例中,相似度分值高说明与所述语音特征越相近,因此,本发明实施例将分值最高的方言识别模型作为所述方言语音数据的目标方言识别模型。In the embodiment of the present invention, a higher similarity score indicates that the voice feature is more similar. Therefore, in the embodiment of the present invention, the dialect recognition model with the highest score is used as the target dialect recognition model of the dialect voice data.

本发明其中一个实施例中,可以通过提取所述相似度分值最高的训练数据的方言标签为所述方言语音数据的方言标签,所述方言标签通过前端显示等方式使用户得知其语音数据识别得到的方言种类。In one embodiment of the present invention, the dialect label of the training data with the highest similarity score can be extracted as the dialect label of the dialect voice data, and the dialect label can let the user know its voice data through front-end display or other means Identify the type of dialect obtained.

S4、利用所述目标方言识别模型对所述方言语音数据进行转换,得到所述方言语音数据对应的语音识别结果。S4. Convert the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data.

参阅图3所示,本发明实施例中,所述S4,包括:Referring to FIG. 3 , in this embodiment of the present invention, the S4 includes:

S41、利用所述目标方言识别模型对所述方言语音数据进行预设次数的卷积、池化以及全连接操作,得到编码向量;S41, using the target dialect recognition model to perform a preset number of convolution, pooling and full connection operations on the dialect voice data to obtain a coding vector;

S42、利用预设的激活函数对所述编码向量进行解码,得到语音识别结果。S42. Decode the encoding vector by using a preset activation function to obtain a speech recognition result.

本发明实施例中,所述语音识别结果可以为文本数据,本发明实施例可以利用CNN网络或者RNN网络对所述方言语音数据进行卷积、池化以及全连接操作。例如,利用所述RNN网络对所述方言语音数据进行卷积、池化以及全连接操作,得到编码向量;再通过带有分类器的单层神经网络作为解码层对所述编码向量进行解码。其中,所述解码层采用softmax激活函数、sigmoid激活函数等激活函数。In this embodiment of the present invention, the speech recognition result may be text data, and in this embodiment of the present invention, a CNN network or an RNN network may be used to perform convolution, pooling, and full connection operations on the dialect speech data. For example, using the RNN network to perform convolution, pooling and full connection operations on the dialect speech data to obtain a coding vector; and then use a single-layer neural network with a classifier as a decoding layer to decode the coding vector. The decoding layer adopts activation functions such as softmax activation function and sigmoid activation function.

本发明另一个实施例中,所述将所述相似度分值最高的训练数据对应的方言识别模型作为目标方言识别模型之前,所述方法还包括:In another embodiment of the present invention, before the dialect recognition model corresponding to the training data with the highest similarity score is used as the target dialect recognition model, the method further includes:

获取随机语音训练集以及多方言语音训练集;Obtain random voice training sets and multi-dialect voice training sets;

利用预构建的通用语音模型根据所述随机语音训练集训练,得到通用语音模型;Utilize pre-built general speech model to train according to described random speech training set, obtain general speech model;

根据所述通用语音模型以及所述多方言语音训练集分别自适应训练,得到多个方言识别模型。According to the general speech model and the multi-dialect speech training set, adaptive training is performed respectively to obtain a plurality of dialect recognition models.

本发明实施例中,所述预构建的通用语音模型(UBM,Universal BackgroundModel)是通过先采集大量随机语音,训练得到的一个通用语音模型,然后使用部分方言语音数据,通过自适应算法调整通用语音模型的参数,得到方言识别模型。In the embodiment of the present invention, the pre-built universal voice model (UBM, Universal Background Model) is a universal voice model obtained by first collecting a large number of random voices and training, and then using part of the dialect voice data to adjust the universal voice through an adaptive algorithm The parameters of the model are obtained to obtain the dialect recognition model.

本发明实施例中,方言语音A的语音特征通过所述通用语音模型进行自适应,可以更加快速的获取方言语音A的方言识别模型,并且不需要过多的方言语音A作为语音特征的提取目标就可以得到方言识别模型。In the embodiment of the present invention, the voice features of the dialect voice A are adapted through the general voice model, so that the dialect recognition model of the dialect voice A can be obtained more quickly, and too many dialect voices A are not needed as the extraction target of voice features Then the dialect recognition model can be obtained.

本发明实施例中,所述语音特征可以散落在通用语音模型某些高斯分布的附近。所述自适应的过程就是将通用语音模型的每个高斯分布向所述语音特征偏移,具体包括:使用所述语音特征计算出通用语音模型的更新参数(高斯权重、均值和方差);将得到的更新参数与通用语音模型的原参数进行融合,从而得到适合所述语音特征的方言识别模型。所述自适应算法包括但不限于最大后验概率(MAP),最大似然线性回归(MLLR)。In this embodiment of the present invention, the speech features may be scattered around certain Gaussian distributions of the general speech model. The adaptive process is to shift each Gaussian distribution of the general speech model to the speech feature, specifically including: using the speech feature to calculate the update parameters (Gaussian weight, mean and variance) of the general speech model; The obtained updated parameters are fused with the original parameters of the general speech model, thereby obtaining a dialect recognition model suitable for the speech features. The adaptive algorithms include, but are not limited to, Maximum A posteriori (MAP), Maximum Likelihood Linear Regression (MLLR).

本发明实施例中,通过自适应算法,将通用语音模型向所述语音特征的方言识别模型进行微调。这种方式通过减少训练参数进而大大减少训练所需要的样本量和训练时间。In the embodiment of the present invention, the general speech model is fine-tuned to the dialect recognition model of the speech feature through an adaptive algorithm. This method greatly reduces the sample size and training time required for training by reducing the training parameters.

本发明另一个实施例中,所述利用所述目标方言识别模型对所述方言语音数据进行转换,得到所述方言语音数据对应的语音识别结果之后,还可以包括:对所述语音识别结果进行文本转换,得到所述方言语音数据对应的普通话文本。In another embodiment of the present invention, after converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data, the method may further include: performing the voice recognition result on the voice recognition result. Text conversion to obtain the Mandarin text corresponding to the dialect voice data.

本发明实施例将所述由方言语音数据转换得到的普通语音数据进行文本转换可以使用户更加直观的查看方言语音的文本内容。In the embodiment of the present invention, performing text conversion on the ordinary voice data obtained by converting the dialect voice data can enable the user to view the text content of the dialect voice more intuitively.

本发明实施例通过方言语音数据的语音特征以及方言模型库中方言识别模型的训练数据,得到对应的方言识别模型,方言模型库中的每一个方言识别模型分别对应这一种方言,通过方言所对应的方言识别模型来识别所述方言语音数据,提高了识别的精确度。因此本发明提出的方言识别方法,可以解决方言识别精度低的问题。In the embodiment of the present invention, the corresponding dialect recognition model is obtained by using the speech features of the dialect voice data and the training data of the dialect recognition model in the dialect model database. Each dialect recognition model in the dialect model database corresponds to this dialect. The corresponding dialect recognition model is used to recognize the dialect voice data, which improves the recognition accuracy. Therefore, the dialect identification method proposed by the present invention can solve the problem of low dialect identification accuracy.

如图4所示,是本发明一实施例提供的方言识别装置的功能模块图。As shown in FIG. 4 , it is a functional block diagram of a dialect identification device provided by an embodiment of the present invention.

本发明所述方言识别装置100可以安装于电子设备中。根据实现的功能,所述方言识别装置100可以包括语音特征提取模块101、目标方言识别模型确定模块102及语音识别结果生成模块103。本发明所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。The dialect identification apparatus 100 of the present invention can be installed in an electronic device. According to the implemented functions, the dialect recognition apparatus 100 may include a speech feature extraction module 101 , a target dialect recognition model determination module 102 and a speech recognition result generation module 103 . The modules in the present invention can also be called units, which refer to a series of computer program segments that can be executed by the electronic device processor and can perform fixed functions, and are stored in the memory of the electronic device.

在本实施例中,关于各模块/单元的功能如下:In this embodiment, the functions of each module/unit are as follows:

所述语音特征提取模块101,用于接收用户输入的方言语音数据,提取所述方言语音数据的语音特征;The voice feature extraction module 101 is configured to receive the dialect voice data input by the user, and extract the voice features of the dialect voice data;

所述目标方言识别模型确定模块102,用于利用预构建的方言模型库中所有方言识别模型对应的训练数据逐一与所述语音特征进行相似度检测,得到所述语音特征与每一种所述训练数据的相似度分值;将所述相似度分值最高的训练数据对应的方言识别模型作为目标方言识别模型;The target dialect recognition model determination module 102 is used to perform similarity detection with the voice features one by one using the training data corresponding to all dialect recognition models in the pre-built dialect model library, and obtain the voice features and each of the described voice features. The similarity score of the training data; the dialect recognition model corresponding to the training data with the highest similarity score is used as the target dialect recognition model;

所述语音识别结果生成模块103,用于利用所述目标方言识别模型对所述方言语音数据进行转换,得到所述方言语音数据对应的语音识别结果。The speech recognition result generating module 103 is configured to convert the dialect speech data by using the target dialect recognition model to obtain a speech recognition result corresponding to the dialect speech data.

详细地,本发明实施例中所述方言识别装置100中所述的各模块在使用时采用与上述图1至图3中所述的方言识别方法一样的技术手段,并能够产生相同的技术效果,这里不再赘述。In detail, each module described in the dialect identification device 100 in the embodiment of the present invention adopts the same technical means as the dialect identification method described in the above-mentioned FIG. 1 to FIG. 3 during use, and can produce the same technical effect , which will not be repeated here.

如图5所示,是本发明一实施例提供的实现方言识别方法的电子设备的结构示意图。As shown in FIG. 5 , it is a schematic structural diagram of an electronic device for implementing a dialect identification method provided by an embodiment of the present invention.

所述电子设备1可以包括处理器10、存储器11、通信总线12以及通信接口13,还可以包括存储在所述存储器11中并可在所述处理器10上运行的计算机程序,如方言识别程序。The electronic device 1 may include a processor 10, a memory 11, a communication bus 12, and a communication interface 13, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a dialect recognition program .

其中,所述处理器10在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述处理器10是所述电子设备的控制核心(ControlUnit),利用各种接口和线路连接整个电子设备的各个部件,通过运行或执行存储在所述存储器11内的程序或者模块(例如执行方言识别程序等),以及调用存储在所述存储器11内的数据,以执行电子设备的各种功能和处理数据。The processor 10 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or A combination of multiple central processing units (Central Processing Units, CPUs), microprocessors, digital processing chips, graphics processors, and various control chips, etc. The processor 10 is the control core (ControlUnit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the dialect) stored in the memory 11. identification programs, etc.), and call data stored in the memory 11 to perform various functions of the electronic device and process data.

所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。所述存储器11在一些实施例中可以是电子设备的内部存储单元,例如该电子设备的移动硬盘。所述存储器11在另一些实施例中也可以是电子设备的外部存储设备,例如电子设备上配备的插接式移动硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(Secure Digital,SD)卡、闪存卡(Flash Card)等。进一步地,所述存储器11还可以既包括电子设备的内部存储单元也包括外部存储设备。所述存储器11不仅可以用于存储安装于电子设备的应用软件及各类数据,例如方言识别程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. . In some embodiments, the memory 11 may be an internal storage unit of an electronic device, such as a mobile hard disk of the electronic device. In other embodiments, the memory 11 may also be an external storage device of the electronic device, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) equipped on the electronic device. ) card, flash card (Flash Card) and so on. Further, the memory 11 may also include both an internal storage unit of an electronic device and an external storage device. The memory 11 can not only be used to store application software installed in the electronic device and various types of data, such as codes of dialect recognition programs, etc., but also can be used to temporarily store data that has been output or will be output.

所述通信总线12可以是外设部件互连标准(peripheral componentinterconnect,简称PCI)总线或扩展工业标准结构(extended industry standardarchitecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器11以及至少一个处理器10等之间的连接通信。The communication bus 12 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.

所述通信接口13用于上述电子设备与其他设备之间的通信,包括网络接口和用户接口。可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备与其他电子设备之间建立通信连接。所述用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备中处理的信息以及用于显示可视化的用户界面。The communication interface 13 is used for communication between the above electronic device and other devices, including a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (eg, a WI-FI interface, a Bluetooth interface, etc.), which is generally used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a display (Display), an input unit (such as a keyboard (Keyboard)), and optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device and for displaying a visual user interface.

图5仅示出了具有部件的电子设备,本领域技术人员可以理解的是,图5示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。FIG. 5 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 5 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the drawings. components, or a combination of certain components, or a different arrangement of components.

例如,尽管未示出,所述电子设备还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器10逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。For example, although not shown, the electronic device may also include a power source (such as a battery) for powering the various components, preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that the power source can be logically connected through the power management device. Implement functions such as charge management, discharge management, and power management. The power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components. The electronic device may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.

应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。It should be understood that the embodiments are only used for illustration, and are not limited by this structure in the scope of the patent application.

所述电子设备1中的所述存储器11存储的方言识别程序是多个指令的组合,在所述处理器10中运行时,可以实现:The dialect recognition program stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 10, it can realize:

接收用户输入的方言语音数据,提取所述方言语音数据的语音特征;receiving the dialect voice data input by the user, and extracting the voice features of the dialect voice data;

利用预构建的方言模型库中所有方言识别模型对应的训练数据逐一与所述语音特征进行相似度检测,得到所述语音特征与每一种所述训练数据的相似度分值;Utilize the training data corresponding to all dialect recognition models in the pre-built dialect model library to perform similarity detection with the voice features one by one, and obtain the similarity score between the voice feature and each of the training data;

将所述相似度分值最高的训练数据对应的方言识别模型作为目标方言识别模型;Using the dialect recognition model corresponding to the training data with the highest similarity score as the target dialect recognition model;

利用所述目标方言识别模型对所述方言语音数据进行转换,得到所述方言语音数据对应的语音识别结果。The dialect speech data is converted by using the target dialect recognition model to obtain a speech recognition result corresponding to the dialect speech data.

具体地,所述处理器10对上述指令的具体实现方法可参考附图对应实施例中相关步骤的描述,在此不赘述。Specifically, for the specific implementation method of the above-mentioned instruction by the processor 10, reference may be made to the description of the relevant steps in the embodiment corresponding to the accompanying drawings, which will not be repeated here.

进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。所述计算机可读存储介质可以是易失性的,也可以是非易失性的。例如,所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。Further, if the modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. The computer-readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).

本发明还提供一种计算机可读存储介质,所述可读存储介质存储有计算机程序,所述计算机程序在被电子设备的处理器所执行时,可以实现:The present invention also provides a computer-readable storage medium, where the readable storage medium stores a computer program, and when executed by a processor of an electronic device, the computer program can realize:

接收用户输入的方言语音数据,提取所述方言语音数据的语音特征;receiving the dialect voice data input by the user, and extracting the voice features of the dialect voice data;

利用预构建的方言模型库中所有方言识别模型对应的训练数据逐一与所述语音特征进行相似度检测,得到所述语音特征与每一种所述训练数据的相似度分值;Utilize the training data corresponding to all dialect recognition models in the pre-built dialect model library to perform similarity detection with the voice features one by one, and obtain the similarity score between the voice feature and each of the training data;

将所述相似度分值最高的训练数据对应的方言识别模型作为目标方言识别模型;Using the dialect recognition model corresponding to the training data with the highest similarity score as the target dialect recognition model;

利用所述目标方言识别模型对所述方言语音数据进行转换,得到所述方言语音数据对应的语音识别结果。The dialect speech data is converted by using the target dialect recognition model to obtain a speech recognition result corresponding to the dialect speech data.

在本发明所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.

所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外,在本发明各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software function modules.

对于本领域技术人员而言,显然本发明不限于上述示范性实施例的细节,而且在不背离本发明的精神或基本特征的情况下,能够以其他的具体形式实现本发明。It will be apparent to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, but that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics of the invention.

因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本发明的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本发明内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。Therefore, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the invention is defined by the appended claims rather than the foregoing description, which are therefore intended to fall within the scope of the appended claims. All changes within the meaning and range of the equivalents of , are included in the present invention. Any reference signs in the claims shall not be construed as limiting the involved claim.

本发明所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in the present invention is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process related data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一、第二等词语用来表示名称,而并不表示任何特定的顺序。Furthermore, it is clear that the word "comprising" does not exclude other units or steps and the singular does not exclude the plural. Several units or means recited in the system claims can also be implemented by one unit or means by means of software or hardware. The words first, second, etc. are used to denote names and do not denote any particular order.

最后应说明的是,以上实施例仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或等同替换,而不脱离本发明技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present invention.

Claims (9)

1. A dialect identification method, the method comprising:
receiving dialect voice data input by a user, and extracting voice characteristics of the dialect voice data;
carrying out similarity detection on the speech features one by utilizing training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each type of training data;
taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model;
and converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data.
2. The dialect recognition method of claim 1, wherein said extracting speech features of the dialect speech data comprises:
converting the sound signals in the voice data into digital signals;
and calculating the digital signal by using a preset triangular band-pass filter to obtain the voice characteristics corresponding to the voice data.
3. The dialect recognition method of claim 2, wherein the calculating the digital signal by using a preset triangular band-pass filter to obtain the voice feature corresponding to the voice data comprises:
pre-emphasis, framing and windowing are carried out on the digital signal to obtain frequency domain energy;
performing fast Fourier transform on the frequency domain energy to obtain a frequency spectrum;
calculating the frequency spectrum by using the triangular band-pass filter to obtain logarithmic energy;
discrete cosine transform is carried out on the logarithmic energy to obtain a Mel frequency cepstrum coefficient;
and carrying out differential calculation according to the mel frequency cepstrum coefficient to obtain a dynamic differential parameter, and determining the dynamic differential parameter as a voice characteristic.
4. The dialect recognition method of claim 1, wherein the performing similarity detection with the speech features one by using training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each kind of the training data comprises:
extracting the voice characteristics of the training data of each dialect recognition model one by one;
calculating a distance value between the voice feature of the training data and the voice feature of the dialect voice data input by the user;
and calculating the similarity score of the voice feature of the dialect voice data input by the user and the voice feature of each type of training data according to the distance value.
5. The dialect recognition method of claim 1, wherein the converting the dialect speech data using the target dialect recognition model to obtain the speech recognition result corresponding to the dialect speech data comprises:
carrying out convolution, pooling and full-connection operation on the dialect voice data for preset times by using the target dialect identification model to obtain a coding vector;
and decoding the coding vector by using a preset activation function to obtain a voice recognition result.
6. The dialect recognition method of any one of claims 1 to 5, wherein before the dialect recognition model corresponding to the training data with the highest similarity score is taken as the target dialect recognition model, the method further comprises:
acquiring a random voice training set and a multi-party speech sound training set;
training according to the random voice training set by using a pre-constructed general voice model to obtain a general voice model;
and respectively carrying out self-adaptive training according to the general voice model and the multi-dialect voice training set to obtain a plurality of dialect recognition models.
7. A dialect recognition apparatus, the apparatus comprising:
the voice feature extraction module is used for receiving dialect voice data input by a user and extracting voice features of the dialect voice data;
the target dialect recognition model determining module is used for carrying out similarity detection on the training data corresponding to all the dialect recognition models in the pre-constructed dialect model library and the voice features one by one to obtain similarity scores of the voice features and each type of the training data; taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model;
and the voice recognition result generation module is used for converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data.
8. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the dialect identification method of any one of claims 1 to 7.
9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a dialect identification method according to any one of claims 1 to 7.
CN202111478141.0A 2021-12-06 2021-12-06 Dialect identification method, dialect identification device, dialect identification equipment and storage medium Pending CN114038450A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111478141.0A CN114038450A (en) 2021-12-06 2021-12-06 Dialect identification method, dialect identification device, dialect identification equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111478141.0A CN114038450A (en) 2021-12-06 2021-12-06 Dialect identification method, dialect identification device, dialect identification equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114038450A true CN114038450A (en) 2022-02-11

Family

ID=80139879

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111478141.0A Pending CN114038450A (en) 2021-12-06 2021-12-06 Dialect identification method, dialect identification device, dialect identification equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114038450A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758664A (en) * 2022-04-06 2022-07-15 维沃移动通信有限公司 Voice data screening method, apparatus, electronic device and readable storage medium
CN115762495A (en) * 2022-10-24 2023-03-07 深圳市捌零零在线科技有限公司 Voice recognition method and voice recognition device
CN118800216A (en) * 2024-07-17 2024-10-18 安徽中融芯航科技有限责任公司 Intelligent terminal system and device with dialect speech recognition

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8583432B1 (en) * 2012-07-18 2013-11-12 International Business Machines Corporation Dialect-specific acoustic language modeling and speech recognition
CN106971729A (en) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 A kind of method and system that Application on Voiceprint Recognition speed is improved based on sound characteristic scope
CN110033765A (en) * 2019-04-11 2019-07-19 中国联合网络通信集团有限公司 A kind of method and terminal of speech recognition
CN111326157A (en) * 2020-01-20 2020-06-23 北京字节跳动网络技术有限公司 Text generation method and device, electronic equipment and computer readable medium
US20200327883A1 (en) * 2019-04-15 2020-10-15 Beijing Baidu Netcom Science And Techology Co., Ltd. Modeling method for speech recognition, apparatus and device
CN112908303A (en) * 2021-01-28 2021-06-04 广东优碧胜科技有限公司 Audio signal processing method and device and electronic equipment
CN113409774A (en) * 2021-07-20 2021-09-17 北京声智科技有限公司 Voice recognition method and device and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8583432B1 (en) * 2012-07-18 2013-11-12 International Business Machines Corporation Dialect-specific acoustic language modeling and speech recognition
CN106971729A (en) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 A kind of method and system that Application on Voiceprint Recognition speed is improved based on sound characteristic scope
CN110033765A (en) * 2019-04-11 2019-07-19 中国联合网络通信集团有限公司 A kind of method and terminal of speech recognition
US20200327883A1 (en) * 2019-04-15 2020-10-15 Beijing Baidu Netcom Science And Techology Co., Ltd. Modeling method for speech recognition, apparatus and device
CN111326157A (en) * 2020-01-20 2020-06-23 北京字节跳动网络技术有限公司 Text generation method and device, electronic equipment and computer readable medium
CN112908303A (en) * 2021-01-28 2021-06-04 广东优碧胜科技有限公司 Audio signal processing method and device and electronic equipment
CN113409774A (en) * 2021-07-20 2021-09-17 北京声智科技有限公司 Voice recognition method and device and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758664A (en) * 2022-04-06 2022-07-15 维沃移动通信有限公司 Voice data screening method, apparatus, electronic device and readable storage medium
CN115762495A (en) * 2022-10-24 2023-03-07 深圳市捌零零在线科技有限公司 Voice recognition method and voice recognition device
CN118800216A (en) * 2024-07-17 2024-10-18 安徽中融芯航科技有限责任公司 Intelligent terminal system and device with dialect speech recognition

Similar Documents

Publication Publication Date Title
CN113205817B (en) Speech semantic recognition method, system, device and medium
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
CN112259089B (en) Speech recognition method and device
CN111727442B (en) Using quality scores to train sequence generation neural networks
WO2021000408A1 (en) Interview scoring method and apparatus, and device and storage medium
CN114038450A (en) Dialect identification method, dialect identification device, dialect identification equipment and storage medium
WO2022121157A1 (en) Speech synthesis method and apparatus, electronic device and storage medium
WO2021047319A1 (en) Voice-based personal credit assessment method and apparatus, terminal and storage medium
WO2022227190A1 (en) Speech synthesis method and apparatus, and electronic device and storage medium
CN114999533A (en) Intelligent question answering method, device, device and storage medium based on emotion recognition
CN113345431B (en) Cross-language voice conversion method, device, equipment and medium
CN114155832A (en) Speech recognition method, device, equipment and medium based on deep learning
CN113808577A (en) Intelligent extraction method, device, electronic device and storage medium for speech abstract
CN114863945A (en) Text-based voice changing method, device, electronic device and storage medium
CN114187912A (en) Knowledge recommendation method, device, device and storage medium based on voice dialogue
CN114974310A (en) Emotion recognition method and device based on artificial intelligence, computer equipment and medium
CN113838451A (en) Speech processing and model training method, device, equipment and storage medium
CN115512698B (en) Speech semantic analysis method
CN113889145B (en) Voice verification method, device, electronic device and medium
CN113436621B (en) GPU (graphics processing Unit) -based voice recognition method and device, electronic equipment and storage medium
CN114373445B (en) Speech generation method, device, electronic device and storage medium
WO2023193394A1 (en) Voice wake-up model training method and apparatus, voice wake-up method and apparatus, device and storage medium
CN113689867B (en) A training method, device, electronic device and medium for a speech conversion model
CN114758649A (en) A speech recognition method, apparatus, device and medium
CN114495977A (en) Speech translation and model training method, device, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220211