[go: up one dir, main page]

CN112349267B - Synthesized voice detection method based on attention mechanism characteristics - Google Patents

Synthesized voice detection method based on attention mechanism characteristics Download PDF

Info

Publication number
CN112349267B
CN112349267B CN202011172150.2A CN202011172150A CN112349267B CN 112349267 B CN112349267 B CN 112349267B CN 202011172150 A CN202011172150 A CN 202011172150A CN 112349267 B CN112349267 B CN 112349267B
Authority
CN
China
Prior art keywords
unvoiced
voice
data
frame
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011172150.2A
Other languages
Chinese (zh)
Other versions
CN112349267A (en
Inventor
靳嘉宇
魏建国
应翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202011172150.2A priority Critical patent/CN112349267B/en
Publication of CN112349267A publication Critical patent/CN112349267A/en
Application granted granted Critical
Publication of CN112349267B publication Critical patent/CN112349267B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to the field of synthesized voice and converted voice detection, aiming at improving the robustness of an ASVspoof system for speaker authentication fraud voice detection, strengthening an unvoiced part in voice and weakening a voiced part in the voice characteristic extraction process so as to improve the robustness of the characteristic in fraud attack detection of an automatic speaker system, the invention discloses a synthesized voice detection method based on attention mechanism characteristics, a corresponding weight matrix is trained for each test voice through an attention mechanism algorithm, and whether framed data are unvoiced and voiced can be screened out through the ratio of short-time capacity and zero-crossing rate; then, scoring is carried out on the screened unvoiced data and the original data, so that the data of unvoiced parts are highlighted, and the data of voiced parts are weakened; and respectively training a real Gaussian mixture model and a cheating voice Gaussian mixture model according to the subsequent characteristics, and grading and confirming. The invention is mainly applied to the occasions of synthesizing voice and converting voice.

Description

基于注意力机制特征的合成语音检测方法Synthetic Speech Detection Method Based on Attention Mechanism Features

技术领域technical field

本发明涉及合成语音和转换语音检测领域,基于清音和浊音在合成语音检测任务中的重要性不同,所设计的一种基于注意力机制的特征,该特征在辨别真实语音与欺诈语音(合成语音和转换语音)中表现出良好的鲁棒性。The invention relates to the field of synthetic speech and converted speech detection. Based on the different importance of unvoiced and voiced sounds in the task of synthetic speech detection, a designed feature based on attention mechanism can be used to distinguish real speech from fraudulent speech (synthesized speech) and converted speech) show good robustness.

背景技术Background technique

近年来,随着说话人识别技术的日益成熟,自动说话人认证(ASV)系统已经被广泛应用于各种场景中,如语音助理、网上银行等。然而,许多研究揭示了说话人认证系统容易受不同欺骗攻击的脆弱性。这些欺骗攻击包括:合成语音和转换语音。In recent years, with the increasing maturity of speaker recognition technology, automatic speaker verification (ASV) systems have been widely used in various scenarios, such as voice assistants, online banking, etc. However, many studies have revealed the vulnerability of speaker verification systems to different spoofing attacks. These spoofing attacks include: synthesized speech and converted speech.

随着语音合成和语音转换技术的发展与成熟,使得合成和转换之后的语音越来越真实,甚至人耳很难辨别语音的真实性。这给说话人识别系统带来了很大的安全隐患。为了避免不法分子通过语音合成或语音转换技术实施违法犯罪行为,Interspeech赛方自2015年至2019年举办了三届说话人认证欺诈语音检测(ASVspoof)挑战赛,因此该领域受到了越来越多人们的关注,从而也提出了非常多的解决方案。With the development and maturity of speech synthesis and speech conversion technology, the synthesized and converted speech becomes more and more real, and it is even difficult for the human ear to distinguish the authenticity of the speech. This brings great security risks to the speaker recognition system. In order to prevent criminals from committing illegal and criminal acts through speech synthesis or speech conversion technology, the Interspeech competition held three speaker authentication fraudulent speech detection (ASVspoof) challenges from 2015 to 2019, so this field has received more and more attention. People's attention, and thus also put forward a lot of solutions.

该领域主要通过对前端特征提取和后端分类器的改进来提升说话人认证欺诈语音检测系统(ASVspoof)的鲁棒性。在特征提取的层面,前人使用了梅尔倒谱系数特征Mel-frequency cepstral coefficients(MFCC)、线性倒谱系数特征linear-frequencycepstral coefficients(LFCC)、常数Q倒谱系数特征constant-Q cepstral coefficient(CQCC)等等,同时群延迟相位特征Modified Group Delay(MGD)也被广泛使用。目前实验结果表明,相位特征具有更高的鲁棒性。在分类器方面,早期高斯混合模型Gaussian MixtureModel(GMM)和支持向量机模型Support Vector Machine(SVM)被广泛使用。目前效果最佳的分类器是残差神经网络Residual Convolutional Neural Network Model(ResNet)模型。This field mainly improves the robustness of speaker authentication fraud speech detection system (ASVspoof) by improving the front-end feature extraction and back-end classifier. At the level of feature extraction, predecessors used Mel-frequency cepstral coefficients (MFCC), linear-frequency cepstral coefficients (LFCC), and constant-Q cepstral coefficients ( CQCC) and so on, while the group delay phase characteristic Modified Group Delay (MGD) is also widely used. The current experimental results show that the phase feature is more robust. In terms of classifiers, early Gaussian MixtureModel (GMM) and Support Vector Machine (SVM) are widely used. Currently the best classifier is the Residual Convolutional Neural Network Model (ResNet) model.

除了在特征和分类器方面,很多研究者分析了噪音和混响对合成语音检测的影响,不同音素在合成语音检测的影响和说话人及其说话内容的研究和分析等等。In addition to features and classifiers, many researchers have analyzed the impact of noise and reverberation on synthetic speech detection, the impact of different phonemes on synthetic speech detection, and the research and analysis of speakers and their speech content, etc.

发明内容Contents of the invention

为克服现有技术的不足,本发明的目的在于通过提出一种新型特征提高说话人认证欺诈语音检测ASVspoof系统的鲁棒性。本发明提出的一种注意力机制,该机制在语音特征提取的过程中强化了语音中的清音部分,削弱了语音中的浊音部分,从而提高该特征在自动说话人系统欺诈攻击检测的鲁棒性。为此,本发明采取的技术方案是,基于注意力机制特征的合成语音检测方法,通过注意力机制算法为每一条试验语音训练相应的权重矩阵,通过短时能力和过零率之比既能筛选出分帧后的数据否为清浊音;然后再将筛选出的清音数据和原始数据进行打分,从而突出清音部分的数据,削弱浊音部分的数据;再把之后的特征分别训练真实的高斯混合模型和欺诈语音高斯混合模型,并进行打分确认。In order to overcome the deficiencies in the prior art, the purpose of the present invention is to improve the robustness of the speaker authentication fraud speech detection ASVspoof system by proposing a novel feature. An attention mechanism proposed by the present invention, which strengthens the unvoiced part of the speech and weakens the voiced part of the speech in the process of speech feature extraction, thereby improving the robustness of this feature in the fraud attack detection of automatic speaker systems sex. For this reason, the technical solution that the present invention takes is, based on the synthetic speech detection method of attention mechanism feature, trains the corresponding weight matrix for each trial speech by attention mechanism algorithm, can both be able to by the ratio of short-term ability and zero-crossing rate Filter out whether the framed data is unvoiced or not; then score the filtered unvoiced data and the original data, so as to highlight the data of the unvoiced part and weaken the data of the voiced part; then train the real Gaussian mixture of the subsequent features Model and fraudulent speech Gaussian mixture model, and confirm the score.

具体步骤如下:Specific steps are as follows:

步骤一,数据准备:Step 1, data preparation:

首先将语料库中的数据划分为训练集、验证集、测试集,训练集用来训练模型,验证集用来检测模型训练的好坏,测试集用来验证模型的鲁棒性;First, divide the data in the corpus into training set, validation set, and test set. The training set is used to train the model, the validation set is used to detect whether the model is trained well, and the test set is used to verify the robustness of the model;

步骤二,语音信号处理:Step 2, speech signal processing:

对语音信号进行预加重、分帧、加窗处理,并通过傅立叶变换得到语音信号的语谱图信息,语音信号预加重是为了对语音的高频部分进行加重,增强高频部分的分辨率;分帧的目的是为了得到平稳的信号,从而符合傅立叶变换的要求;Pre-emphasize, frame, and window the speech signal, and obtain the spectrogram information of the speech signal through Fourier transform. The pre-emphasis of the speech signal is to emphasize the high-frequency part of the speech and enhance the resolution of the high-frequency part; The purpose of framing is to obtain a stable signal, so as to meet the requirements of Fourier transform;

步骤三,权重矩阵训练:Step 3, weight matrix training:

首先通过短时能量和过零率提取出语音信号中是清音的帧,其次使用欧式距离计算语音信号中的每一帧与提取出清音帧的相似度,如果计算原始语音中的清音帧与提取出的清音帧的相似度,其对应的权重矩阵的值会比较大,如果计算原始语音中的浊音帧与提取出得清音帧的相似度,那么其对应的权重矩阵的值会比较小;最后将得到的权重矩阵与原始语音的语谱图代表的矩阵做点乘运算即得到最后的注意力特征;Firstly, the unvoiced frame in the speech signal is extracted by short-term energy and zero-crossing rate, and then the Euclidean distance is used to calculate the similarity between each frame in the speech signal and the extracted unvoiced frame. If the unvoiced frame in the original speech is calculated and extracted The similarity of the unvoiced frame extracted, the value of the corresponding weight matrix will be relatively large, if the similarity between the voiced frame in the original speech and the extracted unvoiced frame is calculated, the value of the corresponding weight matrix will be relatively small; finally The final attention feature is obtained by dot multiplying the obtained weight matrix with the matrix represented by the spectrogram of the original speech;

步骤四,分类模型训练Step 4, classification model training

将训练集中真实数据的语谱特征和欺诈数据的语谱特征作为高斯混合模型GMM模型的输入数据,分别训练得到真实语音的GMM模型和欺诈语音的GMM模型。The spectral features of the real data and the spectral features of the fraudulent data in the training set are used as the input data of the Gaussian mixture model GMM model, and the GMM model of the real speech and the GMM model of the fraudulent speech are trained respectively.

步骤五,真假辨认打分Step 5, authenticity identification and scoring

将开发集数据提取得到的语谱特征输入到训练好的模型中,使用最大似然比打分,判断出语音信号的真实性。Input the spectral features extracted from the development set data into the trained model, and use the maximum likelihood ratio to score to judge the authenticity of the speech signal.

权重矩阵训练详细步骤如下:The detailed steps of weight matrix training are as follows:

1)提取清音帧:使用短时能量和过零率的比值来判断清音帧和浊音帧。1) Extract unvoiced frames: Use the ratio of short-term energy and zero-crossing rate to judge unvoiced frames and voiced frames.

每一帧语音信号的短时能量计算公式如下:The short-term energy calculation formula of each frame of speech signal is as follows:

Figure BDA0002747632090000021
Figure BDA0002747632090000021

其中x(m)为分帧信号,w(m)为窗口函数,每一帧的过零率,计算公式如下:Among them, x(m) is the framed signal, w(m) is the window function, and the zero-crossing rate of each frame is calculated as follows:

Figure BDA0002747632090000022
Figure BDA0002747632090000022

其中,sgn(n)为符号函数:Among them, sgn(n) is a symbolic function:

Figure BDA0002747632090000023
Figure BDA0002747632090000023

计算短时能量与过零率的比EZR,当EZR大于0.02,则该帧信号为浊音信号,当EZR大于0.002且小于0.02时,则该帧信号为清音信号,否则为静音;Calculate the ratio EZR of short-term energy and zero-crossing rate. When EZR is greater than 0.02, the frame signal is a voiced signal. When EZR is greater than 0.002 and less than 0.02, the frame signal is an unvoiced signal, otherwise it is mute;

2)计算权重矩阵:2) Calculate the weight matrix:

通过计算原始语音的每一帧语音信号与上一步提取出的清音帧信号的相似度,得出权重矩阵中相应的权值,计算公式如下:By calculating the similarity between each frame of the original speech signal and the unvoiced frame signal extracted in the previous step, the corresponding weight in the weight matrix is obtained. The calculation formula is as follows:

Figure BDA0002747632090000024
Figure BDA0002747632090000024

其中S代表原始语音帧信号,U代表从原始语音中提取的清音帧信号,m为原始语音信号帧的个数,n代表清音段帧的个数;Wherein S represents the original speech frame signal, U represents the unvoiced frame signal extracted from the original speech, m is the number of original speech signal frames, and n represents the number of unvoiced segment frames;

3)动态特征提取:3) Dynamic feature extraction:

在静态特征基础上采用delta方法对一阶动态和二阶动态进行特征提取,公式如下,其中p=2。On the basis of the static features, the delta method is used to extract the features of the first-order dynamics and the second-order dynamics. The formula is as follows, where p=2.

Figure BDA0002747632090000031
Figure BDA0002747632090000031

Figure BDA0002747632090000032
Figure BDA0002747632090000032

本发明的特点及有益效果是:Features and beneficial effects of the present invention are:

本发明使用注意力机制使得语音信号的清音部分得到强化,浊音部分得到弱化,从而提取了基于清浊音导向的注意力机制的特征,该基于注意力机制特征(LFCC-USE)相比于基线系统特征—线性倒谱系数特征LFCC(baseline)EER提升41.4%,t-DCF相对提升40.2%。The present invention uses the attention mechanism to strengthen the unvoiced part of the speech signal and weaken the voiced part, thereby extracting the features based on the unvoiced and voiced-oriented attention mechanism, which is based on the attention mechanism feature (LFCC-USE) compared to the baseline system Features—Linear cepstral coefficient feature LFCC (baseline) EER increased by 41.4%, and t-DCF increased by 40.2%.

结果表明,该方法达到了良好的声纹识别效果。等错误率和串联评估检测成本函数得到了一定降低。这表明基于清浊音导向的注意力机制的特征能使最后训练得到的模型更加鲁棒。The results show that the method achieves a good voiceprint recognition effect. The equal error rate and tandem evaluation of the detection cost function are somewhat reduced. This suggests that features based on the unvoiced-directed attention mechanism can make the final trained model more robust.

附图说明:Description of drawings:

图1提取语音信号中的清音段流程图。Fig. 1 is a flow chart of extracting unvoiced segments in speech signals.

图2注意力权重矩阵提取示意图。Figure 2 Schematic diagram of attention weight matrix extraction.

具体实施方式Detailed ways

本发明的目的在于通过提出一种新型特征提高ASVspoof系统的鲁棒性。GajanSuthokumar等人发现在ASVspoof2017语料库中,清音相比与浊音携带更多的辨别信息。因此,本文提出了一种注意力机制,该机制在语音特征提取的过程中强化了语音中的清音部分,削弱了语音中的浊音部分,从而提高该特征在自动说话人系统欺诈攻击检测的鲁棒性。The purpose of the present invention is to improve the robustness of the ASVspoof system by proposing a novel feature. GajanSuthokumar et al. found that in the ASVspoof2017 corpus, unvoiced sounds carry more discriminative information than voiced sounds. Therefore, this paper proposes an attention mechanism, which strengthens the unvoiced part of the speech and weakens the voiced part of the speech during the process of speech feature extraction, thereby improving the robustness of this feature in automatic speaker system fraud attack detection. Stickiness.

实现本发明目的的技术解决方案为:The technical solution that realizes the object of the present invention is:

基于注意力机制的特征提取方法。通过注意力机制算法为每一条ASVspoof2019语料库中的语音训练相应的权重矩阵。由于清音的短时能量低,过零率高,浊音的短时能量高,过零率低,通过短时能力和过零率之比既能筛选出分帧后的数据否为清浊音。然后再将筛选出的清音数据和原始数据进行打分,从而突出清音部分的数据,削弱浊音部分的数据。再把之后的特征分别训练真实的高斯混合模型和欺诈语音等高斯混合模型,并进行打分确认。本实验主要包含以下的五个部分:数据准备、清音特征提取、特征提取、分类模型训练、真假辨认打分。Feature extraction method based on attention mechanism. Through the attention mechanism algorithm, the corresponding weight matrix is trained for each voice in the ASVspoof2019 corpus. Since unvoiced sounds have low short-term energy and high zero-crossing rate, and voiced sounds have high short-term energy and low zero-crossing rate, the ratio of short-term power and zero-crossing rate can be used to screen whether the framed data is unvoiced or voiced. Then, the screened unvoiced data and the original data are scored, so as to highlight the data of the unvoiced part and weaken the data of the voiced part. Then train the real Gaussian mixture model and the Gaussian mixture model such as fraudulent voice with the following features, and perform scoring confirmation. This experiment mainly includes the following five parts: data preparation, unvoiced sound feature extraction, feature extraction, classification model training, authenticity identification and scoring.

本发明提出一种基于注意力机制的特征提取方法,包括以下步骤:The present invention proposes a feature extraction method based on attention mechanism, comprising the following steps:

步骤一,数据准备:Step 1, data preparation:

首先将ASVspoof2019语料库中的数据划分为训练集、验证集、测试集。训练集用来训练模型,验证集用来检测模型训练的好坏,测试集用来验证模型的鲁棒性。First, the data in the ASVspoof2019 corpus is divided into training set, verification set, and test set. The training set is used to train the model, the verification set is used to check the quality of the model training, and the test set is used to verify the robustness of the model.

步骤二,语音信号处理:Step 2, speech signal processing:

对语音信号进行预加重、分帧、加窗等处理,并通过傅立叶变换得到语音信号的语谱图信息。语音信号预加重是为了对语音的高频部分进行加重,增强高频部分的分辨率。分帧的目的是为了得到平稳的信号,从而符合傅立叶变换的要求。Perform pre-emphasis, framing, windowing and other processing on the speech signal, and obtain the spectrogram information of the speech signal through Fourier transform. Speech signal pre-emphasis is to emphasize the high-frequency part of the speech and enhance the resolution of the high-frequency part. The purpose of framing is to obtain a stable signal, so as to meet the requirements of Fourier transform.

步骤三,权重矩阵训练:Step 3, weight matrix training:

首先通过短时能量和过零率提取出语音信号中是清音的帧,其次使用欧式距离计算语音信号中的每一帧与提取出清音帧的相似度,如果计算原始语音中的清音帧与提取出的清音帧的相似度,其对应的权重矩阵的值会比较大,如果计算原始语音中的浊音帧与提取出得清音帧的相似度,那么其对应的权重矩阵的值会比较小。最后将得到的权重矩阵与原始语音的语谱图代表的矩阵做点乘运算即得到最后的注意力特征。Firstly, the unvoiced frame in the speech signal is extracted by short-term energy and zero-crossing rate, and then the Euclidean distance is used to calculate the similarity between each frame in the speech signal and the extracted unvoiced frame. If the unvoiced frame in the original speech is calculated and extracted The value of the corresponding weight matrix will be relatively large for the similarity of the extracted unvoiced frames. If the similarity between the voiced frames in the original speech and the extracted unvoiced frames is calculated, the value of the corresponding weight matrix will be relatively small. Finally, the obtained weight matrix is multiplied by the matrix represented by the spectrogram of the original speech to obtain the final attention feature.

步骤四,分类模型训练Step 4, classification model training

将训练集中真实数据的语谱特征和欺诈数据的语谱特征作为GMM模型的输入数据,分别训练得到真实语音的GMM模型和欺诈语音的GMM模型。The spectral features of the real data and the spectral features of the fraudulent data in the training set are used as the input data of the GMM model, and the GMM model of the real voice and the GMM model of the fraudulent voice are trained respectively.

步骤五,真假辨认打分Step 5, authenticity identification and scoring

将开发集数据提取得到的语谱特征输入到训练好的模型中,使用最大似然比打分,判断出语音信号的真实性Input the spectral features extracted from the development set data into the trained model, and use the maximum likelihood ratio to score to judge the authenticity of the speech signal

下面结合附图来描述本发明实施的基于清浊音导向的注意力机制的合成语音检测方法,主要包含以下步骤:Describe below in conjunction with accompanying drawing the synthetic speech detection method based on the unvoiced and voiced sound-guided attention mechanism that the present invention implements, mainly comprise the following steps:

步骤一,数据准备:Step 1, data preparation:

该发明使用的数据是ASVspoof2019挑战赛中Logical Access(LA)数据库。LA数据库包含使用17个不同的合成语音(TTS)和转换语音(VC)系统生成的真实语音和假语音数据。这17种不同的TTS和VC语音数据来自于VCTK数据库,但与2019年数据库中的数据没有重叠。其中6个被指定为已知攻击,其余11个被指定为未知攻击。LA数据库把所有数据划分为训练集,测试集和评估集。训练和开发集只包含已知攻击,而评估集包含2个已知攻击和11个未知欺骗攻击。在已知的6种攻击中,有2种VC语音数据和4种TTS语音数据。The data used in this invention is the Logical Access (LA) database in the ASVspoof2019 challenge. The LA database contains real and fake speech data generated using 17 different Synthesized Speech (TTS) and Converted Speech (VC) systems. These 17 different TTS and VC speech data are from the VCTK database, but there is no overlap with the data in the 2019 database. Six of them were designated as known attacks and the remaining 11 were designated as unknown attacks. The LA database divides all data into training set, test set and evaluation set. The training and development sets contain only known attacks, while the evaluation set contains 2 known attacks and 11 unknown spoofing attacks. Among the 6 known attacks, there are 2 types of VC voice data and 4 types of TTS voice data.

步骤二,语音信号处理Step 2, Speech Signal Processing

数据准备好之后,首先对语音数据进行预处理。预处理目的是消除人发声器官或者收集语音设备多带来的高频泄漏等问题,对语音信号质量产生影响。经过预处理后使得信号更加均匀、平滑,提高语音处理的质量。After the data is ready, the speech data is preprocessed first. The purpose of preprocessing is to eliminate problems such as human vocal organs or high-frequency leakage caused by voice collection equipment, which will affect the quality of voice signals. After preprocessing, the signal is more uniform and smooth, and the quality of speech processing is improved.

预处理主要包括预加重、分帧和加窗。这里我们使用的窗函数是汉明窗。Preprocessing mainly includes pre-emphasis, framing and windowing. The window function we use here is the Hamming window.

步骤三,权重矩阵训练Step 3, weight matrix training

如图1基于判断清音与浊音的示意图所示,包含三个主要步骤As shown in Figure 1 based on the schematic diagram of judging unvoiced and voiced sounds, it contains three main steps

2)提取清音帧:根据清音和浊音的短时能量和过零率的表现特性,清音的短时能量低过零率高,而浊音的短时能量高过零率低。使用短时能量和过零率的比值来判断清音帧和浊音帧。2) Extracting unvoiced frames: According to the performance characteristics of short-term energy and zero-crossing rate of unvoiced and voiced sounds, unvoiced sounds have low short-term energy and high zero-crossing rate, while voiced sounds have high short-term energy and low zero-crossing rate. Use the ratio of short-term energy and zero-crossing rate to judge unvoiced frames and voiced frames.

每一帧语音信号的短时能量计算公式如下:The short-term energy calculation formula of each frame of speech signal is as follows:

Figure BDA0002747632090000051
Figure BDA0002747632090000051

其中x(m)为分帧信号,w(m)为窗口函数。每一帧的过零率,计算公式如下:Among them, x(m) is a framed signal, and w(m) is a window function. The calculation formula for the zero-crossing rate of each frame is as follows:

Figure BDA0002747632090000052
Figure BDA0002747632090000052

其中,sgn(n)为符号函数:Among them, sgn(n) is a symbolic function:

Figure BDA0002747632090000053
Figure BDA0002747632090000053

计算短时能量与过零率的比(EZR),当EZR大于0.02,则该帧信号为浊音信号,当EZR大于0.002且小于0.02时,则该帧信号为清音信号,否则为静音。提取流程如图1所示。Calculate the ratio of short-term energy to zero-crossing rate (EZR). When EZR is greater than 0.02, the frame signal is a voiced signal. When EZR is greater than 0.002 and less than 0.02, the frame signal is an unvoiced signal, otherwise it is silent. The extraction process is shown in Figure 1.

2)计算权重矩阵:2) Calculate the weight matrix:

权重矩阵的计算方法借鉴了自然语音处理领域常用的self-attention机制的权重矩阵计算方法。通过计算原始语音的每一帧语音信号与上一步提取出的清音帧信号的相似度,得出权重矩阵中相应的权值。提取过程如图2所示,计算公式如下:The calculation method of the weight matrix draws on the weight matrix calculation method of the self-attention mechanism commonly used in the field of natural speech processing. By calculating the similarity between each frame of the original speech signal and the unvoiced frame signal extracted in the previous step, the corresponding weights in the weight matrix are obtained. The extraction process is shown in Figure 2, and the calculation formula is as follows:

Figure BDA0002747632090000054
Figure BDA0002747632090000054

其中S代表原始语音帧信号,U代表从原始语音中提取的清音帧信号,m为原始语音信号帧的个数,n代表清音段帧的个数。Among them, S represents the original speech frame signal, U represents the unvoiced frame signal extracted from the original speech, m is the number of original speech signal frames, and n represents the number of unvoiced segment frames.

将得到的权重矩阵The resulting weight matrix will be

3)动态特征提取:3) Dynamic feature extraction:

在静态特征基础上采用delta方法对一阶动态和二阶动态进行特征提取。公式如下,其中p=2。On the basis of the static features, the delta method is used to extract the features of the first-order dynamics and the second-order dynamics. The formula is as follows, where p=2.

Figure BDA0002747632090000055
Figure BDA0002747632090000055

Figure BDA0002747632090000056
Figure BDA0002747632090000056

步骤四,分类模型训练:Step 4, classification model training:

将步骤三种提取出来的特征送入GMM模型中训练,由于训练集中已标记真实语音和欺诈语音,因此可得到一个真是语音的GMM模型和一个欺诈语音的GMM模型。The features extracted in the three steps are sent to the GMM model for training. Since the real speech and the fraudulent speech have been marked in the training set, a GMM model of the real speech and a GMM model of the fraudulent speech can be obtained.

步骤五,真假辨认打分:Step 5, authenticity identification and scoring:

将评估集数据进行特征提取后与GMM模型训练得到的模型系数进行打分,此处用到最大似然打分,当分数大于0时判断为真实语音,当分数小于0时判定为欺诈语音。After feature extraction, the evaluation set data is scored with the model coefficients obtained from GMM model training. Here, the maximum likelihood scoring is used. When the score is greater than 0, it is judged as real speech, and when the score is less than 0, it is judged as fraudulent speech.

实验的结果的评估采用等错误率(EER)和串联评估的检测成本函数(t-DCF)。The results of the experiments were evaluated using the equal error rate (EER) and the detection cost function (t-DCF) evaluated in tandem.

以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

Claims (2)

1. A synthetic voice detection method based on attention mechanism features is characterized in that a corresponding weight matrix is trained for each test voice through an attention mechanism algorithm, and whether framed data are unvoiced and voiced is screened out through the ratio of short-time energy to zero-crossing rate; then, scoring is carried out on the screened unvoiced data and the original data, so that the data of unvoiced parts are highlighted, and the data of voiced parts are weakened; respectively training a real Gaussian mixture model and a fraud voice Gaussian mixture model for the subsequent characteristics, and performing scoring confirmation, wherein a corresponding weight matrix is trained: firstly, extracting unvoiced frames in a voice signal through short-time energy and a zero crossing rate, secondly, calculating the similarity between each frame in the voice signal and the extracted unvoiced frame by using Euclidean distance, wherein if the similarity between the unvoiced frame in the original voice and the extracted unvoiced frame is calculated, the value of a corresponding weight matrix is larger, and if the similarity between the voiced frame in the original voice and the extracted unvoiced frame is calculated, the value of the corresponding weight matrix is smaller; finally, performing dot product operation on the obtained weight matrix and a matrix represented by a spectrogram of the original voice to obtain the final attention feature;
the detailed steps of the weight matrix training are as follows:
1) And (3) extracting an unvoiced frame: the ratio of the short-time energy to the zero crossing rate is used for judging the unvoiced frame and the voiced frame, and the short-time energy calculation formula of each frame of voice signal is as follows:
Figure FDA0003969981110000011
wherein x (m) is a framing signal, w (m) is a window function, and the zero-crossing rate of each frame is calculated according to the following formula:
Figure FDA0003969981110000012
wherein sgn (n) is a sign function:
Figure FDA0003969981110000013
calculating the ratio EZR of the short-time energy to the zero crossing rate, wherein when the EZR is larger than 0.02, the frame signal is a voiced signal, when the EZR is larger than 0.002 and smaller than 0.02, the frame signal is an unvoiced signal, and otherwise, the frame signal is muted;
2) Calculating a weight matrix:
the similarity between each frame of voice signal of the original voice and the unvoiced frame signal extracted in the previous step is calculated to obtain a corresponding weight in the weight matrix, and the calculation formula is as follows:
Figure FDA0003969981110000014
wherein S represents an original speech frame signal, U represents an unvoiced sound frame signal extracted from an original speech, m is the number of original speech signal frames, and n represents the number of unvoiced sound segment frames;
3) Dynamic feature extraction:
and performing feature extraction on the first-order dynamic state and the second-order dynamic state by adopting a delta method on the basis of the static features.
2. The attention mechanism feature-based synthesized speech detection method according to claim 1, comprising the steps of:
step one, data preparation
Firstly, dividing data in a corpus into a training set, a verification set and a test set, wherein the training set is used for training a model, the verification set is used for detecting the quality of model training, and the test set is used for verifying the robustness of the model;
step two, processing the voice signal
Carrying out pre-emphasis, framing and windowing processing on the voice signal, and obtaining spectrogram information of the voice signal through Fourier transformation;
step three, training the weight matrix
Step four, training classification models
Respectively training the speech spectrum characteristics of the real data and the speech spectrum characteristics of the fraud data in the training set as input data of a Gaussian Mixture Model (GMM), so as to obtain a GMM of real speech and a GMM of fraud speech;
step five, identifying and scoring true and false
And inputting the speech spectrum characteristics obtained by extracting the development set data into a trained model, and scoring by using the maximum likelihood ratio to judge the authenticity of the speech signal.
CN202011172150.2A 2020-10-28 2020-10-28 Synthesized voice detection method based on attention mechanism characteristics Active CN112349267B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011172150.2A CN112349267B (en) 2020-10-28 2020-10-28 Synthesized voice detection method based on attention mechanism characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011172150.2A CN112349267B (en) 2020-10-28 2020-10-28 Synthesized voice detection method based on attention mechanism characteristics

Publications (2)

Publication Number Publication Date
CN112349267A CN112349267A (en) 2021-02-09
CN112349267B true CN112349267B (en) 2023-03-21

Family

ID=74358924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011172150.2A Active CN112349267B (en) 2020-10-28 2020-10-28 Synthesized voice detection method based on attention mechanism characteristics

Country Status (1)

Country Link
CN (1) CN112349267B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967712A (en) * 2021-02-25 2021-06-15 中山大学 Synthetic speech detection method based on autoregressive model coefficient
CN113012684B (en) * 2021-03-04 2022-05-31 电子科技大学 Synthesized voice detection method based on voice segmentation
WO2024261546A1 (en) * 2023-06-23 2024-12-26 Samsung Electronics Co., Ltd. System and method for distinguishing original voice from synthetic voice in an iot environment
CN119864053B (en) * 2025-01-08 2025-09-23 杭州电子科技大学 Fake voice detection method adopting attention of two-dimensional graph

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003514260A (en) * 1999-11-11 2003-04-15 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Tone features for speech recognition
US20070021958A1 (en) * 2005-07-22 2007-01-25 Erik Visser Robust separation of speech signals in a noisy environment
CN110931022A (en) * 2019-11-19 2020-03-27 天津大学 Voiceprint recognition method based on high and low frequency dynamic and static features
CN111816203A (en) * 2020-06-22 2020-10-23 天津大学 A synthetic speech detection method based on phoneme-level analysis to suppress the influence of phonemes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003514260A (en) * 1999-11-11 2003-04-15 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Tone features for speech recognition
US20070021958A1 (en) * 2005-07-22 2007-01-25 Erik Visser Robust separation of speech signals in a noisy environment
CN110931022A (en) * 2019-11-19 2020-03-27 天津大学 Voiceprint recognition method based on high and low frequency dynamic and static features
CN111816203A (en) * 2020-06-22 2020-10-23 天津大学 A synthetic speech detection method based on phoneme-level analysis to suppress the influence of phonemes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《Novel Detection Algorithm of Speech Activity and the impact of Speech Codecs on Remote Speaker Recognition System》;RIADH AJGOU et al.;《WSEAS TRANSACTIONS on SIGNAL PROCESSING》;20140731;第10卷;第309-319页 *

Also Published As

Publication number Publication date
CN112349267A (en) 2021-02-09

Similar Documents

Publication Publication Date Title
CN112349267B (en) Synthesized voice detection method based on attention mechanism characteristics
CN110491391B (en) Deception voice detection method based on deep neural network
Chen et al. ResNet and Model Fusion for Automatic Spoofing Detection.
Wu et al. A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case
Chen et al. Towards understanding and mitigating audio adversarial examples for speaker recognition
US7603275B2 (en) System, method and computer program product for verifying an identity using voiced to unvoiced classifiers
Gudnason et al. Voice source cepstrum coefficients for speaker identification
Irum et al. Speaker verification using deep neural networks: A
Hassan et al. Voice spoofing countermeasure for synthetic speech detection
JPH11507443A (en) Speaker identification system
CN105139857A (en) Countercheck method for automatically identifying speaker aiming to voice deception
EP1569200A1 (en) Identification of the presence of speech in digital audio data
Zheng et al. When automatic voice disguise meets automatic speaker verification
CN110459226A (en) A method of voice is detected by vocal print engine or machine sound carries out identity veritification
Chen et al. Speaker verification against synthetic speech
Zhang et al. Joint information from nonlinear and linear features for spoofing detection: An i-vector/DNN based approach
CN111816203A (en) A synthetic speech detection method based on phoneme-level analysis to suppress the influence of phonemes
CN109545191A (en) The real-time detection method of voice initial position in a kind of song
Xue et al. Cross-modal information fusion for voice spoofing detection
De Leon et al. Revisiting the security of speaker verification systems against imposture using synthetic speech
Mahesha et al. LP-Hillbert transform based MFCC for effective discrimination of stuttering dysfluencies
CN113012684B (en) Synthesized voice detection method based on voice segmentation
Weng et al. The SYSU system for the interspeech 2015 automatic speaker verification spoofing and countermeasures challenge
US20220108702A1 (en) Speaker recognition method
CN116665649A (en) Synthetic voice detection method based on prosody characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant