[go: up one dir, main page]

CN111209429B - Unsupervised model training method and unsupervised model training device for measuring coverage of voice database - Google Patents

Unsupervised model training method and unsupervised model training device for measuring coverage of voice database Download PDF

Info

Publication number
CN111209429B
CN111209429B CN202010309303.7A CN202010309303A CN111209429B CN 111209429 B CN111209429 B CN 111209429B CN 202010309303 A CN202010309303 A CN 202010309303A CN 111209429 B CN111209429 B CN 111209429B
Authority
CN
China
Prior art keywords
evaluation
training
factor
training data
coverage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010309303.7A
Other languages
Chinese (zh)
Other versions
CN111209429A (en
Inventor
李科
张卫强
黄宇凯
郝玉峰
宋琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Speechocean Technology Co ltd
Tsinghua University
Original Assignee
Beijing Speechocean Technology Co ltd
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Speechocean Technology Co ltd, Tsinghua University filed Critical Beijing Speechocean Technology Co ltd
Priority to CN202010309303.7A priority Critical patent/CN111209429B/en
Publication of CN111209429A publication Critical patent/CN111209429A/en
Application granted granted Critical
Publication of CN111209429B publication Critical patent/CN111209429B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to an unsupervised model training method for measuring speech database coverage, the method comprising: acquiring training data, wherein the training data is voice; determining one or more evaluation factors for voice database coverage; dividing the evaluation factors into adjustable factors or non-adjustable factors based on whether the training data can be controlled by parameter adjustment; determining a clustering algorithm corresponding to each divided evaluation factor; classifying the training data through a clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses; and training an evaluation model according to the plurality of subclasses of each evaluation factor. The method can set different voice databases corresponding to the evaluation element measurement according to the needs of users, purposefully extract different characteristics and select proper algorithms by distinguishing the evaluation elements, and meanwhile, model training can be carried out by using unsupervised data, so that the cost introduced by data annotation is reduced.

Description

用于度量语音数据库覆盖性的无监督模型训练方法及装置Unsupervised model training method and device for measuring coverage of speech database

技术领域technical field

本公开涉及语音信号处理领域,尤其涉及用于度量语音数据库覆盖性的无监督模型训练方法及装置、电子设备和计算机可读存储介质。The present disclosure relates to the field of speech signal processing, and in particular, to an unsupervised model training method and apparatus, electronic device and computer-readable storage medium for measuring the coverage of a speech database.

背景技术Background technique

语音数据库的覆盖性是衡量语音数据库质量的一个重要指标,是指语音数据库针对评价因素的覆盖程度。例如:发音人的性别、语种、语音内容等因素。例如在训练的语音识别系统时,需要采集数量很大的说话人的语音用于训练,此时,选用的语音数据库覆盖性越好,就越可以包含更为广泛的语音空间,可以有效减低样本空间分布中影响。The coverage of the voice database is an important indicator to measure the quality of the voice database, which refers to the coverage of the voice database for evaluation factors. For example: the speaker's gender, language, voice content and other factors. For example, when training a speech recognition system, it is necessary to collect the speech of a large number of speakers for training. At this time, the better the coverage of the selected speech database, the more extensive the speech space can be included, which can effectively reduce the number of samples. influence in the spatial distribution.

传统获取语音数据库的覆盖性是依靠语音数据库设计阶段的专家经验,在制订采集计划时使语音数据库中的语音在各种评价因素上分布尽量全面。但是对于已经采集完成的数据库,只有在语音信号处理建模后才能根据识别率等指标得到间接反馈。而在训练语音数据评价模型的过程中,训练数据划分不全面、缺乏人工标注的样本数据导致很难构建出准确的评价模型。The traditional acquisition of the coverage of the voice database relies on the expert experience in the design stage of the voice database. When making the acquisition plan, the voice in the voice database is distributed as comprehensively as possible on various evaluation factors. However, for the database that has been collected, indirect feedback can be obtained according to the recognition rate and other indicators only after the speech signal is processed and modeled. In the process of training the speech data evaluation model, it is difficult to construct an accurate evaluation model due to the incomplete division of the training data and the lack of manually annotated sample data.

发明内容SUMMARY OF THE INVENTION

为克服相关技术中存在的问题,本公开提供一种用于度量语音数据库覆盖性的无监督模型训练方法及装置、电子设备和计算机可读存储介质。In order to overcome the problems existing in the related art, the present disclosure provides an unsupervised model training method and apparatus, an electronic device and a computer-readable storage medium for measuring the coverage of a speech database.

根据本公开实施例的第一方面,提供一种用于度量语音数据库覆盖性的无监督模型训练方法,方法包括:获取训练数据,训练数据为语音;确定语音数据库覆盖性的一个或多个评价因素;基于训练数据对应于评价因素是否可通过参数调整控制,划分评价因素为可调因素或不可调因素;确定划分后的每个评价因素对应的聚类算法;通过每个评价因素对应的聚类算法分别将训练数据分类,得到多个子类;根据每个评价因素的多个子类,训练评价模型。According to a first aspect of the embodiments of the present disclosure, there is provided an unsupervised model training method for measuring the coverage of a speech database, the method comprising: acquiring training data, the training data is speech; determining one or more evaluations of the coverage of the speech database factors; based on whether the training data corresponds to the evaluation factors that can be controlled by parameter adjustment, divide the evaluation factors into adjustable factors or non-adjustable factors; determine the clustering algorithm corresponding to each evaluation factor after division; through the clustering algorithm corresponding to each evaluation factor The class algorithm separately classifies the training data to obtain multiple subclasses; according to the multiple subclasses of each evaluation factor, the evaluation model is trained.

在一实施例中,确定划分后的每个评价因素对应的聚类算法,包括:若评价因素为不可调因素,则确定其对应的聚类算法为基于距离的聚类算法;若评价因素为可调因素,则确定其对应的聚类算法为自适应训练算法。In one embodiment, determining the clustering algorithm corresponding to each evaluation factor after the division includes: if the evaluation factor is an unadjustable factor, determining that the corresponding clustering algorithm is a distance-based clustering algorithm; if the evaluation factor is If the adjustable factor is set, the corresponding clustering algorithm is determined to be an adaptive training algorithm.

在一实施例中,通过每个评价因素对应的聚类算法分别将训练数据分类,得到多个子类,包括:若评价因素为不可调因素,则提取训练数据的特征向量;根据特征向量,采用基于距离的聚类算法,将训练数据划分为多个子类。In one embodiment, the training data is classified by a clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses, including: if the evaluation factor is an unadjustable factor, extracting a feature vector of the training data; according to the feature vector, using A distance-based clustering algorithm that divides the training data into multiple subclasses.

在一实施例中,基于距离的聚类算法为K均值聚类算法。In one embodiment, the distance-based clustering algorithm is a K-means clustering algorithm.

在一实施例中,通过每个评价因素对应的聚类算法分别将训练数据分类,得到多个子类,包括:若评价因素为可调因素,则提取训练数据的特征向量;通过特征向量,训练高斯混合模型,标注训练数据;根据标注的训练数据,将训练数据分为多个子类。In one embodiment, the training data is classified by a clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses, including: if the evaluation factor is an adjustable factor, extracting a feature vector of the training data; Gaussian mixture model, labeled training data; according to the labeled training data, the training data is divided into multiple subclasses.

在一实施例中,通过特征向量,训练高斯混合模型,标注训练数据,包括:通过特征向量训练高斯混合模型;根据评价因素,确定控制参数,控制参数可调整控制训练数据;遍历控制参数的所有取值,对训练数据进行变换;获取变换后的训练数据的特征向量使高斯混合模型似然度最大时的参数值;根据参数值累计似然度;根据参数值变换训练数据,得到新的训练数据,重新训练直到达到停止条件;将每个训练数据对应的使高斯混合模型似然度最大时的参数值作为训练数据的标注值。In one embodiment, training the Gaussian mixture model and labeling the training data through the feature vector includes: training the Gaussian mixture model through the feature vector; determining control parameters according to the evaluation factors, and the control parameters can be adjusted to control the training data; traversing all the control parameters. Take the value to transform the training data; obtain the parameter value when the eigenvector of the transformed training data maximizes the likelihood of the Gaussian mixture model; accumulate the likelihood according to the parameter value; transform the training data according to the parameter value to obtain a new training data, and retrain until the stopping condition is reached; the parameter value corresponding to each training data when the likelihood of the Gaussian mixture model is maximized is used as the label value of the training data.

在一实施例中,停止条件包括:迭代次数达到预设阈值,或累计似然度与上次迭代中的累计似然度变化率小于预设阈值。In one embodiment, the stopping condition includes: the number of iterations reaches a preset threshold, or the rate of change between the accumulated likelihood and the accumulated likelihood in the previous iteration is less than a preset threshold.

在一实施例中,根据每个评价因素的多个子类,训练评价模型,包括:将每个子类数据分别训练一个或多个评价模型,或将多个子类数据整体训练一个评价模型。In one embodiment, training an evaluation model according to multiple subclasses of each evaluation factor includes: training one or more evaluation models on each subclass data separately, or training a single evaluation model on a plurality of subclass data as a whole.

在一实施例中,语音数据库覆盖性的评价因素包括以下一个或多个:发音者的性别、发音者的年龄、发音者的口音、语速、音调、语种、采集设备、采集环境、发音因素或内容主题。In one embodiment, the evaluation factors for the coverage of the speech database include one or more of the following: the speaker's gender, the speaker's age, the speaker's accent, speech rate, pitch, language, collection device, collection environment, and pronunciation factors or content topics.

根据本公开实施例的第二方面,提供一种度量语音数据库覆盖性的方法,方法包括,利用如第一方面的用于度量语音数据库覆盖性的无监督模型训练方法,得到每个评价因素的评价模型,获取待评价的语音数据库,其中,语音数据库中包括至少一条语音;通过评价因素的评价模型对语音数据库中的每条语音进行检测,得到语音数据库与评价因素相对应的单因素信息熵;根据单因素信息熵,确定语音数据库的覆盖度。According to a second aspect of the embodiments of the present disclosure, there is provided a method for measuring the coverage of a speech database, the method comprising: using the unsupervised model training method for measuring the coverage of a speech database as in the first aspect, obtaining the The evaluation model obtains the voice database to be evaluated, wherein the voice database includes at least one voice; each voice in the voice database is detected by the evaluation model of the evaluation factor, and the single-factor information entropy corresponding to the evaluation factor in the voice database is obtained ; According to the single factor information entropy, determine the coverage of the voice database.

根据本公开实施例的第三方面,提供一种用于度量语音数据库覆盖性的无监督模型训练装置,装置包括:数据获取单元,用于获取训练数据,训练数据为语音;评价因素确定单元,用于确定语音数据库覆盖性的一个或多个评价因素;划分单元,用于基于训练数据对应于评价因素是否可通过参数调整控制,划分评价因素为可调因素或不可调因素;算法确定单元,用于确定划分后的每个评价因素对应的聚类算法;分类单元,用于通过每个评价因素对应的聚类算法分别将训练数据分类,得到多个子类;模型训练单元,用于根据每个评价因素的多个子类,训练评价模型。According to a third aspect of the embodiments of the present disclosure, there is provided an unsupervised model training device for measuring the coverage of a speech database, the device comprising: a data acquisition unit for acquiring training data, where the training data is speech; an evaluation factor determination unit, One or more evaluation factors used to determine the coverage of the voice database; a division unit, used to divide the evaluation factors into adjustable factors or non-adjustable factors based on whether the training data corresponds to whether the evaluation factors can be controlled by parameter adjustment; the algorithm determination unit, It is used to determine the clustering algorithm corresponding to each evaluation factor after division; the classification unit is used to classify the training data through the clustering algorithm corresponding to each evaluation factor to obtain multiple subclasses; the model training unit is used to classify the training data according to each evaluation factor. Multiple subclasses of each evaluation factor to train the evaluation model.

在一实施例中,算法确定单元还用于:当评价因素为不可调因素时,确定其对应的聚类算法为基于距离的聚类算法;当评价因素为可调因素时,确定其对应的聚类算法为自适应训练算法。In one embodiment, the algorithm determining unit is further configured to: when the evaluation factor is an unadjustable factor, determine that its corresponding clustering algorithm is a distance-based clustering algorithm; when the evaluation factor is an adjustable factor, determine its corresponding clustering algorithm. The clustering algorithm is an adaptive training algorithm.

在一实施例中,分类单元还用于:当评价因素为不可调因素时,提取训练数据的特征向量;根据特征向量,采用基于距离的聚类算法,将训练数据划分为多个子类。In one embodiment, the classification unit is further configured to: extract a feature vector of the training data when the evaluation factor is an unadjustable factor; and divide the training data into multiple subclasses by using a distance-based clustering algorithm according to the feature vector.

在一实施例中,基于距离的聚类算法为K均值聚类算法。In one embodiment, the distance-based clustering algorithm is a K-means clustering algorithm.

在一实施例中,分类单元还用于:当评价因素为可调因素时,提取训练数据的特征向量;通过特征向量,训练高斯混合模型,标注训练数据;根据标注的训练数据,将训练数据分为多个子类。In one embodiment, the classification unit is further used to: extract the feature vector of the training data when the evaluation factor is an adjustable factor; train the Gaussian mixture model through the feature vector, and label the training data; into multiple subcategories.

在一实施例中,通过特征向量,训练高斯混合模型,标注训练数据,包括:通过特征向量训练高斯混合模型;根据评价因素,确定控制参数,控制参数可调整控制训练数据;遍历控制参数的所有取值,对训练数据进行变换;获取变换后的训练数据的特征向量使高斯混合模型似然度最大时的参数值;根据参数值累计似然度;根据参数值变换训练数据,得到新的训练数据,重新训练直到达到停止条件;将每个训练数据对应的使高斯混合模型似然度最大时的参数值作为训练数据的标注值。In one embodiment, training the Gaussian mixture model and labeling the training data through the feature vector includes: training the Gaussian mixture model through the feature vector; determining control parameters according to the evaluation factors, and the control parameters can be adjusted to control the training data; traversing all the control parameters. Take the value to transform the training data; obtain the parameter value when the eigenvector of the transformed training data maximizes the likelihood of the Gaussian mixture model; accumulate the likelihood according to the parameter value; transform the training data according to the parameter value to obtain a new training data, and retrain until the stopping condition is reached; the parameter value corresponding to each training data when the likelihood of the Gaussian mixture model is maximized is used as the label value of the training data.

在一实施例中,停止条件包括:迭代次数达到预设阈值,或累计似然度与上次迭代中的累计似然度变化率小于预设阈值。In one embodiment, the stopping condition includes: the number of iterations reaches a preset threshold, or the rate of change between the accumulated likelihood and the accumulated likelihood in the previous iteration is less than a preset threshold.

在一实施例中,模型训练单元还用于:将每个子类数据分别训练一个或多个评价模型,或将多个子类数据整体训练一个评价模型。In one embodiment, the model training unit is further configured to: train one or more evaluation models on each sub-class data separately, or train a single evaluation model on a plurality of sub-class data as a whole.

在一实施例中,语音数据库覆盖性的评价因素包括以下一个或多个:发音者的性别、发音者的年龄、发音者的口音、语速、音调、语种、采集设备、采集环境、发音因素或内容主题。In one embodiment, the evaluation factors for the coverage of the speech database include one or more of the following: the speaker's gender, the speaker's age, the speaker's accent, speech rate, pitch, language, collection device, collection environment, and pronunciation factors or content topics.

根据本公开实施例的第四方面,提供一种度量语音数据库覆盖性的装置,装置包括,评价模型获取单元,用于利用如第一方面的用于度量语音数据库覆盖性的无监督模型训练方法,得到每个评价因素的评价模型,语音数据库获取单元,用于获取待评价的语音数据库,其中,语音数据库中包括至少一条语音;检测单元,用于通过评价因素的评价模型对语音数据库中的每条语音进行检测,得到语音数据库与评价因素相对应的单因素信息熵;评价单元,用于根据单因素信息熵,确定语音数据库的覆盖度。According to a fourth aspect of the embodiments of the present disclosure, there is provided an apparatus for measuring the coverage of a speech database. The apparatus includes an evaluation model obtaining unit for using the unsupervised model training method for measuring the coverage of a speech database according to the first aspect. , obtain the evaluation model of each evaluation factor, the voice database acquisition unit is used to obtain the voice database to be evaluated, wherein, the voice database includes at least one voice; the detection unit is used to pass the evaluation model of the evaluation factor to the voice database. Each voice is detected to obtain the single-factor information entropy of the voice database corresponding to the evaluation factors; the evaluation unit is used to determine the coverage of the voice database according to the single-factor information entropy.

根据本公开实施例的第五方面,提供一种电子设备,包括:存储器,用于存储指令;以及处理器,用于调用存储器存储的指令执行第一方面的用于度量语音数据库覆盖性的无监督模型训练方法。According to a fifth aspect of the embodiments of the present disclosure, there is provided an electronic device, comprising: a memory for storing instructions; and a processor for invoking the instructions stored in the memory to execute the non-independent method for measuring the coverage of the voice database of the first aspect. Supervised model training methods.

根据本公开实施例的第六方面,提供一种计算机可读存储介质,存储有指令,指令被处理器执行时,执行第一方面的用于度量语音数据库覆盖性的无监督模型训练方法。According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions, and when the instructions are executed by a processor, the unsupervised model training method for measuring the coverage of a speech database of the first aspect is executed.

本公开的实施例提供的技术方案可以包括以下有益效果:本公开提出一种用于度量语音数据库覆盖性的无监督模型训练方法,首先可以根据用户需要设定不同的评价要素度量相应的数据库,例如在做口音识别分类时可根据用户需要设定口音、音调、语速等评价因素构建评价模型,使语音数据库评价模型在具体应用场景中更有针对性;其次,通过区分不同的评价因素,有助于后续处理中抽取对应的特征、选用合适的算法,进一步针对该评价因素构建出更加合适且准确的分类模型,有助于后续准确判断出语音数据库的覆盖性;第三,可以采用无监督数据,即没有标注的数据,进行模型训练,从而实现对语音数据库的覆盖性的定量评价,降低了数据标注所引入的成本。The technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects: the present disclosure proposes an unsupervised model training method for measuring the coverage of a speech database. First, different evaluation elements can be set according to user needs to measure the corresponding database, For example, when doing accent recognition and classification, an evaluation model can be built according to the user's needs by setting evaluation factors such as accent, pitch, speed of speech, etc., so that the speech database evaluation model can be more targeted in specific application scenarios; secondly, by distinguishing different evaluation factors, It is helpful to extract corresponding features and select appropriate algorithms in subsequent processing, and further build a more appropriate and accurate classification model based on the evaluation factors, which is helpful to accurately determine the coverage of the speech database in the future. Supervised data, that is, unlabeled data, is used for model training, so as to achieve quantitative evaluation of the coverage of the speech database and reduce the cost introduced by data labeling.

应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description serve to explain the principles of the invention.

图1是根据一示例性实施例示出的一种用于度量语音数据库覆盖性的无监督模型训练方法的流程示意图;1 is a schematic flowchart of an unsupervised model training method for measuring the coverage of a speech database according to an exemplary embodiment;

图2是根据一示例性实施例示出的另一种用于度量语音数据库覆盖性的无监督模型训练方法的流程示意图;2 is a schematic flowchart of another unsupervised model training method for measuring the coverage of a speech database according to an exemplary embodiment;

图3是根据一示例性实施例示出的一种度量语音数据库覆盖性的方法的流程示意图;3 is a schematic flowchart of a method for measuring the coverage of a speech database according to an exemplary embodiment;

图4是根据一示例性实施例示出的一种用于度量语音数据库覆盖性的无监督模型训练装置的示意框图;4 is a schematic block diagram of an unsupervised model training apparatus for measuring the coverage of a speech database according to an exemplary embodiment;

图5是根据一示例性实施例示出的一种装置的示意框图。Fig. 5 is a schematic block diagram of an apparatus according to an exemplary embodiment.

图6是根据一示例性实施例示出的一种电子装置的示意框图。Fig. 6 is a schematic block diagram of an electronic device according to an exemplary embodiment.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with some aspects of the invention as recited in the appended claims.

目前常用的语音数据库覆盖性评价系统,依赖于专家标注数据,且训练数据的划分不全面,缺乏多样性。在一些相关技术中,依赖于监督数据,即含有标注的数据,需要根据标注将数据划分为若干子类,然后根据每个子类的标签训练模型。这些监督数据需要人工标注才能获得,往往需要耗费大量的人力、物力和时间。At present, the commonly used voice database coverage evaluation system relies on expert annotation data, and the division of training data is not comprehensive and lacks diversity. In some related technologies, relying on supervised data, that is, data with annotations, it is necessary to divide the data into several subclasses according to the annotations, and then train the model according to the labels of each subclass. These supervised data need to be manually labeled to obtain, which often requires a lot of manpower, material resources and time.

本公开为解决上述问题,提供一种用于度量语音数据库覆盖性的无监督模型训练方法10,参见图1,方法包括步骤S11-步骤S15,以下详细说明:In order to solve the above problems, the present disclosure provides an unsupervised model training method 10 for measuring the coverage of a speech database. Referring to FIG. 1 , the method includes steps S11 to S15, which are described in detail below:

步骤S11,获取训练数据,训练数据为语音。In step S11, training data is acquired, and the training data is speech.

其中,训练数据可以来自一个或多个语音数据库,其覆盖的语音范围广泛,具备多种分类的可能性,确保在模型训练中数据的多样性。Among them, the training data can come from one or more voice databases, which cover a wide range of voices and have multiple classification possibilities to ensure the diversity of data in model training.

步骤S12,确定语音数据库覆盖性的一个或多个评价因素。Step S12, determining one or more evaluation factors for the coverage of the speech database.

在本公开的一实施例中,语音数据库覆盖性的评价因素包括以下一个或多个:发音者的性别、发音者的年龄、发音者的口音、语速、音调、语种、发音因素或内容主题。用户在评价过程中可以根据语音数据库的使用需求,确定语音数据库需要涉及的评价因素,例如在做口音识别分类时可根据用户需要设定口音、音调、语速等评价因素构建评价模型。通过上述实施例的步骤,可以由用户自主设定评价因素,使语音数据库评价模型在具体应用场景中更有针对性,得到的评估结果也更加符合用户需求。In an embodiment of the present disclosure, the evaluation factors for the coverage of the speech database include one or more of the following: the speaker's gender, the speaker's age, the speaker's accent, speech rate, pitch, language, pronunciation factor, or content theme . In the evaluation process, users can determine the evaluation factors that need to be involved in the speech database according to the usage requirements of the speech database. For example, when doing accent recognition and classification, they can set up evaluation factors such as accent, pitch, and speech speed to build an evaluation model according to the user's needs. Through the steps of the above embodiments, the user can independently set the evaluation factors, so that the speech database evaluation model is more targeted in specific application scenarios, and the obtained evaluation results are more in line with the user's needs.

步骤S13,基于训练数据对应于评价因素是否可通过参数调整控制,划分评价因素为可调因素或不可调因素。Step S13 , based on whether the training data corresponds to the evaluation factor that can be controlled by parameter adjustment, divide the evaluation factor into adjustable factors or non-adjustable factors.

具体地,对于待评价的某可变因素,首先判断语音对应于该评价因素来说,是否可以进行调整,即存在某种变换方式,使其可由一种取值变换成另一种取值。若存在变换方式时,则认为该评价因素为可调因素,如评价因素为语速时,可以通过时域缩放使语速变快或变慢;再如音调,可以通过频域缩放使音调降低或升高;若不存在变换方式时,则认为该评价因素为不可调因素,不可调因素一般按目前的技术手段来说,尚不存在某种简单的变换方式,使其由一种取值变换成另一种取值,如口音、语种等。通过区分评价要素是否可以进行调整,有助于开发者进一步针对该评价要素选择适当的处理方式,针对不同的评价要素抽取不同的特征,制定出更符合该要素的处理算法。在后续构建整体评价模型时,区分出的要素可以使模型架构更加清晰,进一步提高评价模型的准确度。Specifically, for a variable factor to be evaluated, first determine whether the speech corresponding to the evaluation factor can be adjusted, that is, there is a certain transformation method, so that it can be transformed from one value to another. If there is a transformation method, the evaluation factor is considered to be an adjustable factor. For example, when the evaluation factor is the speech rate, the time domain scaling can be used to make the speech rate faster or slower; for another example, the pitch can be reduced by frequency domain scaling. or increase; if there is no transformation method, the evaluation factor is considered to be an irreversible factor. Generally speaking, according to the current technical means, there is no simple transformation method for the irreducible factor to be determined by a value. Convert to another value, such as accent, language, etc. By distinguishing whether an evaluation element can be adjusted, it is helpful for developers to further select an appropriate processing method for the evaluation element, extract different features for different evaluation elements, and formulate a processing algorithm that is more in line with the element. In the subsequent construction of the overall evaluation model, the distinguished elements can make the model architecture clearer and further improve the accuracy of the evaluation model.

步骤S14,确定划分后的每个评价因素对应的聚类算法。Step S14, determining a clustering algorithm corresponding to each of the divided evaluation factors.

在本公开的一实施例中,若评价因素为不可调因素,则确定其对应的聚类算法为基于距离的聚类算法;若评价因素为可调因素,则确定其对应的聚类算法为自适应训练算法。在语音对应的评价因素不存在变换方式时,该因素的类别较为清晰,例如发音者性别包含男女两类、语音所述语种类别相对固定等,在类别清晰、语音相关特征能够明显区分不同类别的情况下,选取基于距离的聚类算法更加直观便捷,易于实现且准确率高;而在语音对应的评价因素存在变换方式时,特征向量中不能很直观反映类别信息,只能通过被测量的特征参数来体现相关的评价因素,这种情况下选取自适应训练的算法可以有效设置评价因素相关的特征参数,更精准地针对所属评价要素进行分类。In an embodiment of the present disclosure, if the evaluation factor is an unadjustable factor, the corresponding clustering algorithm is determined to be a distance-based clustering algorithm; if the evaluation factor is an adjustable factor, the corresponding clustering algorithm is determined to be Adaptive training algorithm. When there is no transformation method for the evaluation factor corresponding to the voice, the category of the factor is relatively clear. For example, the gender of the speaker includes two types of men and women, and the language category of the voice is relatively fixed. In this case, the distance-based clustering algorithm is more intuitive and convenient, easy to implement and has a high accuracy; however, when there are transformation methods for the evaluation factors corresponding to speech, the category information cannot be directly reflected in the feature vector, and only the measured features can be used. In this case, selecting the adaptive training algorithm can effectively set the characteristic parameters related to the evaluation factors, and classify the evaluation factors more accurately.

步骤S15,通过每个评价因素对应的聚类算法分别将训练数据分类,得到多个子类。In step S15, the training data are classified respectively through the clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses.

在本公开的一实施例中,通过每个评价因素对应的聚类算法分别将训练数据分类,得到多个子类,包括:若评价因素为不可调因素,则提取训练数据的特征向量;根据特征向量,采用基于距离的聚类算法,将训练数据划分为多个子类。In an embodiment of the present disclosure, the training data is classified by a clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses, including: if the evaluation factor is an irreversible factor, extracting a feature vector of the training data; A vector that uses a distance-based clustering algorithm to divide the training data into multiple subclasses.

具体地,若评价因素为不可调因素,提取训练数据的相关特征的向量表示。例如对于口音、语种、说话人等因素,可以提取i-vector向量、x-vector向量等;对于环境、设备、音素内容等因素,可以提取MFCC(梅尔频率倒谱系数)特征向量等。根据抽取到的特征,将数据分为多个子类。在抽取特征时,选取能够有效区分该评价因素下不同语音的特征向量有助于提高聚类的准确率。Specifically, if the evaluation factor is an irreversible factor, a vector representation of the relevant features of the training data is extracted. For example, for factors such as accent, language, speaker, etc., i-vector vector, x-vector vector, etc. can be extracted; for factors such as environment, equipment, phoneme content, etc., MFCC (Mel Frequency Cepstral Coefficient) feature vector can be extracted, etc. According to the extracted features, the data is divided into multiple sub-categories. When extracting features, selecting feature vectors that can effectively distinguish different speeches under the evaluation factor helps to improve the accuracy of clustering.

在本公开的一实施例中,基于距离的聚类算法为K均值聚类算法。K均值算法易于实现,并且在计算聚类问题上非常高效且实用。In an embodiment of the present disclosure, the distance-based clustering algorithm is a K-means clustering algorithm. The K-means algorithm is easy to implement, and is very efficient and practical for computational clustering problems.

在本公开的一实施例中,通过每个评价因素对应的聚类算法分别将训练数据分类,得到多个子类,还包括:若评价因素为可调因素,则提取训练数据的特征向量;通过特征向量,训练高斯混合模型,标注训练数据;根据标注的训练数据,将训练数据分为多个子类。由于高斯混合模型是概率模型,其假设所有样本是从具有未知参数的有限数量的高斯分布的混合生成的。作为高斯概率密度函数的一个线性组合可以逼近任何一种密度函数。而语音特征通常有着平滑的概率密度函数,因此有限数目的高斯函数可以对语音特征的密度函数形成平滑的逼近,针对可调因素能够很好地将语音有效的区分开。In an embodiment of the present disclosure, classifying the training data through a clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses, further comprising: if the evaluation factor is an adjustable factor, extracting a feature vector of the training data; Feature vector, train a Gaussian mixture model, label the training data; according to the labeled training data, divide the training data into multiple subclasses. Since the Gaussian mixture model is a probabilistic model, it assumes that all samples are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. Any density function can be approximated as a linear combination of Gaussian probability density functions. Speech features usually have a smooth probability density function, so a limited number of Gaussian functions can form a smooth approximation to the density function of speech features, and can effectively distinguish speech for adjustable factors.

在本公开的一实施例中,通过特征向量,训练高斯混合模型,标注训练数据,包括:通过特征向量训练高斯混合模型;根据评价因素,确定控制参数,控制参数可调整控制训练数据;遍历控制参数的所有取值,对训练数据进行变换;获取变换后的训练数据的特征向量使高斯混合模型似然度最大时的参数值;根据参数值累计似然度;根据参数值变换训练数据,得到新的训练数据,重新训练直到达到停止条件;将每个训练数据对应的使高斯混合模型似然度最大时的参数值作为训练数据的标注值。In an embodiment of the present disclosure, training a Gaussian mixture model through feature vectors and labeling training data includes: training a Gaussian mixture model through feature vectors; determining control parameters according to evaluation factors, and the control parameters can be adjusted to control the training data; traversal control All values of the parameters, transform the training data; obtain the parameter value when the eigenvector of the transformed training data maximizes the likelihood of the Gaussian mixture model; accumulate the likelihood according to the parameter value; transform the training data according to the parameter value, get New training data, re-train until the stopping condition is reached; the parameter value corresponding to each training data that maximizes the likelihood of the Gaussian mixture model is used as the labeling value of the training data.

在本公开的一实施例中,停止条件包括:迭代次数达到预设阈值,或累计似然度与上次迭代中的累计似然度变化率小于预设阈值。In an embodiment of the present disclosure, the stopping condition includes: the number of iterations reaches a preset threshold, or a rate of change between the accumulated likelihood and the accumulated likelihood in the previous iteration is less than a preset threshold.

具体地,可以采集所有数据的MFCC特征向量,训练一个GMM模型(Gaussian MixedModel,高斯混合模型);Specifically, MFCC feature vectors of all data can be collected, and a GMM model (Gaussian Mixed Model, Gaussian Mixed Model) can be trained;

令似然度累加器

Figure 252859DEST_PATH_IMAGE001
,然后对所有数据逐一进行如下操作:Let likelihood accumulator
Figure 252859DEST_PATH_IMAGE001
, and then do the following for all data one by one:

对数据进行变换,变换控制参数为

Figure 234721DEST_PATH_IMAGE002
,得到
Figure 836604DEST_PATH_IMAGE003
:例如对语速变换,
Figure 545803DEST_PATH_IMAGE002
指时域缩放因子;对音调,
Figure 916741DEST_PATH_IMAGE002
指频域缩放因子;The data is transformed, and the transformation control parameters are
Figure 234721DEST_PATH_IMAGE002
,get
Figure 836604DEST_PATH_IMAGE003
: For example, for speech rate change,
Figure 545803DEST_PATH_IMAGE002
refers to the time domain scaling factor; for pitch,
Figure 916741DEST_PATH_IMAGE002
refers to the frequency domain scaling factor;

Figure 69505DEST_PATH_IMAGE002
值离散化,遍历所有取值,使得
Figure 519203DEST_PATH_IMAGE003
的MFCC特征对GMM模型
Figure 907459DEST_PATH_IMAGE004
的似然度最大化:right
Figure 69505DEST_PATH_IMAGE002
Value discretization, traverse all values, so that
Figure 519203DEST_PATH_IMAGE003
The MFCC features for the GMM model
Figure 907459DEST_PATH_IMAGE004
to maximize the likelihood of:

Figure 8270DEST_PATH_IMAGE005
Figure 8270DEST_PATH_IMAGE005

记录取得最大值时

Figure 190990DEST_PATH_IMAGE002
的取值
Figure 892099DEST_PATH_IMAGE006
:When the record reaches the maximum value
Figure 190990DEST_PATH_IMAGE002
value of
Figure 892099DEST_PATH_IMAGE006
:

Figure 818466DEST_PATH_IMAGE007
Figure 818466DEST_PATH_IMAGE007

按取值

Figure 773784DEST_PATH_IMAGE006
累计似然度:By value
Figure 773784DEST_PATH_IMAGE006
Cumulative Likelihood:

Figure 658563DEST_PATH_IMAGE008
Figure 658563DEST_PATH_IMAGE008

按取值

Figure 94573DEST_PATH_IMAGE006
累对数据进行变换,得到新的数据:By value
Figure 94573DEST_PATH_IMAGE006
Transform the data to get new data:

Figure 824631DEST_PATH_IMAGE009
Figure 824631DEST_PATH_IMAGE009

继续采集所有数据的MFCC特征向量,训练GMM模型进行迭代,直到达到停止条件,在实施例中,我们将停止条件设置为似然度累加器

Figure 634455DEST_PATH_IMAGE010
与上一次迭代变化值小于
Figure 424557DEST_PATH_IMAGE011
,或迭代次数达到8次;Continue to collect MFCC eigenvectors of all data, train the GMM model to iterate until a stop condition is reached, in the example we set the stop condition to the likelihood accumulator
Figure 634455DEST_PATH_IMAGE010
Change value from previous iteration is less than
Figure 424557DEST_PATH_IMAGE011
, or the number of iterations reaches 8;

根据记录的每个记录最大值时

Figure 834678DEST_PATH_IMAGE002
的取值
Figure 102849DEST_PATH_IMAGE012
,将训练数据分成M个子类。According to the maximum value of each record recorded
Figure 834678DEST_PATH_IMAGE002
value of
Figure 102849DEST_PATH_IMAGE012
, divide the training data into M subclasses.

对于可调因素下的语音分类,由于特征向量本身无法很好地体现出语音的可调整性,直接使用向量距离来度量的方法对于可调因素的聚类来说准确率不高。通过高斯混合模型的似然度来描述特征向量,在特征变换的过程中可以获得更多语音特征信息,可以对语音数据的分布具有更加精确有效的描述效果。同时,选取合适的停止条件有助于在保证算法精度的前提下提升算法运算速度。For speech classification under adjustable factors, since the feature vector itself cannot well reflect the adjustability of speech, the method of directly using vector distance to measure is not accurate for the clustering of adjustable factors. The feature vector is described by the likelihood of the Gaussian mixture model, and more speech feature information can be obtained in the process of feature transformation, which can have a more accurate and effective description effect on the distribution of speech data. At the same time, selecting the appropriate stopping condition helps to improve the speed of the algorithm while ensuring the accuracy of the algorithm.

步骤S16,根据每个评价因素的多个子类,训练评价模型。In step S16, an evaluation model is trained according to the multiple sub-categories of each evaluation factor.

在本公开的一实施例中,将每个子类数据分别训练一个或多个评价模型,或将多个子类数据整体训练一个评价模型。对于每个评价因素的多个子类,可以根据需要训练相应的评价模型,便于后续判断语音数据库中每条语音相对于评价因素的信息熵,从而进一步判断语音数据库的覆盖性。In an embodiment of the present disclosure, one or more evaluation models are separately trained for each subclass of data, or one evaluation model is trained as a whole for multiple subclasses of data. For multiple sub-categories of each evaluation factor, the corresponding evaluation model can be trained as required, so as to facilitate the subsequent determination of the information entropy of each speech in the speech database relative to the evaluation factor, so as to further determine the coverage of the speech database.

本公开的一实施例的流程如图2所示,首先确定语音数据库覆盖性的评价因素;基于语音对应于评价因素是否存在变换方式,划分评价因素;若不存在变换方式,则直接提取特征向量,使用k均值聚类算法进行聚类,得到该评价因素下的多个子类,通过得到的子类构建该评价因素的模型;若存在变换方式,则提取MFCC向量,构建GMM模型,通过参数变换的方式确定似然度最大时的参数值,并对向量进行变换,输入GMM模型中继续迭代,通过似然度最大时的参数值标注数据,从而获得多个子类,通过得到的子类构建该评价因素的模型。上述实施例中,对于不同的评价要素,抽取了不同的特征向量、选取了对应的聚类算法,建立评价模型时更具有针对性;同时,算法全程采用无监督数据进行模型训练,降低了数据标注所引入的成本。The process of an embodiment of the present disclosure is shown in FIG. 2 . First, the evaluation factors for the coverage of the voice database are determined; the evaluation factors are divided based on whether there is a transformation mode corresponding to the evaluation factors in the voice; if there is no transformation mode, the feature vector is directly extracted , use the k-means clustering algorithm for clustering, obtain multiple subclasses under the evaluation factor, and construct the model of the evaluation factor through the obtained subclasses; if there is a transformation method, extract the MFCC vector, build a GMM model, and pass the parameter transformation. The method determines the parameter value when the likelihood is the largest, transforms the vector, enters the GMM model to continue to iterate, and labels the data with the parameter value when the likelihood is the largest, so as to obtain multiple subclasses, and construct the subclass through the obtained subclass. A model for evaluating factors. In the above embodiment, for different evaluation elements, different feature vectors are extracted and corresponding clustering algorithms are selected, so that the evaluation model is more targeted; at the same time, the algorithm uses unsupervised data for model training throughout the process, which reduces the number of data. The cost introduced by the callout.

本公开还提供一种度量语音数据库覆盖性的方法20,如图3所示,度量语音数据库覆盖性的方法20包括步骤S21-步骤S24,详细说明如下:The present disclosure also provides a method 20 for measuring the coverage of a voice database. As shown in FIG. 3 , the method 20 for measuring the coverage of a voice database includes steps S21 to S24, and the details are as follows:

步骤S21:利用前述任一实施例的用于度量语音数据库覆盖性的无监督模型训练方法10,得到每个评价因素的评价模型。Step S21: Obtain an evaluation model for each evaluation factor by using the unsupervised model training method 10 for measuring the coverage of a speech database in any of the foregoing embodiments.

步骤S22:获取待评价的语音数据库,其中,语音数据库中包括至少一条语音。Step S22: Acquire a voice database to be evaluated, wherein the voice database includes at least one voice.

步骤S23:通过评价因素的评价模型对语音数据库中的每条语音进行检测,得到语音数据库与评价因素相对应的单因素信息熵。Step S23: Detecting each voice in the voice database through the evaluation model of the evaluation factors, and obtaining the single-factor information entropy corresponding to the evaluation factors in the voice database.

在本公开实施例中,根据确定语音数据库需要涉及的评价因素,将语音数据库中的各语音分别进行检测。评价模型可以对一种或多种评价因素进行检测。通过评价模型进行分类检测,确定语音数据库中的语音涉及各评价因素的单因素信息熵,得到语音数据库涉及各评价因素的可能性,便于确定语音数据库中是否涉及需要涉及的评价因素,以及存在的概率。各语音涉及的子类因素越多、越分散,得到的单因素信息熵的熵值越高,反之,各语音涉及的子类因素越集中,得到的单因素信息熵的熵值越低。In the embodiment of the present disclosure, each voice in the voice database is detected separately according to the evaluation factors involved in determining the voice database. An evaluation model can detect one or more evaluation factors. Through the classification and detection of the evaluation model, the single-factor information entropy of each evaluation factor involved in the speech in the speech database is determined, and the possibility that the speech database is involved in each evaluation factor is obtained, which is convenient to determine whether the speech database involves the evaluation factor that needs to be involved, and the existence of the evaluation factors. probability. The more sub-category factors involved in each voice, the more scattered, the higher the entropy value of the single-factor information entropy obtained. On the contrary, the more concentrated the sub-category factors involved in each voice, the lower the obtained single-factor information entropy entropy value.

在一实施例中,基于评价因素,通过评价模型对每条语音进行分类检测,得到语音数据库中各语音与评价因素中多个子类因素相对应的子类条件概率;基于子类条件概率,得到语音数据库与评价因素相对应的单因素信息熵。In one embodiment, based on the evaluation factors, the evaluation model is used to classify and detect each voice, and the subclass conditional probability corresponding to each voice in the voice database and the multiple subclass factors in the evaluation factor is obtained; based on the subclass conditional probability, obtain The single-factor information entropy of the speech database corresponding to the evaluation factors.

具体地,根据确定需要涉及的评价因素,通过评价模型将对将语音数据库中各条语音均进行分类检测,确定各语音对应需要涉及的各评价因素条件下的概率。根据检测,能够得到语音数据库中各条语音在当前评价因素下对应各子类因素发生的条件概率,便于明确每条语音在各评价因素中涉及各子类因素条件下的发生概率,进而将每条语音在各子类因素下的子类条件概率进行整合,得到语音数据库在该评价因素下的单因素信息熵。Specifically, according to the evaluation factors that need to be involved, the evaluation model will classify and detect each speech in the speech database, and determine the probability of each speech corresponding to each evaluation factor that needs to be involved. According to the detection, the conditional probability of each speech in the speech database corresponding to each sub-category factor under the current evaluation factor can be obtained, which is convenient to clarify the occurrence probability of each speech under the condition that each sub-type factor is involved in each evaluation factor. The subclass conditional probabilities of the speech under each subclass factor are integrated to obtain the single-factor information entropy of the speech database under this evaluation factor.

在一实施例中,针对当前评价因素,通过将各条语音进行检测,能够得到各条语音在当前评价因素下对应各子类因素的子类条件概率,将其汇总进行平均,能够得到语音数据库在当前评价因素下对应各子类因素的子类平均条件概率,进而根据各子类因素下语音数据库的子类平均条件概率得到语音数据库在当前评价因素下的单因素信息熵。In one embodiment, for the current evaluation factor, by detecting each voice, the subclass conditional probabilities of each voice corresponding to each subclass factor under the current evaluation factor can be obtained, and they can be aggregated and averaged to obtain a voice database. Under the current evaluation factor, the subclass average conditional probability corresponding to each subclass factor is obtained, and then the single factor information entropy of the speech database under the current evaluation factor is obtained according to the subclass average conditional probability of the speech database under each subclass factor.

在一实施场景中,

Figure 32759DEST_PATH_IMAGE013
用于表示语音数据库,K表示语音数据库中语音的条数,
Figure 885439DEST_PATH_IMAGE014
Figure 392644DEST_PATH_IMAGE015
代表语音数据库中的每一条语音。
Figure 339871DEST_PATH_IMAGE016
代表当前评价因素下中的各子类因素,M为子类因素的个数。通过评价模型,采用如下公式获取每条语音在各子类因素下的子类条件概率:
Figure 514501DEST_PATH_IMAGE017
Figure 505459DEST_PATH_IMAGE018
Figure 31119DEST_PATH_IMAGE019
K表示语音数据库中语音的条数,M为子类因素的个数。进而采用下述公式进行整合,得到语音数据库在各子类因素下的子类平均条件概率:
Figure 516458DEST_PATH_IMAGE020
Figure 545594DEST_PATH_IMAGE019
。从而根据下述公式得到语音数据库涉及当前评价因素的单因素信息熵:
Figure 208918DEST_PATH_IMAGE021
,log为自然对数。In an implementation scenario,
Figure 32759DEST_PATH_IMAGE013
Used to represent the voice database, K represents the number of voices in the voice database,
Figure 885439DEST_PATH_IMAGE014
to
Figure 392644DEST_PATH_IMAGE015
Represents each speech in the speech database.
Figure 339871DEST_PATH_IMAGE016
represents each sub-category factor under the current evaluation factor, and M is the number of sub-category factors. Through the evaluation model, the following formula is used to obtain the subclass conditional probability of each speech under each subclass factor:
Figure 514501DEST_PATH_IMAGE017
,
Figure 505459DEST_PATH_IMAGE018
,
Figure 31119DEST_PATH_IMAGE019
. K represents the number of voices in the voice database, and M is the number of subclass factors. Then, the following formula is used for integration to obtain the subclass average conditional probability of the speech database under each subclass factor:
Figure 516458DEST_PATH_IMAGE020
,
Figure 545594DEST_PATH_IMAGE019
. Thus, the single-factor information entropy of the speech database involving the current evaluation factor is obtained according to the following formula:
Figure 208918DEST_PATH_IMAGE021
, log is the natural logarithm.

步骤S24:根据单因素信息熵,确定语音数据库的覆盖度。Step S24: Determine the coverage of the speech database according to the single-factor information entropy.

具体地,根据语音数据库在各评价因素下对应得到的各单因素信息熵,能够快速明确语音数据库中的各语音涉及需要涉及的分类因素个数以及熵值大小,对于各个评价因素的子类因素判断全部语音覆盖情况,进而评估语音数据库中的语音是否满足需求,从而确定语音数据库的质量是否合格,对于合格的语音数据库需要尽量全面且平均的覆盖各个评价因素中的各个子类因素,以保证基于该语音数据库进行模型训练或其他后续处理的结果。通过利用信息熵来判断语音数据库涉及的分类因素进而评估语音数据库的质量,有利于将不确定性因素进行量化,统一度量各语音涉及各分类因素的标准,使抽象的评判信息变为具体化,进而有助于直接且快速获取待评估的语音数据库的质量。Specifically, according to the information entropy of each single factor obtained under each evaluation factor in the speech database, the number of classification factors and entropy values that need to be involved in each speech in the speech database can be quickly clarified. Judging the coverage of all voices, and then evaluating whether the voices in the voice database meet the requirements, so as to determine whether the quality of the voice database is qualified. The result of model training or other subsequent processing based on the speech database. By using the information entropy to judge the classification factors involved in the speech database and then to evaluate the quality of the speech database, it is beneficial to quantify the uncertain factors, uniformly measure the standards of each classification factor involved in each speech, and make the abstract judgment information concrete. In turn, it is helpful to obtain the quality of the speech database to be evaluated directly and quickly.

基于同一个发明构思,图4示出了一种用于度量语音数据库覆盖性的无监督模型训练装置100,包括:数据获取单元110,用于获取训练数据,训练数据为语音;评价因素确定单元120,用于确定语音数据库覆盖性的一个或多个评价因素;划分单元130,用于基于训练数据对应于评价因素是否可通过参数调整控制,划分评价因素为可调因素或不可调因素;算法确定单元140,用于确定划分后的每个评价因素对应的聚类算法;分类单元150,用于通过每个评价因素对应的聚类算法分别将训练数据分类,得到多个子类;模型训练装置160,用于根据每个评价因素的多个子类,训练评价模型。Based on the same inventive concept, FIG. 4 shows an unsupervised model training device 100 for measuring the coverage of a speech database, including: a data acquisition unit 110 for acquiring training data, where the training data is speech; an evaluation factor determining unit 120, for determining one or more evaluation factors of the coverage of the voice database; dividing unit 130, for dividing the evaluation factors into adjustable factors or non-adjustable factors based on whether the training data corresponds to whether the evaluation factors can be controlled by parameter adjustment; The determination unit 140 is used to determine the clustering algorithm corresponding to each evaluation factor after division; the classification unit 150 is used to classify the training data respectively through the clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses; the model training device 160, for training an evaluation model according to the multiple subclasses of each evaluation factor.

在一实施例中,算法确定单元140还用于:当评价因素为不可调因素时,确定其对应的聚类算法为基于距离的聚类算法;当评价因素为可调因素时,确定其对应的聚类算法为自适应训练算法。In one embodiment, the algorithm determining unit 140 is further configured to: when the evaluation factor is an unadjustable factor, determine that its corresponding clustering algorithm is a distance-based clustering algorithm; when the evaluation factor is an adjustable factor, determine its corresponding clustering algorithm. The clustering algorithm is an adaptive training algorithm.

在一实施例中,分类单元150还用于:当评价因素为不可调因素时,提取训练数据的特征向量;根据特征向量,采用基于距离的聚类算法,将训练数据划分为多个子类。In one embodiment, the classification unit 150 is further configured to: extract a feature vector of the training data when the evaluation factor is an unadjustable factor; and divide the training data into multiple subclasses by using a distance-based clustering algorithm according to the feature vector.

在一实施例中,基于距离的聚类算法为K均值聚类算法。In one embodiment, the distance-based clustering algorithm is a K-means clustering algorithm.

在一实施例中,分类单元150还用于:当评价因素为可调因素时,提取训练数据的特征向量;通过特征向量,训练高斯混合模型,标注训练数据;根据标注的训练数据,将训练数据分为多个子类。In one embodiment, the classification unit 150 is further configured to: extract a feature vector of the training data when the evaluation factor is an adjustable factor; train a Gaussian mixture model through the feature vector, and label the training data; Data is divided into multiple subcategories.

在一实施例中,通过特征向量,训练高斯混合模型,标注训练数据,包括:通过特征向量训练高斯混合模型;根据评价因素,确定控制参数,控制参数可调整控制训练数据;遍历控制参数的所有取值,对训练数据进行变换;获取变换后的训练数据的特征向量使高斯混合模型似然度最大时的参数值;根据参数值累计似然度;根据参数值变换训练数据,得到新的训练数据,重新训练直到达到停止条件;将每个训练数据对应的使高斯混合模型似然度最大时的参数值作为训练数据的标注值。In one embodiment, training the Gaussian mixture model and labeling the training data through the feature vector includes: training the Gaussian mixture model through the feature vector; determining control parameters according to the evaluation factors, and the control parameters can be adjusted to control the training data; traversing all the control parameters. Take the value to transform the training data; obtain the parameter value when the eigenvector of the transformed training data maximizes the likelihood of the Gaussian mixture model; accumulate the likelihood according to the parameter value; transform the training data according to the parameter value to obtain a new training data, and retrain until the stopping condition is reached; the parameter value corresponding to each training data when the likelihood of the Gaussian mixture model is maximized is used as the label value of the training data.

在一实施例中,停止条件包括:迭代次数达到预设阈值,或累计似然度与上次迭代中的累计似然度变化率小于预设阈值。In one embodiment, the stopping condition includes: the number of iterations reaches a preset threshold, or the rate of change between the accumulated likelihood and the accumulated likelihood in the previous iteration is less than a preset threshold.

在一实施例中,模型训练单元160还用于:将每个子类数据分别训练一个或多个评价模型,或将多个子类数据整体训练一个评价模型。In one embodiment, the model training unit 160 is further configured to: train one or more evaluation models on each sub-class data separately, or train a plurality of sub-class data as a whole into one evaluation model.

在一实施例中,语音数据库覆盖性的评价因素包括以下一个或多个:发音者的性别、发音者的年龄、发音者的口音、发音者的语速、发音者的音调、发音因素或内容主题。In one embodiment, the evaluation factors for the coverage of the speech database include one or more of the following: the speaker's gender, the speaker's age, the speaker's accent, the speaker's speaking rate, the speaker's pitch, the speaker's factor or content theme.

本公开还提供一种度量语音数据库覆盖性的装置,装置包括,评价模型获取单元,用于利用前述任一实施例的用于度量语音数据库覆盖性的无监督模型训练方法10,得到每个评价因素的评价模型,语音数据库获取单元,用于获取待评价的语音数据库,其中,语音数据库中包括至少一条语音;检测单元,用于通过评价因素的评价模型对语音数据库中的每条语音进行检测,得到语音数据库与评价因素相对应的单因素信息熵;评价单元,用于根据单因素信息熵,确定语音数据库的覆盖度。The present disclosure also provides a device for measuring the coverage of a speech database, the device comprising: an evaluation model obtaining unit for obtaining each evaluation by using the unsupervised model training method 10 for measuring the coverage of a speech database according to any of the foregoing embodiments The evaluation model of factors, the voice database acquisition unit is used to acquire the voice database to be evaluated, wherein the voice database includes at least one voice; the detection unit is used to detect each voice in the voice database through the evaluation model of the evaluation factor , to obtain the single-factor information entropy of the voice database corresponding to the evaluation factors; the evaluation unit is used to determine the coverage of the voice database according to the single-factor information entropy.

图5是根据一示例性实施例示出的前述任一实施例装置的示意框图。例如,装置200可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。Fig. 5 is a schematic block diagram of an apparatus according to any one of the foregoing embodiments according to an exemplary embodiment. For example, apparatus 200 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, and the like.

参照图5,装置200可以包括以下一个或多个组件:处理组件202,存储器204,电源组件206,多媒体组件208,音频组件210,输入/输出(I/ O)的接口212,传感器组件214,以及通信组件216。5, the apparatus 200 may include one or more of the following components: a processing component 202, a memory 204, a power supply component 206, a multimedia component 208, an audio component 210, an input/output (I/O) interface 212, a sensor component 214, and communication component 216 .

处理组件202通常控制装置200的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件202可以包括一个或多个处理器220来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件202可以包括一个或多个模块,便于处理组件202和其他组件之间的交互。例如,处理组件202可以包括多媒体模块,以方便多媒体组件208和处理组件202之间的交互。The processing component 202 generally controls the overall operation of the device 200, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 202 may include one or more processors 220 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 202 may include one or more modules that facilitate interaction between processing component 202 and other components. For example, processing component 202 may include a multimedia module to facilitate interaction between multimedia component 208 and processing component 202.

存储器204被配置为存储各种类型的数据以支持在装置200的操作。这些数据的示例包括用于在装置200上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器204可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。Memory 204 is configured to store various types of data to support operation at device 200 . Examples of such data include instructions for any application or method operating on the device 200, contact data, phonebook data, messages, pictures, videos, and the like. Memory 204 may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.

电源组件206为装置200的各种组件提供电力。电源组件206可以包括电源管理系统,一个或多个电源,及其他与为装置200生成、管理和分配电力相关联的组件。Power supply assembly 206 provides power to various components of device 200 . Power supply components 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 200 .

多媒体组件208包括在所述装置200和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件208包括一个前置摄像头和/或后置摄像头。当设备200处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。The multimedia component 208 includes a screen that provides an output interface between the device 200 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 208 includes a front-facing camera and/or a rear-facing camera. When the device 200 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.

音频组件210被配置为输出和/或输入音频信号。例如,音频组件210包括一个麦克风(MIC),当装置200处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器204或经由通信组件216发送。在一些实施例中,音频组件210还包括一个扬声器,用于输出音频信号。Audio component 210 is configured to output and/or input audio signals. For example, audio component 210 includes a microphone (MIC) that is configured to receive external audio signals when device 200 is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 204 or transmitted via communication component 216 . In some embodiments, the audio component 210 also includes a speaker for outputting audio signals.

I/ O接口212为处理组件202和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 212 provides an interface between the processing component 202 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.

传感器组件214包括一个或多个传感器,用于为装置200提供各个方面的状态评估。例如,传感器组件214可以检测到装置200的打开/关闭状态,组件的相对定位,例如所述组件为装置200的显示器和小键盘,传感器组件214还可以检测装置200或装置200一个组件的位置改变,用户与装置200接触的存在或不存在,装置200方位或加速/减速和装置200的温度变化。传感器组件214可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件214还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件214还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。Sensor assembly 214 includes one or more sensors for providing status assessments of various aspects of device 200 . For example, the sensor assembly 214 can detect the open/closed state of the device 200, the relative positioning of components, such as the display and keypad of the device 200, and the sensor assembly 214 can also detect a change in the position of the device 200 or a component of the device 200 , the presence or absence of user contact with the device 200 , the orientation or acceleration/deceleration of the device 200 and the temperature change of the device 200 . Sensor assembly 214 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

通信组件216被配置为便于装置200和其他设备之间有线或无线方式的通信。装置200可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件216经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件216还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。Communication component 216 is configured to facilitate wired or wireless communication between apparatus 200 and other devices. Device 200 may access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 216 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 216 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

在示例性实施例中,装置200可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。In an exemplary embodiment, apparatus 200 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the above method.

在示例性实施例中,还提供了一种包括指令的计算机可读存储介质,例如包括指令的存储器204,上述指令可由装置200的处理器220执行以完成上述方法。例如,所述计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a computer-readable storage medium including instructions, such as a memory 204 including instructions, which are executable by the processor 220 of the apparatus 200 to perform the method described above. For example, the computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

图6是根据一示例性实施例示出的一种电子装置300的框图。例如,装置300可以被提供为一服务器。参照图6,装置300包括处理组件322,其进一步包括一个或多个处理器,以及由存储器342所代表的存储器资源,用于存储可由处理组件322的执行的指令,例如应用程序。存储器342中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件322被配置为执行指令,以执行上述方法。FIG. 6 is a block diagram of an electronic device 300 according to an exemplary embodiment. For example, the apparatus 300 may be provided as a server. 6, apparatus 300 includes a processing component 322, which further includes one or more processors, and a memory resource, represented by memory 342, for storing instructions executable by processing component 322, such as an application program. An application program stored in memory 342 may include one or more modules, each corresponding to a set of instructions. Additionally, the processing component 322 is configured to execute instructions to perform the above-described methods.

装置300还可以包括一个电源组件326被配置为执行装置300的电源管理,一个有线或无线网络接口350被配置为将装置300连接到网络,和一个输入输出(I/O)接口358。装置300可以操作基于存储在存储器342的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM, LinuxTM,FreeBSDTM或类似。Device 300 may also include a power supply assembly 326 configured to perform power management of device 300 , a wired or wireless network interface 350 configured to connect device 300 to a network, and an input output (I/O) interface 358 . Device 300 may operate based on an operating system stored in memory 342, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.

本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本发明的其它实施方案。本申请旨在涵盖本发明的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本发明的真正范围和精神由下面的权利要求指出。Other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention that follow the general principles of the invention and include common knowledge or conventional techniques in the art not disclosed by this disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the invention being indicated by the following claims.

应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。It should be understood that the present invention is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from its scope. The scope of the present invention is limited only by the appended claims.

Claims (20)

1. An unsupervised model training method for measuring speech database coverage, the method comprising:
acquiring training data, wherein the training data is voice;
determining one or more evaluation factors for voice database coverage;
dividing the evaluation factor into an adjustable factor or an unadjustable factor based on whether the training data is controllable by parameter adjustment corresponding to the evaluation factor;
determining a clustering algorithm corresponding to each divided evaluation factor, wherein if the evaluation factor is an unadjustable factor, the corresponding clustering algorithm is determined to be a distance-based clustering algorithm, and if the evaluation factor is an adjustable factor, the corresponding clustering algorithm is determined to be an adaptive training algorithm;
classifying the training data respectively through the clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses;
training an evaluation model according to the plurality of subclasses of each of the evaluation factors.
2. The unsupervised model training method for measuring coverage of a speech database according to claim 1, wherein the classifying the training data by the clustering algorithm corresponding to each of the evaluation factors to obtain a plurality of subclasses comprises:
if the evaluation factor is an unadjustable factor, extracting a feature vector of the training data;
and according to the feature vector, dividing the training data into a plurality of subclasses by adopting the distance-based clustering algorithm.
3. The unsupervised model training method for measuring speech database coverage of claim 2, wherein the distance-based clustering algorithm is a K-means clustering algorithm.
4. The unsupervised model training method for measuring coverage of a speech database according to claim 1, wherein the classifying the training data by the clustering algorithm corresponding to each of the evaluation factors to obtain a plurality of subclasses comprises:
if the evaluation factor is an adjustable factor, extracting a feature vector of the training data;
training a Gaussian mixture model through the feature vectors, and labeling the training data;
and according to the marked training data, dividing the training data into a plurality of subclasses.
5. The unsupervised model training method for measuring coverage of a speech database according to claim 4, wherein the training a Gaussian mixture model by the feature vector and labeling the training data comprises:
training a Gaussian mixture model through the feature vectors;
determining control parameters according to the evaluation factors, wherein the control parameters can adjust and control the training data;
traversing all values of the control parameters, and transforming the training data;
obtaining a parameter value when the feature vector of the transformed training data enables the likelihood of the Gaussian mixture model to be maximum;
accumulating likelihood according to the parameter values;
converting the training data according to the parameter values to obtain new training data, and retraining until a stopping condition is reached;
and taking the parameter value corresponding to each training data and enabling the Gaussian mixture model to have the maximum likelihood as the labeled value of the training data.
6. The unsupervised model training method for measuring speech database coverage of claim 5, wherein the stopping condition comprises: and the iteration times reach a preset threshold, or the change rate of the cumulative likelihood and the cumulative likelihood in the last iteration is smaller than the preset threshold.
7. The unsupervised model training method for measuring speech database coverage of claim 1, wherein said training an assessment model according to said plurality of sub-classes of each of said assessment factors comprises: and respectively training one or more evaluation models for each piece of sub-data, or training one evaluation model for a plurality of pieces of sub-data as a whole.
8. The unsupervised model training method for measuring speech database coverage as in claim 1, wherein the evaluation factors for speech database coverage include one or more of: gender of the speaker, age of the speaker, accent of the speaker, speed of speech, pitch, language, capture device, capture environment, pronunciation factors, or content subject.
9. A method for measuring coverage of a voice database, the method comprising obtaining an evaluation model of each evaluation factor by using the unsupervised model training method for measuring coverage of a voice database according to any one of claims 1 to 8;
acquiring a voice database to be evaluated, wherein the voice database comprises at least one voice;
detecting each voice in the voice database through the evaluation model of the evaluation factor to obtain a single-factor information entropy of the voice database corresponding to the evaluation factor;
and determining the coverage degree of the voice database according to the single-factor information entropy.
10. An unsupervised model training apparatus for measuring speech database coverage, the apparatus comprising:
the data acquisition unit is used for acquiring training data, and the training data is voice;
the evaluation factor determining unit is used for determining one or more evaluation factors of the coverage of the voice database;
the dividing unit is used for dividing the evaluation factors into adjustable factors or non-adjustable factors based on whether the training data corresponding to the evaluation factors can be controlled through parameter adjustment;
the algorithm determining unit is used for determining a clustering algorithm corresponding to each divided evaluation factor, wherein when the evaluation factor is an unadjustable factor, the clustering algorithm corresponding to the evaluation factor is determined to be a distance-based clustering algorithm, and when the evaluation factor is an adjustable factor, the clustering algorithm corresponding to the evaluation factor is determined to be an adaptive training algorithm;
the classification unit is used for classifying the training data through the clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses;
and the model training unit is used for training an evaluation model according to the plurality of subclasses of each evaluation factor.
11. The unsupervised model training device for measuring coverage of a speech database of claim 10, wherein the classification unit is further configured to:
when the evaluation factor is an unadjustable factor, extracting a feature vector of the training data;
and according to the feature vector, dividing the training data into a plurality of subclasses by adopting the distance-based clustering algorithm.
12. The unsupervised model training device for measuring speech database coverage of claim 11, wherein the distance-based clustering algorithm is a K-means clustering algorithm.
13. The unsupervised model training device for measuring coverage of a speech database of claim 10, wherein the classification unit is further configured to:
when the evaluation factor is an adjustable factor, extracting a feature vector of the training data;
training a Gaussian mixture model through the feature vectors, and labeling the training data;
and according to the marked training data, dividing the training data into a plurality of subclasses.
14. The unsupervised model training device for measuring coverage of a speech database according to claim 13, wherein said training a gaussian mixture model by using the feature vectors and labeling the training data comprises:
training a Gaussian mixture model through the feature vectors;
determining control parameters according to the evaluation factors, wherein the control parameters can adjust and control the training data;
traversing all values of the control parameters, and transforming the training data;
obtaining a parameter value when the feature vector of the transformed training data enables the likelihood of the Gaussian mixture model to be maximum;
accumulating likelihood according to the parameter values;
converting the training data according to the parameter values to obtain new training data, and retraining until a stopping condition is reached;
and taking the parameter value corresponding to each training data and enabling the Gaussian mixture model to have the maximum likelihood as the labeled value of the training data.
15. The unsupervised model training device for measuring speech database coverage of claim 14, wherein the stopping condition comprises: and the iteration times reach a preset threshold, or the change rate of the cumulative likelihood and the cumulative likelihood in the last iteration is smaller than the preset threshold.
16. The unsupervised model training device for measuring speech database coverage of claim 10, wherein the model training device is further configured to: and respectively training one or more evaluation models for each piece of sub-data, or training one evaluation model for a plurality of pieces of sub-data as a whole.
17. The unsupervised model training device for measuring coverage of a voice database of claim 10, wherein the evaluation factors for coverage of the voice database include one or more of: gender of the speaker, age of the speaker, accent of the speaker, speed of speech, pitch, language, capture device, capture environment, pronunciation factors, or content subject.
18. An apparatus for measuring coverage of a voice database, the apparatus comprising an evaluation model obtaining unit, configured to obtain an evaluation model of each evaluation factor by using the unsupervised model training method for measuring coverage of a voice database according to any one of claims 1 to 8;
the system comprises a voice database acquisition unit, a voice database evaluation unit and a voice evaluation unit, wherein the voice database acquisition unit is used for acquiring a voice database to be evaluated, and the voice database comprises at least one voice;
the detection unit is used for detecting each voice in the voice database through the evaluation model of the evaluation factor to obtain the single-factor information entropy of the voice database corresponding to the evaluation factor;
and the evaluation unit is used for determining the coverage of the voice database according to the single-factor information entropy.
19. An electronic device, comprising:
a memory to store instructions; and
a processor for invoking the memory-stored instructions to perform the unsupervised model training method for measuring speech database coverage of any of claims 1-8.
20. A computer-readable storage medium having stored thereon instructions which, when executed by a processor, perform the unsupervised model training method for measuring speech database coverage of any of claims 1 to 8.
CN202010309303.7A 2020-04-20 2020-04-20 Unsupervised model training method and unsupervised model training device for measuring coverage of voice database Active CN111209429B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010309303.7A CN111209429B (en) 2020-04-20 2020-04-20 Unsupervised model training method and unsupervised model training device for measuring coverage of voice database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010309303.7A CN111209429B (en) 2020-04-20 2020-04-20 Unsupervised model training method and unsupervised model training device for measuring coverage of voice database

Publications (2)

Publication Number Publication Date
CN111209429A CN111209429A (en) 2020-05-29
CN111209429B true CN111209429B (en) 2020-07-28

Family

ID=70787759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010309303.7A Active CN111209429B (en) 2020-04-20 2020-04-20 Unsupervised model training method and unsupervised model training device for measuring coverage of voice database

Country Status (1)

Country Link
CN (1) CN111209429B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862934B (en) * 2020-07-24 2022-09-27 思必驰科技股份有限公司 Improved method for speech synthesis model and speech synthesis method and device
CN113889096A (en) * 2021-09-16 2022-01-04 北京捷通华声科技股份有限公司 Method and device for analyzing sound library training data
CN114360523A (en) * 2022-03-21 2022-04-15 深圳亿智时代科技有限公司 Keyword dataset acquisition and model training methods, devices, equipment and medium
CN116061971A (en) * 2023-01-31 2023-05-05 清华大学 Automatic driving method and device based on layered reinforcement learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101340520B1 (en) * 2008-07-22 2013-12-11 삼성전자주식회사 Apparatus and method for removing noise
CN107301858B (en) * 2017-05-31 2020-09-22 华南理工大学 Audio Classification Method Based on Audio Feature Space Hierarchical Description
CN109830246B (en) * 2019-01-25 2019-10-29 北京海天瑞声科技股份有限公司 Audio quality appraisal procedure, device, electronic equipment and storage medium
CN111008299B (en) * 2020-03-11 2020-06-19 北京海天瑞声科技股份有限公司 Quality assessment method, device and computer storage medium for speech database

Also Published As

Publication number Publication date
CN111209429A (en) 2020-05-29

Similar Documents

Publication Publication Date Title
CN111209429B (en) Unsupervised model training method and unsupervised model training device for measuring coverage of voice database
CN111460150B (en) Classification model training method, classification method, device and storage medium
WO2020088069A1 (en) Hand gesture keypoints detection method and apparatus, electronic device, and storage medium
CN109389162B (en) Sample image screening technique and device, electronic equipment and storage medium
CN111753895A (en) Data processing method, device and storage medium
CN111931844B (en) Image processing method and device, electronic equipment and storage medium
CN111539443A (en) An image recognition model training method and device, and storage medium
CN109493852A (en) A kind of evaluating method and device of speech recognition
CN111160448A (en) An image classification model training method and device
CN111753917A (en) Data processing method, device and storage medium
CN113792207A (en) Cross-modal retrieval method based on multi-level feature representation alignment
CN110648656A (en) Voice endpoint detection method and device, electronic equipment and storage medium
WO2022147692A1 (en) Voice command recognition method, electronic device and non-transitory computer-readable storage medium
CN111461304A (en) Classification neural network training method, text classification method, device and equipment
CN110889489A (en) Neural network training method, image recognition method and device
CN111210844B (en) Method, device and equipment for determining speech emotion recognition model and storage medium
TW202036462A (en) Method, apparatus and electronic device for image generating and storage medium thereof
CN108665889A (en) The Method of Speech Endpoint Detection, device, equipment and storage medium
CN111862995A (en) A code rate determination model training method, code rate determination method and device
CN109409414B (en) Sample image determines method and apparatus, electronic equipment and storage medium
CN112884040B (en) Training sample data optimization method, system, storage medium and electronic equipment
CN114547421A (en) A search processing method, device, electronic device and storage medium
CN111753266A (en) User authentication method, multimedia content push method and device
CN108268667A (en) Audio file clustering method and device
CN109460458B (en) Prediction method and device for query rewriting intention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant