[go: up one dir, main page]

CN112287666B - Corpus topic distribution calculation method based on meta information - Google Patents

Corpus topic distribution calculation method based on meta information Download PDF

Info

Publication number
CN112287666B
CN112287666B CN202011124613.8A CN202011124613A CN112287666B CN 112287666 B CN112287666 B CN 112287666B CN 202011124613 A CN202011124613 A CN 202011124613A CN 112287666 B CN112287666 B CN 112287666B
Authority
CN
China
Prior art keywords
topic
document
vocabulary
model
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011124613.8A
Other languages
Chinese (zh)
Other versions
CN112287666A (en
Inventor
刘刚
唐宏伟
张翰墨
张瀚文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202011124613.8A priority Critical patent/CN112287666B/en
Publication of CN112287666A publication Critical patent/CN112287666A/en
Application granted granted Critical
Publication of CN112287666B publication Critical patent/CN112287666B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of topic modeling, and particularly relates to a corpus topic distribution calculation method based on meta-information. The invention designs the TWLLDA topic model of the document and vocabulary meta-information, and overcomes the defects of complex model structure, non-conjugation, single information acquisition channel and the like in the prior art. The invention converts meta information into tag information of the document and the word, the tag information is independent of the model, so that the document with similar tags has similar dirichlet priori vectors, and the words with similar tags also have similar distribution weights on the topic; the invention provides an effective closed Gibbs sampling method to complete the reasoning of TWLLDA; and carrying out a plurality of groups of experiments by taking the confusion degree and the theme consistency as evaluation indexes. Experiments show that the TWLLDA model based on meta information performs better under the same conditions than the LDA model and the like.

Description

一种基于元信息的语料库主题分布计算方法A Meta-information-Based Calculation Method of Topic Distribution in Corpus

技术领域technical field

本发明属于主题建模技术领域,具体涉及一种基于元信息的语料库主题分布计算方法。The invention belongs to the technical field of topic modeling, and in particular relates to a method for computing topic distribution of a corpus based on meta-information.

背景技术Background technique

传统的LDA主题模型在短文本上表现不佳,其缺陷在于没有足够丰富的词汇信息使得其统计意义成立。对于如何丰富文档的信息并且将其应用于短文本主题建模上,是近年来学者研究的热点方向之一。The traditional LDA topic model does not perform well on short texts, and its defect is that there is not enough rich lexical information to make its statistical significance valid. How to enrich document information and apply it to short text topic modeling is one of the hot research directions of scholars in recent years.

短文本主题建模方法虽然各有不同,但是其主要思想却是一致的,即通过各种手段丰富短文本稀疏的词共现信息。其主要实现手段大致可以分为两大类,即基于词嵌入、基于标签等的一类方法以及扩充短文本内容以提高信息量的一类方法。Although the short text topic modeling methods are different, the main idea is the same, that is, to enrich the sparse word co-occurrence information of short texts by various means. Its main implementation methods can be roughly divided into two categories, namely, a class of methods based on word embedding and tags, and a class of methods that expand the content of short texts to increase the amount of information.

如今,大多数改进主题模型都以引入例如词向量外部特征或者增加短文本内容为改进手段,而引入词向量特征大致思想是通过词向量使得语义相似的词汇更有概率分配到同一主题下,通过这种方式来增加主题-词分布的准确性,此类方法仅依靠词向量这种手段丰富词共现信息,而文档的元信息则没有有效利用,因此模型精度虽然比LDA模型有所提升,但仍有提高的空间。Nowadays, most improved topic models are improved by introducing external features such as word vectors or adding short text content. The general idea of introducing word vector features is to make words with similar semantics more likely to be assigned to the same topic through word vectors. In this way, the accuracy of topic-word distribution is increased. Such methods only rely on word vectors to enrich word co-occurrence information, while the meta information of documents is not effectively used. Therefore, although the model accuracy has improved compared with the LDA model, there is still room for improvement.

现有的改进主题模型往往忽视了语料中丰富的元信息。元信息是描述信息的信息,文本的作者、时间戳、文档的特征向量等不在文本内容内,却可以对文本属性进行描述的文档特征信息都可称为文档的元信息。词特征等知识也可作为词语的元信息。在主题模型中一个词汇的信息是词汇的主题分布,那么词汇元信息就是描述词汇的主题分布的信息,并且元信息是在读取文档前就获知的信息。引入元信息增加了信息获取渠道,能够有效提高建模的精度。Existing improved topic models often ignore the rich meta-information in the corpus. Meta information is the information that describes information. The author of the text, the time stamp, the feature vector of the document, etc. are not included in the text content, but the document feature information that can describe the text attributes can be called the meta information of the document. Knowledge such as word features can also be used as meta-information of words. In the topic model, the information of a vocabulary is the topic distribution of the vocabulary, then the vocabulary meta-information is the information describing the topic distribution of the vocabulary, and the meta-information is the information that is known before reading the document. The introduction of meta-information increases the channels of information acquisition and can effectively improve the accuracy of modeling.

发明内容Contents of the invention

本发明的目的在于提供一种基于元信息的语料库主题分布计算方法。The purpose of the present invention is to provide a corpus topic distribution calculation method based on meta-information.

本发明的目的通过如下技术方案来实现:包括以下步骤:The purpose of the present invention is achieved through the following technical solutions: comprising the following steps:

步骤1:输入待计算的语料库,获取语料库的文档元信息和词汇元信息,设定最大迭代次数;Step 1: Input the corpus to be calculated, obtain the document meta-information and lexical meta-information of the corpus, and set the maximum number of iterations;

步骤2:将语料库的文档元信息和词汇元信息转化为文档标签和词汇标签;根据文档标签,生成文档-文档标签向量矩阵Fd,l;根据词汇标签,生成词汇-词汇标签向量矩阵 Step 2: Convert the document meta-information and vocabulary meta-information of the corpus into document tags and vocabulary tags; generate a document-document tag vector matrix F d,l according to the document tags; generate a vocabulary-vocabulary tag vector matrix according to the vocabulary tags

步骤3:以超参数为u0的Gamma函数为文档标签l与主题k对应的参数λl,k赋值,λl,k~Gamma(u0,u0),得到主题-文档标签的相关性矩阵;以超参数为v0的Gamma函数为词汇标签l*与主题k对应的参数赋值,/>得到主题-词汇标签的相关性矩阵;其中,主题总数为K;文档标签总数为L;词汇标签总数为L*Step 3: Use the Gamma function with hyperparameter u 0 as the parameter λ l,k corresponding to document label l and topic k to assign values, λ l,k ~Gamma(u 0 ,u 0 ), to obtain the topic-document label correlation matrix; use the Gamma function with hyperparameter v 0 as the vocabulary label l * parameter corresponding to topic k assignment, /> Obtain the correlation matrix of topic-vocabulary label; Wherein, the total number of topics is K; The total number of document labels is L; The total number of vocabulary labels is L * ;

步骤4:计算主题k与词汇v对应的参数βk,v 为词汇-词汇标签向量矩阵/>的元素;计算文档语料d与主题k与对应的参数αd,k,/>fd,l为文档-文档标签向量矩阵Fd,l的元素;计算每一个单词v被分配给主题k的次数nk,v计算每一个文档语料d中被分配给主题k的词汇数量md,k Step 4: Calculate the parameter β k,v corresponding to topic k and vocabulary v, is the vocabulary-vocabulary label vector matrix /> element; calculate the document corpus d and topic k and the corresponding parameters α d,k , /> f d,l is the element of the document-document label vector matrix F d,l ; calculate the number n k, v of each word v assigned to the topic k, Calculate the number of words m d,k assigned to topic k in each document corpus d,

步骤5:通过qd~Beta(αd,·,md,·)采样参数qdStep 5: Sampling parameter q d through q d ~Beta(α d, , m d, );

其中,αd,·为在一篇文档中每个主题αd,k值的线性和,md,·为在一篇文档中所有的词汇的个数;/> Among them, α d,· is the linear sum of each topic α d,k value in a document, m d, · is the number of all words in a document; />

步骤6:通过CRP过程,以αd,k为聚集,md,k为客户数量采样参数td,kStep 6: Through the CRP process, take α d, k as the aggregation, m d, k as the number of customers sampling parameters t d, k ;

步骤7:从Gamma随机函数中采样参数λ'l,k,并更新参数αd,kStep 7: Sample parameters λ' l,k from the Gamma random function, and update parameters α d,k ;

λ'l,k~Gamma(μ′,μ″)λ' l,k ~Gamma(μ′,μ″)

参数αd,k的更新公式为:The update formula of parameters αd ,k is:

步骤8:根据采样参数/> Step 8: According to Sampling parameter/>

其中,βk,·为主题k与每个单词的相关性之和;nk,·为主题k包含的单词总数;Among them, β k, is the sum of the correlation between topic k and each word; n k, is the total number of words contained in topic k;

步骤9:通过CRP过程,以βk,v为聚集,nk,v为客户数量采样参数t′k,vStep 9: Through the CRP process, take β k, v as the aggregation, n k, v as the number of customers sampling parameters t′ k, v ;

步骤10:从Gamma随机函数中采样参数并更新参数βk,vStep 10: Sampling Parameters from Gamma Random Function And update the parameter β k,v ;

参数βk,v的更新公式为:The update formula of parameter β k, v is:

步骤11:判断是否达到最大迭代次数;若未达到最大迭代次数,则返回步骤5;否则,输出语料库的主题分布,即公式:Step 11: Determine whether the maximum number of iterations is reached; if the maximum number of iterations is not reached, return to step 5; otherwise, output the topic distribution of the corpus, that is, the formula:

本发明的有益效果在于:The beneficial effects of the present invention are:

本发明设计了文档和词汇元信息的TWLLDA主题模型,克服了现有技术中模型结构复杂、非共轭、信息获取渠道单一等缺点。本发明将元信息转化为文档和单词的标签信息,标签信息独立于模型本身使得具有相似标签的文档具有相似的狄利克雷先验向量,相似标签的单词在主题上也有相似的分布权重;本发明提出有效的闭合的吉布斯采样方法,完成TWLLDA的推理;以困惑度和主题一致性为评价指标进行多组实验。实验表明相比于LDA等模型,基于元信息的TWLLDA模型在相同条件下表现得更为优秀。The present invention designs a TWLLDA topic model of document and vocabulary meta-information, which overcomes the shortcomings of complex model structure, non-conjugation, and single information acquisition channel in the prior art. The present invention converts meta-information into label information of documents and words, and the label information is independent of the model itself so that documents with similar labels have similar Dirichlet prior vectors, and words with similar labels have similar distribution weights on topics; the present invention proposes an effective closed Gibbs sampling method to complete the reasoning of TWLLDA; multiple groups of experiments are carried out with perplexity and topic consistency as evaluation indicators. Experiments show that compared with LDA and other models, the meta-information-based TWLLDA model performs better under the same conditions.

附图说明Description of drawings

图1是本发明的整体框架图。Fig. 1 is an overall frame diagram of the present invention.

图2是本发明中TWLLDA模型图。Fig. 2 is a TWLLDA model diagram in the present invention.

图3是本发明中从元信息到标签的转化处理算法的伪代码图。Fig. 3 is a pseudocode diagram of the conversion processing algorithm from meta information to tags in the present invention.

图4是本发明中文档-文档标签向量矩阵结构图。Fig. 4 is a structure diagram of document-document label vector matrix in the present invention.

图5是本发明中词汇-词汇标签向量矩阵结构图。Fig. 5 is a structure diagram of vocabulary-vocabulary label vector matrix in the present invention.

图6是本发明中主题-文档标签相关性矩阵结构图。Fig. 6 is a structural diagram of a subject-document tag correlation matrix in the present invention.

图7是本发明中主题-词标签相关性矩阵结构图。Fig. 7 is a structure diagram of a topic-word tag correlation matrix in the present invention.

图8是本发明中TWLLDA的模型生成算法的伪代码图。Fig. 8 is a pseudocode diagram of the model generation algorithm of TWLLDA in the present invention.

图9是本发明中TWLLDA的塌缩吉布斯采样算法的伪代码图。Fig. 9 is a pseudocode diagram of the collapsed Gibbs sampling algorithm of TWLLDA in the present invention.

图10是本发明中各模型在Reuters数据集上的困惑度图。Fig. 10 is the perplexity map of each model in the present invention on the Reuters data set.

图11是各模型在20NewsGroups数据集上的迭代时间对比图。Figure 11 is a comparison chart of the iteration time of each model on the 20NewsGroups dataset.

具体实施方式Detailed ways

下面结合附图对本发明做进一步描述。The present invention will be further described below in conjunction with the accompanying drawings.

本发明涉及了一种基于元信息的语料库主题分布计算方法,包括以下步骤:The present invention relates to a corpus topic distribution calculation method based on meta-information, comprising the following steps:

步骤1:输入待计算的语料库,获取语料库的文档元信息和词汇元信息,设定最大迭代次数;Step 1: Input the corpus to be calculated, obtain the document meta-information and lexical meta-information of the corpus, and set the maximum number of iterations;

步骤2:将语料库的文档元信息和词汇元信息转化为文档标签和词汇标签;根据文档标签,生成文档-文档标签向量矩阵Fd,l;根据词汇标签,生成词汇-词汇标签向量矩阵 Step 2: Convert the document meta-information and vocabulary meta-information of the corpus into document tags and vocabulary tags; generate a document-document tag vector matrix F d,l according to the document tags; generate a vocabulary-vocabulary tag vector matrix according to the vocabulary tags

步骤3:以超参数为u0的Gamma函数为文档标签l与主题k对应的参数λl,k赋值,λl,k~Gamma(u0,u0),得到主题-文档标签的相关性矩阵;以超参数为v0的Gamma函数为词汇标签l*与主题k对应的参数赋值,/>得到主题-词汇标签的相关性矩阵;其中,主题总数为K;文档标签总数为L;词汇标签总数为L*Step 3: Use the Gamma function with hyperparameter u 0 as the parameter λ l,k corresponding to document label l and topic k to assign values, λ l,k ~Gamma(u 0 ,u 0 ), to obtain the topic-document label correlation matrix; use the Gamma function with hyperparameter v 0 as the vocabulary label l * parameter corresponding to topic k assignment, /> Obtain the correlation matrix of topic-vocabulary label; Wherein, the total number of topics is K; The total number of document labels is L; The total number of vocabulary labels is L * ;

步骤4:计算主题k与词汇v对应的参数βk,v 为词汇-词汇标签向量矩阵/>的元素;计算文档语料d与主题k与对应的参数αd,k,/>fd,l为文档-文档标签向量矩阵Fd,l的元素;计算每一个单词v被分配给主题k的次数nk,v计算每一个文档语料d中被分配给主题k的词汇数量md,k Step 4: Calculate the parameter β k,v corresponding to topic k and vocabulary v, is the vocabulary-vocabulary label vector matrix /> element; calculate the document corpus d and topic k and the corresponding parameters α d,k , /> f d,l is the element of the document-document label vector matrix F d,l ; calculate the number n k, v of each word v assigned to the topic k, Calculate the number of words m d,k assigned to topic k in each document corpus d,

步骤5:通过qd~Beta(αd,·,md,·)采样参数qdStep 5: Sampling parameter q d through q d ~Beta(α d, , m d, );

其中,αd,·为在一篇文档中每个主题αd,k值的线性和,md,·为在一篇文档中所有的词汇的个数;/> Among them, α d,· is the linear sum of each topic α d,k value in a document, m d, · is the number of all words in a document; />

步骤6:通过CRP过程,以αd,k为聚集,md,k为客户数量采样参数td,kStep 6: Through the CRP process, take α d, k as the aggregation, m d, k as the number of customers sampling parameters t d, k ;

步骤7:从Gamma随机函数中采样参数λ'l,k,并更新参数αd,kStep 7: Sample parameters λ' l,k from the Gamma random function, and update parameters α d,k ;

λ'l,k~Gamma(μ′,μ″)λ' l,k ~Gamma(μ′,μ″)

参数αd,k的更新公式为:The update formula of parameters αd ,k is:

步骤8:根据采样参数/> Step 8: According to Sampling parameter/>

其中,βk,·为主题k与每个单词的相关性之和;nk,·为主题k包含的单词总数;Among them, β k, is the sum of the correlation between topic k and each word; n k, is the total number of words contained in topic k;

步骤9:通过CRP过程,以βk,v为聚集,nk,v为客户数量采样参数t′k,vStep 9: Through the CRP process, take β k, v as the aggregation, n k, v as the number of customers sampling parameters t′ k, v ;

步骤10:从Gamma随机函数中采样参数并更新参数βk,vStep 10: Sampling Parameters from Gamma Random Function And update the parameter β k,v ;

参数βk,v的更新公式为:The update formula of parameter β k, v is:

步骤11:判断是否达到最大迭代次数;若未达到最大迭代次数,则返回步骤5;否则,输出语料库的主题分布,即公式:Step 11: Determine whether the maximum number of iterations is reached; if the maximum number of iterations is not reached, return to step 5; otherwise, output the topic distribution of the corpus, that is, the formula:

本发明设计了文档和词汇元信息的TWLLDA主题模型,克服了主题模型在短文档上准确度低的缺点,该模型在相同条件下困惑度更低,主题一致性更高。The present invention designs a TWLLDA topic model of document and vocabulary meta information, which overcomes the shortcoming of low accuracy of the topic model on short documents, and the model has lower perplexity and higher topic consistency under the same conditions.

如图3所示位本发明中元信息到标签的转化处理方法。元信息到标签的转化处理:首先文档元信息经过算法处理能够得到元素为(0/1)的文档标签矩阵,词汇元信息经过算法处理能够得到元素为(0/1)的词汇标签矩阵。这样做既能减轻模型对数据处理的负担,又能凸显元信息中高特征维度的表达;其次根据生成的文档标签矩阵,进而生成文档-文档标签向量矩阵;根据生成的词汇标签矩阵,进而生成词汇-词汇标签向量矩阵;最后通过以超参数为u0的Gamma函数为λl,k赋值,λl,k~Gamma(u0,u0),得到主题-文档标签的相关性矩阵;通过以超参数为v0的Gamma函数为赋值,/>得,到主题-词汇标签的相关性矩阵。如图4所示为文档-文档标签向量矩阵,如图5所示为词汇-词汇标签向量矩阵,其中d是数据集中的一篇文档,v是数据集中的词汇,l是文档标签,l*是词汇标签;主题-文档标签相关性矩阵,主题-词汇标签相关性矩阵如图6和图7所示,其中k是主题;最后模型的生成过程如图8所示。TWLLDA的塌缩吉布斯采样过程,如图9所示。As shown in Fig. 3, it is the conversion processing method from meta information to tags in the present invention. Transformation from meta information to tags: First, the document meta information can be processed by algorithm to obtain the document tag matrix with elements of (0/1), and the vocabulary meta information can be processed by algorithm to obtain the vocabulary tag matrix with elements of (0/1). This can not only reduce the burden of data processing on the model, but also highlight the expression of high feature dimensions in meta-information; secondly , according to the generated document label matrix , and then generate a document-document label vector matrix; The ma function is assignment, /> , to the topic-vocabulary label correlation matrix. Figure 4 shows the document-document label vector matrix, and Figure 5 shows the vocabulary-vocabulary label vector matrix, where d is a document in the dataset, v is the vocabulary in the dataset, l is the document label, and l* is the vocabulary label; the topic-document label correlation matrix, and the topic-vocabulary label correlation matrix are shown in Figures 6 and 7, where k is the topic; the final model generation process is shown in Figure 8. The collapsed Gibbs sampling process of TWLLDA is shown in Figure 9.

基于TWLLDA模型的数学推导包括:随机变量的采样、对Gamma随机变量λ'l,k进行采样、对Gamma随机变量进行采样;The mathematical derivation based on the TWLLDA model includes: sampling of random variables, sampling of Gamma random variables λ' l, k , sampling of Gamma random variables to sample;

TWLLDA模型的联合分布与LDA的联合分布完全相同,如公式所示:The joint distribution of the TWLLDA model is exactly the same as that of LDA, as shown in the formula:

其中为单词v被分配给主题k的次数。为在一篇文档d的所有词汇中被分配给主题k的词汇个数。可以利用Dirichlet多项式的共轭特性计算参数φ和θ。进而能够计算模型的联合概率,如公式所示:in is the number of times word v is assigned to topic k. is the number of words assigned to topic k among all words in a document d. The parameters φ and θ can be calculated using the conjugate properties of Dirichlet polynomials. In turn, the joint probability of the model can be calculated, as shown in the formula:

其中BetaN(.)为N维Beta函数。Dirichlet先验参数即代表文档和词汇的特征。给定先验参数αd和βk,此时标准LDA模型便可以开始运行,通过对文档词汇的主题分配进行吉布斯采样能够直接计算Zd,i,如公式所示:where Beta N (.) is the N-dimensional Beta function. The Dirichlet prior parameters represent the features of documents and vocabulary. Given the prior parameters α d and β k , the standard LDA model can start to run at this point, and Z d,i can be directly calculated by performing Gibbs sampling on the topic assignment of the document vocabulary, as shown in the formula:

TWLLDA并不假定一个参数代表文档和词汇的特征,而是通过一个线性模型计算在LDA中假定的Dirichlet先验参数,通过文档和词汇的外部特征进行描述从而得到一个精确的先验参数。这些参数能够直接从Gamma随机函数中生成。TWLLDA does not assume that a parameter represents the characteristics of documents and vocabulary, but calculates the Dirichlet prior parameters assumed in LDA through a linear model, and obtains an accurate prior parameter by describing the external characteristics of documents and vocabulary. These parameters can be generated directly from the Gamma random function.

文档的主题分布θ为多项式分布,λl,k被用于计算θ的狄利克雷先验分布α。为了对变量λl,k进行采样,首先通过Gamma函数对式(1-2)进行扩展,如公式所示:The topic distribution θ of documents is a multinomial distribution, and λ l,k is used to calculate the Dirichlet prior distribution α of θ. In order to sample the variable λ l,k , the formula (1-2) is first extended by the Gamma function, as shown in the formula:

其中即在一篇文档中每个主题α值的线性和。参数/>即在一篇文档中所有的词汇的个数。Γ(.)为Gamma函数。用公式/>替换αk。在式(1-4)中Gamma ratio1可以被看作是一系列的beta随机变量,因此能够被Pitman-Yor过程增强,如公式所示:in That is, the linear sum of the alpha values of each topic in a document. parameter /> That is, the number of all words in a document. Γ(.) is the Gamma function. use the formula /> Replace α k . In formula (1-4), Gamma ratio1 can be regarded as a series of beta random variables, so it can be enhanced by the Pitman-Yor process, as shown in the formula:

对每一篇文档d都有qd~Beta(αd,·,md,·),因此对所有文档给定q1:D则Gamma ratio1便可以表示为在式(1-4)中的Gamma ratio 2能够使用辅助变量td,k数据增强,如公式所示:For each document d, there is q d ~Beta(α d,· ,m d,· ), so given q 1:D for all documents, Gamma ratio1 can be expressed as Gamma ratio 2 in formula (1-4) can use auxiliary variable t d,k data enhancement, as shown in the formula:

公式中的参数表示的是第一类未标记的斯特林数,Gamma ratio 2实际上就是CRP(Chinese Restaurant Process)中的概率归一化常数。通过CRP过程参数td,k能够以αd,k为聚集,md,k为客户数量进行采样,如公式所示:parameters in the formula Represents the first type of unmarked Stirling number, and Gamma ratio 2 is actually the probability normalization constant in CRP (Chinese Restaurant Process). Through the CRP process parameters t d, k can be aggregated with α d, k , and m d, k is the number of customers for sampling, as shown in the formula:

式(1-7)中Bern(.)是伯努利分布。参数td,k通过伯努利分布采样0/1数据。通过忽略与参数α无关的条目,式1-(6)能够简化为通过以上步骤的转化,式(1-4)能够简化为式(1-8):Bern(.) in formula (1-7) is Bernoulli distribution. Parameters t d,k sample 0/1 data through Bernoulli distribution. By ignoring entries that are not related to the parameter α, Equation 1-(6) can be simplified to Through the conversion of the above steps, formula (1-4) can be simplified to formula (1-8):

通过上文提到过的Gamma函数简化方法,式(1-8)能够转化为式(1-9):Through the Gamma function simplification method mentioned above, formula (1-8) can be transformed into formula (1-9):

之前提到过在TWLLDA模型中所有的文档标签和词汇标签都是0/1标签,并且只有在标签矩阵中为1的标签才被激活加入模型计算参数αd,k。在式(1-9)中提取所有在参数λl,k中被激活的标签,得到参数λl,k的后验概率其中αd,kd,k表示的是当fd,l=1时,不将λd,k加入计算得到的αd,k值。As mentioned before, all document labels and vocabulary labels in the TWLLDA model are 0/1 labels, and only labels with a value of 1 in the label matrix are activated and added to the model calculation parameter α d,k . Extract all the labels activated in the parameter λ l,k in formula (1-9), and get the posterior probability of the parameter λ l,k Wherein α d,kd,k represents the value of α d,k obtained by not adding λ d,k to the calculation when f d,l =1.

通过以上这些数据增强技术,TWLLDA联合概率分布已经被转化为Gamma函数λ'l,k的共轭先验分布,因此就像LDA中的吉布斯采样,TWLLDA能够直接对参数λ'l,k进行采样,如式(1-10),(1-11),(1-12):Through the above data enhancement techniques, the TWLLDA joint probability distribution has been transformed into the conjugate prior distribution of the Gamma function λ' l, k , so just like the Gibbs sampling in LDA, TWLLDA can directly sample the parameters λ' l, k , such as formula (1-10), (1-11), (1-12):

λ'l,k~Gamma(μ′,μ″) (1-10)λ' l,k ~Gamma(μ′,μ″) (1-10)

TWLLDA模型第一次计算之前λ'l,k还是一个空矩阵,计算过程就是标准LDA过程,在第一次计算之后λ'l,k矩阵被填充,此时参数αd,k能够通过λ'l,k进行更新,如公式所示:Before the first calculation of the TWLLDA model, λ' l, k is still an empty matrix. The calculation process is the standard LDA process. After the first calculation, the λ' l, k matrix is filled. At this time, the parameters α d, k can be updated through λ' l, k , as shown in the formula:

在式(1-13)中λ'l,k是根据式(1-10)新计算的数值。重复(1-10)到(1-13)的计算即对Gamma随机变量λ'l,k进行采样过程。在这一迭代过程中,对所有文档只有在文档标签激活时才参与运算。In the formula (1-13), λ' l, k are newly calculated values according to the formula (1-10). Repeating calculations from (1-10) to (1-13) means sampling the Gamma random variable λ' l,k . In this iterative process, all documents are only involved in the calculation when the document label is activated.

Gamma随机变量和λ'l,k具有相似的推导过程,都使用了相同的数据增强技术和beta函数转化。在式(1-2)中对第二个beta函数转化为式(1-14):Gamma random variable and λ' l, k have a similar derivation process, and both use the same data enhancement technology and beta function conversion. In formula (1-2), the second beta function is transformed into formula (1-14):

以上文中的方法继续化简得式(1-15):Continue to simplify the formula (1-15) with the above method:

对照上文中的对应关系,其中Beta(.)为beta分布,βk,·为每个主题k在词汇上多项式分布的主题先验,即主题k与每个单词的相关性之和;nk为每个主题所有的词汇个数。/>Bern(.)为伯努利分布,βk,v是一个相关性矩阵。利用模型中的词汇标签激活矩阵将βk,v的计算方法为/>提取每个与δl′,k相关的条目,并添加Gamma先验分布函数,得到δl′,k的后验概率:/> Compared with the corresponding relationship above, where Beta(.) is the beta distribution, β k, · is the topic prior of the multinomial distribution of each topic k on the vocabulary, that is, the sum of the correlation between topic k and each word; n k is the number of vocabulary owned by each topic. /> Bern(.) is a Bernoulli distribution, and β k,v is a correlation matrix. Using the vocabulary label activation matrix in the model, the calculation method of β k,v is /> Extract each entry related to δ l′,k and add the Gamma prior distribution function to get the posterior probability of δ l′,k : />

现在能够从Gamma随机函数中采样参数δl′,k,如式(1-16),(1-17),(1-18):Now the parameter δ l′,k can be sampled from the Gamma random function, such as formula (1-16), (1-17), (1-18):

经过上述三个过程后参数βk,v的公式为:After the above three processes, the formula of parameter β k, v is:

v'是标签处于被激活状态的平均词汇数量。至此,模型参数推导完毕。v' is the average number of words for which the label is activated. So far, the model parameters have been deduced.

基于TWLLDA模型的吉布斯采样算法研究:Research on Gibbs sampling algorithm based on TWLLDA model:

与现有的大多数主题模型不同,TWLLDA能够利用两个层面即文档层面和词汇层面的元信息,而且主题和标签不需要一一对应,能够自由设置标签个数。此外TWLLDA没有破坏模型的共轭先验分布结构,因此可以通过类似LDA的塌缩吉布斯采样方法进行参数推断。Unlike most existing topic models, TWLLDA can utilize meta-information at two levels, document level and vocabulary level, and there is no need for one-to-one correspondence between topics and tags, and the number of tags can be set freely. In addition, TWLLDA does not destroy the conjugate prior distribution structure of the model, so parameter inference can be performed by a collapsed Gibbs sampling method similar to LDA.

TWLLDA模型针对短文本主题建模的缺陷,通过将语料的元信息转化为词汇标签与文档标签,为文档-主题分布与主题-词分布计算特有的狄利克雷先验参数。这样做在有效合并元信息的同时,丰富了短文本稀疏的词共现,使得TWLLDA模型短文本建模中的困惑度与主题质量相较于其他模型有着显著改进。除此之外,TWLLDA对于常规文本同样有着不错的建模准确度。The TWLLDA model aims at the shortcomings of short text topic modeling, and calculates the unique Dirichlet prior parameters for the document-topic distribution and topic-word distribution by converting the meta-information of the corpus into vocabulary tags and document tags. In doing so, while effectively merging meta information, it enriches the sparse word co-occurrence in short texts, making the perplexity and topic quality of TWLLDA model short text modeling significantly improved compared with other models. In addition, TWLLDA also has good modeling accuracy for regular text.

文档元信息和词汇元信息是通过doc2vec和word2vev算法获取的,根据算法1,将得到的词汇元信息(wV1,wV2,wV3,…wVN)和文档元信息(dV1,dV2,dV3,…dVD)作为输入得到输出文档标签(dL1,dL2,dL3,…dLD)和词汇标签(wL1 *,wL2 *,wL3 *,…wLN *),过程如下,首先根据文档元信息列表(dV1,dV2,dV3,…dVD),会计算每一篇的文档元信息获取正数并计算正数的平均值得到(avg+)也会获取负数并计算负数的平均值得到(avg-),将计算得到的(avg+)和(avg-)与维度dVn进行比较大小,如果dVn≥(avg+)或者dVn≤(avg-),向文档标签中添加数据1,否则添加0,进而生成元素为(0/1)的文档标签矩阵和元素为(0/1)的词汇标签矩阵,其中文档-文档标签向量就是根据算法1得到的Fd,l,词汇-词汇标签向量也是根据算法1得到的Gw,l *,其次是通过以超参数为v0的Gamma函数为赋值,/>然后通过以超参数为u0的Gamma函数为λl,k赋值,λl,k~Gamma(u0,u0)得到主题-文档标签的相关性矩阵Ak,l和主题-词汇标签的相关性矩阵/>最后将语料、文档标签矩阵Fd,j、词标签矩阵主题文档标签相关性矩阵A、主题-词标签相关性矩阵B、超参数μ0、超参数v0、主题个数K作为输出,得到在文档层面基于矩阵F和矩阵A为每篇文档d计算其特有的先验参数αd作为输出,并得到在词汇层面基于矩阵G和矩阵B为每个主题k计算其特有的先验参数βk作为输出;最后根据得到参数αd和βk运行TWLLDA模型,通过GibbsSampling采样算法达到收敛得到文档与主题的相关性的矩阵θd,k和主题与词汇的相关性的矩阵φk,v,该模型的生成过程就可以概括为一个联合分布即公式/> 文档元信息和词汇元信息是通过doc2vec和word2vev算法获取的,根据算法1,将得到的词汇元信息(wV 1 ,wV 2 ,wV 3 ,…wV N )和文档元信息(dV 1 ,dV 2 ,dV 3 ,…dV D )作为输入得到输出文档标签(dL 1 ,dL 2 ,dL 3 ,…dL D )和词汇标签(wL 1 * ,wL 2 * ,wL 3 * ,…wL N * ),过程如下,首先根据文档元信息列表(dV 1 ,dV 2 ,dV 3 ,…dV D ),会计算每一篇的文档元信息获取正数并计算正数的平均值得到(avg+)也会获取负数并计算负数的平均值得到(avg-),将计算得到的(avg+)和(avg-)与维度dVn进行比较大小,如果dVn≥(avg+)或者dVn≤(avg-),向文档标签中添加数据1,否则添加0,进而生成元素为(0/1)的文档标签矩阵和元素为(0/1)的词汇标签矩阵,其中文档-文档标签向量就是根据算法1得到的F d,l ,词汇-词汇标签向量也是根据算法1得到的G w,l * ,其次是通过以超参数为v 0的Gamma函数为 assignment, /> Then, by assigning values to λ l,k with the Gamma function whose hyperparameter is u 0 , λ l,k ~Gamma(u 0 ,u 0 ) to get the topic-document label correlation matrix A k,l and the topic-vocabulary label correlation matrix/> Finally, the corpus, document label matrix F d,j , word label matrix 主题文档标签相关性矩阵A、主题-词标签相关性矩阵B、超参数μ 0 、超参数v 0 、主题个数K作为输出,得到在文档层面基于矩阵F和矩阵A为每篇文档d计算其特有的先验参数α d作为输出,并得到在词汇层面基于矩阵G和矩阵B为每个主题k计算其特有的先验参数β k作为输出;最后根据得到参数α d和β k运行TWLLDA模型,通过GibbsSampling采样算法达到收敛得到文档与主题的相关性的矩阵θ d,k和主题与词汇的相关性的矩阵φ k,v ,该模型的生成过程就可以概括为一个联合分布即公式/>

到此模型训练结束,将要进行测试的语料库放入训练好的模型,就会得到相应的主题分布即公式At the end of the model training, put the corpus to be tested into the trained model, and you will get the corresponding topic distribution, that is, the formula

本发明提出了文档和词汇元信息的TWLLDA主题模型方法,该模型克服了之前所提到的模型中结构复杂、非共轭、信息获取渠道单一等缺点,通过将元信息转化为文档和单词的标签信息。标签信息独立于模型本身使得具有相似标签的文档具有相似的狄利克雷先验向量,相似标签的单词在主题上也有相似的分布权重;提出有效的闭合的吉布斯采样方法,完成TWLLDA的推理;以困惑度和主题一致性为评价指标进行多组实验。实验表明相比于LDA等模型,基于元信息的TWLLDA模型在相同条件下表现得更为优秀。The present invention proposes a TWLLDA topic model method for document and vocabulary meta-information. This model overcomes the shortcomings of the previously mentioned model, such as complex structure, non-conjugation, and single information acquisition channel, by converting meta-information into document and word label information. The label information is independent of the model itself so that documents with similar labels have similar Dirichlet prior vectors, and words with similar labels have similar distribution weights on topics; an effective closed Gibbs sampling method is proposed to complete the reasoning of TWLLDA; multiple groups of experiments are carried out with perplexity and topic consistency as evaluation indicators. Experiments show that compared with LDA and other models, the meta-information-based TWLLDA model performs better under the same conditions.

本发明提出了基于文档和词汇元信息的主题模型(Text-Word Labeled LatentDirichlet Allocation),以下简记为TWLLDA);本发明提出了将语料的元信息转化为词汇标签与文档标签,为文档-主题分布与主题-词分布计算特有的狄利克雷先验参数的主题模型TWLLDA。该模型通过将文档元信息标签合并到模型的生成过程中,既可以对带有主题的标签进行分析,又可以为每个主题分配标签,这使得TWLLDA具有更好的可解释性。在实现中,首先需要将文档和词汇元信息转化处理成标签,生成对应的文档-文档标签向量和词汇-词汇标签向量,通过设置超参数得到主题-文档标签的相关性矩阵和主题-词汇标签的相关性矩阵,即将元信息融入到了模型中。The present invention proposes a topic model (Text-Word Labeled Latent Dirichlet Allocation) based on document and vocabulary meta-information (hereinafter abbreviated as TWLLDA); the present invention proposes a topic model TWLLDA that converts the meta-information of the corpus into vocabulary tags and document tags, and calculates unique Dirichlet prior parameters for document-topic distribution and topic-word distribution. The model can both analyze tags with topics and assign tags to each topic by incorporating document meta information tags into the generation process of the model, which makes TWLLDA better interpretable. In the implementation, it is first necessary to transform document and vocabulary meta information into labels, generate corresponding document-document label vectors and vocabulary-vocabulary label vectors, and obtain the topic-document label correlation matrix and topic-vocabulary label correlation matrix by setting hyperparameters, that is, integrate meta information into the model.

实验验证与分析:Experimental verification and analysis:

TWLLDA模型性能比较实验所用数据集为20NewsGroups和路透社部分数据集。20NewsGroups是用于文本分类、挖掘和检索研究的国际标准数据集之一,包含20种不同类型、共约20000份的新闻文档。语料用语规范,各类别文本数量相当,其中某些类别的主题特别相似,还有些则完全不相关。路透社数据集则是从Reuters-21578种提取了带有标签的文档,包括11,367个文档,词汇量为8,817,平均文档长度为73。The data sets used in the TWLLDA model performance comparison experiment are 20NewsGroups and some Reuters data sets. 20NewsGroups is one of the international standard datasets for text classification, mining and retrieval research, including 20 different types of news documents with a total of about 20,000 pieces. The terminology of the corpus is standardized, and the number of texts in each category is equal, and the topics of some categories are particularly similar, and some are completely unrelated. The Reuters dataset extracts labeled documents from Reuters-21578, including 11,367 documents with a vocabulary of 8,817 and an average document length of 73.

计算主题一致性所用数据集为十篇小说与NLTK数据集。小说分别为苏菲的世界、三体、蝇王、杀死一只知更鸟、孙子兵法、老人与海、了不起的盖茨比、血字的研究、恋爱中的女人、百年孤独。每一篇小说都代表一个不同的类别,这些小说的中文版都能在网络上下载得到。The data sets used to calculate the topic consistency are ten novels and the NLTK data set. The novels are Sophie's World, The Three-Body Problem, Lord of the Flies, To Kill a Mockingbird, Sun Tzu's Art of War, The Old Man and the Sea, The Great Gatsby, A Study in Scarlet, Women in Love, and One Hundred Years of Solitude. Each novel represents a different category, and the Chinese versions of these novels are all available for download on the Internet.

在本发明的实验中,用以作为TWLLDA输入的文档向量与词向量的生成分别采用doc2vec以及word2vec。TWLLDA模型的超参数00都设置为1.0,对于标准LDA模型的超参数和都设置为0.5。实验中未识别词不参与计算。将所有小说按照章节分为多篇文档作为短文本数据集。在长文本数据集和短文本数据集中将80%作为训练集,20%作为测试集。In the experiments of the present invention, doc2vec and word2vec are used to generate document vectors and word vectors as input to TWLLDA, respectively. The hyperparameters 0 and 0 of the TWLLDA model are both set to 1.0, and the hyperparameters and sum of the standard LDA model are both set to 0.5. Unrecognized words in the experiment do not participate in the calculation. Divide all novels into multiple documents according to chapters as a short text dataset. In the long text dataset and the short text dataset, 80% are used as the training set and 20% are used as the test set.

随着主题数目的增长,各个模型的困惑度随之降低。但是这种降低并不一定是线性的,例如LDA模型、LF-LDA模型以及WF-LDA模型。这三种模型在主题数目到达一定数值的时候,相对应的困惑度并没有降低很多,这意味着模型对于主题困惑区分是具有一定限度。此外,由于LF-LDA虽然使用了词嵌入进行改善,但是其隐特征的特性反而导致了其困惑度的增加。而WF-LDA虽然引入了但此特征,但是区分度仍然不足以使得主题困惑度有更为明显的下降。相较于LDA,DMR模型在困惑度上面的表现更为出色,但是TWLLDA的实验结果仍旧稍好,其原因是TWLLDA结合了文本的元信息,丰富文档词共现的能力更胜一筹:As the number of topics increases, the perplexity of each model decreases. But this decrease is not necessarily linear, such as LDA model, LF-LDA model and WF-LDA model. When the number of topics reaches a certain value for these three models, the corresponding perplexity does not decrease much, which means that the model has a certain limit for distinguishing topic perplexity. In addition, although LF-LDA uses word embeddings for improvement, the characteristics of its latent features lead to an increase in its perplexity. Although WF-LDA introduces this feature, the discrimination is still not enough to make the topic perplexity drop more obviously. Compared with LDA, the DMR model performs better in perplexity, but the experimental results of TWLLDA are still slightly better. The reason is that TWLLDA combines the meta-information of the text, and the ability to enrich the co-occurrence of document words is even better:

本实验中通过两种主流的指标来评价一个主题模型的质量即困惑度评价和主题一致性。同时对于模型的迭代时间进行比较。In this experiment, two mainstream indicators are used to evaluate the quality of a topic model, that is, perplexity evaluation and topic consistency. At the same time, the iteration time of the model is compared.

困惑度用于度量概率模型预测样本的好坏程度。是主题模型领域中的重要评价指标,困惑的的值越小,则证明模型的精度越高。其公式如式1-20所示。Perplexity is used to measure how well a probability model predicts a sample. is an important evaluation index in the field of topic models, and the smaller the value of perplexity, the higher the accuracy of the model. Its formula is shown in formula 1-20.

Perplexity=exp(-lnp(wd/ad)/Nd) (1-20)Perplexity=exp(-lnp(wd/ad)/Nd) (1-20)

对于困惑度评价则是在上述两个数据集将TWLLDA模型与五种模型进行比较。比较结果如下表所示。For the evaluation of perplexity, the TWLLDA model is compared with five models in the above two data sets. The comparison results are shown in the table below.

表1各模型在20NewsGroups数据集下的困惑度Table 1 Perplexity of each model in the 20NewsGroups dataset

可以发现,在20NewsGroups数据集下,各个主题模型的困惑度表现均在1500以上。其原因是该数据集类别较多,内容不一,即使规定了较高的主题数目也无法将主题的不确定度降至极低的程度。It can be found that under the 20NewsGroups dataset, the perplexity performance of each topic model is above 1500. The reason is that the data set has many categories and different contents. Even if a higher number of topics is specified, the uncertainty of the topics cannot be reduced to an extremely low level.

如图10所示,可以看出在Reuters对于不同的主题个数有不同的困惑度表现。但不论哪种条件TWLLDA模型的困惑度都要低于相比较模型的困惑度,证明了本发明所提模型的主题提取质量。As shown in Figure 10, it can be seen that Reuters has different perplexity performances for different topic numbers. However, no matter which condition the perplexity of the TWLLDA model is lower than that of the comparison model, it proves the subject extraction quality of the proposed model of the present invention.

接下来则是运行与迭代时间的比较。运行时间是模型评价的一个重要指标,即模型的时间复杂度。一次标准LDA模型的迭代过程即是对每篇文档的每个词汇全部完整一次重新分配主题,分配依据是该篇文档的主题分布,和每个主题的词汇分布,那么在每一次的吉布斯采样中都要同时维护文档的主题分布矩阵和主题的词汇分布矩阵共2个矩阵。那么影响一次迭代的因素有文档和词汇即数据集的大小、主题个数,共2个因素。一次TWLLDA的迭代过程与LDA过程一样都是完成全部文档的一轮主题分配过程,与LDA不同的是,TWLLDA模型除了需要维护文档的主题分布矩阵和主题的词汇分布矩阵之外还需要对文档标签与主题的相关性矩阵和词汇标签与主题相关性矩阵进行吉布斯采样,在TWLLDA中要同时进行两种吉布斯采样,共维护4个矩阵。那么影响TWLLDA模型一轮迭代的因素有数据集、主题个数、文档标签数、词汇标签数。Next is a comparison of run and iteration times. Running time is an important indicator of model evaluation, that is, the time complexity of the model. An iterative process of the standard LDA model is to completely reassign the topics for each vocabulary of each document. The assignment is based on the topic distribution of the document and the vocabulary distribution of each topic. Then in each Gibbs sampling, two matrices, the topic distribution matrix of the document and the vocabulary distribution matrix of the topic, must be maintained at the same time. Then the factors that affect an iteration are the size of the document and vocabulary, that is, the size of the data set, and the number of topics, a total of 2 factors. A TWLLDA iterative process is the same as the LDA process to complete a round of topic assignment process for all documents. Unlike LDA, the TWLLDA model not only needs to maintain the topic distribution matrix of the document and the vocabulary distribution matrix of the topic, but also needs to perform Gibbs sampling on the correlation matrix between document tags and topics and the correlation matrix between vocabulary tags and topics. In TWLLDA, two types of Gibbs sampling are performed at the same time, and a total of four matrices are maintained. Then the factors that affect one iteration of the TWLLDA model include the data set, the number of topics, the number of document labels, and the number of vocabulary labels.

表2记录了实验模型每次迭代所需要花费的时间。Table 2 records the time required for each iteration of the experimental model.

表2模型迭代时间对比Table 2 Model iteration time comparison

分析实验数据可知,模型的单次迭代时间一般是随着主题数目的增长而线性增加的。由于LDA是最为基础的模型,并没有引入词汇特征抑或是标签,所以运行时间最短,迭代最快。而LF-LDA模型由于引入了词嵌入,增加了模型的复杂程度,相对应的迭代时间就特别长。According to the analysis of experimental data, the single iteration time of the model generally increases linearly with the increase of the number of topics. Since LDA is the most basic model and does not introduce lexical features or labels, the running time is the shortest and the iteration is the fastest. However, the LF-LDA model increases the complexity of the model due to the introduction of word embedding, and the corresponding iteration time is particularly long.

如图11所示,模型在20NewsGroups数据集上的迭代时间,显然,模型在文本量更多的20NewsGroups数据集上的迭代时间会有所增加。虽然TWLLDA模型相比与标准LDA模型增加了需要计算的数据量,但得益于外部数据稀疏的特点,其迭代时间并没有特别长。因此在对DMR模型以及WF-LDA模型方面体前出了时间上的优势。As shown in Figure 11, the iteration time of the model on the 20NewsGroups dataset, obviously, the iteration time of the model on the 20NewsGroups dataset with more text will increase. Although the TWLLDA model increases the amount of data to be calculated compared with the standard LDA model, its iteration time is not particularly long due to the sparseness of external data. Therefore, it has a time advantage in terms of the DMR model and the WF-LDA model.

接下来将从主题一致性上评估TWLLDA模型与LDA模型的差别。主题一致性计算的是主题模型中属于同一主题的词汇集的语义一致性。在主题模型中若主题一致性高则表明主题内部的词汇都有着相似的语义,主题模型的分类效果好。若主题一致性低则表明该主题的内部表达零散,主题分类效果不好。本发明使用Normalised Pointwise MutualInformation(NPMI)测量主题模型中每个主题的一致性,其公式如式(1-21)所示。Next, the difference between the TWLLDA model and the LDA model will be evaluated from the subject consistency. Topic coherence computes the semantic coherence of vocabulary sets belonging to the same topic in a topic model. In the topic model, if the topic consistency is high, it means that the vocabulary within the topic has similar semantics, and the classification effect of the topic model is good. If the topic consistency is low, it indicates that the internal expression of the topic is fragmented, and the effect of topic classification is not good. The present invention uses Normalized Pointwise MutualInformation (NPMI) to measure the consistency of each topic in the topic model, and its formula is shown in formula (1-21).

通过此公式能够计算出主题k在基于主题中T个高频词得到的主题一致性得分。P(w)是词汇w出现的概率,p(wi,wj)是词汇wi,wj在一个滑动窗口中同时出现的概率。在本发明实验中每个主题k的主题一致性得分是由该主题的10个高频词参加计算得到的。同时本实验将中文小说按照章节与长度分各自分为短文本与长本发明,设置主题数目为100,分析全部主题与TOP20主题的主题一致性。实验结果如下表所示。Through this formula, the topic consistency score of topic k based on T high-frequency words in the topic can be calculated. P(w) is the probability of word w appearing, and p(w i , w j ) is the probability of word w i , w j appearing simultaneously in a sliding window. In the experiment of the present invention, the topic consistency score of each topic k is obtained by participating in the calculation of 10 high-frequency words of the topic. At the same time, this experiment divides Chinese novels into short texts and long texts according to chapters and lengths, sets the number of topics to 100, and analyzes the theme consistency between all themes and TOP20 themes. The experimental results are shown in the table below.

在ALL TOPICS条件下,其整体的主题一致性要低于TOP 20TOPICS,这是因为主题具有聚集效应,即前20%的主题占据了80%的词汇信息,所以前部主题的主题一致性较高即词共现信息较为丰富,而后部主题的主题一致性较低即词共现信息较为稀少。Under the condition of ALL TOPICS, its overall topic consistency is lower than that of TOP 20TOPICS. This is because the topics have an aggregation effect, that is, the top 20% topics occupy 80% of the lexical information, so the topic consistency of the front topic is higher, that is, the word co-occurrence information is more abundant, while the topic consistency of the latter topic is lower, that is, the word co-occurrence information is relatively scarce.

语料实验分析:Analysis of corpus experiments:

表3TWLLDA模型和其他模型主题一致性对比Table 3 Comparison of topic consistency between TWLLDA model and other models

从表3的数据可知,本发明提出的TWLLDA模型的主题一致性整体上要高于标准LDA模型的主题一致性,说明TWLLDA的数据增强技术对于优化主题一致性同样能够起到作用。From the data in Table 3, it can be seen that the topic consistency of the TWLLDA model proposed by the present invention is higher than that of the standard LDA model as a whole, indicating that the data enhancement technology of TWLLDA can also play a role in optimizing the topic consistency.

在只有词汇数最多的20个主题参与计算时,主题一致性有明显大幅度的提高,这是因为主题的词汇数多了,词共现的样本也会增加。其中在中短数据集上的优化比在中长数据集上的优化平均高出了2个点,这比全主题计算的4个点要少,说明TWLLDA模型对于主题词汇个数较少的主题有更好的辅助作用。When only the 20 topics with the largest vocabulary are involved in the calculation, the topic consistency is significantly improved, because the topic has more vocabulary, and the number of co-occurrence samples will also increase. Among them, the optimization on the medium-short dataset is 2 points higher than the optimization on the medium-long dataset on average, which is less than the 4 points calculated on the full topic, indicating that the TWLLDA model has a better auxiliary effect on topics with a small number of topic vocabulary.

在所有主题都加入主题一致性计算时TWLLDA模型在中文长文档、中文短文档、NLTK共3个数据集上都起到了优化作用。相比于LDA模型TWLLDA模型在中短数据集上的优化平均为4个点,而在中长数据集上的优化平均3个点,这说明相比于长文档TWLLDA模型在短文档上有更好的表现。短文档的主题一致性要明显低于长文档的主题一致性,显而易见的原因是短文档的词共现信息不足,这也是数据增强技术在短文档上有更好的发挥空间的原因。在只有词汇数最多的20个主题参与计算时,主题一致性有明显大幅度的提高,这是因为主题的词汇数多了,那么词共现的样本也会增加。其中在中短数据集上的优化比在中长数据集上的优化平均高出了1个点,这比比全主题计算的4个点要少,说明TWLLDA模型对于主题词汇个数较少的主题有更好的辅助作用。When all topics are added to the topic consistency calculation, the TWLLDA model has played an optimized role in the three data sets of Chinese long documents, Chinese short documents, and NLTK. Compared with the LDA model, the TWLLDA model optimizes an average of 4 points on the medium and short data sets, and the average optimization point on the medium and long data sets is 3 points, which shows that the TWLLDA model has better performance on short documents than long documents. The topic consistency of short documents is significantly lower than that of long documents. The obvious reason is that the word co-occurrence information in short documents is insufficient, which is why data enhancement technology has a better space for short documents. When only the 20 topics with the largest number of vocabularies are involved in the calculation, the topic consistency is significantly improved. This is because the number of vocabularies of the topics is larger, and the samples of word co-occurrence will also increase. Among them, the optimization on the medium-short data set is 1 point higher than the optimization on the medium-long data set, which is less than the 4 points calculated on the full topic, indicating that the TWLLDA model has a better auxiliary effect on topics with a small number of topic vocabulary.

而在NLTK数据集上由于NLTK的数据充分,因此全体模型都具有较高的主题一致性,并且如同前文的分析TWLLDA模型具有最高的主题一致性,LDA模型的主题一致性最低,DMR模型和WF-LDA模型的主题一致性在NLTK数据集上的表现基本一致,说明DMR模型和WF-LDA模型对主题模型的主题一致性的优化的效果相近,但是WF-LDA模型使用的是词汇元信息,其计算量较大,综合来看还是DMR模型的收益更高,这一点将在后文的运行时间实验中进行说明。从表4的实验中可以总结出TWLLDA模型的数据增强对于数据量敏感的指标都有较好的优化作用,并且相比于LDA、DMR、WF-LDA具有更好的优化效果。On the NLTK dataset, due to the sufficient data of NLTK, all models have high topic consistency, and as the previous analysis, the TWLLDA model has the highest topic consistency, and the LDA model has the lowest topic consistency. The benefit of the DMR model is higher, which will be illustrated later in the running time experiments. From the experiments in Table 4, it can be concluded that the data enhancement of the TWLLDA model has a better optimization effect on indicators sensitive to data volume, and has a better optimization effect than LDA, DMR, and WF-LDA.

本发明提出了将语料的元信息转化为词汇标签与文档标签,为文档-主题分布与主题-词分布计算特有的狄利克雷先验参数的主题模型TWLLDA。该模型通过将文档元信息标签合并到模型的生成过程中,既可以对带有主题的标签进行分析,又可以为每个主题分配标签,这使得TWLLDA具有更好的可解释性。在实现中,首先需要将文档和词汇元信息转化处理成标签,生成对应的文档-文档标签向量和词汇-词汇标签向量,通过设置超参数得到主题-文档标签的相关性矩阵和主题-词汇标签的相关性矩阵,即将元信息融入到了模型中。The present invention proposes a topic model TWLLDA that converts meta-information of corpus into vocabulary tags and document tags, and calculates unique Dirichlet prior parameters for document-topic distribution and topic-word distribution. The model can both analyze tags with topics and assign tags to each topic by incorporating document meta information tags into the generation process of the model, which makes TWLLDA better interpretable. In the implementation, it is first necessary to transform document and vocabulary meta information into labels, generate corresponding document-document label vectors and vocabulary-vocabulary label vectors, and obtain the topic-document label correlation matrix and topic-vocabulary label correlation matrix by setting hyperparameters, that is, integrate meta information into the model.

本发明的方法在困惑度和主题一致性为评价指标进行多组实验,实验表明相比于LDA等模型,基于元信息的TWLLDA模型在相同条件下表现得更为优秀。The method of the present invention conducts multiple experiments with perplexity and topic consistency as evaluation indicators, and the experiments show that compared with models such as LDA, the meta-information-based TWLLDA model performs better under the same conditions.

以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims (1)

1. The corpus topic distribution calculation method based on the meta information is characterized by comprising the following steps of:
step 1: inputting a corpus to be calculated, obtaining document meta information and vocabulary meta information of the corpus, and setting the maximum iteration times;
step 2: converting the document meta information and the vocabulary meta information of the corpus into document tags and vocabulary tags; generating a document-document tag vector matrix F from the document tags d,l The method comprises the steps of carrying out a first treatment on the surface of the Generating vocabulary-vocabulary tag vector matrix according to vocabulary tag
Step 3: taking super parameter as u 0 Gamma function of (1) is a parameter lambda corresponding to a theme k of a document tag l l,k Assignment, lambda l,k ~Gamma(u 0 ,u 0 ) Obtaining a correlation matrix of the topic-document label; let the super parameter be v 0 Gamma function of (1) is vocabulary label l * Parameters corresponding to topic kAssignment of->Obtaining a correlation matrix of the topic-vocabulary labels; wherein the total number of topics is K; the total number of the document labels is L; the total number of vocabulary labels is L *
Step 4: calculating parameter beta corresponding to topic k and vocabulary v k,v For vocabulary-vocabulary tag vector matrix->Is an element of (2); calculating document corpus d, topic k and corresponding parameters alpha d,k ,/>f d,l For document-document tag vector matrix F d,l Is an element of (2); counting the number of times n each word v is assigned to the topic k k,vCalculating the vocabulary quantity m allocated to the topic k in each document corpus d d,k
Step 5: through q d ~Beta(α d,· ,m d,· ) Sampling parameter q d
Wherein alpha is d,· For each topic α in a document d,k The linear sum of the values is used to determine,m d,· is the number of words in a document; />
Step 6: by CRP process, at alpha d,k To gather, m d,k Sampling the parameter t for the number of customers d,k
Step 7: sampling the parameter lambda 'from the Gamma random function' l,k And update the parameter alpha d,k
λ′ l,k ~Gamma(μ′,μ″)
Parameter alpha d,k The updated formula of (2) is:
step 8: according toSampling parameter->
Wherein beta is k,· The sum of the relevance of the topic k to each word; n is n k,· The total number of words contained for topic k;
step 9: by CRP process, at beta k,v To gather, n k,v Sampling the parameter t 'for the number of customers' k,v
Step 10: sampling parameters from Gamma random functionAnd update the parameter beta k,v
Parameter beta k,v The updated formula of (2) is:
step 11: judging whether the maximum iteration times are reached; if the maximum iteration number is not reached, returning to the step 5; otherwise, outputting topic distribution of the corpus, namely the formula:
CN202011124613.8A 2020-10-20 2020-10-20 Corpus topic distribution calculation method based on meta information Active CN112287666B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011124613.8A CN112287666B (en) 2020-10-20 2020-10-20 Corpus topic distribution calculation method based on meta information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011124613.8A CN112287666B (en) 2020-10-20 2020-10-20 Corpus topic distribution calculation method based on meta information

Publications (2)

Publication Number Publication Date
CN112287666A CN112287666A (en) 2021-01-29
CN112287666B true CN112287666B (en) 2023-07-25

Family

ID=74424062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011124613.8A Active CN112287666B (en) 2020-10-20 2020-10-20 Corpus topic distribution calculation method based on meta information

Country Status (1)

Country Link
CN (1) CN112287666B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119441477B (en) * 2024-10-08 2025-08-12 湖北三峡职业技术学院 Multidimensional clustering and topic evolution method for network big data text analysis

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063030A (en) * 2018-07-16 2018-12-21 南京信息工程大学 A method of theme and descriptor are implied based on streaming LDA topic model discovery document

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9251250B2 (en) * 2012-03-28 2016-02-02 Mitsubishi Electric Research Laboratories, Inc. Method and apparatus for processing text with variations in vocabulary usage
US9575952B2 (en) * 2014-10-21 2017-02-21 At&T Intellectual Property I, L.P. Unsupervised topic modeling for short texts
US11238348B2 (en) * 2016-05-06 2022-02-01 Ebay Inc. Using meta-information in neural machine translation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063030A (en) * 2018-07-16 2018-12-21 南京信息工程大学 A method of theme and descriptor are implied based on streaming LDA topic model discovery document

Also Published As

Publication number Publication date
CN112287666A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
CN109344236B (en) A problem similarity calculation method based on multiple features
CN109858028B (en) Short text similarity calculation method based on probability model
CN104834747B (en) Short text classification method based on convolutional neural networks
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN111832289B (en) Service discovery method based on clustering and Gaussian LDA
CN103970729B (en) A kind of multi-threaded extracting method based on semantic category
CN111353310A (en) Artificial intelligence-based named entity recognition method, device and electronic device
CN108549634A (en) A kind of Chinese patent text similarity calculating method
CN110543564B (en) Domain Label Acquisition Method Based on Topic Model
CN107180026B (en) A method and device for learning event phrases based on word embedding semantic mapping
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN108280164B (en) A short text filtering and classification method based on category-related words
CN102663139A (en) Method and system for constructing emotional dictionary
CN112101027A (en) Chinese Named Entity Recognition Method Based on Reading Comprehension
CN114706972B (en) An automatic generation method of unsupervised scientific and technological information summaries based on multi-sentence compression
CN110287309A (en) A Method for Quickly Extracting Text Summarization
CN108519971A (en) A Cross-lingual News Topic Similarity Comparison Method Based on Parallel Corpus
CN117909484B (en) Construction method and question answering system of Term-BERT model for construction information query
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN110489554A (en) Property level sensibility classification method based on the mutual attention network model of location aware
CN104317837A (en) Cross-modal searching method based on topic model
CN112287666B (en) Corpus topic distribution calculation method based on meta information
CN114492390A (en) Data expansion method, device, device and medium based on keyword recognition
CN107784112A (en) Short text data Enhancement Method, system and detection authentication service platform
CN108256055B (en) Topic modeling method based on data enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant