CN113515638B

CN113515638B - Student clustering-oriented research interest mining method and device and storage medium

Info

Publication number: CN113515638B
Application number: CN202111072396.7A
Authority: CN
Inventors: 寇菲菲; 王文东; 杜军平; 李昂; 薛哲; 梁美玉
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2021-12-07
Anticipated expiration: 2041-09-14
Also published as: CN113515638A

Abstract

The invention provides a research interest mining method, device and storage medium oriented to scholar clustering. The method includes the following steps: constructing an academic metadata set based on multi-source scholar-related academic data; inputting the academic metadata as input data into a pre-established In the research interest mining model of the The semantics of interest expresses scholar clustering and obtains scholar clustering results; the research interest mining model shares the same topic distribution for the data of scholars from the same data source and belongs to the same professional field. In the research interest mining model, the professional field-topic distribution are modeled as Dirichlet distributions, topic-English word distribution, topic-Chinese word distribution, and topic-scholar distribution are modeled as multinomial distributions.

Description

Research interest mining method, device and storage medium for scholar clustering

技术领域technical field

本发明涉及大数据技术领域，具体是一种面向学者聚类的研究兴趣挖掘方法、装置和存储介质。The invention relates to the technical field of big data, in particular to a research interest mining method, device and storage medium oriented to scholar clustering.

背景技术Background technique

学者、科研项目和论文等学术数据都有自己的专业领域。例如，有些数据属于软件工程，有些数据属于人工智能。不同专业领域的研究内容不同，同一专业领域的数据往往具有共同的主题分布。针对学者学术数据，从学术数据中发现学者的研究兴趣，并根据其研究兴趣对学者进行聚类，对于许多任务都很重要，例如为学者选择合作者，为期刊选择审稿人，为政府挑选专家。与一般数据不同，学者兴趣相关的学术数据有其独特的属性。一方面，它是多源的，学者主持的基金项目和发表的论文都能反映其研究兴趣。另一方面，它是多语言的，论文和基金申请书通常用母语或英语撰写。此外，学术数据会随着时间的推移而增加。Academic data such as scholars, research projects, and dissertations all have their own areas of expertise. For example, some data belongs to software engineering and some data belongs to artificial intelligence. The research contents of different professional fields are different, and the data of the same professional field often have a common topic distribution. Targeting scholar academic data, discovering scholars' research interests from academic data, and clustering scholars according to their research interests is important for many tasks, such as selecting collaborators for scholars, selecting reviewers for journals, and selecting for governments expert. Unlike general data, academic data related to scholars' interests has its own unique properties. On the one hand, it is multi-source, and the funded projects and published papers of scholars can reflect their research interests. On the other hand, it is multilingual and essays and grant applications are usually written in the native language or English. Also, academic data increases over time.

目前虽然有许多方法可以处理用户聚类问题，但它们大多只适用于单一来源的数据。例如，可以使用用户聚类主题模型（UCT）或者深度学习方法将用户表示为向量，然后利用典型的聚类算法对用户进行聚类。作为示例，UCT利用单一数据源的信息对用户的兴趣进行建模，通过对时间进行划片，获取不同时间片内的用户兴趣语义表示。然后利用K-means聚类算法对用户的兴趣语义表示结果进行聚类，最终得到用户的聚类结果。该方法仅仅使用单一来源的数据，而且也仅仅利用了单一的语言，将其用在学者用户兴趣语义表示时，其表示质量有待提升。此外，K-Means存在一定的局限性，尤其是需要提前设定K值，这就使得用户聚类的效果鲁棒性比较受限。Although there are many methods to deal with the user clustering problem, most of them are only applicable to a single source of data. For example, users can be represented as vectors using User Clustering Topic Models (UCT) or deep learning methods, and then users can be clustered using typical clustering algorithms. As an example, UCT uses information from a single data source to model user interests, and obtains semantic representations of user interests in different time slices by slicing time. Then, the K-means clustering algorithm is used to cluster the semantic representation results of users' interests, and finally the clustering results of users are obtained. This method only uses data from a single source, and only uses a single language. When it is used in the semantic representation of scholar user interests, its representation quality needs to be improved. In addition, K-Means has certain limitations, especially the need to set the K value in advance, which makes the robustness of the user clustering effect relatively limited.

用多源数据代替单源数据可以更全面、准确地挖掘并表示用户的兴趣，聚类的有效性取决于用户兴趣语义表示的质量。然而，大多数学者的相关研究只是以论文为材料，而忽略科研项目数据。此外，多源数据并不能直接合并在一起。这是因为，不同来源的数据量存在较大的区别，比如学者主持的基金项目数量往往较少，发表论文的数量则几十倍甚至百倍于学者的基金项目数量。如果简单地混合使用不同来源的数据，那小样本源的影响将会被大样本源淹没。因此，多源数据的信息集成是一个挑战。Replacing single-source data with multi-source data can mine and represent user interests more comprehensively and accurately. The effectiveness of clustering depends on the quality of semantic representation of user interests. However, most scholars' related research only uses papers as materials, ignoring scientific research project data. Furthermore, multi-source data cannot be directly merged together. This is because there is a big difference in the amount of data from different sources. For example, the number of funded projects hosted by scholars is often small, and the number of published papers is dozens or even hundreds of times the number of funded projects by scholars. If you simply mix data from different sources, the impact of small sample sources will be overwhelmed by large sample sources. Therefore, information integration of multi-source data is a challenge.

聚类学者面临的另一个挑战是如何充分利用多语言数据的丰富语义。对于大规模的多语言语料库，将多种语言翻译成一种语言是不现实的，这是因为翻译过程会引入错误并耗费时间和精力。典型的多语言数据处理方法是基于概率主题模型或词嵌入方法。但是这些方法都要求不同语言之间有很强的关联，例如具有翻译对应关系的句子对、词语对、或者文档对。然而，学者用不同语言呈现的学术数据则缺乏明确的翻译对应关系。因此，如果要充分利用多语言的学术数据，就需要能够找到不同语言之间关联的桥梁。Another challenge facing cluster scholars is how to fully exploit the rich semantics of multilingual data. For large-scale multilingual corpora, translating multiple languages into one language is impractical because the translation process introduces errors and consumes time and effort. Typical multilingual data processing methods are based on probabilistic topic models or word embedding methods. But these methods all require strong associations between different languages, such as sentence pairs, word pairs, or document pairs with translation correspondences. However, academic data presented by scholars in different languages lacks a clear translation correspondence. Therefore, if multilingual academic data is to be fully utilized, it is necessary to be able to find bridges between different languages.

例如，用户协同兴趣追踪模型（UCIT）利用了不同来源的数据对用户兴趣进行追踪。该模型同时利用用户本身的数据和用户的粉丝数据对用户兴趣进行挖掘，相比仅利用单一来源其效果有较大提升。该模型虽然利用了多源数据，但是其仅利用了单一的语言，没有同时利用不同的语言对用户进行建模。而且在用户聚类阶段，同样使用了K-means，因此也会受K-means的局限性的影响。For example, the User Collaborative Interest Tracking Model (UCIT) uses data from different sources to track user interests. The model uses both the user's own data and the user's fan data to mine user interests, which has a greater effect than only using a single source. Although this model utilizes multi-source data, it only utilizes a single language and does not use different languages to model users at the same time. And in the user clustering stage, K-means is also used, so it is also affected by the limitations of K-means.

此外，与日常兴趣的转移类似，学者的研究兴趣也会随着时间的推移而动态变化，因此对学者聚类时，也需要考虑学者兴趣的变化。现有的动态表示模型可分为两类，一种是将时间视为连续的，以在线方式处理数据，另一种是将时间看作离散的，以划分时间片即批处理的方式处理数据。然而，这些动态表示模型都无法从学术数据中追踪学者的研究兴趣。一方面，论文发表时间和项目申请时间不连续，不适合以在线方式处理学术数据。另一方面，现有的批处理的方法只适用于单源数据或单语数据，因此不能直接用来获取动态的学者研究兴趣语义表示。In addition, similar to the transfer of daily interests, the research interests of scholars also change dynamically over time, so when clustering scholars, changes in scholars’ interests also need to be considered. Existing dynamic representation models can be divided into two categories, one treats time as continuous and processes data online, and the other treats time as discrete and processes data by dividing time slices, that is, batch processing. . However, none of these dynamic representation models can track the research interests of scholars from academic data. On the one hand, the publication time of the paper and the project application time are not continuous, which is not suitable for processing academic data in an online manner. On the other hand, the existing batch processing methods are only applicable to single-source data or monolingual data, so they cannot be directly used to obtain dynamic semantic representations of scholars' research interests.

如何有效地利用多源多语言学术数据实现精准的动态的学者聚类，从而实现研究兴趣挖掘，是一个有待解决的问题。How to effectively use multi-source and multi-language academic data to achieve accurate and dynamic scholar clustering, so as to realize research interest mining, is a problem to be solved.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的在于提供一种面向学者聚类的研究兴趣挖掘方法、装置和存储介质，以解决现有技术中存在的一个或更多个问题。In view of this, the purpose of the present invention is to provide a research interest mining method, device and storage medium for scholar clustering, so as to solve one or more problems existing in the prior art.

本发明的一个方面，提供了一种面向学者聚类的研究兴趣挖掘方法，该方法包括以下步骤：One aspect of the present invention provides a research interest mining method for scholar clustering, the method comprising the following steps:

基于多源学者相关学术数据构造学术元数据集合，所述学术元数据集合中的每条学术元数据包括如下信息：专业领域信息、数据源信息、学者信息和文本内容信息；An academic metadata set is constructed based on academic data related to scholars from multiple sources, and each piece of academic metadata in the academic metadata set includes the following information: professional field information, data source information, scholar information and text content information;

将构造的学术元数据集合中的至少部分学术元数据作为输入数据输入到预先建立的研究兴趣挖掘模型中，通过对主题模型进行采样获得学者兴趣语义表示，所述学者兴趣语义表示包括专业领域-主题分布、主题-英文单词分布、主题-中文单词分布和主题-学者分布；其中，所述研究兴趣挖掘模型对来自同一数据源并且属于同一个专业领域的学者的数据共享同一个主题分布，并且，所述研究兴趣挖掘模型中，所述专业领域-主题分布被建模为狄利克雷分布，所述主题-英文单词分布、主题-中文单词分布和主题-学者分布被建模为多项式分布；Input at least part of the academic metadata in the constructed academic metadata set as input data into the pre-established research interest mining model, and obtain the scholar's interest semantic representation by sampling the topic model, and the scholar's interest semantic representation includes the professional field- topic distribution, topic-English word distribution, topic-Chinese word distribution, and topic-scholar distribution; wherein the research interest mining model shares the same topic distribution for data from scholars from the same data source and belonging to the same professional field, and , in the research interest mining model, the professional field-topic distribution is modeled as Dirichlet distribution, and the topic-English word distribution, topic-Chinese word distribution and topic-scholar distribution are modeled as multinomial distributions;

基于获得的学者兴趣语义表示进行学者聚类，获得学者聚类结果。Based on the obtained semantic representation of scholars' interest, scholars are clustered, and the results of scholars' clustering are obtained.

在本发明一些实施方式中，所述学术数据包括基金项目数据和论文数据；所述专业领域-主题分布包括科研项目的专业领域-主题分布和论文数据的专业领域-主题分布。In some embodiments of the present invention, the academic data includes fund project data and dissertation data; the professional field-topic distribution includes professional field-topic distribution of scientific research projects and professional field-topic distribution of dissertation data.

在本发明一些实施方式中，所述方法还包括构建研究兴趣挖掘模型的步骤，该步骤包括：确定学术元数据所属于的专业领域；从来自科研项目且属于确定的专业领域的主题分布和来自论文且属于确定的专业领域的主题分布中协同地对主题进行采样，得到学术元数据主题；根据主题-英文单词分布、主题-中文单词分布、主题-学者分布来生成英文单词、中文单词和学者。In some embodiments of the present invention, the method further includes the step of constructing a research interest mining model, the step comprising: determining the professional field to which the academic metadata belongs; The topics are sampled collaboratively in the topic distribution of papers and belong to a certain professional field to obtain academic metadata topics; English words, Chinese words and scholars are generated according to topic-English word distribution, topic-Chinese word distribution, and topic-scholar distribution .

在本发明一些实施方式中，所述构造学术元数据集合，包括：判断每条学术元数据的来源，确定是基金项目数据还是论文数据；确定当前学术元数据所属的学者；提取每个基金项目或者论文的文本信息；确定所述基金项目或者论文的文本信息是中文还是英文；确定当前学术元数据所属的时段，将当前学术元数据归入所述所属时段对应的学术元数据子集，由此构造包括所述学术元数据子集的学术元数据集合。In some embodiments of the present invention, constructing an academic metadata set includes: judging the source of each piece of academic metadata, and determining whether it is funded project data or dissertation data; determining the scholar to which the current academic metadata belongs; extracting each funded project Or the text information of the paper; determine whether the text information of the fund project or paper is Chinese or English; determine the time period to which the current academic metadata belongs, and classify the current academic metadata into the academic metadata subset corresponding to the time period, by This construct includes a collection of academic metadata for the subset of academic metadata.

在本发明一些实施例中，所述将构造的学术元数据集合中的至少部分学术元数据作为输入数据输入到预先建立的研究兴趣挖掘模型中，通过对主题模型进行采样获得学者兴趣语义表示，包括：对来自科研项目且属于预定专业领域的学术元数据采样得到科研项目的专业领域-主题分布；对来自论文且属于预定专业领域的学术元数据采样得到论文数据的专业领域-主题分布；针对每个预定主题进行采样获得对应的主题-中文词分布、主题-学者分布和主题-英文词分布；针对每条学术元数据中的文本内容确定其为中文还是英文；在文本内容为中文的情况下，提取双词并针对每个双词采取以下操作：采样中文双词的主题、独立地采样两个词以及采样双词的作者；在文本内容为英文的情况下，针对每个单词采样以下操作：采样英文主题、采样每个英文单词以及采样英文单词的作者。In some embodiments of the present invention, at least part of the academic metadata in the constructed academic metadata set is input into a pre-established research interest mining model as input data, and the semantic representation of scholar interest is obtained by sampling the topic model, Including: sampling the academic metadata from scientific research projects and belonging to a predetermined professional field to obtain the professional field-topic distribution of scientific research projects; sampling academic metadata from papers and belonging to a predetermined professional field to obtain the professional field-topic distribution of paper data; Each predetermined topic is sampled to obtain the corresponding topic-Chinese word distribution, topic-scholar distribution and topic-English word distribution; determine whether it is Chinese or English for the text content in each academic metadata; in the case that the text content is Chinese , extract the double words and take the following actions for each double word: sample the topic of the Chinese double word, independently sample the two words, and sample the author of the double word; in the case of the text content in English, sample the following for each word Action: Sample English topics, sample each English word, and sample the author of the English word.

在本发明一些实施例中，所述将构造的学术元数据集合中的至少部分学术元数据作为输入数据输入到预先建立的研究兴趣挖掘模型中，通过多次迭代地对主题模型进行采样获得学者兴趣语义表示，还包括：将不同时段对应的学术元数据子集中学术元数据作为输入数据输入到所述研究兴趣挖掘模型中，得到不同时刻的学者兴趣语义表示；所述对获得的学者兴趣语义表示进行聚类包括：利用单遍聚类算法对所获取的不同时刻的学者兴趣语义表示进行聚类。In some embodiments of the present invention, at least part of the academic metadata in the constructed academic metadata set is input into a pre-established research interest mining model as input data, and scholars are obtained by sampling the topic model multiple times iteratively. Interest semantic representation, further comprising: inputting academic metadata in academic metadata subsets corresponding to different time periods into the research interest mining model as input data to obtain scholar interest semantic representation at different times; The clustering of representations includes: using a single-pass clustering algorithm to cluster the acquired semantic representations of scholars' interests at different times.

在本发明一些实施例中，所述研究兴趣挖掘模型中的待估计参数采用吉布斯采样方法推算得到；In some embodiments of the present invention, the parameters to be estimated in the research interest mining model are estimated by using a Gibbs sampling method;

对中文双词的采样采用如下公式：The sampling of Chinese double words adopts the following formula:

对每个英文单词的采样采用如下公式：The following formula is used for the sampling of each English word:

其中，Z表示主题的总数，

表示中文主题Z ^b下的中文词i，

表示除了第i个单词外的所有中文主题的集合；

表示的英文主题Z ^b下的英文单词i，

表示除了第i个单词外的所有英文主题的集合；

表示第p个专业领域下，除了词i之外，科研项目数据里属于主题z的所有中英文词总数；

表示第p个专业领域下，除了词i之外，论文数据里属于主题z的中英文词总数；

分别表示主题z的中文主题Z ^b下除中文词i外，所有科研项目和论文数据里中文单词

出现的次数；

表示主题z下除了词i之外，所有科研项目和论文数据里属于学者s _i的中英文词出现的次数；

表示主题z的英文主题Z ^e下除英文单词i外，所有科研项目和论文数据里英文单词

出现的次数；

,

分别为科研项目数据的主题分布以及论文数据的主题分布；

为符合狄利克雷分布的主题-英文单词分布，

为符合狄利克雷分布的主题-学者分布，

为符合狄利克雷分布的主题-中文单词分布；R表示不同学者的个数，B表示双词的集合，S表示学者的集合，W^b表示不同中文单词的个数，W^e表示不同英文单词的个数，

为平衡因子；where Z represents the total number of topics,

represents the Chinese word i under the Chinese topic Z ^b ,

Represents the set of all Chinese topics except the ith word;

represents the English word i under the English topic Z ^b ,

Represents the set of all English topics except the ith word;

Indicates the total number of all Chinese and English words belonging to topic z in the scientific research project data except word i in the pth professional field;

Indicates the total number of Chinese and English words belonging to topic z in the paper data except word i in the pth professional field;

In addition to the Chinese word i , the Chinese words in all scientific research projects and paper data under the Chinese topic Z ^b respectively represent the topic z .

the number of occurrences;

Indicates the number of _{occurrences of Chinese and English words belonging to scholar si} in all scientific research projects and paper data except word i under topic z ;

Except for the English word i , the English words in all scientific research projects and thesis data under the English topic Z ^e representing the topic z

the number of occurrences;

,

are the subject distribution of scientific research project data and the subject distribution of paper data;

For the topic-English word distribution that conforms to the Dirichlet distribution,

To fit the subject-scholar distribution of the Dirichlet distribution,

It is the topic-Chinese word distribution that conforms to the Dirichlet distribution; R represents the number of different scholars, B represents the set of double words, S represents the set of scholars, W ^b represents the number of different Chinese words, and We ^e represents different English words number of ,

is the balance factor;

在对所述研究兴趣挖掘模型进行迭代操作后，得到如下估计参数：After the iterative operation of the research interest mining model, the following estimated parameters are obtained:

；

;

；

;

；

;

；

;

；

;

其中，

表示科研项目的专业领域为p主题为z的专业领域-主题分布；

表示论文数据的专业领域为p主题为z的专业领域-主题分布；

表示第p个专业领域下，科研项目数据里属于主题z的所有中英文词总数；

表示第p个专业领域下，论文数据里属于主题z的所有中英文词总数；

表示第z^b个主题下，所有科研项目和论文数据里中文单词出现的次数；

表示主题为z的中文双词的主题-中文单词分布；

表示第z^e个主题下，所有科研项目和论文数据里英文单词出现的次数；

表示主题为z的英文单词w ^e的主题-英文单词分布；

表示主题为z、学者为s的主题-学者分布；

表示主题z下所有科研项目和论文数据里属于学者s的中英文词出现的次数。in,

Indicates that the professional field of the scientific research project is the professional field of p and the topic is z - topic distribution;

Indicates that the professional field of the paper data is p and the topic is the professional field of z - topic distribution;

Indicates the total number of all Chinese and English words belonging to topic z in the scientific research project data under the pth professional field;

Indicates the total number of all Chinese and English words belonging to topic z in the paper data under the pth professional field;

Indicates the number of occurrences of Chinese words in the data of all scientific research projects and papers under the z- ^th topic;

Represents the topic-Chinese word distribution of Chinese double words with topic z;

Represents the number of occurrences of English words in all scientific research projects and paper data under the z ^e -th topic;

Represents the topic-English word distribution of the English word we with topic ^z ;

Represents the topic-scholar distribution with topic z and scholar s;

Indicates the number of occurrences of Chinese and English words belonging to scholar s in all scientific research projects and paper data under topic z.

在本发明一些实施例中，所述方法还包括：将前一时段的主题分布作为当前时段的先验分布来基于如下公式计算当前时段的先验参数：In some embodiments of the present invention, the method further includes: using the topic distribution of the previous period as the prior distribution of the current period to calculate the prior parameters of the current period based on the following formula:

其中，

为衰减因子，

表示在t-1时刻，第p个专业领域下，科研项目数据里属于主题

的所有中英文词总数；

表示在t-1时刻，科研项目数据的专业领域主题分布；

表示在

时刻，科研项目数据的专业领域主题分布；

表示在

时刻，第

个专业领域下，论文数据里属于主题

的所有中英文词总数；

表示在

时刻，论文数据的专业领域主题分布；

表示在

时刻，论文数据的专业领域主题分布；

表示在

时刻，第

个主题下，所有科研项目和论文数据里中文单词

出现的次数；

表示在

时刻，所有科研项目和论文数据的主题中文单词分布；

表示在

时刻，所有科研项目和论文数据的主题中文单词分布；

表示在

时刻，第

个主题下，所有科研项目和论文数据里英文单词

出现的次数；

表示在

时刻，所有科研项目和论文数据的主题英文单词分布；

表示在

时刻，所有科研项目和论文数据的主题英文单词分布；

表示在

时刻，第

个主题下，所有科研项目和论文数据里学者

出现的次数；

表示在

时刻，所有科研项目和论文数据的主题学者分布；

表示在

时刻，所有科研项目和论文数据的主题学者分布。in,

is the attenuation factor,

Indicates that at time t-1 , under the pth professional field, the scientific research project data belongs to the theme

The total number of all Chinese and English words of ;

Represents the subject distribution of scientific research project data in professional fields at time t-1 ;

expressed in

Time, subject distribution of scientific research project data in professional fields;

expressed in

time, the

Under a professional field, the thesis data belongs to the topic

The total number of all Chinese and English words of ;

expressed in

Moment, the subject distribution of the professional field of the paper data;

expressed in

Moment, the subject distribution of the professional field of the paper data;

expressed in

time, the

Under this topic, Chinese words in all scientific research projects and paper data

the number of occurrences;

expressed in

Time, topic Chinese word distribution of all scientific research projects and dissertation data;

expressed in

time, the

Under this topic, English words in all scientific research projects and paper data

the number of occurrences;

expressed in

Time, the subject English word distribution of all scientific research projects and paper data;

expressed in

time, the

Under this topic, scholars in all scientific research projects and dissertation data

the number of occurrences;

expressed in

Time, the distribution of subject scholars for all scientific research projects and dissertation data;

expressed in

Moment, subject scholar distribution of all research projects and dissertation data.

本发明的另一方面，提供了一种面向学者聚类的研究兴趣挖掘装置，该装置包括处理器和存储器，所述存储器中存储有计算机指令，所述处理器用于执行所述存储器中存储的计算机指令，当所述计算机指令被处理器执行时该装置实现如前所述方法的步骤。Another aspect of the present invention provides a research interest mining device for scholar clustering, the device includes a processor and a memory, where computer instructions are stored in the memory, and the processor is configured to execute the data stored in the memory. Computer instructions which, when executed by a processor, implement the steps of the method as previously described.

本发明的又一方面，还提供了一种计算机存储介质，其上存储有计算机程序，该程序被处理器执行时实现如前所述方法的步骤。In yet another aspect of the present invention, there is also provided a computer storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the aforementioned method are implemented.

本发明实施例提供的面向学者聚类的研究兴趣挖掘方法和装置，可以有效地利用多源多语言学术数据通过研究兴趣挖掘模型获取高质量的学者兴趣语义表示，提高聚类准确率，实现精准的动态的学者聚类，从而实现研究兴趣挖掘。The research interest mining method and device for scholar clustering provided by the embodiments of the present invention can effectively utilize multi-source and multilingual academic data to obtain high-quality semantic representation of scholars' interests through the research interest mining model, improve the clustering accuracy, and achieve precise Dynamic clustering of scholars, so as to realize research interest mining.

进一步地，本发明可以能够根据时间的变化，获取不同时段的学者聚类结果。Further, the present invention may be able to obtain scholar clustering results in different time periods according to changes in time.

本发明的附加优点、目的，以及特征将在下面的描述中将部分地加以阐述，且将对于本领域普通技术人员在研究下文后部分地变得明显，或者可以根据本发明的实践而获知。本发明的目的和其它优点可以通过在书面说明及其权利要求书以及附图中具体指出的结构实现到并获得。Additional advantages, objects, and features of the present invention will be set forth in part in the description that follows, and in part will become apparent to those of ordinary skill in the art upon study of the following, or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

本领域技术人员将会理解的是，能够用本发明实现的目的和优点不限于以上具体所述，并且根据以下详细说明将更清楚地理解本发明能够实现的上述和其他目的。Those skilled in the art will appreciate that the objects and advantages that can be achieved with the present invention are not limited to those specifically described above, and that the above and other objects that can be achieved by the present invention will be more clearly understood from the following detailed description.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，并不构成对本发明的限定。在附图中：The accompanying drawings described herein are used to provide a further understanding of the present invention, and constitute a part of the present application, and do not constitute a limitation to the present invention. In the attached image:

图1为本发明一实施例中面向学者聚类的研究兴趣挖掘方法的流程示意图。FIG. 1 is a schematic flowchart of a research interest mining method for scholar clustering according to an embodiment of the present invention.

图2为本发明另一实施例中研究兴趣挖掘模型的示意图。FIG. 2 is a schematic diagram of a research interest mining model in another embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，下面结合实施方式和附图，对本发明做进一步详细说明。在此，本发明的示意性实施方式及其说明用于解释本发明，但并不作为对本发明的限定。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, but not to limit the present invention.

在此，还需要说明的是，为了避免因不必要的细节而模糊了本发明，在附图中仅仅示出了与根据本发明的方案密切相关的结构和/或处理步骤，而省略了与本发明关系不大的其他细节。Here, it should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and the related structures and/or processing steps are omitted. Other details not relevant to the invention.

应该强调，术语“包括/包含”在本文使用时指特征、要素、步骤或组件的存在，但并不排除一个或更多个其它特征、要素、步骤或组件的存在或附加。It should be emphasized that the term "comprising/comprising" when used herein refers to the presence of a feature, element, step or component, but does not exclude the presence or addition of one or more other features, elements, steps or components.

针对已有的用户兴趣挖掘方法不能有效地利用多源多语言数据，以致于不能实现精准的动态的学者聚类问题，本发明利用学者的多源多语言信息，提出了新的研究兴趣挖掘模型（Research Interest Mining Module，RIMM），可以获取高质量的学者的研究兴趣语义表示，从而实现精准的学者聚类。Aiming at the problem that the existing user interest mining methods cannot effectively utilize multi-source and multi-language data, so that accurate and dynamic scholar clustering cannot be realized, the present invention uses the multi-source and multi-language information of scholars to propose a new research interest mining model. (Research Interest Mining Module, RIMM), can obtain high-quality scholars' research interest semantic representation, so as to achieve accurate scholar clustering.

本发明提出的RIMM可以从学术数据中对学者进行动态地聚类。在RIMM中，设定不同来源的数据属于不同的主题分布，并且为了能够同时利用不同来源的数据获取学者的兴趣语义表示，本发明设计了一个平衡因子来协同地从主题分布中抽取主题。为了弥合不同语言的词汇鸿沟，本发明还将属于同一专业领域并且来自同一数据源的不同语言设置为共享同一主题分布，从而获得不同语言的共同语义，生成模型的各项主题分布。通过以上操作，本发明可以充分利用多源多语言的数据，实现精准的学者兴趣挖掘。此外，本发明还以历史时刻的主题分布作为先验知识，在不同的时间片上执行RIMM，可以获得不同时刻上的学者的兴趣语义表示，并根据兴趣语义表示，利用单遍（single-pass）聚类算法对学者进行聚类。The RIMM proposed by the present invention can dynamically cluster scholars from academic data. In RIMM, it is assumed that data from different sources belong to different topic distributions, and in order to obtain the semantic representation of scholars' interests by using data from different sources at the same time, the present invention designs a balance factor to synergistically extract topics from topic distributions. In order to bridge the lexical gap of different languages, the present invention also sets different languages belonging to the same professional field and from the same data source to share the same topic distribution, thereby obtaining the common semantics of different languages and generating the topic distributions of the model. Through the above operations, the present invention can make full use of multi-source and multi-language data to achieve accurate scholar interest mining. In addition, the present invention also takes the topic distribution of historical moments as prior knowledge, executes RIMM on different time slices, and can obtain the semantic representation of scholars' interests at different moments, and uses a single-pass according to the semantic representation of interest. Clustering algorithms cluster scholars.

研究兴趣可以从很多方面反映出来。从数据来源来看，无论是科研项目信息（如申请书和结题报告）还是论文都能反映学者的兴趣。此外，学术数据是多种语言的。因此，本发明将用户、项目信息、论文、中文和英文映射到同一个主题空间，并在一个多维（如K维）的主题向量中表示学者的兴趣，其中每个K值可体现学者对该主题感兴趣的概率。Research interest can be reflected in many ways. From the perspective of data sources, both scientific research project information (such as applications and final reports) and papers can reflect the interests of scholars. In addition, academic data is multilingual. Therefore, the present invention maps users, project information, papers, Chinese and English to the same topic space, and expresses the scholar's interest in a multi-dimensional (eg K-dimensional) topic vector, where each K value can reflect the scholar's interest in the topic The probability that the topic is of interest.

图1所示为本发明一实施例中面向学者聚类的研究兴趣挖掘方法的流程示意图，如图1所示，该方法包括以下步骤S110-S130：FIG. 1 is a schematic flowchart of a research interest mining method for scholar clustering in an embodiment of the present invention. As shown in FIG. 1 , the method includes the following steps S110-S130:

步骤S110，基于多源学者相关学术数据构造学术元数据集合。Step S110 , constructing an academic metadata set based on academic data related to multi-source scholars.

所述学术元数据集合中的每条学术元数据可包括如下信息：专业领域信息、数据源信息、学者信息和文本内容信息。Each piece of academic metadata in the academic metadata set may include the following information: professional field information, data source information, scholar information and text content information.

科研项目和论文都属于本发明的研究资料。虽然很难获得科研项目申请书的全文或科研项目的结题报告，但可以很容易获得它们的标题、关键词和摘要。对于论文来说，论文全文虽然容易获取，但获取论文全文却需要消耗大量的资源，而论文的标题、关键词摘要，足以用来挖掘学者的兴趣。因此，本发明实施例中，可以将科研项目（如基金项目）数据和论文数据作为多源学者相关学术数据，也即，学术元数据既可能来自于科研项目也可能来自于论文，进一步地，基于多源学者相关学术数据定义了学术元数据的概念，定义的学术元数据可包括如下信息：专业领域信息、数据源信息、学者信息和文本内容信息。作为示例，本发明可使用四元组来表示学术元数据m=(p _m, d _m, s _m, t _m), 其中，p _m表示专业领域，d _m表示数据源，s _m表示学者，t _m表示文本内容，m表示第m条学术元数据。其中，学术元数据的文本内容包括基金项目或者论文的标题、关键词和/或摘要等，但并不限于此。Scientific research projects and papers belong to the research materials of the present invention. Although it is difficult to obtain the full text of a research project application or a research project final report, their titles, keywords, and abstracts can be easily obtained. For the paper, although the full text of the paper is easy to obtain, it requires a lot of resources to obtain the full text of the paper, and the title and keyword abstract of the paper are enough to tap the interest of scholars. Therefore, in this embodiment of the present invention, the data of scientific research projects (such as funded projects) and thesis data can be used as multi-source scholar-related academic data, that is, academic metadata may come from both scientific research projects and papers, and further, Based on academic data related to scholars from multiple sources, the concept of academic metadata is defined. The defined academic metadata can include the following information: professional field information, data source information, scholar information and text content information. As an example, the present invention can use a quadruple to represent academic metadata m = ( p _m , d _m , s _m , t _m ), where p _m represents a professional field, d _m represents a data source, s _m represents a scholar, t _m represents the text content, and m represents the mth academic metadata. The text content of the academic metadata includes titles, keywords and/or abstracts of funded projects or papers, but is not limited thereto.

本发明一些实施例中，收集了我国学者的学术资料，如基金项目数据和论文数据，作为学术数据，然后采用国家自然科学基金申请代码对学者的专业领域进行了划分，获得专业领域信息。具体来说，本发明首先获取学者主持的科研项目的申请代码，并在科研项目执行过程中将该代码设置为学者的专业领域。In some embodiments of the present invention, academic data of Chinese scholars, such as fund project data and dissertation data, are collected as academic data, and then the National Natural Science Foundation of China application code is used to divide the scholar's professional field to obtain professional field information. Specifically, the present invention first obtains the application code of the scientific research project hosted by the scholar, and sets the code as the professional field of the scholar during the execution of the scientific research project.

在本发明一些实施例中，学术元数据集合可以包括多个学术元数据子集，每个学术元数据子集中可包括基于不同时段的学术元数据。In some embodiments of the present invention, the academic metadata set may include multiple academic metadata subsets, and each academic metadata subset may include academic metadata based on different time periods.

步骤S120，将构造的学术元数据集合中的至少部分学术元数据作为输入数据输入到预先建立的研究兴趣挖掘模型（RIMM）中，通过对这种主题模型进行采样获得学者兴趣语义表示。Step S120 , input at least part of the academic metadata in the constructed academic metadata set as input data into a pre-established Research Interest Mining Model (RIMM), and obtain a semantic representation of scholar interest by sampling this topic model.

学者兴趣语义表示可包括专业领域-主题分布、主题-英文单词分布、主题-中文单词分布和主题-学者分布；其中，专业领域-主题分布可包括科研项目的专业领域-主题分布和论文数据的专业领域-主题分布。The semantic representation of scholars’ interests can include professional field-topic distribution, topic-English word distribution, topic-Chinese word distribution, and topic-scholar distribution; among them, professional field-topic distribution can include professional field-topic distribution of scientific research projects and the distribution of paper data. Areas of Expertise - Subject distribution.

在本发明实施例构建的RIMM中，设定来自同一数据源并且属于同一个专业领域的学者的数据共享一个主题分布，不同来源的数据属于不同的主题分布，例如，属于同一专业领域并且来自同一数据源的不同语言设置为共享同一主题分布，从而获得不同语言的共同语义，生成模型的各项主题分布，从而弥合不同语言的词汇鸿沟。In the RIMM constructed by the embodiment of the present invention, it is assumed that the data of scholars from the same data source and belonging to the same professional field share a subject distribution, and the data from different sources belong to different subject distributions, for example, belong to the same professional field and come from the same subject distribution Different languages of the data source are set to share the same topic distribution, so as to obtain the common semantics of different languages, and generate the topic distribution of the model, thereby bridging the lexical gap of different languages.

此外，在本发明实施例构建的RIMM中，专业领域-主题分布被建模为狄利克雷（Dirichlet）分布，主题-英文单词分布、主题-中文单词分布和主题-学者分布被建模为多项式分布。图2为本发明提出的研究兴趣挖掘模型（或称社交网络多特征主题模型）的图模型示例，如图2所示，科研项目数据的专业领域-主题分布

、论文数据的专业领域-主题分布

分别建模为具有参数

,

的狄利克雷分布。主题-英文单词分布

、主题-学者分布

、主题-中文单词分布

分别建模为具有参数

,

,

的多项式分布。由此可知，通过本发明提出的研究兴趣挖掘模型得到的主题分布是科研项目、论文、中文、英文的混合。In addition, in the RIMM constructed by the embodiment of the present invention, the professional field-topic distribution is modeled as a Dirichlet distribution, and the topic-English word distribution, the topic-Chinese word distribution, and the topic-scholar distribution are modeled as polynomials distributed. Fig. 2 is an example of a graph model of the research interest mining model (or a multi-feature topic model of social network) proposed by the present invention. As shown in Fig. 2, the professional field-topic distribution of scientific research project data

, the professional field of the paper data - topic distribution

separately modeled as having parameters

,

The Dirichlet distribution. Topic - English word distribution

, subject-scholar distribution

, subject-Chinese word distribution

separately modeled as having parameters

,

the multinomial distribution of . It can be seen that the topic distribution obtained by the research interest mining model proposed by the present invention is a mixture of scientific research projects, papers, Chinese and English.

图2中涉及的符号及描述如下面的表1所示。The symbols and descriptions involved in Figure 2 are shown in Table 1 below.

表1. 符号及其描述Table 1. Symbols and their descriptions

在本发明实施例中，可以将构造的学术元数据集合中的全部学术元数据作为输入数据输入到预先建立的研究兴趣挖掘模型（RIMM）中，也可以将构造的学术元数据集合中的部分学术元数据作为输入数据输入到预先建立的研究兴趣挖掘模型（RIMM）中。该部分学术元数据例如可以是学术元数据集合中的一个或多个学术元数据子集。In this embodiment of the present invention, all academic metadata in the constructed academic metadata set may be input into a pre-established research interest mining model (RIMM) as input data, or part of the constructed academic metadata set may be Academic metadata is fed into a pre-built Research Interest Mining Model (RIMM) as input data. The portion of academic metadata may be, for example, one or more subsets of academic metadata in a set of academic metadata.

步骤S130，基于获得的学者兴趣语义表示进行学者聚类，获得学者聚类结果。Step S130, perform scholar clustering based on the obtained scholar's interest semantic representation, and obtain a scholar clustering result.

由上可知，通过研究兴趣挖掘模型可以获得学者的语义表示，然而该表示是静态的，仅为一段时间内的主题表示。因此，本发明为了能够根据不同时段的学者兴趣语义表示对学者进行聚类，要跟踪学者的兴趣动态变化。尽管学者的兴趣是动态变化的，但是也是存在一定的持续性，因此，本发明并不是简单的对语料进行切片，而是利用学者历史兴趣来帮助推断当前时段的主题分布。为了能够保持学者在不同时段的兴趣连续性，本发明将前一时段（如，第t-1时段）的主题分布作为当前时段（如第t时段）的先验分布。It can be seen from the above that the semantic representation of scholars can be obtained by studying the interest mining model. However, the representation is static and is only a topic representation within a period of time. Therefore, in the present invention, in order to be able to cluster scholars according to the semantic representation of scholars' interests in different time periods, it is necessary to track the dynamic changes of scholars' interests. Although the interests of scholars are dynamic, there is also a certain persistence. Therefore, the present invention does not simply slice the corpus, but uses the historical interests of scholars to help infer the topic distribution of the current period. In order to maintain the continuity of scholars' interests in different periods, the present invention uses the topic distribution of the previous period (eg, the t-1th period) as the prior distribution of the current period (eg, the tth period).

在本发明实施例中，步骤S130是利用单遍（single-pass）聚类算法来对学者进行聚类。In the embodiment of the present invention, step S130 is to use a single-pass clustering algorithm to cluster scholars.

单遍聚类算法通常是流式数据聚类的经典方法。对于依次到达的数据流，该聚算法按输入顺序每次处理一个数据，依据当前数据与已有类的匹配度大小，将该数据判为已有类或者创建一个新的数据类，实现流式数据的增量和动态聚类。在本发明实施例中，通过利用single-pass对所获取的不同时刻的学者兴趣语义表示进行聚类，可以获得动态的学者聚类结果，克服了传统的学者聚类中K-means聚类算法的局限性。The single-pass clustering algorithm is usually the classic method for streaming data clustering. For the data streams that arrive in sequence, the aggregation algorithm processes one data at a time in the input order, and judges the data as an existing class or creates a new data class according to the matching degree between the current data and the existing class to realize streaming Incremental and dynamic clustering of data. In the embodiment of the present invention, by using single-pass to cluster the acquired scholar's interest semantic representation at different times, dynamic scholar clustering results can be obtained, which overcomes the traditional K-means clustering algorithm in scholar clustering. limitations.

本发明如上面向学者聚类的研究兴趣挖掘方法，可以有效地利用多源多语言学术数据通过研究兴趣挖掘模型获取高质量的学者兴趣语义表示，提高聚类准确率，实现精准的动态的学者聚类，从而实现研究兴趣挖掘。The present invention is the research interest mining method for scholar clustering as above, which can effectively utilize multi-source and multilingual academic data to obtain high-quality semantic representation of scholars' interests through the research interest mining model, improve the clustering accuracy, and realize accurate and dynamic scholar clustering. class, so as to realize research interest mining.

在本发明一些实施例中，学术元数据既可能来自于科研项目也可能来自于论文。对于每条学术元数据，由学术元数据构建RIMM的过程如下：In some embodiments of the present invention, academic metadata may come from both scientific research projects and papers. For each piece of academic metadata, the process of constructing a RIMM from academic metadata is as follows:

首先，获取到该条元数据所属于的专业领域p；First, obtain the professional field p to which the piece of metadata belongs;

然后，从来自科研项目且属于专业领域p的主题分布

和来自论文且属于专业领域p的主题分布

中协同地对主题进行采样，得到主题z，其中，

和

分别表示科研项目的专业领域p的学术元数据的专业领域-主题分布和论文数据的专业领域-主题分布。Then, from the distribution of topics from research projects that belong to the area of expertise p

and the distribution of topics from papers that belong to the area of expertise p

The topics are sampled collaboratively in , to get the topic z, where,

and

Respectively represent the professional field-topic distribution of academic metadata and the professional field-topic distribution of dissertation data of the professional field p of the scientific research project.

然后，根据主题-英文单词分布、主题-中文单词分布、主题-学者分布来生成英文单词、中文单词和学者。Then, English words, Chinese words, and scholars are generated according to topic-English word distribution, topic-Chinese word distribution, and topic-scholar distribution.

此外，由于中文数据比较稀疏，为了缓解中文数据的稀疏性，本发明还引入了双词模型，设定处于同一学术元数据中的两个中文单词共享同一个主题。In addition, since Chinese data is relatively sparse, in order to alleviate the sparseness of Chinese data, the present invention also introduces a two-word model, setting two Chinese words in the same academic metadata to share the same topic.

在本发明一些实施例中，构造学术元数据集合的步骤S110可以具体包括以下步骤：In some embodiments of the present invention, the step S110 of constructing the academic metadata set may specifically include the following steps:

（1）确定每条学术元数据的专业领域；(1) Determine the professional field of each piece of academic metadata;

（2）确定每条学术元数据的来源，即确定是基金项目数据还是论文数据；(2) Determine the source of each piece of academic metadata, that is, determine whether it is fund project data or dissertation data;

（3）提取当前学术元数据所属的学者；学术元数据所属的学者可以包括项目主持人或论文作者等。也即本步骤可以提取每个科研项目的主持人，或者论文的作者，也就是确定该学术元数据属于哪个学者；(3) Extract the scholars to which the current academic metadata belongs; the scholars to which the academic metadata belongs may include the project host or the author of the paper, etc. That is, in this step, the moderator of each scientific research project or the author of the paper can be extracted, that is, to determine which scholar the academic metadata belongs to;

（4）提取每个基金项目或者论文的文本信息（如题目、关键词和/或摘要等）；(4) Extract the text information (such as title, keywords and/or abstract, etc.) of each funded project or paper;

（5）确定该基金项目或者论文的文本信息是中文还是英文；(5) Determine whether the text information of the funded project or thesis is in Chinese or English;

（6）确定该学术元数据所属的时段，将当前学术元数据归入所述所属的时段对应的学术元数据子集，由此构造出包括多个学术元数据子集的学术元数据集合。(6) Determine the time period to which the academic metadata belongs, and classify the current academic metadata into the academic metadata subset corresponding to the time period to which the academic metadata belongs, thereby constructing an academic metadata set including multiple academic metadata subsets.

例如，在爬取了多条基金项目（科研项目）信息后，通过对科研项目数据进行预处理，来获取描述科研项目的学术元数据，然后将上述的科研项目的主持人作为研究对象，并提取这些科研项目的文本信息和项目有效时间信息；进一步从百度学术上爬取上述学者所发表的论文，并且提取了这些论文的文本信息（如标题、摘要和关键词）以及论文的发表时间。基于得到的信息可构造得到学术元数据集合。For example, after crawling a number of fund projects (scientific research projects) information, the academic metadata describing the scientific research project is obtained by preprocessing the scientific research project data, and then the host of the above-mentioned scientific research project is used as the research object, and the Extract the text information and project effective time information of these scientific research projects; further crawl the papers published by the above scholars from Baidu Scholar, and extract the text information (such as titles, abstracts and keywords) of these papers and the publication time of the papers. Based on the obtained information, a collection of academic metadata can be constructed.

此外，上述对学术元数据的预处理可以包括：对文本要进行分词、去停用词、去高频词等数据清洗，然后把每条中文的学术元数据中同时出现的两个单词进行两两组合，提取双词。获取学者id号，对学者进行编码等。In addition, the above preprocessing of academic metadata may include: word segmentation, stop word removal, high frequency word removal and other data cleaning for the text, and then two words appearing simultaneously in each piece of Chinese academic metadata. Two combinations to extract double words. Obtain the scholar id number, code the scholar, etc.

之后，将构造好的学术元数据集合中的至少部分数据输入到研究兴趣挖掘模型（RIMM）中，也即执行步骤S120。更具体而言，在步骤S120中，将这些处理好的学术元数据集合作为输入数据输入到提出的研究兴趣挖掘模型RIMM中，可以对主题数、迭代次数、超参数（在开始机器学习过程之前设置值的参数）等根据经验进行设置，通过多次迭代地执行RIMM进行采样来最终获得主题相关的学者兴趣语义表示，所述学者兴趣语义表示包括科研项目专业领域-主题分布、主题-英文单词分布、主题-中文单词分布、主题-学者分布。After that, at least part of the data in the constructed academic metadata set is input into the research interest mining model (RIMM), that is, step S120 is executed. More specifically, in step S120, these processed academic metadata sets are input into the proposed research interest mining model RIMM as input data, and the number of topics, the number of iterations, and the hyperparameters (before starting the machine learning process) can be determined. The parameters of setting value) are set according to experience, and the semantic representation of scholar interest related to the topic is finally obtained by repeatedly performing RIMM sampling for multiple iterations, and the semantic representation of scholar interest includes scientific research project professional field-topic distribution, topic-English words Distribution, topic-Chinese word distribution, topic-scholar distribution.

作为示例，在学术元数据集合中包括对不同时段t的学术元数据子集

的情况下，可分别对不同时段t的学术元数据子集

通过多次迭代来执行RIMM，由此就可以得到不同时段的学者兴趣表示。As an example, a subset of academic metadata for different time periods t is included in the academic metadata set

In the case of , the subsets of academic metadata for different periods t can be

The RIMM is performed through multiple iterations, so that the interest representations of scholars at different time periods can be obtained.

此外，在本发明一些实施例中，可通过对每个时段的学者兴趣语义表示运行single-pass聚类算法,最终得到多个时段（如T个时段）的学者聚类结果。In addition, in some embodiments of the present invention, a single-pass clustering algorithm may be run on the scholar's interest semantic representation of each time period, and finally the scholar clustering results of multiple time periods (eg, T time periods) are obtained.

在本发明实施例中，研究兴趣挖掘模型的执行过程可包括如下步骤：In the embodiment of the present invention, the execution process of the research interest mining model may include the following steps:

步骤S1，对来自于科研项目并且属于专业领域p的学术元数据采样一个主题分布：

；In step S1, a topic distribution is sampled for academic metadata from scientific research projects and belonging to the professional field p :

;

步骤S2，对来自于论文并且属于专业领域p的学术元数据采样一个主题分布：

；Step S2, sample a topic distribution for academic metadata from papers and belonging to the professional domain p :

;

步骤S3，对每个主题，采样一个主题-中文词分布

和一个主题-学者分布

以及主题-英文词分布

；Step S3, for each topic, sample a topic-Chinese word distribution

and a subject-scholar distribution

and topic-English word distribution

;

步骤S4，针对每条学术元数据中的文本内容判断其为中文还是英文：Step S4, according to the text content in each piece of academic metadata, determine whether it is Chinese or English:

（1）如果是中文，提取双词，针对每个双词，

（b _i表示第i个双词，

表示属于专业领域p的第m条元数据中的双词集合）：(1) If it is Chinese, extract double words, for each double word,

( b _i represents the i -th double word,

represents the set of bigrams in the m-th metadata belonging to the domain of expertise p ):

i. 采样一个主题

；i. Sampling a topic

;

其中，

表示第i个中文双词的主题，

表示多项式分布，

为平衡因子，用来调节学者的兴趣受科研项目和论文的影响的大小。in,

Represents the subject of the i -th Chinese double word,

represents a multinomial distribution,

It is a balance factor, which is used to adjust the influence of scholars' interests by scientific research projects and papers.

ii. 独立地采样中文双词中的两个单词

；ii. Independently sample two words in Chinese double words

;

iii. 采样中文双词的学者

；iii. Scholars who sample Chinese double words

;

（2）如果是英文，对每个单词，

：(2) If it is in English, for each word,

:

i. 采样一个主题

；i. Sampling a topic

;

其中，

表示第i个英文单词的主题。in,

Represents the subject of the i -th English word.

ii. 采样每个英文单词

；ii. Sample each English word

;

iii. 采样英文单词的学者

。iii. Scholars who sample English words

.

研究兴趣挖掘模型RIMM中有两个个潜在变量z^b和z^e，以及五个待估计参数

，

，

，

和

。本发明实施例中可以采用吉布斯采样方法（如collapsed Gibbs sampling）算法来推断RIMM中待估计的参数。吉布斯采样（Gibbs sampling）是统计学中用于马尔科夫蒙特卡洛（MCMC）的一种算法，用于在难以直接采样时从某一多变量概率分布中近似抽取样本序列。Research Interests There are two latent variables z ^b and z ^e in the mining model RIMM, and five parameters to be estimated

,

and

. In this embodiment of the present invention, a Gibbs sampling method (eg, collapsed Gibbs sampling) algorithm may be used to infer parameters to be estimated in the RIMM. Gibbs sampling is an algorithm used in statistics for Markov Monte Carlo (MCMC) to approximate a sequence of samples from a multivariate probability distribution when direct sampling is difficult.

在本发明实施例中，每个中文双词主题的采样公式如下：In the embodiment of the present invention, the sampling formula of each Chinese two-word topic is as follows:

(1)

为了推导公式（1），首先计算其联合分布：To derive formula (1), first calculate its joint distribution:

（2）

(2)

其中，所述联合分布表示的是基于上述采样获得的结果推断出的近似分布。Wherein, the joint distribution represents an approximate distribution inferred based on the results obtained by the above sampling.

根据公式（1）和公式（2），得到双词的采样公式如下：According to formula (1) and formula (2), the sampling formula of double words is obtained as follows:

（3）

(3)

其中，Z表示主题的总数，

表示中文主题Z ^b下的中文词i，

表示除了第i个单词外的所有中文主题的集合；

表示的英文主题Z ^b下除英文单词i，

表示除了第i个单词外的所有英文主题的集合；

出现的次数；

出现的次数；

,

分别为科研项目数据的主题分布以及论文数据的主题分布；

为符合狄利克雷分布的主题-英文单词分布，

为符合狄利克雷分布的主题-学者分布，

为符合狄利克雷分布的主题-中文单词分布；

为平衡因子，表示利用论文信息的比例，用来调节学者的兴趣受科研项目和论文的影响的大小；R表示不同学者的个数，B表示双词的集合，S表示学者的集合，W^b表示不同中文单词的个数，W^e表示不同英文单词的个数。where Z represents the total number of topics,

represents the Chinese word i under the Chinese topic Z ^b ,

Represents the set of all Chinese topics except the ith word;

In addition to the English word i under the English topic Z ^b represented,

Represents the set of all English topics except the ith word;

the number of occurrences;

,

To fit the subject-scholar distribution of the Dirichlet distribution,

is a topic-Chinese word distribution that conforms to the Dirichlet distribution;

is a balance factor, which indicates the proportion of the information of the papers used to adjust the influence of scholars' interests by scientific research projects and papers; R represents the number of different scholars, B represents the set of double words, S represents the set of scholars, and W ^b Represents the number of different Chinese words, ^We represent the number of different English words.

在本发明实施例，设计了一个平衡因子

来协同地从主题分布中抽取主题，以能够同时利用不同来源的数据获取学者的兴趣语义表示，能克服现有技术中小样本源的影响会被大样本源淹没的问题。In the embodiment of the present invention, a balance factor is designed

To synergistically extract topics from the topic distribution, to be able to use data from different sources at the same time to obtain the semantic representation of scholars' interests, and to overcome the problem that the influence of small sample sources in the prior art will be overwhelmed by large sample sources.

类似地，本发明实施例中，为英文单词采样主题，采样公式如下:Similarly, in the embodiment of the present invention, it is the English word sampling subject, and the sampling formula is as follows:

(4)

根据联合公式（2），可以得到英文单词的采样公式如下：According to the joint formula (2), the sampling formula of English words can be obtained as follows:

（5）

(5)

其中，

表示主题z下除英文单词i外，所有基金项目和论文数据里英文单词

出现的次数。in,

Indicates the English words in all fund projects and thesis data except the English word i under topic z

the number of occurrences.

迭代执行上述采样规则直到结果收敛，最终可得到如下估计参数：Iteratively execute the above sampling rules until the results converge, and finally the following estimated parameters can be obtained:

；（6-1）

; (6-1)

；（6-2）

; (6-2)

；（6-3）

; (6-3)

；（6-4）

; (6-4)

（6-5）

(6-5)

其中，

表示科研项目的专业领域为p主题为z的专业领域-主题分布；

表示论文数据的专业领域为p主题为z的专业领域-主题分布；

表示主题为z的中文双词的主题-中文单词分布；

表示主题为z的英文单词w ^e的主题-英文单词分布；

表示主题为z、学者为s的主题-学者分布；

Represents the topic-scholar distribution with topic z and scholar s;

由上可知，通过研究兴趣挖掘模型可以获得学者的研究兴趣语义表示，然而该表示是静态的，即仅为一段时间内的主题表示。因此，为了能够根据不同时间的用户的兴趣语义表示，对学者进行聚类，需要体现学者的兴趣动态变化。尽管学者的兴趣是动态变化的，但是也是存在一定的持续性，因此，本发明并不是简单的对语料进行切片，而是利用学者历史兴趣来帮助推断当前时刻的主题分布。It can be seen from the above that the semantic representation of scholars' research interests can be obtained through the research interest mining model, but the representation is static, that is, only a topic representation for a period of time. Therefore, in order to cluster scholars according to the semantic representation of users' interests at different times, it is necessary to reflect the dynamic changes of scholars' interests. Although the interests of scholars are dynamic, there is also a certain persistence. Therefore, the present invention does not simply slice the corpus, but uses the historical interests of scholars to help infer the topic distribution at the current moment.

假设在时段t内有

位学者，对于每一位学者，都收集了他的写书消息。基于这些学术消息，如果利用RIMM来获取他们的兴趣语义表示，并将具有共同兴趣的学者聚集在一起作为一个聚类c。则该聚类的正式陈述如下：Suppose that in time period t there are

A scholar, for each scholar, has collected his writing news. Based on these academic messages, if we utilize RIMM to obtain semantic representations of their interests, and gather scholars with common interests together as a cluster c . Then the formal statement of the cluster is as follows:

；

;

其中，f表示独立的学者到学者簇之间的映射关系。c代表一个具有共同兴趣的学者群体。

和

分别表示学者和簇在时段t 内的数量。Among them, f represents the mapping relationship between independent scholars to scholar clusters. c represents a group of scholars with common interests.

and

are the numbers of scholars and clusters in time period t , respectively.

为了对学者兴趣动态变化进行建模，本发明充分考虑了如何利用多源多语言数据的历史主题分布来推断当前学者的主题分布。In order to model the dynamic changes of scholars' interests, the present invention fully considers how to use the historical topic distribution of multi-source and multilingual data to infer the topic distribution of current scholars.

为了能够保持学者在不同时段的兴趣连续性，本发明将前一时段（如第（t-1）时段）的主题分布作为当前时段（如第t时段）的先验分布。因此，当前时刻的先验参数计算如下：In order to maintain the continuity of scholars' interests in different periods, the present invention takes the topic distribution of the previous period (eg, the (t-1)th period) as the prior distribution of the current period (eg, the tth period). Therefore, the prior parameters at the current moment are calculated as follows:

；（7-1）

; (7-1)

；（7-2）

; (7-2)

；（7-3）

; (7-3)

；（7-4）

; (7-4)

；（7-5）

; (7-5)

其中，

为衰减因子，

值越大，表示历史兴趣对当前主题分布的影响越大；

的所有中英文词总数；

表示在t-1时刻，科研项目数据的专业领域主题分布；

表示在

时刻，科研项目数据的专业领域主题分布；

表示在

时刻，第

个专业领域下，论文数据里属于主题

的所有中英文词总数；

表示在

时刻，论文数据的专业领域主题分布；

表示在

时刻，论文数据的专业领域主题分布；

表示在

时刻，第

个主题下，所有科研项目和论文数据里中文单词

出现的次数；

表示在

时刻，所有科研项目和论文数据的主题中文单词分布；

表示在

时刻，所有科研项目和论文数据的主题中文单词分布；

表示在

时刻，第

个主题下，所有科研项目和论文数据里英文单词

出现的次数；

表示在

时刻，所有科研项目和论文数据的主题英文单词分布；

表示在

时刻，所有科研项目和论文数据的主题英文单词分布；

表示在

时刻，第

个主题下，所有科研项目和论文数据里学者

出现的次数；

表示在

时刻，所有科研项目和论文数据的主题学者分布；

表示在

时刻，所有科研项目和论文数据的主题学者分布。in,

is the attenuation factor,

The larger the value, the greater the impact of historical interest on the current topic distribution;

The total number of all Chinese and English words of ;

expressed in

time, the

Under a professional field, the thesis data belongs to the topic

The total number of all Chinese and English words of ;

expressed in

Moment, the subject distribution of the professional field of the paper data;

expressed in

Moment, the subject distribution of the professional field of the paper data;

expressed in

time, the

the number of occurrences;

expressed in

time, the

the number of occurrences;

expressed in

time, the

the number of occurrences;

expressed in

如果给定T个时间片（时段），可以将整个语料（学术元数据集合）M根据时段t切分成T个语料子集：

，其中，

。通过在每个子集上运行研究兴趣挖掘模型就可以得到不同时间片内的主题-学者分布，也就是可以获得学者在不同时段t的兴趣语义表示。本发明实施例中利用单遍（single-pass）聚类算法对所获取的不同时刻的学者兴趣语义表示进行聚类，来获得动态的学者聚类结果。If T time slices (periods) are given, the entire corpus (a collection of academic metadata) M can be divided into T corpus subsets according to the period t :

,in,

. By running the research interest mining model on each subset, the topic-scholar distribution in different time slices can be obtained, that is, the semantic representation of scholars' interests in different time periods t can be obtained. In the embodiment of the present invention, a single-pass clustering algorithm is used to cluster the acquired scholar's interest semantic representations at different times, so as to obtain a dynamic scholar clustering result.

上述可知，本发明研究了多源、多语言、不断增长的学术数据中的学者聚类问题，通过新提出的RIMM能够推断学者的主题分布，并根据其研究兴趣实现了动态聚类。本发明提出的研究兴趣挖掘模型（RIMM）能够有效地整合多源数据，弥补多语言数据的词汇鸿沟，从而准确挖掘学者的研究兴趣。此外，本发明还提出了采用吉布斯采样方法来推断RIMM的参数。通过对主题的迭代采样，本发明可以将学术数据的所有特征映射到一个公共语义空间中，并获得它们的主题分布。本发明的基于RIMM的研究兴趣追踪和动态学者聚类方法，通过捕捉学者研究兴趣的连续性和变化性，可以动态地追踪学者的兴趣并对学者进行聚类。As can be seen from the above, the present invention studies the scholar clustering problem in multi-source, multi-language, and growing academic data, and can infer the topic distribution of scholars through the newly proposed RIMM, and realize dynamic clustering according to their research interests. The research interest mining model (RIMM) proposed by the present invention can effectively integrate multi-source data, make up the vocabulary gap of multilingual data, and accurately mine the research interests of scholars. In addition, the present invention also proposes to use the Gibbs sampling method to infer the parameters of the RIMM. Through iterative sampling of topics, the present invention can map all features of academic data into a common semantic space and obtain their topic distribution. The RIMM-based research interest tracking and dynamic scholar clustering method of the present invention can dynamically track and cluster scholars' interests by capturing the continuity and variability of scholars' research interests.

与上述方法相应地，本发明还提供了一种面向学者聚类的研究兴趣挖掘装置，包括处理器和存储器，所述存储器中存储有计算机指令，所述处理器用于执行所述存储器中存储的计算机指令，当所述计算机指令被处理器执行时该装置实现如前所述边缘计算服务器部署方法的步骤。Corresponding to the above method, the present invention also provides a research interest mining device for scholar clustering, comprising a processor and a memory, wherein the memory stores computer instructions, and the processor is used to execute the data stored in the memory. Computer instructions, when the computer instructions are executed by the processor, the apparatus implements the steps of the aforementioned edge computing server deployment method.

本发明实施例还提供一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时以实现前述边缘计算服务器部署方法的步骤。该计算机可读存储介质可以是有形存储介质，诸如光盘、U盘、软盘、硬盘等。Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the foregoing method for deploying an edge computing server. The computer-readable storage medium may be a tangible storage medium, such as an optical disk, a USB flash drive, a floppy disk, a hard disk, and the like.

需要明确的是，本发明并不局限于上文所描述并在图中示出的特定配置和处理。为了简明起见，这里省略了对已知方法的详细描述。在上述实施例中，描述和示出了若干具体的步骤作为示例。但是，本发明的方法过程并不限于所描述和示出的具体步骤，本领域的技术人员可以在领会本发明的精神后，作出各种改变、修改和添加，或者改变步骤之间的顺序。It is to be understood that the present invention is not limited to the specific arrangements and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above-described embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the sequence of steps after comprehending the spirit of the present invention.

本领域普通技术人员应该可以明白，结合本文中所公开的实施方式描述的各示例性的组成部分、系统和方法，能够以硬件、软件或者二者的结合来实现。具体究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。当以硬件方式实现时，其可以例如是电子电路、专用集成电路（ASIC）、适当的固件、插件、功能卡等等。当以软件方式实现时，本发明的元素是被用于执行所需任务的程序或者代码段。程序或者代码段可以存储在机器可读介质中，或者通过载波中携带的数据信号在传输介质或者通信链路上传送。It should be understood by those of ordinary skill in the art that the various exemplary components, systems and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software or a combination of the two. Whether it is implemented in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an application specific integrated circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, elements of the invention are programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted over a transmission medium or communication link by a data signal carried in a carrier wave.

还需要说明的是，本发明中提及的示例性实施例，基于一系列的步骤或者装置描述一些方法或系统。但是，本发明不局限于上述步骤的顺序，也就是说，可以按照实施例中提及的顺序执行步骤，也可以不同于实施例中的顺序，或者若干步骤同时执行。It should also be noted that the exemplary embodiments mentioned in the present invention describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above steps, that is, the steps may be performed in the order mentioned in the embodiments, or may be different from the order in the embodiments, or several steps may be performed simultaneously.

本发明中，针对一个实施方式描述和/或例示的特征，可以在一个或更多个其它实施方式中以相同方式或以类似方式使用，和/或与其他实施方式的特征相结合或代替其他实施方式的特征。In the present invention, features described and/or illustrated with respect to one embodiment may be used in the same or similar manner in one or more other embodiments, and/or in combination with or in place of features of other embodiments Features of the implementation.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明实施例可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, various modifications and changes may be made to the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. A research interest mining method for scholar clustering, characterized in that the method comprises the following steps:

An academic metadata set is constructed based on academic data related to scholars from multiple sources. Each piece of academic metadata in the academic metadata set includes the following information: professional field information, data source information, scholar information and text content information. The academic data includes Research project data and dissertation data;

Input at least part of the academic metadata in the constructed academic metadata set as input data into the pre-established research interest mining model, and obtain the scholar interest semantic representation by repeatedly executing the research interest mining model for sampling multiple iterations. The semantic representation of interests includes professional field-topic distribution, topic-English word distribution, topic-Chinese word distribution, and topic-scholar distribution; wherein, the research interest mining model is based on the data of scholars from the same data source and belonging to the same professional field share the same topic distribution, and, in the research interest mining model, the professional field-topic distribution is modeled as a Dirichlet distribution, the topic-English word distribution, topic-Chinese word distribution and topic-scholar distribution is modeled as a polynomial distribution, and the area of expertise-topic distribution includes the area of expertise-topic distribution of scientific research projects and the area of expertise-topic distribution of dissertation data;

Perform scholar clustering based on the obtained semantic representation of scholar interest, and obtain scholar clustering results;

Wherein, at least part of the academic metadata in the constructed academic metadata set is input into the pre-established research interest mining model as input data, and the scholar's interest semantic representation is obtained by repeatedly executing the research interest mining model for sampling multiple times. ,include:

Sampling the academic metadata from scientific research projects and belonging to a predetermined professional field to obtain the professional field-topic distribution of scientific research projects;

Sampling academic metadata from papers that belong to a predetermined professional domain to obtain the professional domain-topic distribution of the paper data;

Sampling for each predetermined topic to obtain the corresponding topic-Chinese word distribution, topic-scholar distribution and topic-English word distribution;

Determine whether it is Chinese or English according to the text content in each piece of academic metadata; in the case that the text content is Chinese, the two words appearing at the same time in each piece of Chinese academic metadata are combined in pairs to extract the double words. And take the following actions for each double word: sample the topic of the Chinese double word, independently sample the two words, and sample the author of the double word; in the case of the text content in English, sample the following operations for each word: sample the English topic , sample each English word, and the author of the sampled English word.

2. The method according to claim 1, wherein the method further comprises the step of constructing a research interest mining model, the step comprising:

Identify the area of expertise to which the academic metadata falls;

To obtain academic metadata topics by collaboratively sampling topics from a distribution of topics from research projects and belonging to a defined area of expertise and a distribution of topics from papers and belonging to a defined area of expertise;

Generate English words, Chinese words, and scholars according to topic-English word distribution, topic-Chinese word distribution, and topic-scholar distribution.

3. The method according to claim 1, wherein the constructing an academic metadata set comprises:

Determine the source of each piece of academic metadata, and determine whether it is scientific research project data or paper data;

Extract the scholar to which the current academic metadata belongs;

Extract the text information of each scientific research project or paper;

Determine whether the text information of the scientific research project or paper is in Chinese or English;

The time period to which the current academic metadata belongs is determined, and the current academic metadata is classified into the academic metadata subset corresponding to the time period to which the current academic metadata belongs, thereby constructing an academic metadata set including the academic metadata subset.

4. The method of claim 1, wherein

Said inputting at least part of the academic metadata in the constructed academic metadata set into the pre-established research interest mining model as input data, and obtaining the scholar's interest semantic representation by iteratively executing the research interest mining model for multiple times for sampling, and further Including: inputting the academic metadata in the academic metadata subsets corresponding to different time periods as input data into the research interest mining model to obtain the semantic representation of scholars' interests at different times;

The clustering of the acquired scholar's interest semantic representation includes: using a single-pass clustering algorithm to cluster the acquired scholar's interest semantic representation at different times.

5. The method according to claim 1, wherein the parameters to be estimated in the research interest mining model are calculated by using a Gibbs sampling method;

The sampling of Chinese double words adopts the following formula:

The following formula is used for the sampling of each English word:

where Z represents the total number of topics,

represents the Chinese word i under the Chinese topic Z ^b ,

Represents the set of all Chinese topics except the ith word;

represents the English word i under the English topic Z ^b ,

Represents the set of all English topics except the ith word;

the number of occurrences;

The number of occurrences; a ₁ and a ₂ are the subject distribution of scientific research project data and the subject distribution of paper data respectively;

To fit the subject-scholar distribution of the Dirichlet distribution,

is the balance factor;

After the iterative operation of the research interest mining model, the following estimated parameters are obtained:

;

;

;

;

;

in,

Represents the topic-scholar distribution with topic z and scholar s;

6. The method according to claim 3, wherein the method further comprises:

Taking the topic distribution of the previous period as the prior distribution of the current period, the prior parameters of the current period are calculated based on the following formula:

in,

is the attenuation factor,

Represents the total number of all Chinese and English words belonging to topic z in the scientific research project data under the pth professional field at time t-1 ;

expressed in

expressed in

time, the

Under a professional field, the thesis data belongs to the topic

The total number of all Chinese and English words of ;

expressed in

Moment, the subject distribution of the professional field of the paper data;

expressed in

Moment, the subject distribution of the professional field of the paper data;

expressed in

time, the

the number of occurrences;

expressed in

expressed in

expressed in

time, the

the number of occurrences;

expressed in

expressed in

expressed in

time, the

the number of occurrences;

expressed in

expressed in

7. A research interest mining device for scholar clustering, comprising a processor and a memory, characterized in that the memory stores computer instructions, and the processor is used to execute the computer instructions stored in the memory, when the memory is stored. The apparatus implements the steps of the method as claimed in any one of claims 1 to 6 when the computer instructions are executed by the processor.

8. A computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the steps of the method according to any one of claims 1 to 6 are implemented.