CN118152572A

CN118152572A - Document clustering and sorting method, system, device and medium based on language large model

Info

Publication number: CN118152572A
Application number: CN202410377375.3A
Authority: CN
Inventors: 邢国用; 庄莉; 梁懿; 丘志强; 郑耀松
Original assignee: State Grid Information and Telecommunication Group Co Ltd; Fujian Yirong Information Technology Co Ltd
Current assignee: State Grid Information and Telecommunication Group Co Ltd; Fujian Yirong Information Technology Co Ltd
Priority date: 2024-03-29
Filing date: 2024-03-29
Publication date: 2024-06-07

Abstract

The present invention discloses a document clustering and sorting method, system, device and medium based on a language macro model, wherein the method comprises: collecting document data for structural processing and preprocessing; inputting document content into a language macro model to obtain vectorized representation; using a clustering algorithm to obtain document clusters and similarity matrices in document clusters for vectorized document content, sorting documents in each document cluster according to the weighted sum of the similarity matrix, and taking the top ten document titles as seed document titles; counting the number of documents of each level in the document cluster, the total number of documents and the weighted sum of the correlation coefficient of the document cluster, and calculating the weighted sum of the three indicators to obtain the final score of each document cluster, and sorting them according to the score; inputting the seed document title and the set prompt into the language macro model to generate a short sentence as the class label of the document cluster. The present invention can make document vectorization more accurate, the class sorting basis more scientific, and the generation of class labels more specific and automatic.

Description

Document clustering and sorting method, system, device and medium based on language large model

技术领域Technical Field

本发明属于人工智能领域，特别涉及基于语言大模型的文档聚类排序方法、系统、设备及介质。The present invention belongs to the field of artificial intelligence, and in particular relates to a document clustering and sorting method, system, device and medium based on a language large model.

背景技术Background technique

传统文档分类工作需要专家结合业务场景、经验知识对文档分类，通常采用字母顺序的方式排序，这在纸质文档时代具有良好的效果。随着技术的不断发展，文档数字化程度越来越高，越来越多的组织和个人采用数字化工具和解决方案来管理、存储和分享文档。在文档数字化后，对文档搜索有了新的要求，通常需要聚焦于某些类别的文档，而且随着文档数量级的增加，人工分类已经十分困难。同时在积累大量数字化文档后，可能出现信息过载、难以处理的问题，导致用户想要找到需要的信息变得困难。因此需要有效的管理、备份和分类存储，以确保数据的长期保存和可访问性。如果没有良好的文档管理系统，可能会导致数据丢失或不可用，如何对这些数字化文档进行整理分类、排序显得尤为重要。以GhatGPT为代表的语言大模型技术的快速发展，为文档分类排序提供了新的解决思路。Traditional document classification requires experts to classify documents based on business scenarios and experience knowledge, usually in alphabetical order, which has a good effect in the era of paper documents. With the continuous development of technology, the degree of document digitization is getting higher and higher, and more and more organizations and individuals are using digital tools and solutions to manage, store and share documents. After the digitization of documents, there are new requirements for document search, which usually needs to focus on certain categories of documents. Moreover, with the increase in the number of documents, manual classification has become very difficult. At the same time, after accumulating a large number of digital documents, information overload and difficulty in processing may occur, making it difficult for users to find the information they need. Therefore, effective management, backup and classified storage are required to ensure the long-term preservation and accessibility of data. If there is no good document management system, data may be lost or unavailable. How to organize, classify and sort these digital documents is particularly important. The rapid development of language large model technology represented by GhatGPT provides a new solution for document classification and sorting.

当前文档分类和排序方法主要有：一、基于关键词的文档搜索分类、基于文本相似度的文档排序：通过使用关键词查询相关的文档，搜出的文档输入预先训练好的相似度分值预测模型中，获得该候选文档的相似度分值，根据每个候选文档的相似度分值，确定所述多个候选文档的文档排序结果。二、基于知识图谱的文档聚类：在关键词技术的基础上，生成基于关键词的知识图谱，并将知识图谱划分为若干子图，利用聚类算法将文档聚类到知识图谱的事件中。三、基于词频-逆文档矩阵进行双向聚类：直接进行文档向量化，通过计算文本向量之间的距离实现聚类。The current document classification and sorting methods mainly include: 1. Document search classification based on keywords, document sorting based on text similarity: by using keywords to query related documents, the searched documents are input into a pre-trained similarity score prediction model to obtain the similarity score of the candidate document, and the document sorting results of the multiple candidate documents are determined according to the similarity score of each candidate document. 2. Document clustering based on knowledge graph: Based on keyword technology, a keyword-based knowledge graph is generated, and the knowledge graph is divided into several sub-graphs, and the clustering algorithm is used to cluster the documents into events in the knowledge graph. 3. Bidirectional clustering based on word frequency-inverse document matrix: Directly vectorize the document and achieve clustering by calculating the distance between text vectors.

公开号为“CN116304071A”的中国专利公开了一种基于知识图谱的两层网页文档聚类算法。该发明根据提供的关键词，对相关文档进行搜索归类、排序。整理归类阶段，从文档库筛选相关的候选文档；拼接处理，对每个候选文档中的与关键词相关的语句进行拼接处理，生成拼接语料；相似度分值预测，将生成的拼接语料输入到预先训练好的相似度分值预测模型中，以计算每个候选文档的相似度分值；文档排序，根据每个候选文档的相似度分值，对多个候选文档进行排序。上述发明主要存在以下问题：一、用于分类的关键词完全根据经验确定，关键词的选取质量直接决定了文档分类的准确性，当文档数量巨大时，类别覆盖度必然受到限制，则会直接影响准确度。二、分类的标识或称为类标签，直接采用选取的关键词，导致其对文档类别的总结概括能力有限；分类标签需要尽可能多地反映该类别中所有文档的主要特征，一般是包含多个关键词的短语或短句。三、计算复杂性，拼接和计算文档与查询之间的相似度分值可能需要大量的计算资源和时间，尤其是在大规模文档库的情况下，可能会导致延迟和性能问题。四、相似度计算模型使用的传统模型，模型性能比语言大模型差，可能导致排序不准确。The Chinese patent with the publication number "CN116304071A" discloses a two-layer web document clustering algorithm based on knowledge graph. The invention searches, classifies and sorts relevant documents according to the provided keywords. In the sorting and classification stage, relevant candidate documents are screened from the document library; splicing processing, the sentences related to the keywords in each candidate document are spliced to generate spliced corpus; similarity score prediction, the generated spliced corpus is input into the pre-trained similarity score prediction model to calculate the similarity score of each candidate document; document sorting, according to the similarity score of each candidate document, multiple candidate documents are sorted. The above invention mainly has the following problems: First, the keywords used for classification are determined entirely based on experience. The quality of keyword selection directly determines the accuracy of document classification. When the number of documents is huge, the category coverage is inevitably limited, which will directly affect the accuracy. Second, the classification identification or class label directly uses the selected keywords, resulting in limited ability to summarize the document category; the classification label needs to reflect the main features of all documents in the category as much as possible, generally a phrase or short sentence containing multiple keywords. 3. Computational complexity: concatenating and calculating the similarity scores between documents and queries may require a lot of computing resources and time, especially in the case of large-scale document libraries, which may cause delays and performance issues. 4. The traditional model used in the similarity calculation model has worse model performance than the language model, which may lead to inaccurate ranking.

公开号为“CN111680131B”的中国专利公开了一种基于语义的文档聚类方法，包括：获取输入文档并对其进行预处理；对处理后的输入文档中所包含的各个单词进行词频统计和逆文档频率计算，构建词频-逆文档矩阵；将词频统计中所采用的单词作为对象输入至预先存储的自然语言处理模型中，得到与词频-逆文档矩阵所适配的相似度矩阵；根据相似度矩阵对词频-逆文档矩阵进行语义传播，得到第二词频-逆文档矩阵；对第二词频-逆文档矩阵进行双向聚类，得到至少一个双聚类簇，双聚类簇包含一个文档簇和一个单词簇，根据单词簇中所包含的特征词对文档簇中的各个文档进行标签赋予。与关键词方法不同，该发明作为类标签的关键词不再由人工产生，而是通过聚类产生。上述发明主要存在以下问题：一、使用双向聚类虽然不再需要人工直接给出关键词，但还是属于使用关键词作为类标签的方法，只使用关键词表述文档簇的特征信息依然存在不足。二、词频-逆文档频率算法虽然实现简单、容易理解、解释性较强，但用词频来衡量文章中的一个词的重要性不够全面，这使得分类的精度不高。三、这种计算无法体现词的位置信息，无法体现相同词在不同上下文环境中的区别。四、该发明未涉及聚类后文档的排序，而排序在文档整理过程中具有重要意义。The Chinese patent with the publication number of "CN111680131B" discloses a semantic-based document clustering method, including: obtaining an input document and preprocessing it; performing word frequency statistics and inverse document frequency calculation on each word contained in the processed input document to construct a word frequency-inverse document matrix; inputting the words used in the word frequency statistics as objects into a pre-stored natural language processing model to obtain a similarity matrix adapted to the word frequency-inverse document matrix; performing semantic propagation on the word frequency-inverse document matrix according to the similarity matrix to obtain a second word frequency-inverse document matrix; performing bidirectional clustering on the second word frequency-inverse document matrix to obtain at least one double cluster cluster, the double cluster cluster includes a document cluster and a word cluster, and labeling each document in the document cluster according to the feature words contained in the word cluster. Unlike the keyword method, the keywords used as class labels in this invention are no longer generated manually, but are generated through clustering. The above invention mainly has the following problems: First, although the use of bidirectional clustering no longer requires manual direct keyword giving, it still belongs to the method of using keywords as class labels, and there is still a lack of using only keywords to express the feature information of document clusters. Second, although the word frequency-inverse document frequency algorithm is simple to implement, easy to understand, and highly explanatory, it is not comprehensive enough to use word frequency to measure the importance of a word in an article, which makes the classification accuracy low. Third, this calculation cannot reflect the position information of the word, and cannot reflect the difference between the same word in different contexts. Fourth, the invention does not involve the sorting of documents after clustering, which is of great significance in the document sorting process.

发明内容Summary of the invention

本发明提供基于语言大模型的文档聚类排序方法、系统、设备及介质，旨在解决现有技术中文本向量化的方法依赖特征工程、表征能力弱，在类标签的生成方面需要借助人工经验确定或在已有的文本中筛选产生、类标签表达能力有限的问题。The present invention provides a document clustering and sorting method, system, device and medium based on a language big model, aiming to solve the problems in the prior art that the text vectorization method relies on feature engineering, has weak representation ability, needs to use artificial experience to determine the generation of class labels or screen and generate them in existing texts, and has limited class label expression ability.

为解决上述技术问题，本发明提供基于语言大模型的文档聚类排序方法，包括以下步骤：In order to solve the above technical problems, the present invention provides a document clustering and sorting method based on a language large model, comprising the following steps:

S1：收集文档数据并进行结构化处理，所述文档数据包括文档内容与文档信息，所述文档信息包括文档标题与文档等级，对经过结构化处理的文档内容进行预处理。S1: Collect document data and perform structural processing, the document data includes document content and document information, the document information includes document title and document level, and pre-process the document content that has undergone structural processing.

S2：将经过预处理的文档内容输入语言大模型计算得到文档内容的向量化表示。S2: Input the preprocessed document content into the language model to calculate the vectorized representation of the document content.

S3：对经过向量化处理的文档内容使用聚类算法得到多个文档簇，各文档簇中文档通过向量计算得到基于向量距离的相似度矩阵，对各文档簇内文档按照相似度矩阵的加权和进行簇内排序，将簇内排序为前十的文档作为种子文档，所述种子文档的文档标题为种子文档标题。S3: A clustering algorithm is used to obtain multiple document clusters for the vectorized document contents. A similarity matrix based on vector distance is obtained for the documents in each document cluster through vector calculation. The documents in each document cluster are sorted within the cluster according to the weighted sum of the similarity matrix. The top ten documents in the cluster are used as seed documents, and the document title of the seed document is the seed document title.

S4：使用种子文档标题计算各文档簇内相关系数得到相关系数加权和，基于经过结构化处理的文档信息中的文档等级，统计文档簇中的各等级文档数量，根据各等级文档数量、文档总数与文档簇内相关系数加权和三个指标，计算所述三个指标的加权和，得到各文档簇最终得分，按照得分高低对文档簇进行文档排序。S4: Use the seed document title to calculate the correlation coefficient within each document cluster to obtain the weighted sum of the correlation coefficients. Based on the document level in the structured document information, count the number of documents of each level in the document cluster. According to the three indicators of the number of documents of each level, the total number of documents and the weighted sum of the correlation coefficient within the document cluster, calculate the weighted sum of the three indicators to obtain the final score of each document cluster, and sort the documents in the document cluster according to the score.

S5：将各文档簇内的种子文档标题与设定prompt输入语言大模型生成概括文档簇信息特征的短句，将所述短句作为文档簇的类标签。S5: The seed document titles in each document cluster and the set prompt are input into the language model to generate a short sentence summarizing the information features of the document cluster, and the short sentence is used as the class label of the document cluster.

优选的，所述步骤S3中对各文档簇内文档按照相似度矩阵的加权和进行簇内排序具体为：Preferably, in step S3, the documents in each document cluster are sorted within the cluster according to the weighted sum of the similarity matrix as follows:

S31：各文档簇内的文档的向量化表示集合为V＝{v₁,v₂,…,v_p}，其中p为文档簇内文档总数，文档簇内第n个文档与第m个文档的相似度矩阵为d(v_n,v_m)，相似度矩阵d(v_n,v_m)的计算公式具体为：S31: The vectorized representation set of documents in each document cluster is V = {v ₁ ,v ₂ ,…,v _p }, where p is the total number of documents in the document cluster, and the similarity matrix between the nth document and the mth document in the document cluster is d(v _n ,v _m ). The calculation formula of the similarity matrix d(v _n ,v _m ) is specifically:

式中，v_n为文档簇内第n个文档的向量化表示，v_m为文档簇内第m个文档的向量化表示。Where _vn is the vectorized representation of the nth document in the document cluster, and _vm is the vectorized representation of the mth document in the document cluster.

S32：文档簇内第n个文档的相似度矩阵加权和的计算公式具体为：S32: The calculation formula for the weighted sum of the similarity matrix of the nth document in the document cluster is:

式中，S_n为第n个文档的相似度矩阵加权和，α_i为第i个相似度矩阵的权重，v_i为文档簇内第i个文档的向量化表示。Where _Sn is the weighted sum of the similarity matrix of the nth document, _αi is the weight of the ith similarity matrix, and _vi is the vectorized representation of the ith document in the document cluster.

S33：对各文档簇内文档按照相似度矩阵的加权和进行簇内排序。S33: Sort the documents in each document cluster within the cluster according to the weighted sum of the similarity matrix.

优选的，所述步骤S4中使用种子文档计算各文档簇内相关系数得到相关系数加权和具体为：将种子文档标题输入语言大模型计算得到向量化表示T＝{t₁,t₂,…,t₁₀}，各文档簇内第k个种子文档标题的相关系数加权和的计算公式具体为：Preferably, in step S4, the seed document is used to calculate the correlation coefficients in each document cluster to obtain the weighted sum of the correlation coefficients. Specifically, the seed document title is input into the language model to calculate the vectorized representation T = {t ₁ , t ₂ , ..., t ₁₀ }, and the calculation formula for the weighted sum of the correlation coefficients of the kth seed document title in each document cluster is specifically:

式中，WSCC_k为第k个种子文档标题的相关系数加权和，β_i为第i个相似度矩阵的权重，t_k为第k个种子文档标题的向量化表示。Where WSCC _k is the weighted sum of the correlation coefficients of the k-th seed document title, β _i is the weight of the i-th similarity matrix, and t _k is the vectorized representation of the k-th seed document title.

各文档簇内相关系数加权和的计算公式具体为：The calculation formula for the weighted sum of correlation coefficients within each document cluster is as follows:

式中，WSCC为文档簇内相关系数加权和，γ_k为第k个WSCC_k的权重。Where WSCC is the weighted sum of the correlation coefficients within the document cluster, and _γk is the weight of the kth _WSCCk .

优选的，所述步骤S4中文档簇最终得分的计算公式具体为：Preferably, the calculation formula for the final score of the document cluster in step S4 is specifically:

CFS＝0.3*DC₁+0.2*DC₂+0.1*DC+0.4*WSCCCFS＝0.3*DC ₁ +0.2*DC ₂ +0.1*DC+0.4*WSCC

式中，CFS为文档簇最终得分，DC₁为等级1的文档数量，DC₂为等级2的文档数量，DC为总文档数量，WSCC为文档簇内相关系数加权和。In the formula, CFS is the final score of the document cluster, DC ₁ is the number of documents at level 1, DC ₂ is the number of documents at level 2, DC is the total number of documents, and WSCC is the weighted sum of the correlation coefficients within the document cluster.

另一方面，本发明提供一种基于语言大模型的文档聚类排序系统，包括文档采集模块、文本向量化模块、文档聚类模块、类排序模块与类标签生成模块。On the other hand, the present invention provides a document clustering and sorting system based on a language macro model, comprising a document collection module, a text vectorization module, a document clustering module, a class sorting module and a class label generation module.

文档采集模块用于收集文档数据并进行结构化处理，所述文档数据包括文档内容与文档信息，所述文档信息包括文档标题与文档等级，对经过结构化处理的文档内容进行预处理。The document collection module is used to collect document data and perform structured processing, wherein the document data includes document content and document information, wherein the document information includes document title and document level, and pre-processes the document content that has undergone structured processing.

文本向量化模块用于将经过预处理的文档内容输入语言大模型计算得到文档内容的向量化表示。The text vectorization module is used to input the preprocessed document content into the language model to calculate the vectorized representation of the document content.

文档聚类模块用于对经过向量化处理的文档内容使用聚类算法得到多个文档簇，各文档簇中文档通过向量计算得到基于向量距离的相似度矩阵，对各文档簇内文档按照相似度矩阵的加权和进行簇内排序，将簇内排序为前十的文档作为种子文档，所述种子文档的文档标题为种子文档标题。The document clustering module is used to obtain multiple document clusters using a clustering algorithm for the vectorized document content. The documents in each document cluster obtain a similarity matrix based on vector distance through vector calculation. The documents in each document cluster are sorted within the cluster according to the weighted sum of the similarity matrix. The top ten documents in the cluster are used as seed documents, and the document title of the seed document is the seed document title.

类排序模块用于使用种子文档标题计算各文档簇内相关系数得到相关系数加权和，基于经过结构化处理的文档信息中的文档等级，统计文档簇中的各等级文档数量，根据各等级文档数量、文档总数与文档簇内相关系数加权和三个指标，计算所述三个指标的加权和，得到各文档簇最终得分，按照得分高低对文档簇进行文档排序。The class sorting module is used to use the seed document title to calculate the correlation coefficient within each document cluster to obtain the weighted sum of the correlation coefficient. Based on the document level in the structured document information, the number of documents of each level in the document cluster is counted. According to the three indicators of the number of documents of each level, the total number of documents and the weighted sum of the correlation coefficient within the document cluster, the weighted sum of the three indicators is calculated to obtain the final score of each document cluster, and the document clusters are sorted according to the scores.

类标签生成模块用于将各文档簇内的种子文档标题与设定prompt输入语言大模型生成概括文档簇信息特征的短句，将所述短句作为文档簇的类标签。The class label generation module is used to generate a short sentence summarizing the information features of the document cluster based on the seed document title in each document cluster and the set prompt input language model, and use the short sentence as the class label of the document cluster.

优选的，所述文档聚类模块中对各文档簇内文档按照相似度矩阵的加权和进行簇内排序具体为：Preferably, the document clustering module performs intra-cluster sorting on the documents in each document cluster according to the weighted sum of the similarity matrix as follows:

式中，S_n为第n个文档的相似度矩阵加权和，α_i为第i个相似度矩阵的权重，c_i为文档簇内第i个文档的向量化表示。Where S _n is the weighted sum of the similarity matrix of the nth document, α _i is the weight of the ith similarity matrix, and _ci is the vectorized representation of the ith document in the document cluster.

优选的，所述类排序模块中使用种子文档计算各文档簇内相关系数得到相关系数加权和具体为：将种子文档标题输入语言大模型计算得到向量化表示T＝{t₁,t₂,…,t₁₀}，各文档簇内第k个种子文档标题的相关系数加权和的计算公式具体为：Preferably, the class ranking module uses seed documents to calculate the correlation coefficients in each document cluster to obtain the weighted sum of correlation coefficients as follows: the seed document title is input into the language model to obtain a vectorized representation T = {t ₁ , t ₂ , ..., t ₁₀ }, and the calculation formula for the weighted sum of correlation coefficients of the kth seed document title in each document cluster is as follows:

优选的，所述类排序模块中文档簇最终得分的计算公式具体为：Preferably, the calculation formula for the final score of the document cluster in the class ranking module is specifically:

再一方面，本发明还提供一种电子设备，所述电子设备包括：存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如本发明任一实施例所述的基于语言大模型的文档聚类排序方法。On the other hand, the present invention also provides an electronic device, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the document clustering and sorting method based on the language large model as described in any embodiment of the present invention is implemented.

再一方面，本发明还提供一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现如本发明任一实施例所述的基于语言大模型的文档聚类排序方法。On the other hand, the present invention further provides a computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the document clustering and sorting method based on a large language model as described in any embodiment of the present invention is implemented.

与现有技术相比，本发明具有以下技术效果：Compared with the prior art, the present invention has the following technical effects:

1.文档向量化更加准确：本发明不再需要对原始文档进行如分词等预处理，基于生成式语言大模型的文档向量化可以动态的对文档进行向量化，从而解决一词多义的问题，使得后续基于向量化的过程步骤更加合理，结果更加准确。1. Document vectorization is more accurate: The present invention no longer requires preprocessing of the original document, such as word segmentation. Document vectorization based on the generative language large model can dynamically vectorize the document, thereby solving the problem of polysemy, making subsequent process steps based on vectorization more reasonable and the results more accurate.

2.类排序依据更加科学：本发明不仅仅依赖文档向量之间的距离，同时考虑了文档来源等级、类中文档数量等多维度指标，通过更加科学与合适的排序依据来综合确定类的排序。2. The basis for class sorting is more scientific: The present invention not only relies on the distance between document vectors, but also takes into account multi-dimensional indicators such as document source level and the number of documents in the class, and comprehensively determines the class sorting through a more scientific and appropriate sorting basis.

3.类标签的生成自动化、具体化：本发明使用语言大模型结合文档簇中排名前十的文档标题，生成代表该文档簇的标题，能够更加具体地表达和凸显文档簇的特征。3. Automatic and specific generation of class labels: The present invention uses a large language model combined with the top ten document titles in a document cluster to generate a title representing the document cluster, which can express and highlight the characteristics of the document cluster in a more specific manner.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明所述的基于语言大模型的文档聚类排序方法的整体流程图；FIG1 is an overall flow chart of the document clustering and sorting method based on the language macro model of the present invention;

图2是本发明所述的基于语言大模型的文档聚类排序系统的整体结构图。FIG. 2 is an overall structural diagram of the document clustering and sorting system based on the language macro model described in the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本申请具体实施例，并参照附图，对本发明的技术方案进行清楚、完整地描述。In order to make the objectives, technical solutions and advantages of the present invention more clear, the technical solutions of the present invention will be clearly and completely described below in combination with specific embodiments of the present application and with reference to the accompanying drawings.

实施例一Embodiment 1

本实施例提供一种基于语言大模型的文档聚类排序方法，使用基于语言大模型中ELMO方法进行文本向量化；基于语言大模型能力生成的短句来生成类标签，而非预先给定或聚类产生的关键词；在使用聚类算法后，文档排序不仅使用文档向量距离，还结合文档数量、文档等级等特征进行综合排序。本实施例能够提高文档聚类的准确性与整理、归类、排序水平，类标签表示能力更加具体，类中文档排序规则更加合理。参阅图1所示，包括以下步骤：This embodiment provides a document clustering and sorting method based on a language big model, which uses the ELMO method in the language big model to vectorize text; generates class labels based on short sentences generated by the language big model, rather than pre-given or clustered keywords; after using the clustering algorithm, the document sorting not only uses the document vector distance, but also combines the number of documents, document level and other features for comprehensive sorting. This embodiment can improve the accuracy of document clustering and the level of organization, classification and sorting. The class label representation capability is more specific, and the document sorting rules in the class are more reasonable. Referring to Figure 1, the following steps are included:

S1：收集文档数据并进行结构化处理，用于结构化存储。收集的文档数据可以为公司发布的各类规章制度文档等等，所述文档数据包括文档内容与文档信息，所述文档信息包括文档标题与文档等级。所述结构化存储通常是将文档数据中的文档等级信息存入关系型数据库进行结构化存储。对经过结构化处理的文档内容进行预处理，所述预处理包括删除清洗特殊无实际意义的符号与标点、网页文档的HTML标签、文档数据脱密、文档数据脱敏等等。与基于关键词或词频-逆文档算法的传统技术路线不同，由于本实施例采用语言大模型进行文本向量化，因此不再需要分词、去除低频词、停用词等预处理操作。S1: Collect document data and perform structured processing for structured storage. The collected document data can be various types of rules and regulations documents issued by the company, etc. The document data includes document content and document information, and the document information includes document title and document level. The structured storage usually stores the document level information in the document data into a relational database for structured storage. The document content that has been structured is preprocessed, and the preprocessing includes deleting and cleaning special symbols and punctuation that have no practical meaning, HTML tags of web documents, document data decryption, document data desensitization, etc. Different from the traditional technical route based on keywords or word frequency-inverse document algorithm, since this embodiment uses a large language model for text vectorization, preprocessing operations such as word segmentation, removal of low-frequency words, and stop words are no longer required.

S2：将经过预处理的文档内容输入语言大模型计算得到文档内容的向量化表示。其中语言大模型可以使用ChatGPT等在线接口，或自有本地化部署的语言大模型服务。将预处理后的文档内容使用语言大模型的embedding模块，计算得到文档内容的向量化表示。与词频-逆文档算法相比，语言大模型的向量化基于ELMO方法，向量化更加准确，维度更高，包含的信息特征更多，同时能够结合文本上下文信息解决一词多义问题。S2: Input the preprocessed document content into the language model to calculate the vectorized representation of the document content. The language model can use online interfaces such as ChatGPT, or localized language model services. The preprocessed document content is calculated using the embedding module of the language model to obtain the vectorized representation of the document content. Compared with the word frequency-inverse document algorithm, the vectorization of the language model is based on the ELMO method, which is more accurate, has a higher dimension, contains more information features, and can solve the problem of polysemy by combining text context information.

S3：对经过向量化处理的文档内容使用聚类算法得到多个文档簇，所述聚类算法包括K均值聚类、DBSCA、Hierarchical Clustering等，可根据实际场景与需求，选择合适的算法或模型。S3: A clustering algorithm is used to obtain multiple document clusters for the vectorized document content. The clustering algorithm includes K-means clustering, DBSCA, Hierarchical Clustering, etc. The appropriate algorithm or model can be selected according to the actual scenario and requirements.

K均值聚类是一种迭代求解的聚类分析算法，作为本实施例的优选实施方式，本实施例使用K均值聚类得到文档簇，具体的，其实施步骤为：首先将文档预分为K组，随机选取K个文档(向量化表示)作为初始的聚类中心，然后计算每个文档与各个聚类中心之间的距离，把每个文档分配给距离(欧氏距离)它最近的聚类中心。由聚类中心文档以及分配给它们的文档就代表一个文档簇。每当分配一个样本文档，类别的聚类中心会根据类中现有的文档向量重新计算。这个过程将不断重复直到满足某个终止条件。终止条件可以是没有(或设定最小数目)对象被重新分配给不同的聚类，没有(或设定最小数目)聚类中心再发生变化，误差平方和局部最小。K-means clustering is an iterative clustering analysis algorithm. As the preferred implementation of this embodiment, this embodiment uses K-means clustering to obtain document clusters. Specifically, its implementation steps are: first, pre-divide the documents into K groups, randomly select K documents (vectorized representation) as the initial cluster centers, then calculate the distance between each document and each cluster center, and assign each document to the cluster center with the closest distance (Euclidean distance) to it. A document cluster is represented by the cluster center documents and the documents assigned to them. Whenever a sample document is assigned, the cluster center of the category will be recalculated based on the existing document vectors in the class. This process will be repeated until a certain termination condition is met. The termination condition can be that no (or a set minimum number) objects are reassigned to different clusters, no (or a set minimum number) cluster centers change again, and the sum of squared errors is locally minimized.

DBSCA，该算法利用基于密度的聚类的概念，即要求聚类空间中的一定区域内所包含对象(点或其他空间对象)的数目不小于某一给定阈值。DBSCAN算法的显著优点是聚类速度快且能够有效处理噪声点和发现任意形状的空间聚类。但是当空间聚类的密度不均匀、聚类间距差相差很大时，聚类质量较差。DBSCA, the algorithm uses the concept of density-based clustering, which requires that the number of objects (points or other spatial objects) contained in a certain area in the clustering space is not less than a given threshold. The significant advantages of the DBSCAN algorithm are fast clustering speed and the ability to effectively handle noise points and find spatial clusters of arbitrary shapes. However, when the density of spatial clusters is uneven and the distance between clusters varies greatly, the clustering quality is poor.

Hierarchical Clustering(层次聚类)，通过计算不同类别数据点间的相似度来创建一棵有层次的嵌套聚类树，距离越小，相似度越高。在聚类树中，不同类别的原始数据点是树的最低层，树的顶层是一个聚类的根节点。创建聚类树有自下而上合并和自上而下分裂两种方法。Hierarchical Clustering creates a hierarchical nested cluster tree by calculating the similarity between data points of different categories. The smaller the distance, the higher the similarity. In the cluster tree, the original data points of different categories are the lowest level of the tree, and the top level of the tree is the root node of a cluster. There are two ways to create a cluster tree: bottom-up merging and top-down splitting.

各文档簇中文档通过向量计算得到基于向量距离的相似度矩阵，对各文档簇内文档按照相似度矩阵的加权和进行簇内排序，将簇内排序为前十的文档作为种子文档，其中，一个文档簇即一个文档类别包含多个文档，所述种子文档的文档标题为种子文档标题。The documents in each document cluster are vector-calculated to obtain a similarity matrix based on vector distance. The documents in each document cluster are sorted within the cluster according to the weighted sum of the similarity matrix. The top ten documents in the cluster are used as seed documents, where a document cluster, i.e., a document category, contains multiple documents, and the document title of the seed document is the seed document title.

作为本实施例的优选实施方式，所述步骤S3中对各文档簇内文档按照相似度矩阵的加权和进行簇内排序具体为：As a preferred implementation of this embodiment, in step S3, the documents in each document cluster are sorted within the cluster according to the weighted sum of the similarity matrix as follows:

式中，S_n为第n个文档的相似度矩阵加权和，α_i为第i个相似度矩阵的权重，v_i为文档簇内第i个文档的向量化表示。Where _Sn is the weighted sum of the similarity matrices of the nth document, _αi is the weight of the ith similarity matrix, and _vi is the vectorized representation of the ith document in the document cluster.

作为本实施例的优选实施方式，所述步骤S4中使用种子文档计算各文档簇内相关系数得到相关系数加权和具体为：将种子文档标题输入语言大模型计算得到向量化表示T＝{t₁,t₂,…,t₁₀}，各文档簇内第k个种子文档标题的相关系数加权和的计算公式具体为：As a preferred implementation of this embodiment, in step S4, the seed document is used to calculate the correlation coefficient within each document cluster to obtain the weighted sum of the correlation coefficients. Specifically, the seed document title is input into the language model to calculate the vectorized representation T = {t ₁ , t ₂ , ..., t ₁₀ }. The calculation formula for the weighted sum of the correlation coefficients of the k-th seed document title in each document cluster is specifically:

作为本实施例的优选实施方式，所述步骤S4中文档簇最终得分的计算公式具体为：As a preferred implementation of this embodiment, the calculation formula for the final score of the document cluster in step S4 is specifically:

S5：将各文档簇内的种子文档标题与设定prompt输入语言大模型生成概括文档簇信息特征的短句，将所述短句作为文档簇的类标签。基于语言大模型及前述步骤得到的文档簇的种子文档标题是文档簇中与其他文档相关系数加权和前十，相较于预先设定的关键词，更能够代表文档簇的信息特征。所述prompt可设定为以下类似内容：将以下十个文档标题，按相似性给出一个或多个词语或短语，归纳这些标题的特征，标题如下……(列出得到的十个文档标题)。S5: Input the seed document titles in each document cluster and the set prompt into the language model to generate a short sentence summarizing the information characteristics of the document cluster, and use the short sentence as the class label of the document cluster. The seed document title of the document cluster obtained based on the language model and the above steps is the top ten weighted sum of the correlation coefficients with other documents in the document cluster, which is more representative of the information characteristics of the document cluster than the pre-set keywords. The prompt can be set to something similar to the following: The following ten document titles are given one or more words or phrases according to similarity, and the characteristics of these titles are summarized. The titles are as follows... (list the ten document titles obtained).

实施例二Embodiment 2

相应的，参阅图2所示，本实施例提供一种基于语言大模型的文档聚类排序系统，包括文档采集模块、文本向量化模块、文档聚类模块、类排序模块与类标签生成模块。Correspondingly, referring to FIG. 2 , this embodiment provides a document clustering and sorting system based on a large language model, including a document collection module, a text vectorization module, a document clustering module, a class sorting module and a class label generation module.

文档采集模块用于收集文档数据并进行结构化处理，所述文档数据包括文档内容与文档信息，所述文档信息包括文档标题与文档等级，对经过结构化处理的文档内容进行预处理，该模块用于实现如实施例一中步骤S1的功能，在此不再赘述。The document collection module is used to collect document data and perform structured processing. The document data includes document content and document information. The document information includes document title and document level. The structured document content is pre-processed. This module is used to implement the function of step S1 in Example 1, which will not be repeated here.

文本向量化模块用于将经过预处理的文档内容输入语言大模型计算得到文档内容的向量化表示，该模块用于实现如实施例一中步骤S2的功能，在此不再赘述。The text vectorization module is used to input the preprocessed document content into the language model to calculate the vectorized representation of the document content. This module is used to implement the function of step S2 in Example 1, which will not be repeated here.

文档聚类模块用于对经过向量化处理的文档内容使用聚类算法得到多个文档簇，各文档簇中文档通过向量计算得到基于向量距离的相似度矩阵，对各文档簇内文档按照相似度矩阵的加权和进行簇内排序，将簇内排序为前十的文档作为种子文档，所述种子文档的文档标题为种子文档标题，该模块用于实现如实施例一中步骤S3的功能，在此不再赘述。The document clustering module is used to use a clustering algorithm to obtain multiple document clusters for the vectorized document content. The documents in each document cluster obtain a similarity matrix based on vector distance through vector calculation. The documents in each document cluster are sorted within the cluster according to the weighted sum of the similarity matrix. The top ten documents in the cluster are used as seed documents. The document title of the seed document is the seed document title. This module is used to implement the function of step S3 in Example 1, which will not be repeated here.

类排序模块用于使用种子文档标题计算各文档簇内相关系数得到相关系数加权和，基于经过结构化处理的文档信息中的文档等级，统计文档簇中的各等级文档数量，根据各等级文档数量、文档总数与文档簇内相关系数加权和三个指标，计算所述三个指标的加权和，得到各文档簇最终得分，按照得分高低对文档簇进行文档排序，该模块用于实现如实施例一中步骤S4的功能，在此不再赘述。The class sorting module is used to use the seed document title to calculate the correlation coefficient within each document cluster to obtain the weighted sum of the correlation coefficients. Based on the document level in the structured document information, the number of documents of each level in the document cluster is counted. According to the three indicators of the number of documents of each level, the total number of documents and the weighted sum of the correlation coefficient within the document cluster, the weighted sum of the three indicators is calculated to obtain the final score of each document cluster, and the document clusters are sorted according to the scores. This module is used to implement the function of step S4 in Example 1, which will not be repeated here.

类标签生成模块用于将各文档簇内的种子文档标题与设定prompt输入语言大模型生成概括文档簇信息特征的短句，将所述短句作为文档簇的类标签，该模块用于实现如实施例一中步骤S5的功能，在此不再赘述。The class label generation module is used to generate a short sentence summarizing the information characteristics of the document cluster based on the seed document title in each document cluster and the set prompt input language model, and use the short sentence as the class label of the document cluster. This module is used to implement the function of step S5 in Example 1, which will not be repeated here.

实施例三Embodiment 3

本实施例提供一种电子设备，所述电子设备包括：存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如本发明任一实施例所述的基于语言大模型的文档聚类排序方法。This embodiment provides an electronic device, which includes: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the document clustering and sorting method based on a large language model as described in any embodiment of the present invention is implemented.

实施例四Embodiment 4

本实施例提供一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现如本发明任一实施例所述的基于语言大模型的文档聚类排序方法。This embodiment provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the document clustering and sorting method based on a large language model as described in any embodiment of the present invention is implemented.

本申请实施例中，“至少一个”是指一个或者多个，“多个”是指两个或两个以上。“和/或”，描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示单独存在A、同时存在A和B、单独存在B的情况。其中A，B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项”及其类似表达，是指的这些项中的任意组合，包括单项或复数项的任意组合。例如，a，b和c中的至少一项可以表示：a，b，c，a和b，a和c，b和c或a和b和c，其中a，b，c可以是单个，也可以是多个。In the embodiments of the present application, "at least one" refers to one or more, and "more than one" refers to two or more. "And/or" describes the association relationship of associated objects, indicating that three relationships may exist. For example, A and/or B can represent the existence of A alone, the existence of A and B at the same time, and the existence of B alone. Among them, A and B can be singular or plural. The character "/" generally indicates that the previous and next associated objects are in an "or" relationship. "At least one of the following" and similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b and c can be represented by: a, b, c, a and b, a and c, b and c, or a and b and c, where a, b, c can be single or multiple.

本领域普通技术人员可以意识到，本文中公开的实施例中描述的各单元及算法步骤，能够以电子硬件、计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the various units and algorithm steps described in the embodiments disclosed herein can be implemented in a combination of electronic hardware, computer software, and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices and units described above can refer to the corresponding processes in the aforementioned method embodiments and will not be repeated here.

在本申请所提供的几个实施例中，任一功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory；以下简称：ROM)、随机存取存储器(Random Access Memory；以下简称：RAM)、磁碟或者光盘等各种可以存储程序代码的介质。In several embodiments provided in the present application, if any function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be essentially or partly embodied in the form of a software product that contributes to the prior art. The computer software product is stored in a storage medium, including several instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory; hereinafter referred to as: ROM), random access memory (Random Access Memory; hereinafter referred to as: RAM), disk or optical disk, and other media that can store program codes.

以上所述仅为本发明的实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above descriptions are merely embodiments of the present invention and are not intended to limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made using the contents of the present invention specification and drawings, or directly or indirectly applied in other related technical fields, are also included in the patent protection scope of the present invention.

Claims

1. A document clustering and sorting method based on a large language model, characterized in that it comprises the following steps:

S1: collecting document data and performing structural processing, wherein the document data includes document content and document information, wherein the document information includes a document title and a document level, and pre-processing the document content after structural processing;

S2: Input the preprocessed document content into the language model to calculate the vectorized representation of the document content;

S3: A clustering algorithm is used to obtain multiple document clusters for the vectorized document contents. A similarity matrix based on vector distance is obtained for the documents in each document cluster through vector calculation. The documents in each document cluster are sorted within the cluster according to the weighted sum of the similarity matrix. The top ten documents in the cluster are taken as seed documents, and the document title of the seed document is the seed document title.

S4: using the seed document title to calculate the correlation coefficient within each document cluster to obtain the weighted sum of the correlation coefficients, based on the document level in the structured document information, counting the number of documents of each level in the document cluster, and calculating the weighted sum of the three indicators of the number of documents of each level, the total number of documents and the weighted sum of the correlation coefficient within the document cluster to obtain the final score of each document cluster, and sorting the documents in the document cluster according to the score;

S5: The seed document titles in each document cluster and the set prompt are input into the language model to generate a short sentence summarizing the information features of the document cluster, and the short sentence is used as the class label of the document cluster.

2. The document clustering method based on a language large model according to claim 1 is characterized in that the step S3 of sorting the documents in each document cluster according to the weighted sum of the similarity matrix is specifically as follows:

S31: The vectorized representation set of documents in each document cluster is V = {v ₁ ,v ₂ ,…,v _p }, where p is the total number of documents in the document cluster, and the similarity matrix between the nth document and the mth document in the document cluster is d(v _n ,v _m ). The calculation formula of the similarity matrix d(v _n ,v _m ) is specifically:

Where _vn is the vectorized representation of the nth document in the document cluster, and _vm is the vectorized representation of the mth document in the document cluster;

S32: The calculation formula for the weighted sum of the similarity matrix of the nth document in the document cluster is:

Where S _n is the weighted sum of the similarity matrix of the nth document, α _i is the weight of the i-th similarity matrix, and c _i is the vectorized representation of the i-th document in the document cluster;

S33: Sort the documents in each document cluster within the cluster according to the weighted sum of the similarity matrix.

3. The document clustering method based on the language big model according to claim 2 is characterized in that the step S4 uses the seed document to calculate the correlation coefficient in each document cluster to obtain the weighted sum of the correlation coefficients: the seed document title is input into the language big model to calculate the vectorized representation T = {t ₁ , t ₂ , ..., t ₁₀ }, and the calculation formula of the weighted sum of the correlation coefficients of the k-th seed document title in each document cluster is specifically:

Where WSCC _k is the weighted sum of the correlation coefficients of the k-th seed document title, β _i is the weight of the i-th similarity matrix, and t _k is the vectorized representation of the k-th seed document title;

The calculation formula for the weighted sum of correlation coefficients within each document cluster is as follows:

Where WSCC is the weighted sum of the correlation coefficients within the document cluster, and _γk is the weight of the kth WSCC _k0 .

4. The document clustering method based on a language large model according to claim 3 is characterized in that the calculation formula for the final score of the document cluster in step S4 is specifically:

CFS＝0.3*DC ₁ +0.2*DC ₂ +0.1*DC+0.4*WSCC

In the formula, CFS is the final score of the document cluster, DC ₁ is the number of documents at level 1, DC ₂ is the number of documents at level 2, DC is the total number of documents, and WSCC is the weighted sum of the correlation coefficients within the document cluster.

5. A document clustering and sorting system based on a large language model, characterized by comprising a document collection module, a text vectorization module, a document clustering module, a class sorting module and a class label generation module;

The document collection module is used to collect document data and perform structural processing, wherein the document data includes document content and document information, wherein the document information includes document title and document level, and pre-process the document content after structural processing;

The text vectorization module is used to input the pre-processed document content into the language model to calculate the vectorized representation of the document content;

The document clustering module is used to obtain multiple document clusters by using a clustering algorithm on the document content that has been vectorized. The documents in each document cluster are vectorized to obtain a similarity matrix based on vector distance. The documents in each document cluster are sorted within the cluster according to the weighted sum of the similarity matrix. The top ten documents in the cluster are used as seed documents, and the document title of the seed document is the seed document title.

The class sorting module is used to calculate the correlation coefficient within each document cluster using the seed document title to obtain the weighted sum of the correlation coefficient, count the number of documents of each level in the document cluster based on the document level in the structured document information, calculate the weighted sum of the three indicators of the number of documents of each level, the total number of documents and the weighted sum of the correlation coefficient within the document cluster, obtain the final score of each document cluster, and sort the document clusters according to the scores;

The class label generation module is used to generate a short sentence summarizing the information features of the document cluster based on the seed document title in each document cluster and the set prompt input language model, and use the short sentence as the class label of the document cluster.

6. The document clustering system based on the language big model according to claim 5 is characterized in that the document clustering module sorts the documents in each document cluster according to the weighted sum of the similarity matrix in the cluster as follows:

Where S _n is the weighted sum of the similarity matrix of the nth document, α _i is the weight of the i-th similarity matrix, and _vi is the vectorized representation of the i-th document in the document cluster;

7. The document clustering system based on the language big model according to claim 6 is characterized in that the class ranking module uses the seed document to calculate the correlation coefficient in each document cluster to obtain the weighted sum of the correlation coefficients as follows: the seed document title is input into the language big model to calculate the vectorized representation T = {t ₁ , t ₂ , ..., t ₁₀ }, and the calculation formula of the weighted sum of the correlation coefficients of the k-th seed document title in each document cluster is as follows:

Where WSCC is the weighted sum of the correlation coefficients within the document cluster, and _γk is the weight of the kth _WSCCk .

8. The document clustering system based on the language large model according to claim 7 is characterized in that the calculation formula of the final score of the document cluster in the class ranking module is specifically:

CFS＝0.3*DC ₁ +0.2*DC ₂ +0.1*DC+0.4*WSCC

9. An electronic device, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the document clustering and sorting method based on a large language model as described in any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the document clustering and sorting method based on a large language model as described in any one of claims 1 to 4 is implemented.