CN117010366A

CN117010366A - Text specific sentence-oriented content identification and correction method

Info

Publication number: CN117010366A
Application number: CN202310820600.1A
Authority: CN
Inventors: 肖波; 胡慧云; 黄永军; 陈乔; 周均; 谢学勤; 马占宇; 张闯
Original assignee: Beijing Dongfang Tongwangxin Technology Co ltd; Beijing University of Posts and Telecommunications; Beijing Tongtech Co Ltd
Current assignee: Beijing Dongfang Tongwangxin Technology Co ltd; Beijing University of Posts and Telecommunications; Beijing Tongtech Co Ltd
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-11-07

Abstract

The application discloses a text specific sentence-oriented content recognition and correction method, and belongs to the field of natural language processing. The method combines a plurality of technologies, including methods of regular matching, hfl/rbt3 model up-and-down sentence prediction, position sensitive hash technology, jaccard similarity calculation and the like, and aims at specific contexts of specific sentence recognition and error correction of important leading speech and the like, and can accurately recognize golden sentence parts of important leading speech and the like in news articles by integrating various different methods; by improving the minhash algorithm, it is made more efficient at processing specific text and matches are computed using Jaccard similarity for short text therein, so that it also has higher accuracy in the error recognition part. The method can effectively extract and verify specific sentences such as the pilot important speaking content and plays an important role in improving efficiency and accuracy for news release and the like.

Description

A content recognition and error correction method for specific sentences in text

技术领域Technical field

本发明涉及自然语言处理技术领域，尤其涉及一种面向文本特定句子的内容识别和纠错方法。The present invention relates to the technical field of natural language processing, and in particular to a content identification and error correction method for specific sentences in text.

背景技术Background technique

在文章中识别文字内容的场景类似于文献引用识别的场景，且是针对隐形文献引用任务。文献引文识别是从学术文本中识别和提取引文的过程。目前来说比较常见的是使用深度神经网络和注意力机制从引用上下文中提取参考文本的方法，可以将任务看做分类或者序列标注任务，比如对于序列标注思想，训练模型预测文中词语的属性，从而判断那些词语组成的句子为引文句子。当然，早期也有基于滑动窗口的思想来实现，即选取合适窗口大小，从而选择引文段，然后利用文本相似度匹配确定隐形引文部分。The scenario of identifying text content in an article is similar to the scenario of document citation recognition, and is aimed at the task of invisible document citations. Literature citation identification is the process of identifying and extracting citations from academic texts. At present, the more common method is to use deep neural networks and attention mechanisms to extract reference text from the citation context. The task can be regarded as a classification or sequence labeling task. For example, for the idea of sequence labeling, the model is trained to predict the attributes of words in the text. Thus, it is judged that the sentences composed of those words are quotation sentences. Of course, in the early days, it was also implemented based on the idea of sliding windows, that is, selecting an appropriate window size to select the citation segment, and then using text similarity matching to determine the invisible citation part.

现有技术中，hfl/rbt3是一个基于RoBERTa的汉语预训练语言模型。这是一个小模型，有3层和12个注意力头(attention head)。它由哈尔滨工业大学(HIT)发布，作为其Chinese-BERT-wwm项目的一部分。hfl/rbt3可用于各种自然语言处理任务，如文本分类、命名实体识别、情感分析等。它可以轻松加载并与Transformers或PaddleHub库一起使用。In the existing technology, hfl/rbt3 is a Chinese pre-trained language model based on RoBERTa. This is a small model with 3 layers and 12 attention heads. It is released by Harbin Institute of Technology (HIT) as part of its Chinese-BERT-wwm project. hfl/rbt3 can be used for various natural language processing tasks, such as text classification, named entity recognition, sentiment analysis, etc. It can be easily loaded and used with Transformers or PaddleHub libraries.

位置敏感哈希(Locality-Sensitive Hashing，LSH)是一种用于海量高维数据的近似最近邻快速查找技术，对海量数据来说有较高的效率。其基本思路是通过哈希函数将输入的高维特征向量散列到低维特征空间，该哈希方式能使得原始空间中距离较近的点经过散列之后在低维空间依然距离较近，也就是说距离较近的点经过哈希函数散列后碰撞的概率要大于距离较远的点之间碰撞的概率。位置敏感哈希可以有效地降低数据的存储和检索复杂度，适用于文本、图像、音频等多种类型的数据。Location-Sensitive Hashing (LSH) is a fast approximate nearest neighbor search technology for massive high-dimensional data, which is highly efficient for massive data. The basic idea is to hash the input high-dimensional feature vector into a low-dimensional feature space through a hash function. This hashing method can make the points that are close in the original space still close in the low-dimensional space after hashing. That is to say, the probability of collision between points that are closer to each other after being hashed by the hash function is greater than the probability of collision between points that are farther away. Position-sensitive hashing can effectively reduce the complexity of data storage and retrieval, and is suitable for various types of data such as text, images, and audio.

对于位置敏感哈希的具体操作，首先需要定义一系列哈希函数，将高维输入向量映射到低维输出值；哈希函数应满足这样的属性：即相似的输入向量被映射到相同的输出值的概率很高，而不同的输入向量映射到相同输出值的可能性很低。其次，将得到的哈希值分成几个部分，分别记做key，将key和原始的输入向量存储在哈希表中，其中键为key，值为原始的输入向量。之后，要查找查询向量的最近邻居，请对其应用相同的哈希函数，并在哈希表中查找相应的输出值，检索与查询向量共享至少一个key的所有输入向量，并计算它们与查询向量的距离。将最接近的向量作为近似近邻返回。For the specific operation of position-sensitive hashing, you first need to define a series of hash functions to map high-dimensional input vectors to low-dimensional output values; the hash function should satisfy the property that similar input vectors are mapped to the same output The probability of a value is high, while the probability that different input vectors map to the same output value is low. Secondly, the obtained hash value is divided into several parts, which are respectively recorded as keys, and the key and the original input vector are stored in the hash table, where the key is the key and the value is the original input vector. After that, to find the nearest neighbor of the query vector, apply the same hash function to it and find the corresponding output value in the hash table, retrieve all input vectors that share at least one key with the query vector and calculate their relationship with the query vector distance. Returns the closest vector as an approximate neighbor.

Minhash是一项快速估计两个集合之间相似程度的技术。它由Andrei Broder发明，最开始用于AltaVista搜索引擎，用于检测重复的网页并删除重复部分，它还被应用于大规模的聚类问题，例如通过单词集的相似性对文档进行聚类。最小哈希值实际上与Jaccard相似度之间有一定的关系：两个集合的Jaccard相似度等于他们随机排列转换得到最小哈希值相等的概率值，因此最小哈希常用来估计Jaccard相似度，MinHash算法为我们提供两个集合之间的Jaccard相似性的快速近似。Minhash is a technique for quickly estimating the similarity between two sets. It was invented by Andrei Broder and was originally used in the AltaVista search engine to detect duplicate web pages and remove duplicate parts. It has also been applied to large-scale clustering problems, such as clustering documents by the similarity of word sets. There is actually a certain relationship between the minimum hash value and the Jaccard similarity: the Jaccard similarity of two sets is equal to the probability value of their random permutation and conversion to obtain the same minimum hash value. Therefore, the minimum hash is often used to estimate the Jaccard similarity. The MinHash algorithm gives us a fast approximation of the Jaccard similarity between two sets.

上述技术针对不同场景有不同的应用，然而并没有综合性的方法做具体识别特定句子(如领导重要讲话内容、文章中诗句、重要人物的名言警句等)的场景应用。对于具体的应用场景，对于算法的应用需要根据特定的场景做出适当的改进，在特定的场景下不一定选择深度学习的方式，简单的规则识别有时比深度学习方式更准确也更高效。本本发明借助基于规则的方法来使用到文章中特定句子的场景当中实现识别。The above technologies have different applications for different scenarios. However, there is no comprehensive method for specifically identifying specific sentences (such as the content of important speeches by leaders, poems in articles, famous sayings of important figures, etc.). For specific application scenarios, the application of the algorithm needs to be appropriately improved according to the specific scenario. In specific scenarios, deep learning may not necessarily be chosen. Simple rule recognition is sometimes more accurate and efficient than deep learning. The present invention uses a rule-based method to realize recognition in the scene of specific sentences in the article.

发明内容Contents of the invention

本发明针对文章中特定句子的识别场景，综合运用现有方法，构建一种面向文本的识别和纠错的方法，方法大致分成两个部分，分别为识别模块和匹配模块，识别模块主要基于正则匹配和Roberta模型，匹配模块主要基于位置敏感哈希和Jaccard相似度，识别结果准确可靠。Aiming at the recognition scenario of specific sentences in the article, the present invention comprehensively uses existing methods to construct a text-oriented recognition and error correction method. The method is roughly divided into two parts, namely the recognition module and the matching module. The recognition module is mainly based on regular rules. Matching and Roberta models, the matching module is mainly based on position-sensitive hashing and Jaccard similarity, and the recognition results are accurate and reliable.

为了实现上述目的，本发明提供如下技术方案：In order to achieve the above objects, the present invention provides the following technical solutions:

本发明提供了一种面向文本特定句子的内容识别和纠错方法，包括以下步骤：The present invention provides a content identification and error correction method for specific sentences of text, which includes the following steps:

S1、应用正则匹配规则初步识别出金句部分，识别情况主要分成两类，一类是被双引号包括，此类经过正则后即可正确匹配，进入步骤S3；另一类未被双引号包括，进入步骤S2做进一步处理；S1. Apply the regular matching rules to initially identify the golden sentence part. The recognition situation is mainly divided into two categories. One category is included in double quotes. This category can be correctly matched after being regularized and enters step S3; the other category is not included in double quotes. , enter step S2 for further processing;

S2、利用训练好的hfl/rbt3模型进行上下句预测，确定金句部分，进入步骤S3；S2. Use the trained hfl/rbt3 model to predict the upper and lower sentences, determine the golden sentence part, and enter step S3;

S3、收集金句数据库，并将识别出的金句部分进行分句，以句子为单位，根据句子长度划分匹配方式，当句子长度大于等于设定的阈值时，采用minhash匹配方式，计算句子minhash值，将数据库中的句子对应的哈希值分段以及句子存储到哈希表中，采用倒排索引匹配寻找到最相近的句子，进入步骤S4；当句子长度小于设定的阈值时，采用Jaccard相似度的匹配方式，计算标句子与数据库中其他句子的Jaccard值，找到Jaccard值最大的句子，进入步骤S4；S3. Collect the golden sentence database and segment the identified golden sentence parts into sentences. The matching method is divided according to the sentence length. When the sentence length is greater than or equal to the set threshold, the minhash matching method is used to calculate the sentence minhash. value, store the hash value segments and sentences corresponding to the sentences in the database into the hash table, use inverted index matching to find the most similar sentences, and enter step S4; when the sentence length is less than the set threshold, use Jaccard similarity matching method, calculate the Jaccard value of the target sentence and other sentences in the database, find the sentence with the largest Jaccard value, and enter step S4;

S4、从第一句话开始作为目标句子去数据库中匹配，若匹配到，则一一对比数据库中该句子开始后续部分和识别的金句部分后续是否有错误，输出识别结果和错误部分。S4. Start from the first sentence as the target sentence and match it in the database. If it matches, compare the subsequent parts of the sentence and the recognized golden sentence part in the database one by one to see if there are any errors, and output the recognition result and the error part.

进一步地，步骤S2中的hfl/rbt3模型训练采用上下句的正负样本，其中正样本来源于金句数据库中上下句，负样本则人工收集，对负样本进行数据增强之后再输入预训练模型hfl/rbt3进行训练，数据增强方式包括但不限于随机删除、随机替换、随机添加。Furthermore, the hfl/rbt3 model training in step S2 uses positive and negative samples of the upper and lower sentences. The positive samples come from the upper and lower sentences in the golden sentence database, and the negative samples are collected manually. The negative samples are data enhanced and then input into the pre-training model. hfl/rbt3 is trained, and data enhancement methods include but are not limited to random deletion, random replacement, and random addition.

进一步地，步骤S2中利用训练好的模型进一步判断的具体步骤如下：Further, the specific steps for further judgment using the trained model in step S2 are as follows:

S201、首先得到没有引号包括的一个段落，对该段进行分句，分句按照句号、感叹号、问号、省略号来分；S201. First, obtain a paragraph without quotation marks, and divide the paragraph into sentences according to periods, exclamation marks, question marks, and ellipses;

S202、然后，将第一句话为金句，则该句作为上句，下一句话作为下句，经过hfl/rbt3模型的词元分析器(tokenizer)处理之后，输入到hfl/rbt3模型中判断返回值；S202. Then, consider the first sentence as the golden sentence, then this sentence will be used as the previous sentence, and the next sentence will be used as the next sentence. After being processed by the tokenizer of the hfl/rbt3 model, it will be input into the hfl/rbt3 model. Determine the return value;

S203、初始化结果，第一句为金句，然后输入模型中，如果返回值为1，表明是下句也是金句，将该句加入结果句中，重复S202步骤，将该句作为上句，接着判断后面句子是否为下句；如果返回值为0，表明下句不是金句，则退出不再继续判断，当遍历完所有句子时结束，从而得到结果句，为金句部分。S203. Initialize the result. The first sentence is a golden sentence, and then input it into the model. If the return value is 1, it means that the next sentence is also a golden sentence. Add this sentence to the result sentence, repeat step S202, and use this sentence as the previous sentence. Then it is judged whether the following sentence is the next sentence; if the return value is 0, it indicates that the next sentence is not a golden sentence, then it exits and does not continue the judgment. It ends when all sentences are traversed, and the result sentence is obtained, which is the golden sentence part.

进一步地，步骤S3中对句子进行最小哈希操作的具体步骤为：Further, the specific steps of performing minimum hash operation on sentences in step S3 are:

S301、首先对句子进行3-shingle处理，得到shingles列表；S301. First, perform 3-shingle processing on the sentence to obtain a shingles list;

S302、定义随机排列转换的数量记为perm_num，然后随机生产perm_num个仿射变换函数的a和b系数，即：S302. Define the number of random permutation transformations as perm_num, and then randomly produce the a and b coefficients of perm_num affine transformation functions, that is:

h(x)＝(ax+b)％mh(x)=(ax+b)%m

其中m为梅森素数，取2⁶¹-1，a和b是小于最大值的随机整数，最大值取2³²-1；Where m is the Mersenne prime number, which takes 2 ⁶¹ -1, a and b are random integers less than the maximum value, and the maximum value takes 2 ³² -1;

S303、对于每个哈希函数，对每个shingle做哈希运算，求最小值作为这个哈希函数的最小哈希值，perm_num个最小哈希函数组成最后原始文本的最小哈希值。S303. For each hash function, perform a hash operation on each shingle, and find the minimum value as the minimum hash value of this hash function. Perm_num minimum hash functions constitute the minimum hash value of the final original text.

进一步地，步骤S3中对数据库句子做存储操作的具体步骤为：Further, the specific steps for storing the database sentences in step S3 are:

(1)定义随机排列转换的数量记为perm_num，将所有随机排序分成b个段，每段有r个哈希，假设两个文本之间的真实jaccard相似度为js，那么两个文本之间的perm_num个最小哈希中某对应部分相等的概率就等于js，推导出两个文本的最小哈希签名在某一个段中全部哈希相等的概率即说成为候选对的概率为：(1) Define the number of random permutation transformations as perm_num, divide all random permutation into b segments, each segment has r hashes, assuming that the real jaccard similarity between two texts is js, then between the two texts The probability that a corresponding part of the perm_num minimum hashes is equal is equal to js. It can be deduced that the probability that all hashes of the minimum hash signatures of two texts in a certain segment are equal, that is, the probability of becoming a candidate pair is:

1-(1-s^r)^b 1-(1-s ^r ) ^b

文档Jaccard相似度和成为候选对概率的关系曲线为S型曲线，曲线在候选概率1/2处相似度为阈值，即0.5的Jaccard相似度，此时近似估计b、r和阈值的关系如下：The relationship curve between the document Jaccard similarity and the probability of becoming a candidate pair is an S-shaped curve. The similarity of the curve is the threshold at 1/2 of the candidate probability, that is, the Jaccard similarity of 0.5. At this time, the approximate relationship between b, r and the threshold is estimated as follows:

(1/b)^1/r (1/b) ^1/r

根据提供的阈值、perm_num、假正例权重和假反例权重，求得b和r的最佳值，其中，假正例为被哈希映射到同一个桶中的不相似文本对，假反例为没被哈希映射到一个桶中的相似文本对，设置假正例权重和假反例权重分别为0.5；Find the optimal values of b and r according to the provided threshold, perm_num, false positive example weight and false negative example weight, where false positive examples are dissimilar text pairs hashed to the same bucket, and false negative examples are For similar text pairs that are not hash-mapped into a bucket, set the false positive example weight and false negative example weight to 0.5 respectively;

(2)分别对数据库中句子求minhash值，然后按照将得到的perm_num个哈希分成b块，每块中有r个最小哈希，分成的b块内容作为句子的b段哈希值(key)；(2) Calculate the minhash value of the sentences in the database respectively, and then divide the obtained perm_num hashes into b blocks. Each block has r minimum hashes. The content of the divided b blocks is used as the b segment hash value (key) of the sentence. );

(3)将b段哈希值(key)记录到哈希表(HashTables)中，其中各个哈希值(key)为索引，即键，索引值为对应的句子表示，同时创建一个字典，其键为句子表示，记作mi，i为从0开始的整数，值为句子本身。(3) Record the hash value (key) of segment b into hash tables (HashTables), where each hash value (key) is an index, that is, a key, and the index value is the corresponding sentence representation, and a dictionary is created at the same time. The key is a sentence representation, denoted as mi, i is an integer starting from 0, and the value is the sentence itself.

进一步地，步骤S3中倒排索引匹配的具体步骤为：Further, the specific steps of inverted index matching in step S3 are:

对于目标句子，首先判断句子长度，若大于定值，则做minhash，并得到其对应的哈希值(key)，在哈希表(HashTables)中索引遍历所有哈希值，找到有一部分哈希值相同的候选集，然后求候选集中与目标句子汉明距离最近的一个实例，在容忍度范围内则输出，否则标记为未识别。For the target sentence, first determine the length of the sentence. If it is greater than a fixed value, do minhash and get its corresponding hash value (key). Index all hash values in the hash table (HashTables) and find a part of the hash. Candidate sets with the same value, and then find the instance in the candidate set that is closest to the Hamming distance of the target sentence. If it is within the tolerance range, it will be output, otherwise it will be marked as unrecognized.

与现有技术相比，本发明的有益效果为：Compared with the prior art, the beneficial effects of the present invention are:

1、本发明基于正则匹配对特定句子如领导重要讲话进行识别，不直接使用深度学习模型进行引文识别，相对来说识别结果更加准确，而且不需要搜集大量数据，同时减少了人工标注的成本。1. This invention identifies specific sentences such as important speeches by leaders based on regular matching, without directly using the deep learning model for citation identification. Relatively speaking, the identification results are more accurate, and there is no need to collect a large amount of data, while reducing the cost of manual annotation.

2、本发明基于Roberta模型对初次识别出的金句进行进一步识别，考虑没有引号直接引用的部分是否全都属于金句，该方式使得金句识别更加精准。2. The present invention further identifies the golden sentences initially identified based on the Roberta model, and considers whether the parts directly quoted without quotation marks all belong to golden sentences. This method makes the recognition of golden sentences more accurate.

3、本发明基于位置敏感哈希minhash和Jaccard相似度计算对金句进行匹配和错误识别。位置敏感哈希方法minhash将高维的特征向量映射成低维的特征向量，将长句子存储为哈希形式，可以离线计算并存储，并且可以利用倒排索引的方式进行相似度匹配，对于海量长文本来说效率非常高；而短句使用Jaccard相似度计算方式更为准确。两者的结合平衡了效率和准确度。3. The present invention performs matching and error recognition of golden sentences based on position-sensitive hash minhash and Jaccard similarity calculation. The position-sensitive hashing method minhash maps high-dimensional feature vectors into low-dimensional feature vectors, stores long sentences in hash form, can be calculated and stored offline, and can use inverted indexing for similarity matching. For massive It is very efficient for long texts; for short sentences, it is more accurate to use the Jaccard similarity calculation method. The combination of the two balances efficiency and accuracy.

4、本发明采用3-shingle处理目标句子而不是用jieba分词，具有更高的准确率和处理速度。由于领导重要讲话句子等金句的特殊性，一般不允许错一个字，而jieba分词调用了第三方库需要更长的时间，而且去掉停用词和标点之后进行minhash计算会拉大错误句子和原句差距，因此采用简单的3-shingle更为有效。4. The present invention uses 3-shingle to process the target sentence instead of using jieba word segmentation, which has higher accuracy and processing speed. Due to the particularity of golden sentences such as important speech sentences by leaders, it is generally not allowed to make a single word error. However, jieba word segmentation calls a third-party library which takes longer, and minhash calculation after removing stop words and punctuation will increase the error in the sentence and There is a gap in the original sentence, so it is more effective to use a simple 3-shingle.

综上，本发明提出的一种面向文本特定句子的内容识别和纠错方法，结合了多项技术包括正则匹配、hfl/rbt3模型上下句预测、位置敏感哈希技术minhash以及Jaccard相似度计算方法，可以较为准确地识别如新闻文章中领导重要讲话等特定句子，并且通过改进minhash算法，使其在处理特定文本上更有效，使得其在错误识别上也有较高的准确度，对于新闻发布等国家重要领域发挥重要作用。该方法也可以拓展到对文章中诗句的识别或者是其他重要人物的名言警句的识别中。In summary, the present invention proposes a content recognition and error correction method for specific text sentences, which combines a number of technologies including regular matching, hfl/rbt3 model upper and lower sentence prediction, position-sensitive hashing technology minhash and Jaccard similarity calculation method , can more accurately identify specific sentences such as important speeches by leaders in news articles, and by improving the minhash algorithm, it is more effective in processing specific texts, making it also have higher accuracy in error recognition, for news releases, etc. play an important role in important national areas. This method can also be extended to the identification of poems in articles or the identification of famous quotes from other important figures.

附图说明Description of the drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍。显而易见地，下面描述中的附图仅仅是本发明中记载的一些实施例，对于本领域普通技术人员来讲，还可以根据这些附图获得其他的附图。In order to more clearly explain the embodiments of the present application or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments recorded in the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings.

图1为本发明实施例提供面向文本特定句子的内容识别和纠错方法的流程图。Figure 1 is a flowchart of a method for content identification and error correction for specific sentences in text according to an embodiment of the present invention.

图2为本发明实施例提供的上下句预测模型训练和推理流程图。Figure 2 is a flow chart of training and inference of the sentence prediction model provided by the embodiment of the present invention.

图3为本发明实施例提供的minhash操作的流程图。Figure 3 is a flow chart of the minhash operation provided by the embodiment of the present invention.

图4为本发明实施例提供的minhash哈希表的存储原理图。Figure 4 is a storage principle diagram of the minhash hash table provided by the embodiment of the present invention.

图5为本发明实施例提供的完整输入输出结果展示图。Figure 5 is a diagram showing the complete input and output results provided by the embodiment of the present invention.

具体实施方式Detailed ways

为了更好地理解本技术方案，下面结合附图对本发明的方法做详细的说明。In order to better understand the technical solution, the method of the present invention will be described in detail below with reference to the accompanying drawings.

下面以领导重要讲话为例，介绍本发明的一种面向文本特定句子的内容识别和纠错方法，如图1所示，包括以下步骤：The following takes an important speech by a leader as an example to introduce a content identification and error correction method for text-specific sentences of the present invention, as shown in Figure 1, which includes the following steps:

步骤一：正则匹配识别Step 1: Regular matching identification

应用正则匹配规则，先初步识别出领导重要讲话，领导重要讲话这里也记做金句。识别情况主要分成两类，一类是被双引号包括，此类经过正则后即可正确匹配，无需进一步处理；另一类未被双引号包括，需要进一步处理。Apply regular matching rules to initially identify the leader’s important speeches. The leader’s important speeches are also recorded here as golden sentences. The identification situations are mainly divided into two categories. One category is included in double quotes, which can be matched correctly after regularization, and no further processing is required; the other category is not included in double quotes and requires further processing.

步骤二：上下句预测模型的训练和推理Step 2: Training and inference of the upper and lower sentence prediction model

如图2展示了对预训练模型hfl/rbt3进行微调训练和推理的步骤。首先需要搜集领导重要讲话部分的上下句的正负样本，正样本来源于金句数据库中上下句，而负样本则人工收集，所以正负样本数据会有不平衡现象，因此对负样本进行数据增强，之后对数据进行词元分析，然后输入预训练模型hfl/rbt3进行模型训练。Figure 2 shows the steps for fine-tuning training and inference of the pre-trained model hfl/rbt3. First, we need to collect positive and negative samples of the upper and lower sentences of the leader's important speech. The positive samples come from the upper and lower sentences in the golden sentence database, while the negative samples are collected manually. Therefore, there will be an imbalance in the positive and negative sample data, so the negative samples are collected. Enhancement, then perform word analysis on the data, and then input the pre-trained model hfl/rbt3 for model training.

利用训练好的模型推理的具体步骤如下：The specific steps to use the trained model for inference are as follows:

S1、首先得到没有引号包括的一个段落，对该段进行分句，分句按照句号、感叹号、问号、省略号来分；S1. First get a paragraph without quotation marks, and divide the paragraph into sentences according to periods, exclamation marks, question marks, and ellipses;

S2、然后，将第一句话为金句，则该句作为上句，下一句话作为下句，经过hfl/rbt3模型的词元分析器(tokenizer)处理之后，输入到hfl/rbt3模型中判断返回值；S2. Then, consider the first sentence as a golden sentence, then this sentence will be used as the previous sentence, and the next sentence will be used as the next sentence. After being processed by the tokenizer of the hfl/rbt3 model, it will be input into the hfl/rbt3 model. Determine the return value;

S3、如果返回值为1，表明是下句也是金句，将它加入结果句中，重复S2步骤，将该句作为上句，接着判断后面句子是否为下句；如果返回值为0，表明下句不是金句，则退出不再继续判断，当遍历完所有句子时结束，从而得到结果句，为金句部分。S3. If the return value is 1, it indicates that the next sentence is also a golden sentence. Add it to the result sentence, repeat step S2, use this sentence as the previous sentence, and then determine whether the following sentence is the next sentence; if the return value is 0, it indicates If the next sentence is not a golden sentence, it will exit without further judgment. It will end when all sentences have been traversed, and the result sentence will be obtained, which is the golden sentence part.

步骤三：minhash操作Step 3: minhash operation

如图3展示了minhash的大致步骤，对句子进行minhash操作的具体步骤描述如下：Figure 3 shows the general steps of minhash. The specific steps of minhash operation on sentences are described as follows:

S1、首先，就是对文本3-shingle处理，假设需要原句为：“我们要不断把中华民族伟大复兴的历史伟业推向前进”，则结果为：['我们要','们要不','要不断','不断把','断把中','把中华','中华民','华民族','民族伟','族伟大','伟大复','大复兴','复兴的','兴的历','的历史','历史伟','史伟业','伟业推','业推向','推向前','向前进']；S1. First of all, it is to process the text 3-shingle. Assume that the original sentence is: "We must continue to push forward the great historical cause of the great rejuvenation of the Chinese nation", then the result is: ['We want', 'We want' ,'should continue','continue to control','end to control','to control China','Chinese people','Chinese nation','national greatness','national greatness','great restoration','great rejuvenation' ', 'revival', 'history', 'history', 'historical greatness', 'historical great cause', 'great cause push', 'deed push', 'push forward', 'forward'] ;

S2、定义随机排列转换的数量记为perm_num，然后随机生产perm_num个仿射变换函数的a和b系数，即：S2. Define the number of random permutation transformations as perm_num, and then randomly produce the a and b coefficients of perm_num affine transformation functions, that is:

h(x)＝(ax+b)％mh(x)=(ax+b)%m

其中m为梅森素数，这里取2⁶¹-1，a和b是小于最大值的随机整数，最大值取2³²-1。Where m is the Mersenne prime number, here it is 2 ⁶¹ -1, a and b are random integers less than the maximum value, the maximum value is 2 ³² -1.

S3、对于每个哈希函数，对每个shingle做哈希运算，求最小值作为这个哈希函数的最小哈希值，perm_num个最小哈希函数组成最后原始文本的最小哈希值。S3. For each hash function, perform a hash operation on each shingle, and find the minimum value as the minimum hash value of this hash function. Perm_num minimum hash functions constitute the minimum hash value of the final original text.

步骤四：minhash索引存储Step 4: minhash index storage

对搜集到的领导重要讲话数据库中的各个句子做哈希表的存储，方便后续快速匹配到相似的句子，如图4所示，对数据库句子做存储操作的具体步骤为：Store each sentence in the collected leader's important speech database in a hash table to facilitate subsequent quick matching of similar sentences. As shown in Figure 4, the specific steps for storing sentences in the database are:

S1、考虑将perm_num分成b个段，每段有r个哈希，假设两个文本之间的真实jaccard相似度为js，那么两个文本之间的perm_num个最小哈希中某对应部分相等的概率就等于js，可以推导出两个文本的最小哈希签名在某一个段中全部哈希相等的概率为：S1. Consider dividing perm_num into b segments, each segment has r hashes. Assume that the real jaccard similarity between the two texts is js, then a corresponding part of the perm_num minimum hashes between the two texts is equal. The probability is equal to js. It can be deduced that the probability that all hashes of the minimum hash signatures of two texts in a certain segment are equal is:

1-(1-s^r)^b 1-(1-s ^r ) ^b

也就是说成为候选对的概率。文档Jaccard相似度和成为候选对概率的关系曲线为S型曲线，基本不受b和r的影响，曲线在候选概率1/2处相似度为阈值，也就是0.5的Jaccard相似度，此时近似估计b、r和阈值的关系如下：That is to say, the probability of becoming a candidate pair. The relationship curve between the document Jaccard similarity and the probability of becoming a candidate pair is an S-shaped curve, which is basically not affected by b and r. The similarity of the curve is the threshold at 1/2 of the candidate probability, which is the Jaccard similarity of 0.5. At this time, it is approximate The estimated relationship between b, r and threshold is as follows:

(1/b)^1/r (1/b) ^1/r

根据提供的阈值、perm_num、假正例权重和假反例权重，求得b和r的最佳值，其中，假正例为被哈希映射到同一个桶中的不相似文本对，假反例为没被哈希映射到一个桶中的相似文本对，一般设置假正例权重和假反例权重分别为0.5，也就是希望两种情况同样都尽可能少地出现。Find the optimal values of b and r according to the provided threshold, perm_num, false positive example weight and false negative example weight, where false positive examples are dissimilar text pairs hashed to the same bucket, and false negative examples are For similar text pairs that are not hash-mapped into a bucket, the weight of false positive examples and the weight of false negative examples are generally set to 0.5 respectively, that is, we hope that both situations will occur as little as possible.

S2、分别对数据库中句子求minhash值，然后按照将得到的perm_num个哈希分成b块，每块中有r个最小哈希，分成的b块内容作为句子的b段哈希值(key)。S2. Find the minhash value of the sentences in the database respectively, and then divide the obtained perm_num hashes into b blocks. Each block has r minimum hashes. The content of the divided b blocks is used as the b segment hash value (key) of the sentence. .

S3、将b段哈希值(key)记录到哈希表(HashTables)中，其中各个哈希值为索引，也就是键，而索引值为对应的句子表示，比如“m1”，同时，创建一个字典，其键为句子表示“mi”，i为从0开始的整数，值为句子本身。S3. Record the hash value (key) of segment b into hash tables (HashTables), where each hash value is an index, that is, a key, and the index value is the corresponding sentence representation, such as "m1". At the same time, create A dictionary whose key is the sentence representation "mi", i is an integer starting from 0, and the value is the sentence itself.

步骤五：对识别的金句错误识别Step 5: Misidentification of the recognized golden sentences

对于输入识别到的金句部分，对金句部分进行分句。从第一句话开始作为目标句子去数据库中匹配。若匹配到，则一一对比数据库中该句子开始后续部分和识别的金句部分后续是否有错误。For the golden sentence part recognized by the input, segment the golden sentence part. Starting from the first sentence, it is used as the target sentence to match in the database. If there is a match, compare the beginning and subsequent parts of the sentence in the database with the identified golden sentence part one by one to see if there are any errors.

对于目标句子的匹配，首先判断句子长度，若大于定值，则做minhash，并得到其对应的哈希值(key)，在哈希表(HashTables)中索引遍历所有哈希值，找到有一部分哈希值相同的候选集，然后求候选集中与目标句子汉明距离最近的一个实例，在容忍度范围内则输出，否则标记为未识别；若句子长度小于定值，则计算目标句子与数据库中其他句子的Jaccard相似度，找到最大的一个，相似度值在一定范围之内则输出，否则标记为未识别。For the matching of the target sentence, first judge the length of the sentence. If it is greater than a fixed value, do minhash and get its corresponding hash value (key). Index all hash values in the hash table (HashTables) and find a part. Candidate sets with the same hash value, and then find the instance in the candidate set that is closest to the Hamming distance of the target sentence. If it is within the tolerance range, output it, otherwise mark it as unrecognized; if the sentence length is less than a fixed value, calculate the distance between the target sentence and the database Among the Jaccard similarities of other sentences, find the largest one. If the similarity value is within a certain range, it will be output, otherwise it will be marked as unrecognized.

图5展示了完整的输入和输出结果，输入文章后进行上述的金句识别和相似度匹配得到与目标句相近的原句，输出原句和识别句。Figure 5 shows the complete input and output results. After inputting the article, the above-mentioned golden sentence recognition and similarity matching are performed to obtain the original sentence that is similar to the target sentence, and the original sentence and the recognized sentence are output.

综上，本发明提出的面向文本特定句子的内容识别和纠错方法，结合了多项技术包括正则匹配、hfl/rbt3模型上下句预测、位置敏感哈希技术minhash以及Jaccard相似度计算方法，可以较为准确地识别新闻文章中领导重要讲话，并且通过改进minhash算法，使其在处理特定文本上更有效，使得其在错误识别上也有较高的准确度。该方法也可以拓展到对文章中诗句的识别或者是其他重要人物的名言警句的识别中。In summary, the content identification and error correction method for specific sentences in text proposed by this invention combines a number of technologies including regular matching, hfl/rbt3 model upper and lower sentence prediction, position-sensitive hashing technology minhash and Jaccard similarity calculation method, which can It can more accurately identify important speeches of leaders in news articles, and by improving the minhash algorithm, it is more effective in processing specific texts, making it more accurate in error recognition. This method can also be extended to the identification of poems in articles or the identification of famous quotes from other important figures.

以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换，但这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still modify the technical solutions of the foregoing embodiments. The recorded technical solutions may be modified, or some of the technical features thereof may be equivalently replaced, but these modifications or substitutions shall not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of each embodiment of the present invention.

Claims

1. A content identification and error correction method for specific sentences in text, which is characterized by including the following steps:

S1. Apply the regular matching rules to initially identify the golden sentence part. The recognition situation is mainly divided into two categories. One category is included in double quotes. This category can be correctly matched after being regularized and enters step S3; the other category is not included in double quotes. , enter step S2 for further processing;

S2. Use the trained hfl/rbt3 model to predict the upper and lower sentences, determine the golden sentence part, and enter step S3;

S3. Collect the golden sentence database and divide the identified golden sentence parts into sentences. The matching method is divided according to the sentence length. When the sentence length is greater than or equal to the set threshold, the minhash matching method is used to calculate the minimum sentence. Hash value, store the hash value segments and sentences corresponding to the sentences in the database into the hash table, use inverted index matching to find the most similar sentences, and enter step S4; when the sentence length is less than the set threshold , using the Jaccard similarity matching method, calculate the Jaccard value of the target sentence and other sentences in the database, find the sentence with the largest Jaccard value, and enter step S4;

S4. Start from the first sentence as the target sentence and match it in the database. If it matches, compare the subsequent parts of the sentence and the recognized golden sentence part in the database one by one to see if there are any errors, and output the recognition result and the error part.

2. The content identification and error correction method for text-specific sentences according to claim 1, characterized in that the hfl/rbt3 model training in step S2 adopts positive and negative samples of the upper and lower sentences, wherein the positive samples are derived from the golden sentence database. In the upper and lower sentences, negative samples are collected manually. After data enhancement is performed on the negative samples, the pre-trained model hfl/rbt3 is input for training. The data enhancement methods include but are not limited to random deletion, random replacement, and random addition.

3. The content identification and error correction method for text-specific sentences according to claim 1, characterized in that the specific steps for further judgment using the trained model in step S2 are as follows:

S201. First, obtain a paragraph without quotation marks, and divide the paragraph into sentences according to periods, exclamation marks, question marks, and ellipses;

S202. Then, consider the first sentence as a golden sentence, then this sentence will be used as the previous sentence, and the next sentence will be used as the next sentence. After being processed by the lexical analyzer of the hfl/rbt3 model, it will be input into the hfl/rbt3 model to determine the return value. ;

S203. Initialize the result. The first sentence is a golden sentence, and then input it into the model. If the return value is 1, it means that the next sentence is also a golden sentence. Add this sentence to the result sentence, repeat step S202, and use this sentence as the previous sentence. Then it is judged whether the following sentence is the next sentence; if the return value is 0, it indicates that the next sentence is not a golden sentence, then it exits and does not continue the judgment. It ends when all sentences are traversed, and the result sentence is obtained, which is the golden sentence part.

4. The content identification and error correction method for text-specific sentences according to claim 1, characterized in that the specific steps of performing a minimum hash operation on the sentence in step S3 are:

S301. First, perform 3-shingle processing on the sentence to obtain a shingles list;

S302. Define the number of random permutation transformations as perm_num, and then randomly produce the a and b coefficients of perm_num affine transformation functions, that is:

h(x)=(ax+b)%m

Where m is the Mersenne prime number, which takes 2 ⁶¹ -1, a and b are random integers less than the maximum value, and the maximum value takes 2 ³² -1;

S303. For each hash function, perform a hash operation on each shingle, and find the minimum value as the minimum hash value of this hash function. Perm_num minimum hash functions constitute the minimum hash value of the final original text.

5. The content identification and error correction method for text-specific sentences according to claim 1, characterized in that the specific steps of storing the database sentences in step S3 are:

(1) Define the number of random permutation transformations as perm_num, divide all random permutations into b segments, each segment has r hashes, assuming that the real jaccard similarity between the two texts is js, then between the two texts The probability that a corresponding part of the perm_num minimum hashes is equal is equal to js. It can be deduced that the probability that all hashes of the minimum hash signatures of two texts in a certain segment are equal, that is, the probability of becoming a candidate pair is:

1-(1-s ^r ) ^b

The relationship curve between the document Jaccard similarity and the probability of becoming a candidate pair is an S-shaped curve. The similarity of the curve is the threshold at 1/2 of the candidate probability, that is, the Jaccard similarity of 0.5. At this time, the approximate relationship between b, r and the threshold is estimated as follows:

(1/b) ^1/r

Find the optimal values of b and r according to the provided threshold, perm_num, false positive example weight and false negative example weight, where false positive examples are dissimilar text pairs hashed to the same bucket, and false negative examples are For similar text pairs that are not hash-mapped into a bucket, set the false positive example weight and false negative example weight to 0.5 respectively;

(2) Calculate the minhash value of the sentences in the database respectively, and then divide the obtained perm_num hashes into b blocks. Each block has r minimum hashes, and the content of the divided b blocks is used as the b segment hash value of the sentence;

(3) Record the hash value of segment b into the hash table, where each hash value is an index, that is, a key, and the index value is the corresponding sentence representation. At the same time, a dictionary is created, whose key is the sentence representation, recorded as mi, i is an integer starting from 0, and the value is the sentence itself.

6. The content identification and error correction method for text-specific sentences according to claim 1, characterized in that the specific steps of inverted index matching in step S3 are:

For the target sentence, first determine the length of the sentence. If it is greater than a fixed value, minhash is done and the corresponding hash value is obtained. All hash values are indexed in the hash table to find a candidate set with a part of the same hash value. Then find the instance in the candidate set that is closest to the Hamming distance of the target sentence. If it is within the tolerance range, it will be output. Otherwise, it will be marked as unrecognized.