CN107145476A

CN107145476A - One kind is based on improvement TF IDF keyword extraction algorithms

Info

Publication number: CN107145476A
Application number: CN201710369600.9A
Authority: CN
Inventors: 金彪; 方敏霞; 沙晋明; 熊金波; 李璇; 林劼
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2017-05-23
Filing date: 2017-05-23
Publication date: 2017-09-08

Abstract

The invention discloses a keyword extraction algorithm based on improved TF-IDF, which includes the following steps: S1: uniformly format the input form of the text; S2: load the configuration file to Stanford NLP; S3: obtain the text in the configuration file All sentence collection Sentences; S4: Take a sentence from Sentences each time; S5: Get all the word collection Tokens in the current sentence; S6: Take a token from Tokens each time; S7: Get the word/phrase and part of speech of the current token , and give different part-of-speech weights; S8: Calculate the total number of words/words in the current sentence and their position percentage; S9: Get all words/words in the text set Words; S10: Take a word from Words each time; S11: Calculate the current word TF and IDF; S12: Calculate the weight W of all words, and select keywords according to the weight W of words. The invention increases the part-of-speech factor, improves the extraction accuracy, and solves the problem of complex spaces such as constructing Pat-trees.

Description

A Keyword Extraction Algorithm Based on Improved TF-IDF

技术领域technical field

本发明涉及一种基于改进TF-IDF关键词提取算法。The invention relates to an improved TF-IDF keyword extraction algorithm.

背景技术Background technique

目前在基于改进TF-IDF关键词提取算法相关方法中，主要缺点表现为：单纯以"词频"衡量一个词的重要性，不够全面，有时重要的词可能出现次数并不多。而且，这种算法无法体现词语的词性信息，属性为名词的词与属性为助词等的词，都被赋予了相同的重要性，这显然是不合理的。其他相关技术中，提取关键词的准确率相对提高了，但是在空间的复杂度上也提高了，现实要求中得不到满足。At present, in the related methods based on the improved TF-IDF keyword extraction algorithm, the main disadvantage is that simply measuring the importance of a word by "word frequency" is not comprehensive enough, and sometimes important words may not appear many times. Moreover, this algorithm cannot reflect the part-of-speech information of words. Words whose attributes are nouns and words whose attributes are particles are given the same importance, which is obviously unreasonable. In other related technologies, the accuracy of keyword extraction is relatively improved, but the complexity of the space is also improved, which cannot meet the actual requirements.

发明内容Contents of the invention

本发明的目的在于克服现有技术的不足，提供一种基于改进TF-IDF关键词提取算法。The purpose of the present invention is to overcome the deficiencies of the prior art and provide a keyword extraction algorithm based on improved TF-IDF.

本发明采用的技术方案是：The technical scheme adopted in the present invention is:

一种基于改进TF-IDF关键词提取算法，其包括以下步骤：A keyword extraction algorithm based on improved TF-IDF, which comprises the following steps:

S1：将文本的输入形式统一格式化；S1: Uniformly format the input form of the text;

S2：将文本标准格式化，对Stanford NLP加载属性Properties配置文件；S2: Format the text standard, and load the Properties configuration file for Stanford NLP;

S3：在Properties配置文件中根据定义的句子分隔符号，得到文本中的所有句子的集合Sentences；S3: Obtain the set Sentences of all sentences in the text according to the defined sentence separator in the Properties configuration file;

S4：每次从集合Sentences中取出一条句子；S4: Each time a sentence is taken out from the collection Sentences;

S5：获取当前的句子中所有的词语集合Tokens；S5: Obtain all the word set Tokens in the current sentence;

S6：每次从集合Tokens中取出一个token；S6: Take out a token from the collection Tokens each time;

S7：得到当前token的字/词语和词性，并赋予不同的词性以不同的词性权值；S7: Obtain the word/phrase and part of speech of the current token, and assign different part of speech weights to different parts of speech;

S8：计算当前句子中字/词语的总数及其位置百分比；S8: Calculate the total number of words/words and their position percentages in the current sentence;

S9：在前面的操作中，获取了文本中所有的字/词语集合Words；S9: In the previous operation, all the words/word sets Words in the text are obtained;

S10：每次从集合Words取出一个word；S10: take out a word from the set Words each time;

S11：计算当前word的TF以及IDF；S11: Calculate the TF and IDF of the current word;

S12：得到所有word的词性权值、位置权值、TF、IDF后，计算词语的权重W＝TF*IDF+词性权值+位置权值，选取词语的权重W权值由大到小前5个word作为关键词输出。S12: After obtaining the part-of-speech weight, position weight, TF, and IDF of all words, calculate the weight W of the word=TF*IDF+part-of-speech weight+position weight, and select the top 5 weight W weights of words from large to small word is output as a keyword.

进一步地，所述步骤S1中统一格式化所包括的参数分别为标题、标签、摘要以及正文；若是对应的参数不存在则输入“”。Further, the parameters included in the unified formatting in the step S1 are title, label, abstract and text respectively; if the corresponding parameters do not exist, enter "".

所述步骤S2中，文本标准格式化首先设定管道中包含的annotators时分别选择分词、分隔、词性标注、识别命名实体4个annotators，这四个annotators是对本发明进行文本处理所必须的；其次，加载各个annotator所需要的包以及设置对应的参数。In said step S2, when the annotators included in the pipeline are first set in the text standard formatting, respectively select word segmentation, separation, part-of-speech tagging, and 4 annotators for identifying named entities, these four annotators are necessary for the text processing of the present invention; secondly , load the packages required by each annotator and set the corresponding parameters.

进一步地，所述步骤S8中计算当前句子中字/词语的总数时除去不能成为关键词的词性的词语。Further, when calculating the total number of characters/phrases in the current sentence in the step S8, the words of parts of speech that cannot become keywords are removed.

进一步地，所述步骤S11，IDF在传统的算法中主要是通过包含了该词语的文档的数量和包含该词语的文档的总数量来比较的，公式如下：Further, in the step S11, the traditional algorithm of IDF mainly compares the number of documents containing the word with the total number of documents containing the word, and the formula is as follows:

其中|D|表示的文件总数，|{j:t_i∈d_j}|表示包含词语ti的文件总数。Where |D| indicates the total number of documents, and |{j:t _i ∈ d _j }| indicates the total number of documents containing the word ti.

所述IDF的大小是以采集得到的10000篇新闻为测试数据集，计算出其中包含的每个单词的IDF并存储至配置文件(共包含了27万左右常见词语的IDF)。需要使用时，直接从配置文件中读取相应单词的IDF值即可，无需统计所有文档中出现该单词的文档数，计算量相对比较小，运行速度快。对于未存储在配置文件中的新词或者生僻词，将所有IDF的均值作为该词的IDFThe size of the IDF is based on the collected 10,000 news articles as a test data set, and the IDF of each word contained therein is calculated and stored in a configuration file (the IDF of about 270,000 common words is included). When you need to use it, you can directly read the IDF value of the corresponding word from the configuration file. There is no need to count the number of documents in which the word appears in all documents. The calculation amount is relatively small and the operation speed is fast. For new or uncommon words not stored in the configuration file, the mean of all IDFs is used as the IDF of the word

进一步地，所述步骤S12中作为关键词的前5个word的每一个word的长度在2-6之间。Further, the length of each word of the first 5 words used as keywords in the step S12 is between 2-6.

本发明采用以上技术方案，在提取关键词中的过程中增加词性因子，可以在提高关键词提取的准确度的同时，有效解决相关方案中要构造Pat-tree等导致空间复杂度大等问题。The present invention adopts the above technical scheme and adds part-of-speech factors in the process of extracting keywords, which can effectively solve the problems of large space complexity caused by constructing Pat-trees in related solutions while improving the accuracy of keyword extraction.

附图说明Description of drawings

以下结合附图和具体实施方式对本发明做进一步详细说明；The present invention will be described in further detail below in conjunction with accompanying drawing and specific embodiment;

图1为本发明一种基于改进TF-IDF关键词提取算法的流程示意图。Fig. 1 is a schematic flow chart of an improved TF-IDF keyword extraction algorithm in the present invention.

具体实施方式detailed description

如图1所示，本发明公开了一种基于改进TF-IDF关键词提取算法，其包括以下步骤：As shown in Figure 1, the present invention discloses a keyword extraction algorithm based on improved TF-IDF, which includes the following steps:

S12：得到所有word的词性权值、位置权值、TF、IDF后，计算词语的权重W＝TF*IDF+词性权值+位置权值，选取词语的权重W权值由大到小前5个word作为关键词，每一个word的长度在2-6之间输出。S12: After obtaining the part-of-speech weight, position weight, TF, and IDF of all words, calculate the weight W of the word = TF*IDF+ part-of-speech weight + position weight, and select the top 5 weight W weights of words from large to small Word is used as a keyword, and the length of each word is output between 2-6.

进一步地，所述步骤S8中计算当前句子中字/词语的总数时除去停顿词语。不仅对文本进行停用词的过滤，而且同时也过滤了不可能成为关键词的词性，例如虚词等。Further, stop words are removed when calculating the total number of characters/phrases in the current sentence in the step S8. Not only stop words are filtered on the text, but also parts of speech that cannot become keywords, such as function words, etc. are filtered.

其中|D|表示的文件总数，|{j:t_i∈d_j}|表示包含词语ti的文件总数。如果该词语不在文档中，就会导致分母为零，为了避免该情况的出现，一般情况下使用1+|{j:t_i∈d_j}|作为分母。但是本发明中所讨论的IDF的大小是以采集得到的10000篇新闻为测试数据集，计算出其中包含的每个单词的IDF并存储至配置文件(共包含了27万左右常见词语的IDF)。需要使用时，直接从配置文件中读取相应单词的IDF值即可，无需统计所有文档中出现该单词的文档数，计算量相对比较小，运行速度快。对于未存储在配置文件中的新词或者生僻词，将所有IDF的均值作为该词的IDF。Where |D| indicates the total number of documents, and |{j:t _i ∈ d _j }| indicates the total number of documents containing the word ti. If the term is not in the document, the denominator will be zero. In order to avoid this situation, generally 1+|{j:t _i ∈ d _j }| is used as the denominator. But the size of the IDF discussed in the present invention is the test data set with 10,000 pieces of news collected, calculate the IDF of each word contained therein and store it in the configuration file (comprising the IDF of about 270,000 common words in total) . When you need to use it, you can directly read the IDF value of the corresponding word from the configuration file. There is no need to count the number of documents in which the word appears in all documents. The calculation amount is relatively small and the operation speed is fast. For new or uncommon words not stored in the configuration file, the mean of all IDFs is used as the IDF of the word.

Claims

1. A keyword extraction algorithm based on improved TF-IDF, characterized in that: it comprises the following steps:

S1: Uniformly format the input form of the text;

S2: Format the text standard, and load the Properties configuration file for Stanford NLP;

S3: Obtain the set Sentences of all sentences in the text according to the defined sentence separator in the Properties configuration file;

S4: Each time a sentence is taken out from the collection Sentences;

S5: Obtain all the word set Tokens in the current sentence;

S6: Take out a token from the collection Tokens each time;

S7: Obtain the word/phrase and part of speech of the current token, and assign different part of speech weights to different parts of speech;

S8: Calculate the total number of words/words and their position percentages in the current sentence;

S9: In the previous operation, all the words/word sets Words in the text are obtained;

S10: take out a word from the set Words each time;

S11: Calculate the TF and IDF of the current word;

S12: After obtaining the part-of-speech weight, position weight, TF, and IDF of all words, calculate the word weight W=TF*IDF+part-of-speech weight+position weight, and select the top 5 weights W of words from large to small word is output as a keyword.

2. a kind of based on improved TF-IDF keyword extracting algorithm according to claim 1, it is characterized in that: in the described step S1, the parameter that unified formatting comprises is title, label, abstract and main text respectively; If corresponding If the parameter does not exist, enter "".

3. a kind of based on improved TF-IDF keyword extraction algorithm according to claim 1, it is characterized in that: in described step S2, text standard formatting selects word segmentation, separation, 4 annotators for part-of-speech tagging and identification of named entities, these four annotators are necessary for the text processing of the present invention; secondly, load the packages required by each annotator and set the corresponding parameters.

4. a kind of based on improved TF-IDF keyword extracting algorithm according to claim 1, it is characterized in that: remove the words that can not become the part of speech of keyword when calculating the total number of words/words in the current sentence in the described step S8.

5. A kind of keyword extraction algorithm based on improved TF-IDF according to claim 1, characterized in that: the size of the IDF is based on the 10,000 pieces of news that are collected as a test data set, and each included in it is calculated The IDF of the word is stored in the configuration file, which contains a total of 270,000 IDFs of common words; when needed, just read the IDF value of the corresponding word directly from the configuration file, without counting the number of documents in which the word appears in all documents, For new or uncommon words not stored in the configuration file, the mean of all IDFs is used as the IDF of the word.

6. A kind of keyword extraction algorithm based on improved TF-IDF according to claim 1, characterized in that: the length of each word of the first 5 words as keywords in the step S12 is between 2-6 .