CN107145476A - One kind is based on improvement TF IDF keyword extraction algorithms - Google Patents
One kind is based on improvement TF IDF keyword extraction algorithms Download PDFInfo
- Publication number
- CN107145476A CN107145476A CN201710369600.9A CN201710369600A CN107145476A CN 107145476 A CN107145476 A CN 107145476A CN 201710369600 A CN201710369600 A CN 201710369600A CN 107145476 A CN107145476 A CN 107145476A
- Authority
- CN
- China
- Prior art keywords
- words
- word
- idf
- speech
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开一种基于改进TF‑IDF关键词提取算法,其包括以下步骤:S1:将文本的输入形式统一格式化;S2:对Stanford NLP加载配置文件;S3:在配置文件中得到文本中的所有句子集合Sentences;S4:每次从Sentences中取一句子;S5:获取当前的句子中所有词语集合Tokens;S6:每次从Tokens中取一token;S7:得到当前token的字/词语和词性,并赋予不同词性权值;S8:计算当前句子中字/词语的总数及其位置百分比;S9:获取文本中所有字/词语集合Words;S10:每次从Words取一word;S11:计算当前word的TF和IDF;S12:计算所有word词语的权重W,依据词语的权重W选取关键词。本发明增加词性因子,提高提取准确度,解决构造Pat‑tree等空间复杂的问题。
The invention discloses a keyword extraction algorithm based on improved TF-IDF, which includes the following steps: S1: uniformly format the input form of the text; S2: load the configuration file to Stanford NLP; S3: obtain the text in the configuration file All sentence collection Sentences; S4: Take a sentence from Sentences each time; S5: Get all the word collection Tokens in the current sentence; S6: Take a token from Tokens each time; S7: Get the word/phrase and part of speech of the current token , and give different part-of-speech weights; S8: Calculate the total number of words/words in the current sentence and their position percentage; S9: Get all words/words in the text set Words; S10: Take a word from Words each time; S11: Calculate the current word TF and IDF; S12: Calculate the weight W of all words, and select keywords according to the weight W of words. The invention increases the part-of-speech factor, improves the extraction accuracy, and solves the problem of complex spaces such as constructing Pat-trees.
Description
技术领域technical field
本发明涉及一种基于改进TF-IDF关键词提取算法。The invention relates to an improved TF-IDF keyword extraction algorithm.
背景技术Background technique
目前在基于改进TF-IDF关键词提取算法相关方法中,主要缺点表现为:单纯以"词频"衡量一个词的重要性,不够全面,有时重要的词可能出现次数并不多。而且,这种算法无法体现词语的词性信息,属性为名词的词与属性为助词等的词,都被赋予了相同的重要性,这显然是不合理的。其他相关技术中,提取关键词的准确率相对提高了,但是在空间的复杂度上也提高了,现实要求中得不到满足。At present, in the related methods based on the improved TF-IDF keyword extraction algorithm, the main disadvantage is that simply measuring the importance of a word by "word frequency" is not comprehensive enough, and sometimes important words may not appear many times. Moreover, this algorithm cannot reflect the part-of-speech information of words. Words whose attributes are nouns and words whose attributes are particles are given the same importance, which is obviously unreasonable. In other related technologies, the accuracy of keyword extraction is relatively improved, but the complexity of the space is also improved, which cannot meet the actual requirements.
发明内容Contents of the invention
本发明的目的在于克服现有技术的不足,提供一种基于改进TF-IDF关键词提取算法。The purpose of the present invention is to overcome the deficiencies of the prior art and provide a keyword extraction algorithm based on improved TF-IDF.
本发明采用的技术方案是:The technical scheme adopted in the present invention is:
一种基于改进TF-IDF关键词提取算法,其包括以下步骤:A keyword extraction algorithm based on improved TF-IDF, which comprises the following steps:
S1:将文本的输入形式统一格式化;S1: Uniformly format the input form of the text;
S2:将文本标准格式化,对Stanford NLP加载属性Properties配置文件;S2: Format the text standard, and load the Properties configuration file for Stanford NLP;
S3:在Properties配置文件中根据定义的句子分隔符号,得到文本中的所有句子的集合Sentences;S3: Obtain the set Sentences of all sentences in the text according to the defined sentence separator in the Properties configuration file;
S4:每次从集合Sentences中取出一条句子;S4: Each time a sentence is taken out from the collection Sentences;
S5:获取当前的句子中所有的词语集合Tokens;S5: Obtain all the word set Tokens in the current sentence;
S6:每次从集合Tokens中取出一个token;S6: Take out a token from the collection Tokens each time;
S7:得到当前token的字/词语和词性,并赋予不同的词性以不同的词性权值;S7: Obtain the word/phrase and part of speech of the current token, and assign different part of speech weights to different parts of speech;
S8:计算当前句子中字/词语的总数及其位置百分比;S8: Calculate the total number of words/words and their position percentages in the current sentence;
S9:在前面的操作中,获取了文本中所有的字/词语集合Words;S9: In the previous operation, all the words/word sets Words in the text are obtained;
S10:每次从集合Words取出一个word;S10: take out a word from the set Words each time;
S11:计算当前word的TF以及IDF;S11: Calculate the TF and IDF of the current word;
S12:得到所有word的词性权值、位置权值、TF、IDF后,计算词语的权重W=TF*IDF+词性权值+位置权值,选取词语的权重W权值由大到小前5个word作为关键词输出。S12: After obtaining the part-of-speech weight, position weight, TF, and IDF of all words, calculate the weight W of the word=TF*IDF+part-of-speech weight+position weight, and select the top 5 weight W weights of words from large to small word is output as a keyword.
进一步地,所述步骤S1中统一格式化所包括的参数分别为标题、标签、摘要以及正文;若是对应的参数不存在则输入“”。Further, the parameters included in the unified formatting in the step S1 are title, label, abstract and text respectively; if the corresponding parameters do not exist, enter "".
所述步骤S2中,文本标准格式化首先设定管道中包含的annotators时分别选择分词、分隔、词性标注、识别命名实体4个annotators,这四个annotators是对本发明进行文本处理所必须的;其次,加载各个annotator所需要的包以及设置对应的参数。In said step S2, when the annotators included in the pipeline are first set in the text standard formatting, respectively select word segmentation, separation, part-of-speech tagging, and 4 annotators for identifying named entities, these four annotators are necessary for the text processing of the present invention; secondly , load the packages required by each annotator and set the corresponding parameters.
进一步地,所述步骤S8中计算当前句子中字/词语的总数时除去不能成为关键词的词性的词语。Further, when calculating the total number of characters/phrases in the current sentence in the step S8, the words of parts of speech that cannot become keywords are removed.
进一步地,所述步骤S11,IDF在传统的算法中主要是通过包含了该词语的文档的数量和包含该词语的文档的总数量来比较的,公式如下:Further, in the step S11, the traditional algorithm of IDF mainly compares the number of documents containing the word with the total number of documents containing the word, and the formula is as follows:
其中|D|表示的文件总数,|{j:ti∈dj}|表示包含词语ti的文件总数。Where |D| indicates the total number of documents, and |{j:t i ∈ d j }| indicates the total number of documents containing the word ti.
所述IDF的大小是以采集得到的10000篇新闻为测试数据集,计算出其中包含的每个单词的IDF并存储至配置文件(共包含了27万左右常见词语的IDF)。需要使用时,直接从配置文件中读取相应单词的IDF值即可,无需统计所有文档中出现该单词的文档数,计算量相对比较小,运行速度快。对于未存储在配置文件中的新词或者生僻词,将所有IDF的均值作为该词的IDFThe size of the IDF is based on the collected 10,000 news articles as a test data set, and the IDF of each word contained therein is calculated and stored in a configuration file (the IDF of about 270,000 common words is included). When you need to use it, you can directly read the IDF value of the corresponding word from the configuration file. There is no need to count the number of documents in which the word appears in all documents. The calculation amount is relatively small and the operation speed is fast. For new or uncommon words not stored in the configuration file, the mean of all IDFs is used as the IDF of the word
进一步地,所述步骤S12中作为关键词的前5个word的每一个word的长度在2-6之间。Further, the length of each word of the first 5 words used as keywords in the step S12 is between 2-6.
本发明采用以上技术方案,在提取关键词中的过程中增加词性因子,可以在提高关键词提取的准确度的同时,有效解决相关方案中要构造Pat-tree等导致空间复杂度大等问题。The present invention adopts the above technical scheme and adds part-of-speech factors in the process of extracting keywords, which can effectively solve the problems of large space complexity caused by constructing Pat-trees in related solutions while improving the accuracy of keyword extraction.
附图说明Description of drawings
以下结合附图和具体实施方式对本发明做进一步详细说明;The present invention will be described in further detail below in conjunction with accompanying drawing and specific embodiment;
图1为本发明一种基于改进TF-IDF关键词提取算法的流程示意图。Fig. 1 is a schematic flow chart of an improved TF-IDF keyword extraction algorithm in the present invention.
具体实施方式detailed description
如图1所示,本发明公开了一种基于改进TF-IDF关键词提取算法,其包括以下步骤:As shown in Figure 1, the present invention discloses a keyword extraction algorithm based on improved TF-IDF, which includes the following steps:
S1:将文本的输入形式统一格式化;S1: Uniformly format the input form of the text;
S2:将文本标准格式化,对Stanford NLP加载属性Properties配置文件;S2: Format the text standard, and load the Properties configuration file for Stanford NLP;
S3:在Properties配置文件中根据定义的句子分隔符号,得到文本中的所有句子的集合Sentences;S3: Obtain the set Sentences of all sentences in the text according to the defined sentence separator in the Properties configuration file;
S4:每次从集合Sentences中取出一条句子;S4: Each time a sentence is taken out from the collection Sentences;
S5:获取当前的句子中所有的词语集合Tokens;S5: Obtain all the word set Tokens in the current sentence;
S6:每次从集合Tokens中取出一个token;S6: Take out a token from the collection Tokens each time;
S7:得到当前token的字/词语和词性,并赋予不同的词性以不同的词性权值;S7: Obtain the word/phrase and part of speech of the current token, and assign different part of speech weights to different parts of speech;
S8:计算当前句子中字/词语的总数及其位置百分比;S8: Calculate the total number of words/words and their position percentages in the current sentence;
S9:在前面的操作中,获取了文本中所有的字/词语集合Words;S9: In the previous operation, all the words/word sets Words in the text are obtained;
S10:每次从集合Words取出一个word;S10: take out a word from the set Words each time;
S11:计算当前word的TF以及IDF;S11: Calculate the TF and IDF of the current word;
S12:得到所有word的词性权值、位置权值、TF、IDF后,计算词语的权重W=TF*IDF+词性权值+位置权值,选取词语的权重W权值由大到小前5个word作为关键词,每一个word的长度在2-6之间输出。S12: After obtaining the part-of-speech weight, position weight, TF, and IDF of all words, calculate the weight W of the word = TF*IDF+ part-of-speech weight + position weight, and select the top 5 weight W weights of words from large to small Word is used as a keyword, and the length of each word is output between 2-6.
进一步地,所述步骤S1中统一格式化所包括的参数分别为标题、标签、摘要以及正文;若是对应的参数不存在则输入“”。Further, the parameters included in the unified formatting in the step S1 are title, label, abstract and text respectively; if the corresponding parameters do not exist, enter "".
所述步骤S2中,文本标准格式化首先设定管道中包含的annotators时分别选择分词、分隔、词性标注、识别命名实体4个annotators,这四个annotators是对本发明进行文本处理所必须的;其次,加载各个annotator所需要的包以及设置对应的参数。In said step S2, when the annotators included in the pipeline are first set in the text standard formatting, respectively select word segmentation, separation, part-of-speech tagging, and 4 annotators for identifying named entities, these four annotators are necessary for the text processing of the present invention; secondly , load the packages required by each annotator and set the corresponding parameters.
进一步地,所述步骤S8中计算当前句子中字/词语的总数时除去停顿词语。不仅对文本进行停用词的过滤,而且同时也过滤了不可能成为关键词的词性,例如虚词等。Further, stop words are removed when calculating the total number of characters/phrases in the current sentence in the step S8. Not only stop words are filtered on the text, but also parts of speech that cannot become keywords, such as function words, etc. are filtered.
进一步地,所述步骤S11,IDF在传统的算法中主要是通过包含了该词语的文档的数量和包含该词语的文档的总数量来比较的,公式如下:Further, in the step S11, the traditional algorithm of IDF mainly compares the number of documents containing the word with the total number of documents containing the word, and the formula is as follows:
其中|D|表示的文件总数,|{j:ti∈dj}|表示包含词语ti的文件总数。如果该词语不在文档中,就会导致分母为零,为了避免该情况的出现,一般情况下使用1+|{j:ti∈dj}|作为分母。但是本发明中所讨论的IDF的大小是以采集得到的10000篇新闻为测试数据集,计算出其中包含的每个单词的IDF并存储至配置文件(共包含了27万左右常见词语的IDF)。需要使用时,直接从配置文件中读取相应单词的IDF值即可,无需统计所有文档中出现该单词的文档数,计算量相对比较小,运行速度快。对于未存储在配置文件中的新词或者生僻词,将所有IDF的均值作为该词的IDF。Where |D| indicates the total number of documents, and |{j:t i ∈ d j }| indicates the total number of documents containing the word ti. If the term is not in the document, the denominator will be zero. In order to avoid this situation, generally 1+|{j:t i ∈ d j }| is used as the denominator. But the size of the IDF discussed in the present invention is the test data set with 10,000 pieces of news collected, calculate the IDF of each word contained therein and store it in the configuration file (comprising the IDF of about 270,000 common words in total) . When you need to use it, you can directly read the IDF value of the corresponding word from the configuration file. There is no need to count the number of documents in which the word appears in all documents. The calculation amount is relatively small and the operation speed is fast. For new or uncommon words not stored in the configuration file, the mean of all IDFs is used as the IDF of the word.
本发明采用以上技术方案,在提取关键词中的过程中增加词性因子,可以在提高关键词提取的准确度的同时,有效解决相关方案中要构造Pat-tree等导致空间复杂度大等问题。The present invention adopts the above technical scheme and adds part-of-speech factors in the process of extracting keywords, which can effectively solve the problems of large space complexity caused by constructing Pat-trees in related solutions while improving the accuracy of keyword extraction.
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710369600.9A CN107145476A (en) | 2017-05-23 | 2017-05-23 | One kind is based on improvement TF IDF keyword extraction algorithms |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710369600.9A CN107145476A (en) | 2017-05-23 | 2017-05-23 | One kind is based on improvement TF IDF keyword extraction algorithms |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN107145476A true CN107145476A (en) | 2017-09-08 |
Family
ID=59779033
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201710369600.9A Pending CN107145476A (en) | 2017-05-23 | 2017-05-23 | One kind is based on improvement TF IDF keyword extraction algorithms |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN107145476A (en) |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107885793A (en) * | 2017-10-20 | 2018-04-06 | 江苏大学 | A kind of hot microblog topic analyzing and predicting method and system |
| CN108170666A (en) * | 2017-11-29 | 2018-06-15 | 同济大学 | A kind of improved method based on TF-IDF keyword extractions |
| CN109145097A (en) * | 2018-06-11 | 2019-01-04 | 人民法院信息技术服务中心 | A kind of judgement document's classification method based on information extraction |
| CN112905771A (en) * | 2021-02-10 | 2021-06-04 | 北京邮电大学 | Characteristic keyword extraction method based on part of speech and position |
| CN113591572A (en) * | 2021-06-29 | 2021-11-02 | 福建师范大学 | Water and soil loss quantitative monitoring method based on multi-source data and multi-temporal data |
| CN113743090A (en) * | 2021-09-08 | 2021-12-03 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
| CN114328865A (en) * | 2021-12-14 | 2022-04-12 | 南京航空航天大学 | Improved TextRank multi-feature fusion education resource keyword extraction method |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7440947B2 (en) * | 2004-11-12 | 2008-10-21 | Fuji Xerox Co., Ltd. | System and method for identifying query-relevant keywords in documents with latent semantic analysis |
| CN103064969A (en) * | 2012-12-31 | 2013-04-24 | 武汉传神信息技术有限公司 | Method for automatically creating keyword index table |
| CN103885934A (en) * | 2014-02-19 | 2014-06-25 | 中国专利信息中心 | Method for automatically extracting key phrases of patent documents |
| CN105740229A (en) * | 2016-01-26 | 2016-07-06 | 中国人民解放军国防科学技术大学 | Keyword extraction method and device |
| CN105808524A (en) * | 2016-03-11 | 2016-07-27 | 江苏畅远信息科技有限公司 | Patent document abstract-based automatic patent classification method |
-
2017
- 2017-05-23 CN CN201710369600.9A patent/CN107145476A/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7440947B2 (en) * | 2004-11-12 | 2008-10-21 | Fuji Xerox Co., Ltd. | System and method for identifying query-relevant keywords in documents with latent semantic analysis |
| CN103064969A (en) * | 2012-12-31 | 2013-04-24 | 武汉传神信息技术有限公司 | Method for automatically creating keyword index table |
| CN103885934A (en) * | 2014-02-19 | 2014-06-25 | 中国专利信息中心 | Method for automatically extracting key phrases of patent documents |
| CN105740229A (en) * | 2016-01-26 | 2016-07-06 | 中国人民解放军国防科学技术大学 | Keyword extraction method and device |
| CN105808524A (en) * | 2016-03-11 | 2016-07-27 | 江苏畅远信息科技有限公司 | Patent document abstract-based automatic patent classification method |
Non-Patent Citations (1)
| Title |
|---|
| 祝振媛: "《基于信息分类的网络书评多文档自动文摘研究》", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107885793A (en) * | 2017-10-20 | 2018-04-06 | 江苏大学 | A kind of hot microblog topic analyzing and predicting method and system |
| CN108170666A (en) * | 2017-11-29 | 2018-06-15 | 同济大学 | A kind of improved method based on TF-IDF keyword extractions |
| CN109145097A (en) * | 2018-06-11 | 2019-01-04 | 人民法院信息技术服务中心 | A kind of judgement document's classification method based on information extraction |
| CN112905771A (en) * | 2021-02-10 | 2021-06-04 | 北京邮电大学 | Characteristic keyword extraction method based on part of speech and position |
| CN113591572A (en) * | 2021-06-29 | 2021-11-02 | 福建师范大学 | Water and soil loss quantitative monitoring method based on multi-source data and multi-temporal data |
| CN113591572B (en) * | 2021-06-29 | 2023-08-15 | 福建师范大学 | Quantitative Monitoring Method of Water and Soil Erosion Based on Multi-source Data and Multi-temporal Data |
| CN113743090A (en) * | 2021-09-08 | 2021-12-03 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
| CN113743090B (en) * | 2021-09-08 | 2024-04-12 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
| CN114328865A (en) * | 2021-12-14 | 2022-04-12 | 南京航空航天大学 | Improved TextRank multi-feature fusion education resource keyword extraction method |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN107145476A (en) | One kind is based on improvement TF IDF keyword extraction algorithms | |
| CN111104794B (en) | Text similarity matching method based on subject term | |
| US10042923B2 (en) | Topic extraction using clause segmentation and high-frequency words | |
| US9460195B1 (en) | System and methods for determining term importance, search relevance, and content summarization | |
| CN103761264B (en) | Concept hierarchy establishing method based on product review document set | |
| CN103473263B (en) | News event development process-oriented visual display method | |
| CN109783787A (en) | A kind of generation method of structured document, device and storage medium | |
| CN104281702B (en) | Data retrieval method and device based on electric power critical word participle | |
| WO2019174132A1 (en) | Data processing method, server and computer storage medium | |
| CN105740229B (en) | The method and device of keyword extraction | |
| CN107239439A (en) | Public sentiment sentiment classification method based on word2vec | |
| CN109918556B (en) | A Depressive Emotion Recognition Method Based on Integrated Weibo Users' Social Relationships and Text Features | |
| CN106407182A (en) | A method for automatic abstracting for electronic official documents of enterprises | |
| CN107562717A (en) | A kind of text key word abstracting method being combined based on Word2Vec with Term co-occurrence | |
| WO2020151218A1 (en) | Method and apparatus for generating specialised electric power word bank, and storage medium | |
| CN104298714B (en) | A kind of mass text automatic marking method based on abnormality processing | |
| CN103617158A (en) | Method for generating emotion abstract of dialogue text | |
| WO2023274047A1 (en) | Standard knowledge graph construction and standard query method and apparatus | |
| CN107203520A (en) | The method for building up of hotel's sentiment dictionary, the sentiment analysis method and system of comment | |
| CN110083832A (en) | Recognition methods, device, equipment and the readable storage medium storing program for executing of article reprinting relationship | |
| CN104778201A (en) | Multi-query result combination-based prior art retrieval method | |
| CN106445914B (en) | Construction method and construction device of microblog emotion classifier | |
| CN102722526B (en) | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method | |
| CN113656592B (en) | Data processing method, device, electronic device and medium based on knowledge graph | |
| CN118377853B (en) | Paper topic selection assistance method, system, medium and equipment based on large language model |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170908 |