[go: up one dir, main page]

CN107145476A - One kind is based on improvement TF IDF keyword extraction algorithms - Google Patents

One kind is based on improvement TF IDF keyword extraction algorithms Download PDF

Info

Publication number
CN107145476A
CN107145476A CN201710369600.9A CN201710369600A CN107145476A CN 107145476 A CN107145476 A CN 107145476A CN 201710369600 A CN201710369600 A CN 201710369600A CN 107145476 A CN107145476 A CN 107145476A
Authority
CN
China
Prior art keywords
words
word
idf
speech
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710369600.9A
Other languages
Chinese (zh)
Inventor
金彪
方敏霞
沙晋明
熊金波
李璇
林劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201710369600.9A priority Critical patent/CN107145476A/en
Publication of CN107145476A publication Critical patent/CN107145476A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开一种基于改进TF‑IDF关键词提取算法,其包括以下步骤:S1:将文本的输入形式统一格式化;S2:对Stanford NLP加载配置文件;S3:在配置文件中得到文本中的所有句子集合Sentences;S4:每次从Sentences中取一句子;S5:获取当前的句子中所有词语集合Tokens;S6:每次从Tokens中取一token;S7:得到当前token的字/词语和词性,并赋予不同词性权值;S8:计算当前句子中字/词语的总数及其位置百分比;S9:获取文本中所有字/词语集合Words;S10:每次从Words取一word;S11:计算当前word的TF和IDF;S12:计算所有word词语的权重W,依据词语的权重W选取关键词。本发明增加词性因子,提高提取准确度,解决构造Pat‑tree等空间复杂的问题。

The invention discloses a keyword extraction algorithm based on improved TF-IDF, which includes the following steps: S1: uniformly format the input form of the text; S2: load the configuration file to Stanford NLP; S3: obtain the text in the configuration file All sentence collection Sentences; S4: Take a sentence from Sentences each time; S5: Get all the word collection Tokens in the current sentence; S6: Take a token from Tokens each time; S7: Get the word/phrase and part of speech of the current token , and give different part-of-speech weights; S8: Calculate the total number of words/words in the current sentence and their position percentage; S9: Get all words/words in the text set Words; S10: Take a word from Words each time; S11: Calculate the current word TF and IDF; S12: Calculate the weight W of all words, and select keywords according to the weight W of words. The invention increases the part-of-speech factor, improves the extraction accuracy, and solves the problem of complex spaces such as constructing Pat-trees.

Description

一种基于改进TF-IDF关键词提取算法A Keyword Extraction Algorithm Based on Improved TF-IDF

技术领域technical field

本发明涉及一种基于改进TF-IDF关键词提取算法。The invention relates to an improved TF-IDF keyword extraction algorithm.

背景技术Background technique

目前在基于改进TF-IDF关键词提取算法相关方法中,主要缺点表现为:单纯以"词频"衡量一个词的重要性,不够全面,有时重要的词可能出现次数并不多。而且,这种算法无法体现词语的词性信息,属性为名词的词与属性为助词等的词,都被赋予了相同的重要性,这显然是不合理的。其他相关技术中,提取关键词的准确率相对提高了,但是在空间的复杂度上也提高了,现实要求中得不到满足。At present, in the related methods based on the improved TF-IDF keyword extraction algorithm, the main disadvantage is that simply measuring the importance of a word by "word frequency" is not comprehensive enough, and sometimes important words may not appear many times. Moreover, this algorithm cannot reflect the part-of-speech information of words. Words whose attributes are nouns and words whose attributes are particles are given the same importance, which is obviously unreasonable. In other related technologies, the accuracy of keyword extraction is relatively improved, but the complexity of the space is also improved, which cannot meet the actual requirements.

发明内容Contents of the invention

本发明的目的在于克服现有技术的不足,提供一种基于改进TF-IDF关键词提取算法。The purpose of the present invention is to overcome the deficiencies of the prior art and provide a keyword extraction algorithm based on improved TF-IDF.

本发明采用的技术方案是:The technical scheme adopted in the present invention is:

一种基于改进TF-IDF关键词提取算法,其包括以下步骤:A keyword extraction algorithm based on improved TF-IDF, which comprises the following steps:

S1:将文本的输入形式统一格式化;S1: Uniformly format the input form of the text;

S2:将文本标准格式化,对Stanford NLP加载属性Properties配置文件;S2: Format the text standard, and load the Properties configuration file for Stanford NLP;

S3:在Properties配置文件中根据定义的句子分隔符号,得到文本中的所有句子的集合Sentences;S3: Obtain the set Sentences of all sentences in the text according to the defined sentence separator in the Properties configuration file;

S4:每次从集合Sentences中取出一条句子;S4: Each time a sentence is taken out from the collection Sentences;

S5:获取当前的句子中所有的词语集合Tokens;S5: Obtain all the word set Tokens in the current sentence;

S6:每次从集合Tokens中取出一个token;S6: Take out a token from the collection Tokens each time;

S7:得到当前token的字/词语和词性,并赋予不同的词性以不同的词性权值;S7: Obtain the word/phrase and part of speech of the current token, and assign different part of speech weights to different parts of speech;

S8:计算当前句子中字/词语的总数及其位置百分比;S8: Calculate the total number of words/words and their position percentages in the current sentence;

S9:在前面的操作中,获取了文本中所有的字/词语集合Words;S9: In the previous operation, all the words/word sets Words in the text are obtained;

S10:每次从集合Words取出一个word;S10: take out a word from the set Words each time;

S11:计算当前word的TF以及IDF;S11: Calculate the TF and IDF of the current word;

S12:得到所有word的词性权值、位置权值、TF、IDF后,计算词语的权重W=TF*IDF+词性权值+位置权值,选取词语的权重W权值由大到小前5个word作为关键词输出。S12: After obtaining the part-of-speech weight, position weight, TF, and IDF of all words, calculate the weight W of the word=TF*IDF+part-of-speech weight+position weight, and select the top 5 weight W weights of words from large to small word is output as a keyword.

进一步地,所述步骤S1中统一格式化所包括的参数分别为标题、标签、摘要以及正文;若是对应的参数不存在则输入“”。Further, the parameters included in the unified formatting in the step S1 are title, label, abstract and text respectively; if the corresponding parameters do not exist, enter "".

所述步骤S2中,文本标准格式化首先设定管道中包含的annotators时分别选择分词、分隔、词性标注、识别命名实体4个annotators,这四个annotators是对本发明进行文本处理所必须的;其次,加载各个annotator所需要的包以及设置对应的参数。In said step S2, when the annotators included in the pipeline are first set in the text standard formatting, respectively select word segmentation, separation, part-of-speech tagging, and 4 annotators for identifying named entities, these four annotators are necessary for the text processing of the present invention; secondly , load the packages required by each annotator and set the corresponding parameters.

进一步地,所述步骤S8中计算当前句子中字/词语的总数时除去不能成为关键词的词性的词语。Further, when calculating the total number of characters/phrases in the current sentence in the step S8, the words of parts of speech that cannot become keywords are removed.

进一步地,所述步骤S11,IDF在传统的算法中主要是通过包含了该词语的文档的数量和包含该词语的文档的总数量来比较的,公式如下:Further, in the step S11, the traditional algorithm of IDF mainly compares the number of documents containing the word with the total number of documents containing the word, and the formula is as follows:

其中|D|表示的文件总数,|{j:ti∈dj}|表示包含词语ti的文件总数。Where |D| indicates the total number of documents, and |{j:t i ∈ d j }| indicates the total number of documents containing the word ti.

所述IDF的大小是以采集得到的10000篇新闻为测试数据集,计算出其中包含的每个单词的IDF并存储至配置文件(共包含了27万左右常见词语的IDF)。需要使用时,直接从配置文件中读取相应单词的IDF值即可,无需统计所有文档中出现该单词的文档数,计算量相对比较小,运行速度快。对于未存储在配置文件中的新词或者生僻词,将所有IDF的均值作为该词的IDFThe size of the IDF is based on the collected 10,000 news articles as a test data set, and the IDF of each word contained therein is calculated and stored in a configuration file (the IDF of about 270,000 common words is included). When you need to use it, you can directly read the IDF value of the corresponding word from the configuration file. There is no need to count the number of documents in which the word appears in all documents. The calculation amount is relatively small and the operation speed is fast. For new or uncommon words not stored in the configuration file, the mean of all IDFs is used as the IDF of the word

进一步地,所述步骤S12中作为关键词的前5个word的每一个word的长度在2-6之间。Further, the length of each word of the first 5 words used as keywords in the step S12 is between 2-6.

本发明采用以上技术方案,在提取关键词中的过程中增加词性因子,可以在提高关键词提取的准确度的同时,有效解决相关方案中要构造Pat-tree等导致空间复杂度大等问题。The present invention adopts the above technical scheme and adds part-of-speech factors in the process of extracting keywords, which can effectively solve the problems of large space complexity caused by constructing Pat-trees in related solutions while improving the accuracy of keyword extraction.

附图说明Description of drawings

以下结合附图和具体实施方式对本发明做进一步详细说明;The present invention will be described in further detail below in conjunction with accompanying drawing and specific embodiment;

图1为本发明一种基于改进TF-IDF关键词提取算法的流程示意图。Fig. 1 is a schematic flow chart of an improved TF-IDF keyword extraction algorithm in the present invention.

具体实施方式detailed description

如图1所示,本发明公开了一种基于改进TF-IDF关键词提取算法,其包括以下步骤:As shown in Figure 1, the present invention discloses a keyword extraction algorithm based on improved TF-IDF, which includes the following steps:

S1:将文本的输入形式统一格式化;S1: Uniformly format the input form of the text;

S2:将文本标准格式化,对Stanford NLP加载属性Properties配置文件;S2: Format the text standard, and load the Properties configuration file for Stanford NLP;

S3:在Properties配置文件中根据定义的句子分隔符号,得到文本中的所有句子的集合Sentences;S3: Obtain the set Sentences of all sentences in the text according to the defined sentence separator in the Properties configuration file;

S4:每次从集合Sentences中取出一条句子;S4: Each time a sentence is taken out from the collection Sentences;

S5:获取当前的句子中所有的词语集合Tokens;S5: Obtain all the word set Tokens in the current sentence;

S6:每次从集合Tokens中取出一个token;S6: Take out a token from the collection Tokens each time;

S7:得到当前token的字/词语和词性,并赋予不同的词性以不同的词性权值;S7: Obtain the word/phrase and part of speech of the current token, and assign different part of speech weights to different parts of speech;

S8:计算当前句子中字/词语的总数及其位置百分比;S8: Calculate the total number of words/words and their position percentages in the current sentence;

S9:在前面的操作中,获取了文本中所有的字/词语集合Words;S9: In the previous operation, all the words/word sets Words in the text are obtained;

S10:每次从集合Words取出一个word;S10: take out a word from the set Words each time;

S11:计算当前word的TF以及IDF;S11: Calculate the TF and IDF of the current word;

S12:得到所有word的词性权值、位置权值、TF、IDF后,计算词语的权重W=TF*IDF+词性权值+位置权值,选取词语的权重W权值由大到小前5个word作为关键词,每一个word的长度在2-6之间输出。S12: After obtaining the part-of-speech weight, position weight, TF, and IDF of all words, calculate the weight W of the word = TF*IDF+ part-of-speech weight + position weight, and select the top 5 weight W weights of words from large to small Word is used as a keyword, and the length of each word is output between 2-6.

进一步地,所述步骤S1中统一格式化所包括的参数分别为标题、标签、摘要以及正文;若是对应的参数不存在则输入“”。Further, the parameters included in the unified formatting in the step S1 are title, label, abstract and text respectively; if the corresponding parameters do not exist, enter "".

所述步骤S2中,文本标准格式化首先设定管道中包含的annotators时分别选择分词、分隔、词性标注、识别命名实体4个annotators,这四个annotators是对本发明进行文本处理所必须的;其次,加载各个annotator所需要的包以及设置对应的参数。In said step S2, when the annotators included in the pipeline are first set in the text standard formatting, respectively select word segmentation, separation, part-of-speech tagging, and 4 annotators for identifying named entities, these four annotators are necessary for the text processing of the present invention; secondly , load the packages required by each annotator and set the corresponding parameters.

进一步地,所述步骤S8中计算当前句子中字/词语的总数时除去停顿词语。不仅对文本进行停用词的过滤,而且同时也过滤了不可能成为关键词的词性,例如虚词等。Further, stop words are removed when calculating the total number of characters/phrases in the current sentence in the step S8. Not only stop words are filtered on the text, but also parts of speech that cannot become keywords, such as function words, etc. are filtered.

进一步地,所述步骤S11,IDF在传统的算法中主要是通过包含了该词语的文档的数量和包含该词语的文档的总数量来比较的,公式如下:Further, in the step S11, the traditional algorithm of IDF mainly compares the number of documents containing the word with the total number of documents containing the word, and the formula is as follows:

其中|D|表示的文件总数,|{j:ti∈dj}|表示包含词语ti的文件总数。如果该词语不在文档中,就会导致分母为零,为了避免该情况的出现,一般情况下使用1+|{j:ti∈dj}|作为分母。但是本发明中所讨论的IDF的大小是以采集得到的10000篇新闻为测试数据集,计算出其中包含的每个单词的IDF并存储至配置文件(共包含了27万左右常见词语的IDF)。需要使用时,直接从配置文件中读取相应单词的IDF值即可,无需统计所有文档中出现该单词的文档数,计算量相对比较小,运行速度快。对于未存储在配置文件中的新词或者生僻词,将所有IDF的均值作为该词的IDF。Where |D| indicates the total number of documents, and |{j:t i ∈ d j }| indicates the total number of documents containing the word ti. If the term is not in the document, the denominator will be zero. In order to avoid this situation, generally 1+|{j:t i ∈ d j }| is used as the denominator. But the size of the IDF discussed in the present invention is the test data set with 10,000 pieces of news collected, calculate the IDF of each word contained therein and store it in the configuration file (comprising the IDF of about 270,000 common words in total) . When you need to use it, you can directly read the IDF value of the corresponding word from the configuration file. There is no need to count the number of documents in which the word appears in all documents. The calculation amount is relatively small and the operation speed is fast. For new or uncommon words not stored in the configuration file, the mean of all IDFs is used as the IDF of the word.

本发明采用以上技术方案,在提取关键词中的过程中增加词性因子,可以在提高关键词提取的准确度的同时,有效解决相关方案中要构造Pat-tree等导致空间复杂度大等问题。The present invention adopts the above technical scheme and adds part-of-speech factors in the process of extracting keywords, which can effectively solve the problems of large space complexity caused by constructing Pat-trees in related solutions while improving the accuracy of keyword extraction.

Claims (6)

1.一种基于改进TF-IDF关键词提取算法,其特征在于:其包括以下步骤:1. A keyword extraction algorithm based on improved TF-IDF, characterized in that: it comprises the following steps: S1:将文本的输入形式统一格式化;S1: Uniformly format the input form of the text; S2:将文本标准格式化,对Stanford NLP加载属性Properties配置文件;S2: Format the text standard, and load the Properties configuration file for Stanford NLP; S3:在Properties配置文件中根据定义的句子分隔符号,得到文本中的所有句子的集合Sentences;S3: Obtain the set Sentences of all sentences in the text according to the defined sentence separator in the Properties configuration file; S4:每次从集合Sentences中取出一条句子;S4: Each time a sentence is taken out from the collection Sentences; S5:获取当前的句子中所有的词语集合Tokens;S5: Obtain all the word set Tokens in the current sentence; S6:每次从集合Tokens中取出一个token;S6: Take out a token from the collection Tokens each time; S7:得到当前token的字/词语和词性,并赋予不同的词性以不同的词性权值;S7: Obtain the word/phrase and part of speech of the current token, and assign different part of speech weights to different parts of speech; S8:计算当前句子中字/词语的总数及其位置百分比;S8: Calculate the total number of words/words and their position percentages in the current sentence; S9:在前面的操作中,获取了文本中所有的字/词语集合Words;S9: In the previous operation, all the words/word sets Words in the text are obtained; S10:每次从集合Words取出一个word;S10: take out a word from the set Words each time; S11:计算当前word的TF以及IDF;S11: Calculate the TF and IDF of the current word; S12:得到所有word的词性权值、位置权值、TF、IDF后,计算词语的权重W=TF*IDF+词性权值+位置权值,选取词语的权重W权值由大到小前5个word作为关键词输出。S12: After obtaining the part-of-speech weight, position weight, TF, and IDF of all words, calculate the word weight W=TF*IDF+part-of-speech weight+position weight, and select the top 5 weights W of words from large to small word is output as a keyword. 2.根据权利要求1所述的一种基于改进TF-IDF关键词提取算法,其特征在于:所述步骤S1中统一格式化所包括的参数分别为标题、标签、摘要以及正文;若是对应的参数不存在则输入“”。2. a kind of based on improved TF-IDF keyword extracting algorithm according to claim 1, it is characterized in that: in the described step S1, the parameter that unified formatting comprises is title, label, abstract and main text respectively; If corresponding If the parameter does not exist, enter "". 3.根据权利要求1所述的一种基于改进TF-IDF关键词提取算法,其特征在于:所述步骤S2中,文本标准格式化首先设定管道中包含的annotators时分别选择分词、分隔、词性标注、识别命名实体4个annotators,这四个annotators是对本发明进行文本处理所必须的;其次,加载各个annotator所需要的包以及设置对应的参数。3. a kind of based on improved TF-IDF keyword extraction algorithm according to claim 1, it is characterized in that: in described step S2, text standard formatting selects word segmentation, separation, 4 annotators for part-of-speech tagging and identification of named entities, these four annotators are necessary for the text processing of the present invention; secondly, load the packages required by each annotator and set the corresponding parameters. 4.根据权利要求1所述的一种基于改进TF-IDF关键词提取算法,其特征在于:所述步骤S8中计算当前句子中字/词语的总数时除去不能成为关键词的词性的词语。4. a kind of based on improved TF-IDF keyword extracting algorithm according to claim 1, it is characterized in that: remove the words that can not become the part of speech of keyword when calculating the total number of words/words in the current sentence in the described step S8. 5.根据权利要求1所述的一种基于改进TF-IDF关键词提取算法,其特征在于:所述IDF的大小是以采集得到的10000篇新闻为测试数据集,计算出其中包含的每个单词的IDF并存储至配置文件,共包含了27万常见词语的IDF;需要使用时,直接从配置文件中读取相应单词的IDF值即可,无需统计所有文档中出现该单词的文档数,对于未存储在配置文件中的新词或者生僻词,将所有IDF的均值作为该词的IDF。5. A kind of keyword extraction algorithm based on improved TF-IDF according to claim 1, characterized in that: the size of the IDF is based on the 10,000 pieces of news that are collected as a test data set, and each included in it is calculated The IDF of the word is stored in the configuration file, which contains a total of 270,000 IDFs of common words; when needed, just read the IDF value of the corresponding word directly from the configuration file, without counting the number of documents in which the word appears in all documents, For new or uncommon words not stored in the configuration file, the mean of all IDFs is used as the IDF of the word. 6.根据权利要求1所述的一种基于改进TF-IDF关键词提取算法,其特征在于:所述步骤S12中作为关键词的前5个word的每一个word的长度在2-6之间。6. A kind of keyword extraction algorithm based on improved TF-IDF according to claim 1, characterized in that: the length of each word of the first 5 words as keywords in the step S12 is between 2-6 .
CN201710369600.9A 2017-05-23 2017-05-23 One kind is based on improvement TF IDF keyword extraction algorithms Pending CN107145476A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710369600.9A CN107145476A (en) 2017-05-23 2017-05-23 One kind is based on improvement TF IDF keyword extraction algorithms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710369600.9A CN107145476A (en) 2017-05-23 2017-05-23 One kind is based on improvement TF IDF keyword extraction algorithms

Publications (1)

Publication Number Publication Date
CN107145476A true CN107145476A (en) 2017-09-08

Family

ID=59779033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710369600.9A Pending CN107145476A (en) 2017-05-23 2017-05-23 One kind is based on improvement TF IDF keyword extraction algorithms

Country Status (1)

Country Link
CN (1) CN107145476A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885793A (en) * 2017-10-20 2018-04-06 江苏大学 A kind of hot microblog topic analyzing and predicting method and system
CN108170666A (en) * 2017-11-29 2018-06-15 同济大学 A kind of improved method based on TF-IDF keyword extractions
CN109145097A (en) * 2018-06-11 2019-01-04 人民法院信息技术服务中心 A kind of judgement document's classification method based on information extraction
CN112905771A (en) * 2021-02-10 2021-06-04 北京邮电大学 Characteristic keyword extraction method based on part of speech and position
CN113591572A (en) * 2021-06-29 2021-11-02 福建师范大学 Water and soil loss quantitative monitoring method based on multi-source data and multi-temporal data
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device
CN114328865A (en) * 2021-12-14 2022-04-12 南京航空航天大学 Improved TextRank multi-feature fusion education resource keyword extraction method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7440947B2 (en) * 2004-11-12 2008-10-21 Fuji Xerox Co., Ltd. System and method for identifying query-relevant keywords in documents with latent semantic analysis
CN103064969A (en) * 2012-12-31 2013-04-24 武汉传神信息技术有限公司 Method for automatically creating keyword index table
CN103885934A (en) * 2014-02-19 2014-06-25 中国专利信息中心 Method for automatically extracting key phrases of patent documents
CN105740229A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Keyword extraction method and device
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7440947B2 (en) * 2004-11-12 2008-10-21 Fuji Xerox Co., Ltd. System and method for identifying query-relevant keywords in documents with latent semantic analysis
CN103064969A (en) * 2012-12-31 2013-04-24 武汉传神信息技术有限公司 Method for automatically creating keyword index table
CN103885934A (en) * 2014-02-19 2014-06-25 中国专利信息中心 Method for automatically extracting key phrases of patent documents
CN105740229A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Keyword extraction method and device
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
祝振媛: "《基于信息分类的网络书评多文档自动文摘研究》", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885793A (en) * 2017-10-20 2018-04-06 江苏大学 A kind of hot microblog topic analyzing and predicting method and system
CN108170666A (en) * 2017-11-29 2018-06-15 同济大学 A kind of improved method based on TF-IDF keyword extractions
CN109145097A (en) * 2018-06-11 2019-01-04 人民法院信息技术服务中心 A kind of judgement document's classification method based on information extraction
CN112905771A (en) * 2021-02-10 2021-06-04 北京邮电大学 Characteristic keyword extraction method based on part of speech and position
CN113591572A (en) * 2021-06-29 2021-11-02 福建师范大学 Water and soil loss quantitative monitoring method based on multi-source data and multi-temporal data
CN113591572B (en) * 2021-06-29 2023-08-15 福建师范大学 Quantitative Monitoring Method of Water and Soil Erosion Based on Multi-source Data and Multi-temporal Data
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device
CN113743090B (en) * 2021-09-08 2024-04-12 度小满科技(北京)有限公司 Keyword extraction method and device
CN114328865A (en) * 2021-12-14 2022-04-12 南京航空航天大学 Improved TextRank multi-feature fusion education resource keyword extraction method

Similar Documents

Publication Publication Date Title
CN107145476A (en) One kind is based on improvement TF IDF keyword extraction algorithms
CN111104794B (en) Text similarity matching method based on subject term
US10042923B2 (en) Topic extraction using clause segmentation and high-frequency words
US9460195B1 (en) System and methods for determining term importance, search relevance, and content summarization
CN103761264B (en) Concept hierarchy establishing method based on product review document set
CN103473263B (en) News event development process-oriented visual display method
CN109783787A (en) A kind of generation method of structured document, device and storage medium
CN104281702B (en) Data retrieval method and device based on electric power critical word participle
WO2019174132A1 (en) Data processing method, server and computer storage medium
CN105740229B (en) The method and device of keyword extraction
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
CN109918556B (en) A Depressive Emotion Recognition Method Based on Integrated Weibo Users' Social Relationships and Text Features
CN106407182A (en) A method for automatic abstracting for electronic official documents of enterprises
CN107562717A (en) A kind of text key word abstracting method being combined based on Word2Vec with Term co-occurrence
WO2020151218A1 (en) Method and apparatus for generating specialised electric power word bank, and storage medium
CN104298714B (en) A kind of mass text automatic marking method based on abnormality processing
CN103617158A (en) Method for generating emotion abstract of dialogue text
WO2023274047A1 (en) Standard knowledge graph construction and standard query method and apparatus
CN107203520A (en) The method for building up of hotel's sentiment dictionary, the sentiment analysis method and system of comment
CN110083832A (en) Recognition methods, device, equipment and the readable storage medium storing program for executing of article reprinting relationship
CN104778201A (en) Multi-query result combination-based prior art retrieval method
CN106445914B (en) Construction method and construction device of microblog emotion classifier
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN113656592B (en) Data processing method, device, electronic device and medium based on knowledge graph
CN118377853B (en) Paper topic selection assistance method, system, medium and equipment based on large language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170908