CN109460477B - Information collection and classification system and method and retrieval and integration method thereof - Google Patents
Information collection and classification system and method and retrieval and integration method thereof Download PDFInfo
- Publication number
- CN109460477B CN109460477B CN201811258103.2A CN201811258103A CN109460477B CN 109460477 B CN109460477 B CN 109460477B CN 201811258103 A CN201811258103 A CN 201811258103A CN 109460477 B CN109460477 B CN 109460477B
- Authority
- CN
- China
- Prior art keywords
- potential new
- word
- information
- module
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 230000010354 integration Effects 0.000 title claims abstract description 73
- 238000000034 method Methods 0.000 title claims abstract description 63
- 239000013598 vector Substances 0.000 claims abstract description 97
- 230000009193 crawling Effects 0.000 claims description 44
- 238000007781 pre-processing Methods 0.000 claims description 17
- 230000007246 mechanism Effects 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 13
- 238000004458 analytical method Methods 0.000 claims description 7
- 230000000692 anti-sense effect Effects 0.000 claims description 6
- 230000003993 interaction Effects 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 230000009471 action Effects 0.000 claims description 4
- 238000009825 accumulation Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 238000007619 statistical method Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 238000013145 classification model Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000001960 triggered effect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an information collection and classification system and method and a retrieval and integration method thereof, wherein the retrieval and integration method comprises the following steps: step S1: acquiring a potential new word, searching a knowledge graph of the potential new word, and if the potential new word exists in the knowledge graph, directly performing step S2, and if the potential new word does not exist in the knowledge graph, integrating the potential new word and all triples (e1, r, e2) related to the potential new word into the knowledge graph, wherein e1 represents the potential new word, e2 represents a word having an entity relationship with the potential new word, and r represents the relationship types of e1 and e 2; step S2: performing word vector integration on the obtained potential new words; step S3: and repeating the steps S1-S2 until all the potential new words are searched and integrated. The invention can effectively integrate information in an incremental way, and only triggers retraining when necessary, thereby reducing the system cost and optimizing the system flow under the condition of ensuring the quality of knowledge integration.
Description
Technical Field
The invention relates to the technical field of information collection and processing, in particular to an information collection and classification system and method and a retrieval and integration method thereof.
Background
The specific attention of companies in the small minimally invasive industrial park and the industrial park to the latest national hot news, policy preference and the subdivision field where the business of the company is located reflects the particularity of the information corpus which the companies pay attention to. Establishing an effective message collection, classification, retrieval and pushing system in the subdivision areas in which they are interested will enable them to discover and find information valuable to themselves in the first time, avoiding flooding in the waning ocean where there are a huge number of messages.
Traditional corpus classification models and word vector models are often trained from a large number of general corpuses, such as word vectors trained by Wikipedia, and news text classification corpora in dog search laboratories. Insufficient support for the subdivided domain of national policies; and new words appearing in the professional field and the subdivision field cannot be quickly responded and effectively processed.
Therefore, it is necessary to provide a new information processing method.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides an information retrieval and integration method, which comprises the following steps:
step S1: acquiring a potential new word, searching a knowledge graph of the potential new word, and if the potential new word exists in the knowledge graph, directly performing step S2, and if the potential new word does not exist in the knowledge graph, integrating the potential new word and all triples (e1, r, e2) related to the potential new word into the knowledge graph, wherein e1 represents the potential new word, e2 represents a word having an entity relationship with the potential new word, and r represents the relationship types of e1 and e 2;
step S2: performing word vector integration on the obtained potential new words;
step S3: and repeating the steps S1-S2 until all the potential new words are searched and integrated.
Wherein the step S2 includes the following steps:
step S21: searching the potential new word in the word vector library, if the potential new word exists, returning to the step S1 to obtain the next potential new word; if not, go to step S22;
step S22: judging whether the number n of the categories of the potential new words obtained by accumulation at present is greater than or equal to a threshold value threshold _ ALL, if so, clearing the number n of the categories of the potential new words, retraining the whole word vector, and returning to the step S1 to obtain the next potential new word; if not, go to step S23;
step S23: updating the value of n and n corresponding to the potential new word iA value of where n iThe value represents the number of times the acquired potential new word is accumulated into the system;
step S24: judging n corresponding to the potential new word iWhether the value is larger than or equal to the threshold value threshold _ ONE or not, if not, returning to the step S1 to obtain the next potential new word; if so, thenStep S25 is performed;
step S25: integrating the word vectors of the potential new words into a word vector library.
Wherein the step S25 includes: retrieving entity words related to the potential new word in a knowledge graph;
if the word vector is searched, taking the weighted average value of the word vectors of the related entity words as the word vector of the potential new word to be stored, and returning to the step S1;
if not, searching the synonym, the synonym or the antisense of the potential new word in at least one of the synonym thesaurus, the near-synonym thesaurus and the antisense thesaurus, if so, taking the weighted average of the word vectors of at least one of the synonym, the near-synonym or the antisense of the potential new word as the word vector of the potential new word to be stored, and returning to the step S1; and if not, inserting a certain preset word vector of the potential new word into the word stock.
In step S22, when the number n of new word types is greater than or equal to the threshold _ ALL, the number n of potential new word types is cleared, and only the newly appeared potential new word types after clearing are accumulated to calculate the n value in the process of searching and integrating the next potential new word;
in step S23, the principle of updating the n value is as follows: if the acquired potential new word appears in the system before, the value of n is unchanged, and if the acquired potential new word does not appear in the system before, the value of n is added with 1; updating n iIn the principle of (1) iThe value is increased by 1.
The invention also provides an information collection and classification method, which comprises the following steps:
step S1, information crawling: crawling information on related news, websites and related texts on a database through a web crawler to acquire information;
step S2: preprocessing a text;
step S3: discovering potential new words and potential new relations from the preprocessed information;
step S4, information retrieval and integration: carrying out information retrieval and integration on the found potential new words and potential new relations;
step S5: classifying the integrated information;
wherein the information retrieval and integration in the step S4 is completed according to the information retrieval and integration method as described in any one of the above.
In the step S1, in the information crawling, information crawling is performed by using a crawler of python or a web crawler of urllib, and in the information crawling process, the latest data is crawled by using a timing starting mechanism, it is ensured that only incremental data is crawled by using a crawling history management mechanism, and the crawled data is pushed to a subsequent module by using a pushing or storing mechanism or is stored;
in step S2, the preprocessing of the text includes removing html tags, segmenting words, or referring to a stop word list to remove stop words.
Wherein the step S3 includes:
step S31, finding potential new words: obtaining a plurality of keywords with the highest occurrence frequency in the text through feature sorting based on word frequency; acquiring a special vocabulary through the characteristic characters, acquiring all vocabularies related to the special vocabulary through syntactic analysis, and deleting entities with special meanings including names through an entity identification method;
step S32, finding a potential new relationship: and acquiring all sentences comprising potential new words, acquiring the relation words in the sentences by using a relation extraction method, and classifying the relation words by using a classifier to obtain the classified relation triples (e1, r, e 2).
Wherein the step S5 includes:
step S51: acquiring training model characteristics;
step S52: the training model features are fused into a large feature vector through a Concat layer;
step S53: outputting the training model characteristics to a single classification vector through a Fully Connected layer;
step S54: the output classification vector is normalized by the Softmax layer and finally processed into a result of (0,0, …,1, …,0), where the ith element is 1, representing that the text belongs to the ith classification.
The invention also provides an information collection and classification system, comprising:
the information crawling module is used for crawling information of related news, websites and related texts on a database to acquire information;
the text preprocessing module is connected with the information crawling module and used for performing text preprocessing on the acquired information;
the finding module is connected with the text preprocessing module and used for finding potential new words and potential new relations from the preprocessed information;
the information retrieval and integration module is connected with the discovery module and is used for retrieving and integrating the discovered potential new words and potential new relations;
the classification module is used for classifying the integrated information;
the information retrieval and integration module is used for completing information retrieval and integration according to the information retrieval and integration method.
The information crawling module comprises a policy information crawling module and a service information crawling module, and is respectively used for crawling information of related news, websites and related texts on a database through different web crawlers;
the discovery module comprises a potential new word discovery module and a potential new relation discovery module which are respectively used for discovering potential new words and potential new relations from the preprocessed information;
the information retrieval and integration module comprises a knowledge graph retrieval and integration module and a word vector retrieval and integration module, wherein the knowledge graph retrieval and integration module is used for completing the step S1 in the information retrieval and integration method, and the word vector retrieval and integration module is used for completing the step S2 in the information retrieval and integration method.
Wherein the action mechanism of the potential new word discovery module comprises: obtaining a plurality of keywords with the highest occurrence frequency in the text through feature sorting based on word frequency; acquiring a special vocabulary through the characteristic characters, acquiring all vocabularies related to the special vocabulary through syntactic analysis, and deleting entities with special meanings including names through an entity identification method;
the action mechanism of the potential new relationship discovery module comprises: and acquiring all sentences comprising potential new words, acquiring the relation words in the sentences by using a relation extraction method, and classifying the relation words by using a classifier to obtain the classified relation triples (e1, r, e 2).
The classification module classifies the integrated information by the following method:
step Sa: acquiring training model characteristics;
and Sb: the training model features are fused into a large feature vector through a Concat layer;
step Sc: outputting the training model characteristics to a single classification vector through a Fully Connected layer;
step Sd: the output classification vector is normalized by the Softmax layer and finally processed into a result of (0,0, …,1, …,0), where the ith element is 1, representing that the text belongs to the ith classification.
In step Sa, the obtained training model features include:
mixed word and sentence level characteristics of a word vector mean value formed by a plurality of keywords with the highest occurrence frequency in the text obtained by a word frequency statistical method;
article-level features formed by embedding features into a body model of a text obtained by training the text; and the number of the first and second groups,
logical features within an article formed by knowledge-graph embedded features obtained by the TransE or TransR algorithms.
When the classification module classifies the integrated information, the step Sb and Sc further include: the fused feature vectors are normalized by a Batch Normalize layer, and partial nodes are randomly invalidated by at least one Dropout layer during training of the model.
The information collection and classification system further comprises a storage module which is connected with the classification module and used for storing the collected text titles, texts, keywords, main body model embedded vectors, knowledge map embedded vectors, classification vectors and classification results obtained in the whole information collection and classification process.
The information collection and classification system also comprises a user interaction module which is connected with the storage module and used for providing intelligent search service and customized push service for the user according to the information stored by the storage module.
The information collection and classification system and method and the retrieval and integration method thereof can effectively integrate information in an incremental manner, trigger retraining only when necessary, reduce the system cost and optimize the system flow under the condition of ensuring the quality of knowledge integration.
Drawings
FIG. 1: the invention provides a system architecture diagram of an information collection and classification system.
FIG. 2: the invention discloses a logic flow diagram of a working method of a discovery module.
FIG. 3: the invention is a work flow diagram of a retrieval and integration module.
FIG. 4: a workflow diagram of the classification module of the present invention.
Description of the reference numerals
10-an information crawling module, 11-a policy information crawling module and 12-a business information crawling module;
20-a text pre-processing module;
30-discovery module, 31-potential new word discovery module, 32-potential new relation discovery module;
40-a retrieval and integration module, 41-a knowledge graph retrieval and integration module and 42-a word vector retrieval and integration module;
50-classification module, 60-storage module;
70-a user interaction module, 71-a customizable push module, 72-a natural language search module;
80-user.
Detailed Description
In order to further understand the technical scheme and the advantages of the present invention, the following detailed description of the technical scheme and the advantages thereof is provided in conjunction with the accompanying drawings.
Fig. 1 is a system architecture diagram of an information collection and classification system provided by the present invention, and as shown in fig. 1, the information collection and classification system provided by the present invention mainly includes: the information crawling module 10, the text preprocessing module 20, the discovery module 30, the information retrieving and integrating module 40, the classifying module 50, the storage module 60 and the user interaction module 70 sequentially complete information acquisition, preprocessing, discovery of potential new words and potential new relations, information retrieval and integration, text classification and storage of information generated in the working process of the modules, so that a first-hand, personalized and accurate message service is provided for an end user 80, and the specific working method of the modules and the working mode of the modules in cooperation are detailed as follows.
Information crawling module
The information crawling module 10 comprises a policy information crawling module 11 and a business information crawling module 12 which can run in parallel, and adopts any technology such as scratch of python, urllib package and the like which can provide languages of web crawlers to a large number of technologies:
1. hot news and policy websites, databases (crawled by the policy information crawling module 11);
2. subdividing depth data (crawled by the business information crawling module 12) performed on domain professional websites;
crawling is performed, and the crawled website comprises a webpage and a link mentioned in the webpage. The method limits the number of the crawled webpages by presetting the maximum depth. For database data, not only source data but also main foreign key relationships among the data are crawled.
The data sources may include:
1. state, provincial and municipal government official networks, channels or open databases of news, policies, finance, startup, industry, science and technology and the like;
2. service enterprises, park client official networks, main business, related technology consultation networks, channels or open databases;
3. other related networks or databases may be preset by the information collection and classification system of the present invention.
To ensure the efficiency, accuracy and depth of crawling information, the information crawling module 10 should have the following mechanisms:
1. a timing starting mechanism, specifically, the information crawling module 10 is periodically pulled up through an external timer degree, so as to ensure that the latest data is crawled;
2. a crawling history management mechanism for identifying which web page contents have been crawled and are not updated, so that only incremental web page data are crawled;
3. and the pushing or storing mechanism pushes the crawling data to a subsequent module or stores the crawling data, so that the subsequent module can consume the crawling data conveniently.
Text preprocessing module
The invention mainly uses some common text preprocessing techniques, such as removing html tags (using beautifuloup), segmenting words (using jieba segmentation), introducing a stop word list to remove stop words, and the like.
Third, discover module
Fig. 2 is a logic flow diagram of a working method of a discovery module according to the present invention, please refer to fig. 1 and fig. 2, wherein the discovery module 30 of the present invention includes a potential new word discovery module 31 and a potential new relationship discovery module 32.
1. Potential new word discovery module
The text preprocessing module 20 first sends the processed texts to different application process nodes in the potential new word discovery module 31, and each node processes the text allocated to it.
Specifically, top N keywords in the text are obtained through tf-idf, bow and other feature ordering based on word frequency; acquiring special words of potential policies and subdivision fields through special characters such as quotation marks, book title numbers, brackets and the like; obtaining all vocabularies having predicate relations with the vocabularies through syntactic analysis; and judging whether the vocabulary is a special meaning entity such as a name of a person, a place name, a company name and the like through an entity recognition technology, and if so, rejecting the vocabulary. The union of the above words is used as a potential new word and pushed to the potential new relationship discovery module 32 together with the auxiliary information (meta data, whether in quotation marks or not, whether top N or not).
2. Potential new relationship discovery module
And after the potential new words of each node are obtained, obtaining all sentences containing the potential new words. And obtaining the relation words in the relation words by using a relation extraction technology. And classifying the relation words by using a pre-trained classifier. The triples (e1, r, e2) containing the sorted relationships are sent to the retrieval and integration module 40.
In the present invention, the triplet (e1, r, e2) is the form of the final output of the discovery module 30. For each discovery module 30 process, processing a document will output several such entity relationship triplets. The union of all the triples referring to the entities, i.e. the set of all the potential new words, and the union of all the new relations referring to, i.e. the set of potential new relations. They will enter the retrieval and integration module 40 for subsequent processing.
In the present invention, the information transmitted to the retrieving and integrating module 40 by the discovering module 30 is not limited to the discovered potential new words and potential new relationships, but may also include an acquiring path for acquiring the potential new words, that is, the acquiring path for the potential new words may be transmitted to a subsequent module as attached information, which is used as a reference for retrieving and integrating.
Fourth, search and integration module
1. Search and integration foundation
The method uses the existing Chinese model as a basic Word vector, such as a Chinese Word2Vec model pre-trained by using a dog searching web news corpus (http:// www.sogou.com/labs/resource/ca. php). Meanwhile, an existing knowledge map library, such as a double Chinese CN-DBpedia (http:// kw. fudan. edu. CN/cndbpedia/intro /) is used as the Chinese knowledge map.
Fig. 3 is a flowchart of the operation of the retrieving and integrating module of the present invention, please refer to fig. 1 and fig. 3, the retrieving and integrating module 40 of the present invention includes a knowledge graph retrieving and integrating module 41 and a word vector retrieving and integrating module 42, which respectively complete the retrieving and integrating of the knowledge graph and the word vector.
2. Knowledge graph retrieval and integration module
In the invention, when the retrieval and integration of the potential new words are carried out, each potential new word may repeatedly appear in different retrieval and integration periods, and the more times of repeated appearance, the higher the frequency of appearance of the potential new word in the text is. In the invention, n isiRepresenting the number of occurrences of the corresponding potential new word, i represents the ranking index of each potential new word in all the different potential new words, for example, in the first search and integration period, the new word "A" is obtained, and n corresponding to the potential new word is obtained1=1, in the second search and integration period, new word "B" is obtained, n corresponding to the potential new word2=1, in the third search and integration period, the new word "a" is obtained again, and n corresponding to the potential new word is obtained1The value accumulates once and is 2. In the invention, n represents the number of categories of potential new words obtained by system accumulation, and in the three search and integration periods, the value of n is 1, 2 and 2 respectively.
In the invention, after a potential new word is obtained, the knowledge graph is searched and integrated. First, whether the word is in the knowledge-graph is judged. If not, the word and all triples (e1, r, e2) associated with it are first integrated into the knowledge graph. After the step is completed, the word vector is searched and integrated.
3. Word vector retrieval and integration module
And after the retrieval and integration of the knowledge graph are completed, the retrieval and integration of the word vectors are carried out.
(1) Firstly, searching whether the potential new word exists in a word vector library, if so, ending the round of searching and integrating period, and reacquiring the next potential new word; if not, the step (2) is carried out.
(2) Judging whether the number of the categories of the accumulated potential new words is larger than or equal to a preset threshold _ ALL, if so, indicating that a large number of new corpora and new words appear in the text, triggering a retraining process of word vectors for obtaining a more accurate field word vector model, at the moment, resetting the n value, ending the round of retrieval and integration period after retraining is finished, and acquiring the next potential new word again.
It should be noted that once the value of n is cleared, in the subsequent retrieval and integration period, n is accumulated only if new words that never appear exist, for example, a preset threshold value threshold _ ALL =3, when the system acquires words "a", "B", and "C" in three retrieval and integration periods respectively, the three words never appear in the previous retrieval and integration period, the value of n is accumulated to 3, a retraining process is triggered, and in the fourth retrieval and integration period, when the words appear again, the number of potential new word categories is not counted, and the value of n is still 0; that is, once the n value is cleared, the previously appearing words are not considered when calculating the n value.
threshold _ ALL =3 and the setting of the n value determines when to start retraining the full model.
(3) If the value of n does not reach the threshold _ ALL, then n and n are paired according to the principles identified aboveiIs updated, niAdding 1 to the value, adding 1 to the value of n under the condition that the potential new word never appears, and keeping the value of n unchanged under the condition that the potential new word appears; after the updating is finished, the acquired n of the potential new word is further judgediWhether the value reaches a threshold value threshold _ ONE or not, if not, the potential new word is considered not to be a valuable potential new word, and the current retrieval and integration period is ended; if so, the potential new words are integrated into the vector library, and the specific method takes part in the following steps.
(4) Searching entity words related to the potential new words in a knowledge graph, judging whether the number m (i) of the obtained entity words is more than 0, if so, calculating the weighted average of the entity words on the basis of word vectors of the entity words, taking the weighted average as the word vector of the potential new words and inserting the word vector into a word vector library; if not, the synonym or the antonym (hereinafter referred to as the corresponding word) of the potential new word is searched in the synonym or antonym library, and the weighted average is calculated based on the word vector of the corresponding word, and is used as the word vector of the potential new word and inserted into the word vector library.
(5) If no valid word vector can be obtained in the above manner, a predetermined word vector is inserted into the word vector library, such as an assignment (0,0, 0, …,0) and inserted into the word library, or a weighted average of all word vectors in the library is used as the word vector of the potential new word.
In summary, the threshold value threshold _ ALL and threshold _ ONE are introduced, the retraining frequency of the system can be controlled through the threshold value threshold _ ALL, the sensitivity of the system to new words can be controlled through the threshold value threshold _ ONE, the approximate calculation method of word vectors is provided through the setting of the two threshold values, the system does not need to retrain every time when potential new words are obtained, the word vectors can be calculated through the vectors of some related words related to the word vectors, and retraining is performed after the n value reaches the set threshold value, so that valuable new words and new relations can be searched and integrated, the calculation resources are saved, the integration of the new words and the new relations is accelerated, and the new words and the new relations can be used by a subsequent classification model, a search module and a push module immediately. The accuracy of prediction, search and the like is improved.
Fifth, categorised module
Fig. 4 is a flowchart of the classification module according to the present invention, please refer to fig. 1 and fig. 4, the classification module 50 according to the present invention is connected to the retrieving and integrating module 40, and the training classification model is based on the input training model features, which include:
1. mixed sentence level features-TopN word vector features: the characteristics are the TopN words obtained by the word frequency statistical method (tf-idf or BOW, etc.) and the weighted average of the word vectors obtained by the retrieval and integration module.
2. Embedding characteristics of the main body model: obtained through LDA training text, which is input to classification module 50 as article-level features.
3. Knowledge picture embedding characteristics: obtained by using the algorithm TransE, TransR, etc., as logical features within the article input to the classification module 50.
The necessary tools used in the training process of model features are the Concat layer and the full Connected layer shown in the figure:
1. concat layer: the method is used for fusing the three features into a large feature vector to serve as the input of a subsequent neural network.
2. Fully Connected layer (Fully Connected layer): after receiving the input features, a single classification vector is output.
In the present invention, a Dropout layer (discard regularization layer) and at least one Batch regularization layer (Batch regularization layer) may be disposed between the Concat layer and the full Connected layer, and the arrangement order of the Dropout layer and the Batch regularization layer has no front or back requirement, that is, the arrangement modes of the Dropout layer and the Batch regularization layer between the Concat layer and the full Connected layer may include the following three types:
1. a Dropout layer, a Batch Normalize layer;
2. a Batch normaize layer, a Dropout layer;
3. dropout layer, Batch Normalize layer, Dropout layer.
The Batch Normalization layer is used for normalizing the fused feature vectors so as to stabilize data distribution and improve convergence speed.
The Dropout layer is used to randomly invalidate some nodes during model training to avoid overfitting.
Finally, the output classification vectors are normalized by the Softmax layer and finally processed to (0,0, …,1, …, 0). The ith element is 1 here, representing that the text belongs to the ith class.
Sixthly, storage module
After the text classification training is completed, each new input sample is processed by the text preprocessing module 20, the finding module 30, the retrieving and integrating module 40, and the classifying module 50, and is finally stored in the storage module 60, where the stored information includes an article title, a text, Top N keywords (including word vectors), a topic model embedding vector, a knowledge map embedding vector, a classification vector output by a full Connected layer, and a classification of a text.
Seventh, user interaction module
After the storage module 40 stores the above information in a file system or a database, an intelligent search module can be provided based on the information; and purposeful and targeted pushing can be performed according to the subscription and the interest of different users on different information.
As shown in fig. 1, the user interaction module 70 includes a customizable push module 71 and a natural language search module 72, the customizable push module mainly provides daily push of new messages, such as classified information subscribed by a user through means of WeChat, SMS, and email, and pushes the classified information to the user 80 in real time, and the natural language search module provides real-time search of indexed news and policy information, and can perform search using keywords, classification, and natural language questions.
In the present invention, the terms "python", "script" and "urllib" are all common web crawler tools.
In the invention, the term "tf-idf" refers to "word frequency-inverse text frequency", the term "bow" refers to "bag of words model", both are existing general text processing technologies mainly based on word frequency, which can calculate text features at word frequency level on the basis of words, phrases or N-grams, and can order the calculated features, thereby obtaining top N such words and phrases as a source of potential new words.
In the present invention, the "syntactic analysis" method is a method of parsing a sentence into several nested syntax trees to extract: 1. the relationship is a main relationship table relationship, a main predicate relationship, a 3 modification relationship, a 4 other relationship and the like, and because the relationship words are various in form, a plurality of relationship words are divided into a plurality of categories by classifying the relationship words, which is beneficial to subsequent retrieval and integration.
In the present invention, the "syntactic analysis technique" refers to a named entity recognition technique that can recognize specific entities such as company names, person names, dates, place names, etc., which do not contribute much to model prediction, etc., and do not serve as potential new words, and therefore need to be eliminated from the potential new words.
The "classifier" used in the present invention is selected from any existing classifier, such as a text classifier of news corpus of dog searching.
In the present invention, the techniques such as "TransE" and "TransR" are vector representations of knowledge maps, and are mainly used for vectorizing the relationships and logics inside texts into features.
In the invention, a so-called 'LDA' training text can model the distribution of text words through a main body model, and an article is represented as a vector and represents the distribution of the article on a plurality of subjects.
In the present invention, the "Concat layer" refers to a feature splicing layer, which means that two or more features are spliced in corresponding dimensions.
The invention has the following beneficial effects:
1. through a potential new word discovery module and a potential new relation discovery module, a method for discovering new words and new relations rapidly and pertinently is provided: the requirement of enterprises for the popular hot spot information can be better and faster met. Because a large number of new words have the characteristics of fast appearance, high short-term frequency and short life cycle, if certain linguistic data and data are accumulated and then are found again when being triggered to retrain and are added into the system, the best discovery, classification and indexing opportunity of the new words and new relations can be greatly missed. Frequent retraining will also increase the computational burden on the enterprise. Therefore, the method can also accelerate the discovery of valuable new words and new relations and avoid frequent retraining of the model by the aid of the potential new word discovery module and the potential new relation discovery module. The method is more efficient and rapid while ensuring the discovery quality, and reduces the calculation cost.
2. An iterative high-efficiency new word and new relation integration method is provided through a knowledge map retrieval and integration module and a word vector retrieval and integration module, and comprises the following steps: the threshold for specifying the number of new words and the threshold for accumulating the number of new words are added to define when to trigger the word vector retraining. Before triggering retraining, the weighted average of the word vectors of the related words is directly used as the word vector of the new word for direct integration. And (4) considering the attached information of the potential new relationship in the integration as a reference basis for the integration. Therefore, information can be effectively and incrementally integrated, retraining is triggered only when necessary, the cost of the system is reduced and the flow of the system is optimized under the condition that the quality of knowledge integration is ensured.
3. And providing an information classification model combining a plurality of non-complete text characteristics through a classification module. The unstructured text information is feature extracted from 3 dimensions using 3 features, namely from the keyword dimension (keyword word vector), discourse word distribution dimension (body model vector) and logic dimension inside discourse (knowledge graph embedding vector), respectively. Since the information is extracted and trained separately in the system by using unsupervised learning, the system is incomplete. By combining the above 3 dimensions, the prediction accuracy is improved.
Although the present invention has been described with reference to the preferred embodiments, it should be understood that the scope of the present invention is not limited thereto, and those skilled in the art will appreciate that various changes and modifications can be made without departing from the spirit and scope of the present invention.
Claims (15)
1. An information retrieval and integration method, characterized by comprising the steps of:
step S1: acquiring a potential new word, searching a knowledge graph of the potential new word, and directly performing step S2 if the potential new word exists in the knowledge graph, and integrating the potential new word and all related triples (e1, r, e2) of the potential new word into the knowledge graph if the potential new word does not exist in the knowledge graph, wherein e1 represents the potential new word, e2 represents a word having an entity relationship with the potential new word, and r represents the relationship types of e1 and e 2;
step S2: performing word vector integration on the obtained potential new word, wherein the step S2 includes the following steps:
step S21: searching the potential new word in the word vector library, if the potential new word exists, returning to the step S1 to obtain the next potential new word; if not, go to step S22;
step S22: judging whether the number n of the categories of the potential new words obtained by accumulation at present is greater than or equal to a threshold value threshold _ ALL, if so, clearing the number n of the categories of the potential new words, retraining the whole word vector, and returning to the step S1 to obtain the next potential new word; if not, go to step S23;
step S23: updating the value of n and n corresponding to the potential new wordiA value of where niThe value represents the number of times the acquired potential new word is accumulated into the system;
step S24: judging n corresponding to the potential new wordiWhether the value is larger than or equal to the threshold value threshold _ ONE or not, if not, returning to the step S1 to obtain the next potential new word; if yes, go to step S25;
step S25: integrating the word vectors of the potential new words into a word vector library;
step S3: and repeating the steps S1-S2 until all the potential new words are searched and integrated.
2. The information retrieval and integration method of claim 1, wherein the step S25 includes: retrieving entity words related to the potential new word in a knowledge graph;
if the word vector is searched, taking the weighted average value of the word vectors of the related entity words as the word vector of the potential new word to be stored, and returning to the step S1;
if not, searching the synonym, the synonym or the antisense of the potential new word in at least one of the synonym thesaurus, the near-synonym thesaurus and the antisense thesaurus, if so, taking the weighted average of the word vectors of at least one of the synonym, the near-synonym or the antisense of the potential new word as the word vector of the potential new word to be stored, and returning to the step S1; and if not, inserting a certain preset word vector of the potential new word into the word stock.
3. The information retrieval and integration method of claim 1, wherein: in step S22, when the number n of new word types is greater than or equal to the threshold _ ALL, the number n of potential new word types is cleared, and only the newly appeared potential new word types after clearing are accumulated to calculate the n value in the process of searching and integrating the next potential new word;
in step S23, the principle of updating the n value is as follows: if the acquired potential new word appears in the system before, the value of n is unchanged, and if the acquired potential new word does not appear in the system before, the value of n is added with 1; updating niIn the principle of (1)iThe value is increased by 1.
4. An information collection and classification method is characterized by comprising the following steps:
step S1, information crawling: crawling information on related news, websites and related texts on a database through a web crawler to acquire information;
step S2: preprocessing a text;
step S3: discovering potential new words and potential new relations from the preprocessed information;
step S4, information retrieval and integration: carrying out information retrieval and integration on the found potential new words and potential new relations;
step S5: classifying the integrated information;
wherein the information retrieval and integration in step S4 is accomplished according to the information retrieval and integration method of any one of claims 1-3.
5. The information collection and classification method according to claim 4, characterized in that:
in the step S1, in the information crawling, the information crawling is performed by a crawler of python or a web crawler of urllib, and in the information crawling process, the latest data is crawled by a timing starting mechanism, only the incremental data is guaranteed to be crawled by a crawling history management mechanism, and the crawled data is pushed to a subsequent module by a pushing or storing mechanism or is stored;
in step S2, the preprocessing of the text includes removing html tags, segmenting words, or referring to a stop word list to remove stop words.
6. The information collecting and classifying method according to claim 4, wherein said step S3 includes:
step S31, finding potential new words: obtaining a plurality of keywords with the highest occurrence frequency in the text through feature sorting based on word frequency; acquiring a special vocabulary through the characteristic characters, acquiring all vocabularies related to the special vocabulary through syntactic analysis, and deleting entities with special meanings including names through an entity identification method;
step S32, finding a potential new relationship: and acquiring all sentences comprising potential new words, acquiring the relation words in the sentences by using a relation extraction method, and classifying the relation words by using a classifier to obtain the classified relation triples (e1, r, e 2).
7. The information collecting and classifying method according to claim 4, wherein said step S5 includes:
step S51: acquiring training model characteristics;
step S52: the training model features are fused into a large feature vector through a Concat layer;
step S53: outputting the training model characteristics to a single classification vector through a Fully Connected layer;
step S54: the output classification vector is normalized by the Softmax layer and finally processed into a result of (0,0, …,1, …,0), where the ith element is 1, representing that the text belongs to the ith classification.
8. An information collection and classification system, comprising:
the information crawling module is used for crawling information of related news, websites and related texts on a database to acquire information;
the text preprocessing module is connected with the information crawling module and used for performing text preprocessing on the acquired information;
the finding module is connected with the text preprocessing module and used for finding potential new words and potential new relations from the preprocessed information;
the information retrieval and integration module is connected with the discovery module and is used for retrieving and integrating the discovered potential new words and potential new relations;
the classification module is used for classifying the integrated information;
wherein the information retrieval and integration module performs information retrieval and integration according to the information retrieval and integration method of any one of claims 1-3.
9. The information collection and classification system of claim 8, wherein:
the information crawling module comprises a policy information crawling module and a service information crawling module, and is respectively used for crawling information of related news, websites and related texts on a database through different web crawlers;
the discovery module comprises a potential new word discovery module and a potential new relation discovery module which are respectively used for discovering potential new words and potential new relations from the preprocessed information;
the information retrieving and integrating module comprises a knowledge graph retrieving and integrating module and a word vector retrieving and integrating module, wherein the knowledge graph retrieving and integrating module is used for completing the step S1 in the information retrieving and integrating method of any one of claims 1-3, and the word vector retrieving and integrating module is used for completing the step S2 in the information retrieving and integrating method of any one of claims 1-3.
10. The information collection and classification system of claim 9, wherein: the action mechanism of the potential new word discovery module comprises the following steps: obtaining a plurality of keywords with the highest occurrence frequency in the text through feature sorting based on word frequency; acquiring a special vocabulary through the characteristic characters, acquiring all vocabularies related to the special vocabulary through syntactic analysis, and deleting entities with special meanings including names through an entity identification method;
the action mechanism of the potential new relationship discovery module comprises: and acquiring all sentences comprising potential new words, acquiring the relation words in the sentences by using a relation extraction method, and classifying the relation words by using a classifier to obtain the classified relation triples (e1, r, e 2).
11. The information collection and classification system of claim 8, wherein the classification module classifies the integrated information by:
step Sa: acquiring training model characteristics;
and Sb: the training model features are fused into a large feature vector through a Concat layer;
step Sc: outputting the training model characteristics to a single classification vector through a Fully Connected layer;
step Sd: the output classification vector is normalized by the Softmax layer and finally processed into a result of (0,0, …,1, …,0), where the ith element is 1, representing that the text belongs to the ith classification.
12. The information collection and classification system of claim 11, wherein: in step Sa, the obtained training model features include:
mixed word and sentence level characteristics of a word vector mean value formed by a plurality of keywords with the highest occurrence frequency in the text obtained by a word frequency statistical method;
article-level features formed by embedding features into a body model of a text obtained by training the text; and the number of the first and second groups,
logical features within an article formed by knowledge-graph embedded features obtained by the TransE or TransR algorithms.
13. The information collecting and classifying system according to claim 11, wherein when the classifying module classifies the integrated information, the step Sb and Sc further include: the fused feature vectors are normalized by a Batch Normalize layer, and partial nodes are randomly invalidated by at least one Dropout layer during training of the model.
14. The information collection and classification system according to any one of claims 8-13, further comprising a storage module, coupled to the classification module, for storing the text titles, texts, keywords, body model embedding vectors, knowledge map embedding vectors, classification vectors, and classification results obtained during the entire information collection and classification process.
15. The information collection and classification system of claim 14, wherein: the information collection and classification system also comprises a user interaction module which is connected with the storage module and used for providing intelligent search service and customized push service for the user according to the information stored by the storage module.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811258103.2A CN109460477B (en) | 2018-10-26 | 2018-10-26 | Information collection and classification system and method and retrieval and integration method thereof |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811258103.2A CN109460477B (en) | 2018-10-26 | 2018-10-26 | Information collection and classification system and method and retrieval and integration method thereof |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109460477A CN109460477A (en) | 2019-03-12 |
| CN109460477B true CN109460477B (en) | 2022-03-29 |
Family
ID=65608499
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201811258103.2A Expired - Fee Related CN109460477B (en) | 2018-10-26 | 2018-10-26 | Information collection and classification system and method and retrieval and integration method thereof |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN109460477B (en) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110609903B (en) * | 2019-08-01 | 2022-11-11 | 华为技术有限公司 | Information presentation method and device |
| CN110765235B (en) * | 2019-09-09 | 2023-09-05 | 深圳市人马互动科技有限公司 | Training data generation method, device, terminal and readable medium |
| CN112347343B (en) * | 2020-09-25 | 2024-05-28 | 北京淇瑀信息科技有限公司 | Custom information pushing method and device and electronic equipment |
| CN112035653B (en) * | 2020-11-05 | 2021-03-02 | 北京智源人工智能研究院 | A method and device for extracting key policy information, storage medium, and electronic device |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160092448A1 (en) * | 2014-09-26 | 2016-03-31 | International Business Machines Corporation | Method For Deducing Entity Relationships Across Corpora Using Cluster Based Dictionary Vocabulary Lexicon |
| CN107818164A (en) * | 2017-11-02 | 2018-03-20 | 东北师范大学 | A kind of intelligent answer method and its system |
| CN108509654A (en) * | 2018-04-18 | 2018-09-07 | 上海交通大学 | The construction method of dynamic knowledge collection of illustrative plates |
-
2018
- 2018-10-26 CN CN201811258103.2A patent/CN109460477B/en not_active Expired - Fee Related
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160092448A1 (en) * | 2014-09-26 | 2016-03-31 | International Business Machines Corporation | Method For Deducing Entity Relationships Across Corpora Using Cluster Based Dictionary Vocabulary Lexicon |
| US20170255694A1 (en) * | 2014-09-26 | 2017-09-07 | International Business Machines Corporation | Method For Deducing Entity Relationships Across Corpora Using Cluster Based Dictionary Vocabulary Lexicon |
| CN107818164A (en) * | 2017-11-02 | 2018-03-20 | 东北师范大学 | A kind of intelligent answer method and its system |
| CN108509654A (en) * | 2018-04-18 | 2018-09-07 | 上海交通大学 | The construction method of dynamic knowledge collection of illustrative plates |
Non-Patent Citations (1)
| Title |
|---|
| 知识图谱研究综述;黄恒琪等;《计算机系统应用》;20190630;第28卷(第6期);第1-12页 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109460477A (en) | 2019-03-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN108052583B (en) | E-commerce ontology construction method | |
| Turney | Learning algorithms for keyphrase extraction | |
| CN109460477B (en) | Information collection and classification system and method and retrieval and integration method thereof | |
| KR20200067180A (en) | Methods and systems for semantic search in large databases | |
| CN112861990B (en) | Topic clustering method and device based on keywords and entities and computer readable storage medium | |
| CN108509521B (en) | An Image Retrieval Method for Automatically Generated Text Index | |
| EP2307951A1 (en) | Method and apparatus for relating datasets by using semantic vectors and keyword analyses | |
| CN112507109A (en) | Retrieval method and device based on semantic analysis and keyword recognition | |
| CN110188349A (en) | A kind of automation writing method based on extraction-type multiple file summarization method | |
| CN111475625A (en) | News manuscript generation method and system based on knowledge graph | |
| US11983205B2 (en) | Semantic phrasal similarity | |
| CN112149422B (en) | Dynamic enterprise news monitoring method based on natural language | |
| MidhunChakkaravarthy | Evolutionary and incremental text document classifier using deep learning | |
| CN119903234B (en) | A law and regulation recommendation system and method based on multi-channel fusion | |
| CN114328820B (en) | Information search method and related equipment | |
| Hanyurwimfura et al. | A centroid and relationship based clustering for organizing | |
| Osanyin et al. | A review on web page classification | |
| Kaur et al. | News classification using neural networks | |
| CN109871429B (en) | Short text retrieval method integrating Wikipedia classification and explicit semantic features | |
| CN111241854A (en) | Language search engine system based on block chain technology | |
| Murfi et al. | A two-level learning hierarchy of concept based keyword extraction for tag recommendations | |
| Asirvatham et al. | Web page categorization based on document structure | |
| CN110209765A (en) | A kind of method and apparatus by semantic search key | |
| CN114860936A (en) | Topic generation system and method based on hotspot list | |
| Segev | Identifying the multiple contexts of a situation |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220329 |