[go: up one dir, main page]

CN106095737A - Documents Similarity computational methods and similar document the whole network retrieval tracking - Google Patents

Documents Similarity computational methods and similar document the whole network retrieval tracking Download PDF

Info

Publication number
CN106095737A
CN106095737A CN201610398902.4A CN201610398902A CN106095737A CN 106095737 A CN106095737 A CN 106095737A CN 201610398902 A CN201610398902 A CN 201610398902A CN 106095737 A CN106095737 A CN 106095737A
Authority
CN
China
Prior art keywords
document
documents
similarity
computational methods
cosine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610398902.4A
Other languages
Chinese (zh)
Inventor
姚洲鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Fan Wen Science And Technology Ltd
Original Assignee
Hangzhou Fan Wen Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Fan Wen Science And Technology Ltd filed Critical Hangzhou Fan Wen Science And Technology Ltd
Priority to CN201610398902.4A priority Critical patent/CN106095737A/en
Publication of CN106095737A publication Critical patent/CN106095737A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of Documents Similarity computational methods and similar document the whole network retrieval tracking.It is an object of the invention to provide a kind of Documents Similarity computational methods and similar document the whole network retrieval tracking.The technical scheme is that a kind of Documents Similarity computational methods, it is characterised in that: S01, document decomposition: original document and destination document are cut respectively word and processes, obtain respective participle set;S02, pretreatment and characteristic weighing: utilize TF IDF technology that each participle is calculated weight, extract kernel keyword;Utilize the correlation degree between different participles in Word2vec excavation document, every document is carried out semantic analysis;S03, vector space model and cosine similarity algorithm: utilizing in vector space two vectorial angle cosine values as weighing the similarity degree of two documents, cosine value is between 0~1, and two documents of the biggest explanation of cosine value are the most similar.The present invention is applicable to Domestic News and reprints tracking and transmissibility statistics.

Description

Documents Similarity computational methods and similar document the whole network retrieval tracking
Technical field
The present invention relates to a kind of Documents Similarity computational methods and similar document the whole network retrieval tracking.It is applicable to news Information is reprinted to follow the tracks of and is added up with transmissibility.
Background technology
Traditional media, as the main producers of Domestic News, contribute to the original news of more than 80%, but is limited to it Propagating the restriction of platform, original document is reprinted by substantial amounts of door and some new medias, and new media is reprinting these document processes In, it is achieved that flow and the multiplication effect of power of influence, also achieve preferable economic benefit simultaneously, and as the work of original document Person, the most therefrom obtains interests.But during solving copyright problem by legal means, the literary composition finding to be reprinted to be removed Shelves are equal to look for a needle in a haystack, and need to consume substantial amounts of manpower, and the most difficult to evidence obtaining.
Meanwhile, media also are intended to, by his media of all reprintings, analyze its transmissibility, and current media are the most well Way goes to add up its all propagation paths, can only manually go statistics, and this statistic is the hugest.
At present, China is to use the highest country of social media ratio in the world, have the most for each person every day 5.8 hours time Between surf the Net.Former, masses learn information source in TV, newspaper, magazine and broadcast, sky masses are more by micro-now The social software such as rich, wechat, QQ, forum obtains information.Cut-off first quarter Mo in this year, Sina's microblogging moon any active ues reaches 2.6 hundred million, wechat monthly any active ues has reached 5.49 hundred million.Microblogging, wechat become the optimal utilization instrument of chip time.
From the point of view of today, in the mobile Internet epoch, there are content, form, social activity, and are that strong relation is social, mass media Power of influence slowly declining, and the power of influence of new media is deepened constantly, and this is the epoch of mobile Internet.
When each individuality has transmission capacity, traditional media structure begins to disintegrate, and message is learnt by consumer Pipeline rely on mass media the most significantly, " from the media " age be born.Can create so this is an ordinary people In the epoch of miracle, Ye Shi consumer obtains the epoch of sovereignty, so being also everybody in especially media people chance is most epoch.
In today fast-developing from media, for the copyright protection from media individual, more seem important, due to from matchmaker Body is powerless, and it is for the copyright protection of the document of oneself, the way not had.
Summary of the invention
The technical problem to be solved in the present invention is: for the problem of above-mentioned existence, it is provided that a kind of Documents Similarity calculating side Method and similar document the whole network retrieval tracking, to judge the similarity degree of two documents more accurately, it is achieved the most complete The papers published of document followed the tracks of by net, lays a solid foundation for copyright protection.
The technical solution adopted in the present invention is: a kind of Documents Similarity computational methods, it is characterised in that:
S01, document decomposition: original document and destination document are cut respectively word and processes, obtain respective participle set;
S02, pretreatment and characteristic weighing:
Utilize TF-IDF technology that each participle is calculated weight, extract kernel keyword;
Utilize the correlation degree between different participles in Word2vec excavation document, every document is carried out semantic analysis;
S03, vector space model and cosine similarity algorithm:
Original document and destination document are reduced to two N-dimensional vectors with keyword weight as component;
Document cosine similarity algorithm is based on vector model, utilizes two vectorial angle cosine values in vector space to make For weighing the similarity degree of two documents, cosine value is between 0~1, and two documents of the biggest explanation of cosine value are the most similar.
Step S01 includes
Data prepare, and are cleaned the interference information of document by ETL Data clean system, and carry out document at structuring Reason, resolves into least unit structure;
Capital construction, based on ElasticSearch search engine, full-text index built by component, and uses Chinese word segmentation Fine granularity participle in storehouse creates index.
Step S02 utilizes TF-IDF technology according in inverse document dictionary word delete in document content of text known Do not have little significance but the highest participle of the frequency of occurrences.
A kind of similar document the whole network retrieval tracking, it is characterised in that:
A, setting range of search;
B, search condition set, and extract N number of kernel keyword that in original document, in TF-IDF, weighted value is the highest, with certain Matching rate, carry out full library searching based on ES full-text search engine;
C, do fall sequence according to key word and file correlation weighted value, by the document that retrieves according to key word and document Degree of association weighted value does descending sort;
D, every the document utilizing highest weight weight values document to obtain retrieval contrast one by one, profile similarity meter Calculation method calculates the similarity of two documents;
Whether e, similarity comparison result be higher than N%, if higher than N%, then judges that two documents are identical, otherwise judge two For different documents.
Step a includes setting time range, the carrier of issue that the document that is retrieved is issued, and the word of the document that is retrieved Number, type.
The invention has the beneficial effects as follows: the present invention uses TF-IDF+word2vec technology to make Documents Similarity and processes On obtain effect more accurately, so that copyright is followed the tracks of with the analytic statistics of transmissibility more precisely and closing to reality situation. The present invention is reduced to two N-dimensional vectors with keyword weight as component original document and destination document, utilizes vector space In two vectorial angle cosine values as weighing the similarity degree of two documents, judge two documents the most accurately Similarity degree.Present invention setting with good conditionsi range of search, cleans interference information by ETL Data clean system, improves retrieval Efficiency.
Accompanying drawing explanation
Fig. 1 is the system architecture diagram of document similarity calculating method in embodiment.
Fig. 2 is pretreatment and characteristic weighing flow chart in embodiment.
Fig. 3 is vector space model and cosine similarity algorithm graph of a relation in embodiment.
Fig. 4 is the flow chart of similar document the whole network retrieval tracking in embodiment.
Detailed description of the invention
Fig. 1 is the system architecture diagram of document similarity calculating method in the present embodiment.Documents Similarity meter in the present embodiment Calculation method includes:
(1) data preparation-ETL
Real-time Collection the whole network media data, cleans interference information by " ETL Data clean system ", and data obtain sublimate While Press release is carried out structuring process, resolve into the structure of least unit, obtain participle set, referred to as data former Sub-ization process.
(2) capital construction-ElasticSearch full-text index+Chinese word segmentation
Using ElasticSearch search engine as the basic component of whole system, the algorithm in later stage is all at ES On basis.ElasticSearch is a distributed multi-user full-text search engine based on Lucene, distributed storage Extensibility can effectively solve the storage problem that every day, mass data converged, and ElasticSearch is again one and connects simultaneously The search platform of near real-time, is calculated in actual applications and just starts the most time-consuming about 1 second time from one contribution of index Searched can arrive, so can be able to be applied efficiently in later stage propagation path analysis, distributed fortune can also be utilized simultaneously The characteristic calculated, improves arithmetic speed in conjunction with increasing hardware device, improves retrieval performance.
During building full-text index, the fine granularity participle in Chinese word segmentation storehouse is used to create index, to ensure The decomposition integrity degree of document key word.
(3) pretreatment and characteristic weighing-TF-IDF+word2vec
Fig. 2 is pretreatment and characteristic weighing flow chart in the present embodiment.TF-IDF is a kind of for information retrieval not data The weighting technique excavated.In order to assess a words, one weight against a copy of it document in document sets is guarded against for a document sets Wanting degree, the weighted value of words is directly proportional increase along with the number of times that it occurs in a document, but simultaneously can be along with it is at inverse document The frequency of middle appearance is inversely proportional to decline.Based on TF-IDF technology, according in inverse document dictionary word by document to text Content recognition has little significance but the highest word, symbol, punctuate and the mess code of the frequency of occurrences etc. are deleted.
By decomposing the key word of every document, and add up the word frequency of each word, utilize TF-IDF technology for each point Word calculates weight, extracts kernel keyword.
TF-IDF is the computational methods of correlation degree between a kind of analysing word not document, is mainly used in improving from magnanimity number Need to carry out the scope of statistical analysis similar document according to middle hit, analyze tracking for follow-up reprinting and prepare.
Do not possess the ability processing similar synonym vocabulary in view of cosine similarity algorithm, the present embodiment is in pretreatment link Quote Word2vec algorithm in advance and carry out semantic analysis for every document, to remove the semantic interference in later stage statistical analysis. Word2vec algorithm is a kind of being levied by vocabulary as to the highly effective algorithm of numerical quantity, and it utilizes the thought that degree of depth sons and daughters practises, by instruction Practice, the vector operation that the process of document key word is reduced in vector space and different crucial by excavating in document Correlation degree between word, improves accuracy semantically.
(4) vector space model and cosine similarity algorithm
Fig. 3 is vector space model and cosine similarity algorithm graph of a relation in the present embodiment.By original document and target literary composition Shelves are reduced to two N-dimensional vectors with keyword weight as component, then utilize vector model to carry out cosine similarity calculating.Literary composition Shelves cosine similarity algorithm, based on vector, utilizes in vector space two vectorial angle cosine values as weighing two literary compositions Shelves similarity degree, focus on two vectors difference on direction, cosine value between 0~1, two documents of the biggest explanation of numerical value The most similar.
As shown in Figure 4, the present embodiment provides a kind of similar document the whole network retrieval tracking, the method be embodied as step Rapid as follows:
A, setting range of search;
A01, time range is set: such as the document issued in 3 days (72 hours) of current time;
A02, document scope is set: select the carrier of retrieval, such as newspaper, website, wechat etc.;
A03, document alternative condition: set the be retrieved number of words of document, types entail, such as article number of words >=200;Get rid of Article's style: forum, special.
B, search condition set: extract N number of kernel keyword that in original document, in TF-IDF, weighted value is the highest, with certain Matching rate, carry out full library searching based on ES full-text search engine;
C, do fall sequence according to key word and file correlation weighted value: the document retrieved is according to key word and document phase Pass degree weighted value does descending sort;
D, every the document utilizing highest weight weight values document to obtain retrieval contrast one by one: the literary composition of application the present embodiment Shelves similarity calculating method calculates the similarity of highest weight weight values document and another document;
Whether e, similarity comparison result be higher than N%, if higher than N%, then judges that two documents are identical;Otherwise judge two For different documents.

Claims (5)

1. Documents Similarity computational methods, it is characterised in that:
S01, document decomposition: original document and destination document are cut respectively word and processes, obtain respective participle set;
S02, pretreatment and characteristic weighing:
Utilize TF-IDF technology that each participle is calculated weight, extract kernel keyword;
Utilize the correlation degree between different participles in Word2vec excavation document, every document is carried out semantic analysis;
S03, vector space model and cosine similarity algorithm:
Original document and destination document are reduced to two N-dimensional vectors with keyword weight as component;
Document cosine similarity algorithm is based on vector model, utilizes in vector space two vectorial angle cosine values as weighing apparatus The similarity degree of two articles of amount, cosine value is between 0~1, and two documents of the biggest explanation of cosine value are the most similar.
Documents Similarity computational methods the most according to claim 1, it is characterised in that: step S01 includes
Data prepare, and cleaned the interference information of document by ETL Data clean system, and document is carried out structuring process, point Solution becomes least unit structure;
Capital construction, based on ElasticSearch search engine, full-text index built by component, and uses in Chinese word segmentation storehouse Fine granularity participle create index.
Documents Similarity computational methods the most according to claim 1, it is characterised in that: step S02 utilizes TF-IDF skill Art according in inverse document dictionary word delete content of text identification is had little significance by document but the frequency of occurrences the highest point Word.
4. similar document the whole network retrieval tracking, it is characterised in that:
A, setting range of search;
B, search condition set, and extract N number of kernel keyword that in original document, in TF-IDF, weighted value is the highest, with certain Join rate, carry out full library searching based on ES full-text search engine;
C, do fall sequence according to key word and file correlation weighted value, by the document that retrieves according to key word and file correlation Weighted value does descending sort;
D, every the document utilizing highest weight weight values document to obtain retrieval contrast one by one, and application claims 1 to 3 is any One described Documents Similarity computational methods calculates the similarity of two documents;
Whether e, similarity comparison result be higher than N%, if higher than N%, then judges that two documents are identical, otherwise judge two as not Same document.
Similar document the whole network the most according to claim 4 retrieval tracking, it is characterised in that: step S01 includes setting The time range of the document that is retrieved issue, the carrier of issue, and the number of words of the document that is retrieved, type.
CN201610398902.4A 2016-06-07 2016-06-07 Documents Similarity computational methods and similar document the whole network retrieval tracking Pending CN106095737A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610398902.4A CN106095737A (en) 2016-06-07 2016-06-07 Documents Similarity computational methods and similar document the whole network retrieval tracking

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610398902.4A CN106095737A (en) 2016-06-07 2016-06-07 Documents Similarity computational methods and similar document the whole network retrieval tracking

Publications (1)

Publication Number Publication Date
CN106095737A true CN106095737A (en) 2016-11-09

Family

ID=57227368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610398902.4A Pending CN106095737A (en) 2016-06-07 2016-06-07 Documents Similarity computational methods and similar document the whole network retrieval tracking

Country Status (1)

Country Link
CN (1) CN106095737A (en)

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933787A (en) * 2017-03-20 2017-07-07 上海智臻智能网络科技股份有限公司 Adjudicate the computational methods of document similarity, search device and computer equipment
CN107330820A (en) * 2017-08-28 2017-11-07 北京智诚律法科技有限公司 A kind of forecasting system and method for lawsuit result
CN107506204A (en) * 2017-09-30 2017-12-22 福建星瑞格软件有限公司 A kind of function reconstructing method of the code similarity-rough set based on the cosine law
CN107577774A (en) * 2017-09-08 2018-01-12 北京智诚律法科技有限公司 A kind of intelligent selection draws up a contract the system of lawyer
CN107749034A (en) * 2017-11-17 2018-03-02 浙江工业大学 A kind of safe friend recommendation method in social networks
CN107943762A (en) * 2017-11-24 2018-04-20 四川长虹电器股份有限公司 A kind of text similarity sort method based on ES search
CN108009599A (en) * 2017-12-27 2018-05-08 福建中金在线信息科技有限公司 A kind of original document determination methods, device, electronic equipment and storage medium
CN108206020A (en) * 2016-12-16 2018-06-26 北京智能管家科技有限公司 A kind of audio recognition method, device and terminal device
CN108241699A (en) * 2016-12-26 2018-07-03 百度在线网络技术(北京)有限公司 For the method and apparatus of pushed information
CN108932228A (en) * 2018-06-06 2018-12-04 武汉斗鱼网络科技有限公司 INDUSTRY OVERVIEW and subregion matching process, device, server and storage medium is broadcast live
CN109117474A (en) * 2018-06-25 2019-01-01 广州多益网络股份有限公司 Calculation method, device and the storage medium of statement similarity
CN109255018A (en) * 2018-08-31 2019-01-22 沈文策 A kind of method and apparatus identifying similar article
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method
CN109376231A (en) * 2018-09-29 2019-02-22 杭州凡闻科技有限公司 A kind of media hotspot tracking and system
CN109460415A (en) * 2018-11-26 2019-03-12 江苏科技大学 A kind of similar fixture search method based on N-dimensional vector included angle cosine
CN109508373A (en) * 2018-11-13 2019-03-22 深圳前海微众银行股份有限公司 Calculation method, equipment and the computer readable storage medium of enterprise's public opinion index
CN109582964A (en) * 2018-11-29 2019-04-05 天津工业大学 Intelligent legal advice auxiliary system based on marriage law judicial decision document big data
CN109614478A (en) * 2018-12-18 2019-04-12 北京中科闻歌科技股份有限公司 Construction method, key word matching method and the device of term vector model
CN109948121A (en) * 2017-12-20 2019-06-28 北京京东尚科信息技术有限公司 Article similarity method for digging, system, equipment and storage medium
CN109977196A (en) * 2019-03-29 2019-07-05 云南电网有限责任公司电力科学研究院 A kind of detection method and device of magnanimity document similarity
CN110532569A (en) * 2019-09-05 2019-12-03 浪潮软件股份有限公司 A kind of data collision method and system based on Chinese word segmentation
CN110674388A (en) * 2018-07-03 2020-01-10 百度在线网络技术(北京)有限公司 Mapping method and device for push item, storage medium and terminal equipment
CN110737839A (en) * 2019-10-22 2020-01-31 京东数字科技控股有限公司 Short text recommendation method, device, medium and electronic equipment
CN111104790A (en) * 2018-10-10 2020-05-05 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting key relation and computer readable medium
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words
CN111144068A (en) * 2019-11-26 2020-05-12 方正璞华软件(武汉)股份有限公司 Similar arbitration case recommendation method and device
CN111444450A (en) * 2019-01-16 2020-07-24 北大方正集团有限公司 Method and device for determining reprinted data
CN111666428A (en) * 2020-06-04 2020-09-15 杭州凡闻科技有限公司 Network media propagation evaluation method
CN111767365A (en) * 2019-03-12 2020-10-13 株式会社理光 Document retrieval device and method
CN111859896A (en) * 2019-04-01 2020-10-30 长鑫存储技术有限公司 Formula document detection method and device, computer readable medium and electronic equipment
CN112163409A (en) * 2020-09-23 2021-01-01 平安直通咨询有限公司上海分公司 A similar document detection method, system, terminal device and computer-readable storage medium
CN112270183A (en) * 2020-10-21 2021-01-26 北京钛氪新媒体科技有限公司 News spreading effect monitoring system based on text
CN112949304A (en) * 2021-03-24 2021-06-11 中新国际联合研究院 Construction case knowledge reuse query method and device
CN113254634A (en) * 2021-02-04 2021-08-13 天津德尔塔科技有限公司 File classification method and system based on phase space
WO2021253873A1 (en) * 2020-06-15 2021-12-23 语联网(武汉)信息技术有限公司 Method and apparatus for retrieving similar document
CN114077834A (en) * 2020-08-13 2022-02-22 北京有限元科技有限公司 Method, device and storage medium for determining similar texts
CN114330301A (en) * 2021-12-29 2022-04-12 中电福富信息科技有限公司 Atomic capability matching method based on text similarity improvement
CN114418016A (en) * 2022-01-24 2022-04-29 支付宝(杭州)信息技术有限公司 An Efficient Short Text Similarity Determination Method and Device
CN115686432A (en) * 2022-12-30 2023-02-03 药融云数字科技(成都)有限公司 Document evaluation method for retrieval sorting, storage medium and terminal
CN117910479A (en) * 2024-03-19 2024-04-19 湖南蚁坊软件股份有限公司 Method, device, equipment and medium for judging aggregated news
CN118939749A (en) * 2024-07-11 2024-11-12 中汇智(山东)高新技术发展有限公司 A method for in-depth literature retrieval and analysis
CN114418016B (en) * 2022-01-24 2025-10-17 支付宝(杭州)信息技术有限公司 Efficient short text similarity determination method and device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1790321A (en) * 2005-10-28 2006-06-21 北大方正集团有限公司 Fast similarity-based retrieval method for mass text
CN101055580A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System, method and user interface for retrieving documents
US7440947B2 (en) * 2004-11-12 2008-10-21 Fuji Xerox Co., Ltd. System and method for identifying query-relevant keywords in documents with latent semantic analysis
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device
CN102567364A (en) * 2010-12-24 2012-07-11 鸿富锦精密工业(深圳)有限公司 File search system and method
CN103294693A (en) * 2012-02-27 2013-09-11 华为技术有限公司 Searching method, server and system
CN104102626A (en) * 2014-07-07 2014-10-15 厦门推特信息科技有限公司 Method for computing semantic similarities among short texts
CN104679728A (en) * 2015-02-06 2015-06-03 中国农业大学 A Text Similarity Detection Method
CN105095430A (en) * 2015-07-22 2015-11-25 深圳证券信息有限公司 Method and device for setting up word network and extracting keywords
CN105488151A (en) * 2015-11-27 2016-04-13 小米科技有限责任公司 Reference document recommendation method and apparatus

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7440947B2 (en) * 2004-11-12 2008-10-21 Fuji Xerox Co., Ltd. System and method for identifying query-relevant keywords in documents with latent semantic analysis
CN1790321A (en) * 2005-10-28 2006-06-21 北大方正集团有限公司 Fast similarity-based retrieval method for mass text
CN101055580A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System, method and user interface for retrieving documents
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device
CN102567364A (en) * 2010-12-24 2012-07-11 鸿富锦精密工业(深圳)有限公司 File search system and method
CN103294693A (en) * 2012-02-27 2013-09-11 华为技术有限公司 Searching method, server and system
CN104102626A (en) * 2014-07-07 2014-10-15 厦门推特信息科技有限公司 Method for computing semantic similarities among short texts
CN104679728A (en) * 2015-02-06 2015-06-03 中国农业大学 A Text Similarity Detection Method
CN105095430A (en) * 2015-07-22 2015-11-25 深圳证券信息有限公司 Method and device for setting up word network and extracting keywords
CN105488151A (en) * 2015-11-27 2016-04-13 小米科技有限责任公司 Reference document recommendation method and apparatus

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
于天恩: "《Lucene搜索引擎开发权威经典》", 31 October 2008, 中国铁道出版社 *
吉志薇: "改进的TF-IDF算法在作品抄袭判定中的应用", 《文教资料》 *
庄毅: "《面向互联网的多媒体大数据信息高效查询处理》", 1 June 2015 *
潘华,项同德: "《数据仓库与数据挖掘原理、工具及应用》", 31 December 2007, 中国电力出版社 *

Cited By (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108206020A (en) * 2016-12-16 2018-06-26 北京智能管家科技有限公司 A kind of audio recognition method, device and terminal device
CN108241699A (en) * 2016-12-26 2018-07-03 百度在线网络技术(北京)有限公司 For the method and apparatus of pushed information
CN108241699B (en) * 2016-12-26 2022-03-11 百度在线网络技术(北京)有限公司 Method and device for pushing information
CN106933787A (en) * 2017-03-20 2017-07-07 上海智臻智能网络科技股份有限公司 Adjudicate the computational methods of document similarity, search device and computer equipment
CN107330820A (en) * 2017-08-28 2017-11-07 北京智诚律法科技有限公司 A kind of forecasting system and method for lawsuit result
CN107577774A (en) * 2017-09-08 2018-01-12 北京智诚律法科技有限公司 A kind of intelligent selection draws up a contract the system of lawyer
CN107506204A (en) * 2017-09-30 2017-12-22 福建星瑞格软件有限公司 A kind of function reconstructing method of the code similarity-rough set based on the cosine law
CN107506204B (en) * 2017-09-30 2020-08-25 福建星瑞格软件有限公司 Code similarity comparison function reconstruction method based on cosine theorem
CN107749034A (en) * 2017-11-17 2018-03-02 浙江工业大学 A kind of safe friend recommendation method in social networks
CN107943762A (en) * 2017-11-24 2018-04-20 四川长虹电器股份有限公司 A kind of text similarity sort method based on ES search
CN109948121A (en) * 2017-12-20 2019-06-28 北京京东尚科信息技术有限公司 Article similarity method for digging, system, equipment and storage medium
CN108009599A (en) * 2017-12-27 2018-05-08 福建中金在线信息科技有限公司 A kind of original document determination methods, device, electronic equipment and storage medium
CN108932228A (en) * 2018-06-06 2018-12-04 武汉斗鱼网络科技有限公司 INDUSTRY OVERVIEW and subregion matching process, device, server and storage medium is broadcast live
CN108932228B (en) * 2018-06-06 2023-08-08 广东南方报业移动媒体有限公司 Live broadcast industry news and partition matching method and device, server and storage medium
CN109117474A (en) * 2018-06-25 2019-01-01 广州多益网络股份有限公司 Calculation method, device and the storage medium of statement similarity
CN110674388A (en) * 2018-07-03 2020-01-10 百度在线网络技术(北京)有限公司 Mapping method and device for push item, storage medium and terminal equipment
CN109255018A (en) * 2018-08-31 2019-01-22 沈文策 A kind of method and apparatus identifying similar article
CN109271626B (en) * 2018-08-31 2023-09-26 北京工业大学 Text semantic analysis method
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method
CN109376231A (en) * 2018-09-29 2019-02-22 杭州凡闻科技有限公司 A kind of media hotspot tracking and system
CN111104790B (en) * 2018-10-10 2024-03-22 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer readable medium for extracting key relation
CN111104790A (en) * 2018-10-10 2020-05-05 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting key relation and computer readable medium
CN109508373A (en) * 2018-11-13 2019-03-22 深圳前海微众银行股份有限公司 Calculation method, equipment and the computer readable storage medium of enterprise's public opinion index
CN109508373B (en) * 2018-11-13 2021-08-06 深圳前海微众银行股份有限公司 Calculation method, device and computer-readable storage medium for enterprise public opinion index
CN109460415A (en) * 2018-11-26 2019-03-12 江苏科技大学 A kind of similar fixture search method based on N-dimensional vector included angle cosine
CN109460415B (en) * 2018-11-26 2021-09-21 江苏科技大学 Similar fixture retrieval method based on N-dimensional vector included angle cosine
CN109582964A (en) * 2018-11-29 2019-04-05 天津工业大学 Intelligent legal advice auxiliary system based on marriage law judicial decision document big data
CN109614478B (en) * 2018-12-18 2020-12-08 北京中科闻歌科技股份有限公司 Word vector model construction method, keyword matching method and device
CN109614478A (en) * 2018-12-18 2019-04-12 北京中科闻歌科技股份有限公司 Construction method, key word matching method and the device of term vector model
CN111444450A (en) * 2019-01-16 2020-07-24 北大方正集团有限公司 Method and device for determining reprinted data
CN111767365A (en) * 2019-03-12 2020-10-13 株式会社理光 Document retrieval device and method
CN109977196A (en) * 2019-03-29 2019-07-05 云南电网有限责任公司电力科学研究院 A kind of detection method and device of magnanimity document similarity
CN111859896B (en) * 2019-04-01 2022-11-25 长鑫存储技术有限公司 Formula document detection method and device, computer readable medium and electronic equipment
CN111859896A (en) * 2019-04-01 2020-10-30 长鑫存储技术有限公司 Formula document detection method and device, computer readable medium and electronic equipment
CN110532569A (en) * 2019-09-05 2019-12-03 浪潮软件股份有限公司 A kind of data collision method and system based on Chinese word segmentation
CN110532569B (en) * 2019-09-05 2023-03-28 浪潮软件股份有限公司 Data collision method and system based on Chinese word segmentation
CN110737839A (en) * 2019-10-22 2020-01-31 京东数字科技控股有限公司 Short text recommendation method, device, medium and electronic equipment
CN111144068A (en) * 2019-11-26 2020-05-12 方正璞华软件(武汉)股份有限公司 Similar arbitration case recommendation method and device
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words
CN111104794B (en) * 2019-12-25 2023-07-04 同方知网数字出版技术股份有限公司 Text similarity matching method based on subject term
CN111666428A (en) * 2020-06-04 2020-09-15 杭州凡闻科技有限公司 Network media propagation evaluation method
CN111666428B (en) * 2020-06-04 2023-08-08 杭州凡闻科技有限公司 Network media propagation force evaluation method
WO2021253873A1 (en) * 2020-06-15 2021-12-23 语联网(武汉)信息技术有限公司 Method and apparatus for retrieving similar document
CN114077834A (en) * 2020-08-13 2022-02-22 北京有限元科技有限公司 Method, device and storage medium for determining similar texts
CN112163409A (en) * 2020-09-23 2021-01-01 平安直通咨询有限公司上海分公司 A similar document detection method, system, terminal device and computer-readable storage medium
CN112270183B (en) * 2020-10-21 2024-03-19 北京钛氪新媒体科技有限公司 News propagation effect monitoring system based on text
CN112270183A (en) * 2020-10-21 2021-01-26 北京钛氪新媒体科技有限公司 News spreading effect monitoring system based on text
CN113254634A (en) * 2021-02-04 2021-08-13 天津德尔塔科技有限公司 File classification method and system based on phase space
CN112949304A (en) * 2021-03-24 2021-06-11 中新国际联合研究院 Construction case knowledge reuse query method and device
CN114330301A (en) * 2021-12-29 2022-04-12 中电福富信息科技有限公司 Atomic capability matching method based on text similarity improvement
CN114418016A (en) * 2022-01-24 2022-04-29 支付宝(杭州)信息技术有限公司 An Efficient Short Text Similarity Determination Method and Device
CN114418016B (en) * 2022-01-24 2025-10-17 支付宝(杭州)信息技术有限公司 Efficient short text similarity determination method and device
CN115686432A (en) * 2022-12-30 2023-02-03 药融云数字科技(成都)有限公司 Document evaluation method for retrieval sorting, storage medium and terminal
CN117910479A (en) * 2024-03-19 2024-04-19 湖南蚁坊软件股份有限公司 Method, device, equipment and medium for judging aggregated news
CN117910479B (en) * 2024-03-19 2024-06-04 湖南蚁坊软件股份有限公司 Method, device, equipment and medium for judging aggregated news
CN118939749A (en) * 2024-07-11 2024-11-12 中汇智(山东)高新技术发展有限公司 A method for in-depth literature retrieval and analysis

Similar Documents

Publication Publication Date Title
CN106095737A (en) Documents Similarity computational methods and similar document the whole network retrieval tracking
CN105488024B (en) The abstracting method and device of Web page subject sentence
CN105488196B (en) An automatic mining system for hot topics based on interconnected corpus
CN101944099B (en) Method for automatically classifying text documents by utilizing body
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN103377226B (en) A kind of intelligent search method and system thereof
CN110543595B (en) In-station searching system and method
CN109388743B (en) Language model determining method and device
CN103617157A (en) Text similarity calculation method based on semantics
CN101408883A (en) Method for collecting network public feelings viewpoint
CN104391835A (en) Method and device for selecting feature words in texts
Han et al. HIT at TREC 2012 Microblog Track.
CN104484380A (en) Personalized search method and personalized search device
CN110309251B (en) Text data processing method, device and computer readable storage medium
CN112862567B (en) Method and system for recommending exhibits in online exhibition
CN103207864A (en) Online novel content similarity comparison method
CN106682149A (en) Label automatic generation method based on meta-search engine
Wu et al. Extracting topics based on Word2Vec and improved Jaccard similarity coefficient
CN105608075A (en) Related knowledge point acquisition method and system
Li et al. A hybrid model for experts finding in community question answering
CN103092838B (en) A kind of method and device for obtaining English words
Hong et al. Project Rank: An internet topic evaluation model based on latent dirichlet allocation
Cui et al. Personalized microblog recommendation using sentimental features
CN102033961A (en) Open-type knowledge sharing platform and polysemous word showing method thereof
Huang et al. Study on multimedia network Weibo situational awareness model and emotional algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161109

RJ01 Rejection of invention patent application after publication