[go: up one dir, main page]

CN108959413A - A kind of topical webpage clawing method and Theme Crawler of Content system - Google Patents

A kind of topical webpage clawing method and Theme Crawler of Content system Download PDF

Info

Publication number
CN108959413A
CN108959413A CN201810581858.XA CN201810581858A CN108959413A CN 108959413 A CN108959413 A CN 108959413A CN 201810581858 A CN201810581858 A CN 201810581858A CN 108959413 A CN108959413 A CN 108959413A
Authority
CN
China
Prior art keywords
link
target
correlation
degree
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810581858.XA
Other languages
Chinese (zh)
Other versions
CN108959413B (en
Inventor
彭涛
包铁
徐凯旋
张雪松
王上
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN201810581858.XA priority Critical patent/CN108959413B/en
Publication of CN108959413A publication Critical patent/CN108959413A/en
Application granted granted Critical
Publication of CN108959413B publication Critical patent/CN108959413B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application provides a kind of topical webpage clawing method and Theme Crawler of Content system, method includes: to obtain the link not crawled from first including kind of sublink wait crawl in link set;Determine that corresponding first degree of correlation of the corresponding target webpage of link and second degree of correlation obtained, first degree of correlation and second degree of correlation are respectively the degree of correlation of target text content and Object linking and designated key in target webpage;The temperature value of target webpage is determined according to first degree of correlation and second degree of correlation, and stores the content to be presented of target webpage;If the temperature value of target webpage is greater than or equal to preset temperature value, Object linking is put into second wait crawl link set;If first wait crawl in link set there is no the link that had not been obtained, from second wait crawl obtained in link set with the designated key degree of correlation it is highest do not crawl link and continue to crawl.The application makes user that can obtain largely webpage relevant to designated key from network.

Description

一种主题网页爬取方法及主题爬虫系统A theme webpage crawling method and theme crawler system

技术领域technical field

本发明涉及网页爬取技术领域,尤其涉及一种主题网页爬取方法及主题爬虫系统。The present invention relates to the technical field of webpage crawling, in particular to a method for crawling theme webpages and a theme crawler system.

背景技术Background technique

随着互联网的快速发展,人们迎来了一个信息爆炸的时代,各种各样的信息充斥着生活的方方面面。为了方便信息的获取,出现了搜索引擎,人们通过搜索引擎能很快地检索到很多网页的信息,搜索引擎提高了人们获取信息的效率。目前,人们常用的搜索引擎如Google、百度等均为通用搜索引擎,这类搜索引擎试图获取互联网上的全部资源,然而,人们的需求多种多样,在某些时候,用户希望能从网络中获取到与指定主题的网页内容,而通用搜索引擎无法满足用户的这种个性化需求。With the rapid development of the Internet, people have ushered in an era of information explosion, and all kinds of information are flooding all aspects of life. In order to facilitate the acquisition of information, search engines have emerged. People can quickly retrieve information on many web pages through search engines, and search engines have improved the efficiency of people's access to information. At present, commonly used search engines such as Google, Baidu, etc. are all general search engines. This type of search engine tries to obtain all the resources on the Internet. However, people's needs are various. The content of webpages with specified topics can be obtained, but general search engines cannot meet the personalized needs of users.

发明内容Contents of the invention

有鉴于此,本发明提供了一种主题网页爬取方法及主题爬虫系统,用以使用户基于该方法及系统获取到与指定主题相关的网页,其技术方案如下:In view of this, the present invention provides a method and a theme crawler system for crawling theme webpages, so that users can obtain webpages related to a specified theme based on the method and system. The technical solution is as follows:

一种主题网页爬取方法,包括:A method for crawling a theme webpage, comprising:

从第一待爬取链接集合中获取未爬取的链接,所述第一待爬取链接集合中包括预先获取的种子链接;Obtain uncrawled links from the first link set to be crawled, wherein the first link set to be crawled includes pre-acquired seed links;

确定获取的链接对应的目标网页对应的第一相关度和第二相关度,所述第一相关度为所述目标网页中的目标链接与所述指定主题的相关度,所述第二相关度为所述目标网页中的目标文本内容与指定主题的相关度;Determine the first correlation degree and the second correlation degree corresponding to the target webpage corresponding to the obtained link, the first correlation degree is the correlation degree between the target link in the target webpage and the specified topic, and the second correlation degree is Relevance between the target text content in the target webpage and the specified topic;

根据所述第一相关度和所述第二相关度确定所述目标网页的温度值,并存储所述目标网页的待展示内容,其中,所述温度值能够表征所述目标网页与所述指定主题的相关度;Determine the temperature value of the target webpage according to the first correlation degree and the second correlation degree, and store the content to be displayed of the target webpage, wherein the temperature value can represent the relationship between the target webpage and the specified the relevance of the topic;

若所述目标网页的温度值大于或等于第一预设温度值,则将所述目标链接放入第二待爬取链接集合中;If the temperature value of the target webpage is greater than or equal to the first preset temperature value, then put the target link into the second set of links to be crawled;

若所述第一待爬取链接集合中不存在未获取过的链接,则从所述第二待爬取链接集合中获取与所述指定主题相关度最高的未爬取的链接,然后执行所述确定获取的链接对应的目标网页对应的第一相关度和第二相关度。If there is no unobtained link in the first link set to be crawled, then obtain the uncrawled link with the highest correlation with the specified topic from the second set of links to be crawled, and then execute the The first correlation degree and the second correlation degree corresponding to the target webpage corresponding to the obtained link are determined as described above.

其中,所述确定获取的链接对应的目标网页对应的第一相关度和第二相关度,包括:Wherein, said determining the first correlation degree and the second correlation degree corresponding to the target webpage corresponding to the obtained link includes:

根据获取的链接从网络上爬取所述目标网页;Crawling the target webpage from the Internet according to the obtained link;

从所述目标网页中提取所述目标文本内容、所述目标链接和所述目标链接对应的锚文本;Extracting the target text content, the target link, and the anchor text corresponding to the target link from the target webpage;

基于所述锚文本确定对应的目标链接与所述指定主题的相关度作为所述第一相关度,并确定所述目标文本内容与所述指定主题的相关度作为所述第二相关度。Determining, based on the anchor text, a degree of correlation between the corresponding target link and the specified topic as the first correlation degree, and determining a degree of correlation between the target text content and the specified topic as the second degree of correlation.

其中,所述确定所述目标文本内容与指定主题的相关度,包括:Wherein, the determination of the relevance between the target text content and the specified topic includes:

利用双向长短记忆条件随机场模型对所述目标文本内容进行分词,获得多个词;Segmenting the target text content by using a two-way long-short-term memory conditional random field model to obtain multiple words;

通过所述多个词和预见建立的主题判定模块判定所述目标文本内容的主题;Determine the theme of the target text content through the theme determination module established by the plurality of words and prediction;

确定所述目标文本内容的主题与所述指定主题的相关度作为所述目标文本内容与所述指定主题的相关度。Determining the degree of relevance between the subject of the target text content and the specified theme as the degree of relevance between the target text content and the specified theme.

其中,所述基于所述锚文本确定对应的目标链接与所述指定主题的相关度,包括:Wherein, the determination of the relevance between the corresponding target link and the specified topic based on the anchor text includes:

将所述锚文本中的字转换为字向量;converting words in the anchor text into word vectors;

将所述字向量输入预先建立的主题预测模型,获得所述主题预测模型输出的预测结果,其中,所述预测结果用于指示所述锚文本的主题,所述主题预测模型以标注有主题的训练锚文本对应的字向量为训练样本进行训练得到;Inputting the word vector into a pre-established topic prediction model to obtain a prediction result output by the topic prediction model, wherein the prediction result is used to indicate the topic of the anchor text, and the topic prediction model is marked with a topic The word vector corresponding to the training anchor text is obtained by training the training sample;

确定所述锚文本的主题与所述指定主题的相关度,作为所述锚文本对应的目标链接与所述指定主题的相关度。Determining the relevance between the topic of the anchor text and the specified topic as the correlation between the target link corresponding to the anchor text and the specified topic.

其中,预先建立所述主题预测模型的过程,包括:Wherein, the process of pre-establishing the topic prediction model includes:

获取多个标注了主题的锚文本,组成训练锚文本集合;Obtain multiple anchor texts marked with topics to form a training anchor text set;

将所述训练锚文本集合中的训练锚文本中的每个字依次转换为字向量,得到与所述训练锚文本对应的字向量集合,其中,不同字向量之间的距离表征其对应的文字之间的关联性;Each word in the training anchor text in the training anchor text set is converted into a word vector in turn to obtain a word vector set corresponding to the training anchor text, wherein the distance between different word vectors represents its corresponding text the connection between

将所述训练文本对应的字向量集合作为输入,训练双向循环神经网络,将训练得到的双向循环神经网络作为所述主题预测模型。The word vector set corresponding to the training text is used as an input to train a bidirectional recurrent neural network, and the trained bidirectional recurrent neural network is used as the topic prediction model.

优选地,所述主题网页爬取方法,还包括:Preferably, the subject webpage crawling method also includes:

将所述第二待爬取链接集合中,与所述指定主题的相关度小于预设相关度、对应的温度值小于第二预设温度值的链接删除。In the second set of links to be crawled, links whose relevance degree to the specified topic is less than a preset relevance degree and whose corresponding temperature value is less than a second preset temperature value are deleted.

一种主题爬虫系统,包括:链接获取模块、相关度确定模块、温度值确定模块和链接处理模块;A theme crawler system, comprising: a link acquisition module, a correlation degree determination module, a temperature value determination module and a link processing module;

所述链接获取模块,用于从第一待爬取链接集合中获取未爬取的链接,所述第一待爬取链接集合中包括预先获取的种子链接;The link acquisition module is configured to acquire uncrawled links from the first link set to be crawled, the first link set to be crawled includes pre-acquired seed links;

所述相关度确定模块,用于确定获取的链接对应的目标网页对应的第一相关度和第二相关度,所述第一相关度为所述目标网页中的目标链接与所述指定主题的相关度,所述第二相关度为所述目标网页中的目标文本内容与指定主题的相关度;The correlation degree determining module is used to determine the first correlation degree and the second correlation degree corresponding to the target webpage corresponding to the obtained link, and the first correlation degree is the relationship between the target link in the target webpage and the specified topic degree of relevance, the second degree of relevance is the degree of relevance between the target text content in the target webpage and the specified topic;

所述温度值确定模块,用于根据所述第一相关度和所述第二相关度确定所述目标网页的温度值,并存储所述目标网页的待展示内容,其中,所述温度值能够表征所述目标网页与所述指定主题的相关度;The temperature value determination module is configured to determine the temperature value of the target webpage according to the first correlation degree and the second correlation degree, and store the content to be displayed of the target webpage, wherein the temperature value can characterizing the relevance of the target webpage to the specified topic;

所述链接处理模块,用于当所述目标网页的温度值大于或等于第一预设温度值时,将所述目标链接放入第二待爬取链接集合中;The link processing module is configured to put the target link into the second set of links to be crawled when the temperature value of the target webpage is greater than or equal to the first preset temperature value;

所述链接获取模块,还用于当所述第一待爬取链接集合中不存在未获取过的链接时,从所述第二待爬取链接集合中获取与所述指定主题相关度最高的未爬取的链接,然后触发相关度确定模块确定获取的链接对应的目标网页对应的第一相关度和第二相关度。The link obtaining module is further configured to obtain, from the second set of links to be crawled, the most relevant link to the specified topic when there is no unobtained link in the first set of links to be crawled. The uncrawled link then triggers the correlation determination module to determine the first correlation and the second correlation corresponding to the target webpage corresponding to the obtained link.

其中,所述相关度确定模块包括:网页爬取子模块、数据提取子模块、第一相关度确定子模块和第二相关度确定子模块;Wherein, the correlation determination module includes: a web page crawling sub-module, a data extraction sub-module, a first correlation determination sub-module and a second correlation determination sub-module;

所述网页爬取子模块,用于根据获取的链接从网络上爬取所述目标网页;The webpage crawling submodule is used to crawl the target webpage from the network according to the obtained link;

所述数据提取子模块,用于从所述目标网页中提取所述目标文本内容、所述目标链接和所述目标链接对应的锚文本;The data extraction submodule is used to extract the target text content, the target link and the anchor text corresponding to the target link from the target webpage;

所述第一相关度确定子模块,用于基于所述锚文本确定对应的目标链接与所述指定主题的相关度作为所述第一相关度;The first correlation determination submodule is configured to determine the correlation between the corresponding target link and the specified topic based on the anchor text as the first correlation;

所述第二相关度确定子模块,用于确定所述目标文本内容与所述指定主题的相关度作为所述第二相关度。The second correlation determination submodule is configured to determine the correlation between the target text content and the specified topic as the second correlation.

其中,所述第一相关度确定子模块,包括:转换子模块、预测子模块和确定子模块;Wherein, the first correlation determination submodule includes: a conversion submodule, a prediction submodule and a determination submodule;

所述转换子模块,用于将所述锚文本中的字转换为字向量;The conversion submodule is used to convert the words in the anchor text into word vectors;

所述预测子模块,用于将所述字向量输入预先建立的主题预测模型,获得所述主题预测模型输出的预测结果,其中,所述预测结果用于指示所述锚文本的主题,所述主题预测模型以标注有主题的训练锚文本对应的字向量为训练样本进行训练得到;The prediction sub-module is configured to input the word vector into a pre-established topic prediction model, and obtain a prediction result output by the topic prediction model, wherein the prediction result is used to indicate the topic of the anchor text, and the The topic prediction model is obtained by training the word vector corresponding to the training anchor text marked with the topic as the training sample;

所述确定子模块,用于确定所述锚文本的主题与所述指定主题的相关度,作为所述锚文本对应的目标链接与所述指定主题的相关度。The determination sub-module is configured to determine the degree of relevance between the subject of the anchor text and the specified subject, as the degree of relevance between the target link corresponding to the anchor text and the specified subject.

优选地,所述主题爬虫系统,还包括:链接删除模块;Preferably, the theme crawler system also includes: a link deletion module;

所述链接删除模块,用于将所述第二待爬取链接集合中,与所述指定主题的相关度小于预设相关度、对应的温度值小于第二预设温度值的链接删除。The link deletion module is configured to delete links in the second set of links to be crawled whose relevance to the specified topic is less than a preset relevance and whose corresponding temperature value is less than a second preset temperature value.

上述技术方案具有如下有益效果:The above technical scheme has the following beneficial effects:

本发明供的主题网页爬取方法及主题爬虫系统,首先从第一待爬取链接集合中获取未爬取的链接,然后分别确定获取的链接对应的目标网页中的目标文本内容与指定主题的相关度以及目标链接与指定主题的相关度,接着根据确定出的相关度确定目标网页的温度值,并存储所标网页的待展示内容,若目标网页的温度值大于或等于第一预设温度值,则将目标链接放入第二待爬取链接集合中,在第一待爬取链接集合中不存在未获取过的链接时,则从第二待爬取链接集合中获取与指定主题相关度最高的未爬取的链接继续爬取。本发明提供的主题网页爬取方法及主题爬虫系统,使得用户可从网络上获取大量与指定主题相关的网页,用户体验较好。The theme webpage crawling method and theme crawler system provided by the present invention firstly obtain uncrawled links from the first set of links to be crawled, and then respectively determine the target text content in the target webpage corresponding to the acquired links and the content of the specified theme The degree of relevance and the degree of relevance between the target link and the specified topic, and then determine the temperature value of the target webpage according to the determined degree of relevance, and store the content to be displayed on the marked webpage, if the temperature value of the target webpage is greater than or equal to the first preset temperature value, put the target link into the second set of links to be crawled, and if there is no unobtained link in the first set of links to be crawled, then obtain the link related to the specified topic from the second set of links to be crawled Continue to crawl the uncrawled link with the highest degree. The topic web page crawling method and topic crawler system provided by the present invention enable users to obtain a large number of web pages related to specified topics from the Internet, and the user experience is better.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.

图1为本发明实施例提供的主题网页爬取方法;Fig. 1 is the subject web page crawling method that the embodiment of the present invention provides;

图2为本发明实施例提供的主题网页爬取方法中,确定获取的链接对应的目标网页对应的第一相关度和第二相关度的实现过程的流程示意图;FIG. 2 is a schematic flow diagram of the implementation process of determining the first correlation degree and the second correlation degree corresponding to the target webpage corresponding to the obtained link in the subject webpage crawling method provided by the embodiment of the present invention;

图3为本发明实施例提供的主题网页爬取方法中,基于锚文本确定对应的目标链接与指定主题的相关度的实现过程的流程示意图;FIG. 3 is a schematic flow diagram of an implementation process of determining the correlation between a corresponding target link and a specified topic based on an anchor text in a method for crawling a topical webpage provided by an embodiment of the present invention;

图4为本发明实施例提供的字向量的示意图;FIG. 4 is a schematic diagram of a word vector provided by an embodiment of the present invention;

图5为本发明实施例提供的主题预测模型的示意图;FIG. 5 is a schematic diagram of a topic prediction model provided by an embodiment of the present invention;

图6为本发明实施例提供的将nx2m的矩阵Z池化得到维度为2m的向量P的示意图;6 is a schematic diagram of a vector P with a dimension of 2m obtained by pooling an nx2m matrix Z provided by an embodiment of the present invention;

图7为利用本发明实施例提供的双向长短记忆条件随机场模型对一句子进行分词的示意图;7 is a schematic diagram of word segmentation of a sentence using the two-way long-short memory conditional random field model provided by the embodiment of the present invention;

图8为本发明实施例提供的主题爬虫系统的结构示意图。Fig. 8 is a schematic structural diagram of a theme crawler system provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本发明实施例提供了一种主题网页爬取方法,该方法用于爬取与指定主题相关的网页,请参阅图1,示出了该网页爬取方法的流程示意图,可以包括:An embodiment of the present invention provides a method for crawling a subject webpage, which is used to crawl webpages related to a specified topic. Please refer to FIG. 1 , which shows a schematic flow diagram of the method for crawling a webpage, which may include:

步骤S101:从第一待爬取链接集合中获取未爬取的链接。Step S101: Obtain uncrawled links from the first set of links to be crawled.

其中,第一待爬取链接集合中包括预先获取的种子链接。Wherein, the first set of links to be crawled includes pre-acquired seed links.

种子链接是网页爬取的起始位置,好的种子链接可以快速找到与主题相关的网页。The seed link is the starting position for web crawling, and a good seed link can quickly find web pages related to the topic.

步骤S102:确定获取的链接对应的目标网页对应的第一相关度和第二相关度,第一相关度为目标网页中的目标链接与指定主题的相关度,第二相关度为目标网页中的目标文本内容与指定主题的相关度。Step S102: Determine the first correlation degree and the second correlation degree corresponding to the target webpage corresponding to the obtained link. The first correlation degree is the correlation degree between the target link in the target webpage and the specified topic, and the second correlation degree is the correlation degree between the target link in the target webpage. The relevance of the target text content to the specified topic.

其中,目标网页中的目标文本内容指的是目标网页中所包含的文本的内容,如一篇文章、一条新闻等,目标网页中目标链接指的是目标网页中存在的URL,通过URL可以跳转到对应的页面中。Wherein, the target text content in the target web page refers to the content of the text contained in the target web page, such as an article, a piece of news, etc., and the target link in the target web page refers to the URL existing in the target web page, through which URLs can be jumped to the corresponding page.

步骤S103:根据第一相关度和第二相关度确定目标网页的温度值,并存储目标网页的待展示内容。Step S103: Determine the temperature value of the target webpage according to the first correlation degree and the second correlation degree, and store the content to be displayed on the target webpage.

其中,目标网页的温度值能够表征目标网页与指定主题的相关度。目标网页的温度值越高,表明目标网页与指定主题的相关度越高,反之,目标网页的温度值越低,表明目标网页与指定主题的相关度越低。Wherein, the temperature value of the target webpage can represent the degree of relevance between the target webpage and the specified topic. The higher the temperature value of the target web page, the higher the relevance of the target web page to the specified topic, and the lower the temperature value of the target web page, the lower the relevance of the target web page to the specified topic.

其中,目标网页的待展示内容可以包括目标网页的标题、目标网页的文本内容、目标网页的链接。Wherein, the content to be displayed of the target webpage may include the title of the target webpage, the text content of the target webpage, and the link of the target webpage.

考虑到目标网页上包含很多HTML标签和其他代码,然而这些信息对于展示而言没有任何意义,因为用户并不关注这些信息,另外,存储这些对于展示而言没有意义的信息,还会浪费存储空间,因此,在存储目标网页的信息时,只需要存储目标网页的标题、目标网页的文本内容、目标网页的链接即可。Considering that the target web page contains a lot of HTML tags and other codes, however, this information is meaningless for display, because users do not pay attention to this information. In addition, storing such information that is meaningless for display will waste storage space , therefore, when storing the information of the target webpage, it is only necessary to store the title of the target webpage, the text content of the target webpage, and the link of the target webpage.

在一种可能的目标网页的标题、目标网页的文本内容和目标网页的链接可按如下格式存储在文件中:The title of a possible landing page, the text content of the landing page, and the link to the landing page can be stored in a file in the following format:

<title>网页的标题</title><title>The title of the page</title>

<body>网页的文本内容</body><body>Text content of the page</body>

<url>网页的链接</url><url>link to web page</url>

当存储在数据库中时,可设置3个字段:title、body、url。When storing in the database, 3 fields can be set: title, body, url.

步骤S104:若目标网页的温度值大于或等于第一预设温度值,则将目标链接放入第二待爬取链接集合中。Step S104: If the temperature value of the target web page is greater than or equal to the first preset temperature value, put the target link into the second set of links to be crawled.

需要说明的是,目标网页的温度值大于或等于第一预设温度值,表明目标网页与指定主题的相关度较高,相应的,目标网页中所包含的目标链接与指定主题的相关度可能也较高,因此,将目标网页中所包含的目标链接放入第二待爬取链接集合中。It should be noted that if the temperature value of the target webpage is greater than or equal to the first preset temperature value, it indicates that the target webpage has a high degree of relevance to the specified topic, and accordingly, the target link contained in the target webpage may have a high degree of relevance to the specified topic. is also higher, therefore, put the target link contained in the target webpage into the second set of links to be crawled.

其中,第二待爬取链接集合中的链接为非种子链接,而是从基于种子链接获取的网页中提取的链接,需要说明的是,基于种子链接获取的网页可以包括直接或间接通过种子链接获取的网页。Wherein, the links in the second set of links to be crawled are non-seed links, but links extracted from webpages obtained based on seed links. It should be noted that webpages obtained based on seed links may include direct or indirect fetched webpage.

需要说明的是,若目标网页的温度值小于第一预设温度值,则表明目标网页与指定主题的相关度较小,相应地,目标网页中所包含的链接与指定主题的相关度也较小,因此停止对目标网页中的链接继续爬取。It should be noted that if the temperature value of the target webpage is less than the first preset temperature value, it indicates that the target webpage is less relevant to the specified topic, and correspondingly, the links contained in the target webpage are also more relevant to the specified topic. Small, so stop crawling the links in the target webpage.

步骤S104:若第一待爬取链接集合中不存在未获取过的链接,则从第二待爬取链接集合中获取与指定主题相关度最高的未爬取的链接,然后执行步骤S102。Step S104: If there is no unacquired link in the first link set to be crawled, obtain an uncrawled link most relevant to the specified topic from the second set of links to be crawled, and then perform step S102.

在一种可能的实现方式中,从第二待爬取链接集合中获取与指定主题相关度最高的未爬取的链接的过程可以包括:基于未爬取的链接与指定主题的相关度对第二待爬取链接集合中的链接进行排序,通过排序确定与指定主题相关度最高的链接,从第二待爬取链接集合中获取该与指定主题相关度最高的链接。In a possible implementation manner, the process of obtaining the uncrawled link with the highest correlation with the specified topic from the second set of links to be crawled may include: based on the correlation between the uncrawled link and the specified topic Second, the links in the link set to be crawled are sorted, and the link with the highest correlation with the specified topic is determined by sorting, and the link with the highest correlation with the specified topic is obtained from the second link set to be crawled.

需要说明的是,若第一待爬取链接集合存在未获取过的链接,则优先在第一待爬取链接集合中获取未爬取的链接,执行步骤S102及后续步骤,在第一待爬取链接集合中的链接全部爬取完后,再从第二待爬取链接集合中获取与指定主题相关度最高的未爬取的链接进行爬取。当第二待爬取链接集合中不存在未获取的链接时,结束基于指定主题的网页爬取过程。It should be noted that, if there are unacquired links in the first link set to be crawled, the uncrawled links are preferentially obtained in the first link set to be crawled, and step S102 and subsequent steps are executed, and the first link to be crawled is After all the links in the link collection are crawled, the uncrawled links that are most relevant to the specified topic are obtained from the second link collection to be crawled for crawling. When there is no unobtained link in the second set of links to be crawled, the webpage crawling process based on the specified topic is ended.

本发明实施例提供的面向韩语的网页爬取方法,首先从第一待爬取链接集合中获取未爬取的链接,然后分别确定获取的链接对应的目标网页中的目标文本内容与指定主题的相关度以及目标链接与指定主题的相关度,接着根据确定出的相关度确定目标网页的温度值,并存储所标网页的待展示内容,若目标网页的温度值大于或等于第一预设温度值,则将目标链接放入第二待爬取链接集合中,在第一待爬取链接集合中不存在未获取过的链接时,则从第二待爬取链接集合中获取与指定主题相关度最高的未爬取的链接继续爬取。本发明实施例提供的网页爬取方法,使得用户可从网络上获取大量与指定主题相关的网页,用户体验较好。The method for crawling Korean-oriented webpages provided by the embodiments of the present invention first obtains uncrawled links from the first set of links to be crawled, and then respectively determines the target text content and the content of the specified topic in the target webpage corresponding to the obtained links. The degree of relevance and the degree of relevance between the target link and the specified topic, and then determine the temperature value of the target webpage according to the determined degree of relevance, and store the content to be displayed on the marked webpage, if the temperature value of the target webpage is greater than or equal to the first preset temperature value, put the target link into the second set of links to be crawled, and if there is no unobtained link in the first set of links to be crawled, then obtain the link related to the specified topic from the second set of links to be crawled Continue to crawl the uncrawled link with the highest degree. The webpage crawling method provided by the embodiment of the present invention enables the user to obtain a large number of webpages related to a specified topic from the Internet, and the user experience is better.

需要说明的是,上述实施例提供的主题网页爬取方法可适用于多种语言的主题网页爬取,比如,汉语、韩语、日语等。It should be noted that the method for crawling themed webpages provided in the above embodiments is applicable to crawling themed webpages in multiple languages, such as Chinese, Korean, Japanese, and so on.

韩语作为我国朝鲜族的主要语言,同时也是韩国和朝鲜的官方语言,具有很大的研究意义,其具体体现以下几个方面:首先,在我国的朝鲜族人民以韩语为母语,他们在搜索信息时可能会使用韩语,需要为他们提供一个面向特定内容的搜索引擎,方便他们生活;其次,韩国和朝鲜都在我国的周边,与我国自古就有很多联系,在国际社会不断变化的今天,人们除了关心自己国家的发展外,也应该关注周边国家的发展与形势。有鉴于此,本实施例在对上述本发明实施例提供的方法中的步骤进行介绍时,以韩语为例进行说明。需要说明的是,当上述方法面向韩语时,可预先从NAVER网站(https://www.naver.com/)中选择多个(如163个)链接作为种子链接。As the main language of the Korean ethnic group in my country, Korean is also the official language of South Korea and North Korea. It has great research significance, which is embodied in the following aspects: First, the Korean people in our country use Korean as their mother tongue, and they are searching for information. They may use Korean from time to time, so it is necessary to provide them with a search engine for specific content to facilitate their lives; secondly, South Korea and North Korea are both in the vicinity of our country, and have many connections with our country since ancient times. Today, when the international community is constantly changing, people In addition to caring about the development of your own country, you should also pay attention to the development and situation of neighboring countries. In view of this, when this embodiment introduces the steps in the method provided by the above embodiments of the present invention, Korean language is used as an example for description. It should be noted that, when the above method is for Korean, multiple (for example, 163) links can be selected in advance from the NAVER website (https://www.naver.com/) as seed links.

以下对上述实施例中的步骤S102:确定获取的链接对应的目标网页对应的第一相关度和第二相关度进行介绍。Step S102 in the above embodiment: determining the first correlation degree and the second correlation degree corresponding to the target webpage corresponding to the obtained link will be introduced below.

请参阅图2,示出了确定获取的链接对应的目标网页对应的第一相关度和第二相关度的实现过程的流程示意图,可以包括:Please refer to FIG. 2 , which shows a schematic flowchart of the implementation process of determining the first correlation degree and the second correlation degree corresponding to the target webpage corresponding to the obtained link, which may include:

步骤S201:根据获取的链接从网络上爬取对应的目标网页。Step S201: Crawling the corresponding target webpage from the Internet according to the obtained link.

步骤S202:从目标网页中提取目标文本内容、目标链接和目标链接对应的锚文本。Step S202: Extracting the target text content, the target link and the anchor text corresponding to the target link from the target webpage.

其中,锚文本指的是网页中的链接URL对应的文字内容,其一般包含在一个a标签中,按如下形式组织:Among them, the anchor text refers to the text content corresponding to the link URL in the web page, which is generally included in an a tag and organized as follows:

<a href=”URL”>锚文本</a><a href="URL">Anchor Text</a>

目标链接对应的锚文本中包含了对目标网页的简要概括,其能够用于对目标链接的主题进行预测,基于锚文本对对应目标链接的主题进行预测的过程可参见后续说明。The anchor text corresponding to the target link contains a brief summary of the target webpage, which can be used to predict the subject of the target link. For the process of predicting the subject of the corresponding target link based on the anchor text, please refer to the subsequent description.

步骤S203:基于锚文本确定对应的目标链接与指定主题的相关度作为第一相关度,并确定目标文本内容与指定主题的相关度作为第二相关度。Step S203: Based on the anchor text, determine the degree of relevance between the corresponding target link and the specified topic as the first degree of relevance, and determine the degree of relevance between the target text content and the designated topic as the second degree of relevance.

以下对基于锚文本确定对应的目标链接与指定主题的相关度的过程进行说明。请参阅图3,示出了基于锚文本确定对应的目标链接与指定主题的相关度的实现过程的流程示意图,可以包括:The process of determining the correlation between the corresponding target link and the specified topic based on the anchor text will be described below. Please refer to FIG. 3 , which shows a schematic flowchart of an implementation process for determining the relevance between the corresponding target link and the specified topic based on the anchor text, which may include:

步骤S301:将锚文本中的字转换为字向量。Step S301: convert the words in the anchor text into word vectors.

需要说明的是,对于字的向量化方式,现有技术中存在采用one-hot编码对字进行向量化的方式,即用0和1对字进行向量化,但这种向量化方式得到的向量稀疏且维度比较高,有鉴于此,本发明实施例可使用word-embeding词嵌入方法对字进行向量化表示,借助word2vec模型将词向量的方法扩展到对字的表示之中,得到一定维度的字表示向量。It should be noted that for the word vectorization method, there is a way to vectorize words using one-hot encoding in the prior art, that is, to vectorize words with 0 and 1, but the vector obtained by this vectorization method Sparse and relatively high dimensionality, in view of this, the embodiment of the present invention can use the word-embedding word embedding method to vectorize the representation of the word, and use the word2vec model to extend the word vector method to the representation of the word, and obtain a certain dimension Words represent vectors.

word2vec模型借助词和上下文之间的关系生成对词的表示,这样在得到的表示中可以包含一定的语义信息,利用字和上下文之间的关系,在模型中训练得到对字的表示。在训练时可使用wiki中的韩语文章信息,得到的向量化表示类似于图4。The word2vec model uses the relationship between the word and the context to generate a representation of the word, so that the obtained representation can contain certain semantic information, and the relationship between the word and the context is used to train the representation of the word in the model. The Korean article information in the wiki can be used during training, and the obtained vectorized representation is similar to Figure 4.

步骤S302:将字向量输入预先建立的主题预测模型,获得主题预测模型输出的预测结果。Step S302: input the word vector into the pre-established topic prediction model, and obtain the prediction result output by the topic prediction model.

其中,预测结果用于指示锚文本的主题,主题预测模型以标注有主题的训练锚文本对应的字向量为训练样本进行训练得到。The prediction result is used to indicate the topic of the anchor text, and the topic prediction model is obtained by training the word vector corresponding to the training anchor text labeled with the topic as a training sample.

预先建立主题预测模型的过程可以包括:获取多个标注了主题的锚文本,组成训练锚文本集合;将训练锚文本集合中的训练锚文本中的每个字依次转换为字向量,得到与训练文本对应的字向量集合,其中,不同字向量之间的距离表征其对应的文字之间的关联性;将训练文本对应的字向量集合作为输入,训练双向循环神经网络,将训练得到的双向循环神经网络作为主题预测模型。The process of pre-establishing the topic prediction model may include: obtaining multiple anchor texts labeled with topics to form a training anchor text set; converting each word in the training anchor text in the training anchor text set into a word vector in turn, and obtaining and training The set of word vectors corresponding to the text, where the distance between different word vectors represents the relevance between the corresponding words; the set of word vectors corresponding to the training text is used as input to train the two-way cyclic neural network, and the two-way cyclic neural network obtained from the training is Neural Networks as Topic Prediction Models.

请参阅图5,示出了主题预测模型的示意图,在本实施例中,双向循环神经网络可以为双向LSTM网络,为了充分利用网络的输出,在模型中对每个输入字符的正反两个方向的输出hf和hb进行拼接得到hi,若hf的维度为隐藏单元的个数m,则hi的维度为2m:Please refer to Figure 5, which shows a schematic diagram of a topic prediction model. In this embodiment, the bidirectional cyclic neural network can be a bidirectional LSTM network. Direction output h f and h b are concatenated to obtain h i , if the dimension of h f is the number m of hidden units, then the dimension of h i is 2m:

hi=[hf,hb] (1)h i =[h f , h b ] (1)

对于一个长度为n的锚文本,将经过双向LSTM网络计算和拼接后得到的输出变成一个nx2m的矩阵Z:For an anchor text with a length of n, the output obtained after calculation and splicing by the bidirectional LSTM network becomes an nx2m matrix Z:

Z={h0,h1,h2,...,hn-1,hn-1} (2)Z={h 0 ,h 1 ,h 2 ,...,h n-1 ,h n-1 } (2)

对于得到的矩阵Z,利用卷积网络中的池化方法,对Z中的特征进行提取压缩,得到一个维度为2m的向量P,如图6所示。For the obtained matrix Z, use the pooling method in the convolutional network to extract and compress the features in Z, and obtain a vector P with a dimension of 2m, as shown in Figure 6.

池化方法通过特定的操作对矩阵中的一个部分进行处理,对矩阵进行压缩与转换,从中抽取出有用的特征用于后续的分析。在一种可能的实现方式中,可使用最大池化方法对矩阵中的每一行进行池化,即:The pooling method processes a part of the matrix through specific operations, compresses and transforms the matrix, and extracts useful features from it for subsequent analysis. In one possible implementation, each row in the matrix can be pooled using the max pooling method, namely:

kx=max(hij)0≤j<n (3)k x =max(h ij )0≤j<n (3)

池化之后是一个含有h个单元的全连接网络,通过网络计算,将结果输出转化为对应主题的概率值。After pooling, there is a fully connected network with h units. Through network calculation, the result output is converted into the probability value of the corresponding topic.

步骤S303:确定锚文本的主题与指定主题的相关度,作为锚文本对应的目标链接与指定主题的相关度。Step S303: Determine the correlation between the topic of the anchor text and the specified topic, and use it as the correlation between the target link corresponding to the anchor text and the specified topic.

本实施例在对锚文本的主题进行预测时使用了锚文本中的全部内容,排除了其他信息的干扰,并且使用了字向量的表示方法,可以包含更多的语义信息,更加有利于对锚文本的主题进行预测。主题预测模型使用的循环神经网络可以更好的对字符序列信息进行模拟,并且,由于使用的循环神经网络为双向循环神经网络,其能对一个锚文本的正反两个方向的信息进行获取,将更多的信息用于对主题的判断。This embodiment uses all the content in the anchor text when predicting the subject of the anchor text, eliminating the interference of other information, and using the word vector representation method, which can contain more semantic information, which is more conducive to the anchor text The topic of the text is predicted. The cyclic neural network used in the topic prediction model can better simulate the character sequence information, and since the cyclic neural network used is a bidirectional cyclic neural network, it can obtain information in both positive and negative directions of an anchor text, More information is used to judge the topic.

以下对确定目标文本与指定主题的相关度的过程进行说明。The following describes the process of determining the relevance of the target text to the specified topic.

在一种可能的实现方式中,确定目标文本内容与指定主题的相关度的过程可以包括:利用双向长短记忆条件随机场模型对目标文本内容进行分词,获得多个词;通过多个词和预见建立的主题判定模型判定目标文本内容的主题;确定目标文本内容的主题与指定主题的相关度作为目标文本内容与指定主题的相关度。In a possible implementation, the process of determining the relevance between the target text content and the specified topic may include: using a two-way long-short memory conditional random field model to segment the target text content to obtain multiple words; The established topic judgment model judges the topic of the target text content; determines the correlation degree between the topic of the target text content and the specified topic as the correlation degree between the target text content and the specified topic.

在得到了上述对字的表示方法后,可将字向量用于对韩语的分词。本实施例在对目标文本内容进行分词时,将分词过程看作是一个由字构词的过程,为词中的每个字分配不同标签的形式:B,表示字在一个词的开始位置;M,表示字在一个词的内部;S,表示字是一个单独的词;E,表示字在词的结尾位置。本实施例使用了循环神经网络中的长短记忆单元(LSTM)与条件随机场(CRF)结合的方法,构成了一个双向长短记忆条件随机场模型来完成对韩语的分词,请参阅图7,示出了利用双向长短记忆条件随机场模型对一韩语句子进行分词的示意图,从图7可以看出,使用句子或字符序列作为输入,经过在字向量表中进行查找,得到字向量,然后将字向量输入到双向LSTM网络中,在得到双向LSTM网络的输出后,使用CRF对模型中的字与标签直间的关系进行解码得到最终的分词结果。After obtaining the above-mentioned representation method for characters, the word vector can be used for word segmentation of Korean. In this embodiment, when the target text content is segmented, the word segmentation process is regarded as a process of forming words by characters, and each word in the word is assigned a different label form: B, indicating that the word is at the beginning of a word; M, indicates that the character is inside a word; S, indicates that the character is a separate word; E, indicates that the character is at the end of the word. This embodiment uses the method that the long-short memory unit (LSTM) in the recurrent neural network combines with the conditional random field (CRF), constitutes a two-way long-short memory conditional random field model to complete the word segmentation of Korean, please refer to Fig. 7, showing A schematic diagram of word segmentation for a Korean sentence using the two-way long-short memory conditional random field model is shown. It can be seen from Figure 7 that a sentence or character sequence is used as input, and the word vector is obtained by searching in the word vector table, and then the word The vector is input into the bidirectional LSTM network. After the output of the bidirectional LSTM network is obtained, CRF is used to decode the relationship between the word and the label in the model to obtain the final word segmentation result.

为了确定目标文本内容的主题,除了进行分词外,还要构建主题判定模型。在一种可能的实现方式中,可使用支持向量机(support vector machine,SVM)构造一个主题判定模型,使用特征提取方法得到主题的特征,利用TF-IDF方法对特征加权得到网页特征向量,训练得到一个主题判定模型,在判定目标文本内容的主题时,直接利用该主题判定模型对目标文本内容的主题进行判定。In order to determine the subject of the target text content, in addition to word segmentation, a subject determination model must also be constructed. In a possible implementation, support vector machine (support vector machine, SVM) can be used to construct a topic determination model, the feature extraction method can be used to obtain the feature of the topic, and the feature vector can be obtained by using the TF-IDF method to weight the feature, and the training A topic judgment model is obtained. When judging the topic of the target text content, the topic judgment model is directly used to judge the topic of the target text content.

在本发明的另一个实施例中,对前述实施例中步骤S103:根据第一相关度和第二相关度确定目标网页的温度值进行介绍。In another embodiment of the present invention, the step S103 in the foregoing embodiments: determining the temperature value of the target web page according to the first correlation degree and the second correlation degree is introduced.

需要说明的是,互联网上的相关主题页面之间并不一定都是直接相连的,而是通过一些其它网页间接链接在一起,这些链接就组成了一条“隧道”。隧道穿越就是让主题爬虫试图去越过这些链接,发现更多的主题网页,隧道穿越对于主题爬虫而言是一种面向未来回报的方法。It should be noted that not all relevant subject pages on the Internet are directly connected, but indirectly linked together through some other web pages, and these links form a "tunnel". Tunneling is to make topic crawlers try to cross these links and discover more topical web pages. Tunneling is a future-oriented method for topic crawlers.

本实施例通过为每个网页计算不同的温度值,在网页爬取过程中利用牛顿冷却定律动态调整温度值,让主题爬虫具有一定的隧道穿越能力。In this embodiment, different temperature values are calculated for each webpage, and the temperature value is dynamically adjusted using Newton's law of cooling during the webpage crawling process, so that the subject crawler has a certain ability of tunneling.

在本实施例中,网页的温度值通过网页中文本内容与指定主题的相关度以及网页中的链接与指定主题的相关度计算得到,具体的,网页的温度值Ti按下式(4)计算:In this embodiment, the temperature value of the webpage is calculated by the correlation between the text content in the webpage and the specified topic and the correlation between the links in the webpage and the specified topic. Specifically, the temperature value T of the webpage is calculated according to the formula (4) calculate:

式(4)中的Ti-1表示父网页的温度,δ表示温度的衰减率,ti表示网页本身具有的温度,网页的温度值描述为父网页传来的温度和当前网页本身温度的结合。需要说明的是,通过式(4)可以看出,网页的温度值的计算是一个迭代过程,起初的父网页为种子链接对应的网页,该网页的温度可设定一固定温度值。T i-1 in formula (4) represents the temperature of the parent webpage, δ represents the decay rate of the temperature, t i represents the temperature of the webpage itself, and the temperature value of the webpage is described as the temperature transmitted from the parent webpage and the temperature of the current webpage itself. combined. It should be noted that it can be seen from formula (4) that the calculation of the temperature value of a webpage is an iterative process, and the initial parent webpage is the webpage corresponding to the seed link, and the temperature of the webpage can be set to a fixed temperature value.

网页本身的原始温度ti按下式(5)计算:The original temperature t i of the webpage itself is calculated according to formula (5):

上式(5)中的w(content)表示网页的文本内容与指定主题的相关度,w(lk)表示网页中的链接与指定主题的相关度。w(content) in the above formula (5) represents the correlation degree between the text content of the web page and the specified topic, and w(l k ) represents the correlation degree between the links in the web page and the specified topic.

温度的衰减率δ按下式(6)计算:The temperature decay rate δ is calculated according to formula (6):

δ=e-u*τ (6)δ=e -u*τ (6)

上式(6)中的u表示冷却系数,可以设置为一个定值,τ表示时间间隔,也可以设置为网页当前的深度值。u in the above formula (6) represents the cooling coefficient, which can be set as a fixed value, and τ represents the time interval, and can also be set as the current depth value of the web page.

在进行隧道穿越时,由于网页的温度值不同,导致不同网页的爬取能力不同,温度高的网页对应的链接与指定主题的相关度可能更高,有更多的机会采集更多的网页信息,温度低的则相反。为了让主题爬虫能够停止对链接的继续爬取,当网页的温度低于第一预设温度值时,就放弃对该网页对应的链接继续爬取。During tunnel traversal, due to the different temperature values of web pages, the crawling capabilities of different web pages are different. The links corresponding to web pages with high temperatures may be more relevant to the specified topic, and there are more opportunities to collect more web page information. , the opposite is true for low temperature. In order to allow the theme crawler to stop continuing to crawl the links, when the temperature of the webpage is lower than the first preset temperature value, the link corresponding to the webpage is given up and continues to crawl.

为了提高网页爬取效率,避免对与指定主题相关度低的链接进行爬取,本发明实施例提供的方法还可以包括:将第二待爬取链接集合中,与指定主题的相关度小于预设相关度、对应的温度值小于第二预设温度值的链接删除。In order to improve the efficiency of web page crawling and avoid crawling links with low relevance to the specified topic, the method provided by the embodiment of the present invention may further include: in the second set of links to be crawled, the links with a specified topic whose relevance is less than the preset The correlation degree is set, and links whose corresponding temperature values are less than the second preset temperature value are deleted.

需要说明的是,链接对应的温度值指的是包含该链接的网页的温度值,即,当某一网页中包含多个链接时,这些链接对应的温度值是相同的,均为包含这些链接的网页的温度值。It should be noted that the temperature value corresponding to a link refers to the temperature value of the web page containing the link, that is, when a web page contains multiple links, the temperature values corresponding to these links are the same, and all of them contain these links. The temperature value of the web page.

在对第一待爬取链接集合以及第二待爬取链接集合中的所有链接均爬取完后,展示存储的各目标网页的待展示内容,具体地,按目标网页中目标文本内容与指定主题的相关度由高到低的顺序对目标网页的待展示内容进行展示,即,展示给用户的网页内容中,排在前面的为网页的文本内容与指定主题的相关度较高的网页的内容。After crawling all the links in the first set of links to be crawled and the second set of links to be crawled, display the stored content to be displayed of each target webpage, specifically, according to the target text content in the target webpage and the specified Display the content to be displayed on the target webpage in descending order of the relevance of the topic, that is, among the webpage content displayed to the user, the webpage whose text content is highly relevant to the specified topic ranks first content.

本发明实施例还提供了一种主题爬虫系统,请参阅图8,示出了该系统的结构示意图,可以包括:链接获取模块801、相关度确定模块802、温度值确定模块803和链接处理模块804。The embodiment of the present invention also provides a theme crawler system, please refer to Figure 8, which shows a schematic structural diagram of the system, which may include: a link acquisition module 801, a correlation degree determination module 802, a temperature value determination module 803 and a link processing module 804.

链接获取模块801,用于从第一待爬取链接集合中获取未爬取的链接,所述第一待爬取链接集合中包括预先获取的种子链接;A link acquisition module 801, configured to acquire uncrawled links from the first set of links to be crawled, the first set of links to be crawled includes pre-acquired seed links;

相关度确定模块802,用于确定获取的链接对应的目标网页对应的第一相关度和第二相关度。The relevance determination module 802 is configured to determine the first relevance and the second relevance corresponding to the target webpage corresponding to the obtained link.

其中,第一相关度为所述目标网页中的目标链接与所述指定主题的相关度,第二相关度为所述目标网页中的目标文本内容与指定主题的相关度。Wherein, the first correlation degree is the correlation degree between the target link in the target webpage and the designated topic, and the second correlation degree is the correlation degree between the target text content in the target webpage and the designated topic.

温度值确定模块803,用于根据所述第一相关度和所述第二相关度确定所述目标网页的温度值,并存储所述目标网页的待展示内容。The temperature value determination module 803 is configured to determine the temperature value of the target webpage according to the first correlation degree and the second correlation degree, and store the content to be displayed of the target webpage.

其中,所述温度值能够表征所述目标网页与所述指定主题的相关度。Wherein, the temperature value can represent the degree of relevance between the target webpage and the specified topic.

链接处理模块804,用于当所述目标网页的温度值大于或等于第一预设温度值时,将所述目标链接放入第二待爬取链接集合中。The link processing module 804 is configured to put the target link into the second set of links to be crawled when the temperature value of the target web page is greater than or equal to the first preset temperature value.

链接获取模块801,还用于当所述第一待爬取链接集合中不存在未获取过的链接时,从所述第二待爬取链接集合中获取与所述指定主题相关度最高的未爬取的链接,然后触发相关度确定模块802确定获取的链接对应的目标网页对应的第一相关度和第二相关度。The link obtaining module 801 is further configured to obtain, from the second set of links to be crawled, an unacquired link with the highest correlation with the specified topic when there is no unacquired link in the first set of links to be crawled. The crawled link then triggers the relevance determination module 802 to determine the first relevance and the second relevance corresponding to the target webpage corresponding to the obtained link.

本发明实施例提供的主题爬虫系统,首先从第一待爬取链接集合中获取未爬取的链接,然后分别确定获取的链接对应的目标网页中的目标文本内容与指定主题的相关度以及目标链接与指定主题的相关度,接着根据确定出的相关度确定目标网页的温度值,并存储所标网页的待展示内容,若目标网页的温度值大于或等于第一预设温度值,则将目标链接放入第二待爬取链接集合中,在第一待爬取链接集合中不存在未获取过的链接时,则从第二待爬取链接集合中获取与指定主题相关度最高的未爬取的链接继续爬取。本发明实施例提供的主题爬虫系统,使得用户可从网络上获取大量与指定主题相关的网页,用户体验较好。The theme crawler system provided by the embodiment of the present invention first obtains uncrawled links from the first set of links to be crawled, and then respectively determines the relevance and target text content of the target text content in the target webpage corresponding to the obtained links and the specified topic. The degree of relevance between the link and the specified topic, and then determine the temperature value of the target webpage according to the determined degree of relevance, and store the content to be displayed on the marked webpage. If the temperature value of the target webpage is greater than or equal to the first preset temperature value, the The target link is put into the second link set to be crawled, and when there is no unobtained link in the first link set to be crawled, the unobtained link with the highest correlation with the specified topic is obtained from the second link set to be crawled. The crawled link continues to crawl. The topic crawler system provided by the embodiment of the present invention enables users to obtain a large number of web pages related to a specified topic from the Internet, and the user experience is better.

在上述实施例提供的主题爬虫系统中,相关度确定模块802可以包括:网页爬取子模块、数据提取子模块、第一相关度确定子模块和第二相关度确定子模块。In the topic crawler system provided in the above embodiments, the relevance determination module 802 may include: a webpage crawling submodule, a data extraction submodule, a first relevance determination submodule and a second relevance determination submodule.

所述网页爬取子模块,用于根据获取的链接从网络上爬取所述目标网页。The webpage crawling submodule is configured to crawl the target webpage from the Internet according to the obtained link.

所述数据提取子模块,用于从所述目标网页中提取所述目标文本内容、所述目标链接和所述目标链接对应的锚文本。The data extraction submodule is configured to extract the target text content, the target link, and the anchor text corresponding to the target link from the target webpage.

所述第一相关度确定子模块,用于基于所述锚文本确定对应的目标链接与所述指定主题的相关度作为所述第一相关度。The first correlation determination submodule is configured to determine, based on the anchor text, the correlation between the corresponding target link and the specified topic as the first correlation.

所述第二相关度确定子模块,用于确定所述目标文内容与所述指定主题的相关度作为所述第二相关度。The second correlation determination submodule is configured to determine the correlation between the target text content and the specified topic as the second correlation.

进一步地,所述第一相关度确定子模块可以包括:转换子模块、预测子模块和确定子模块。Further, the first correlation determination submodule may include: a conversion submodule, a prediction submodule and a determination submodule.

所述转换子模块,用于将所述锚文本中的字转换为字向量。The conversion submodule is used to convert the words in the anchor text into word vectors.

所述预测子模块,用于将所述字向量输入预先建立的主题预测模型,获得所述主题预测模型输出的预测结果,其中,所述预测结果用于指示所述锚文本的主题,所述主题预测模型以标注有主题的训练锚文本对应的字向量为训练样本进行训练得到。The prediction sub-module is configured to input the word vector into a pre-established topic prediction model, and obtain a prediction result output by the topic prediction model, wherein the prediction result is used to indicate the topic of the anchor text, and the The topic prediction model is obtained by training the word vectors corresponding to the training anchor text marked with topics as training samples.

所述确定子模块,用于确定所述锚文本的主题与所述指定主题的相关度,作为所述锚文本对应的目标链接与所述指定主题的相关度。The determination sub-module is configured to determine the degree of relevance between the subject of the anchor text and the specified subject, as the degree of relevance between the target link corresponding to the anchor text and the specified subject.

上述实施例提供的主题爬虫系统,还可以包括:链接删除模块。The topic crawler system provided by the above embodiments may further include: a link deletion module.

所述链接删除模块,用于将所述第二待爬取链接集合中,与所述指定主题的相关度小于预设相关度、对应的温度值小于第二预设温度值的链接删除。The link deletion module is configured to delete links in the second set of links to be crawled whose relevance to the specified topic is less than a preset relevance and whose corresponding temperature value is less than a second preset temperature value.

本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other.

在本申请所提供的几个实施例中,应该理解到,所揭露的方法、装置和设备,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed methods, devices and equipment may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes. .

对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1.一种主题网页爬取方法,其特征在于,包括:1. A subject webpage crawling method is characterized in that, comprising: 从第一待爬取链接集合中获取未爬取的链接,所述第一待爬取链接集合中包括预先获取的种子链接;Obtain uncrawled links from the first link set to be crawled, wherein the first link set to be crawled includes pre-acquired seed links; 确定获取的链接对应的目标网页对应的第一相关度和第二相关度,所述第一相关度为所述目标网页中的目标链接与所述指定主题的相关度,所述第二相关度为所述目标网页中的目标文本内容与指定主题的相关度;Determine the first correlation degree and the second correlation degree corresponding to the target webpage corresponding to the obtained link, the first correlation degree is the correlation degree between the target link in the target webpage and the specified topic, and the second correlation degree is Relevance between the target text content in the target webpage and the specified topic; 根据所述第一相关度和所述第二相关度确定所述目标网页的温度值,并存储所述目标网页的待展示内容,其中,所述温度值能够表征所述目标网页与所述指定主题的相关度;Determine the temperature value of the target webpage according to the first correlation degree and the second correlation degree, and store the content to be displayed of the target webpage, wherein the temperature value can represent the relationship between the target webpage and the specified the relevance of the topic; 若所述目标网页的温度值大于或等于第一预设温度值,则将所述目标链接放入第二待爬取链接集合中;If the temperature value of the target webpage is greater than or equal to the first preset temperature value, then put the target link into the second set of links to be crawled; 若所述第一待爬取链接集合中不存在未获取过的链接,则从所述第二待爬取链接集合中获取与所述指定主题相关度最高的未爬取的链接,然后执行所述确定获取的链接对应的目标网页对应的第一相关度和第二相关度。If there is no unobtained link in the first link set to be crawled, then obtain the uncrawled link with the highest correlation with the specified topic from the second set of links to be crawled, and then execute the The first correlation degree and the second correlation degree corresponding to the target webpage corresponding to the obtained link are determined as described above. 2.根据权利要求1所述的主题网页爬取方法,其特征在于,所述确定获取的链接对应的目标网页对应的第一相关度和第二相关度,包括:2. The subject web page crawling method according to claim 1, wherein said determination of the first degree of relevance and the second degree of relevance corresponding to the target web page corresponding to the obtained link comprises: 根据获取的链接从网络上爬取所述目标网页;Crawling the target webpage from the Internet according to the obtained link; 从所述目标网页中提取所述目标文本内容、所述目标链接和所述目标链接对应的锚文本;Extracting the target text content, the target link, and the anchor text corresponding to the target link from the target webpage; 基于所述锚文本确定对应的目标链接与所述指定主题的相关度作为所述第一相关度,并确定所述目标文本内容与所述指定主题的相关度作为所述第二相关度。Determining, based on the anchor text, a degree of correlation between the corresponding target link and the specified topic as the first correlation degree, and determining a degree of correlation between the target text content and the specified topic as the second degree of correlation. 3.根据权利要求2所述的主题网页爬取方法,其特征在于,所述确定所述目标文本内容与指定主题的相关度,包括:3. The subject web page crawling method according to claim 2, wherein said determination of the degree of relevance between said target text content and a designated subject comprises: 利用双向长短记忆条件随机场模型对所述目标文本内容进行分词,获得多个词;Segmenting the target text content by using a two-way long-short-term memory conditional random field model to obtain multiple words; 通过所述多个词和预见建立的主题判定模块判定所述目标文本内容的主题;Determine the theme of the target text content through the theme determination module established by the plurality of words and prediction; 确定所述目标文本内容的主题与所述指定主题的相关度作为所述目标文本内容与所述指定主题的相关度。Determining the degree of relevance between the subject of the target text content and the specified theme as the degree of relevance between the target text content and the specified theme. 4.根据权利要求2所述的主题网页爬取方法,其特征在于,所述基于所述锚文本确定对应的目标链接与所述指定主题的相关度,包括:4. The subject webpage crawling method according to claim 2, wherein said determining the degree of relevance between the corresponding target link and the specified subject based on said anchor text comprises: 将所述锚文本中的字转换为字向量;converting words in the anchor text into word vectors; 将所述字向量输入预先建立的主题预测模型,获得所述主题预测模型输出的预测结果,其中,所述预测结果用于指示所述锚文本的主题,所述主题预测模型以标注有主题的训练锚文本对应的字向量为训练样本进行训练得到;Inputting the word vector into a pre-established topic prediction model to obtain a prediction result output by the topic prediction model, wherein the prediction result is used to indicate the topic of the anchor text, and the topic prediction model is marked with a topic The word vector corresponding to the training anchor text is obtained by training the training sample; 确定所述锚文本的主题与所述指定主题的相关度,作为所述锚文本对应的目标链接与所述指定主题的相关度。Determining the relevance between the topic of the anchor text and the specified topic as the correlation between the target link corresponding to the anchor text and the specified topic. 5.根据权利要求3所述的主题网页爬取方法,其特征在于,预先建立所述主题预测模型的过程,包括:5. The subject web page crawling method according to claim 3, wherein the process of setting up the subject prediction model in advance comprises: 获取多个标注了主题的锚文本,组成训练锚文本集合;Obtain multiple anchor texts marked with topics to form a training anchor text set; 将所述训练锚文本集合中的训练锚文本中的每个字依次转换为字向量,得到与所述训练锚文本对应的字向量集合,其中,不同字向量之间的距离表征其对应的文字之间的关联性;Each word in the training anchor text in the training anchor text set is converted into a word vector in turn to obtain a word vector set corresponding to the training anchor text, wherein the distance between different word vectors represents its corresponding text the connection between 将所述训练文本对应的字向量集合作为输入,训练双向循环神经网络,将训练得到的双向循环神经网络作为所述主题预测模型。The word vector set corresponding to the training text is used as an input to train a bidirectional recurrent neural network, and the trained bidirectional recurrent neural network is used as the topic prediction model. 6.根据权利要求1所述的主题网页爬取方法,其特征在于,还包括:6. The subject web page crawling method according to claim 1, further comprising: 将所述第二待爬取链接集合中,与所述指定主题的相关度小于预设相关度、对应的温度值小于第二预设温度值的链接删除。In the second set of links to be crawled, links whose relevance degree to the specified topic is less than a preset relevance degree and whose corresponding temperature value is less than a second preset temperature value are deleted. 7.一种主题爬虫系统,其特征在于,包括:链接获取模块、相关度确定模块、温度值确定模块和链接处理模块;7. A theme crawler system, comprising: a link acquisition module, a correlation degree determination module, a temperature value determination module and a link processing module; 所述链接获取模块,用于从第一待爬取链接集合中获取未爬取的链接,所述第一待爬取链接集合中包括预先获取的种子链接;The link acquisition module is configured to acquire uncrawled links from the first link set to be crawled, the first link set to be crawled includes pre-acquired seed links; 所述相关度确定模块,用于确定获取的链接对应的目标网页对应的第一相关度和第二相关度,所述第一相关度为所述目标网页中的目标链接与所述指定主题的相关度,所述第二相关度为所述目标网页中的目标文本内容与指定主题的相关度;The correlation degree determining module is used to determine the first correlation degree and the second correlation degree corresponding to the target webpage corresponding to the obtained link, and the first correlation degree is the relationship between the target link in the target webpage and the specified topic degree of relevance, the second degree of relevance is the degree of relevance between the target text content in the target webpage and the specified topic; 所述温度值确定模块,用于根据所述第一相关度和所述第二相关度确定所述目标网页的温度值,并存储所述目标网页的待展示内容,其中,所述温度值能够表征所述目标网页与所述指定主题的相关度;The temperature value determination module is configured to determine the temperature value of the target webpage according to the first correlation degree and the second correlation degree, and store the content to be displayed of the target webpage, wherein the temperature value can characterizing the relevance of the target webpage to the specified topic; 所述链接处理模块,用于当所述目标网页的温度值大于或等于第一预设温度值时,将所述目标链接放入第二待爬取链接集合中;The link processing module is configured to put the target link into the second set of links to be crawled when the temperature value of the target webpage is greater than or equal to a first preset temperature value; 所述链接获取模块,还用于当所述第一待爬取链接集合中不存在未获取过的链接时,从所述第二待爬取链接集合中获取与所述指定主题相关度最高的未爬取的链接,然后触发相关度确定模块确定获取的链接对应的目标网页对应的第一相关度和第二相关度。The link obtaining module is further configured to obtain, from the second set of links to be crawled, the most relevant link to the specified topic when there is no unobtained link in the first set of links to be crawled. The uncrawled link then triggers the correlation determination module to determine the first correlation and the second correlation corresponding to the target webpage corresponding to the obtained link. 8.根据权利要求7所述的主题爬虫系统,其特征在于,所述相关度确定模块包括:网页爬取子模块、数据提取子模块、第一相关度确定子模块和第二相关度确定子模块;8. theme crawler system according to claim 7, is characterized in that, described correlation determination module comprises: web page crawling sub-module, data extraction sub-module, first correlation determination sub-module and second correlation determination sub-module module; 所述网页爬取子模块,用于根据获取的链接从网络上爬取所述目标网页;The webpage crawling submodule is used to crawl the target webpage from the network according to the obtained link; 所述数据提取子模块,用于从所述目标网页中提取所述目标文本内容、所述目标链接和所述目标链接对应的锚文本;The data extraction submodule is used to extract the target text content, the target link and the anchor text corresponding to the target link from the target webpage; 所述第一相关度确定子模块,用于基于所述锚文本确定对应的目标链接与所述指定主题的相关度作为所述第一相关度;The first correlation determination submodule is configured to determine the correlation between the corresponding target link and the specified topic based on the anchor text as the first correlation; 所述第二相关度确定子模块,用于确定所述目标文本内容与所述指定主题的相关度作为所述第二相关度。The second correlation determination submodule is configured to determine the correlation between the target text content and the specified topic as the second correlation. 9.根据权利要求8所述的主题爬虫系统,其特征在于,所述第一相关度确定子模块,包括:转换子模块、预测子模块和确定子模块;9. The theme crawler system according to claim 8, characterized in that, said first correlation determination submodule comprises: conversion submodule, prediction submodule and determination submodule; 所述转换子模块,用于将所述锚文本中的字转换为字向量;The conversion submodule is used to convert the words in the anchor text into word vectors; 所述预测子模块,用于将所述字向量输入预先建立的主题预测模型,获得所述主题预测模型输出的预测结果,其中,所述预测结果用于指示所述锚文本的主题,所述主题预测模型以标注有主题的训练锚文本对应的字向量为训练样本进行训练得到;The prediction sub-module is configured to input the word vector into a pre-established topic prediction model, and obtain a prediction result output by the topic prediction model, wherein the prediction result is used to indicate the topic of the anchor text, and the The topic prediction model is obtained by training the word vector corresponding to the training anchor text marked with the topic as the training sample; 所述确定子模块,用于确定所述锚文本的主题与所述指定主题的相关度,作为所述锚文本对应的目标链接与所述指定主题的相关度。The determination sub-module is configured to determine the degree of relevance between the subject of the anchor text and the specified subject, as the degree of relevance between the target link corresponding to the anchor text and the specified subject. 10.根据权利要求7所述的主题爬虫系统,其特征在于,还包括:链接删除模块;10. The subject crawler system according to claim 7, further comprising: a link deletion module; 所述链接删除模块,用于将所述第二待爬取链接集合中,与所述指定主题的相关度小于预设相关度、对应的温度值小于第二预设温度值的链接删除。The link deletion module is configured to delete links in the second set of links to be crawled whose relevance to the specified topic is less than a preset relevance and whose corresponding temperature value is less than a second preset temperature value.
CN201810581858.XA 2018-06-07 2018-06-07 Topic webpage crawling method and topic crawler system Expired - Fee Related CN108959413B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810581858.XA CN108959413B (en) 2018-06-07 2018-06-07 Topic webpage crawling method and topic crawler system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810581858.XA CN108959413B (en) 2018-06-07 2018-06-07 Topic webpage crawling method and topic crawler system

Publications (2)

Publication Number Publication Date
CN108959413A true CN108959413A (en) 2018-12-07
CN108959413B CN108959413B (en) 2020-09-11

Family

ID=64494106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810581858.XA Expired - Fee Related CN108959413B (en) 2018-06-07 2018-06-07 Topic webpage crawling method and topic crawler system

Country Status (1)

Country Link
CN (1) CN108959413B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069690A (en) * 2019-04-24 2019-07-30 成都市映潮科技股份有限公司 A kind of theme network crawler method, apparatus and medium
CN110532450A (en) * 2019-05-13 2019-12-03 南京大学 A kind of Theme Crawler of Content method based on improvement shark search
CN111143649A (en) * 2019-12-09 2020-05-12 杭州迪普科技股份有限公司 Webpage searching method and device
CN112579853A (en) * 2019-09-30 2021-03-30 顺丰科技有限公司 Method and device for sequencing crawling links and storage medium
CN112836111A (en) * 2021-02-09 2021-05-25 沈阳麟龙科技股份有限公司 URL crawling method, device, medium and electronic equipment of crawler system
CN113449168A (en) * 2021-07-14 2021-09-28 北京锐安科技有限公司 Method, device and equipment for capturing theme webpage data and storage medium
CN114117177A (en) * 2021-11-09 2022-03-01 智文有限公司 Topic crawler method and system based on TextCNN

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN102298622A (en) * 2011-08-11 2011-12-28 中国科学院自动化研究所 Search method for focused web crawler based on anchor text and system thereof
CN102662954A (en) * 2012-03-02 2012-09-12 杭州电子科技大学 Method for implementing topical crawler system based on learning URL string information
US9323861B2 (en) * 2010-11-18 2016-04-26 Daniel W. Shepherd Method and apparatus for enhanced web browsing
CN106776722A (en) * 2016-11-22 2017-05-31 新乡学院 theme prediction algorithm based on hyperlink

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9323861B2 (en) * 2010-11-18 2016-04-26 Daniel W. Shepherd Method and apparatus for enhanced web browsing
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN102298622A (en) * 2011-08-11 2011-12-28 中国科学院自动化研究所 Search method for focused web crawler based on anchor text and system thereof
CN102662954A (en) * 2012-03-02 2012-09-12 杭州电子科技大学 Method for implementing topical crawler system based on learning URL string information
CN106776722A (en) * 2016-11-22 2017-05-31 新乡学院 theme prediction algorithm based on hyperlink

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
费晨杰等: "基于LDA扩展主题词库的主题爬虫研究", 《计算机应用与软件》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069690A (en) * 2019-04-24 2019-07-30 成都市映潮科技股份有限公司 A kind of theme network crawler method, apparatus and medium
CN110069690B (en) * 2019-04-24 2021-12-07 成都映潮科技股份有限公司 Method, device and medium for topic web crawler
CN110532450A (en) * 2019-05-13 2019-12-03 南京大学 A kind of Theme Crawler of Content method based on improvement shark search
CN112579853A (en) * 2019-09-30 2021-03-30 顺丰科技有限公司 Method and device for sequencing crawling links and storage medium
CN111143649A (en) * 2019-12-09 2020-05-12 杭州迪普科技股份有限公司 Webpage searching method and device
CN112836111A (en) * 2021-02-09 2021-05-25 沈阳麟龙科技股份有限公司 URL crawling method, device, medium and electronic equipment of crawler system
CN113449168A (en) * 2021-07-14 2021-09-28 北京锐安科技有限公司 Method, device and equipment for capturing theme webpage data and storage medium
WO2023284612A1 (en) * 2021-07-14 2023-01-19 北京锐安科技有限公司 Subject webpage data capturing method and apparatus, and device and storage medium
CN113449168B (en) * 2021-07-14 2024-02-20 北京锐安科技有限公司 Theme web page data capture method, device, equipment and storage medium
CN114117177A (en) * 2021-11-09 2022-03-01 智文有限公司 Topic crawler method and system based on TextCNN

Also Published As

Publication number Publication date
CN108959413B (en) 2020-09-11

Similar Documents

Publication Publication Date Title
CN108959413B (en) Topic webpage crawling method and topic crawler system
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
CN102629246B (en) Server for recognizing browser voice command and browser voice command recognition method
TWI695277B (en) Automatic website data collection method
CN106980664B (en) A bilingual comparable corpus mining method and device
US20130339840A1 (en) System and method for logical chunking and restructuring websites
CN106462626A (en) Modeling Interestingness Using Deep Neural Networks
CN102831246A (en) Method and device for classification of Tibetan webpage
CN111651675B (en) UCL-based user interest topic mining method and device
CN107301195A (en) Generate disaggregated model method, device and the data handling system for searching for content
CN103886020B (en) A kind of real estate information method for fast searching
US10152540B2 (en) Linking thumbnail of image to web page
CN111160007B (en) Search method and device based on BERT language model, computer equipment and storage medium
CN110275963A (en) Method and device for outputting information
CN113392195A (en) Public opinion monitoring method and device, electronic equipment and storage medium
CN105808615A (en) Document index generation method and device based on word segment weights
JP2023544925A (en) Data evaluation methods, training methods and devices, electronic equipment, storage media, computer programs
CN114663164A (en) E-commerce site promotion configuration method and its device, equipment, medium and product
CN110297994A (en) Acquisition method, device, computer equipment and the storage medium of web data
CN106933380B (en) A kind of update method and device of dictionary
CN104778232B (en) Searching result optimizing method and device based on long query
CN108595466B (en) A kind of Internet information filtering and Internet user information and network post structure analysis method
CN115203514A (en) Commodity query redirection method and device, equipment, medium and product thereof
CN111666479A (en) Method for searching web page and computer readable storage medium
CN114117242A (en) Data query method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200911