[go: up one dir, main page]

CN103136219A - Method and device for requirement mining and based on timeliness - Google Patents

Method and device for requirement mining and based on timeliness Download PDF

Info

Publication number
CN103136219A
CN103136219A CN2011103791203A CN201110379120A CN103136219A CN 103136219 A CN103136219 A CN 103136219A CN 2011103791203 A CN2011103791203 A CN 2011103791203A CN 201110379120 A CN201110379120 A CN 201110379120A CN 103136219 A CN103136219 A CN 103136219A
Authority
CN
China
Prior art keywords
query
preset
pattern
webpage
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103791203A
Other languages
Chinese (zh)
Other versions
CN103136219B (en
Inventor
黄际洲
钟华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110379120.3A priority Critical patent/CN103136219B/en
Publication of CN103136219A publication Critical patent/CN103136219A/en
Application granted granted Critical
Publication of CN103136219B publication Critical patent/CN103136219B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for requirement mining based on timeliness. The method comprises the following steps: obtaining clicking data in a selected time range from a searched log, the clicking data at least comprises queries of uses and clicked web titles of the searched results which correspond to the queries; obtaining queries in the clicked data, wherein the queries correspond to the web tiles and meet modes of preset requirement types; respectively calculating click rates of all the obtained queries, selecting queries which are larger or equal to a preset ration threshold value in the click rate, wherein the click ratio is the ratio that the click times, on the web titles with the modes of the preset requirement types, of the queries accounts for click times, on all the searched results, of the queries; and obtaining timeliness queries with the preset requirement type through the selected queries. According to the method and the device for the requirement mining and based on the timeliness, accuracy rate and recall rate of requirement query mining can be improved, so that an effect of requirement identification is improved.

Description

一种基于时效性的需求挖掘方法和装置A method and device for demand mining based on timeliness

【技术领域】 【Technical field】

本发明涉及计算机技术领域,特别涉及一种基于时效性的需求挖掘方法和装置。The invention relates to the field of computer technology, in particular to a time-based demand mining method and device.

【背景技术】 【Background technique】

随着互联网在全球范围内的迅速发展与成熟,网络上的信息资源不断丰富,信息数据量也在飞速膨胀,通过搜索引擎获取信息已经成为现代人获取信息的主要方式。为了向用户提供更加便捷、准确地查询服务是搜索引擎技术在当今和未来的发展方向。With the rapid development and maturity of the Internet on a global scale, the information resources on the network are constantly enriched, and the amount of information data is also expanding rapidly. Obtaining information through search engines has become the main way for modern people to obtain information. In order to provide users with more convenient and accurate query services is the current and future development direction of search engine technology.

在搜索引擎技术中,对用户的搜索需求进行识别是提高搜索准确性和有效性的重要一环,特别在结构化搜索(即垂直搜索)中作用显著。其中存在一种方式是将搜索日志中具有某类需求的query预先挖掘出来,当用户输入query时,直接将该query与预先挖掘出的各类需求的query进行匹配从而识别用户所输入query的需求。In search engine technology, identifying the user's search needs is an important part of improving search accuracy and effectiveness, especially in structured search (ie, vertical search). There is a way to mine the query with certain requirements in the search log in advance. When the user enters the query, directly match the query with the pre-mined query of various requirements to identify the requirements of the query entered by the user. .

在挖掘各需求类型的query时,通常采用基于关键词或者基于模板的方式,将包含某类需求的关键词的query识别为具有该类需求,将符合某类需求的模板的query识别为具有该类需求等等。然而这种方式往往不能够将各需求类型中具有时效性的query挖掘出来,例如query“家常菜”,先验知识显示该query只有菜谱类需求,如果采用基于关键词或模板的方式,该query会被识别出具有菜谱类需求即便是在《家常菜》这部电视剧热映的阶段,但是,在这部电视剧热映的阶段用户输入“家常菜”时,可能主要需求是视频类而不是菜谱类,显然对于这种具有时效性的query在需求挖掘中是无法召回的,也就造成在需求识别过程中准确率会受到影响。When mining queries of various types of requirements, a keyword-based or template-based method is usually used to identify queries that contain keywords of a certain type of requirements as having that type of requirement, and identify queries that meet a template of a certain type of requirement as having that type of requirement. class requirements etc. However, this method is often unable to dig out time-sensitive queries in various types of requirements, such as the query "home-cooked food". Prior knowledge shows that this query only requires recipes. If the method is based on keywords or templates, the query It will be recognized that there is a demand for recipes even when the TV series "Home Cooking" is on the air. However, when the user enters "Home Cooking" during this TV show, the main demand may be video instead of recipes. Obviously, this kind of time-sensitive query cannot be recalled in demand mining, which will affect the accuracy rate in the process of demand identification.

【发明内容】 【Content of invention】

本发明提供了一种时效性query的识别方法和装置,以便于提高需求挖掘的准确率和召回率,从而提升需求识别的效果。The present invention provides a time-sensitive query identification method and device, so as to improve the accuracy rate and recall rate of demand mining, thereby improving the effect of demand identification.

具体技术方案如下:The specific technical scheme is as follows:

一种基于时效性的需求挖掘方法,该方法包括:A timeliness-based demand mining method, the method includes:

S1、从搜索日志中获取所选时间段的点击数据,所述点击数据至少包括用户的搜索项query以及query对应的搜索结果中被点击网页标题;S1. Obtain the click data of the selected time period from the search log, the click data at least includes the user's search item query and the title of the clicked webpage in the search results corresponding to the query;

S2、从所述点击数据中获取对应网页标题满足预设需求类型的模式的query;S2. Acquiring from the click data a query corresponding to a mode in which the title of the webpage satisfies a preset demand type;

S3、分别计算所述步骤S2获取的各query的点击率,选择点击率大于或等于预设比例阈值的query,其中query的点击率为:query在具有所述预设需求类型的模式的网页标题上的点击数占该query在所有搜索结果上的点击数的比例;S3. Calculate the click-through rate of each query obtained in the step S2 respectively, and select the query whose click-through rate is greater than or equal to the preset ratio threshold, wherein the click-through rate of the query is: the title of the web page of the query in the mode with the preset demand type The proportion of clicks on the query to the number of clicks on all search results;

S4、由选择的query得到具有所述预设需求类型的时效性query。S4. Obtain a time-sensitive query with the preset demand type from the selected query.

根据本发明一优选实施例,所述需求类型的模式由短语、词组、词语、属性标识、分割符号中的一种或任意组合构成。According to a preferred embodiment of the present invention, the pattern of the requirement type is composed of one or any combination of phrases, phrases, words, attribute identifiers, and segmentation symbols.

根据本发明一优选实施例,所述需求类型的模式的挖掘具体包括:According to a preferred embodiment of the present invention, the mining of the pattern of the requirement type specifically includes:

A1、获取具有时效性的网页标题作为语料;A1. Obtain time-sensitive web page titles as corpus;

A2、将获取的语料进行聚类;A2, clustering the acquired corpus;

A3、分别针对聚类结果各类别中的网页标题执行步骤A31至步骤A33:A3. Perform steps A31 to A33 for the web page titles in each category of the clustering results:

A31、将网页标题进行切词;A31. Segment the title of the web page into words;

A32、将切词结果中的命名实体替换为对应的命名实体类型标记;A32, replace the named entity in the word segmentation result with the corresponding named entity type mark;

A33、确定切词结果的n元词组n-gram,n为预设的一个或多个正整数,统计各n-gram在该网页标题所在类别中的出现次数,抽取出现次数满足预设选词要求的n-gram作为所述需求类型的模式。A33. Determine the n-gram n-gram of the word segmentation result, n is one or more preset positive integers, count the number of occurrences of each n-gram in the category of the webpage title, and extract the number of occurrences to meet the preset selected words The required n-grams serve as the schema for the type of requirement.

根据本发明一优选实施例,所述步骤A1具体包括:According to a preferred embodiment of the present invention, the step A1 specifically includes:

从预设需求类型的新闻网站获取网页标题作为语料;或者,Obtain webpage titles as corpus from news websites of preset demand types; or,

从搜索日志中获取预设需求类型的时效性种子query对应的被点击网页标题作为语料。The title of the clicked webpage corresponding to the time-sensitive seed query of the preset demand type is obtained from the search log as the corpus.

根据本发明一优选实施例,在所述步骤A32和步骤A33之间还包括:According to a preferred embodiment of the present invention, between the step A32 and the step A33, it also includes:

查找同义词词表,将切词结果中的词语归一化为同义词词根。Look up the synonym vocabulary and normalize the words in the word segmentation results into synonym roots.

根据本发明一优选实施例,在所述步骤A33之后还包括:According to a preferred embodiment of the present invention, after the step A33, it also includes:

A34、对所述需求类型的模式进行验证,保留验证通过的模式,其中所述验证的过程具体包括:将具有时效性的网页标题作为正例集合,将具有非时效性的网页标题作为负例集合;针对每一个模式分别计算该模式匹配到的正例集合中的样本数与该模式匹配到的正例集合中的样本数和负例集合中的样本数之和的比值,将计算的比值作为该模式的得分;如果得分大于预设的得分阈值,则验证通过。A34. Verify the pattern of the requirement type, and keep the verified pattern, wherein the verification process specifically includes: taking the time-sensitive webpage title as a positive example set, and taking the non-time-sensitive webpage title as a negative example set; for each pattern, calculate the ratio of the number of samples in the positive example set matched by the pattern to the sum of the number of samples in the positive example set matched by the pattern and the number of samples in the negative example set, and the calculated ratio As the score of the mode; if the score is greater than the preset score threshold, the verification is passed.

根据本发明一优选实施例,所述步骤S4包括:According to a preferred embodiment of the present invention, the step S4 includes:

分别统计各query对应的具有该预设需求类型的模式的网页标题所来源的网站数量,将来源网站数量低于预设数量阈值的网页标题所对应的query从所述选择的query中过滤掉;和/或,Statistically count the website quantity corresponding to each query corresponding to the webpage title with the pattern of the preset demand type, and filter out the query corresponding to the webpage title whose source website quantity is lower than the preset quantity threshold; and / or,

将包含预设的黑名单词语的query从所述选择的query中过滤掉。Filter out queries containing preset blacklist words from the selected queries.

根据本发明一优选实施例,所述步骤S4包括:According to a preferred embodiment of the present invention, the step S4 includes:

获取所述预设需求类型的垂直搜索日志中搜索次数大于预设次数阈值的query,将获取的query与所述选择的query取交集,得到具有所述预设需求类型的时效性query;或者,Obtaining a query whose search times are greater than a preset threshold in the vertical search log of the preset demand type, and intersecting the acquired query with the selected query to obtain a time-sensitive query with the preset demand type; or,

分别获取所述选择的query在所述预设需求类型的垂直搜索日志中的搜索次数,保留搜索次数大于预设次数阈值的query作为具有所述预设需求类型的时效性query。The search times of the selected query in the vertical search log of the preset demand type are obtained respectively, and the queries whose search times are greater than a preset threshold are reserved as time-sensitive queries with the preset demand type.

根据本发明一优选实施例,在所述步骤S4之后还包括:According to a preferred embodiment of the present invention, after the step S4, it also includes:

将具有所述预设需求类型的时效性query与采用其他方式挖掘出的具有所述预设需求类型的query进行合并,得到最终挖掘出的具有所述预设需求类型的query。The time-sensitive query with the preset demand type is combined with the query with the preset demand type mined in other ways to obtain the finally mined query with the preset demand type.

一种基于时效性的需求挖掘装置,该装置包括:A demand excavation device based on timeliness, the device comprising:

数据获取单元,用于从搜索日志中获取所选时间段的点击数据,所述点击数据至少包括用户的搜索项query以及query对应的搜索结果中被点击网页标题;A data acquisition unit, configured to acquire click data for a selected time period from the search log, the click data at least including the user's search item query and the title of the clicked webpage in the search results corresponding to the query;

query获取单元,用于从所述点击数据中获取对应网页标题满足预设需求类型的模式的query;A query acquisition unit, configured to acquire a query corresponding to a web page title that meets a preset demand type from the click data;

query选择单元,用于分别计算所述query获取单元获取的各query的点击率,选择点击率大于或等于预设比例阈值的query,其中query的点击率为:query在具有所述预设需求类型的模式的网页标题上的点击数占该query在所有搜索结果上的点击数的比例;The query selection unit is used to separately calculate the click-through rate of each query acquired by the query acquisition unit, and select the query whose click-through rate is greater than or equal to a preset ratio threshold, wherein the click-through rate of the query is: the query has the preset demand type The proportion of the number of clicks on the title of the web page of the pattern to the number of clicks of the query on all search results;

query确定单元,用于由所述query选择单元选择的query得到具有所述预设需求类型的时效性query。The query determination unit is configured to obtain the time-sensitive query with the preset demand type from the query selected by the query selection unit.

根据本发明一优选实施例,所述需求类型的模式由短语、词组、词语、属性标识、分割符号中的一种或任意组合构成。According to a preferred embodiment of the present invention, the pattern of the requirement type is composed of one or any combination of phrases, phrases, words, attribute identifiers, and segmentation symbols.

根据本发明一优选实施例,该装置还包括:模式挖掘单元;According to a preferred embodiment of the present invention, the device further includes: a pattern mining unit;

所述模式挖掘单元具体包括:The pattern mining unit specifically includes:

语料获取子单元,用于获取具有时效性的网页标题作为语料;The corpus acquisition subunit is used to acquire time-sensitive web page titles as corpus;

聚类子单元,用于将获取的语料进行聚类;The clustering subunit is used to cluster the acquired corpus;

模式提取子单元,用于分别针对聚类结果各类别中的网页标题执行:将网页标题进行切词,将切词结果中的命名实体替换为对应的命名实体类型标记,确定切词结果的n元词组n-gram,n为预设的一个或多个正整数,统计各n-gram在该网页标题所在类别中的出现次数,抽取出现次数满足预设选词要求的n-gram作为所述需求类型的模式。The pattern extraction sub-unit is used to execute separately for the webpage titles in each category of the clustering results: segment the webpage titles, replace the named entities in the word segmentation results with the corresponding named entity type tags, and determine the n of the word segmentation results Metaphrase n-gram, n is one or more preset positive integers, count the number of occurrences of each n-gram in the category of the webpage title, and extract the n-gram whose occurrence times meet the preset word selection requirements as the described The schema for the requirement type.

根据本发明一优选实施例,所述语料获取子单元从预设需求类型的新闻网站获取网页标题作为语料;或者,According to a preferred embodiment of the present invention, the corpus acquisition subunit acquires web page titles as corpus from a news website of a preset demand type; or,

从搜索日志中获取预设需求类型的时效性种子query对应的被点击网页标题作为语料。The title of the clicked webpage corresponding to the time-sensitive seed query of the preset demand type is obtained from the search log as the corpus.

根据本发明一优选实施例,所述模式提取子单元在将切词结果中的命名实体替换为对应的命名实体类型标记之后,且在确定切词结果的n-gram之前,还用于查找同义词词表,将切词结果中的词语归一化为同义词词根。According to a preferred embodiment of the present invention, the pattern extraction subunit is also used to find synonyms after replacing the named entity in the word segmentation result with the corresponding named entity type tag and before determining the n-gram of the word segmentation result Vocabulary, which normalizes the words in the word segmentation results into synonym roots.

根据本发明一优选实施例,所述模式提取子单元还用于对所述需求类型的模式进行验证,保留验证通过的模式,其中所述验证的过程具体包括:将具有时效性的网页标题作为正例集合,将具有非时效性的网页标题作为负例集合;针对每一个模式分别计算该模式匹配到的正例集合中的样本数与该模式匹配到的正例集合中的样本数和负例集合中的样本数之和的比值,将计算的比值作为该模式的得分;如果得分大于预设的得分阈值,则验证通过。According to a preferred embodiment of the present invention, the pattern extraction subunit is further configured to verify the pattern of the requirement type, and retain the pattern that passes the verification, wherein the verification process specifically includes: taking the time-sensitive webpage title as Positive example set, the non-time-sensitive web page title is used as a negative example set; for each pattern, the number of samples in the positive example set matched by the pattern and the number of samples in the positive example set matched by the pattern and the negative example set are calculated separately. The ratio of the sum of the number of samples in the example set, and the calculated ratio is used as the score of the mode; if the score is greater than the preset score threshold, the verification is passed.

根据本发明一优选实施例,所述query确定单元包括:第一过滤子单元和/或第二过滤子单元;According to a preferred embodiment of the present invention, the query determination unit includes: a first filtering subunit and/or a second filtering subunit;

所述第一过滤子单元,用于分别统计各query对应的具有该预设需求类型的模式的网页标题所来源的网站数量,将来源网站数量低于预设数量阈值的网页标题所对应的query从所述选择的query中过滤掉;The first filtering subunit is used to separately count the number of websites from which the webpage titles corresponding to each query have the pattern of the preset demand type, and count the query numbers corresponding to the webpage titles whose source website quantity is lower than the preset quantity threshold Filter out from the selected query;

所述第二过滤子单元,用于将包含预设的黑名单词语的query从所述选择的query中过滤掉。The second filtering subunit is configured to filter out queries containing preset blacklist words from the selected queries.

根据本发明一优选实施例,所述query确定单元包括:第一确定子单元或第二确定子单元;According to a preferred embodiment of the present invention, the query determination unit includes: a first determination subunit or a second determination subunit;

所述第一确定子单元,用于获取所述预设需求类型的垂直搜索日志中搜索次数大于预设次数阈值的query,将获取的query与所述选择的query取交集,得到具有所述预设需求类型的时效性query;The first determining subunit is configured to acquire a query whose search times are greater than a preset times threshold in the vertical search log of the preset demand type, and intersect the acquired query with the selected query to obtain the query with the preset query type. Set the timeliness query of the demand type;

所述第二确定子单元,用于分别获取所述选择的query在所述预设需求类型的垂直搜索日志中的搜索次数,保留搜索次数大于预设次数阈值的query作为具有所述预设需求类型的时效性query。The second determining subunit is configured to respectively obtain the search times of the selected query in the vertical search log of the preset demand type, and retain the query whose search times are greater than the preset number threshold as having the preset demand Type of timeliness query.

根据本发明一优选实施例,该装置还包括:挖掘合并单元,用于将具有所述预设需求类型的时效性query与其他需求挖掘装置挖掘出的具有所述预设需求类型的query进行合并,得到最终挖掘出的具有所述预设需求类型的query。According to a preferred embodiment of the present invention, the device further includes: a mining and merging unit, configured to merge the time-sensitive query with the preset demand type with the query with the preset demand type excavated by other demand mining devices , to obtain the finally excavated query with the preset requirement type.

由以上技术方案可以看出,本发明从点击数据中获取网页标题满足预设需求类型的模式的query,基于query在具有预设需求类型的模式的网页标题上的点击率选择出query得到具有所述预设需求类型的时效性query。也就是说,通过本发明能够在需求挖掘中自动挖掘出具有时效性的query,从而在需求识别时能够实现这部分时效性query的需求识别,提高了需求识别的准确率与召回率。As can be seen from the above technical solutions, the present invention obtains the query of the webpage title satisfying the pattern of the preset demand type from the click data, and selects the query based on the click-through rate of the query on the webpage title of the pattern of the preset demand type to obtain the query with the preset demand type. The timeliness query of the preset demand type. That is to say, through the present invention, time-sensitive queries can be automatically mined in demand mining, so that the time-sensitive queries can be identified during demand identification, and the accuracy and recall rate of demand identification can be improved.

【附图说明】 【Description of drawings】

图1为本发明实施例一提供的方法流程图;FIG. 1 is a flowchart of a method provided by Embodiment 1 of the present invention;

图2为本发明实施例二提供的模式短串的挖掘流程图;FIG. 2 is a flow chart of mining short pattern strings provided by Embodiment 2 of the present invention;

图3为本发明实施例三提供的需求挖掘装置的结构图;FIG. 3 is a structural diagram of a demand digging device provided by Embodiment 3 of the present invention;

图4为本发明实施例四提供的模式挖掘单元的结构图。FIG. 4 is a structural diagram of a pattern mining unit provided by Embodiment 4 of the present invention.

【具体实施方式】 【Detailed ways】

为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

通过观察发现,有时效性的网页的标题经常符合一些特别的模式,如果用户搜索时点击这些网页较多而点击别的网页较少,则表明用户搜索的query具有时效性需求。例如,用户搜索“家常菜”query时,如果当前一段时间内以“家常菜”为名的电视剧热播,则用户可能会点击一些新闻类网页,这些网页通常会具有类似的模式,用户点击这些模式的网页要比点击其他网页的数量要多得多,说明该query具有时效性需求。本发明所提供的时效性query的识别方法和装置就是基于此考虑,下面通过实施例一对本发明提供的方法进行详细描述。Through observation, it is found that the titles of time-sensitive web pages often conform to some special patterns. If users click more on these web pages and less on other web pages when searching, it indicates that the user's search query has time-sensitive requirements. For example, when a user searches for "home cooking" query, if a TV series named "home cooking" is on the air for a period of time, the user may click on some news webpages. These webpages usually have similar patterns, and the user clicks on these There are much more webpages with the pattern than other webpages, indicating that the query has time-sensitive requirements. The time-sensitive query identification method and device provided by the present invention are based on this consideration, and the method provided by the present invention will be described in detail below through embodiments.

实施例一、Embodiment one,

图1为本发明实施例一提供的方法流程图,如图1所示,该方法可以包括:Fig. 1 is a flow chart of the method provided by Embodiment 1 of the present invention. As shown in Fig. 1, the method may include:

步骤101:从搜索日志中获取所选时间段的点击数据。Step 101: Obtain the click data of the selected time period from the search log.

如果需要分析query在某时间段内是否具有时效性需求,则获取该时间段内的点击数据,所选时间段可以采用最近一段时间,例如采用最近一天的点击数据。由于需求挖掘通常是周期性进行的,则本步骤中涉及的时间段也可以与普通需求挖掘所采用的时间段保持一致,例如普通需求挖掘如果一个月挖掘一次,则本步骤中涉及的时间段长度也可以是一个月,获取近一个月的点击数据。If it is necessary to analyze whether the query has timeliness requirements within a certain period of time, the click data within the period of time is obtained. The selected period of time can be the most recent period of time, for example, the click data of the latest day. Since demand mining is usually carried out periodically, the time period involved in this step can also be consistent with the time period adopted by ordinary demand mining. For example, if ordinary demand mining is conducted once a month, the time period involved in this step The length can also be one month, and the click data of the past month is obtained.

获取的点击数据至少包括:用户搜索的query以及query对应的搜索结果中被点击网页的标题。The acquired click data at least includes: the query searched by the user and the title of the clicked webpage in the search results corresponding to the query.

步骤102:从点击数据中获取对应网页标题满足预设需求类型的模式的query。Step 102: Obtain the query corresponding to the pattern of the webpage title satisfying the preset demand type from the click data.

预设需求类型的模式可以通过模式短串来表示,如果网页标题中包含该模式短串,则说明该网页标题满足预设需求类型的模式。其中模式短串的构成可以包括但不限于:短语、词组、词语、属性标识等,还可以包括一些分割符号,诸如:【】、[]、()、{}、~、!、、#、¥、%、……、&、*、_等等。The pattern of the preset demand type can be represented by a short string of patterns. If the short string of the pattern is included in the title of the webpage, it means that the title of the webpage meets the pattern of the preset demand type. The composition of short pattern strings may include but not limited to: phrases, phrases, words, attribute identifiers, etc., and may also include some segmentation symbols, such as: [], [], (), {}, ~, ! ,, #, ¥, %, ..., &, *, _ and so on.

以图片类为例,模式短串有“有震感(图)”、“相撞事故【number】人丧生”、“事故【number】人受伤(图)”,其中【number】为数字的属性标识。Taking pictures as an example, the pattern short strings include "shock feeling (picture)", "collision accident [number] people were killed", "accident [number] people were injured (picture)", where [number] is the attribute identification of the number .

上述模式短串可以通过观察检索结果的标题人工总结,也可以通过机器学习的方式自动挖掘,将在实施例二中对模式短串的自动挖掘进行详细描述。The above pattern short strings can be manually summarized by observing the titles of the retrieval results, or can be automatically mined by machine learning. The automatic mining of the pattern short strings will be described in detail in Embodiment 2.

步骤103:分别计算各query在具有该预设需求类型的模式的网页标题上的点击数占该query在所有搜索结果上的点击数的比例,选择比例大于或等于预设比例阈值的query。Step 103: Calculate the ratio of the number of clicks of each query on the title of the web page with the preset demand type to the number of clicks of the query on all search results, and select the query whose ratio is greater than or equal to the preset ratio threshold.

本步骤中预设比例阈值可以根据具体的需求类型和实际情况进行设置,一种实施例可以设置为0.3。The preset ratio threshold in this step can be set according to specific demand types and actual conditions, and can be set to 0.3 in an embodiment.

假设经过步骤102后获得的query、对应的网页标题和点击次数如表1所示。Assume that the query obtained after step 102, the corresponding web page title and the number of clicks are shown in Table 1.

表1Table 1

Figure BDA0000112116300000081
Figure BDA0000112116300000081

经过步骤103计算比例后,确定出“上海车展”和“百度公司”计算出的比例均小于0.3,而“超新星爆发”计算出的比例大于0.3,因此选择“超新星爆发”。After calculating the ratio in step 103, it is determined that the ratios calculated by "Shanghai Auto Show" and "Baidu Company" are both less than 0.3, while the ratio calculated by "Supernova Explosion" is greater than 0.3, so "Supernova Explosion" is selected.

步骤104:分别统计各query对应的具有该预设需求类型模式的网页标题所来源的网站数量,将来源网站数量低于预设数量阈值的网页标题所对应的query过滤掉。Step 104: Count the number of websites from which the webpage titles corresponding to the preset demand types correspond to each query, and filter out the queries corresponding to the webpage titles whose number of source websites is lower than the preset number threshold.

如果能表明某个query具有时效性的网页标题来源的网站数量很少,则有可能其出现和点击的状况是该网站的突发行为引起的,并不能体现出query的时效性,因此在本步骤中对网页标题所来源的网站数量进行限制,只有来源网站数量大于或等于预设数量阈值的网页标题所对应的query才认为具有时效性。If the number of websites that can indicate the timeliness of a certain query comes from the page title is very small, it is possible that its appearance and clicks are caused by the sudden behavior of the website, which does not reflect the timeliness of the query, so in this article In the step, the number of websites from which the webpage title comes is limited, and only the query corresponding to the webpage title whose number of source websites is greater than or equal to the preset number threshold is considered time-sensitive.

例如,在经过步骤103之后得到的query以及对应的来源网站数量如表2所示。For example, the query obtained after step 103 and the corresponding number of source websites are shown in Table 2.

表2Table 2

  query query   对应的来源网站数量 The number of corresponding source websites   超新星爆发 supernova explosion   8 8   美国坠毁隐形直升机 US stealth helicopter crashes   12 12   利比亚巷战 Libyan Street Fighting   2 2   美国卫星撞地球 US satellite hits Earth   4 4

假设来源网站的预设数量阈值为3,则“利比亚巷战”被过滤掉,剩余的query为:“超新星爆发”、“美国坠毁隐形直升机”和“美国卫星撞地球”。Assuming that the preset number threshold of the source website is 3, then "street fighting in Libya" is filtered out, and the remaining queries are: "supernova explosion", "US stealth helicopter crashed" and "US satellite crashed into the earth".

步骤105:将包含预设的黑名单词语的query过滤掉。Step 105: Filter out queries containing preset blacklist words.

本步骤通过使用黑名单规则去掉一些明显的错误,其中涉及的黑名单词语包括但不限于:疑问词、黄反词等。In this step, some obvious errors are removed by using the blacklist rules, and the blacklist words involved include but are not limited to: interrogative words, yellow anti-words, etc.

需要说明的是,上述步骤104和步骤105涉及的过滤操作并不是必须的,且两步骤可以择一执行,也可以以任意的顺序先后执行。It should be noted that the filtering operations involved in the above step 104 and step 105 are not mandatory, and the two steps can be performed one by one, or can be performed sequentially in any order.

步骤106:获取预设需求类型的垂直搜索日志中搜索次数大于预设次数阈值的query,将获取的query与步骤105过滤处理后得到的query取交集,得到具有预设需求类型的时效性query。Step 106: Obtain the query whose search frequency is greater than the preset threshold in the vertical search log of the preset requirement type, and intersect the acquired query with the query obtained after filtering in step 105 to obtain a time-sensitive query with the preset requirement type.

本步骤的执行是为了进一步保证挖掘出的query在预设需求类型上有显著的需求,同样不是本发明必须的步骤。如果query在预设需求类型的垂直搜索日志中搜索次数大于预设次数阈值,则说明该query在该需求类型上具有较强的需求。The execution of this step is to further ensure that the mined query has a significant demand on the preset demand type, and it is also not a necessary step of the present invention. If the number of times the query is searched in the vertical search logs of the preset demand type is greater than the preset times threshold, it means that the query has a strong demand for the demand type.

除了本步骤的实现方式之外,还可以采用另一种方式,即分别获取步骤105过滤处理后得到的各query在预设需求类型的垂直搜索日志中的搜索次数,保留搜索次数大于预设次数阈值的query作为具有预设需求类型的时效性query。In addition to the implementation of this step, another method can also be used, that is, to obtain the search times of each query obtained after filtering in step 105 in the vertical search log of the preset demand type, and keep the search times greater than the preset times Threshold queries are time-sensitive queries with preset demand types.

接续上例,假设“超新星爆发”、“美国坠毁隐形直升机”和“美国卫星撞地球”在图片垂直搜索日志中搜索次数分别为612、5630和126,如果预设的次数阈值为200,则最终确定“超新星爆发”和“美国坠毁隐形直升机”为具有图片类需求的时效性query。Continuing from the previous example, assume that the search times of "supernova explosion", "US stealth helicopter crashed" and "US satellite crashed into the earth" in the image vertical search log are 612, 5630 and 126 respectively. If the preset times threshold is 200, the final Determine "supernova explosion" and "US crashed stealth helicopter" as time-sensitive queries that require pictures.

在经过本步骤之后,可以将得到的具有预设需求类型的时效性query与采用其他方式挖掘出的具有预设需求类型的query进行合并,得到最终挖掘出的具有预设需求类型的query,从而弥补了现有需求挖掘方式所无法召回的时效性query。After this step, the obtained time-sensitive query with the preset demand type can be merged with the query with the preset demand type mined in other ways to obtain the finally mined query with the preset demand type, thus It makes up for the time-sensitive query that cannot be recalled by the existing demand mining method.

实施例二、Embodiment two,

图2为本发明实施例二提供的模式短串的挖掘流程图,如图2所示,该挖掘过程具体包括:Fig. 2 is the mining flowchart of the pattern short string provided by the second embodiment of the present invention, as shown in Fig. 2, the mining process specifically includes:

步骤201:获取具有时效性的网页标题作为语料。Step 201: Obtain time-sensitive web page titles as corpus.

本步骤中,具有时效性的网页标题的获取可以包括但不限于以下两种方式:In this step, the acquisition of the time-sensitive web page title may include but not limited to the following two methods:

方式一、从预设需求类型的新闻网站获取网页标题。通常新闻网站报道的多是具有时效性的新闻,为了吸引读者的注意力,通常时效性的新闻标题往往对事件的核心进行相同的纪实性描写并采用一些表述新闻特殊性的词汇。利用图片类新闻网站的网页标题中,纪实性的描写诸如“事故X人受伤”、“事故X人丧生”等,表述新闻特殊性的词汇诸如“【高清大图】”、“(图)”、“组图”等。这些标题都比较适合进行模式短串的挖掘。Method 1: Acquiring the title of a web page from a news website of a preset demand type. Usually news websites report time-sensitive news. In order to attract readers' attention, time-sensitive news headlines usually describe the core of the event in the same documentary way and use some vocabulary to express the particularity of the news. In the title of the webpage of news websites using pictures, documentary descriptions such as "accident X people were injured", "accident X people were killed", etc., and words expressing the particularity of the news such as "[high-definition large picture]" and "(picture)" , "Group Map", etc. These titles are more suitable for mining short strings of patterns.

方式二、从搜索日志中,获取预设需求类型的时效性种子query对应的被点击网页标题。在此可以人为设置若干具有时效性的种子query,这些种子query对应的被点击网页标题也通常体现出时效性,在模式上相似,因此,也比较适合用于模式短串的挖掘。Method 2: Obtain the title of the clicked web page corresponding to the time-sensitive seed query of the preset demand type from the search log. Here, several time-sensitive seed queries can be artificially set, and the titles of clicked webpages corresponding to these seed queries also usually reflect time-effectiveness and are similar in patterns, so they are more suitable for mining short pattern strings.

步骤202:将获取的语料进行聚类。Step 202: Clustering the acquired corpus.

本步骤中,在进行聚类时,可以基于事件、站点或频道等进行聚类。对网页标题进行聚类的目的在于,聚类后每一类聚类结果中的标题都在描述相同或者相似的表达,例如描述相同或相似事件的网页标题具有相同模式的可能性较大,同一站点的网页标题具有相同模式的可能性比较大,相同频道的网页标题具有相同模式的可能性比较大。聚类算法可以采用现有的文本聚类方法,在此不再赘述。In this step, when performing clustering, clustering may be performed based on events, sites, or channels. The purpose of clustering webpage titles is that after clustering, the titles in each clustering result describe the same or similar expressions. For example, the titles of webpages describing the same or similar events are more likely to have the same pattern, and the same It is more likely that the webpage titles of the sites have the same pattern, and the possibility of the webpage titles of the same channel has the same pattern. The clustering algorithm can adopt the existing text clustering method, which will not be repeated here.

举个例子,假设在步骤201中从图片类新闻网站上获取的新闻标题如表3所示。For example, assume that in step 201, the news titles obtained from the picture news website are shown in Table 3.

表3table 3

  编号 serial number   新闻标题 Headlines   来源网站 source website   1 1   新疆伊犁州2分钟内连发6.0级和4.1级地震 A 6.0-magnitude and a 4.1-magnitude earthquake occurred within 2 minutes in Yili Prefecture, Xinjiang   百度新闻 Baidu News   2 2   四川甘肃交界发生5.4级地震四川省多地有震感(图) A 5.4-magnitude earthquake occurred at the junction of Sichuan and Gansu. The earthquake was felt in many places in Sichuan Province (Figure)   百度新闻 Baidu News   3 3   安庆市发生火车与货车相碰事故两人受伤 Two people were injured in a collision between a train and a truck in Anqing City   百度新闻 Baidu News   4 4   日本发生火车与卡车相撞事故5人受伤(图) 5 injured in train-truck collision in Japan (photo)   新浪新闻 Sina News   5 5   昆明发生火车和货车相撞事故5节车厢脱轨 A train and a truck collided in Kunming and 5 carriages derailed   新浪新闻 Sina News   6 6   四川甘肃交界5.4级地震未造成伤亡和房屋倒塌 The 5.4-magnitude earthquake at the junction of Sichuan and Gansu caused no casualties and house collapses   新浪新闻 Sina News   7 7   甘肃陇南与四川青川县交界发生5.4级地震(图) A magnitude 5.4 earthquake occurred at the junction of Longnan, Gansu and Qingchuan County, Sichuan (Figure)   新浪新闻 Sina News   8 8   印尼发生火车与汽车相撞事故8人丧生 8 killed in train-car collision in Indonesia   腾讯新闻 Tencent News   9 9   希腊火车卡车相撞5人死亡数十人受伤 5 dead, dozens injured in Greece train truck collision   腾讯新闻 Tencent News   10 10   贵州福泉境内发生爆炸致4人死亡原因正调查 The cause of the death of 4 people in an explosion in Fuquan, Guizhou is under investigation   腾讯新闻 Tencent News   11 11   贵州福泉马场坪收费站发生爆炸近百名官兵救援 An explosion occurred at the Machangping toll station in Fuquan, Guizhou, and nearly a hundred officers and soldiers rescued   腾讯新闻 Tencent News   12 12   组图:贵州福泉境内发生爆炸致多人伤亡 Group pictures: Explosion in Fuquan, Guizhou, causing many casualties   腾讯新闻 Tencent News   13 13   【人民快讯】新疆伊犁2分钟内发生6.0级和4.1级2次地震 [People's Express] Two earthquakes of magnitude 6.0 and magnitude 4.1 occurred within 2 minutes in Yili, Xinjiang   人民网 People's Daily Online   14 14   甘肃与四川交界发生5.4级地震陇南居民被摇醒 A 5.4-magnitude earthquake struck the border between Gansu and Sichuan, and Longnan residents were shaken awake   人民网 People's Daily Online   15 15   2010年智利发生8.8级地震 In 2010, a magnitude 8.8 earthquake occurred in Chile   人民网 People's Daily Online   16 16   青海玉树县发生7.1级地震 A magnitude 7.1 earthquake hit Yushu County, Qinghai   人民网 People's Daily Online   17 17   印尼发生7.3级地震引发海啸和台风 7.3-magnitude earthquake triggers tsunami, typhoon in Indonesia   人民网 People's Daily Online   18 18   新疆昭苏县发生6.1级地震 A 6.1-magnitude earthquake hit Zhaosu County, Xinjiang   人民网 People's Daily Online

在按照事件进行聚类之后,获得3类聚类结果如表4所示。After clustering according to events, the clustering results of three categories are obtained as shown in Table 4.

表4Table 4

  聚类描述 Cluster description   新闻标题编号 News headline number   地震 earthquake   1,2,6,7,13,14,15,16,17,18 1, 2, 6, 7, 13, 14, 15, 16, 17, 18   交通事故 traffic accident   3,4,5,8,9 3, 4, 5, 8, 9   爆炸 explode   10,11,12 10, 11, 12

步骤203:分别针对聚类结果各类别中的各网页标题执行步骤203_1至步骤203_5。Step 203: Execute Step 203_1 to Step 203_5 for each web page title in each category of the clustering result.

步骤203_1:将网页标题进行切词。Step 203_1: Segment the title of the webpage into words.

以网页标题“印尼发生7.3级地震引发海啸和台风”为例,切分结果为:“印尼/发生/7.3/级/地震/引发/海啸/和/台风”。Taking the title of the webpage as "7.3 magnitude earthquake in Indonesia triggered tsunami and typhoon" as an example, the segmentation result is: "Indonesia/occurred/7.3/magnitude/earthquake/triggered/tsunami/and/typhoon".

步骤203_2:将切词结果中的命名实体替换为对应的命名实体类型标记。Step 203_2: Replace the named entity in the word segmentation result with the corresponding named entity type tag.

此处的命名实体包括但不限于:人名、地名、机构名、数字、日期、货币、地址等。The named entities here include but are not limited to: person names, place names, organization names, numbers, dates, currencies, addresses, etc.

上例的切分结果执行本步骤之后得到“【place】/发生/【number】/级/地震/引发/海啸/和/台风”。The segmentation result of the above example is executed after this step to obtain “[place]/occurrence/[number]/level/earthquake/triggered/tsunami/and/typhoon”.

步骤203_3:查找同义词词表,将切词结果中的词语归一化为同义词词根。Step 203_3: Look up the synonym vocabulary, and normalize the words in the word segmentation result into synonym roots.

同义词词表中,每一组同义词都存在一个词根,该组同义词都可以用该词根表示。本步骤的目的在于可以将更多的同义表达转化为统一的模板,当然,本步骤并不是必须的操作。In the synonym vocabulary, each group of synonyms has a root, and the group of synonyms can be represented by the root. The purpose of this step is to convert more synonymous expressions into a unified template. Of course, this step is not a necessary operation.

假设“和”的同义词词根为“与”,则上例执行本步骤之后得到“【place】/发生/【number】/级/地震/引发/海啸/与/台风”。Assuming that the root of the synonym of "和" is "and", then after performing this step in the above example, "[place]/occurrence/[number]/level/earthquake/initiation/tsunami/and/typhoon" is obtained.

步骤203_4:确定切词结果的n元词组(n-gram),统计各n-gram在该网页标题所在类别中的出现次数,抽取出现次数满足预设选词要求的n-gram作为模式短串。Step 203_4: Determine the n-grams (n-gram) of the word segmentation result, count the number of occurrences of each n-gram in the category of the webpage title, and extract the n-gram whose number of occurrences meets the preset word selection requirements as a short string of patterns .

所谓n-gram就是最小粒度的n个词语按顺序出现的组合,其中n为预设的一个或多个正整数。在本发明实施例中,由于1-gram和2-gram发生转义的风险较大,因此通常选取n为大于或等于3的一个或多个正整数。The so-called n-gram is a combination of n words with the smallest granularity appearing in sequence, where n is one or more preset positive integers. In the embodiment of the present invention, since 1-gram and 2-gram have a higher risk of escaping, n is usually selected as one or more positive integers greater than or equal to 3.

预设选词要求可以为出现次数排在前N1个,N1为预设正整数,或者出现次数大于预设阈值。The preset requirement for word selection may be that the number of occurrences ranks in the top N1, where N1 is a preset positive integer, or the number of occurrences is greater than a preset threshold.

假设选取n为3和4,则接续上例得到的3-gram有:【place】发生【number】、发生【number】级、【number】级地震、级地震引发、地震引发海啸、引发海啸与、海啸与台风;4-gram有:【place】发生【number】级、发生【number】级地震、【number】级地震引发、级地震引发海啸、地震引发海啸与、引发海啸与台风。Assuming that n is selected as 3 and 4, the 3-grams obtained following the above example are: [place] occurrence of [number], occurrence of [number] magnitude, [number] magnitude earthquake, magnitude earthquake triggers, earthquake triggers tsunami, triggers tsunami and , tsunami and typhoon; 4-grams are: [place] occurrence of [number] level earthquake, occurrence of [number] level earthquake, [number] level earthquake triggers, level earthquake triggers tsunami, earthquake triggers tsunami and, triggers tsunami and typhoon.

对于地震类的10个标题进行n-gram统计后,得到出现次数排在前3个的n-gram如表5所示。After performing n-gram statistics on the 10 titles of the earthquake category, the top 3 n-grams are obtained, as shown in Table 5.

表5table 5

 n-gram n-gram   出现次数 The number of occurrences  [number]级地震 Earthquake of magnitude [number]   10 10  发生[number]级地震 An earthquake of magnitude [number] occurred   7 7  [place]发生[number]级地震 An earthquake of magnitude [number] occurred in [place]   7 7

可以直接将抽取的n-gram作为模式短串结束模式短串的挖掘过程,为了挖掘出的模式短串具有更好的准确性,可以进一步执行后续步骤。The extracted n-grams can be directly used as the pattern string to end the pattern string mining process. In order to have better accuracy of the mined pattern string, subsequent steps can be further performed.

步骤203_5:对抽取的模式短串进行验证,保留验证通过的模式短串。Step 203_5: Verify the extracted pattern strings, and keep the pattern strings that pass the verification.

具体的验证过程为:将具有时效性的网页标题作为正例集合,将具有非时效性的网页标题作为负例集合,例如可以将新闻网站的网页标题作为正例集合,将社区网站(例如百度贴吧、百度知道等)的网页标题作为负例集合;针对每一个模式短串分别计算该模式短串匹配到的正例集合中的样本数与该模式短串匹配到的正例集合中的样本数和负例集合中的样本数之和的比值,将计算的比值作为该模式短串的得分;如果得分大于预设的得分阈值,则验证通过,否则验证不通过。其中,模式短串与样本的匹配方法为:如果网页标题包含模式短串,则表示匹配,否则就不匹配。The specific verification process is as follows: time-sensitive web page titles are used as positive examples, and non-time-sensitive web page titles are used as negative examples. For example, web page titles of news websites can be used as positive examples, and community websites (such as Baidu Tieba, Baidu Zhizhi, etc.) as the negative example set; for each pattern short string, calculate the number of samples in the positive example set matched by the pattern short string and the samples in the positive example set matched by the pattern short string The ratio of the sum of the number of samples and the number of samples in the negative example set, the calculated ratio is used as the score of the short string of the pattern; if the score is greater than the preset score threshold, the verification is passed, otherwise the verification is not passed. Wherein, the matching method between the pattern short string and the sample is as follows: if the title of the webpage contains the pattern short string, it means matching, otherwise it does not match.

以上是对本发明所提供的方法进行的详细描述,下面对本发明结合实施例三对本发明所提供的装置进行详细描述。The above is a detailed description of the method provided by the present invention, and the device provided by the present invention will be described in detail below in conjunction with Embodiment 3.

实施例三、Embodiment three,

图3为本发明实施例三提供的需求挖掘装置的结构图,如图3所示,该装置包括:数据获取单元300、query获取单元310、query选择单元320和query确定单元330。FIG. 3 is a structural diagram of a demand mining device provided by Embodiment 3 of the present invention. As shown in FIG. 3 , the device includes: a data acquisition unit 300 , a query acquisition unit 310 , a query selection unit 320 and a query determination unit 330 .

数据获取单元300从搜索日志中获取所选时间段的点击数据,点击数据至少包括用户的搜索项query以及query对应的搜索结果中被点击网页标题。The data obtaining unit 300 obtains the click data of the selected time period from the search log, and the click data includes at least the user's search item query and the title of the clicked web page in the search result corresponding to the query.

所选时间段可以采用最近一段时间,例如采用最近一天的点击数据。由于需求挖掘通常是周期性进行的,则本步骤中涉及的时间段也可以与普通需求挖掘所采用的时间段保持一致,例如普通需求挖掘如果一个月挖掘一次,则本步骤中涉及的时间段长度也可以是一个月,获取近一个月的点击数据。The selected time period may be a recent period of time, for example, click data of the latest day. Since demand mining is usually carried out periodically, the time period involved in this step can also be consistent with the time period adopted by ordinary demand mining. For example, if ordinary demand mining is conducted once a month, the time period involved in this step The length can also be one month, and the click data of the past month is obtained.

query获取单元310,用于从点击数据中获取对应网页标题满足预设需求类型的模式的query。The query obtaining unit 310 is configured to obtain, from the click data, a query corresponding to a pattern in which the title of the webpage satisfies a preset requirement type.

其中,需求类型的模式由短语、词组、词语、属性标识、分割符号中的一种或任意组合构成。预设需求类型的模式可以通过模式短串来表示,如果网页标题中包含该模式短串,则说明该网页标题满足预设需求类型的模式。Wherein, the pattern of the requirement type is composed of one or any combination of phrases, phrases, words, attribute identifiers, and segmentation symbols. The pattern of the preset demand type can be represented by a short string of patterns. If the short string of the pattern is included in the title of the webpage, it means that the title of the webpage meets the pattern of the preset demand type.

上述预设需求类型的模式可以通过观察检索结果的标题人工总结,也可以通过机器学习的方式自动挖掘,为了实现预设需求类型的模式的自动挖掘,该装置还包括:模式挖掘单元340,该单元的具体结构将在实施例四中进行详细描述。The patterns of the above preset demand types can be manually summarized by observing the titles of the retrieval results, or can be automatically mined by machine learning. In order to realize the automatic mining of the patterns of the preset demand types, the device also includes: a pattern mining unit 340, the The specific structure of the unit will be described in detail in Embodiment 4.

query选择单元320分别计算query获取单元310获取的各query的点击率,选择点击率大于或等于预设比例阈值的query,其中query的点击率为:query在具有预设需求类型的模式的网页标题上的点击数占该query在所有搜索结果上的点击数的比例。The query selection unit 320 calculates the click-through rate of each query acquired by the query acquisition unit 310 respectively, and selects the query whose click-through rate is greater than or equal to the preset ratio threshold, wherein the click-through rate of the query is: the title of the web page of the query in the mode of the preset demand type The proportion of clicks on the query to the number of clicks on all search results for this query.

query确定单元330,用于由query选择单元320选择的query得到具有预设需求类型的时效性query。The query determination unit 330 is configured to obtain a time-sensitive query with a preset demand type from the query selected by the query selection unit 320 .

为了更进一步提高挖掘时效性query的准确性,上述query确定单元330可以包括:第一过滤子单元331和/或第二过滤子单元332,图3中以同时包含两个子单元为例,当同时包含两个子单元时,可以以任意的顺序对query选择单元320选择的query进行处理。In order to further improve the accuracy of mining timeliness query, the query determination unit 330 may include: a first filtering subunit 331 and/or a second filtering subunit 332. In FIG. 3, two subunits are included as an example at the same time. When two subunits are included, the query selected by the query selection unit 320 can be processed in any order.

第一过滤子单元331分别统计各query对应的具有该预设需求类型的模式的网页标题所来源的网站数量,将来源网站数量低于预设数量阈值的网页标题所对应的query从选择的query中过滤掉。The first filtering subunit 331 counts the number of websites from which the webpage titles of the preset demand type patterns corresponding to each query come from, and selects the query corresponding to the webpage titles whose source website quantity is lower than the preset quantity threshold. filter out.

第二过滤子单元332将包含预设的黑名单词语的query从选择的query中过滤掉。其中涉及的黑名单词语包括但不限于:疑问词、黄反词等。The second filtering subunit 332 filters out queries containing preset blacklist words from the selected queries. The blacklist words involved include but are not limited to: interrogative words, yellow anti-words, etc.

另外,为了进一步保证挖掘出的query在预设需求类型上有显著的需求,query确定单元330可以包括:第一确定子单元333或第二确定子单元(图3中未示出)。In addition, in order to further ensure that the mined query has a significant demand in the preset demand type, the query determining unit 330 may include: a first determining subunit 333 or a second determining subunit (not shown in FIG. 3 ).

第一确定子单元333获取预设需求类型的垂直搜索日志中搜索次数大于预设次数阈值的query,将获取的query与选择的query取交集,得到具有预设需求类型的时效性query。The first determination subunit 333 obtains the query whose search frequency is greater than the preset threshold in the vertical search log of the preset requirement type, and intersects the obtained query with the selected query to obtain a time-sensitive query with the preset requirement type.

第二确定子单元分别获取选择的query在预设需求类型的垂直搜索日志中的搜索次数,保留搜索次数大于预设次数阈值的query作为具有预设需求类型的时效性query。The second determining subunit respectively acquires the search times of the selected query in the vertical search log of the preset demand type, and retains the queries whose search times are greater than the preset number threshold as time-sensitive queries with the preset demand type.

更进一步地,该装置还包括:挖掘合并单元350,用于将具有预设需求类型的时效性query与其他需求挖掘装置挖掘出的具有预设需求类型的query进行合并,得到最终挖掘出的具有预设需求类型的query。Furthermore, the device further includes: a mining and merging unit 350, configured to merge the time-sensitive query with the preset demand type with the query with the preset demand type mined by other demand mining devices, to obtain the finally mined query with the preset demand type The query of the preset requirement type.

实施例四、Embodiment four,

图4为本发明实施例四提供的模式挖掘单元的结构图,如图4所示,该模式挖掘单元可以具体包括:语料获取子单元341、聚类子单元342和模式提取子单元343。FIG. 4 is a structural diagram of a pattern mining unit provided in Embodiment 4 of the present invention. As shown in FIG. 4 , the pattern mining unit may specifically include: a corpus acquisition subunit 341 , a clustering subunit 342 and a pattern extraction subunit 343 .

语料获取子单元341获取具有时效性的网页标题作为语料。The corpus acquisition subunit 341 acquires time-sensitive webpage titles as corpus.

其中,获取语料的方式包括但不限于:从预设需求类型的新闻网站获取网页标题作为语料;或者,从搜索日志中获取预设需求类型的时效性种子query对应的被点击网页标题作为语料。Wherein, the method of acquiring corpus includes but is not limited to: obtaining webpage titles from news websites of preset demand types as corpus; or obtaining clicked webpage titles corresponding to time-sensitive seed queries of preset demand types as corpus from search logs.

聚类子单元342将获取的语料进行聚类。The clustering subunit 342 clusters the acquired corpus.

在进行聚类时,可以基于事件、网站或频道等进行聚类。聚类算法可以采用现有的文本聚类方法,在此不再赘述。When performing clustering, clustering can be performed based on events, websites, or channels. The clustering algorithm can adopt the existing text clustering method, which will not be repeated here.

模式提取子单元343分别针对聚类结果各类别中的网页标题执行:将网页标题进行切词,将切词结果中的命名实体替换为对应的命名实体类型标记,确定切词结果的n元词组n-gram,n为预设的一个或多个正整数,统计各n-gram在该网页标题所在类别中的出现次数,抽取出现次数满足预设选词要求的n-gram作为需求类型的模式。The pattern extraction subunit 343 executes respectively for the webpage titles in each category of the clustering results: segmenting the webpage titles, replacing the named entities in the word segmentation results with corresponding named entity type tags, and determining the n-grams of the word segmentation results n-gram, n is one or more preset positive integers, count the number of occurrences of each n-gram in the category of the webpage title, and extract n-grams whose occurrences meet the preset word selection requirements as the mode of the demand type .

上述涉及的命名实体包括但不限于:人名、地名、机构名、数字、日期、货币、地址等。The named entities mentioned above include but are not limited to: person names, place names, organization names, numbers, dates, currencies, addresses, etc.

在同义词词表中,通常每一组同义词都存在一个词根,为了将更多的同义表达转化为统一的模板,模式提取子单元343在将切词结果中的命名实体替换为对应的命名实体类型标记之后,且在确定切词结果的n-gram之前,还用于查找同义词词表,将切词结果中的词语归一化为同义词词根。In the synonym vocabulary, usually there is a root for each group of synonyms, in order to convert more synonymous expressions into a unified template, the pattern extraction subunit 343 replaces the named entities in the word segmentation results with corresponding named entities After the type tag, and before determining the n-gram of the word segmentation result, it is also used to look up the synonym vocabulary, and normalize the words in the word segmentation result into synonym roots.

为了更进一步提高模式挖掘的精确度,模式提取子单元343还用于对需求类型的模式进行验证,保留验证通过的模式。In order to further improve the accuracy of pattern mining, the pattern extraction subunit 343 is also used to verify the pattern of the requirement type, and retain the pattern that passes the verification.

其中验证的过程具体包括:将具有时效性的网页标题作为正例集合,将具有非时效性的网页标题作为负例集合;针对每一个模式分别计算该模式匹配到的正例集合中的样本数与该模式匹配到的正例集合中的样本数和负例集合中的样本数之和的比值,将计算的比值作为该模式的得分;如果得分大于预设的得分阈值,则验证通过。其中,模式匹配方法为:如果网页标题包含该模式,则表示匹配,否则就不匹配。The verification process specifically includes: taking time-sensitive web page titles as positive example sets, and non-time-sensitive web page titles as negative example sets; calculating the number of samples in the positive example set matched by the pattern for each pattern The ratio of the sum of the number of samples in the positive example set and the number of samples in the negative example set matched by the pattern is used as the score of the pattern; if the score is greater than the preset score threshold, the verification is passed. Wherein, the pattern matching method is: if the title of the web page contains the pattern, it means matching, otherwise it does not match.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims (18)

1.一种基于时效性的需求挖掘方法,其特征在于,该方法包括:1. A demand mining method based on timeliness, characterized in that the method comprises: S1、从搜索日志中获取所选时间段的点击数据,所述点击数据至少包括用户的搜索项query以及query对应的搜索结果中被点击网页标题;S1. Obtain the click data of the selected time period from the search log, the click data at least includes the user's search item query and the title of the clicked webpage in the search results corresponding to the query; S2、从所述点击数据中获取对应网页标题满足预设需求类型的模式的query;S2. Acquiring from the click data a query corresponding to a mode in which the title of the webpage satisfies a preset demand type; S3、分别计算所述步骤S2获取的各query的点击率,选择点击率大于或等于预设比例阈值的query,其中query的点击率为:query在具有所述预设需求类型的模式的网页标题上的点击数占该query在所有搜索结果上的点击数的比例;S3. Calculate the click-through rate of each query obtained in the step S2 respectively, and select the query whose click-through rate is greater than or equal to the preset ratio threshold, wherein the click-through rate of the query is: the title of the web page of the query in the mode with the preset demand type The proportion of clicks on the query to the number of clicks on all search results; S4、由选择的query得到具有所述预设需求类型的时效性query。S4. Obtain a time-sensitive query with the preset demand type from the selected query. 2.根据权利要求1所述的方法,其特征在于,所述需求类型的模式由短语、词组、词语、属性标识、分割符号中的一种或任意组合构成。2. The method according to claim 1, wherein the pattern of the requirement type is composed of one or any combination of phrases, phrases, words, attribute identifiers, and segmentation symbols. 3.根据权利要求1或2所述的方法,其特征在于,所述需求类型的模式的挖掘具体包括:3. The method according to claim 1 or 2, wherein the mining of the pattern of the demand type specifically comprises: A1、获取具有时效性的网页标题作为语料;A1. Obtain time-sensitive web page titles as corpus; A2、将获取的语料进行聚类;A2, clustering the acquired corpus; A3、分别针对聚类结果各类别中的网页标题执行步骤A31至步骤A33:A3. Perform steps A31 to A33 for the web page titles in each category of the clustering results: A31、将网页标题进行切词;A31. Segment the title of the web page into words; A32、将切词结果中的命名实体替换为对应的命名实体类型标记;A32, replace the named entity in the word segmentation result with the corresponding named entity type mark; A33、确定切词结果的n元词组n-gram,n为预设的一个或多个正整数,统计各n-gram在该网页标题所在类别中的出现次数,抽取出现次数满足预设选词要求的n-gram作为所述需求类型的模式。A33. Determine the n-gram n-gram of the word segmentation result, n is one or more preset positive integers, count the number of occurrences of each n-gram in the category of the webpage title, and extract the number of occurrences to meet the preset selected words The required n-grams serve as the schema for the type of requirement. 4.根据权利要求3所述的方法,其特征在于,所述步骤A1具体包括:4. The method according to claim 3, wherein the step A1 specifically comprises: 从预设需求类型的新闻网站获取网页标题作为语料;或者,Obtain webpage titles as corpus from news websites of preset demand types; or, 从搜索日志中获取预设需求类型的时效性种子query对应的被点击网页标题作为语料。The title of the clicked webpage corresponding to the time-sensitive seed query of the preset demand type is obtained from the search log as the corpus. 5.根据权利要求3所述的方法,其特征在于,在所述步骤A32和步骤A33之间还包括:5. The method according to claim 3, characterized in that, between said step A32 and step A33, further comprising: 查找同义词词表,将切词结果中的词语归一化为同义词词根。Look up the synonym vocabulary and normalize the words in the word segmentation results into synonym roots. 6.根据权利要求3所述的方法,其特征在于,在所述步骤A33之后还包括:6. The method according to claim 3, characterized in that, after the step A33, further comprising: A34、对所述需求类型的模式进行验证,保留验证通过的模式,其中所述验证的过程具体包括:将具有时效性的网页标题作为正例集合,将具有非时效性的网页标题作为负例集合;针对每一个模式分别计算该模式匹配到的正例集合中的样本数与该模式匹配到的正例集合中的样本数和负例集合中的样本数之和的比值,将计算的比值作为该模式的得分;如果得分大于预设的得分阈值,则验证通过。A34. Verify the pattern of the requirement type, and keep the verified pattern, wherein the verification process specifically includes: taking the time-sensitive webpage title as a positive example set, and taking the non-time-sensitive webpage title as a negative example set; for each pattern, calculate the ratio of the number of samples in the positive example set matched by the pattern to the sum of the number of samples in the positive example set matched by the pattern and the number of samples in the negative example set, and the calculated ratio As the score of the mode; if the score is greater than the preset score threshold, the verification is passed. 7.根据权利要求1所述的方法,其特征在于,所述步骤S4包括:7. The method according to claim 1, wherein said step S4 comprises: 分别统计各query对应的具有该预设需求类型的模式的网页标题所来源的网站数量,将来源网站数量低于预设数量阈值的网页标题所对应的query从所述选择的query中过滤掉;和/或,Statistically count the website quantity corresponding to each query corresponding to the webpage title with the pattern of the preset demand type, and filter out the query corresponding to the webpage title whose source website quantity is lower than the preset quantity threshold; and / or, 将包含预设的黑名单词语的query从所述选择的query中过滤掉。Filter out queries containing preset blacklist words from the selected queries. 8.根据权利要求1所述的方法,其特征在于,所述步骤S4包括:8. The method according to claim 1, wherein said step S4 comprises: 获取所述预设需求类型的垂直搜索日志中搜索次数大于预设次数阈值的query,将获取的query与所述选择的query取交集,得到具有所述预设需求类型的时效性query;或者,Obtaining a query whose search times are greater than a preset threshold in the vertical search log of the preset demand type, and intersecting the acquired query with the selected query to obtain a time-sensitive query with the preset demand type; or, 分别获取所述选择的query在所述预设需求类型的垂直搜索日志中的搜索次数,保留搜索次数大于预设次数阈值的query作为具有所述预设需求类型的时效性query。The search times of the selected query in the vertical search log of the preset demand type are obtained respectively, and the queries whose search times are greater than a preset threshold are reserved as time-sensitive queries with the preset demand type. 9.根据权利要求1所述的方法,其特征在于,在所述步骤S4之后还包括:9. The method according to claim 1, further comprising after the step S4: 将具有所述预设需求类型的时效性query与采用其他方式挖掘出的具有所述预设需求类型的query进行合并,得到最终挖掘出的具有所述预设需求类型的query。The time-sensitive query with the preset demand type is combined with the query with the preset demand type mined in other ways to obtain the finally mined query with the preset demand type. 10.一种基于时效性的需求挖掘装置,其特征在于,该装置包括:10. A demand excavation device based on timeliness, characterized in that the device comprises: 数据获取单元,用于从搜索日志中获取所选时间段的点击数据,所述点击数据至少包括用户的搜索项query以及query对应的搜索结果中被点击网页标题;A data acquisition unit, configured to acquire click data for a selected time period from the search log, the click data at least including the user's search item query and the title of the clicked webpage in the search results corresponding to the query; query获取单元,用于从所述点击数据中获取对应网页标题满足预设需求类型的模式的query;A query acquisition unit, configured to acquire a query corresponding to a web page title that meets a preset demand type from the click data; query选择单元,用于分别计算所述query获取单元获取的各query的点击率,选择点击率大于或等于预设比例阈值的query,其中query的点击率为:query在具有所述预设需求类型的模式的网页标题上的点击数占该query在所有搜索结果上的点击数的比例;The query selection unit is used to separately calculate the click-through rate of each query acquired by the query acquisition unit, and select the query whose click-through rate is greater than or equal to a preset ratio threshold, wherein the click-through rate of the query is: the query has the preset demand type The proportion of the number of clicks on the title of the web page of the pattern to the number of clicks of the query on all search results; query确定单元,用于由所述query选择单元选择的query得到具有所述预设需求类型的时效性query。The query determination unit is configured to obtain the time-sensitive query with the preset demand type from the query selected by the query selection unit. 11.根据权利要求10所述的装置,其特征在于,所述需求类型的模式由短语、词组、词语、属性标识、分割符号中的一种或任意组合构成。11. The device according to claim 10, wherein the pattern of the requirement type is composed of one or any combination of phrases, phrases, words, attribute identifiers, and segmentation symbols. 12.根据权利要求10或11所述的装置,其特征在于,该装置还包括:模式挖掘单元;12. The device according to claim 10 or 11, further comprising: a pattern mining unit; 所述模式挖掘单元具体包括:The pattern mining unit specifically includes: 语料获取子单元,用于获取具有时效性的网页标题作为语料;The corpus acquisition subunit is used to acquire time-sensitive web page titles as corpus; 聚类子单元,用于将获取的语料进行聚类;The clustering subunit is used to cluster the acquired corpus; 模式提取子单元,用于分别针对聚类结果各类别中的网页标题执行:将网页标题进行切词,将切词结果中的命名实体替换为对应的命名实体类型标记,确定切词结果的n元词组n-gram,n为预设的一个或多个正整数,统计各n-gram在该网页标题所在类别中的出现次数,抽取出现次数满足预设选词要求的n-gram作为所述需求类型的模式。The pattern extraction sub-unit is used to execute separately for the webpage titles in each category of the clustering results: segment the webpage titles, replace the named entities in the word segmentation results with the corresponding named entity type tags, and determine the n of the word segmentation results Metaphrase n-gram, n is one or more preset positive integers, count the number of occurrences of each n-gram in the category of the webpage title, and extract the n-gram whose occurrence times meet the preset word selection requirements as the described The schema of the requirement type. 13.根据权利要求12所述的装置,其特征在于,所述语料获取子单元从预设需求类型的新闻网站获取网页标题作为语料;或者,13. The device according to claim 12, wherein the corpus acquisition subunit acquires a web page title as a corpus from a news website of a preset demand type; or, 从搜索日志中获取预设需求类型的时效性种子query对应的被点击网页标题作为语料。The title of the clicked webpage corresponding to the time-sensitive seed query of the preset demand type is obtained from the search log as the corpus. 14.根据权利要求12所述的装置,其特征在于,所述模式提取子单元在将切词结果中的命名实体替换为对应的命名实体类型标记之后,且在确定切词结果的n-gram之前,还用于查找同义词词表,将切词结果中的词语归一化为同义词词根。14. The device according to claim 12, wherein the pattern extraction subunit replaces the named entity in the word segmentation result with the corresponding named entity type mark, and determines the n-gram of the word segmentation result Previously, it was also used to look up the synonym vocabulary and normalize the words in the word segmentation results into synonym roots. 15.根据权利要求12所述的装置,其特征在于,所述模式提取子单元还用于对所述需求类型的模式进行验证,保留验证通过的模式,其中所述验证的过程具体包括:将具有时效性的网页标题作为正例集合,将具有非时效性的网页标题作为负例集合;针对每一个模式分别计算该模式匹配到的正例集合中的样本数与该模式匹配到的正例集合中的样本数和负例集合中的样本数之和的比值,将计算的比值作为该模式的得分;如果得分大于预设的得分阈值,则验证通过。15. The device according to claim 12, wherein the pattern extraction subunit is further configured to verify the pattern of the requirement type, and retain the pattern that passes the verification, wherein the verification process specifically includes: Time-sensitive webpage titles are used as a positive example set, and non-time-sensitive webpage titles are used as a negative example set; for each pattern, the number of samples in the positive example set matched by the pattern and the positive examples matched by the pattern are calculated separately The ratio of the number of samples in the set to the sum of the number of samples in the negative set is used as the score of the pattern; if the score is greater than the preset score threshold, the verification is passed. 16.根据权利要求10所述的装置,其特征在于,所述query确定单元包括:第一过滤子单元和/或第二过滤子单元;16. The device according to claim 10, wherein the query determining unit comprises: a first filtering subunit and/or a second filtering subunit; 所述第一过滤子单元,用于分别统计各query对应的具有该预设需求类型的模式的网页标题所来源的网站数量,将来源网站数量低于预设数量阈值的网页标题所对应的query从所述选择的query中过滤掉;The first filtering subunit is used to separately count the number of websites from which the webpage titles corresponding to each query have the pattern of the preset demand type, and count the query numbers corresponding to the webpage titles whose source website quantity is lower than the preset quantity threshold Filter out from the selected query; 所述第二过滤子单元,用于将包含预设的黑名单词语的query从所述选择的query中过滤掉。The second filtering subunit is configured to filter out queries containing preset blacklist words from the selected queries. 17.根据权利要求10所述的装置,其特征在于,所述query确定单元包括:第一确定子单元或第二确定子单元;17. The device according to claim 10, wherein the query determining unit comprises: a first determining subunit or a second determining subunit; 所述第一确定子单元,用于获取所述预设需求类型的垂直搜索日志中搜索次数大于预设次数阈值的query,将获取的query与所述选择的query取交集,得到具有所述预设需求类型的时效性query;The first determining subunit is configured to acquire a query whose search times are greater than a preset times threshold in the vertical search log of the preset demand type, and intersect the acquired query with the selected query to obtain the query with the preset query type. Set the timeliness query of the demand type; 所述第二确定子单元,用于分别获取所述选择的query在所述预设需求类型的垂直搜索日志中的搜索次数,保留搜索次数大于预设次数阈值的query作为具有所述预设需求类型的时效性query。The second determining subunit is configured to respectively obtain the search times of the selected query in the vertical search log of the preset demand type, and retain the query whose search times are greater than the preset number threshold as having the preset demand Type of timeliness query. 18.根据权利要求10所述的装置,其特征在于,该装置还包括:挖掘合并单元,用于将具有所述预设需求类型的时效性query与其他需求挖掘装置挖掘出的具有所述预设需求类型的query进行合并,得到最终挖掘出的具有所述预设需求类型的query。18. The device according to claim 10, further comprising: a mining and merging unit, configured to combine the time-sensitive query with the preset demand type with the query with the preset demand mined by other demand mining devices. The query of the preset requirement type is merged to obtain the finally excavated query with the preset requirement type.
CN201110379120.3A 2011-11-24 2011-11-24 A kind of based on ageing demand method for digging and device Active CN103136219B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110379120.3A CN103136219B (en) 2011-11-24 2011-11-24 A kind of based on ageing demand method for digging and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110379120.3A CN103136219B (en) 2011-11-24 2011-11-24 A kind of based on ageing demand method for digging and device

Publications (2)

Publication Number Publication Date
CN103136219A true CN103136219A (en) 2013-06-05
CN103136219B CN103136219B (en) 2016-08-17

Family

ID=48496055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110379120.3A Active CN103136219B (en) 2011-11-24 2011-11-24 A kind of based on ageing demand method for digging and device

Country Status (1)

Country Link
CN (1) CN103136219B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462259A (en) * 2014-11-21 2015-03-25 百度在线网络技术(北京)有限公司 Method and equipment for providing search result of time-efficient picture
CN105095434A (en) * 2015-07-23 2015-11-25 百度在线网络技术(北京)有限公司 Recognition method and device for timeliness requirement
CN105468782A (en) * 2015-12-21 2016-04-06 北京奇虎科技有限公司 Method and device for judging matching rate of query and resources
CN106919603A (en) * 2015-12-25 2017-07-04 北京奇虎科技有限公司 The method and apparatus for calculating participle weight in query word pattern
CN107408122A (en) * 2015-02-25 2017-11-28 微软技术许可有限责任公司 effective retrieval of fresh Internet content
WO2018054352A1 (en) * 2016-09-23 2018-03-29 腾讯科技(深圳)有限公司 Item set determination method, apparatus, processing device, and storage medium
CN108268552A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 The processing method and processing device of site information
CN109582874A (en) * 2018-12-10 2019-04-05 北京搜狐新媒体信息技术有限公司 A kind of related news method for digging and system based on two-way LSTM

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101055587A (en) * 2007-05-25 2007-10-17 清华大学 Search engine retrieving result reordering method based on user behavior information
CN101398856A (en) * 2008-11-12 2009-04-01 北京搜狗科技发展有限公司 Method for acquiring navigation enquiry words, device and method for displaying searching result
US20090313235A1 (en) * 2008-06-12 2009-12-17 Microsoft Corporation Social networks service
CN102004792A (en) * 2010-12-07 2011-04-06 百度在线网络技术(北京)有限公司 Method and system for generating hot-searching word
CN102129422A (en) * 2010-01-14 2011-07-20 富士通株式会社 Template extraction method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101055587A (en) * 2007-05-25 2007-10-17 清华大学 Search engine retrieving result reordering method based on user behavior information
US20090313235A1 (en) * 2008-06-12 2009-12-17 Microsoft Corporation Social networks service
CN101398856A (en) * 2008-11-12 2009-04-01 北京搜狗科技发展有限公司 Method for acquiring navigation enquiry words, device and method for displaying searching result
CN102129422A (en) * 2010-01-14 2011-07-20 富士通株式会社 Template extraction method and device
CN102004792A (en) * 2010-12-07 2011-04-06 百度在线网络技术(北京)有限公司 Method and system for generating hot-searching word

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462259A (en) * 2014-11-21 2015-03-25 百度在线网络技术(北京)有限公司 Method and equipment for providing search result of time-efficient picture
CN107408122A (en) * 2015-02-25 2017-11-28 微软技术许可有限责任公司 effective retrieval of fresh Internet content
CN107408122B (en) * 2015-02-25 2021-05-14 微软技术许可有限责任公司 Media and method for efficient retrieval of fresh internet content
CN105095434A (en) * 2015-07-23 2015-11-25 百度在线网络技术(北京)有限公司 Recognition method and device for timeliness requirement
WO2017012222A1 (en) * 2015-07-23 2017-01-26 百度在线网络技术(北京)有限公司 Time-sensitivity processing requirement identification method, device, apparatus and non-volatile computer storage medium
CN105095434B (en) * 2015-07-23 2019-03-29 百度在线网络技术(北京)有限公司 The recognition methods of timeliness demand and device
CN105468782B (en) * 2015-12-21 2019-05-17 北京奇虎科技有限公司 A kind of method and device of the resource matched degree judgement of inquiry-
CN105468782A (en) * 2015-12-21 2016-04-06 北京奇虎科技有限公司 Method and device for judging matching rate of query and resources
CN106919603A (en) * 2015-12-25 2017-07-04 北京奇虎科技有限公司 The method and apparatus for calculating participle weight in query word pattern
WO2018054352A1 (en) * 2016-09-23 2018-03-29 腾讯科技(深圳)有限公司 Item set determination method, apparatus, processing device, and storage medium
CN108268552B (en) * 2016-12-30 2020-08-11 北京国双科技有限公司 Website information processing method and device
CN108268552A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 The processing method and processing device of site information
CN109582874A (en) * 2018-12-10 2019-04-05 北京搜狐新媒体信息技术有限公司 A kind of related news method for digging and system based on two-way LSTM

Also Published As

Publication number Publication date
CN103136219B (en) 2016-08-17

Similar Documents

Publication Publication Date Title
CN103136219B (en) A kind of based on ageing demand method for digging and device
CN102693219B (en) Method and system for extracting Chinese event
Mihalcea et al. Textrank: Bringing order into text
CN104933027B (en) A kind of open Chinese entity relation extraction method of utilization dependency analysis
CN102708100B (en) Method and device for digging relation keyword of relevant entity word and application thereof
CN106484767A (en) A kind of event extraction method across media
CN106570144A (en) Method and apparatus for recommending information
CN103034693B (en) Open entity and kind identification method thereof
CN111309925A (en) A knowledge graph construction method for military equipment
CN103186556B (en) Obtain the method with searching structure semantic knowledge and corresponding intrument
CN103514213B (en) Term extraction method and device
CN104978314B (en) Media content recommendations method and device
CN103544266B (en) A kind of method and device for searching for suggestion word generation
CN103678412B (en) A kind of method and device of file retrieval
CN103902619B (en) A kind of network public-opinion monitoring method and system
CN101673306B (en) Web page information query method and system
CN103778262B (en) Information retrieval method and device based on thesaurus
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN106407484A (en) Video tag extraction method based on semantic association of barrages
CN102298642A (en) Method and system for extracting text information
WO2008028395A1 (en) A method for providing and searching information to the public using internet
CN102236654A (en) Web Invalid Link Filtering Method Based on Content Correlation
CN100458797C (en) Process for ordering network advertisement
CN104346382B (en) Text analysis system and method using language query
CN104376115A (en) Fuzzy word determining method and device based on global search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant