CN101320387A - Web page text and image ranking method based on user attention time - Google Patents
Web page text and image ranking method based on user attention time Download PDFInfo
- Publication number
- CN101320387A CN101320387A CNA2008101200029A CN200810120002A CN101320387A CN 101320387 A CN101320387 A CN 101320387A CN A2008101200029 A CNA2008101200029 A CN A2008101200029A CN 200810120002 A CN200810120002 A CN 200810120002A CN 101320387 A CN101320387 A CN 101320387A
- Authority
- CN
- China
- Prior art keywords
- user
- attention time
- time
- attention
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012549 training Methods 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 3
- 230000003247 decreasing effect Effects 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 5
- 238000011160 research Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基于用户关注时间的网页文本与图像排序方法。包括以下步骤:1)利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理;2)利用自定义的浏览器,收集文本关注时间的样本信息;3)利用自定义的浏览器,收集图片关注时间的样本信息;4)对收集的关注时间样本进行校正;5)基于文本和图片相似度来预测未知网页的关注时间;6)利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序。本发明有效地将用户的喜好结合在搜索过程中,使得最终的排名结果更加接近用户期待的理想排名,从而使得网页文本与图像搜索引擎为用户提供更好的个性化服务。The invention discloses a web page text and image sorting method based on user attention time. The method includes the following steps: 1) using the attention time to personalize the existing web page sorting, so that the sorting results conform to the user's psychology; 2) using a customized browser to collect sample information of text attention time; 3) using a customized browsing 4) Correct the collected attention time samples; 5) Predict the attention time of unknown web pages based on the similarity between text and pictures; 6) Use the attention time combined with traditional search technology to generate personalized Pages and pictures are sorted. The invention effectively combines the user's preferences in the search process, so that the final ranking result is closer to the ideal ranking expected by the user, so that the web page text and image search engine provides better personalized services for the user.
Description
技术领域 technical field
本发明涉及计算机搜索领域,尤其涉及一种基于用户关注时间的网页文本与图像排序方法。The invention relates to the field of computer search, in particular to a method for sorting web page text and images based on user attention time.
背景技术 Background technique
现有的个性化引擎依靠的是用户的反馈,它可以分为显式反馈和隐式反馈。我们从这两种反馈中都可以得到用户的喜好特征(Salton&Buckley 1990;White,Jose,&Ruthven 2001;White,Ruthven,&Jose 2002)。但是用户一般都不愿意去提供显式的反馈,所以现在的研究越来越多的研究都转向隐式反馈(Granka,Joachims,&Gay 2004;Guan&Cutrell 2007;Fu 2007)。研究表明,隐式反馈可以很好的反映用户的搜索意图(Fox et al.2005;Dou,Song,&Wen 2007;Fu 2007).并且从大量的隐式反馈中得到的用户喜好往往比显式反馈更加可靠。查询历史:现代研究中,用得最多的隐式反馈就是用户的查询历史。Google的个性化搜索(http://www.google.com/psearch)就是基于用户的查询历史的。总的来说,基于查询历史的算法又可以分为以下两类:一类是基于整个查询历史的算法,另一类是基于某个查询会话(指的是一连串相关的查询)。对于前者来说,通常算法会产生一个该用户的概要文档用来描述用户的搜索喜好。点击数据:点击数据是另一种非常重要的隐式反馈,如(Dupret,Mrudock,&Piwowarski 2007;Joachims 2002)。在一个搜索结果页面上,我们假设用户点击过的链接比用户没有点过的链接对于此用户来说更加重要。研究者们用了很多中方法从用户的点击行为中获取用户的喜好特征。举例来说,有些研究者用一种叫Ranking SVM的算法(Hersh et al.1994)通过用户的点击信息来获得对该用户来说最好的网页排序。在(Radlinski&Joachims 2005)一文中,作者不但从用户的单次查询中提取用户喜好,同时也从用户对同一信息的一连串查询中提取用户的喜好,这些喜好特征然后通过Ranking SVM的改进算法来进行训练。Sun et al.(2005)提出了一种基于Singluar Value Decomposition的算法,它通过分析用户的点击数据来提高搜索引擎的建议系统的准确率。关注时间:相对来说,关注时间是一个新型的隐式用户反馈。虽然它在近期的研究中越来越多被提到,但是关于它是否真的能够反映用户意图仍然有争辩。Kelly和Belkin(2004;2001)建议说,在文档的关注时间和它对用户的有用度之间并没有非常可靠的相互关系。但是不同的是,在他们的研究当中,关注时间是通过测量一组用户阅读不同主题的文章而得到的平均关注时间。Halabiet al.(2007)认为对于一个的用户在同一个搜索行为中关注时间,它可以很好的反映出用户的喜好。我们认为以上两个研究并不矛盾,因为他们所计算的关注时间并不相同。在这篇论文中,我们假设单一用户或者单一主题的关注时间可以很好的反映用户的喜好Existing personalization engines rely on user feedback, which can be divided into explicit feedback and implicit feedback. From both kinds of feedback, we can obtain user preference characteristics (Salton & Buckley 1990; White, Jose, & Ruthven 2001; White, Ruthven, & Jose 2002). But users are generally unwilling to provide explicit feedback, so more and more researches now turn to implicit feedback (Granka, Joachims, & Gay 2004; Guan & Cutrell 2007; Fu 2007). Studies have shown that implicit feedback can well reflect the user's search intention (Fox et al.2005; Dou, Song, & Wen 2007; Fu 2007). And user preferences obtained from a large number of implicit feedback are often better than explicit feedback more reliable. Query history: In modern research, the most frequently used implicit feedback is the user's query history. Google's personalized search (http://www.google.com/psearch) is based on the user's query history. In general, algorithms based on query history can be divided into the following two categories: one is based on the entire query history, and the other is based on a certain query session (referring to a series of related queries). For the former, usually the algorithm will generate a profile document of the user to describe the user's search preferences. Click data: Click data is another very important type of implicit feedback, such as (Dupret, Mrudock, & Piwowarski 2007; Joachims 2002). On a search results page, we assume that links the user has clicked are more important to the user than links the user has not clicked. Researchers have used many methods to obtain the user's preference characteristics from the user's click behavior. For example, some researchers use an algorithm called Ranking SVM (Hersh et al. 1994) to obtain the best web page ranking for the user through the user's click information. In (Radlinski&Joachims 2005), the author not only extracts user preferences from a single user query, but also extracts user preferences from a series of user queries on the same information. These preferences are then trained through the improved algorithm of Ranking SVM . Sun et al. (2005) proposed an algorithm based on Singluar Value Decomposition, which improves the accuracy of the search engine's suggestion system by analyzing the user's click data. Attention time: Relatively speaking, attention time is a new type of implicit user feedback. Although it has been mentioned more and more in recent studies, it is still debated whether it can really reflect user intent. Kelly and Belkin (2004; 2001) suggest that there is not a very reliable correlation between the attention time of a document and its usefulness to users. But the difference is that in their research, attention time is measured by the average attention time of a group of users reading articles on different topics. Halabie et al. (2007) believed that for a user to pay attention to the time in the same search behavior, it can well reflect the user's preferences. We do not think the above two studies are contradictory, because the time of attention they calculate is not the same. In this paper, we assume that the attention time of a single user or a single topic can well reflect the user's preferences
发明内容 Contents of the invention
本发明的目的是克服现有技术的不足,提供一种基于关注时间的个性化网页排序方法。The purpose of the present invention is to overcome the deficiencies of the prior art and provide a method for sorting personalized webpages based on attention time.
基于用户关注时间的网页文本与图像排序方法包括以下步骤:The method for ranking webpage text and images based on user attention time includes the following steps:
1)利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理;1) Use the attention time to personalize the existing web page sorting, so that the sorting results conform to the user's psychology;
2)利用自定义的浏览器,收集文本关注时间的样本信息;2) Use a custom browser to collect sample information of text attention time;
3)利用自定义的浏览器,收集图片关注时间的样本信息;3) Use a custom browser to collect sample information on the attention time of pictures;
4)对收集的关注时间样本进行校正;4) Calibrate the collected time-of-interest samples;
5)基于文本和图片相似度来预测未知网页的关注时间;5) Predict the attention time of unknown web pages based on the similarity between text and pictures;
6)利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序。6) Using attention time combined with traditional search technology to generate personalized web pages and pictures for ranking.
所述的利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理步骤:将关注时间作为用户隐式反馈的来源,从而得知用户的喜好特征,进而对用户未浏览过的网页或图片进行关注时间的预测,最终根据预测的关注时间对结果进行排序,关注时间是用户在浏览一个网页或图片时花费的阅读或浏览时间。The above-mentioned use of attention time to personalize the ranking of existing web pages, so that the ranking results conform to the user’s psychological steps: use the attention time as the source of user’s implicit feedback, so as to know the user’s preferences, and then analyze the user’s unbrowsed Predict the attention time of web pages or pictures, and finally sort the results according to the predicted attention time. The attention time is the reading or browsing time spent by users when browsing a web page or picture.
所述的利用自定义的浏览器,收集文本关注时间的样本信息步骤:客户端是一个自定义的浏览器,对于文本搜索,在搜索结果页面上,搜索引擎通常会在搜索结果页面上为每个文档提供几行概要,追踪鼠标的移动位置,从而来记录用户在某个文档上花的时间,在被打开的页面上,记录用户在此页面上的活动时间,对于此文档的关注时间就是阅读概要的时间加上阅读整篇文档的时间,如果之后用户又回到已看过的页面,那么该页面的关注时间会相应增加。The steps of collecting the sample information of text attention time by using a custom browser: the client is a custom browser, for text search, on the search result page, the search engine usually sets A document provides a few lines of summary, tracking the mouse movement position, so as to record the time the user spends on a certain document, on the opened page, record the user's activity time on this page, and the attention time for this document is The time to read the summary is added to the time to read the entire document. If the user later returns to a page that has already been viewed, the attention time of the page will increase accordingly.
所述的利用自定义的浏览器,收集图片关注时间的样本信息步骤:客户端是一个自定义的浏览器,对于图片搜索,搜索引擎会在结果页面上显式每个图片的缩略图,同样的,关注时间是用户看缩略图的时间加上用户看原图的时间,如果一个文档既有文字又有图片,它的关注时间就是两者之和。The steps of using a custom browser to collect the sample information of the picture attention time: the client is a custom browser, for picture search, the search engine will display the thumbnail of each picture on the result page, and the same Yes, the attention time is the time the user looks at the thumbnail plus the time the user looks at the original image. If a document has both text and pictures, its attention time is the sum of the two.
所述的对收集的关注时间样本进行校正步骤:对收集的关注时间样本进行校正式如下:The step of correcting the collected time-of-interest samples: correcting the time-of-interest samples collected is as follows:
其中tatt raw是收集的关注时间,tbasic(u)是用户用来判断此文档是否值得一读的时间,tatt inf(u,d)则是潜在的该文档d包含的关注时间。Among them, t att raw is the collected attention time, t basic (u) is the time used by users to judge whether the document is worth reading, and t att inf (u, d) is the potential attention time contained in the document d.
所述的基于文本和图片相似度来预测未知网页的关注时间步骤:The time steps for predicting the attention time of unknown web pages based on the similarity between text and pictures:
a)用Sim(d0,d1)来表示文档d0和文档d1之间的相似度,同时Sim(d0,d1)[0,1],在计算两个文档的相似度之前,删除广告,网页源码中的标签,以及网页上面的导航栏;a) Use Sim(d 0 , d 1 ) to represent the similarity between document d 0 and document d 1 , and Sim(d 0 , d 1 )[0, 1], before calculating the similarity between the two documents , delete advertisements, tags in the source code of the webpage, and the navigation bar on the webpage;
b)把每个训练样本表示为{tatt(u,di)|i=1,...n},其中n是当前用户阅读过的文档的个数,阅读过的文档表示为di(i=1,...,n),当用户遇到一个新的文档dx的时候,计算文档dx和测试集中的所有文档进行相似度计算,挑选出k个具有最高相似度的文档,把k设为min(10,n),挑选出来的文档为di(i=1,...,k),用以下这个方程来预测dx的关注时间,b) Express each training sample as {t att (u, d i )|i=1,...n}, where n is the number of documents that the current user has read, and the read documents are expressed as d i (i=1,...,n), when the user encounters a new document d x , calculate the similarity between the document d x and all documents in the test set, and select k documents with the highest similarity , set k as min(10, n), the selected document is d i (i=1,..., k), use the following equation to predict the attention time of d x ,
其中用来控制Sim(,)的值占多的比重,是一个很小的正整数用来防止表达式的分母为0,函数(,)用来去除一些相似度非常低的文档,它被定义为:It is used to control the proportion of the value of Sim(,), which is a small positive integer to prevent the denominator of the expression from being 0. The function (,) is used to remove some documents with very low similarity, which is defined for:
所述的利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序步骤:The steps of using attention time combined with traditional search technology to generate personalized web pages and pictures for sorting:
c)当用户提交一个查询请求时,服务端首先将查询重定向至传统搜索引擎,并获得返回的前n个网页,对于返回的每个页面,系统将在该用户的样本集中查找k个与文本或图片相似度最高的样本,并用权利要求8中的方法预测网页的关注时间;c) When a user submits a query request, the server first redirects the query to a traditional search engine and obtains the first n web pages returned. For each returned web page, the system will search for k web pages in the user's sample set that are related to The sample with the highest similarity of text or picture, and use the method in claim 8 to predict the attention time of the webpage;
d)对于传统的排序,系统会生成一个关注时间偏差,那就是在传统排序中,排名越高的文档,获得更高的关注时间偏差,用如下公式定义这个偏差d) For traditional sorting, the system will generate an attention time deviation, that is, in traditional sorting, documents with higher ranks will get higher attention time deviation. Use the following formula to define this deviation
其中rank(i)表示的文档i在Google的排序的排名,参数κ用来控制关注时间随排名下降的坡度;Among them, rank(i) represents the ranking of document i in Google, and the parameter κ is used to control the slope of the attention time falling with the ranking;
e)从文档i的关注时间tatten(i)和偏差tatten offset(i),获得文档i的全局关注时间:
f)最终排序将按照总关注时间的倒序排列。f) The final ranking will be in reverse order of total attention time.
本发明有效地将用户的喜好结合在搜索过程中,使得最终的排名结果更加接近用户期待的理想排名,从而使得网页文本与图像搜索引擎为用户提供更好的个性化服务。The invention effectively combines the user's preferences in the search process, so that the final ranking result is closer to the ideal ranking expected by the user, so that the web page text and image search engine provides better personalized services for the user.
附图说明 Description of drawings
图1是具体实施方式的流程图;Fig. 1 is the flowchart of specific embodiment;
图2是本例中自定义浏览器的截图;Figure 2 is a screenshot of the custom browser in this example;
图3是14组文本搜索的实验结果,具体数据在表2中;Figure 3 is the experimental results of 14 groups of text searches, and the specific data are in Table 2;
图4是7组图片搜索实验的坐标图,具体数据在表3和表4中;每组实验都是由不同的用户在相同的设置下进行的,坐标上画的是用户的平均的期望排名,平均值越小,那么用户期望的图片在搜索结果中将会出现得越靠前。Figure 4 is the coordinate diagram of 7 groups of image search experiments, and the specific data are in Table 3 and Table 4; each group of experiments is conducted by different users under the same setting, and the coordinates are drawn on the coordinates of the user's average expected ranking , the smaller the average value, the higher the user-desired picture will appear in the search results.
具体实施方式 Detailed ways
基于关注时间的个性化网页排序方法包括以下步骤:The method for ranking personalized web pages based on attention time includes the following steps:
1)利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理;1) Use the attention time to personalize the existing web page sorting, so that the sorting results conform to the user's psychology;
2)利用自定义的浏览器,收集文本关注时间的样本信息;2) Use a custom browser to collect sample information of text attention time;
3)利用自定义的浏览器,收集图片关注时间的样本信息;3) Use a custom browser to collect sample information on the attention time of pictures;
4)对收集的关注时间样本进行校正;4) Calibrate the collected time-of-interest samples;
5)基于文本和图片相似度来预测未知网页的关注时间;5) Predict the attention time of unknown web pages based on the similarity between text and pictures;
6)利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序。6) Using attention time combined with traditional search technology to generate personalized web pages and pictures for ranking.
所述的利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理步骤:将关注时间作为用户隐式反馈的来源,从而得知用户的喜好特征,进而对用户未浏览过的网页或图片进行关注时间的预测,最终根据预测的关注时间对结果进行排序,关注时间是用户在浏览一个网页或图片时花费的阅读或浏览时间。The above-mentioned use of attention time to personalize the ranking of existing web pages, so that the ranking results conform to the user’s psychological steps: use the attention time as the source of user’s implicit feedback, so as to know the user’s preferences, and then analyze the user’s unbrowsed Predict the attention time of web pages or pictures, and finally sort the results according to the predicted attention time. The attention time is the reading or browsing time spent by users when browsing a web page or picture.
所述的利用自定义的浏览器,收集文本关注时间的样本信息步骤:客户端是一个自定义的浏览器,对于文本搜索,在搜索结果页面上,搜索引擎通常会在搜索结果页面上为每个文档提供几行概要,追踪鼠标的移动位置,从而来记录用户在某个文档上花的时间,在被打开的页面上,记录用户在此页面上的活动时间,对于此文档的关注时间就是阅读概要的时间加上阅读整篇文档的时间,如果之后用户又回到已看过的页面,那么该页面的关注时间会相应增加。The steps of collecting the sample information of text attention time by using a custom browser: the client is a custom browser, for text search, on the search result page, the search engine usually sets A document provides a few lines of summary, tracking the mouse movement position, so as to record the time the user spends on a certain document, on the opened page, record the user's activity time on this page, and the attention time for this document is The time to read the summary is added to the time to read the entire document. If the user later returns to a page that has already been viewed, the attention time of the page will increase accordingly.
所述的利用自定义的浏览器,收集图片关注时间的样本信息步骤:客户端是一个自定义的浏览器,对于图片搜索,搜索引擎会在结果页面上显式每个图片的缩略图,同样的,关注时间是用户看缩略图的时间加上用户看原图的时间,如果一个文档既有文字又有图片,它的关注时间就是两者之和。The steps of using a custom browser to collect the sample information of the picture attention time: the client is a custom browser, for picture search, the search engine will display the thumbnail of each picture on the result page, and the same Yes, the attention time is the time the user looks at the thumbnail plus the time the user looks at the original image. If a document has both text and pictures, its attention time is the sum of the two.
所述的对收集的关注时间样本进行校正步骤:对收集的关注时间样本进行校正式如下:The step of correcting the collected time-of-interest samples: correcting the time-of-interest samples collected is as follows:
其中tatt raw是收集的关注时间,tbasic(u)是用户用来判断此文档是否值得一读的时间,tatt inf(u,d)则是潜在的该文档d包含的关注时间。Among them, t att raw is the collected attention time, t basic (u) is the time used by users to judge whether the document is worth reading, and t att inf (u, d) is the potential attention time contained in the document d.
所述的基于文本和图片相似度来预测未知网页的关注时间步骤:The time steps for predicting the attention time of unknown web pages based on the similarity between text and pictures:
a)用Sim(d0,d1)来表示文档d0和文档d1之间的相似度,同时Sim(d0,d1)[0,1],在计算两个文档的相似度之前,删除广告,网页源码中的标签,以及网页上面的导航栏;a) Use Sim(d 0 , d 1 ) to represent the similarity between document d 0 and document d 1 , and Sim(d 0 , d 1 )[0, 1], before calculating the similarity between the two documents , delete advertisements, tags in the source code of the webpage, and the navigation bar on the webpage;
b)把每个训练样本表示为{tatt(u,di)|i=1,...n},其中n是当前用户阅读过的文档的个数,阅读过的文档表示为di(i=1,...,n),当用户遇到一个新的文档dx的时候,计算文档dx和测试集中的所有文档进行相似度计算,挑选出k个具有最高相似度的文档,把k设为min(10,n),挑选出来的文档为di(i=1,...,k),用以下这个方程来预测dx的关注时间,b) Express each training sample as {t att (u, d i )|i=1,...n}, where n is the number of documents that the current user has read, and the read documents are expressed as d i (i=1,...,n), when the user encounters a new document d x , calculate the similarity between the document d x and all documents in the test set, and select k documents with the highest similarity , set k as min(10, n), the selected document is d i (i=1,..., k), use the following equation to predict the attention time of d x ,
其中用来控制Sim(,)的值占多的比重,是一个很小的正整数用来防止表达式的分母为0,函数(,)用来去除一些相似度非常低的文档,它被定义为:It is used to control the proportion of the value of Sim(,), which is a small positive integer used to prevent the denominator of the expression from being 0. The function (,) is used to remove some documents with very low similarity. It is defined for:
所述的利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序步骤:The steps of using attention time combined with traditional search technology to generate personalized web pages and pictures for sorting:
c)当用户提交一个查询请求时,服务端首先将查询重定向至传统搜索引擎,并获得返回的前n个网页,对于返回的每个页面,系统将在该用户的样本集中查找k个与文本或图片相似度最高的样本,并用权利要求8中的方法预测网页的关注时间;c) When a user submits a query request, the server first redirects the query to a traditional search engine and obtains the first n web pages returned. For each returned web page, the system will search for k web pages in the user's sample set that are related to The sample with the highest similarity of text or picture, and use the method in claim 8 to predict the attention time of the webpage;
d)对于传统的排序,系统会生成一个关注时间偏差,那就是在传统排序中,排名越高的文档,获得更高的关注时间偏差,用如下公式定义这个偏差d) For traditional sorting, the system will generate an attention time deviation, that is, in traditional sorting, documents with higher ranks will get higher attention time deviation. Use the following formula to define this deviation
其中rank(i)表示的文档i在Google的排序的排名,参数κ用来控制关注时间随排名下降的坡度;Among them, rank(i) represents the ranking of document i in Google, and the parameter κ is used to control the slope of the attention time falling with the ranking;
e)从文档i的关注时间tatten(i)和偏差tatten offset(i),获得文档i的全局关注时间:
f)最终排序将按照总关注时间的倒序排列。f) The final ranking will be in reverse order of total attention time.
实施例:Example:
本发明的基于用户关注时间的网页文本与图像排序方法的流程结构如图1所示。该个性化排序系统包括客户端和服务端两部分,客户端20、自定义浏览器来获取用户的关注时间,服务端包括30、样本收集模块,40、关注时间校正,50、用户数据库和60、文档数据库,70、查询界面,80、传统引擎模块,90、文档预处理模块,100、文档比较模块,110、关注时间预测模块,120、排序模块。。The flow structure of the web page text and image sorting method based on the user's attention time of the present invention is shown in FIG. 1 . The personalized sorting system includes two parts, the client and the server, the
自定义浏览器20,对用户的鼠标移动进行追踪分析,最终得出用户在各个文档上的关注时间。在本例中,给出了由我们开发的自定义浏览器记录的对关注时间(图2)。Customize the
样本收集模块30,将客户端发送的样本数据存入对应用户的数据库中,如果某文档在文档数据库中不存在,则下载并存入文档数据库。The
关注时间校正模块40,直接从客户端获得的预测关注时间还需要进行校正,当用户浏览一个文档时,不管此文档是否对该用户有用,用户都得花一段时间去粗略的浏览此文档。一般来说,此时获得关注时间既包括了用户的实际关注时间也包括了用户粗略浏览该文档的时间,为了克服这个问题,我们以下这个方程来校正我们原先获得的关注时间:The attention
tatt raw是我们原先获得的关注时间,tbasic(u)是用户用来判断此文档是否值得一读的时间,tatt inf(u,d)则是潜在的该文档d包含的关注时间。t att raw is the attention time we obtained originally, t basic (u) is the time used by the user to judge whether the document is worth reading, and t att inf (u, d) is the potential attention time contained in the document d.
用户数据库50,存储系统各个用户对文档的关注时间,在本例中用MYSQL存储。The user database 50 stores the attention time of each user of the system on the document, which is stored in MYSQL in this example.
文档数据库60,存储文档(文本网页和图片)的数据,在本例中用MYSQL存储。The
查询界面70,提供一个用户查询的web入口,提供文本搜索和图片搜索两项服务。在本例中,此查询界面用jsp实现。The query interface 70 provides a web portal for user query, and provides two services of text search and image search. In this example, the query interface is implemented with jsp.
传统引擎模块80,当用户提交一个查询请求时,服务端会对传统搜索引擎(比如Google)的结果页面进行解析并获取其返回结果中的前300个文档,并将文档下载存至文档服务器。
文档预处理模块90,直接从网站下载下来的网页包含很多无用信息,比如HTML标签,广告栏,导航栏等。此模块用于去除网页中的无用信息,保留用户将关注的主体文档。在本例中,我们实现了,去除HTML标签功能。In the document preprocessing module 90, the web pages downloaded directly from the website contain a lot of useless information, such as HTML tags, advertisement columns, navigation bars and so on. This module is used to remove useless information in web pages and keep the main documents that users will pay attention to. In this example, we implemented the function of removing HTML tags.
文档比较模块100,选用的文本相似度算法为extended Jaccard方法(Tanimoto);选用的图片相似度算法为基于“Auto Color Correlogram”(Huang et al.1997)的相似度算法。In the
关注时间预测模块110,包含以下几个步骤:Pay attention to the time prediction module 110, including the following steps:
a.)此模块对于传统引擎模块中的每个文档都进行关注时间的预测。首先我们把每个训练样本表示为{tatt(u,di)|i=1,...n},其中n是当前用户阅读过的文档的个数。阅读过的文档表示为di(i=1,...,n)。对于传统引擎返回的文档dx的时候,我们会计算文档dx和测试集中的所有文档进行相似度计算。然后我们会挑选出k个具有最高相似度的文档。在我们的实验中,我们把k设为min(10,n)。我们挑选出来的文档为di(i=1,...,k).然后我们用以下这个方程来预测dx的关注时间。a.) This module predicts the attention time for each document in the traditional engine module. First, we denote each training sample as {t att (u, d i )|i=1,...n}, where n is the number of documents that the current user has read. Read documents are denoted as d i (i=1, . . . , n). For the document d x returned by the traditional engine, we will calculate the similarity between the document d x and all documents in the test set. We then pick the k documents with the highest similarity. In our experiments, we set k as min(10,n). The documents we selected are d i (i=1,...,k). Then we use the following equation to predict the attention time of d x .
其中γ用来控制Sim(,)的值占多的比重,ε是一个很小的正整数用来防止表达式的分母为0。函数δ(,)用来去除一些相似度非常低的文档,它被定义为Among them, γ is used to control the proportion of the value of Sim(,), and ε is a small positive integer to prevent the denominator of the expression from being 0. The function δ(,) is used to remove some documents with very low similarity, which is defined as
b.)在系统运行的初期,我们还会将传统引擎的排名转化成一个关注时间偏差。我们用下面这个方程将传统排名转化成一个值在0和1之间的标准化关注时间偏差:b.) In the initial stage of system operation, we will also convert the ranking of traditional engines into a focus on time deviation. We convert traditional rankings into a normalized attention-time bias with values between 0 and 1 using the following equation:
其中rank(i)表示的文档i在传统搜索引擎的排名。我们之所以选择这样一个式子是因为它可以把网页排名信息转换成关注时间,而且让排名较低的文档转化所得的关注时间相对更短。参数κd用来控制关注时间随排名下降的坡度,在我们是实验中,我们设定为0.2。Where rank(i) represents the ranking of document i in traditional search engines. The reason why we choose such a formula is that it can convert the page ranking information into attention time, and the attention time obtained by conversion of lower ranking documents is relatively shorter. The parameter κ d is used to control the slope of the attention time decreasing with the ranking. In our experiment, we set it to 0.2.
c.)一旦我们得到了文档i的关注时间tatten(i)和偏差tatten offseti,我们可以获得该文档的全局关注时间:
排序模块120,排序模块将结果按照所有文档按照全局关注时间进行倒序排列,并将结果返回给用户。The
表1~4的实验结果清晰的显示出本方法的优越性;The experimental results of Tables 1 to 4 clearly show the superiority of this method;
表1是用″网页搜索技术″(Web search technology)作为关键词的文本搜索得到的前17项文本的各自排名名次;各个栏从左到右分别是用户的理想排名,网页所搜引擎Google的排名,以及用户读过2,5,8,10,15个网页之后的排名;最后一行表示的是各个排名与用户理想排名之间的排名绝对误差总和;Table 1 shows the respective rankings of the top 17 texts obtained by using "Web search technology" (Web search technology) as the keyword text search; each column is the user's ideal ranking from left to right, and the web page search engine Google ranking, and the ranking after the user has read 2, 5, 8, 10, and 15 web pages; the last row indicates the sum of the absolute errors between each ranking and the user's ideal ranking;
表1Table 1
表2是14个不同的用户对不同关键词做文本搜索的实验数据;每一行表示每组实验中所得排名与用户理想排名之间的排名绝对误差总和,这些数据也以图形化的形式显示在图3中;Table 2 is the experimental data of 14 different users doing text searches on different keywords; each row represents the sum of the absolute errors in the rankings between the rankings obtained in each group of experiments and the user's ideal rankings, and these data are also displayed graphically in Figure 3;
表2Table 2
表3是一组以“毕加索”(Picasso)为关键词的图像搜索实验数据;用户想用″Picasso″去查找Picasso的自画像,在60个图片中仅有6是符合用户需求的;表中每一栏表示的是这些符合需求的图片在图像搜索引擎Google,以及本方法得出的的排名;Rk1st,Rk2nd,Rk3rd分别表示的是用户在阅读过搜索结果第1,2,3页之后的排名情况;Rkgoogle表示的是这些图像在Google图像搜索的排名情况。最后一行是这些图像在各个情况下的平均排名;平均排名值越小,用户所需求的图像将出现的越早;Table 3 is a group of image search experiment data with "Picasso" (Picasso) as the keyword; the user wants to use "Picasso" to find Picasso's self-portrait, and only 6 of the 60 pictures meet the user's needs; One column indicates the rankings of the images that meet the requirements in the image search engine Google and this method; Rk1st, Rk2nd, and Rk3rd respectively indicate the rankings of the users after reading pages 1, 2, and 3 of the search results situation; Rkgoogle represents the ranking of these images in Google Image Search. The last line is the average ranking of these images in each case; the smaller the average ranking value, the earlier the image required by the user will appear;
表3table 3
表4是另外6组图像搜索的实验数据;实验中每个用户都被要求在60个图片中寻找他所需要的图像;第一列为搜索关键词,第二列为用户所需要的图片个数;RkGoogle表示在网页图像搜索引擎Google中用户所需网页的平均排名;Rk1st、Rk2nd、Rk3rd分别表示的是用户在阅读过搜索结果第1、2、3页之后,用户所需网页图像的平均排名;Table 4 is the experimental data of another 6 groups of image search; in the experiment, each user is required to find the image he needs in 60 pictures; the first column is the search keyword, and the second column is the number of pictures that the user needs. RkGoogle indicates the average ranking of the web pages required by the user in the web image search engine Google; Rk1st, Rk2nd, and Rk3rd respectively indicate the average web page images required by the user after reading the first, second, and third pages of the search results ranking;
表4Table 4
上述表格表明,本发明有效地将用户的喜好结合在搜索过程中,使得最终的排名结果更加接近用户期待的理想排名,从而使得网页文本与图像搜索引擎为用户提供更好的个性化服务。The above table shows that the present invention effectively combines the user's preferences in the search process, making the final ranking result closer to the ideal ranking expected by the user, so that the web text and image search engine can provide users with better personalized services.
以上所述仅为本发明的基于关注时间的面向用户的个性化网页排序方法及系统的较佳实施例,并非用以限定本发明的实质技术内容的范围。本发明的基于关注时间的面向用户的个性化网页排序方法及系统,其实质技术内容是广泛的定义于权利要求书中,任何他人所完成的技术实体或方法,若是与权利要求书中所定义者完全相同,或是同一等效的变更,均将被视为涵盖于此专利保护范围之内。The above descriptions are only preferred embodiments of the attention time-based user-oriented personalized web page sorting method and system of the present invention, and are not intended to limit the scope of the substantive technical content of the present invention. The user-oriented personalized web page sorting method and system based on attention time of the present invention, its essential technical content is broadly defined in the claims, any technical entity or method completed by others, if it is the same as defined in the claims or identical or equivalent changes will be deemed to be covered within the scope of this patent protection.
Claims (7)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNA2008101200029A CN101320387A (en) | 2008-07-11 | 2008-07-11 | Web page text and image ranking method based on user attention time |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNA2008101200029A CN101320387A (en) | 2008-07-11 | 2008-07-11 | Web page text and image ranking method based on user attention time |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN101320387A true CN101320387A (en) | 2008-12-10 |
Family
ID=40180436
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CNA2008101200029A Pending CN101320387A (en) | 2008-07-11 | 2008-07-11 | Web page text and image ranking method based on user attention time |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN101320387A (en) |
Cited By (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101916264A (en) * | 2010-07-30 | 2010-12-15 | 浙江大学 | A personalized webpage recommendation method based on user facial expression and gaze distribution detection |
| CN102117332A (en) * | 2011-03-10 | 2011-07-06 | 辜进荣 | Given time-based searching method |
| CN102231165A (en) * | 2011-07-11 | 2011-11-02 | 浙江大学 | Method for searching and sequencing personalized web pages based on user retention time analysis |
| CN101567004B (en) * | 2009-02-06 | 2012-05-30 | 浙江大学 | English text automatic summarization method based on eyeball tracking |
| CN103186565A (en) * | 2011-12-28 | 2013-07-03 | 中国移动通信集团浙江有限公司 | Method and device for judging user preference according to web browsing behavior of user |
| WO2014079196A1 (en) * | 2012-11-21 | 2014-05-30 | 华为技术有限公司 | Method for generating history record and favorites folder and user terminal |
| CN104166741A (en) * | 2014-09-10 | 2014-11-26 | 北京国双科技有限公司 | Webpage browsing analysis and processing method and device |
| CN104281606A (en) * | 2013-07-08 | 2015-01-14 | 腾讯科技(北京)有限公司 | Method and device for displaying microblog comments |
| WO2015127782A1 (en) * | 2014-02-27 | 2015-09-03 | 优视科技有限公司 | Method and system for displaying webpage self-defined content |
| WO2015143910A1 (en) * | 2014-03-28 | 2015-10-01 | 北京奇虎科技有限公司 | Method and device for defining search engine result pages by user |
| WO2016155537A1 (en) * | 2015-03-30 | 2016-10-06 | 阿里巴巴集团控股有限公司 | Method and device for ranking search results of picture objects |
| CN103827863B (en) * | 2011-05-12 | 2017-02-15 | 谷歌公司 | Dynamic image display area and image display within web search results |
| TWI585596B (en) * | 2009-10-01 | 2017-06-01 | Alibaba Group Holding Ltd | How to implement image search and website server |
| CN107644028A (en) * | 2016-07-20 | 2018-01-30 | 平安科技(深圳)有限公司 | The collection method and system of web data |
| WO2018133681A1 (en) * | 2017-01-23 | 2018-07-26 | 腾讯科技(深圳)有限公司 | Method and device for sorting search results, server and storage medium |
-
2008
- 2008-07-11 CN CNA2008101200029A patent/CN101320387A/en active Pending
Cited By (25)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101567004B (en) * | 2009-02-06 | 2012-05-30 | 浙江大学 | English text automatic summarization method based on eyeball tracking |
| TWI585596B (en) * | 2009-10-01 | 2017-06-01 | Alibaba Group Holding Ltd | How to implement image search and website server |
| CN101916264B (en) * | 2010-07-30 | 2012-09-19 | 浙江大学 | A personalized webpage recommendation method based on user facial expression and gaze distribution detection |
| CN101916264A (en) * | 2010-07-30 | 2010-12-15 | 浙江大学 | A personalized webpage recommendation method based on user facial expression and gaze distribution detection |
| CN102117332A (en) * | 2011-03-10 | 2011-07-06 | 辜进荣 | Given time-based searching method |
| CN103827863B (en) * | 2011-05-12 | 2017-02-15 | 谷歌公司 | Dynamic image display area and image display within web search results |
| CN102231165A (en) * | 2011-07-11 | 2011-11-02 | 浙江大学 | Method for searching and sequencing personalized web pages based on user retention time analysis |
| CN102231165B (en) * | 2011-07-11 | 2013-01-09 | 浙江大学 | Method for searching and sequencing personalized web pages based on user retention time analysis |
| CN103186565A (en) * | 2011-12-28 | 2013-07-03 | 中国移动通信集团浙江有限公司 | Method and device for judging user preference according to web browsing behavior of user |
| CN103186565B (en) * | 2011-12-28 | 2017-02-22 | 中国移动通信集团浙江有限公司 | Method and device for judging user preference according to web browsing behavior of user |
| CN103838727B (en) * | 2012-11-21 | 2018-01-19 | 华为技术有限公司 | A kind of generation method and user terminal of historical record and collection |
| CN103838727A (en) * | 2012-11-21 | 2014-06-04 | 华为技术有限公司 | Generation method for history records and favorites and user terminal |
| WO2014079196A1 (en) * | 2012-11-21 | 2014-05-30 | 华为技术有限公司 | Method for generating history record and favorites folder and user terminal |
| CN104281606B (en) * | 2013-07-08 | 2021-06-25 | 腾讯科技(北京)有限公司 | A method and device for displaying microblog comments |
| CN104281606A (en) * | 2013-07-08 | 2015-01-14 | 腾讯科技(北京)有限公司 | Method and device for displaying microblog comments |
| WO2015127782A1 (en) * | 2014-02-27 | 2015-09-03 | 优视科技有限公司 | Method and system for displaying webpage self-defined content |
| US10776564B2 (en) | 2014-02-27 | 2020-09-15 | Uc Mobile Co., Ltd. | Method and system for displaying webpage self-defined content |
| WO2015143910A1 (en) * | 2014-03-28 | 2015-10-01 | 北京奇虎科技有限公司 | Method and device for defining search engine result pages by user |
| CN104166741B (en) * | 2014-09-10 | 2018-09-18 | 北京国双科技有限公司 | Web page browsing analysis and processing method and device |
| CN104166741A (en) * | 2014-09-10 | 2014-11-26 | 北京国双科技有限公司 | Webpage browsing analysis and processing method and device |
| CN106156063A (en) * | 2015-03-30 | 2016-11-23 | 阿里巴巴集团控股有限公司 | Correlation technique and device for object picture search results ranking |
| WO2016155537A1 (en) * | 2015-03-30 | 2016-10-06 | 阿里巴巴集团控股有限公司 | Method and device for ranking search results of picture objects |
| CN107644028B (en) * | 2016-07-20 | 2020-09-04 | 平安科技(深圳)有限公司 | Method and system for collecting webpage data |
| CN107644028A (en) * | 2016-07-20 | 2018-01-30 | 平安科技(深圳)有限公司 | The collection method and system of web data |
| WO2018133681A1 (en) * | 2017-01-23 | 2018-07-26 | 腾讯科技(深圳)有限公司 | Method and device for sorting search results, server and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN101320387A (en) | Web page text and image ranking method based on user attention time | |
| CN102246167B (en) | Providing search results | |
| US8538989B1 (en) | Assigning weights to parts of a document | |
| Guan et al. | Personalized tag recommendation using graph-based ranking on multi-type interrelated objects | |
| US8209616B2 (en) | System and method for interfacing a web browser widget with social indexing | |
| CN102929928B (en) | Multidimensional-similarity-based personalized news recommendation method | |
| US9348935B2 (en) | Systems and methods for augmenting a keyword of a web page with video content | |
| TWI471737B (en) | System and method for trail identification with search results | |
| US20110225152A1 (en) | Constructing a search-result caption | |
| CN101382939B (en) | Webpage Text Personalized Search Method Based on Eye Tracking | |
| CN105512285B (en) | Adaptive network reptile method based on machine learning | |
| CN101382938B (en) | Network video ordering method based on focusing time of users | |
| CN104036038A (en) | News recommendation method and system | |
| CN109165367B (en) | News recommendation method based on RSS subscription | |
| WO2010014082A1 (en) | Method and apparatus for relating datasets by using semantic vectors and keyword analyses | |
| CN102622417A (en) | Method and device for ordering information records | |
| US10339469B2 (en) | Self-adaptive display layout system | |
| US20110072025A1 (en) | Ranking entity relations using external corpus | |
| CN102236719A (en) | Page search engine based on page classification and quick search method | |
| Parikh et al. | Search engine optimization | |
| US20130013305A1 (en) | Method and subsystem for searching media content within a content-search service system | |
| Chaudhuri et al. | SHARE: Designing multiple criteria-based personalized research paper recommendation system | |
| CN101097580A (en) | Process for ordering network advertisement | |
| CN120011634A (en) | A news recommendation method based on image and text combination | |
| CN101382940B (en) | Web page image individuation search method based on eyeball tracking |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C12 | Rejection of a patent application after its publication | ||
| RJ01 | Rejection of invention patent application after publication |
Open date: 20081210 |