[go: up one dir, main page]

CN101320387A - Web page text and image ranking method based on user attention time - Google Patents

Web page text and image ranking method based on user attention time Download PDF

Info

Publication number
CN101320387A
CN101320387A CNA2008101200029A CN200810120002A CN101320387A CN 101320387 A CN101320387 A CN 101320387A CN A2008101200029 A CNA2008101200029 A CN A2008101200029A CN 200810120002 A CN200810120002 A CN 200810120002A CN 101320387 A CN101320387 A CN 101320387A
Authority
CN
China
Prior art keywords
user
attention time
time
attention
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008101200029A
Other languages
Chinese (zh)
Inventor
徐颂华
江浩
刘智满
潘云鹤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CNA2008101200029A priority Critical patent/CN101320387A/en
Publication of CN101320387A publication Critical patent/CN101320387A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于用户关注时间的网页文本与图像排序方法。包括以下步骤:1)利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理;2)利用自定义的浏览器,收集文本关注时间的样本信息;3)利用自定义的浏览器,收集图片关注时间的样本信息;4)对收集的关注时间样本进行校正;5)基于文本和图片相似度来预测未知网页的关注时间;6)利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序。本发明有效地将用户的喜好结合在搜索过程中,使得最终的排名结果更加接近用户期待的理想排名,从而使得网页文本与图像搜索引擎为用户提供更好的个性化服务。The invention discloses a web page text and image sorting method based on user attention time. The method includes the following steps: 1) using the attention time to personalize the existing web page sorting, so that the sorting results conform to the user's psychology; 2) using a customized browser to collect sample information of text attention time; 3) using a customized browsing 4) Correct the collected attention time samples; 5) Predict the attention time of unknown web pages based on the similarity between text and pictures; 6) Use the attention time combined with traditional search technology to generate personalized Pages and pictures are sorted. The invention effectively combines the user's preferences in the search process, so that the final ranking result is closer to the ideal ranking expected by the user, so that the web page text and image search engine provides better personalized services for the user.

Description

基于用户关注时间的网页文本与图像排序方法 Web page text and image ranking method based on user attention time

技术领域 technical field

本发明涉及计算机搜索领域,尤其涉及一种基于用户关注时间的网页文本与图像排序方法。The invention relates to the field of computer search, in particular to a method for sorting web page text and images based on user attention time.

背景技术 Background technique

现有的个性化引擎依靠的是用户的反馈,它可以分为显式反馈和隐式反馈。我们从这两种反馈中都可以得到用户的喜好特征(Salton&Buckley 1990;White,Jose,&Ruthven 2001;White,Ruthven,&Jose 2002)。但是用户一般都不愿意去提供显式的反馈,所以现在的研究越来越多的研究都转向隐式反馈(Granka,Joachims,&Gay 2004;Guan&Cutrell 2007;Fu 2007)。研究表明,隐式反馈可以很好的反映用户的搜索意图(Fox et al.2005;Dou,Song,&Wen 2007;Fu 2007).并且从大量的隐式反馈中得到的用户喜好往往比显式反馈更加可靠。查询历史:现代研究中,用得最多的隐式反馈就是用户的查询历史。Google的个性化搜索(http://www.google.com/psearch)就是基于用户的查询历史的。总的来说,基于查询历史的算法又可以分为以下两类:一类是基于整个查询历史的算法,另一类是基于某个查询会话(指的是一连串相关的查询)。对于前者来说,通常算法会产生一个该用户的概要文档用来描述用户的搜索喜好。点击数据:点击数据是另一种非常重要的隐式反馈,如(Dupret,Mrudock,&Piwowarski 2007;Joachims 2002)。在一个搜索结果页面上,我们假设用户点击过的链接比用户没有点过的链接对于此用户来说更加重要。研究者们用了很多中方法从用户的点击行为中获取用户的喜好特征。举例来说,有些研究者用一种叫Ranking SVM的算法(Hersh et al.1994)通过用户的点击信息来获得对该用户来说最好的网页排序。在(Radlinski&Joachims 2005)一文中,作者不但从用户的单次查询中提取用户喜好,同时也从用户对同一信息的一连串查询中提取用户的喜好,这些喜好特征然后通过Ranking SVM的改进算法来进行训练。Sun et al.(2005)提出了一种基于Singluar Value Decomposition的算法,它通过分析用户的点击数据来提高搜索引擎的建议系统的准确率。关注时间:相对来说,关注时间是一个新型的隐式用户反馈。虽然它在近期的研究中越来越多被提到,但是关于它是否真的能够反映用户意图仍然有争辩。Kelly和Belkin(2004;2001)建议说,在文档的关注时间和它对用户的有用度之间并没有非常可靠的相互关系。但是不同的是,在他们的研究当中,关注时间是通过测量一组用户阅读不同主题的文章而得到的平均关注时间。Halabiet al.(2007)认为对于一个的用户在同一个搜索行为中关注时间,它可以很好的反映出用户的喜好。我们认为以上两个研究并不矛盾,因为他们所计算的关注时间并不相同。在这篇论文中,我们假设单一用户或者单一主题的关注时间可以很好的反映用户的喜好Existing personalization engines rely on user feedback, which can be divided into explicit feedback and implicit feedback. From both kinds of feedback, we can obtain user preference characteristics (Salton & Buckley 1990; White, Jose, & Ruthven 2001; White, Ruthven, & Jose 2002). But users are generally unwilling to provide explicit feedback, so more and more researches now turn to implicit feedback (Granka, Joachims, & Gay 2004; Guan & Cutrell 2007; Fu 2007). Studies have shown that implicit feedback can well reflect the user's search intention (Fox et al.2005; Dou, Song, & Wen 2007; Fu 2007). And user preferences obtained from a large number of implicit feedback are often better than explicit feedback more reliable. Query history: In modern research, the most frequently used implicit feedback is the user's query history. Google's personalized search (http://www.google.com/psearch) is based on the user's query history. In general, algorithms based on query history can be divided into the following two categories: one is based on the entire query history, and the other is based on a certain query session (referring to a series of related queries). For the former, usually the algorithm will generate a profile document of the user to describe the user's search preferences. Click data: Click data is another very important type of implicit feedback, such as (Dupret, Mrudock, & Piwowarski 2007; Joachims 2002). On a search results page, we assume that links the user has clicked are more important to the user than links the user has not clicked. Researchers have used many methods to obtain the user's preference characteristics from the user's click behavior. For example, some researchers use an algorithm called Ranking SVM (Hersh et al. 1994) to obtain the best web page ranking for the user through the user's click information. In (Radlinski&Joachims 2005), the author not only extracts user preferences from a single user query, but also extracts user preferences from a series of user queries on the same information. These preferences are then trained through the improved algorithm of Ranking SVM . Sun et al. (2005) proposed an algorithm based on Singluar Value Decomposition, which improves the accuracy of the search engine's suggestion system by analyzing the user's click data. Attention time: Relatively speaking, attention time is a new type of implicit user feedback. Although it has been mentioned more and more in recent studies, it is still debated whether it can really reflect user intent. Kelly and Belkin (2004; 2001) suggest that there is not a very reliable correlation between the attention time of a document and its usefulness to users. But the difference is that in their research, attention time is measured by the average attention time of a group of users reading articles on different topics. Halabie et al. (2007) believed that for a user to pay attention to the time in the same search behavior, it can well reflect the user's preferences. We do not think the above two studies are contradictory, because the time of attention they calculate is not the same. In this paper, we assume that the attention time of a single user or a single topic can well reflect the user's preferences

发明内容 Contents of the invention

本发明的目的是克服现有技术的不足,提供一种基于关注时间的个性化网页排序方法。The purpose of the present invention is to overcome the deficiencies of the prior art and provide a method for sorting personalized webpages based on attention time.

基于用户关注时间的网页文本与图像排序方法包括以下步骤:The method for ranking webpage text and images based on user attention time includes the following steps:

1)利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理;1) Use the attention time to personalize the existing web page sorting, so that the sorting results conform to the user's psychology;

2)利用自定义的浏览器,收集文本关注时间的样本信息;2) Use a custom browser to collect sample information of text attention time;

3)利用自定义的浏览器,收集图片关注时间的样本信息;3) Use a custom browser to collect sample information on the attention time of pictures;

4)对收集的关注时间样本进行校正;4) Calibrate the collected time-of-interest samples;

5)基于文本和图片相似度来预测未知网页的关注时间;5) Predict the attention time of unknown web pages based on the similarity between text and pictures;

6)利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序。6) Using attention time combined with traditional search technology to generate personalized web pages and pictures for ranking.

所述的利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理步骤:将关注时间作为用户隐式反馈的来源,从而得知用户的喜好特征,进而对用户未浏览过的网页或图片进行关注时间的预测,最终根据预测的关注时间对结果进行排序,关注时间是用户在浏览一个网页或图片时花费的阅读或浏览时间。The above-mentioned use of attention time to personalize the ranking of existing web pages, so that the ranking results conform to the user’s psychological steps: use the attention time as the source of user’s implicit feedback, so as to know the user’s preferences, and then analyze the user’s unbrowsed Predict the attention time of web pages or pictures, and finally sort the results according to the predicted attention time. The attention time is the reading or browsing time spent by users when browsing a web page or picture.

所述的利用自定义的浏览器,收集文本关注时间的样本信息步骤:客户端是一个自定义的浏览器,对于文本搜索,在搜索结果页面上,搜索引擎通常会在搜索结果页面上为每个文档提供几行概要,追踪鼠标的移动位置,从而来记录用户在某个文档上花的时间,在被打开的页面上,记录用户在此页面上的活动时间,对于此文档的关注时间就是阅读概要的时间加上阅读整篇文档的时间,如果之后用户又回到已看过的页面,那么该页面的关注时间会相应增加。The steps of collecting the sample information of text attention time by using a custom browser: the client is a custom browser, for text search, on the search result page, the search engine usually sets A document provides a few lines of summary, tracking the mouse movement position, so as to record the time the user spends on a certain document, on the opened page, record the user's activity time on this page, and the attention time for this document is The time to read the summary is added to the time to read the entire document. If the user later returns to a page that has already been viewed, the attention time of the page will increase accordingly.

所述的利用自定义的浏览器,收集图片关注时间的样本信息步骤:客户端是一个自定义的浏览器,对于图片搜索,搜索引擎会在结果页面上显式每个图片的缩略图,同样的,关注时间是用户看缩略图的时间加上用户看原图的时间,如果一个文档既有文字又有图片,它的关注时间就是两者之和。The steps of using a custom browser to collect the sample information of the picture attention time: the client is a custom browser, for picture search, the search engine will display the thumbnail of each picture on the result page, and the same Yes, the attention time is the time the user looks at the thumbnail plus the time the user looks at the original image. If a document has both text and pictures, its attention time is the sum of the two.

所述的对收集的关注时间样本进行校正步骤:对收集的关注时间样本进行校正式如下:The step of correcting the collected time-of-interest samples: correcting the time-of-interest samples collected is as follows:

tt attatt infinf (( uu ,, dd )) == maxmax (( tt attatt rawraw (( uu ,, dd )) -- tt basicbasic (( uu )) ,, 00 ))

其中tatt raw是收集的关注时间,tbasic(u)是用户用来判断此文档是否值得一读的时间,tatt inf(u,d)则是潜在的该文档d包含的关注时间。Among them, t att raw is the collected attention time, t basic (u) is the time used by users to judge whether the document is worth reading, and t att inf (u, d) is the potential attention time contained in the document d.

所述的基于文本和图片相似度来预测未知网页的关注时间步骤:The time steps for predicting the attention time of unknown web pages based on the similarity between text and pictures:

a)用Sim(d0,d1)来表示文档d0和文档d1之间的相似度,同时Sim(d0,d1)[0,1],在计算两个文档的相似度之前,删除广告,网页源码中的标签,以及网页上面的导航栏;a) Use Sim(d 0 , d 1 ) to represent the similarity between document d 0 and document d 1 , and Sim(d 0 , d 1 )[0, 1], before calculating the similarity between the two documents , delete advertisements, tags in the source code of the webpage, and the navigation bar on the webpage;

b)把每个训练样本表示为{tatt(u,di)|i=1,...n},其中n是当前用户阅读过的文档的个数,阅读过的文档表示为di(i=1,...,n),当用户遇到一个新的文档dx的时候,计算文档dx和测试集中的所有文档进行相似度计算,挑选出k个具有最高相似度的文档,把k设为min(10,n),挑选出来的文档为di(i=1,...,k),用以下这个方程来预测dx的关注时间,b) Express each training sample as {t att (u, d i )|i=1,...n}, where n is the number of documents that the current user has read, and the read documents are expressed as d i (i=1,...,n), when the user encounters a new document d x , calculate the similarity between the document d x and all documents in the test set, and select k documents with the highest similarity , set k as min(10, n), the selected document is d i (i=1,..., k), use the following equation to predict the attention time of d x ,

tt attatt (( uu ,, dd xx )) == ΣΣ ii == 11 kk (( tt attatt (( uu ,, dd xx )) SimSim γγ (( dd ii ,, dd xx )) δδ (( dd ii ,, dd xx )) )) ΣΣ ii == 11 kk (( SimSim γγ (( dd ii ,, dd xx )) δδ (( dd ii ,, dd xx )) )) ++ ϵϵ

其中用来控制Sim(,)的值占多的比重,是一个很小的正整数用来防止表达式的分母为0,函数(,)用来去除一些相似度非常低的文档,它被定义为:It is used to control the proportion of the value of Sim(,), which is a small positive integer to prevent the denominator of the expression from being 0. The function (,) is used to remove some documents with very low similarity, which is defined for:

δδ (( dd ii ,, dd xx )) == 11 IfIf SimSim γγ (( dd ii ,, dd xx )) >> 0.010.01 00 Otherwiseotherwise ..

所述的利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序步骤:The steps of using attention time combined with traditional search technology to generate personalized web pages and pictures for sorting:

c)当用户提交一个查询请求时,服务端首先将查询重定向至传统搜索引擎,并获得返回的前n个网页,对于返回的每个页面,系统将在该用户的样本集中查找k个与文本或图片相似度最高的样本,并用权利要求8中的方法预测网页的关注时间;c) When a user submits a query request, the server first redirects the query to a traditional search engine and obtains the first n web pages returned. For each returned web page, the system will search for k web pages in the user's sample set that are related to The sample with the highest similarity of text or picture, and use the method in claim 8 to predict the attention time of the webpage;

d)对于传统的排序,系统会生成一个关注时间偏差,那就是在传统排序中,排名越高的文档,获得更高的关注时间偏差,用如下公式定义这个偏差d) For traditional sorting, the system will generate an attention time deviation, that is, in traditional sorting, documents with higher ranks will get higher attention time deviation. Use the following formula to define this deviation

tt attenattenuate offsetoffset (( ii )) == 22 expexp (( -- κκ dd ·&Center Dot; rankrank (( ii )) )) 11 ++ expexp (( -- κκ dd ·· rankrank (( ii )) )) -- -- -- (( 33 ))

其中rank(i)表示的文档i在Google的排序的排名,参数κ用来控制关注时间随排名下降的坡度;Among them, rank(i) represents the ranking of document i in Google, and the parameter κ is used to control the slope of the attention time falling with the ranking;

e)从文档i的关注时间tatten(i)和偏差tatten offset(i),获得文档i的全局关注时间: t atten overall ( i ) = κ overall t atten ( i ) + t atten offset ( i ) , 参数κoverall是一个用户变量,用来控制该用户希望个性化的排名占的比重;e) Obtain the global attention time of document i from the attention time t atten (i) and the deviation t atten offset (i) of document i: t attenuate overall ( i ) = κ overall t attenuate ( i ) + t attenuate offset ( i ) , The parameter κ overall is a user variable, which is used to control the proportion of the user's personalized ranking;

f)最终排序将按照总关注时间的倒序排列。f) The final ranking will be in reverse order of total attention time.

本发明有效地将用户的喜好结合在搜索过程中,使得最终的排名结果更加接近用户期待的理想排名,从而使得网页文本与图像搜索引擎为用户提供更好的个性化服务。The invention effectively combines the user's preferences in the search process, so that the final ranking result is closer to the ideal ranking expected by the user, so that the web page text and image search engine provides better personalized services for the user.

附图说明 Description of drawings

图1是具体实施方式的流程图;Fig. 1 is the flowchart of specific embodiment;

图2是本例中自定义浏览器的截图;Figure 2 is a screenshot of the custom browser in this example;

图3是14组文本搜索的实验结果,具体数据在表2中;Figure 3 is the experimental results of 14 groups of text searches, and the specific data are in Table 2;

图4是7组图片搜索实验的坐标图,具体数据在表3和表4中;每组实验都是由不同的用户在相同的设置下进行的,坐标上画的是用户的平均的期望排名,平均值越小,那么用户期望的图片在搜索结果中将会出现得越靠前。Figure 4 is the coordinate diagram of 7 groups of image search experiments, and the specific data are in Table 3 and Table 4; each group of experiments is conducted by different users under the same setting, and the coordinates are drawn on the coordinates of the user's average expected ranking , the smaller the average value, the higher the user-desired picture will appear in the search results.

具体实施方式 Detailed ways

基于关注时间的个性化网页排序方法包括以下步骤:The method for ranking personalized web pages based on attention time includes the following steps:

1)利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理;1) Use the attention time to personalize the existing web page sorting, so that the sorting results conform to the user's psychology;

2)利用自定义的浏览器,收集文本关注时间的样本信息;2) Use a custom browser to collect sample information of text attention time;

3)利用自定义的浏览器,收集图片关注时间的样本信息;3) Use a custom browser to collect sample information on the attention time of pictures;

4)对收集的关注时间样本进行校正;4) Calibrate the collected time-of-interest samples;

5)基于文本和图片相似度来预测未知网页的关注时间;5) Predict the attention time of unknown web pages based on the similarity between text and pictures;

6)利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序。6) Using attention time combined with traditional search technology to generate personalized web pages and pictures for ranking.

所述的利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理步骤:将关注时间作为用户隐式反馈的来源,从而得知用户的喜好特征,进而对用户未浏览过的网页或图片进行关注时间的预测,最终根据预测的关注时间对结果进行排序,关注时间是用户在浏览一个网页或图片时花费的阅读或浏览时间。The above-mentioned use of attention time to personalize the ranking of existing web pages, so that the ranking results conform to the user’s psychological steps: use the attention time as the source of user’s implicit feedback, so as to know the user’s preferences, and then analyze the user’s unbrowsed Predict the attention time of web pages or pictures, and finally sort the results according to the predicted attention time. The attention time is the reading or browsing time spent by users when browsing a web page or picture.

所述的利用自定义的浏览器,收集文本关注时间的样本信息步骤:客户端是一个自定义的浏览器,对于文本搜索,在搜索结果页面上,搜索引擎通常会在搜索结果页面上为每个文档提供几行概要,追踪鼠标的移动位置,从而来记录用户在某个文档上花的时间,在被打开的页面上,记录用户在此页面上的活动时间,对于此文档的关注时间就是阅读概要的时间加上阅读整篇文档的时间,如果之后用户又回到已看过的页面,那么该页面的关注时间会相应增加。The steps of collecting the sample information of text attention time by using a custom browser: the client is a custom browser, for text search, on the search result page, the search engine usually sets A document provides a few lines of summary, tracking the mouse movement position, so as to record the time the user spends on a certain document, on the opened page, record the user's activity time on this page, and the attention time for this document is The time to read the summary is added to the time to read the entire document. If the user later returns to a page that has already been viewed, the attention time of the page will increase accordingly.

所述的利用自定义的浏览器,收集图片关注时间的样本信息步骤:客户端是一个自定义的浏览器,对于图片搜索,搜索引擎会在结果页面上显式每个图片的缩略图,同样的,关注时间是用户看缩略图的时间加上用户看原图的时间,如果一个文档既有文字又有图片,它的关注时间就是两者之和。The steps of using a custom browser to collect the sample information of the picture attention time: the client is a custom browser, for picture search, the search engine will display the thumbnail of each picture on the result page, and the same Yes, the attention time is the time the user looks at the thumbnail plus the time the user looks at the original image. If a document has both text and pictures, its attention time is the sum of the two.

所述的对收集的关注时间样本进行校正步骤:对收集的关注时间样本进行校正式如下:The step of correcting the collected time-of-interest samples: correcting the time-of-interest samples collected is as follows:

tt attatt infinf (( uu ,, dd )) == maxmax (( tt attatt rawraw (( uu ,, dd )) -- tt basicbasic (( uu )) ,, 00 ))

其中tatt raw是收集的关注时间,tbasic(u)是用户用来判断此文档是否值得一读的时间,tatt inf(u,d)则是潜在的该文档d包含的关注时间。Among them, t att raw is the collected attention time, t basic (u) is the time used by users to judge whether the document is worth reading, and t att inf (u, d) is the potential attention time contained in the document d.

所述的基于文本和图片相似度来预测未知网页的关注时间步骤:The time steps for predicting the attention time of unknown web pages based on the similarity between text and pictures:

a)用Sim(d0,d1)来表示文档d0和文档d1之间的相似度,同时Sim(d0,d1)[0,1],在计算两个文档的相似度之前,删除广告,网页源码中的标签,以及网页上面的导航栏;a) Use Sim(d 0 , d 1 ) to represent the similarity between document d 0 and document d 1 , and Sim(d 0 , d 1 )[0, 1], before calculating the similarity between the two documents , delete advertisements, tags in the source code of the webpage, and the navigation bar on the webpage;

b)把每个训练样本表示为{tatt(u,di)|i=1,...n},其中n是当前用户阅读过的文档的个数,阅读过的文档表示为di(i=1,...,n),当用户遇到一个新的文档dx的时候,计算文档dx和测试集中的所有文档进行相似度计算,挑选出k个具有最高相似度的文档,把k设为min(10,n),挑选出来的文档为di(i=1,...,k),用以下这个方程来预测dx的关注时间,b) Express each training sample as {t att (u, d i )|i=1,...n}, where n is the number of documents that the current user has read, and the read documents are expressed as d i (i=1,...,n), when the user encounters a new document d x , calculate the similarity between the document d x and all documents in the test set, and select k documents with the highest similarity , set k as min(10, n), the selected document is d i (i=1,..., k), use the following equation to predict the attention time of d x ,

tt attatt (( uu ,, dd xx )) == ΣΣ ii == 11 kk (( tt attatt (( uu ,, dd xx )) SimSim γγ (( dd ii ,, dd xx )) δδ (( dd ii ,, dd xx )) )) ΣΣ ii == 11 kk (( SimSim γγ (( dd ii ,, dd xx )) δδ (( dd ii ,, dd xx )) )) ++ ϵϵ

其中用来控制Sim(,)的值占多的比重,是一个很小的正整数用来防止表达式的分母为0,函数(,)用来去除一些相似度非常低的文档,它被定义为:It is used to control the proportion of the value of Sim(,), which is a small positive integer used to prevent the denominator of the expression from being 0. The function (,) is used to remove some documents with very low similarity. It is defined for:

δδ (( dd ii ,, dd xx )) == 11 IfIf SimSim γγ (( dd ii ,, dd xx )) >> 0.010.01 00 Otherwiseotherwise ..

所述的利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序步骤:The steps of using attention time combined with traditional search technology to generate personalized web pages and pictures for sorting:

c)当用户提交一个查询请求时,服务端首先将查询重定向至传统搜索引擎,并获得返回的前n个网页,对于返回的每个页面,系统将在该用户的样本集中查找k个与文本或图片相似度最高的样本,并用权利要求8中的方法预测网页的关注时间;c) When a user submits a query request, the server first redirects the query to a traditional search engine and obtains the first n web pages returned. For each returned web page, the system will search for k web pages in the user's sample set that are related to The sample with the highest similarity of text or picture, and use the method in claim 8 to predict the attention time of the webpage;

d)对于传统的排序,系统会生成一个关注时间偏差,那就是在传统排序中,排名越高的文档,获得更高的关注时间偏差,用如下公式定义这个偏差d) For traditional sorting, the system will generate an attention time deviation, that is, in traditional sorting, documents with higher ranks will get higher attention time deviation. Use the following formula to define this deviation

tt attenattenuate offsetoffset (( ii )) == 22 expexp (( -- κκ dd ·&Center Dot; rankrank (( ii )) )) 11 ++ expexp (( -- κκ dd ·· rankrank (( ii )) )) -- -- -- (( 33 ))

其中rank(i)表示的文档i在Google的排序的排名,参数κ用来控制关注时间随排名下降的坡度;Among them, rank(i) represents the ranking of document i in Google, and the parameter κ is used to control the slope of the attention time falling with the ranking;

e)从文档i的关注时间tatten(i)和偏差tatten offset(i),获得文档i的全局关注时间: t atten overall ( i ) = κ overall t atten ( i ) + t atten offset ( i ) , 参数κoverall是一个用户变量,用来控制该用户希望个性化的排名占的比重;e) Obtain the global attention time of document i from the attention time t atten (i) and the deviation t atten offset (i) of document i: t attenuate overall ( i ) = κ overall t attenuate ( i ) + t attenuate offset ( i ) , The parameter κ overall is a user variable, which is used to control the proportion of the user's personalized ranking;

f)最终排序将按照总关注时间的倒序排列。f) The final ranking will be in reverse order of total attention time.

实施例:Example:

本发明的基于用户关注时间的网页文本与图像排序方法的流程结构如图1所示。该个性化排序系统包括客户端和服务端两部分,客户端20、自定义浏览器来获取用户的关注时间,服务端包括30、样本收集模块,40、关注时间校正,50、用户数据库和60、文档数据库,70、查询界面,80、传统引擎模块,90、文档预处理模块,100、文档比较模块,110、关注时间预测模块,120、排序模块。。The flow structure of the web page text and image sorting method based on the user's attention time of the present invention is shown in FIG. 1 . The personalized sorting system includes two parts, the client and the server, the client 20, a custom browser to obtain the user's attention time, the server includes 30, a sample collection module, 40, attention time correction, 50, user database and 60 . Document database, 70. query interface, 80. traditional engine module, 90. document preprocessing module, 100. document comparison module, 110. attention time prediction module, 120. sorting module. .

自定义浏览器20,对用户的鼠标移动进行追踪分析,最终得出用户在各个文档上的关注时间。在本例中,给出了由我们开发的自定义浏览器记录的对关注时间(图2)。Customize the browser 20 to track and analyze the user's mouse movement, and finally obtain the user's attention time on each document. In this example, pair attention times recorded by a custom browser developed by us are presented (Fig. 2).

样本收集模块30,将客户端发送的样本数据存入对应用户的数据库中,如果某文档在文档数据库中不存在,则下载并存入文档数据库。The sample collection module 30 stores the sample data sent by the client into the database corresponding to the user, and if a certain document does not exist in the document database, downloads and stores it into the document database.

关注时间校正模块40,直接从客户端获得的预测关注时间还需要进行校正,当用户浏览一个文档时,不管此文档是否对该用户有用,用户都得花一段时间去粗略的浏览此文档。一般来说,此时获得关注时间既包括了用户的实际关注时间也包括了用户粗略浏览该文档的时间,为了克服这个问题,我们以下这个方程来校正我们原先获得的关注时间:The attention time correction module 40 needs to correct the predicted attention time obtained directly from the client. When a user browses a document, no matter whether the document is useful to the user or not, the user has to spend a period of time to roughly browse the document. Generally speaking, the attention time obtained at this time includes both the user's actual attention time and the user's rough browsing time of the document. In order to overcome this problem, we use the following equation to correct our original attention time:

tt attatt infinf (( uu ,, dd )) == maxmax (( tt attatt rawraw (( uu ,, dd )) -- tt basicbasic (( uu )) ,, 00 ))

tatt raw是我们原先获得的关注时间,tbasic(u)是用户用来判断此文档是否值得一读的时间,tatt inf(u,d)则是潜在的该文档d包含的关注时间。t att raw is the attention time we obtained originally, t basic (u) is the time used by the user to judge whether the document is worth reading, and t att inf (u, d) is the potential attention time contained in the document d.

用户数据库50,存储系统各个用户对文档的关注时间,在本例中用MYSQL存储。The user database 50 stores the attention time of each user of the system on the document, which is stored in MYSQL in this example.

文档数据库60,存储文档(文本网页和图片)的数据,在本例中用MYSQL存储。The document database 60 stores the data of documents (text web pages and pictures), and uses MYSQL to store in this example.

查询界面70,提供一个用户查询的web入口,提供文本搜索和图片搜索两项服务。在本例中,此查询界面用jsp实现。The query interface 70 provides a web portal for user query, and provides two services of text search and image search. In this example, the query interface is implemented with jsp.

传统引擎模块80,当用户提交一个查询请求时,服务端会对传统搜索引擎(比如Google)的结果页面进行解析并获取其返回结果中的前300个文档,并将文档下载存至文档服务器。Traditional engine module 80, when a user submits a query request, the server will analyze the result page of a traditional search engine (such as Google) and obtain the first 300 documents in the returned results, and download and save the documents to the document server.

文档预处理模块90,直接从网站下载下来的网页包含很多无用信息,比如HTML标签,广告栏,导航栏等。此模块用于去除网页中的无用信息,保留用户将关注的主体文档。在本例中,我们实现了,去除HTML标签功能。In the document preprocessing module 90, the web pages downloaded directly from the website contain a lot of useless information, such as HTML tags, advertisement columns, navigation bars and so on. This module is used to remove useless information in web pages and keep the main documents that users will pay attention to. In this example, we implemented the function of removing HTML tags.

文档比较模块100,选用的文本相似度算法为extended Jaccard方法(Tanimoto);选用的图片相似度算法为基于“Auto Color Correlogram”(Huang et al.1997)的相似度算法。In the document comparison module 100, the text similarity algorithm selected is the extended Jaccard method (Tanimoto); the image similarity algorithm selected is the similarity algorithm based on "Auto Color Correlogram" (Huang et al.1997).

关注时间预测模块110,包含以下几个步骤:Pay attention to the time prediction module 110, including the following steps:

a.)此模块对于传统引擎模块中的每个文档都进行关注时间的预测。首先我们把每个训练样本表示为{tatt(u,di)|i=1,...n},其中n是当前用户阅读过的文档的个数。阅读过的文档表示为di(i=1,...,n)。对于传统引擎返回的文档dx的时候,我们会计算文档dx和测试集中的所有文档进行相似度计算。然后我们会挑选出k个具有最高相似度的文档。在我们的实验中,我们把k设为min(10,n)。我们挑选出来的文档为di(i=1,...,k).然后我们用以下这个方程来预测dx的关注时间。a.) This module predicts the attention time for each document in the traditional engine module. First, we denote each training sample as {t att (u, d i )|i=1,...n}, where n is the number of documents that the current user has read. Read documents are denoted as d i (i=1, . . . , n). For the document d x returned by the traditional engine, we will calculate the similarity between the document d x and all documents in the test set. We then pick the k documents with the highest similarity. In our experiments, we set k as min(10,n). The documents we selected are d i (i=1,...,k). Then we use the following equation to predict the attention time of d x .

tt attatt (( uu ,, dd xx )) == ΣΣ ii == 11 kk (( tt attatt (( uu ,, dd xx )) SimSim γγ (( dd ii ,, dd xx )) δδ (( dd ii ,, dd xx )) )) ΣΣ ii == 11 kk (( SimSim γγ (( dd ii ,, dd xx )) δδ (( dd ii ,, dd xx )) )) ++ ϵϵ

其中γ用来控制Sim(,)的值占多的比重,ε是一个很小的正整数用来防止表达式的分母为0。函数δ(,)用来去除一些相似度非常低的文档,它被定义为Among them, γ is used to control the proportion of the value of Sim(,), and ε is a small positive integer to prevent the denominator of the expression from being 0. The function δ(,) is used to remove some documents with very low similarity, which is defined as

δδ (( dd ii ,, dd xx )) == 11 IfIf SimSim γγ (( dd ii ,, dd xx )) >> 0.010.01 00 Otherwiseotherwise

b.)在系统运行的初期,我们还会将传统引擎的排名转化成一个关注时间偏差。我们用下面这个方程将传统排名转化成一个值在0和1之间的标准化关注时间偏差:b.) In the initial stage of system operation, we will also convert the ranking of traditional engines into a focus on time deviation. We convert traditional rankings into a normalized attention-time bias with values between 0 and 1 using the following equation:

tt attenattenuate offsetoffset (( ii )) == 22 expexp (( -- κκ dd ·&Center Dot; rankrank (( ii )) )) 11 ++ expexp (( -- κκ dd ·&Center Dot; rankrank (( ii )) ))

其中rank(i)表示的文档i在传统搜索引擎的排名。我们之所以选择这样一个式子是因为它可以把网页排名信息转换成关注时间,而且让排名较低的文档转化所得的关注时间相对更短。参数κd用来控制关注时间随排名下降的坡度,在我们是实验中,我们设定为0.2。Where rank(i) represents the ranking of document i in traditional search engines. The reason why we choose such a formula is that it can convert the page ranking information into attention time, and the attention time obtained by conversion of lower ranking documents is relatively shorter. The parameter κ d is used to control the slope of the attention time decreasing with the ranking. In our experiment, we set it to 0.2.

c.)一旦我们得到了文档i的关注时间tatten(i)和偏差tatten offseti,我们可以获得该文档的全局关注时间: t atten overall i = κ overall t atten i + t atten offset i . 参数κoverall是一个用户变量,用来控制该用户希望个性化的排名占的比重。最终网页的排名就是按照全局关注时间的降序来排列的。我们实现了种自动设置κoverall值的方法,当训练集中的样本很少的时候,κoverall值较小,并且当训练集中的样本变的越来越多的时候,κoverall值越来越大。之所以这样是因为我们的排序算法从根本上来说是一个学习算法。但是,就像其他学习算法一样,当训练样本集还很小的时候,算法会产生比较差的结果,因此我们需要借鉴传统引擎的排序结果。在我们的实验中,我们一个S形函数去自动验证κoverall的值,发现它是一个常量,通常为0.1。c.) Once we have the attention time t atten (i) and the bias t atten offset i for document i, we can obtain the global attention time for that document: t attenuate overall i = κ overall t attenuate i + t attenuate offset i . The parameter κ overall is a user variable, which is used to control the proportion of the user's personalized ranking. The ranking of the final web pages is arranged in descending order of global attention time. We implemented a method to automatically set the value of κ overall . When there are few samples in the training set, the value of κ overall is small, and when the number of samples in the training set becomes more and more, the value of κ overall becomes larger and larger. . The reason for this is that our sorting algorithm is fundamentally a learning algorithm. However, like other learning algorithms, when the training sample set is still small, the algorithm will produce poor results, so we need to learn from the ranking results of traditional engines. In our experiments, we used a sigmoid function to automatically verify the value of κ overall and found it to be a constant, usually 0.1.

排序模块120,排序模块将结果按照所有文档按照全局关注时间进行倒序排列,并将结果返回给用户。The sorting module 120, the sorting module sorts the results in reverse order according to the global attention time of all documents, and returns the results to the user.

表1~4的实验结果清晰的显示出本方法的优越性;The experimental results of Tables 1 to 4 clearly show the superiority of this method;

表1是用″网页搜索技术″(Web search technology)作为关键词的文本搜索得到的前17项文本的各自排名名次;各个栏从左到右分别是用户的理想排名,网页所搜引擎Google的排名,以及用户读过2,5,8,10,15个网页之后的排名;最后一行表示的是各个排名与用户理想排名之间的排名绝对误差总和;Table 1 shows the respective rankings of the top 17 texts obtained by using "Web search technology" (Web search technology) as the keyword text search; each column is the user's ideal ranking from left to right, and the web page search engine Google ranking, and the ranking after the user has read 2, 5, 8, 10, and 15 web pages; the last row indicates the sum of the absolute errors between each ranking and the user's ideal ranking;

表1Table 1

  Rkuser Rk user   RkGoogle Rk Google   Rk2 Rk 2   Rk5 Rk 5   Rk8 Rk 8   Rk10 Rk 10   Rk15 Rk 15   6 6   1 1   1 1   15 15   13 13   11 11   7 7   9 9   4 4   17 17   16 16   14 14   12 12   9 9   1 1   2 2   2 2   1 1   1 1   1 1   1 1   17 17   3 3   10 10   17 17   16 16   16 16   16 16   2 2   6 6   3 3   7 7   2 2   2 2   2 2   15 15   5 5   12 12   9 9   15 15   15 15   15 15   16 16   7 7   13 13   14 14   17 17   17 17   17 17   5 5   8 8   9 9   4 4   12 12   10 10   6 6   11 11   15 15   13 13   6 6   6 6   14 14   11 11   10 10   14 14   15 15   13 13   11 11   13 13   10 10   14 14   18 18   16 16   12 12   9 9   7 7   14 14   12 12   16 16   12 12   11 11   10 10   9 9   12 12   3 3   4 4   4 4   4 4   3 3   3 3   3 3   13 13   11 11   9 9   10 10   8 8   8 8   13 13   8 8   6 6   5 5   3 3   4 4   4 4   8 8   7 7   5 5   14 14   8 8   7 7   6 6   5 5   4 4   7 7   8 8   5 5   5 5   5 5   4 4   0 0   96 96   60 60   52 52   44 44   42 42   6 6

表2是14个不同的用户对不同关键词做文本搜索的实验数据;每一行表示每组实验中所得排名与用户理想排名之间的排名绝对误差总和,这些数据也以图形化的形式显示在图3中;Table 2 is the experimental data of 14 different users doing text searches on different keywords; each row represents the sum of the absolute errors in the rankings between the rankings obtained in each group of experiments and the user's ideal rankings, and these data are also displayed graphically in Figure 3;

表2Table 2

  查询词 query word   RkGoogle Rk Google   Rk2 Rk 2   Rk5 Rk 5   Rk8 Rk 8   Rk10 Rk 10   Rk15 Rk 15   温室效应(greenhouse effect) Greenhouse effect   88 88   66 66   66 66   62 62   52 52   16 16   基诺Linux操作系统(Gnome Linux) Keno Linux operating system (Gnome Linux)   86 86   64 64   60 60   56 56   50 50   18 18   加密算法(encryption algorithm) encryption algorithm   123 123   99 99   78 78   65 65   45 45   22 twenty two   精简指令集(RISC) Reduced Instruction Set (RISC)   94 94   82 82   62 62   58 58   50 50   32 32   广告理论(Advertising ethics) Advertising ethics   94 94   77 77   62 62   47 47   41 41   10 10   达芬奇(da Vinci) Da Vinci (da Vinci)   103 103   99 99   65 65   49 49   39 39   21 twenty one   奥林匹克(olympic) olympic   77 77   72 72   58 58   50 50   36 36   10 10   安可(anckor) Anckor   90 90   94 94   92 92   74 74   46 46   16 16   颜色管理(color management) Color management   128 128   146 146   94 94   90 90   66 66   52 52   全国篮球协会(NBA) National Basketball Association (NBA)   109 109   94 94   76 76   62 62   36 36   22 twenty two   相关(correlation) Correlation   122 122   114 114   114 114   88 88   82 82   54 54   休斯顿(houston) Houston   133 133   98 98   92 92   85 85   76 76   43 43   投资(investment) investment   132 132   120 120   104 104   100 100   94 94   46 46   三星(samsung) Samsung (samsung)   71 71   74 74   68 68   42 42   36 36   4 4

表3是一组以“毕加索”(Picasso)为关键词的图像搜索实验数据;用户想用″Picasso″去查找Picasso的自画像,在60个图片中仅有6是符合用户需求的;表中每一栏表示的是这些符合需求的图片在图像搜索引擎Google,以及本方法得出的的排名;Rk1st,Rk2nd,Rk3rd分别表示的是用户在阅读过搜索结果第1,2,3页之后的排名情况;Rkgoogle表示的是这些图像在Google图像搜索的排名情况。最后一行是这些图像在各个情况下的平均排名;平均排名值越小,用户所需求的图像将出现的越早;Table 3 is a group of image search experiment data with "Picasso" (Picasso) as the keyword; the user wants to use "Picasso" to find Picasso's self-portrait, and only 6 of the 60 pictures meet the user's needs; One column indicates the rankings of the images that meet the requirements in the image search engine Google and this method; Rk1st, Rk2nd, and Rk3rd respectively indicate the rankings of the users after reading pages 1, 2, and 3 of the search results situation; Rkgoogle represents the ranking of these images in Google Image Search. The last line is the average ranking of these images in each case; the smaller the average ranking value, the earlier the image required by the user will appear;

表3table 3

  RkGoogle Rk Google   Rk1st Rk 1st   Rk2nd Rk 2nd   Rk3rd Rk 3rd   9 9   1 1   1 1   1 1   16 16   63 63   3 3   5 5   17 17   3 3   2 2   3 3   23 twenty three   41 41   15 15   2 2   41 41   24 twenty four   37 37   4 4   48 48   13 13   4 4   6 6   25.67 25.67   24.17 24.17   10.33 10.33   3.5 3.5

表4是另外6组图像搜索的实验数据;实验中每个用户都被要求在60个图片中寻找他所需要的图像;第一列为搜索关键词,第二列为用户所需要的图片个数;RkGoogle表示在网页图像搜索引擎Google中用户所需网页的平均排名;Rk1st、Rk2nd、Rk3rd分别表示的是用户在阅读过搜索结果第1、2、3页之后,用户所需网页图像的平均排名;Table 4 is the experimental data of another 6 groups of image search; in the experiment, each user is required to find the image he needs in 60 pictures; the first column is the search keyword, and the second column is the number of pictures that the user needs. RkGoogle indicates the average ranking of the web pages required by the user in the web image search engine Google; Rk1st, Rk2nd, and Rk3rd respectively indicate the average web page images required by the user after reading the first, second, and third pages of the search results ranking;

表4Table 4

 查询词 query word  #Images #Images   RkGoogle Rk Google   Rk1st Rk 1st   Rk2nd Rk 2nd   Rk3rd Rk 3rd  树(Tree) Tree  9 9   16.22 16.22   10.56 10.56   8.11 8.11   6 6  沙漠(Desert) Desert  10 10   20.2 20.2   15.4 15.4   14.1 14.1   12.6 12.6  南极(South Pole) South Pole  14 14   24.57 24.57   23.5 23.5   21.21 21.21   13.71 13.71  苹果(Apple) Apple  9 9   21.33 21.33   12.33 12.33   11.78 11.78   11.22 11.22  伤心(Break heart) break heart  5 5   22.4 22.4   22 twenty two   21.2 21.2   8.2 8.2  加勒比海盗(Pirates ofthe Caribbean) Pirates of the Caribbean  19 19   24.37 24.37   27.16 27.16   20.37 20.37   15.32 15.32

上述表格表明,本发明有效地将用户的喜好结合在搜索过程中,使得最终的排名结果更加接近用户期待的理想排名,从而使得网页文本与图像搜索引擎为用户提供更好的个性化服务。The above table shows that the present invention effectively combines the user's preferences in the search process, making the final ranking result closer to the ideal ranking expected by the user, so that the web text and image search engine can provide users with better personalized services.

以上所述仅为本发明的基于关注时间的面向用户的个性化网页排序方法及系统的较佳实施例,并非用以限定本发明的实质技术内容的范围。本发明的基于关注时间的面向用户的个性化网页排序方法及系统,其实质技术内容是广泛的定义于权利要求书中,任何他人所完成的技术实体或方法,若是与权利要求书中所定义者完全相同,或是同一等效的变更,均将被视为涵盖于此专利保护范围之内。The above descriptions are only preferred embodiments of the attention time-based user-oriented personalized web page sorting method and system of the present invention, and are not intended to limit the scope of the substantive technical content of the present invention. The user-oriented personalized web page sorting method and system based on attention time of the present invention, its essential technical content is broadly defined in the claims, any technical entity or method completed by others, if it is the same as defined in the claims or identical or equivalent changes will be deemed to be covered within the scope of this patent protection.

Claims (7)

1.一种基于用户关注时间的网页文本与图像排序方法,其特征在于包括以下步骤:1. A webpage text and image sorting method based on user attention time, it is characterized in that comprising the following steps: 1)利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理;1) Use the attention time to personalize the existing web page sorting, so that the sorting results conform to the user's psychology; 2)利用自定义的浏览器,收集文本关注时间的样本信息;2) Use a custom browser to collect sample information of text attention time; 3)利用自定义的浏览器,收集图片关注时间的样本信息;3) Use a custom browser to collect sample information on the attention time of pictures; 4)对收集的关注时间样本进行校正;4) Calibrate the collected time-of-interest samples; 5)基于文本和图片相似度来预测未知网页的关注时间;5) Predict the attention time of unknown web pages based on the similarity between text and pictures; 6)利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序。6) Using attention time combined with traditional search technology to generate personalized web pages and pictures for ranking. 2.根据权利要求1所述的一种基于用户关注时间的网页文本与图像排序方法,其特征在于所述的利用关注时间对现有网页排序进行个性化改进,使排序结果符合用户心理步骤:将关注时间作为用户隐式反馈的来源,从而得知用户的喜好特征,进而对用户未浏览过的网页或图片进行关注时间的预测,最终根据预测的关注时间对结果进行排序,关注时间是用户在浏览一个网页或图片时花费的阅读或浏览时间。2. a kind of web page text and image sorting method based on user attention time according to claim 1, it is characterized in that described utilization time of attention is carried out individualized improvement to existing web page sorting, makes sorting result accord with user's psychological step: The attention time is used as the source of the user's implicit feedback, so as to know the user's preferences, and then predict the attention time of the web pages or pictures that the user has not browsed, and finally sort the results according to the predicted attention time. The attention time is the user's The reading or browsing time spent viewing a web page or image. 3.根据权利要求1所述的一种基于用户关注时间的网页文本与图像排序方法,其特征在于所述的利用自定义的浏览器,收集文本关注时间的样本信息步骤:客户端是一个自定义的浏览器,对于文本搜索,在搜索结果页面上,搜索引擎通常会在搜索结果页面上为每个文档提供几行概要,追踪鼠标的移动位置,从而来记录用户在某个文档上花的时间,在被打开的页面上,记录用户在此页面上的活动时间,对于此文档的关注时间就是阅读概要的时间加上阅读整篇文档的时间,如果之后用户又回到已看过的页面,那么该页面的关注时间会相应增加。3. a kind of webpage text and image sorting method based on user's attention time according to claim 1, it is characterized in that described utilizes self-defined browser, collects the sample information step of text attention time: client is an automatic Defined browsers, for text search, on the search results page, the search engine usually provides a few lines of summary for each document on the search results page, tracking the mouse movement position, so as to record the user's spending on a document Time, on the opened page, record the user's activity time on this page. The attention time for this document is the time of reading the summary plus the time of reading the entire document. If the user returns to the page that has been read later , then the page's attention time will increase accordingly. 4.根据权利要求1所述的一种基于用户关注时间的网页文本与图像排序方法,其特征在于所述的利用自定义的浏览器,收集图片关注时间的样本信息步骤:客户端是一个自定义的浏览器,对于图片搜索,搜索引擎会在结果页面上显式每个图片的缩略图,同样的,关注时间是用户看缩略图的时间加上用户看原图的时间,如果一个文档既有文字又有图片,它的关注时间就是两者之和。4. a kind of webpage text and image sorting method based on user's attention time according to claim 1, it is characterized in that described utilize self-definition browser, collect the sample information step of picture attention time: client is an automatic Defined browsers, for image search, the search engine will display the thumbnail of each image on the result page. Similarly, the attention time is the time the user looks at the thumbnail plus the time the user looks at the original image. If a document is both There are text and pictures, and its attention time is the sum of the two. 5.根据权利要求1所述的一种基于用户关注时间的网页文本与图像排序方法,其特征在于所述的对收集的关注时间样本进行校正步骤:对收集的关注时间样本进行校正式如下:5. A kind of webpage text and image ordering method based on user attention time according to claim 1, it is characterized in that described step of correcting the attention time sample collected: the correction formula is as follows to the attention time sample collected: tt attatt infinf (( uu ,, dd )) == maxmax (( tt attatt rawraw (( uu ,, dd )) -- tt basicbasic (( uu )) ,, 00 )) 其中tatt raw是收集的关注时间,tbasic(u)是用户用来判断此文档是否值得一读的时间,tatt inf(u,d)则是潜在的该文档d包含的关注时间。Where t att raw is the collected attention time, t basic (u) is the time used by users to judge whether the document is worth reading, and t att inf (u, d) is the potential attention time contained in the document d. 6.根据权利要求1所述的一种基于用户关注时间的网页文本与图像排序方法,其特征在于所述的基于文本和图片相似度来预测未知网页的关注时间步骤:6. a kind of webpage text and image sorting method based on user attention time according to claim 1, it is characterized in that described based on text and picture similarity to predict the attention time step of unknown webpage: a)用Sim(d0,d1)来表示文档d0和文档d1之间的相似度,同时Sim(d0,d1)[0,1],在计算两个文档的相似度之前,删除广告,网页源码中的标签,以及网页上面的导航栏;a) Use Sim(d 0 , d 1 ) to represent the similarity between document d 0 and document d 1 , and Sim(d 0 , d 1 )[0, 1], before calculating the similarity between the two documents , delete advertisements, tags in the source code of the webpage, and the navigation bar on the webpage; b)把每个训练样本表示为{tatt(u,di)|i=1,...n},其中n是当前用户阅读过的文档的个数,阅读过的文档表示为di(i=1,...,n),当用户遇到一个新的文档dx的时候,计算文档dx和测试集中的所有文档进行相似度计算,挑选出k个具有最高相似度的文档,把k设为min(10,n),挑选出来的文档为di(i=1,...,k),用以下这个方程来预测dx的关注时间,b) Express each training sample as {t att (u, d i )|i=1,...n}, where n is the number of documents that the current user has read, and the read documents are expressed as d i (i=1,...,n), when the user encounters a new document d x , calculate the similarity between the document d x and all documents in the test set, and select k documents with the highest similarity , set k as min(10, n), the selected document is d i (i=1,..., k), use the following equation to predict the attention time of d x , tt attatt (( uu ,, dd xx )) == ΣΣ ii == 11 kk (( tt attatt (( uu ,, dd xx )) SimSim γγ (( dd ii ,, dd xx )) δδ (( dd ii ,, dd xx )) )) ΣΣ ii == 11 kk (( SimSim γγ (( dd ii ,, dd xx )) δδ (( dd ii ,, dd xx )) )) ++ ϵϵ 其中用来控制Sim(,)的值占多的比重,是一个很小的正整数用来防止表达式的分母为0,函数(,)用来去除一些相似度非常低的文档,它被定义为:It is used to control the proportion of the value of Sim(,), which is a small positive integer to prevent the denominator of the expression from being 0. The function (,) is used to remove some documents with very low similarity, which is defined for: δδ (( dd ii ,, dd xx )) == 11 IfIf SimSim γγ (( dd ii ,, dd xx )) >> 0.010.01 00 Otherwiseotherwise .. 7.根据权利要求1所述的一种基于用户关注时间的网页文本与图像排序方法,其特征在于所述的利用关注时间结合传统搜索技术生成个性化的网页和图片进行排序步骤:7. a kind of web page text and image sorting method based on user attention time according to claim 1, it is characterized in that described utilizing attention time in conjunction with traditional search technology to generate personalized webpage and picture sorting step: c)当用户提交一个查询请求时,服务端首先将查询重定向至传统搜索引擎,并获得返回的前n个网页,对于返回的每个页面,系统将在该用户的样本集中查找k个与文本或图片相似度最高的样本,并用权利要求8中的方法预测网页的关注时间;c) When a user submits a query request, the server first redirects the query to a traditional search engine and obtains the first n web pages returned. For each returned web page, the system will search for k web pages in the user's sample set that are related to The sample with the highest similarity of text or picture, and use the method in claim 8 to predict the attention time of the webpage; d)对于传统的排序,系统会生成一个关注时间偏差,那就是在传统排序中,排名越高的文档,获得更高的关注时间偏差,用如下公式定义这个偏差d) For traditional sorting, the system will generate an attention time deviation, that is, in traditional sorting, documents with higher rankings will get higher attention time deviation. Use the following formula to define this deviation tt attenattenuate offsetoffset (( ii )) == 22 expexp (( -- κκ dd ·&Center Dot; rankrank (( ii )) )) 11 ++ expexp (( -- κκ dd ·&Center Dot; rankrank (( ii )) )) -- -- -- (( 33 )) 其中rank(i)表示的文档i在Google的排序的排名,参数κ用来控制关注时间随排名下降的坡度;Among them, rank(i) represents the ranking of document i in Google, and the parameter κ is used to control the slope of the attention time decreasing with the ranking; e)从文档i的关注时间tatten(i)和偏差tatten offset(i),获得文档i的全局关注时间: t atten overall ( i ) = κ overall t atten ( i ) + t atten offset ( i ) , 参数κoverall是一个用户变量,用来控制该用户希望个性化的排名占的比重;e) Obtain the global attention time of document i from the attention time t atten (i) and the deviation t atten offset (i) of document i: t attenuate overall ( i ) = κ overall t attenuate ( i ) + t attenuate offset ( i ) , The parameter κ overall is a user variable, which is used to control the proportion of the user's personalized ranking; f)最终排序将按照总关注时间的倒序排列。f) The final ranking will be in reverse order of total attention time.
CNA2008101200029A 2008-07-11 2008-07-11 Web page text and image ranking method based on user attention time Pending CN101320387A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008101200029A CN101320387A (en) 2008-07-11 2008-07-11 Web page text and image ranking method based on user attention time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008101200029A CN101320387A (en) 2008-07-11 2008-07-11 Web page text and image ranking method based on user attention time

Publications (1)

Publication Number Publication Date
CN101320387A true CN101320387A (en) 2008-12-10

Family

ID=40180436

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008101200029A Pending CN101320387A (en) 2008-07-11 2008-07-11 Web page text and image ranking method based on user attention time

Country Status (1)

Country Link
CN (1) CN101320387A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916264A (en) * 2010-07-30 2010-12-15 浙江大学 A personalized webpage recommendation method based on user facial expression and gaze distribution detection
CN102117332A (en) * 2011-03-10 2011-07-06 辜进荣 Given time-based searching method
CN102231165A (en) * 2011-07-11 2011-11-02 浙江大学 Method for searching and sequencing personalized web pages based on user retention time analysis
CN101567004B (en) * 2009-02-06 2012-05-30 浙江大学 English text automatic summarization method based on eyeball tracking
CN103186565A (en) * 2011-12-28 2013-07-03 中国移动通信集团浙江有限公司 Method and device for judging user preference according to web browsing behavior of user
WO2014079196A1 (en) * 2012-11-21 2014-05-30 华为技术有限公司 Method for generating history record and favorites folder and user terminal
CN104166741A (en) * 2014-09-10 2014-11-26 北京国双科技有限公司 Webpage browsing analysis and processing method and device
CN104281606A (en) * 2013-07-08 2015-01-14 腾讯科技(北京)有限公司 Method and device for displaying microblog comments
WO2015127782A1 (en) * 2014-02-27 2015-09-03 优视科技有限公司 Method and system for displaying webpage self-defined content
WO2015143910A1 (en) * 2014-03-28 2015-10-01 北京奇虎科技有限公司 Method and device for defining search engine result pages by user
WO2016155537A1 (en) * 2015-03-30 2016-10-06 阿里巴巴集团控股有限公司 Method and device for ranking search results of picture objects
CN103827863B (en) * 2011-05-12 2017-02-15 谷歌公司 Dynamic image display area and image display within web search results
TWI585596B (en) * 2009-10-01 2017-06-01 Alibaba Group Holding Ltd How to implement image search and website server
CN107644028A (en) * 2016-07-20 2018-01-30 平安科技(深圳)有限公司 The collection method and system of web data
WO2018133681A1 (en) * 2017-01-23 2018-07-26 腾讯科技(深圳)有限公司 Method and device for sorting search results, server and storage medium

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101567004B (en) * 2009-02-06 2012-05-30 浙江大学 English text automatic summarization method based on eyeball tracking
TWI585596B (en) * 2009-10-01 2017-06-01 Alibaba Group Holding Ltd How to implement image search and website server
CN101916264B (en) * 2010-07-30 2012-09-19 浙江大学 A personalized webpage recommendation method based on user facial expression and gaze distribution detection
CN101916264A (en) * 2010-07-30 2010-12-15 浙江大学 A personalized webpage recommendation method based on user facial expression and gaze distribution detection
CN102117332A (en) * 2011-03-10 2011-07-06 辜进荣 Given time-based searching method
CN103827863B (en) * 2011-05-12 2017-02-15 谷歌公司 Dynamic image display area and image display within web search results
CN102231165A (en) * 2011-07-11 2011-11-02 浙江大学 Method for searching and sequencing personalized web pages based on user retention time analysis
CN102231165B (en) * 2011-07-11 2013-01-09 浙江大学 Method for searching and sequencing personalized web pages based on user retention time analysis
CN103186565A (en) * 2011-12-28 2013-07-03 中国移动通信集团浙江有限公司 Method and device for judging user preference according to web browsing behavior of user
CN103186565B (en) * 2011-12-28 2017-02-22 中国移动通信集团浙江有限公司 Method and device for judging user preference according to web browsing behavior of user
CN103838727B (en) * 2012-11-21 2018-01-19 华为技术有限公司 A kind of generation method and user terminal of historical record and collection
CN103838727A (en) * 2012-11-21 2014-06-04 华为技术有限公司 Generation method for history records and favorites and user terminal
WO2014079196A1 (en) * 2012-11-21 2014-05-30 华为技术有限公司 Method for generating history record and favorites folder and user terminal
CN104281606B (en) * 2013-07-08 2021-06-25 腾讯科技(北京)有限公司 A method and device for displaying microblog comments
CN104281606A (en) * 2013-07-08 2015-01-14 腾讯科技(北京)有限公司 Method and device for displaying microblog comments
WO2015127782A1 (en) * 2014-02-27 2015-09-03 优视科技有限公司 Method and system for displaying webpage self-defined content
US10776564B2 (en) 2014-02-27 2020-09-15 Uc Mobile Co., Ltd. Method and system for displaying webpage self-defined content
WO2015143910A1 (en) * 2014-03-28 2015-10-01 北京奇虎科技有限公司 Method and device for defining search engine result pages by user
CN104166741B (en) * 2014-09-10 2018-09-18 北京国双科技有限公司 Web page browsing analysis and processing method and device
CN104166741A (en) * 2014-09-10 2014-11-26 北京国双科技有限公司 Webpage browsing analysis and processing method and device
CN106156063A (en) * 2015-03-30 2016-11-23 阿里巴巴集团控股有限公司 Correlation technique and device for object picture search results ranking
WO2016155537A1 (en) * 2015-03-30 2016-10-06 阿里巴巴集团控股有限公司 Method and device for ranking search results of picture objects
CN107644028B (en) * 2016-07-20 2020-09-04 平安科技(深圳)有限公司 Method and system for collecting webpage data
CN107644028A (en) * 2016-07-20 2018-01-30 平安科技(深圳)有限公司 The collection method and system of web data
WO2018133681A1 (en) * 2017-01-23 2018-07-26 腾讯科技(深圳)有限公司 Method and device for sorting search results, server and storage medium

Similar Documents

Publication Publication Date Title
CN101320387A (en) Web page text and image ranking method based on user attention time
CN102246167B (en) Providing search results
US8538989B1 (en) Assigning weights to parts of a document
Guan et al. Personalized tag recommendation using graph-based ranking on multi-type interrelated objects
US8209616B2 (en) System and method for interfacing a web browser widget with social indexing
CN102929928B (en) Multidimensional-similarity-based personalized news recommendation method
US9348935B2 (en) Systems and methods for augmenting a keyword of a web page with video content
TWI471737B (en) System and method for trail identification with search results
US20110225152A1 (en) Constructing a search-result caption
CN101382939B (en) Webpage Text Personalized Search Method Based on Eye Tracking
CN105512285B (en) Adaptive network reptile method based on machine learning
CN101382938B (en) Network video ordering method based on focusing time of users
CN104036038A (en) News recommendation method and system
CN109165367B (en) News recommendation method based on RSS subscription
WO2010014082A1 (en) Method and apparatus for relating datasets by using semantic vectors and keyword analyses
CN102622417A (en) Method and device for ordering information records
US10339469B2 (en) Self-adaptive display layout system
US20110072025A1 (en) Ranking entity relations using external corpus
CN102236719A (en) Page search engine based on page classification and quick search method
Parikh et al. Search engine optimization
US20130013305A1 (en) Method and subsystem for searching media content within a content-search service system
Chaudhuri et al. SHARE: Designing multiple criteria-based personalized research paper recommendation system
CN101097580A (en) Process for ordering network advertisement
CN120011634A (en) A news recommendation method based on image and text combination
CN101382940B (en) Web page image individuation search method based on eyeball tracking

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20081210