CN101901249A

CN101901249A - A Text-Based Query Expansion and Ranking Method in Image Retrieval

Info

Publication number: CN101901249A
Application number: CN2010101847252A
Authority: CN
Inventors: 张玥杰; 金城; 薛向阳; 岑磊; 彭琳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2009-05-26
Filing date: 2010-05-12
Publication date: 2010-12-01

Abstract

本发明属于多媒体信息检索领域，涉及一种在图像检索中实现基于义类词典的查询扩展与排序的方法。该发明包含：基于WordNet的英语词语语义相似度度量算法、基于HowNet的汉语词语语义相似度度量算法、基于扩展规则的查询扩展词选择与优化算法、检索结果的评分与优化算法。本发明方法使用相关的文本处理方法和语义网络词典对图像搜索引擎进行改进，通过语义扩展与用户交互及通过改进的相似度度量对检索结果进行排序。较之于传统方法而言，本发明具有准确率高、完整性强且时空代价低的优点。对于在大规模图像数据集基础上，考虑图像高层语义信息而进行高效图像检索具有非常重要的意义，在跨语言跨媒体检索领域具有广泛的应用价值。The invention belongs to the field of multimedia information retrieval, and relates to a method for realizing query expansion and sorting based on a semantic dictionary in image retrieval. The invention includes: a WordNet-based semantic similarity measurement algorithm for English words, a HowNet-based semantic similarity measurement algorithm for Chinese words, a query expansion word selection and optimization algorithm based on extension rules, and a scoring and optimization algorithm for retrieval results. The method of the invention uses a related text processing method and a semantic network dictionary to improve the image search engine, interacts with users through semantic expansion and sorts the retrieval results through the improved similarity measure. Compared with the traditional method, the present invention has the advantages of high accuracy, strong integrity and low time and space cost. It is of great significance for efficient image retrieval based on large-scale image data sets, considering the high-level semantic information of images, and has extensive application value in the field of cross-language and cross-media retrieval.

Description

A Text-Based Query Expansion and Ranking Method in Image Retrieval

技术领域technical field

本发明属于多媒体信息检索领域，涉及一种特定媒体-图像的检索方法，具体涉及一种在图像检索中实现基于义类词典的查询扩展与排序的方法。该方法可用于配合基于内容的图像检索方法，提高图像搜索质量，改善用户搜索体验。The invention belongs to the field of multimedia information retrieval, relates to a specific media-image retrieval method, in particular to a method for realizing query expansion and sorting based on a semantic dictionary in image retrieval. The method can be used to cooperate with the content-based image retrieval method to improve image search quality and user search experience.

背景技术Background technique

近年来，随着Internet和社会信息化的发展，数字图像的容量正快速增长，每天都有海量的图像数据产生。如何快速而准确地查找、访问图像，并有效利用这些图像成为迫切需要解决的问题，这就是所谓的图像检索技术。最早主要采用基于文本的图像检索技术(Text-based Image Retrieval，TBIR)；20世纪90年代以来，基于内容的图像检索(Content-based Image Retrieval，CBIR)的研究与应用得到长足的发展，进而又发展出基于语义的图像检索、基于反馈的图像检索以及基于知识的图像检索^[1]。TBIR作为一种早期的技术，十分依赖于图像的标注结果，这是受限之处所在，但是基于文本的检索作为一种较为成熟的技术，其快速可靠的特点至今仍然十分突出。因此，TBIR仍然是一个值得研究的方面，如果能吸取其他方法的一些特点或是和其他几种方法交互使用，可以有不错的效果。In recent years, with the development of the Internet and social informatization, the capacity of digital images is increasing rapidly, and a large amount of image data is generated every day. How to quickly and accurately find, access images, and effectively use these images has become an urgent problem to be solved, which is the so-called image retrieval technology. The earliest text-based image retrieval technology (Text-based Image Retrieval, TBIR) was mainly used; since the 1990s, the research and application of content-based image retrieval (Content-based Image Retrieval, CBIR) has been greatly developed, and then Semantic-based image retrieval, feedback-based image retrieval, and knowledge-based image retrieval ^[1] have been developed. As an early technology, TBIR relies heavily on image annotation results, which is a limitation. However, text-based retrieval is a relatively mature technology, and its fast and reliable features are still very prominent. Therefore, TBIR is still an aspect worthy of research. If it can absorb some characteristics of other methods or use it interactively with several other methods, it can have good results.

从图像数据本身来看，可分为两类。一类具有相关的文本说明信息，如新闻图像，通常记者在发回图像的同时附有简短的文字描述；而另一类则是无文字说明。对于“具有文本说明信息”的图像库，一般检索系统都是从图像的文本说明中提取关键词作为索引来实现图像检索。由于多年来针对文本检索的研究已取得不少成果，与单纯基于图像底层特征的检索系统相比，这类图像检索系统往往能更好地支持基于高级语义的检索。但是，有关图像的文本说明是图像作者从自身的理解与喜好角度出发对图像所做的简短描述，与针对检索目的对图像所进行的标注信息不同，则两者所描述的内容也必然存在差异。因此，从文本说明中提取出的关键词不同于专门用于检索目的的手工标注的关键词，不仅导致查准率下降，而且可能返回用户许多不相关的查询结果，使用户无所适从^[2]。同时，在图像检索应用中，用户的真实信息需求到用户提交的查询请求之间、以及查询请求到系统理解的查询请求之间存在一定的偏差。这些偏差导致查询出来的相关图像与用户查询之间、乃至用户希望得到的信息之间存在不匹配。通过查询扩展优化用户查询，更准确、客观地表达用户的查询需求，帮助用户快速准确地获得所需要的信息，已经成为信息检索领域，特别是图像检索领域的重要研究热点^[3，4]。From the perspective of image data itself, it can be divided into two categories. One type has related text description information, such as news images, and reporters usually attach a short text description when sending back the image; while the other type has no text description. For the image library "with text description information", the general retrieval system extracts keywords from the text description of the image as an index to realize image retrieval. Since many years of research on text retrieval have achieved a lot of results, this type of image retrieval system can often better support retrieval based on advanced semantics than the retrieval system based solely on the underlying features of images. However, the text description of the image is a brief description of the image from the perspective of the image author's own understanding and preferences, which is different from the annotation information for the image for retrieval purposes, and the content described by the two must also be different. . Therefore, the keywords extracted from the text description are different from the manually marked keywords specially used for retrieval purposes, which not only leads to a decrease in precision, but also may return many irrelevant query results to users, making users at a loss ^[2] . At the same time, in image retrieval applications, there is a certain deviation between the user's real information needs and the query request submitted by the user, and between the query request and the query request understood by the system. These deviations lead to a mismatch between the relevant images queried and the user's query, and even the information that the user wants to get. It has become an important research hotspot in the field of information retrieval, especially in the field of image retrieval, to optimize user queries through query expansion, express user query needs more accurately and objectively, and help users quickly and accurately obtain the required information ^[3,4] .

目前的查询扩展方法大致可分成三类，即基于义类词典、基于全局分析以及基于局部分析^[5，6，7]。基于义类词典的方法一般借助于语义知识词典^[8，9，10]，选择出与查询用词存在一定语义关联性的词来进行扩展，选择的依据通常为词之间的上下位关系与同义关系等^[11， ^{12，13，14，15，16]}。该方法依赖于完备的语义体系，独立于待检索对象集^{[17，18，19]}。基于全局分析的方法其基本思想是对全部检索对象中的词或者词组进行相关分析，将与查询用词关联程度最高的词或者词组加入初始查询以生成新的查询^[20]。该方法虽然可以最大限度地探求词间关系，但在检索对象集合改变后的更新代价巨大，而且随着检索对象集合规模的递增在时空代价上也会存在不可行性。而基于局部分析的方法为两阶段查询，也就是首先对使用者的初始查询做第一次检索，根据检索结果选取前N个检索对象进行分析，找出其中重要性较高的词，与初始查询组成新的查询，然后利用新的查询进行第二次检索^[20]。该方法容易存在“查询漂移”问题，当第一次检索结果不佳时，可能会选择与查询主题不相关的词而加入至初始查询，会严重降低查询精度，甚至低于未做查询扩展的情形。The current query expansion methods can be roughly divided into three categories, namely, based on semantic lexicon, based on global analysis and based on local analysis ^{[5, 6, 7]} . The method based on the semantic lexicon generally uses the semantic knowledge dictionary ^[8,9,10] to select words that have a certain semantic relevance with the query word for expansion. Synonymous relations, etc. ^{[11, 12} ^{, 13, 14, 15, 16]} . This method relies on a complete semantic system and is independent of the object set to be retrieved ^{[17, 18, 19]} . The basic idea of the method based on global analysis is to conduct correlation analysis on all the words or phrases in the search objects, and add the words or phrases with the highest degree of correlation with the query words to the initial query to generate a new query ^[20] . Although this method can maximize the search for the relationship between words, the update cost after the retrieval object set changes is huge, and it will also be infeasible in terms of time and space as the retrieval object collection scale increases. The method based on local analysis is a two-stage query, that is, the first search is performed on the user's initial query, and the first N search objects are selected for analysis according to the search results, and the words with higher importance are found out, which are compared with the initial query. The query is composed of a new query, and then the second retrieval is performed using the new query ^[20] . This method is prone to the problem of "query drift". When the first retrieval result is not good, words that are not related to the query topic may be selected and added to the initial query, which will seriously reduce the query accuracy, even lower than that without query expansion. situation.

另一方面，如何把所要的检索结果呈现给用户，帮助用户迅速地定位所需要的资源，也一直是图像检索的目标。当用户输入查询时，希望能够及时检索出最想要的结果，并且这些结果能够排在检索结果的最前面^[21]。尤其是当返回大量检索结果的时候，从用户的浏览习惯来看，基本上只关心前面若干项的结果，而很靠后的检索结果不可能也不愿意去一一遍历，甚至被用户读到的机率几乎为零^[22]。因此，检索结果的排序效果直接影响到用户能否方便地获取所需资源，同时也决定着用户对该检索系统的满意度^[23]。尤其是图像搜索引擎，其组织大量的各类图像资源，是针对特定媒体资源的信息查询工具。用户使用这类搜索引擎带有更强的目的性，更关注于能否在检索结果中尽快找到所需资源，这就对图像搜索引擎的排序处理提出更高的要求。决定排序结果的重要因素是图像搜索引擎的排序策略，而排序策略是图像搜索引擎最核心的部分之一，也是图像搜索引擎成败的关键。On the other hand, how to present the desired retrieval results to users and help users quickly locate the resources they need has always been the goal of image retrieval. When the user enters a query, he hopes to retrieve the most desired results in time, and these results can be ranked at the top of the retrieval results ^[21] . Especially when a large number of search results are returned, from the perspective of users' browsing habits, they basically only care about the results of the first few items, and it is impossible and unwilling to go through the very late search results one by one, and even be read by users. The probability is almost zero ^[22] . Therefore, the sorting effect of retrieval results directly affects whether users can easily obtain the required resources, and also determines the user's satisfaction with the retrieval system ^[23] . Especially the image search engine, which organizes a large number of various image resources, is an information query tool for specific media resources. Users use this type of search engine with a stronger purpose, and pay more attention to whether they can find the required resources as soon as possible in the search results, which puts forward higher requirements for the sorting process of the image search engine. The important factor that determines the ranking result is the ranking strategy of the image search engine, and the ranking strategy is one of the core parts of the image search engine, and it is also the key to the success of the image search engine.

现有的通用搜索引擎排序算法从原理上可分为五种，即词频和位置加权排序算法、Direct Hit算法、Alexa的网站排名算法、Google的排序算法以及相似度算法^{[24，25，26]}。利用词频和位置加权算法是搜索引擎早期排序的主要思想，其技术发展最成熟，至今仍是许多搜索引擎的核心排序技术。该算法的优点在于简单、易实现，比较适用于结构化文档数据。Direct Hit是一种注重信息质量和用户行为反馈的排序算法，在一定程度上能够满足“用户保障原则”，同时也考虑信息的质量。但由于用户行为比较随意，很难保证排序结果的准确性。Alexa专注于发布世界网站排名，主要考虑综合排名与分类排名两个方面。Google的排序算法是其优秀搜索结果的决定性因素，采用一种精密的排序网页文件等级的方式——PageRank。查询与检索结果记录的相似程度也是搜索引擎排序的一个重要依据，目前比较常用的方法即为将查询串和文档都看作向量，其中需要考虑查询与检索结果的长度。Existing general search engine sorting algorithms can be divided into five types in principle, namely word frequency and position weighted sorting algorithm, Direct Hit algorithm, Alexa website ranking algorithm, Google sorting algorithm and similarity algorithm ^{[24, 25, 26]} . The use of word frequency and position weighting algorithm is the main idea of the early sorting of search engines, and its technology is the most mature, and it is still the core sorting technology of many search engines. The advantage of this algorithm is that it is simple and easy to implement, and it is more suitable for structured document data. Direct Hit is a sorting algorithm that focuses on information quality and user behavior feedback. To a certain extent, it can meet the "user guarantee principle" and also consider the quality of information. However, due to the random behavior of users, it is difficult to guarantee the accuracy of the sorting results. Alexa focuses on publishing world website rankings, mainly considering two aspects: comprehensive ranking and category ranking. Google's ranking algorithm is the decisive factor for its excellent search results, using a sophisticated way of sorting web file levels - PageRank. The similarity between query and retrieval result records is also an important basis for search engine ranking. Currently, the more common method is to treat query strings and documents as vectors, and the length of query and retrieval results needs to be considered.

通过上述的综合分析可以发现，在基于用户输入关键词进行查询的图像搜索引擎中，用户输入的关键词是引擎唯一获取的查询信息。在图像库与图像文字描述信息没有变化的条件下，与关键词序列对应的相关信息必然是唯一的检索结果。因此，从关键词序列中挖掘出尽可能多的信息来辅助查询将有助于引擎更好的理解用户的意图。查询扩展就是这样一种扩充信息的方式。如果能通过查询扩展显性信息或是隐性地缩小用户意图的不确定性，将在一定程度上能够带来更好的检索结果。另外，图像搜索引擎的查询结果往往很多，而对于用户来说，往往只会有耐心察看前几十个结果。换言之，如何将更贴近用户搜索意图的图像检索结果放到返回结果更前面的位置上相当重要。如100个返回结果中有50个正确结果不一定比只有20个正确结果显得更有效果，因此查准率(Precision)与查全率(Recall)都十分重要。实践显示，没有用户会把所有的搜索结果都利用起来，用户只拣有用者，这正是排序要提供给用户的便利。Through the above comprehensive analysis, it can be found that in an image search engine based on keywords input by users, the keywords input by users are the only query information obtained by the engine. Under the condition that there is no change in the image database and image text description information, the relevant information corresponding to the keyword sequence must be the only retrieval result. Therefore, mining as much information as possible from the keyword sequence to assist the query will help the engine better understand the user's intention. Query expansion is one such way of expanding information. If the explicit information can be expanded through the query or the uncertainty of the user's intention can be narrowed implicitly, it will bring better retrieval results to a certain extent. In addition, image search engines often have a lot of query results, and users often only have the patience to view the first few dozen results. In other words, how to place the image retrieval results that are closer to the user's search intent at a higher position in the returned results is very important. For example, 50 correct results out of 100 returned results are not necessarily more effective than only 20 correct results, so precision and recall are both very important. Practice shows that no user will use all the search results, and the user only selects the useful ones. This is the convenience that sorting provides to users.

同时，以上分析也说明目前已有的查询扩展与检索结果排序算法通常源于文本检索，针对大规模文本信息处理^[27，28]。虽然基于文本的图像检索技术脱胎于现已较成熟的文本检索技术，但其中存在某些不适用的技术，会给图像检索带来负面影响。一般的查询扩展模型或者排序模型不可能对图像检索都有效，有关现有针对文本检索的查询扩展和检索文档排序模式的研究还有待加强和深化。At the same time, the above analysis also shows that the existing query expansion and retrieval result ranking algorithms usually originate from text retrieval and are aimed at large-scale text information processing ^{[27, 28]} . Although the text-based image retrieval technology is born out of the more mature text retrieval technology, there are some unsuitable technologies in it, which will have a negative impact on image retrieval. The general query expansion model or ranking model cannot be effective for image retrieval, and the existing research on query expansion and retrieval document ranking mode for text retrieval needs to be strengthened and deepened.

与本发明相关的参考文献有：References relevant to the present invention are:

[1]Oilscoil Chathair，Bhaile Cliath，A.F.Sineaton，I.Quigley，Alan F.Smeaton，Ian Quigleyand Glasnevin Dublin.“Experiments on Using Semantic Distances between Words in ImageCaption Retrieval”.In Research and Development in Information Retrieval，pp.174-180，1996.[1] Oilscoil Chathair, Bhaile Cliath, A.F.Sineaton, I.Quigley, Alan F.Smeaton, Ian Quigleyand Glasnevin Dublin. "Experiments on Using Semantic Distances between Words in ImageCaption Retrieval". In Research and Development in Information Retrieval. 17, valpp -180, 1996.

[2]Yang Linpeng，Ji Donghong，Tang Li and Niu Zhenyu.“Chinese Information Retrieval basedon Terms and Relevant Terms”.In ACM Transactions on Asian Language Information Processing；Vol.4(3)：357-374，September，2005.[2] Yang Linpeng, Ji Donghong, Tang Li and Niu Zhenyu. "Chinese Information Retrieval based on Terms and Relevant Terms". In ACM Transactions on Asian Language Information Processing; Vol.4(3): 357-374, September, 2005.

[3]Xiqing Lin and Ximing Chen.New Methods for Query Expansion and Query Re-weighting forDocument Retrieval.Master Thesis，Department of Information Engineering，National Scienceand Technology University，Taiwan，2005.[3] Xiqing Lin and Ximing Chen. New Methods for Query Expansion and Query Re-weighting for Document Retrieval. Master Thesis, Department of Information Engineering, National Science and Technology University, Taiwan, 2005.

[4]YiXuan Hong.Ontological Inference for User Intention Extraction，Query Expansion andConcept-based Retrieval.Master Thesis，Department of Information Engineering，NationalDong-hua University，Taiwan，2004.[4] YiXuan Hong. Ontological Inference for User Intention Extraction, Query Expansion and Concept-based Retrieval. Master Thesis, Department of Information Engineering, National Dong-hua University, Taiwan, 2004.

[5]C.Ch.Latiri，S.Ben Yahin，J.P.ChevaVet and A.Jaouaa.“Query Expansion using FuzzyAssociation Rules between Terms”.In Proceedings of the 4th JIM International Conferenceon Knowledge Discovery and Discrete Mathematics，Mets，France，2003.[5] C.Ch.Latiri, S.Ben Yahin, J.P.ChevaVet and A.Jaouaa. "Query Expansion using FuzzyAssociation Rules between Terms". In Proceedings of the 4th JIM International Conference on Knowledge Discovery and Discrete Mathematics, Mets, 03France, .

[6]Hsi-Ching Lin，Li-Hui Wang and Shyi-Ming Chen.“Query Expansion for Document Retrievalbased on Fuzzy Rules and User Relevance Feedback Techniques”.In Expert Systems withApplications，Vol.31(2)：397-405，August 2006.[6] Hsi-Ching Lin, Li-Hui Wang and Shyi-Ming Chen. "Query Expansion for Document Retrievalbased on Fuzzy Rules and User Relevance Feedback Techniques". In Expert Systems with Applications, Vol.31(2): 397-405, August 2006.

[7]Hang Cui，Ji-Rong Wen，Jian-Yun Nie and Wei-Ying Ma.“Query Expansion by Mining User Logs”.In IEEE Transactions on Knowledge and Data Engineering，Vol.15(4)：829-839，July/August2003.[7] Hang Cui, Ji-Rong Wen, Jian-Yun Nie and Wei-Ying Ma. "Query Expansion by Mining User Logs". In IEEE Transactions on Knowledge and Data Engineering, Vol.15(4): 829-839, July/August 2003.

[8]Christiane Fellbaum(ed.).WordNet：An Electronic Lexical Database.The MIT Press，Cambridge，MA，1998.[8] Christiane Fellbaum (ed.). WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, 1998.

[9]George A.Miller and Florentina Hristea.“WordNet Nouns：Classes and Instances”.InComputational Linguistics，Vol.2(1)：1-3，2006.[9] George A. Miller and Florentina Hristea. "WordNet Nouns: Classes and Instances". In Computational Linguistics, Vol.2(1): 1-3, 2006.

[10]董振东董强郝长伶，“知网的理论发现”，《中文信息学报》，Vol.21(4)：3-9，2007年。[10] Dong Zhendong, Dong Qiang and Hao Changling, "Theoretical Discovery of HowNet", "Journal of Chinese Information Science", Vol.21(4): 3-9, 2007.

[11]Zhiguo Gong，Chan Wa Cheang，and Leong Hou U.“Web Query Expansion by WordNet”.In LectureNotes in Computer Science Volume 4080/2006，pp.379-388，2005.[11] Zhiguo Gong, Chan Wa Cheang, and Leong Hou U. "Web Query Expansion by WordNet". In LectureNotes in Computer Science Volume 4080/2006, pp.379-388, 2005.

[12]Ming-Hung Hsu，Ming-Feng Tsai and Hsin-Hsi Chen.“Query Expansion with ConceptNet andWordNet：An Intrinsic Comparison”.In Proceedings of AIRS 2006，pp.1-13，2006.[12] Ming-Hung Hsu, Ming-Feng Tsai and Hsin-Hsi Chen. "Query Expansion with ConceptNet and WordNet: An Intrinsic Comparison". In Proceedings of AIRS 2006, pp.1-13, 2006.

[13]Alexander Budanitsky and Graeme Hirst.“Evaluating WordNet-based Measures of LexicalSemantic Relatedness”.In Computational Linguistics，Vol.32(1)：13-47，2006.[13] Alexander Budanitsky and Graeme Hirst. "Evaluating WordNet-based Measures of Lexical Semantic Relatedness". In Computational Linguistics, Vol.32(1): 13-47, 2006.

[14]刘群李素建，“基于《知网》的词汇语义相似度计算”，Computational Linguistics and ChineseLanguage Processing，Vol.7(2)：59-76，2002年。[14] Liu Qun and Li Sujian, "Calculation of Lexical Semantic Similarity Based on HowNet", Computational Linguistics and Chinese Language Processing, Vol.7(2):59-76, 2002.

[15]李峰李芳，“中文词语语义相似度计算——基于《知网》”，《中文信息学报》，Vol.21(3)：99-105，2007年。[15] Li Feng and Li Fang, "Computation of Semantic Similarity of Chinese Words—Based on HowNet", Journal of Chinese Information Science, Vol.21(3):99-105, 2007.

[16]江敏肖诗斌王弘蔚施水才，“一种改进的基于《知网》的词语语义相似度计算”，《中文信息学报》，Vol.22(5)：84-89，2008年。[16] Jiang Min, Xiao Shibin, Wang Hongwei, Shi Shuicai, "An Improved Word Semantic Similarity Calculation Based on HowNet", "Chinese Journal of Information", Vol.22(5): 84-89, 2008.

[17]Diana Inkpen and Graeme Hirst.“Building and Using a Lexical Knowledge Base of Near-SynonymDifferences”.In Computational Linguistics，Vol.32(2)：223-262，2006.[17] Diana Inkpen and Graeme Hirst. "Building and Using a Lexical Knowledge Base of Near-Synonym Differences". In Computational Linguistics, Vol.32(2): 223-262, 2006.

[18]Ted Pedersen，Satanjeev Banerjee and Siddharth Patwardhan.“Maximizing SemanticRelatedness to Perform Word Sense Disambiguation”.In University of MinnesotaSupercomputing Institute Research Report UMSI 2005/25，March，2005.[18] Ted Pedersen, Satanjeev Banerjee and Siddharth Patwardhan. "Maximizing Semantic Relatedness to Perform Word Sense Disambiguation". In University of Minnesota Supercomputing Institute Research Report UMSI 2005/25, March, 2005.

[19]Budanitsky，Alexander and Graeme Hirst.“Semantic Distance in WordNet：An Experimental，Application-Oriented Evaluation of Five Measures.In Proceedings of the Workshop on WordNetand Other Lexical Resources，The Second Meeting of the North American Chapter of theAssociation for Computational Linguistics，Pittsburgh，PA，pp.29-34，2001.[19] Budanitsky, Alexander and Graeme Hirst. "Semantic Distance in WordNet: An Experimental, Application-Oriented Evaluation of Five Measures. In Proceedings of the Workshop on WordNet and Other Lexical Resources, The Second Meeting of the North American Computing Association for the North American Chap Linguistics, Pittsburgh, PA, pp. 29-34, 2001.

[20]Yuen-Hsien Tseng，Da-Wei Juang and Shiu-Han Chen.“Global and Local Expansion TermExpansion for Text Retrieval”.In Proceedings of the Fourth NTCIR Workshop on Evaluationof Information Retrieval，Automatic Text Summarization and Question Answering，Tokyo，Japan，June 2-4，2004.[20] Yuen-Hsien Tseng, Da-Wei Juang and Shiu-Han Chen. "Global and Local Expansion Term Expansion for Text Retrieval". In Proceedings of the Fourth NTCIR Workshop on Evaluation of Information Retrieval, Automatic Text Summarization and Question, Tokyo ok Answer, Japan, June 2-4, 2004.

[21]H.Vernon Leighton，Jaideep Srivastava.Precision among World Wide Web Search Services(Search Engines)：AltaVista，Excite，Hotbot，Infoseek，Lycos.September，2006.http：//www.winona.msus.edu/library/webind2/webind2.htm.[21] H. Vernon Leighton, Jaideep Srivastava. Precision among World Wide Web Search Services (Search Engines): AltaVista, Excite, Hotbot, Infoseek, Lycos.September, 2006. http://www.winona.msus.edu/library /webind2/webind2.htm.

[22]Claudio Carpineto，Giovanni Romano and Vittorio Giannini.“Improving Retrieval Feedbackwith Multiple Term-Ranking”.In ACM Transactions on Information Systems，Vol.20(3)：259-290，July，2002.[22] Claudio Carpineto, Giovanni Romano and Vittorio Giannini. "Improving Retrieval Feedback with Multiple Term-Ranking". In ACM Transactions on Information Systems, Vol.20(3): 259-290, July, 2002.

[23]Kemafor Anyanwu，Angela Maduko and Amit Sheth.“SemRank：Ranking Complex RelationshipSearch Results on the Semantic Web”.In Proceedings of WWW 2005，pp.117-127，Chiba，Japan，May 10-14，2005.[23] Kemafor Anyanwu, Angela Maduko and Amit Sheth. "SemRank: Ranking Complex Relationship Search Results on the Semantic Web". In Proceedings of WWW 2005, pp.117-127, Chiba, Japan, May 10-14, 2005.

[24]Shengli Wu and Fabio Crestani.“Methods for Ranking Information Retrieval Systems withoutRelevance Judgements”.In Proceedings of the 2003 ACM Symposium on Applied Computing，pp.811-816，Melbourne，Florida，USA，2003.[24] Shengli Wu and Fabio Crestani. "Methods for Ranking Information Retrieval Systems without Relevance Judgements". In Proceedings of the 2003 ACM Symposium on Applied Computing, pp.811-816, Melbourne, Florida, USA, 2003.

[25]Boanerges Aleman-Meza，Chris Halaschek，I.Budak Arpinar and Amit Sheth.“Context-AwareSemantic Association Ranking”.In Technical Report 03-010，LSDIS Lab，Computer Science，Universi ty of Georgia，August，2003.[25]Boanerges Aleman-Meza, Chris Halaschek, I. Budak Arpinar and Amit Sheth. "Context-Aware Semantic Association Ranking". In Technical Report 03-010, LSDIS Lab, Computer Science, University of Georgia, August, 2003.

[26]Taher H.Haveliwala.“Topic-Sensitive PageRank：A Context-Sensitive Ranking Algorithmfor Web Search”.In IEEE Transactions on Knowledge and Data Engineering，Vol.15(4)：784-796，July/August，2003.[26] Taher H. Haveliwala. "Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search". In IEEE Transactions on Knowledge and Data Engineering, Vol.15(4): 784-796, July/August, 2003.

[27]Jinan Fiaidhi，Sabah Mohammed，Jihad Jaam and Ahmad Hasnah.“A Standard Framework forPersonalization via Ontology-based Query Expansion”.In Pakistan Journal of Informationand Technology，Vol.2(2)：96-103，2003.[27] Jinan Fiaidhi, Sabah Mohammed, Jihad Jaam and Ahmad Hasnah. "A Standard Framework for Personalization via Ontology-based Query Expansion". In Pakistan Journal of Information and Technology, Vol.2(2): 96-103, 2003.

[28]Chris Buckley and Ellen M.Voorhees.“Retrieval Evaluation with Incomplete Information”.In Proceedings of SIGIR 2004，pp.25-32，Sheffield，UK，2004.[28]Chris Buckley and Ellen M.Voorhees. "Retrieval Evaluation with Incomplete Information". In Proceedings of SIGIR 2004, pp.25-32, Sheffield, UK, 2004.

发明内容Contents of the invention

本发明的目的在于克服现有技术的不足，提出一种在基于文本的图像检索中有效进行查询扩展与检索结果排序的方法。The purpose of the present invention is to overcome the deficiencies of the prior art, and propose a method for effectively performing query expansion and sorting retrieval results in text-based image retrieval.

本发明借鉴文本检索技术，建立一个适合于图像检索自身特点的“专用查询扩展与检索结果排序模型”。采用“从一般到特殊”的设计思想，公开一种技术框架(包括四个主要算法)，即从“一般查询扩展与检索结果排序模型”出发，使用相关的文本处理方法与语义网络词典对图像搜索引擎进行改进，通过语义扩展与用户交互，其中使用语义网络词典与所建立的扩展规则；及通过改进的相似度度量对检索结果进行排序，其中重点关注于评分算法的构建与优化，最终建立适用于基于文本的图像检索的“专用查询扩展与检索结果排序模型”。The invention uses the text retrieval technology for reference to establish a "special query expansion and retrieval result sorting model" suitable for the characteristics of the image retrieval itself. Using the design concept of "from general to specific", a technical framework (including four main algorithms) is disclosed, that is, starting from the "general query expansion and retrieval result ranking model", using related text processing methods and semantic web dictionaries to image Improve the search engine, interact with users through semantic expansion, use the semantic network dictionary and the established expansion rules; and sort the retrieval results through the improved similarity measure, which focuses on the construction and optimization of the scoring algorithm, and finally establishes "Specialized Query Expansion and Result Ranking Model" for Text-Based Image Retrieval.

本发明提出的图像检索中的查询扩展与检索结果排序方法，是利用相关的文本处理方法和语义网络词典对图像搜索引擎进行改进，其包含以下方面：(1)预处理与预分析(Pre-Processing and Pre-Analysis)——针对初始查询，通过预处理完成查询的分词与标点符号加标，并基于经过预处理的初始查询，通过预分析完成禁用词加标、词类分析与关键词提取；(2)词语语义相似度度量(Word Lexical Semantic Similarity Measurement)——针对英语词语语义相似度度量，基于网络路径长度与深度来计算语义距离，而针对汉语词语语义相似度度量，基于综合考虑主类义原相似度、语义表达式相似度与主类义原框架相似度进行计算，同时融入最大匹配规则与义原深度信息；(3)融合扩展规则的查询扩展(QueryExpansion based on Fusion of Expansion Rules)——基于语义网络知识，同时融合所建立的特定扩展规则，针对源于初始查询的关键词序列进行语义扩展；(4)基于评分的检索结果排序(Retrieval Results Ranking based on Scoring)——以搜索引擎返回的检索结果作为处理对象，基于词语语义相似度度量评估查询关键词序列与图像描述说明之间的“相近程度”，从而获取评分，并通过改进的评分算法进行优化，从而将最终得分作为搜索引擎返回图像的排序依据。The method for query expansion and retrieval result sorting in the image retrieval proposed by the present invention is to utilize relevant text processing methods and semantic web dictionaries to improve the image search engine, which includes the following aspects: (1) preprocessing and preanalysis (Pre- Processing and Pre-Analysis)——for the initial query, the word segmentation and punctuation marks of the query are completed through pre-processing, and based on the pre-processed initial query, stop word marking, part of speech analysis and keyword extraction are completed through pre-analysis; (2) Word Lexical Semantic Similarity Measurement (Word Lexical Semantic Similarity Measurement)——For the English word semantic similarity measurement, the semantic distance is calculated based on the length and depth of the network path, while for the Chinese word semantic similarity measurement, based on comprehensive consideration of the main category Sememe similarity, semantic expression similarity and main class sememe frame similarity are calculated, and the maximum matching rule and sememe depth information are incorporated at the same time; (3) Query Expansion based on Fusion of Expansion Rules ——Based on the semantic network knowledge, while integrating the established specific extension rules, the semantic extension is carried out for the keyword sequence from the initial query; (4) Retrieval Results Ranking based on Scoring——based on the search The retrieval results returned by the engine are used as the processing object, and the "similarity" between the query keyword sequence and the image description is evaluated based on the word semantic similarity measure, so as to obtain the score, and optimize through the improved scoring algorithm, so that the final score is used as The order by which the search engine returns images.

与现有技术比较，本发明的上述方法在图像搜索引擎中基于所扩展的查询最终获取检索结果，具有三大优势，即准确率高、完整性强与时空代价低。其准确率高体现在查询扩展中由非常用语义词而来的扩展词非常少，这可以保证经过扩展之后所获得的扩展词与初始查询关键词具有高度共性；检索结果排序中由搜索引擎返回的相关性差或者“错误”的图像尽可能排在检索结果列表的后面位置，这可以保证经过排序之后同一结果集中更“好”的结果排序更前，以便用户更容易看见。其完整性强体现在查询扩展中初始查询关键词序列附加扩展而来的扩展词非常完整，与图像集中的图像说明描述具有高度相关性与一致性；检索结果排序中搜索引擎的搜索行为未受任何干扰，这可以保证经过排序之后其返回结果集没有受到任何影响。其时空代价低体现在查询扩展与检索结果排序中，抛开网络传输速度与服务器处理速度等次要因素，具有更好的时空效率，而实际环境中的传输耗时将比具体计算耗时大得多，则对于用户来说感受不到这样的时间差距。Compared with the prior art, the above method of the present invention finally obtains the search results based on the extended query in the image search engine, and has three advantages, namely, high accuracy, strong integrity and low space-time cost. Its high accuracy rate is reflected in the fact that there are very few expanded words derived from unused semantic words in query expansion, which ensures that the expanded words obtained after expansion have a high degree of commonality with the initial query keywords; the search results are sorted by the search engine Images with poor correlation or "wrong" images are ranked as far back as possible in the search result list, which ensures that the "better" results in the same result set are sorted higher, so that users can see them more easily. Its integrity is reflected in the fact that in the query expansion, the expansion words added to the initial query keyword sequence are very complete, and are highly relevant and consistent with the image descriptions in the image collection; the search behavior of the search engine in the ranking of retrieval results is not affected Any interference, which ensures that the returned result set has not been affected in any way after sorting. Its low space-time cost is reflected in query expansion and retrieval result sorting. Regardless of secondary factors such as network transmission speed and server processing speed, it has better time-space efficiency. However, the transmission time in the actual environment will be longer than the actual calculation time. If the time difference is much higher, the user will not feel such a time gap.

本发明的突出贡献在于提供了(1)基于WordNet的英语词语语义相似度度量算法；(2)基于HowNet的汉语词语语义相似度度量算法；(3)基于扩展规则的查询扩展词选择与优化算法；(4)检索结果的评分与优化算法。利用以上四个核心算法设计了一种图像检索中查询扩展与检索结果排序的技术框架。The outstanding contribution of the present invention is to provide (1) English word semantic similarity measurement algorithm based on WordNet; (2) Chinese word semantic similarity measurement algorithm based on HowNet; (3) query expansion word selection and optimization algorithm based on expansion rules ; (4) Scoring and optimization algorithm of retrieval results. Using the above four core algorithms, a technical framework of query expansion and retrieval result ranking in image retrieval is designed.

本发明的上述优点能满足针对大规模图像数据集，考虑图像高层语义信息而进行高效图像检索的应用需求，跨语言跨媒体检索就是其主要应用。The above-mentioned advantages of the present invention can meet the application requirements of efficient image retrieval in consideration of high-level semantic information of images for large-scale image data sets, and cross-language cross-media retrieval is its main application.

附图说明Description of drawings

图1为本发明方法的流程框架图，图中标号：(一)为预处理与预分析功能模块；(二)为词语语义相似度度量功能模块；(三)为融合扩展规则的查询扩展功能模块；(四)为基于评分的检索结果排序功能模块。Fig. 1 is the flow chart of the inventive method, label among the figure: (one) is preprocessing and pre-analysis functional module; (two) is the word semantic similarity measurement functional module; (three) is the query expansion function of fusion expansion rule (4) a functional module for sorting retrieval results based on scores.

图2为通过具体的实例演示上述算法流程框架的具体步骤，通过给出各算法模块的中间输出以及该框架的最终检索结果，给人以直观的理解，Figure 2 demonstrates the specific steps of the above-mentioned algorithm flow framework through specific examples. By giving the intermediate output of each algorithm module and the final retrieval result of the framework, it gives people an intuitive understanding.

其中，标号(1)与(2)分别为用户输入的原始英语查询和汉语查询；标号(3)与(4)分别为利用英语义类词典WordNet和汉语义类词典HowNet，采用“基于扩展规则的查询扩展词选择与优化算法”所获取的与原始英语查询和汉语查询相对应的扩展词集；标号(5)与(6)分别为基于原始英语查询和汉语查询的扩展词集，利用图像搜索引擎所获取的相应检索结果；标号(7)与(8)分别为基于原始英语查询和汉语查询的初始检索列表，利用融合“基于WordNet的英语词语语义相似度度量算法”和“基于HowNet的汉语词语语义相似度度量算法”的“检索结果的评分与优化算法”所获取的最终检索结果。Among them, the labels (1) and (2) are the original English query and Chinese query input by the user respectively; the labels (3) and (4) are respectively using the English semantic dictionary WordNet and the Chinese semantic dictionary HowNet, using the "based on extension rules The expanded word sets corresponding to the original English query and Chinese query obtained by "Query expansion word selection and optimization algorithm"; labels (5) and (6) are the expanded word sets based on the original English query and Chinese query respectively, using image The corresponding retrieval results obtained by the search engine; labels (7) and (8) are the initial retrieval lists based on the original English query and Chinese query respectively, using the fusion of "WordNet-based semantic similarity measurement algorithm for English words" and "HowNet-based The final retrieval results obtained by the "Scoring and Optimization Algorithm for Search Results" of "Measurement Algorithm for Semantic Similarity of Chinese Words".

具体实施方式Detailed ways

下面结合附图详细介绍本发明在图像检索中进行查询扩展与检索结果排序的流程框架及组成该框架的四个核心算法：The following describes in detail the process framework of the present invention for query expansion and retrieval result sorting in image retrieval and four core algorithms that form the framework in conjunction with the accompanying drawings:

实施例1Example 1

1.算法的流程框架1. Algorithm process framework

附图1为该应用框架的流程图，标号1-4分别代表上述的四个主要功能模块。Accompanying drawing 1 is the flow chart of this application framework, and the numerals 1-4 respectively represent the above-mentioned four main functional modules.

该框架分为四个主要模块：预处理与预分析(Pre-Processing and Pre-Analysis)、词语语义相似度度量(Concept Semantic Relativity Measurement)、融合扩展规则的查询扩展(QueryExpansion based on Fusion of Expansion Rules)与基于评分的检索结果排序(Retrieval ResultsRanking based on Scoring)。四个核心算法其中的三个，基于WordNet的英语词语语义相似度度量算法、基于HowNet的汉语词语语义相似度度量算法、以及检索结果的评分与优化算法将用于基于评分的检索结果排序模块，而基于扩展规则的查询扩展词选择与优化算法将用于融合扩展规则的查询扩模块。在该应用框架的前两个模块中，还将用到一些目前已经比较成熟的现有技术。The framework is divided into four main modules: Pre-Processing and Pre-Analysis, Concept Semantic Relativity Measurement, Query Expansion based on Fusion of Expansion Rules ) and Retrieval ResultsRanking based on Scoring. Three of the four core algorithms, the English word semantic similarity measurement algorithm based on WordNet, the Chinese word semantic similarity measurement algorithm based on HowNet, and the retrieval result scoring and optimization algorithm will be used in the scoring-based retrieval result sorting module, The query expansion word selection and optimization algorithm based on the expansion rules will be used in the query expansion module that fuses the expansion rules. In the first two modules of the application framework, some existing technologies that are relatively mature at present will also be used.

(1)预处理与预分析(Pre-Processing and Pre-Analysis)：针对初始查询，其预处理的主要任务是完成查询的分词与标点符号加标。其中，针对汉语查询采取最大概率分词策略，而针对英语查询需要附加单词首写字母大小写变换处理过程。基于经过预处理的初始查询，预分析主要完成三项任务。其一是对查询中的禁用词加以标注；其二是对每一单词进行词类分析，确定其所属正确词性，其中对于英语查询中具有变化形式的单词还需要进行形态恢复处理；而其三即为根据词类分析结果，提取作为关键词的查询词项。(1) Pre-Processing and Pre-Analysis (Pre-Processing and Pre-Analysis): For the initial query, the main task of its pre-processing is to complete the word segmentation and punctuation marks of the query. Among them, the maximum probability word segmentation strategy is adopted for Chinese queries, while an additional case conversion process for English queries is required. Based on the preprocessed initial query, preanalysis mainly accomplishes three tasks. One is to mark the stop words in the query; the other is to analyze the part of speech of each word to determine its correct part of speech. Among them, words with changing forms in the English query need to be restored; and the third is In order to extract query terms as keywords based on the results of the part of speech analysis.

(2)词语语义相似度度量(Word Lexical Semantic Similarity Measurement)：词语语义相似度反映是词语之间的聚合特点，可以用两个词语间的可替换程度来衡量。词语语义距离可以被看作词语语义相似度的反面，两个词语语义距离越大，则其相似度越低；反之，两个词语语义距离越小，则其相似度越大。针对英语词语语义相似度度量，基于WordNet义类词典的组织结构，主要考虑概念所对应的同义词集合的计算，取相似度最大(或者语义距离最小)的一对同义词集合分别代表两个同义词集合计算的最终结果，其中还需考虑针对图像搜索引擎而言时空代价的特定要求。针对汉语词语语义相似度度量，基于HowNet义类词典的组织结构，根据知识系统描述语言的结构特性，将语义相似度分为三部分来计算。第一部分单独考虑主类义原的相似度；第二部分考虑整个语义表达式的相似度，并根据知识描述语言描述的层次特性将义原按层次进行划分，然后每层采用最大匹配的方法进行相似度计算；第三部分考虑主类义原框架的相似度。同时，在计算义原相似度时加入义原深度信息，以区别对待含有不同信息量的义原。(2) Word Lexical Semantic Similarity Measurement (Word Lexical Semantic Similarity Measurement): The word semantic similarity reflects the aggregation characteristics between words, which can be measured by the degree of replaceability between two words. Word semantic distance can be regarded as the opposite of word semantic similarity. The greater the semantic distance between two words, the lower their similarity; conversely, the smaller the semantic distance between two words, the greater their similarity. For the semantic similarity measurement of English words, based on the organizational structure of the WordNet semantic dictionary, the calculation of the synonym set corresponding to the concept is mainly considered, and a pair of synonym sets with the largest similarity (or the smallest semantic distance) are taken to represent two synonym sets respectively. The final result of , which also needs to consider the specific requirements of space and time costs for image search engines. For the measurement of semantic similarity of Chinese words, based on the organizational structure of the HowNet semantic dictionary, according to the structural characteristics of the knowledge system description language, the semantic similarity is divided into three parts to calculate. The first part considers the similarity of the main class sememe alone; the second part considers the similarity of the entire semantic expression, and divides the sememe into layers according to the hierarchical characteristics described by the knowledge description language, and then adopts the method of maximum matching for each layer. Computation of similarity; the third part considers the similarity of the main class semantic original frame. At the same time, the sememe depth information is added to the calculation of sememe similarity to treat sememes with different amounts of information differently.

(3)融合扩展规则的查询扩展(Query Expansion based on Fusion of ExpansionRules)：对于给定查询关键词，按照扩展规则确定另一个词作为其扩展词，即单个单词的语义扩展规则。确定该规则后，一次语义扩展就可以理解为基于语义网络知识将关键词序列的各个单词分别扩展然后将结果进行合并。考虑到用户输入的随意性，关键词序列中的单词前后顺序不作考虑，一视同仁。同时，基于图像数据集中图像说明描述信息与扩展词集规模的双重考虑，对查询扩展后关键词集进行优化处理。(3) Query Expansion based on Fusion of Expansion Rules: For a given query keyword, determine another word as its expansion word according to the expansion rules, that is, the semantic expansion rules of a single word. After the rule is determined, a semantic expansion can be understood as expanding each word of the keyword sequence based on the knowledge of the semantic network and then merging the results. Considering the arbitrariness of user input, the sequence of words in the keyword sequence is not considered and treated equally. At the same time, based on the dual considerations of image description information in the image data set and the scale of the expanded word set, the keyword set after query expansion is optimized.

(4)基于评分的检索结果排序(Retrieval Results Ranking based on Scoring)：基于搜索引擎返回的检索结果，根据关键字序列与图像说明描述对图像进行评分，该图像得分将作为所返回图像的排序依据。实际上，评分对象是图像说明描述，对图像本身并无任何认识，这是一个纯粹信任并依靠图像说明描述的评分方案。也就是，基于“词语语义相近”的计算支持，对查询关键词序列与图像说明描述关键词序列的“相近”程度进行评分。其中，对于图像说明描述关键词的计算结果，均赋予相应权重，用于突出图像中可能的“突出”物体。同时，通过设置适当大小的相关度计算结果缓存，建立多层缓存机制，改进评分策略，从而简化计算的复杂程度，节约大量时间，提升处理速度。(4) Retrieval Results Ranking based on Scoring: Based on the retrieval results returned by the search engine, the images are scored according to the keyword sequence and image description, and the image scores will be used as the sorting basis for the returned images . In fact, the scoring object is the description of the image without any knowledge of the image itself. This is a scoring scheme that relies on pure trust and description of the image. That is, based on the calculation support of "semantic similarity of words", the degree of "similarity" between the query keyword sequence and the image caption description keyword sequence is scored. Among them, corresponding weights are assigned to the calculation results of image caption description keywords, which are used to highlight possible "prominent" objects in the image. At the same time, by setting an appropriate size of the correlation calculation result cache, a multi-layer cache mechanism is established, and the scoring strategy is improved, thereby simplifying the complexity of the calculation, saving a lot of time, and improving the processing speed.

2.基于WordNet的英语词语语义相似度度量算法2. WordNet-based semantic similarity measurement algorithm for English words

基于WordNet的英语词语语义相似度度量算法的创意基于以下设想：希望在在线处理图像检索结果排序过程中系统能够动态计算英语查询关键词序列与检索结果关键词序列之间的语义相似度，但是要在保证适合时空代价量级的基础上，“一般英语词语语义相似度度量模型”的统计信息不完全丢失。即，希望利用从特定英语语义网络知识源中经过适当处理所得到的同义词集，在原词语语义相似度度量模型的基础上进行修正。然而，“一般英语词语语义相似度度量模型”存在数据稀疏与所处理词语词性受限问题，所以在线处理过程中无法融合同义词集与语义计算模型精确地度量词语语义相似度。因此，本发明提供了一种能够利用经过特殊扩展处理之后所得到同义词集在原一般英语词语语义相似度度量模型上进行修正的新算法。本发明的新算法同时满足下述的三个条件：The idea of the WordNet-based semantic similarity measurement algorithm for English words is based on the following idea: It is hoped that the system can dynamically calculate the semantic similarity between the English query keyword sequence and the retrieval result keyword sequence during the online processing of image retrieval result ranking, but On the basis of ensuring that it is suitable for the magnitude of the space-time cost, the statistical information of the "general English word semantic similarity measurement model" is not completely lost. That is to say, it is hoped that the synset obtained through proper processing from a specific English semantic network knowledge source can be used to modify the semantic similarity measurement model of the original words. However, the “Measurement Model for Semantic Similarity of General English Words” has the problems of data sparsity and part-of-speech limitations of the processed words, so it is impossible to integrate synsets and semantic computing models to accurately measure the semantic similarity of words during online processing. Therefore, the present invention provides a new algorithm that can use the synonym set obtained after special expansion processing to modify the original general English word semantic similarity measurement model. New algorithm of the present invention satisfies the following three conditions simultaneously:

(1)避免数据稀疏问题——组成词语语义定义的单词数量往往不够多，从而导致语义计算过程中发生数据稀疏问题。因此，只能利用经过特殊扩展处理之后所得到词集在原一般英语词语语义相似度度量模型上进行修正，以解决该问题。(1) Avoid the problem of data sparsity—the number of words that make up the semantic definition of words is often not enough, which leads to the problem of data sparsity in the process of semantic calculation. Therefore, the problem can only be solved by modifying the original general English word semantic similarity measurement model by using the word set obtained after special expansion processing.

(2)词语词性不受限——不可以只适用于名词词语之间语义相似度的度量，应该具有跨词性词语之间的语义相似度度量能力。(2) The part of speech of words is not limited-it cannot only be applied to the measurement of semantic similarity between noun words, but should have the ability to measure the semantic similarity between words across parts of speech.

(3)时空复杂性低——处理过程中不可以使用高时空复杂性的算法，如使用概率计算的方法要读取大量的统计数据进行处理，需要考虑算法改进与优化，以满足时空代价要求。(3) Low spatio-temporal complexity—algorithms with high spatio-temporal complexity cannot be used in the processing process. For example, the method of probability calculation needs to read a large amount of statistical data for processing. Algorithm improvement and optimization need to be considered to meet the requirements of space-time cost .

本发明通过下述步骤设计符合以上三个条件的新算法，The present invention designs the new algorithm that meets above three conditions by following steps,

有关英语词语语义相似度度量，经典的Lesk算法把词语语义定义看作为无顺序的词包，并用定义间的单词交集来衡量其相似度。Lesk认为语义相近的词语语义定义所使用的单词也相似，但是组成定义的单词数量往往不够多，从而导致数据稀疏问题。因此，为解决这一问题，本发明提出若干扩展算法。Regarding the measurement of semantic similarity of English words, the classic Lesk algorithm regards the semantic definitions of words as unordered word bags, and uses the intersection of words between definitions to measure their similarity. Lesk believes that the words used in the semantic definitions of words with similar semantics are also similar, but the number of words that make up the definition is often not enough, which leads to the problem of data sparseness. Therefore, to solve this problem, the present invention proposes several extended algorithms.

Lesk扩展算法通过扩展词语语义定义，在一定程度上能够克服经典Lesk算法中的数据稀疏问题。EKEDAHL和GOLUB通过使用WordNet对Lesk算法作以调整，通过查找某个概念的最近两个上位词，来扩充用于计算重叠个数的词语语义定义。Pedersen等采用另一种扩展方法，考虑与某个词语在WordNet结构上直接相连的所有语义定义，包括上位词与下位词等，同时赋予词组更大的权重。作者宣称，在相同条件下，该算法比传统的Lesk算法在性能上具有显著提高。Lesk扩展算法最常用的信息为上位词信息，即WordNet中的父结点，是词语语义的进一步抽象。The Lesk extension algorithm can overcome the data sparsity problem in the classic Lesk algorithm to a certain extent by extending the semantic definition of words. EKEDAHL and GOLUB use WordNet to adjust the Lesk algorithm, and expand the semantic definition of words used to calculate the number of overlaps by looking for the nearest two hypernyms of a concept. Pedersen et al. adopted another extension method, considering all semantic definitions directly connected to a word in the WordNet structure, including hypernyms and hyponyms, etc., and at the same time giving greater weight to the phrase. The authors claim that under the same conditions, the algorithm has a significant performance improvement over the traditional Lesk algorithm. The most commonly used information for the Lesk extension algorithm is hypernym information, that is, the parent node in WordNet, which is a further abstraction of word semantics.

现有的Lesk扩展算法主要考虑WordNet层次结构中与某个词语直接相连的语义信息，特别是上位词，来对词语语义定义进行扩展，在一定程度上能够克服数据稀疏问题。这些方法虽然能够有效利用WordNet结构中的直接信息，却疏忽某些非常有用的间接信息。由此，本发明建立一种基于同等词(Coordinate Terms)的Lesk扩展算法(简称Lesk-C)，可进一步扩展词语语义定义，其中将同等词定义为某个词语所属同义词集合在WordNet层次结构中的兄弟结点(例如，“basketball”的同等词包括“football”、“volleyball”等)。显然，一个同义词集合与其所对应的同等词必然存在一个公共父结点。The existing Lesk extension algorithm mainly considers the semantic information directly connected to a word in the WordNet hierarchy, especially the hypernym, to extend the semantic definition of words, which can overcome the problem of data sparsity to a certain extent. Although these methods can effectively use the direct information in the WordNet structure, they ignore some very useful indirect information. Thus, the present invention establishes a Lesk extension algorithm (Lesk-C for short) based on equivalent words (Coordinate Terms), which can further expand the semantic definition of words, wherein the equivalent words are defined as the set of synonyms to which a certain word belongs in the WordNet hierarchy (e.g. equivalents of "basketball" include "football", "volleyball", etc.). Obviously, a synonym set and its corresponding equivalent words must have a common parent node.

Lesk-C算法通过引入一个词义的所有(或者部分)同等词定义来扩展该概念语义定义，其思想基于以下假设建立，即任何概念和其同等词对于确定上下文中所起作用相一致。根据上述假设，考虑到“basketball”、“football”及“volleyball”是一组同等词，采用所建立的Lesk-C算法，通过使用同等词“basketball”与“football”等的定义来扩展“volleyball”的定义，无疑会增大单词相交的可能性。同等词即为WordNet中的兄弟结点，虽然不是原有词义的抽象甚至没有直接联系，但是在任何概念语义定义及其同等词对于确定上下文中的概念所起作用相一致这一假设条件下，同等词和上位词同样有意义。The Lesk-C algorithm extends the semantic definition of a concept by introducing all (or some) equivalent definitions of a word meaning. Its idea is based on the assumption that any concept and its equivalents play the same role in determining the context. According to the above assumptions, considering that "basketball", "football" and "volleyball" are a group of equivalent words, the established Lesk-C algorithm is used to expand "volleyball" by using the definitions of the equivalent words "basketball" and "football". "The definition will undoubtedly increase the possibility of word intersection. Equivalent words are sibling nodes in WordNet. Although they are not abstract or even directly related to the original word meaning, under the assumption that the semantic definition of any concept and its equivalent words have the same role in determining the concept in the context, Equivalents and hypernyms are equally meaningful.

对于经过基于Lesk-C扩展之后所获取的完整词语语义定义，由于其中每个单词都属于多个同义词集，则两个概念之间的语义相关度度量(或者语义距离计算)，实际上就是两个同义词集的计算。一般来讲，采取相似度最大(或者语义距离最小)的一对同义词集分别代表两个同义词集来计算最终结果。For the complete semantic definition of words obtained after being extended based on Lesk-C, since each word belongs to multiple synsets, the semantic correlation measure (or semantic distance calculation) between two concepts is actually two Computation of synsets. Generally speaking, a pair of synsets with the largest similarity (or the smallest semantic distance) is used to represent two synsets respectively to calculate the final result.

本发明下文中的描述约定以及将用到的符号如下定义：The following description conventions of the present invention and the symbols to be used are defined as follows:

(1)两个同义词集S1与S2在WordNet语义网络上的路径距离，是从S₁到S₂的路径经过的边数，用Len(S₁，S₂)函数表示。(1) The path distance between two synsets S1 and S2 on the WordNet semantic network is the number of edges passed by the path from S ₁ to S ₂ , expressed by the Len(S ₁ , S ₂ ) function.

(2)当只考虑WordNet语义网络的上下位类型的边时，语义网络退化成森林。在增加一个虚的根结点后，该森林转换为一棵树。两个同义词集S₁与S₂在上下义树里的最低公共父结点(Lowest Super-Ordinate)用Lcs(S₁，S₂)函数表示，而其在树上的深度由Depth()函数表示。(2) When only the upper and lower type edges of the WordNet semantic network are considered, the semantic network degenerates into a forest. After adding a virtual root node, the forest is transformed into a tree. The lowest common parent node (Lowest Super-Ordinate) of two synsets S ₁ and S ₂ in the upper and lower semantic trees is represented by the Lcs(S ₁ , S ₂ ) function, and its depth on the tree is represented by the Depth() function express.

(3)概念语义相关度Sim(C₁，C₂)与语义距离Dist(C₁，C₂)之间的关系为：(3) The relationship between concept semantic relevance Sim(C ₁ , C ₂ ) and semantic distance Dist(C ₁ , C ₂ ) is:

Sim(C₁，C₂)+Dist(C₁，C₂)＝1 (1)Sim(C ₁ , C ₂ )+Dist(C ₁ , C ₂ )=1 (1)

在概念语义相似度度量中，把WordNet层次结构看成是一个图，然后利用路径信息来计算相关度。其中比较直接的想法是：两个结点的距离越近，那么两者之间的相关度越大。也就是说，如果两个结点所代表概念的公共上位词离它们越近，则这两者之间的相似度越大。这里，所使用的相似度公式如下：In concept semantic similarity measurement, the WordNet hierarchy is regarded as a graph, and then the path information is used to calculate the correlation. The more direct idea is: the closer the distance between two nodes, the greater the correlation between them. That is to say, if the common hypernyms of concepts represented by two nodes are closer to them, the similarity between the two is greater. Here, the similarity formula used is as follows:

$Sim Sim (({C C}_{11},, {C C}_{22})) = = Sim Sim (({S S}_{11},, {S S}_{22})) = = \frac{22 \times \times Depth Depth ((Lcs Lcs (({S S}_{11},, {S S}_{22}))))}{Depth Depth (({S S}_{11})) + + Depth Depth (({S S}_{22}))} - - - - - - ((22))$

其中，Depth()为概念C或者同义词集S在WordNet层次结构中的深度，LCS(S1，S2)是为概念C₁与C₂或者同义词集S₁与S₂的所有公共上位词中深度最大的那个上位词。Among them, Depth() is the depth of concept C or synonym set S in the WordNet hierarchy, and LCS(S1, S2) is the largest depth among all public hypernyms of concepts C ₁ and C ₂ or synonym sets S ₁ and S ₂ the hypernym of .

该公式可通过变形，转换为以下公式：This formula can be transformed into the following formula through deformation:

$Dist Dist (({C C}_{11},, {C C}_{22})) = = \frac{Len Len (({C C}_{11},, Lso Lso (({C C}_{11},, {C C}_{22})))) + + Len Len (({C C}_{22},, Lcs Lcs (({C C}_{11},, {C C}_{22}))))}{Len Len (({C C}_{11},, Lcs Lcs (({C C}_{11},, {C C}_{22})))) + + Len Len (({C C}_{22},, Lcs Lcs (({C C}_{11},, {C C}_{22})))) + + 22 \times \times Depth Depth ((Lcs Lcs (({C C}_{11},, {C C}_{22}))))} - - - - - - ((33))$

英语中存在一词多义现象，词语语义相似度应该计算概念(或者词义、语义定义)之间的相似度，两个孤立词语的语义相似度是其所有概念之间相似度的最大值。There is a polysemy phenomenon in English, and the semantic similarity of words should calculate the similarity between concepts (or word meanings, semantic definitions). The semantic similarity of two isolated words is the maximum similarity between all concepts.

Sim(W₁，W₂)＝maxSim(C_1i，C_2j)i＝1Λn，j＝1Λm (4)Sim(W ₁ , W ₂ )=maxSim(C _1i , C _2j )i=1Λn, j=1Λm (4)

其中，W1表示词语1且具有n个概念，W₂表示词语2且具有m个概念，C_1i是W1的第i项概念，C_2j是W₂的第j项概念。Among them, W1 represents word 1 with n concepts, W ₂ represents word 2 with m concepts, C _1i is the i-th concept of W1, and C _2j is the j-th concept of W ₂ .

上述算法的步骤用伪代码描述如下：The steps of the above algorithm are described in pseudocode as follows:

(1)获得输入：两个词语W₁与W₂。(1) Get input: two words W ₁ and W ₂ .

(2)选择其两个概念C_1i与C_2j。(2) Select its two concepts C _1i and C _2j .

(3)查找WordNet语义网络文件，获取分别代表C_1i与C_2j的两个同义词集合S_1i与S_2j。(3) Search the WordNet semantic network file to obtain two synonym sets S _1i and S _2j respectively representing C _1i and C _2j .

(4)根据公式(1)～(3)，将S_1i与S_2j输入Dist(C_1i，C_2j)计算语义距离结果。(4) According to formulas (1)-(3), input S _1i and S _2j into Dist(C _1i , C _2j ) to calculate the semantic distance result.

(5)重复步骤(2)～(4)，获得两个词语每一对概念之间的相似度(语义距离)值。根据公式(4)，从中选择最大值作为最终的词语相似度值。(5) Repeat steps (2)-(4) to obtain the similarity (semantic distance) value between each pair of concepts of two words. According to formula (4), the maximum value is selected as the final word similarity value.

其中，在计算Dist(C_1i，C_2j)时，只使用上下位关系。Wherein, when calculating Dist(C _1i , C _2j ), only the hyponymy relation is used.

3.基于HowNet的汉语词语语义相似度度量算法3. Chinese word semantic similarity measurement algorithm based on HowNet

基于HowNet的汉语词语语义相似度度量算法的创意基于以下设想：希望在在线处理图像检索结果排序过程中系统能够动态计算汉语查询关键词序列与检索结果关键词序列之间的语义相似度，但是要在保证适合时空代价量级的基础上，能够充分考虑汉语中存在的诸多难点与复杂性。即，希望利用从特定语义网络知识源中相应的汉语词语概念多层次描述，提取丰富语义信息，建立更加符合人类主观感觉的度量机制。然而，“一般的汉语词语语义相似度度量模型”存在无法充分获取词语概念间固有关联、领域不平衡性以及数据稀疏的问题，所以在线处理过程更倾向于计算词语概念本身的相似度，而不太关注其不同语义。因此，本发明提供了一种能够利用具有“正确性，无偏见性和完备性”的词语概念语义描述在一般汉语词语语义相似度度量模型上进行修正的新算法。本发明的新算法同时满足三个条件：The idea of the HowNet-based semantic similarity measurement algorithm for Chinese words is based on the following idea: It is hoped that the system can dynamically calculate the semantic similarity between the Chinese query keyword sequence and the retrieval result keyword sequence during the online processing of image retrieval result ranking, but On the basis of ensuring that it is suitable for the magnitude of time and space costs, it can fully consider the many difficulties and complexities that exist in Chinese. That is, it is hoped that by using the multi-level description of the corresponding Chinese word concept in the specific semantic network knowledge source, rich semantic information can be extracted, and a measurement mechanism that is more in line with human subjective perception can be established. However, the "general Chinese word semantic similarity measurement model" has the problems of not being able to fully capture the inherent correlation between word concepts, domain imbalance, and data sparseness, so the online processing process is more inclined to calculate the similarity of word concepts themselves, rather than Too much focus on its different semantics. Therefore, the present invention provides a new algorithm that can use the word concept semantic description with "correctness, non-prejudice and completeness" to modify the general Chinese word semantic similarity measurement model. New algorithm of the present invention satisfies three conditions simultaneously:

(1)避免数据稀疏问题——组成概念语义定义的单词数量往往不够多，从而导致语义计算过程中发生稀疏数据问题。因此，只能利用词语概念语义的多层次描述并附加辅助信息，从而在原一般汉语词语语义相似度度量模型上进行修正，以解决该问题。(1) Avoid the data sparsity problem—the number of words that make up the semantic definition of a concept is often not enough, which leads to the sparse data problem in the process of semantic computation. Therefore, we can only use the multi-level description of word concept semantics and add auxiliary information to modify the original general Chinese word semantic similarity measurement model to solve this problem.

(2)具有高度区分力——应该能够有效利用汉语语义网络的知识结构，将不同的词语组区分在不同的相似度层次。(2) Highly distinguishable—it should be able to effectively use the knowledge structure of the Chinese semantic network to distinguish different word groups at different levels of similarity.

现在描述如何设计符合以上三个条件的新算法。Now describe how to design a new algorithm that meets the above three conditions.

为确保词语概念语义描述的复杂度、一致性与准确性，HowNet采用一种知识描述规范体系——知识系统描述语言(Knowledge Database Mark-up Language，KDML)，具有以下四种重要构成形式。In order to ensure the complexity, consistency and accuracy of the semantic description of words and concepts, HowNet adopts a knowledge description specification system - Knowledge Database Mark-up Language (KDML), which has the following four important forms.

(1)义原——KDML中所用的词语被称为义原(Sememes)，如“exercise|锻炼”与“sport|体育”，并按照KDML语法规则进行组织。义原不具有歧义性，是从汉字(包括单纯词)中所提取出来的“最基本且不易于再分割的意义最小单位”，也就是描述的最小单位。(1) Sememes - words used in KDML are called Sememes, such as "exercise|exercise" and "sport|sports", and are organized according to KDML grammatical rules. Sememe has no ambiguity, and is extracted from Chinese characters (including simple words), "the most basic and not easy to divide the smallest unit of meaning", that is, the smallest unit of description.

(2)主类义原——语义表达式中的第一个义原同时也被称为主类义原，前述实例中“exercise|锻炼”即为主类义原。主类义原必须指出概念最基本的意义，可认为其对概念具有最强的描述能力。(2) Main class sememe—the first sememe in the semantic expression is also called the main class sememe. In the above example, “exercise|exercise” is the main class sememe. The main class semantics must point out the most basic meaning of the concept, and it can be considered that it has the strongest ability to describe the concept.

(3)语义表达式——“DEF＝{...}”是整个记录的核心，是对于概念的定义和描述，称之为语义表达式。为确保概念描述的复杂度、一致性和准确性，利用KDML进行规范。(3) Semantic expression——"DEF={...}" is the core of the whole record, which is the definition and description of concepts, called semantic expression. In order to ensure the complexity, consistency and accuracy of concept description, KDML is used for specification.

(4)主类义原框架——简单地说，就是对于大部分义原也像词语一样进行语义表达式定义，如下图所示。其中，对于义原“thing|万物”，其主类义原框架为“{entity|实体：{ExistAppear|存现：existent＝{～}}}”，描述语法严格遵循KDML描述语言。(4) The main class sememe frame—simply put, most sememes are also defined as semantic expressions like words, as shown in the figure below. Among them, for the sememe "thing|everything", its main class sememe frame is "{entity|entity: {ExistAppear|existence: existent={～}}}", and the description syntax strictly follows the KDML description language.

在基于KDML所建立的词语概念语义描述中，处于不同括号层次中的义原对于词语语义定义的描述能力不同，越是处于外层括号中的义原对概念的描述能力越强；反之，处于内层括号中的义原是对上一层义原的具体解释，是对概念的间接描述，描述能力相对较弱。因此，在度量词语语义相似度时，有必要将其区别对待。In the semantic description of word concepts established based on KDML, sememes in different bracket levels have different ability to describe the semantic definition of words, and the sememes in the outer brackets have stronger ability to describe concepts; The sememe in the inner brackets is a specific explanation of the previous layer of sememe, and it is an indirect description of the concept, and its descriptive ability is relatively weak. Therefore, it is necessary to treat them differently when measuring the semantic similarity of words.

作为词语相似度度量的重要基础，义原相似度的计算依据义原的层次体系(即上下位关系)进行。基于树状层次结构，考虑结点之间的路径长度，同时引入结点的层次深度，而建立义原相似度的计算公式，如下所示。As an important basis for word similarity measurement, the calculation of sememe similarity is carried out according to the hierarchical system of sememe (namely, the hyponym relationship). Based on the tree hierarchy, the calculation formula of sememe similarity is established by considering the path length between nodes and introducing the hierarchical depth of nodes, as shown below.

$Sim Sim (({S S}_{11},, {S S}_{22})) = = \frac{α α \times \times min min ((Depth Depth (({S S}_{11})),, Depth Depth (({S S}_{22}))))}{α α \times \times min min ((Depth Depth (({S S}_{11})),, Depth Depth (({S S}_{22})))) + + Dist Dist (({S S}_{11},, {S S}_{22}))} - - - - - - ((11))$

其中，S₁与S₂分别表示两个义原；Dist(S₁，S₂)表示义原S₁与S₂之间的路径长度；α为调节参数；Depth(S₁)与Depth(S₂)分别表示义原S₁与S₂的层次深度；min(Depth(S₁)，Depth(S₂))表示义原S₁与S₂层次深度中的较小者。义原所携带的语义信息具有大小之分，越是处于底层的结点语义信息越丰富，越是处于高层的结点语义越抽象，所以应该区别对待不同层次上的义原。Among them, S ₁ and S ₂ represent two sememes respectively; Dist(S ₁ , S ₂ ) represents the path length between sememe S ₁ and S ₂ ; α is an adjustment parameter; Depth(S ₁ ) and Depth(S 2 ₂ ) represent the hierarchical depths of sememes S ₁ and S ₂ respectively; min(Depth(S ₁ ), Depth(S ₂ )) represents the smaller of the hierarchical depths of sememes S ₁ and S ₂ . The semantic information carried by sememes can be divided into different sizes. The lower the nodes are, the richer the semantic information is, and the higher the nodes are, the more abstract the semantic information is. Therefore, sememes at different levels should be treated differently.

汉语中存在一词多义现象，词语语义相似度应该计算词语概念之间的相似度，两个孤立词语(不处在一定的上下文背景中)的语义相似度是其所有概念之间相似度的最大值。There is a polysemy phenomenon in Chinese, and the semantic similarity of words should calculate the similarity between word concepts. The semantic similarity of two isolated words (not in a certain context) is the sum of the similarities between all their concepts. maximum value.

Sim(W₁，W₂)＝maxSim(C_1i，C_2j)i＝1Λn，j＝1Λm (2)Sim(W ₁ , W ₂ )=maxSim(C _1i , C _2j )i=1Λn, j=1Λm (2)

其中，词W₁具有n个概念，词W2具有m个概念，C_1i是W₁的第i项概念，C_2j是W₂的第j项概念。根据KDML的结构特性，将概念语义相似度分为三个部分进行计算：Among them, word W ₁ has n concepts, word W2 has m concepts, C _1i is the i-th concept of W ₁ , and C _2j is the j-th concept of W ₂ . According to the structural characteristics of KDML, the conceptual semantic similarity is divided into three parts for calculation:

Sim(C₁，C₂)＝w₁ ^*P₁+w₂ ^*P₂+w₃ ^*P₃ (3)Sim(C ₁ , C ₂ )=w ₁ ^* P ₁ +w ₂ ^* P ₂ +w ₃ ^* P ₃ (3)

其中，P₁为两个概念主类义原之间的相似度；P₂为整个语义表达式之间的相似度；P₃是针对两个DEF主类义原框架之间相似度的计算；w₁、w₂与w₃分别为三个部分相似度所对应的权值，应满足约束条件w₁+w₂+w₃＝1且w₂＞w₁，w₂＞w₃。Among them, P ₁ is the similarity between two conceptual main class sememes; P ₂ is the similarity between the entire semantic expressions; P ₃ is the calculation of the similarity between two DEF main class sememes; w ₁ , w _{2 ,} and w ₃ are the weights corresponding to the three partial similarities respectively, and should satisfy the constraint condition w ₁ +w ₂ +w ₃ =1 and w ₂ >w ₁ , w ₂ >w ₃ .

对于P₁，按公式(1)进行计算，前述已说明主类义原对于概念具有最直接的语义描述能力，因此将其单列为一部分进行考虑很有意义。For P ₁ , it is calculated according to formula (1). It has been shown above that the main sememe has the most direct semantic description ability for the concept, so it is meaningful to consider it as a part separately.

对于P₂，由于语义表达式是一个完整的个体，并拥有自己的语法规则，因此将其作为一个整体并参考KDML规则来计算其语义相似度很有必要。该部分是整个语义相似度度量中最复杂且权值比重最大的一部分。因为需要考虑整个语义表达式。其计算过程可分为两个阶段，根据KDML描述的层次特性将义原按层次进行划分，然后每层采用最大匹配的方法进行语义相似度计算。首先，计算每组义原的语义相似度，从中选择值最大的一组。如果存在多组义原语义相似度相同，则任选一组即可。其次，在剩下的义原组中仍选择语义相似度最大者，依此类推。当两个概念同层的义原个数不等时，会出现义原和空元素配对的情况，此时可统一取较小值r(所设定的参数)。最后，将所选出的义原组语义相似度相加取平均值，即可得到P₂部分的值。For P ₂ , since the semantic expression is a complete entity and has its own grammatical rules, it is necessary to calculate its semantic similarity by taking it as a whole and referring to KDML rules. This part is the most complex part with the largest weight ratio in the whole semantic similarity measure. Because the entire semantic expression needs to be considered. The calculation process can be divided into two stages. According to the hierarchical characteristics described by KDML, sememes are divided into layers, and then each layer adopts the method of maximum matching to calculate the semantic similarity. First, calculate the semantic similarity of each group of sememes, and select the group with the largest value. If there are multiple groups of sememes with the same semantic similarity, one group can be selected. Second, select the one with the highest semantic similarity among the remaining sememe groups, and so on. When the number of sememes in the same layer of two concepts is not equal, there will be a situation where sememes and empty elements are paired. In this case, the smaller value r (the set parameter) can be uniformly taken. Finally, the value of _P2 can be obtained by summing the semantic similarity of the selected sememe groups and taking the average value.

对于P₃，其计算方法与P₂相同。针对主类义原框架的语义相似度度量实际上是另一种计算主类义原语义相似度的方法，再一次强调主类义原对于概念的直接描述能力。For P ₃ , its calculation method is the same as that of P ₂ . The semantic similarity measurement for the main class sememe framework is actually another method to calculate the main class sememe semantic similarity, which once again emphasizes the direct description ability of the main class sememe for concepts.

最终，基于上述三部分相似度的计算，根据公式(3)即可计算出每对概念之间的语义相似度，然后按公式(2)取最大值作为词语间的语义相似度。Finally, based on the above three-part similarity calculation, the semantic similarity between each pair of concepts can be calculated according to formula (3), and then the maximum value is taken as the semantic similarity between words according to formula (2).

需要注意的一些特殊情形是，当仅用一个义原就能完全解释一个词语时，说明该义原含有的信息量比较大，是处于义原树中较底层的一个。此时，如果加入义原深度信息，则可提高该单一义原的描述能力，使词语语义相似度更接近于期望值。另外，对于使用引号括起来的特殊意义义原，也可称之为具体词，包含较丰富且具体的语义信息，对其所描述概念的性质具有直接的决定作用与影响力。因此，应该将其区别于普通义原，给具体词语之间的语义相似度赋予一个调节参数。Some special cases that need to be noted are that when a word can be fully explained with only one sememe, it means that the sememe contains a relatively large amount of information and is at the bottom of the sememe tree. At this time, if the sememe depth information is added, the descriptive ability of the single sememe can be improved, and the semantic similarity of words can be closer to the expected value. In addition, the special meaning sememes enclosed in quotation marks can also be called concrete words, which contain rich and specific semantic information, and have a direct decisive effect and influence on the nature of the concepts they describe. Therefore, it should be distinguished from common sememes, and an adjustment parameter should be given to the semantic similarity between specific words.

在上述的语义相似度度量模型中，以整个语义表达式为基础，按层次将义原进行划分，并采用最大匹配的方法，同时单独考虑主类义原对于概念的直接描述能力。这种度量语义相似度的机制可更为有效地利用HowNet的知识结构，使得结果更为具有区分力。同时，由于在度量过程中，适当加入义原深度信息的考虑，使结果更加精确，尤其是在语义表达式中义原个数不多的情况下效果更加明显。In the above-mentioned semantic similarity measurement model, based on the whole semantic expression, sememes are divided according to levels, and the method of maximum matching is adopted, and the direct description ability of the main class sememe to the concept is considered separately. This mechanism for measuring semantic similarity can make more effective use of the knowledge structure of HowNet, making the results more discriminative. At the same time, due to the proper consideration of sememe depth information in the measurement process, the result is more accurate, especially when the number of sememes in the semantic expression is not large.

该算法的步骤用伪代码描述如下：The steps of the algorithm are described in pseudocode as follows:

(3)查找HowNet的语义网络文件，获取概念C_1i与C_2j的主类义原、语义表达式、语义表达式框架等相关信息。(3) Search HowNet's semantic network files to obtain the main class sememe, semantic expression, semantic expression framework and other related information of concepts C _1i and C _2j .

(4)基于义原相似度的计算公式(1)，获取两个概念主类义原之间的相似度信息P₁。(4) Based on the calculation formula (1) of the sememe similarity, obtain the similarity information P ₁ between the two conceptual main class sememes.

(5)基于两阶段的求解过程，分别计算两个概念语义表达式和主类义原框架之间的相似度P₂与P₃。(5) Based on the two-stage solution process, calculate the similarities P ₂ and P ₃ between the semantic expressions of the two concepts and the original frame of the main class respectively.

(6)综合三部分的相似度信息，根据公式(3)，获取两个概念之间的相似度取值。(6) Synthesize the similarity information of the three parts, and obtain the similarity value between two concepts according to the formula (3).

(7)重复步骤(2)～(6)，获得两个词语每一对概念之间的相似度值。根据公式(2)，从中选择最大值作为最终的词语相似度值。(7) Repeat steps (2) to (6) to obtain the similarity value between each pair of concepts of two words. According to formula (2), the maximum value is selected as the final word similarity value.

4.基于扩展规则的查询扩展词选择与优化算法4. Query expansion word selection and optimization algorithm based on expansion rules

针对基于义类词典语义网络所进行的查询语义扩展，有两种方式可借鉴。其一是将基于原始查询的搜索结果自动加入原始查询关键词序列中，该方式一般需要人工参与和一定规模的机器学习及积累，否则将引入大量无关词汇，使得扩展结果十分糟糕。而另一种方式是将选择权交给用户，仅提供扩展后的结果，至于是否适用或者使用哪些扩展结果则由用户决定。虽然使用后一种方式需要用户主动参与，在一定程度上增加用户使用搜索引擎的复杂度，但由此得到的扩展词实际上是一种用户输入，具有较高使用价值。There are two ways to learn from the query semantic extension based on the semantic network of semantic lexicons. One is to automatically add the search results based on the original query to the original query keyword sequence. This method generally requires manual participation and a certain scale of machine learning and accumulation. Otherwise, a large number of irrelevant words will be introduced, making the expansion results very bad. The other way is to give the choice to the user, and only provide the expanded results, and it is up to the user to decide whether it is applicable or which expanded results to use. Although using the latter method requires the active participation of the user, which increases the complexity of the user's use of the search engine to a certain extent, the resulting expanded words are actually a kind of user input and have high use value.

文本检索领域中的自动查询扩展是一项较为成熟的技术，如[3，4，7]，该类算法考虑更多的是合并检索文档中的相关信息，然而许多检索文档与查询并无关系。有关结合用户交互的半自动查询扩展的研究也比较成熟，如[5，6，20]，该类算法通常将从检索文档中能够提取出的所有相关词汇信息全部提供给用户，造成用户面临范围宽泛的诸多选择，而容易造成选择不适当或者引入不必要的噪声信息。同时，上述两种查询扩展方式均针对文本检索并结合文本信息的特点而建立，对于基于文本的图像检索来说并非完全适用。鉴此，一种结合两种经典查询扩展技术且适用于图像检索的新算法应运而生，更容易实现与使用，是一种轻量且具有更直接效果的方法。Automatic query expansion in the field of text retrieval is a relatively mature technology, such as [3, 4, 7]. This type of algorithm considers more about merging relevant information in retrieved documents, but many retrieved documents have nothing to do with queries . The research on semi-automatic query expansion combined with user interaction is relatively mature, such as [5, 6, 20], this type of algorithm usually provides all relevant lexical information that can be extracted from the retrieved documents to the user, causing the user to face a wide range of There are many choices, and it is easy to cause inappropriate selection or introduce unnecessary noise information. At the same time, the above two query expansion methods are both established for text retrieval and combined with the characteristics of text information, and are not fully applicable to text-based image retrieval. In view of this, a new algorithm that combines two classic query expansion techniques and is suitable for image retrieval came into being. It is easier to implement and use, and it is a lightweight and more direct effect method.

确定查询扩展模式之后，在实现查询扩展功能中，首先必须考虑适合图像检索特点的查询扩展规则构建。即，对于一个给定查询关键词项，按照何种具体规则来确定另一词项是其合适的扩展词，也就是单个词项的语义扩展规则。基于扩展规则的确定，查询语义扩展就可以理解为将查询关键词序列的各个词项分别扩展然后将其合并。考虑到用户输入的随意性，关键词序列中各词项前后顺序没有差异，一视同仁。After determining the query expansion mode, in realizing the query expansion function, the construction of query expansion rules suitable for the characteristics of image retrieval must be considered first. That is, for a given query key term, what specific rules are used to determine that another term is its appropriate expansion word, that is, the semantic expansion rules of a single term. Based on the determination of expansion rules, query semantic expansion can be understood as expanding each term of the query keyword sequence separately and then merging them. Considering the arbitrariness of user input, there is no difference in the order of each term in the keyword sequence, and they are treated equally.

为最大限度地满足用户希望通过选择扩展词项而使搜索意图更为明显的目的，本发明考虑以下两种情况建立查询扩展规则。In order to satisfy the user's desire to make the search intention more obvious by selecting expanded terms, the present invention considers the following two situations to establish query expansion rules.

(1)用户对作为搜索对象的图像找不到很好的词汇进行抽象描述，另一方面图像的标注者往往使用直接而且常用的词语，使得用户的词语输入比较棘手。从而，针对图像本身的标注信息以及检索的要求，应该扩展出一些与查询关键词具有共性的词项。例如，用户希望搜索有关大型猫科动物的图像，如果输入“big_cat”，结果会很糟糕。因为大多数图像的标注信息为“tiger”、“lion”等具体词项，则返回结果就会很少。针对该情形，其最佳解决途径是，将“tiger”、“lion”等大型猫科动物的名字都输入于搜索框内，但对用户来说这种方式显然很繁琐。因此，如果能够仅输入“tiger”然后扩展出一些其它大型猫科动物的名字，则用户只需要将扩展词项选进搜索框即可，而无需多加思考还有其它哪些名字需要通过键盘输入。(1) The user cannot find a good vocabulary to describe the image as the search object. On the other hand, the annotator of the image often uses direct and commonly used words, which makes the user's word input more difficult. Therefore, according to the labeling information of the image itself and the retrieval requirements, some terms that have commonality with the query keywords should be expanded. For example, if a user wants to search for images of big cats and enters "big_cat", the results will be terrible. Because most of the images are tagged with specific terms such as "tiger" and "lion", few results will be returned. In view of this situation, the best solution is to input the names of big cats such as "tiger" and "lion" into the search box, but this method is obviously very cumbersome for users. So, if it were possible to just type "tiger" and expand to some other big cat names, the user would just have to select the expanded term into the search box without having to think about what other names need to be typed in via the keyboard.

(2)作为用户输入的查询关键词项具有多种涵义的时候(这是经常出现的情况)，通过选取扩展词项加入关键词序列，可谓给搜索引擎提供一定的消歧依据。(2) When the query keyword entered by the user has multiple meanings (this is often the case), by selecting extended terms to add to the keyword sequence, it can be said to provide a certain disambiguation basis for the search engine.

例如，对于关键词字“bank”，如果能和“water”或者“coast”在一起，图像搜索引擎就拥有依据避免把关于“银行”的图像返回或者评分过高。For example, for the keyword "bank", if it can be together with "water" or "coast", the image search engine has a basis to avoid returning or scoring too high an image about "bank".

基于上述扩展规则，在查询语义扩展中，通过搜索义类词典(包括针对英语查询的WordNet、针对汉语查询的HowNet)的语义网络，将与原始英语查询关键词项具有部分关系(Part)、兄弟关系(Sibling)以及子女关系(Child)的相关词项作为扩展词项返回，而直接使用DEF匹配对原始汉语查询进行扩展。其中，针对英语查询扩展的子女关系仅包含直接子女，即语义层次关系中的直接子结点。Based on the above expansion rules, in query semantic expansion, by searching the semantic network of semantic dictionaries (including WordNet for English queries and HowNet for Chinese queries), there will be partial relationships (Part), siblings, and original English query keywords. Related terms of the relationship (Sibling) and child relationship (Child) are returned as expanded terms, and the original Chinese query is expanded directly using DEF matching. Among them, the child relation expanded for English query only includes direct children, that is, direct child nodes in the semantic hierarchy relation.

除上述扩展规则之外，本发明还考虑扩展词项最终是要加至关键词序列中，而关键词序列的词项要用于搜索过程中与图像库的图像标注信息进行匹配处理。因此，图像标注信息中未出现过的词项由于对搜索结果无用，则在扩展模块中扩展出来毫无意义。义类词典中的词项数目通常数万，甚至十万以上，均有可能通过上述扩展规则而被选中作为扩展词项。但是，图像标注信息一般仅会出现常用词，而常用词集合就小很多。因此，在基于扩展规则的查询语义扩展中，其最后一步是利用标注词过滤扩展词集，未在标注词集中出现的扩展词项将被抛弃。In addition to the above-mentioned extension rules, the present invention also considers that the extended term is finally added to the keyword sequence, and the term of the keyword sequence is used for matching processing with the image annotation information in the image database during the search process. Therefore, it is meaningless to expand the terms that have not appeared in the image annotation information because they are useless to the search results. The number of terms in the lexicon is usually tens of thousands, or even more than one hundred thousand, which may be selected as extended terms through the above-mentioned expansion rules. However, image annotation information generally only contains common words, and the set of common words is much smaller. Therefore, in the query semantic expansion based on expansion rules, the last step is to use tag words to filter the extended word set, and the extended words that do not appear in the tag word set will be discarded.

另外，在基于WordNet的英语查询语义扩展中需要解决的一个问题是，由于关键词序列中的各个词项相互独立，最终的扩展结果是各个词项扩展词集的并集。相似地，每个词项的扩展词集也可通过包含该词项各个同义词集Synset的扩展词集的并集得到。如果一个词项的某个Synset在语义网络中所处位置比较“密集”，且语义关系比较复杂，则由该Synset出发得到的扩展词集规模大大超出其它几个Synset。而一个词项的所有Synset都具有同等地位，则有可能通过前述带来较大规模扩展集的Synset，引入许多成为噪音的扩展词项信息。例如，在对关键词项“tiger”进行扩展时，大量的扩展词项竟然是与“人文”相关。这个意外的结果来自“tiger”的一个Synset，其语义为“a fierce or audacious person”，而该“拟人”的语义并非“tiger”的常用语义，在未进行消歧处理的情况下，无法通过具体规则找出此类Synset。因此，在进行语义扩展时，对每个Synset的扩展词集作以规模上的限制(如限制每个Synset至多扩展出15个扩展词项)，从而避免因为某个比较冷僻的Synset扩展出大量不合格的扩展词项。In addition, a problem that needs to be solved in the semantic expansion of English queries based on WordNet is that because each term in the keyword sequence is independent of each other, the final expansion result is the union of the expanded word sets of each term. Similarly, the extended word set of each term can also be obtained by the union of the extended word sets including the synsets of the term. If a certain Synset of a term is located in a "dense" position in the semantic network, and the semantic relationship is relatively complex, the scale of the expanded vocabulary obtained from this Synset is much larger than that of other Synsets. While all synsets of a term have the same status, it is possible to introduce a lot of extended term information that becomes noise through the aforementioned synset that brings a large-scale expanded set. For example, when expanding the keyword term "tiger", a large number of expanded terms are actually related to "humanities". This unexpected result comes from a Synset of "tiger", whose semantics is "a fierce or audacious person", and the semantics of this "person" is not the common semantics of "tiger", so it cannot be passed without disambiguation processing. Specific rules for finding such Synsets. Therefore, when performing semantic expansion, limit the size of each Synset's expanded word set (such as limiting each Synset to expand at most 15 expanded terms), so as to avoid expanding a large number of words due to a relatively remote Synset. Unqualified expansion term.

(1)获得输入：原始查询关键词序列。(1) Obtain input: the original query keyword sequence.

(2)选择其某个关键词项。(2) Select one of its keywords.

(3)如果为英语关键词项，查找WordNet的语义网络文件，获取其同义词集Synset。如果为汉语关键词项，查找HowNet的语义网络文件，获取其语义定义DEF。(3) If it is an English keyword item, search the semantic network file of WordNet to obtain its synset. If it is a Chinese keyword item, search HowNet's semantic network file to obtain its semantic definition DEF.

(4)基于扩展规则，针对英语关键词项的各个Synset，根据语义网络层次结构中的部分关系(Part)、兄弟关系(Sibling)以及子女关系(Child)，寻找相应的近义词词集作为扩展词集；针对汉语关键词项的各个DEF，作以直接匹配扩展。(4) Based on the expansion rules, for each Synset of English keyword items, according to the partial relationship (Part), brother relationship (Sibling) and child relationship (Child) in the semantic network hierarchy, find the corresponding synonym word set as the expansion word set; for each DEF of the Chinese keyword item, it is directly matched and expanded.

(5)基于扩展后处理策略，根据图像库标注集信息，对扩展词集进行过滤筛选，从而获取优化后的最终扩展词集。(5) Based on the extended post-processing strategy, the extended word set is filtered and screened according to the image database annotation set information, so as to obtain the final optimized extended word set.

(6)重复(2)～(5)，获得原始查询中每个关键词项的扩展词集进行合并，将其作为与原始查询相对应的扩展后查询表达。(6) Repeat (2)-(5) to obtain and merge the expanded word sets of each keyword item in the original query, and use it as an expanded query expression corresponding to the original query.

5.检索结果的评分与优化算法5. Scoring and optimization algorithm for search results

图像搜索引擎的排序基本单元为图像，其排序的基本依据为图像特征。在基于图像内容的图像检索中，图像的底层特征作为一幅图像的特征；而在基于文本的图像检索中，图像的标注信息即为图像特征。对于后者，附加用户输入的查询关键词作为图像的排序标准，排序就是将标注词序列更接近于查询关键词序列的图像排列在检索结果列表中更靠前的位置上。因此，需要检索结果的评分与优化算法，以确定作为检索结果的各幅图像相比较哪一幅对于用户查询关键词序列来说更“好”。然而，“好”的标准实际上并不存在，不同的用户即使输入同样的查询，也很有可能对同样的返回结果做出大相径庭的评价。所以，检索结果的评分与优化算法只是定义一种评分规则，通过调整参数为图像评分而进行排序，以期达到更“好”的效果。The basic unit of sorting for image search engines is images, and the basic basis for sorting is image features. In image retrieval based on image content, the underlying features of an image are the features of an image; while in text-based image retrieval, the annotation information of an image is the image feature. For the latter, the query keyword input by the user is added as the sorting standard of the images, and the sorting is to arrange the images whose tag word sequence is closer to the query keyword sequence in a higher position in the retrieval result list. Therefore, a scoring and optimization algorithm for the retrieval results is needed to determine which image is "better" for the user's query keyword sequence among the images of the retrieval results. However, the "good" standard does not actually exist, and even if different users enter the same query, they are likely to make very different evaluations on the same returned results. Therefore, the scoring and optimization algorithm of retrieval results only defines a scoring rule, and sorts the images by adjusting parameters to achieve a "better" effect.

鉴于现有技术公开的文本检索领域中的排序处理是一项较为成熟的技术，如[24，25，26]，该类算法考虑更多的是查询关键词与检索文档的直接匹配，有可能造成未包含用户查询关键词但确实相关的检索文档不能被返回。本发明建立适用于基于文本的图像检索的排序策略，提供一种检索结果评分与优化算法，结果显示，对图像搜索引擎的搜索行为未加任何干扰，对所返回的检索结果无任何影响。该算法的主要作用在于，让同一检索结果集中更“好”的结果排序更前，以便用户更容易观测到。In view of the fact that the sorting process in the field of text retrieval disclosed in the prior art is a relatively mature technology, such as [24, 25, 26], this type of algorithm considers more about the direct matching of query keywords and retrieved documents, and it is possible As a result, retrieved documents that do not contain the user's query keywords but are indeed relevant cannot be returned. The invention establishes a sorting strategy suitable for text-based image retrieval and provides a retrieval result scoring and optimization algorithm. The results show that it does not interfere with the search behavior of the image search engine and has no impact on the returned retrieval results. The main function of this algorithm is to rank the more "better" results in the same retrieval result set higher, so that users can observe them more easily.

图像搜索引擎的检索结果往往很多，而用户往往只会有耐心察看前面的一些结果。换句话来说，如何将更贴近用户搜索意图的检索结果放至返回结果更前面的位置上相当重要。因此，设计基于词语语义相似度的评分算法对返回的检索结果进行排序，根据查询关键词序列与图像的标注信息(即标注词序列)进行评分，从而将所返回各幅图像的得分作为排序依据。实际上，该算法的评分对象是图像的标注集，而对图像本身并没有任何认识，这是一个纯粹信任并依靠图像标注信息的评分方案。前述所讨论的词语语义相似度度量，就是为这里的排序算法能够得到“语义相近”的计算支持。There are often many retrieval results of image search engines, and users often only have the patience to look at some of the previous results. In other words, how to place the search results that are closer to the user's search intent at the front of the returned results is very important. Therefore, a scoring algorithm based on the semantic similarity of words is designed to sort the returned retrieval results, and to score according to the query keyword sequence and the annotation information of the image (that is, the tagged word sequence), so that the scores of the returned images are used as the sorting basis . In fact, the scoring object of the algorithm is the label set of the image without any knowledge of the image itself. This is a scoring scheme that purely trusts and relies on image labeling information. The semantic similarity measure of words discussed above is for the sorting algorithm here to be able to get the calculation support of "semantic similarity".

由于用户查询关键词的输入是以随意方式进行，因此平等对待查询关键词序列中的每个词项。然而，对于图像的标注词序列，假设排在前面的词项更为值得信赖。该假设基于一个事实，即标注者倾向于首先输入图像中最突出的物体。诚然，对同一幅图像，不同的标注者具有不同的判断，并且不一定存在最突出的物体。但对于大多数图像来说，图像中的焦点物体还是非常明显。正是出于这样的考虑，评分算法中标注词的计算结果都附加权重，用于突出图像中可能的“突出”物体。由此，使用下述公式来计算图像的排序分数：Since the input of the query keyword by the user is done in a random manner, each term in the query keyword sequence is treated equally. However, for annotated word sequences for images, it is assumed that the top-ranked terms are more trustworthy. This assumption is based on the fact that annotators tend to input the most prominent objects in the image first. Admittedly, for the same image, different annotators have different judgments, and there may not necessarily be the most salient object. But for most images, the in-focus objects in the image are still very obvious. It is for this consideration that the calculation results of tag words in the scoring algorithm are weighted to highlight possible "prominent" objects in the image. From this, the ranking score of the image is calculated using the following formula:

$Score Score = = \frac{{Σ Σ}_{i i = = 11}^{n no} {Σ Σ}_{j j = = 11}^{m m} w w ((j j,, m m)) Sim Sim (({k k}_{i i},, {t t}_{j j}))}{{Σ Σ}_{j j = = 11}^{m m} w w ((j j,, m m))} - - - - - - ((11))$

其中，k_i为关键词序列的第i个关键词；t_j为图像标注词序列的第j个标注词；Sim(k_i，t_j)用于计算两个词项k_i与t_j之间的语义相似度；w(j，m)为相关权重，w(j，m)＝(m+1-j)²，用于突出标注序列中标注词项的前后关系；而n与m则分别是查询关键词序列与图像标注词序列所包含的词项个数。考虑图像标注词序列中的第一个标注词权重为m²，则相对于总权重

其所占比例为：Among them, _ki is the i-th keyword of the keyword sequence; t _j is the j-th tagged word _of the image tagged word sequence; Sim( _ki , t _j ) is used _to calculate the The semantic similarity between them; w(j, m) is the relevant weight, w(j, m)=(m+1-j) ² , which is used to highlight the contextual relationship of tagged items in the tagged sequence; while n and m are are the number of terms contained in the query keyword sequence and the image tagging word sequence, respectively. Considering that the weight of the first tagged word in the image tagged word sequence is m ² , then relative to the total weight

Its proportion is:

$\frac{{m m}^{22}}{{Σ Σ}_{j j = = 11}^{m m} w w ((j j,, m m))} = = \frac{{m m}^{22}}{{Σ Σ}_{j j = = 11}^{m m} {j j}^{22}} = = \frac{{66 m m}^{22}}{m m ((m m + + 11)) ((22 m m + + 11))} = = \frac{66 m m}{((m m + + 11)) ((22 m m + + 11))} - - - - - - ((22))$

该函数是一个递减函数，随着图像标注词序列的增大，排头词的权重影响成线性递减。如果一幅图像含有太多物体，就会使得各个物体都不会特别突出。This function is a decreasing function. As the sequence of image tagging words increases, the influence of the weight of the top words decreases linearly. If an image contains too many objects, each object will not be particularly prominent.

需要注意的一种情形是，评分计算中存在大量重复计算。查询关键词参与所有的词语语义相似度计算，而每幅图像的标注序列都会包含至少一个查询关键词(否则该图像不会作为检索结果被返回)，并且检索结果图像中也会共有大量相同标注词项，所以实际所必需的语义计算比语义计算被调用的次数少很多。通过设置一个适当大小的相似度计算结果缓存，记录下一些语义计算结果，对处理速度的提升具有很大帮助。另一方面，图像搜索引擎在处理检索结果过多而导致的分页显示时，提倡每次访问一个分页结果都重新进行搜索。如果能够使用一个缓存，将一些图像的评分结果缓存起来，那么在用户切换图像检索结果不同分页的时候，将会避免大量计算。从相似度(语义距离)结果缓存、结果文档缓存、直至底层的同义词集读取模块缓存，基于多层缓存机制对原有评分算法进行优化，可在很大程度上节约处理时间。One situation to be aware of is when there is a lot of double counting in the score calculation. The query keyword participates in all word semantic similarity calculations, and the annotation sequence of each image will contain at least one query keyword (otherwise the image will not be returned as a retrieval result), and there will be a large number of identical annotations in the retrieval result images terms, so the semantic computation actually necessary is much less often than the semantic computation is invoked. By setting a similarity calculation result cache of an appropriate size and recording some semantic calculation results, it is of great help to improve the processing speed. On the other hand, when the image search engine is dealing with paged display caused by too many retrieval results, it is recommended to re-search each time a paged result is accessed. If a cache can be used to cache the scoring results of some images, then a large number of calculations will be avoided when the user switches between different pages of image retrieval results. From similarity (semantic distance) result cache, result document cache, to the underlying synset read module cache, the original scoring algorithm is optimized based on the multi-layer cache mechanism, which can save processing time to a large extent.

评分算法的计算方式给出的是一个精确按照语义相似度函数结果整合的分数，而实际计算中，一些近似结果同样也能完成任务。毕竟，只需要不错的排序效果，而具体的分数值并不重要。在评分计算中，每幅图像的标注词序列都具有与查询关键词相匹配的标注词，同样也具有与查询关键词语义相似度相当小(或者语义距离相当远)的标注词。标注词序列中的各个标注词具有相应的顺序权重，一个排位在后且语义相似度很小的标注词对最终检索列表的影响，是一个排位在前且与查询关键词相匹配的标注词的二十分之一还是五十分之一显然无关紧要。因此，针对这些对最终检索列表影响不大的计算结果，统一采用一种相同结果表示而非按部就班地计算，这将大大简化计算的复杂度。The calculation method of the scoring algorithm gives a score that is precisely integrated according to the results of the semantic similarity function, but in actual calculations, some approximate results can also complete the task. After all, all you need is a good sorting effect, and the exact score value is not important. In the score calculation, the tag word sequence of each image has tag words that match the query keywords, and also has tag words that have a relatively small semantic similarity (or a considerable semantic distance) with the query keywords. Each tagged word in the tagged word sequence has a corresponding order weight. A tagged word that ranks last and has a small semantic similarity has an impact on the final search list. It is a tag that ranks first and matches the query keyword. Whether it's one-twentieth or one-fiftieth of a word obviously doesn't matter. Therefore, for these calculation results that have little influence on the final search list, uniformly adopt a same result representation instead of step-by-step calculation, which will greatly simplify the calculation complexity.

(2)使用查询扩展函数得到原始查询关键词序列的扩展词集。(2) Use the query expansion function to obtain the expanded word set of the original query keyword sequence.

(3)为每一对查询关键词项与其扩展词项计算语义相似度，并将结果全部存于缓存之中。(3) Calculate the semantic similarity for each pair of query key terms and their extended terms, and store all the results in the cache.

(4)基于图像搜索引擎，获取与原始查询相对应的检索结果。(4) Obtain the retrieval results corresponding to the original query based on the image search engine.

(5)为检索结果中的每幅图像计算评分。如果能够从缓存中获取相关信息，则利用现成的已有结果；否则，就当作语义计算结果很大(即语义距离很远)，使用一个统一的常量结果，而不再进行语义计算。(5) Calculate the score for each image in the retrieval result. If the relevant information can be obtained from the cache, use the existing results; otherwise, consider the result of the semantic calculation to be very large (that is, the semantic distance is far away), and use a unified constant result instead of performing semantic calculation.

(6)根据检索结果中每幅图像的评分，对各幅图像进行重新排序，以返回最终的检索列表。其中，在步骤(5)的评分过程中，所有的语义计算将作为预处理，而实际评分时无需任何语义计算，则总的语义计算次数将比优化前的评分算法降低一个数量级，以此提高处理效率。(6) Reorder each image according to the score of each image in the retrieval result to return the final retrieval list. Among them, in the scoring process of step (5), all semantic calculations will be used as preprocessing, and no semantic calculations are required for actual scoring, so the total number of semantic calculations will be reduced by an order of magnitude compared with the scoring algorithm before optimization, so as to improve Processing efficiency.

实施例2应用实例Embodiment 2 application example

附图2是通过一个具体的实例演示上述算法流程框架的具体步骤，通过给出各算法模块的中间输出以及该框架的最终检索结果，给人以直观的理解。Attached Figure 2 demonstrates the specific steps of the above-mentioned algorithm flow framework through a specific example, and gives people an intuitive understanding by giving the intermediate output of each algorithm module and the final retrieval result of the framework.

标号(1)与(2)分别为用户输入的原始英语查询和汉语查询；标号(3)与(4)分别为利用英语义类词典WordNet和汉语义类词典HowNet，采用“基于扩展规则的查询扩展词选择与优化算法”所获取的与原始英语查询和汉语查询相对应的扩展词集；标号(5)与(6)分别为基于原始英语查询和汉语查询的扩展词集，利用图像搜索引擎所获取的相应检索结果；标号(7)与(8)分别为基于原始英语查询和汉语查询的初始检索列表，利用融合“基于WordNet的英语词语语义相似度度量算法”和“基于HowNet的汉语词语语义相似度度量算法”的“检索结果的评分与优化算法”所获取的最终检索结果。Labels (1) and (2) are the original English query and Chinese query input by the user respectively; labels (3) and (4) respectively use the English semantic dictionary WordNet and the Chinese semantic dictionary HowNet, using the "extension rule-based query The expanded word sets corresponding to the original English query and Chinese query obtained by the "Extended Word Selection and Optimization Algorithm"; labels (5) and (6) are the expanded word sets based on the original English query and Chinese query, respectively, using the image search engine The corresponding retrieval results obtained; labels (7) and (8) are the initial retrieval lists based on the original English query and Chinese query respectively, using the fusion of "WordNet-based English word semantic similarity measurement algorithm" and "HowNet-based Chinese word The final retrieval results obtained by the "Scoring and Optimization Algorithm of Retrieval Results" of "Semantic Similarity Measurement Algorithm".

Claims

1. text based query expansion and sort method in the image retrieval is characterized in that comprising the steps:

(1) pre-service and preanalysis

At initial query, finish the participle and the punctuation mark mark-on of inquiry by pre-service, and, finish stop word mark-on, part of speech analysis and keyword extraction by preanalysis based on through pretreated initial query;

(2) phrase semantic measuring similarity

Measure at the english vocabulary semantic similarity, the path Network Based and the degree of depth are come the computing semantic distance, measure at the Chinese terms semantic similarity, calculate based on taking all factors into consideration the former similarity of main classes justice, semantic formula similarity and the former framework similarity of main classes justice, incorporate maximum match rule and adopted former depth information simultaneously;

(3) query expansion of fusion extension rule

Based on semantic network identity, merge the particular extension rule of being set up simultaneously, carry out semantic extension at the keyword sequence that comes from initial query;

(4) result for retrieval based on scoring sorts

The result for retrieval that returns with search engine is as process object, based on " the close degree " between phrase semantic measuring similarity assessment searching keyword sequence and the iamge description explanation, obtain scoring, and be optimized by scoring algorithm, final score is returned the sort by of image as search engine.

2. method according to claim 1, it is characterized in that, in the prototype of described english vocabulary semantic similarity metric algorithm, set up a kind of Lesk expansion algorithm based on equal speech, further expand the phrase semantic definition, wherein equal speech is defined as synset under certain word and is combined in sibling in the WordNet hierarchical structure, wherein, synonym set is public father node of pairing equal speech existence with it.

3. method according to claim 1, it is characterized in that, in the prototype of described Chinese terms semantic similarity metric algorithm, based on whole semantic formula, divide justice is former by level, adopt the method for maximum match, consider the former direct descriptive power of main classes justice separately for notion; Simultaneously, in metrics process, add the consideration of adopted former depth information, notion semantic similarity wherein is divided into following three parts and calculates:

Sim(C ₁，C ₂)＝w ₁*P ₁+w ₂*P ₂+w ₃*P ₃

Wherein, P ₁Be the similarity of two notion main classes justice between former; P ₂Be the similarity between the whole semantic formula; P ₃Be at calculation of similarity degree between two former frameworks of DEF main classes justice; w ₁, w ₂With w ₃Be respectively three pairing weights of part similarity, should satisfy constraint condition w ₁+ w ₂+ w ₃=1 and w ₂＞w ₁, w ₂＞w ₃

4. method according to claim 1 is characterized in that its algorithm steps of query expansion of described fusion extension rule adopts following false code to describe:

(1) obtains input: the original query keyword sequence;

(2) select its certain key word item;

(3) if be the English keywords item, search the semantic network file of WordNet, obtain its synset Synset;

If be the Chinese key word item, search the semantic network file of HowNet, obtain its semantical definition DEF;

(4) based on extension rule, at each Synset of English keywords item, according to the part relations in the semantic network hierarchical structure, brotherhood, and children relation are sought corresponding near synonym word set as the expansion word set; At each DEF of Chinese key word item, do with direct coupling expansion;

(5) based on expansion aftertreatment strategy, according to image library mark collection information, the expansion word set is carried out filtering screening, obtain the final expansion word set after the optimization;

(6) repeat (2)～(5), the expansion word set that obtains each key word item in the original query merges, with its as with the back query express of the corresponding expansion of original query.

5. method according to claim 1 is characterized in that, in the described prototype based on the result for retrieval sort algorithm of marking, the result of calculation additional weight of mark speech in the scoring algorithm is used for outstanding possible " giving prominence to " object of image; Adopt the ranking score of following formula computed image:

Score = \frac{Σ_{i = 1}^{n} Σ_{j = 1}^{m} w (j, m) Sim (k_{i}, t_{j})}{Σ_{j = 1}^{m} w (j, m)}

Wherein, k _iI keyword for keyword sequence; t _jJ mark speech for the image labeling word sequence; Sim (k _i, t _j) be used to calculate two lexical item k _iWith t _jBetween semantic similarity; W (j m) is associated weight, and w (j, m)=(m+1-j) ², be used for the context that outstanding mark sequence marks lexical item; N and m are respectively the lexical item numbers that searching keyword sequence and image labeling word sequence are comprised; First mark speech weight in the image labeling word sequence is m ², then with respect to total weight

Its proportion is:

\frac{m^{2}}{Σ_{j = 1}^{m} w (j, m)} = \frac{m^{2}}{Σ_{j = 1}^{m} j^{2}} = \frac{6 m^{2}}{m (m + 1) (2 m + 1)} = \frac{6 m}{(m + 1) (2 m + 1)}

This function is a decreasing function, and along with the increase of image labeling word sequence, the weights influence of file leader's speech is linear successively decreases.