[go: up one dir, main page]

CN103218362B - A kind of Methodologies for Building Domain Ontology and system - Google Patents

A kind of Methodologies for Building Domain Ontology and system Download PDF

Info

Publication number
CN103218362B
CN103218362B CN201210017772.7A CN201210017772A CN103218362B CN 103218362 B CN103218362 B CN 103218362B CN 201210017772 A CN201210017772 A CN 201210017772A CN 103218362 B CN103218362 B CN 103218362B
Authority
CN
China
Prior art keywords
ontology
keyword
keywords
sequence
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210017772.7A
Other languages
Chinese (zh)
Other versions
CN103218362A (en
Inventor
董振江
吉锋
罗圣美
程龚
瞿裕忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
ZTE Corp
Original Assignee
Nanjing University
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University, ZTE Corp filed Critical Nanjing University
Priority to CN201210017772.7A priority Critical patent/CN103218362B/en
Publication of CN103218362A publication Critical patent/CN103218362A/en
Application granted granted Critical
Publication of CN103218362B publication Critical patent/CN103218362B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开一种领域本体构建方法,包括:罗列需要被目标本体描述的所有术语的名称,形成关键词集合W0;对关键词集合W0中的所有关键词进行排序,形成关键词序列S0;创建待复用的本体集合O,将从关键词序列S0中抽取的连续子序列中的所有关键词提交到本体检索系统,将检索结果中排名最高的本体添加到本体集合O;对本体集合O中的所有本体进行集合的并操作处理,形成新的本体o;本发明还提供一种领域本体构建系统。根据本发明的技术方案,提供一种面向本体检索的关键词查询的构建方法,具有良好的定义和可操作性,可取得较高的本体复用率。

The invention discloses a domain ontology construction method, including: listing the names of all terms that need to be described by the target ontology to form a keyword set W 0 ; sorting all keywords in the keyword set W 0 to form a keyword sequence S 0 ; create an ontology set O to be reused, submit all keywords in the continuous subsequences extracted from the keyword sequence S0 to the ontology retrieval system, and add the highest-ranked ontology in the search results to the ontology set O; All the ontologies in the ontology set O are combined and processed to form a new ontology o; the invention also provides a domain ontology construction system. According to the technical solution of the present invention, a method for constructing ontology-oriented keyword query is provided, which has good definition and operability, and can achieve a higher ontology reuse rate.

Description

一种领域本体构建方法及系统Method and system for constructing domain ontology

技术领域 technical field

本发明涉及信息系统建模和知识工程领域,尤其涉及一种基于本体复用的领域本体构建方法及系统。The invention relates to the fields of information system modeling and knowledge engineering, in particular to a domain ontology construction method and system based on ontology reuse.

背景技术 Background technique

汤姆·格鲁伯(Tom Gruber)将本体(Ontology)定义为一种为共享而构建的概念化的显式规范。概念化是指为一个领域或范围内的抽象概念、具体对象、对象属性及对象间关系建立的模型,而本体是将一个概念化显式地表示成为规范,以便多个主体共享。在本体中,上述概念、关系等统称为术语(Term);本体可以视作由称作公理(Axiom)的术语描述组成的集合。尼古拉·高利诺(Nicola Guarino)将本体分为顶层本体、领域本体、任务本体、应用本体。其中,顶层本体描述通用的概念(如空间、时间),领域和任务本体分别描述一般的领域(如单反相机)和一般的任务(如相机销售),而应用本体则描述具体应用涉及的具体范围(如一个具体的单反相机销售网站)。其中,顶层本体通常比较稳定,应用本体的共享意义较小,因此,领域和任务本体的构建最为活跃,其构建方法最为重要。Tom Gruber defines an ontology as an explicit specification of a conceptualization built for sharing. Conceptualization refers to the establishment of a model for abstract concepts, concrete objects, object attributes, and relationships between objects in a domain or scope, while ontology explicitly expresses a conceptualization as a specification for sharing by multiple subjects. In an ontology, the above-mentioned concepts, relations, etc. are collectively called terms (Term); an ontology can be regarded as a set composed of term descriptions called axioms (Axiom). Nicola Guarino divided ontology into top-level ontology, domain ontology, task ontology, and application ontology. Among them, the top-level ontology describes general concepts (such as space and time), domain and task ontology describe general domains (such as SLR cameras) and general tasks (such as camera sales), and application ontology describes the specific scope of specific applications. (such as a specific SLR camera sales site). Among them, the top-level ontology is usually relatively stable, and the sharing significance of the application ontology is small. Therefore, the construction of the domain and task ontology is the most active, and its construction method is the most important.

现有的构建领域本体的方法可以分为两类:手工构建和半自动构建。手工构建以本体描述捕获方法(IDEF5,Integrated Definition for Ontology DescriptionCapture Method)为代表,将本体构建的过程分为目标和团队建立、原始素材采集、素材分析、本体初步构建、本体精化和验证等5个步骤,每一步都由人手工完成。半自动构建又称本体学习,由计算机程序自动地从文本中抽取出表示概念、概念间关系等的术语,形成初步的本体,再经过人手工精化和验证。然而,目前计算机程序自动构建的初步本体在质量上通常很差,并不能有效降低对人工的依赖,因此手工构建仍是主流方法。The existing methods of constructing domain ontology can be divided into two categories: manual construction and semi-automatic construction. Manual construction is represented by the Ontology Description Capture Method (IDEF5, Integrated Definition for Ontology Description Capture Method), which divides the process of ontology construction into goals and team establishment, original material collection, material analysis, ontology preliminary construction, ontology refinement and verification, etc. 5 steps, each of which is performed manually by humans. Semi-automatic construction, also known as ontology learning, is a computer program that automatically extracts terms representing concepts and relationships between concepts from the text to form a preliminary ontology, which is then manually refined and verified. However, the quality of the preliminary ontology automatically constructed by computer programs is usually poor, which cannot effectively reduce the dependence on manual work, so manual construction is still the mainstream method.

在手工构建领域本体时,一种提高效率的方式是复用现有本体,即针对新的需求对一个相同或相近领域的现有本体加以改造,成为一个新的本体,从而比重新开发节约成本。然而,从大量的现有本体中发现适合复用的本体手段非常匮乏。目前的一种主要途径是逐一浏览在线的本体图书馆(如美国国防部先进研究项目距代理标记语言(DAML,Defense Advanced Research ProjectsAgency Agent Markup Language)本体图书馆)中的本体,效率低下。另一种新兴的途径是进行本体检索,向本体检索系统(如Swoogle搜索引擎)提交查询关键词,获取并只浏览能够匹配到查询关键词的本体,从而提高效率。然而,尚未形成良好定义的方法来指导上述检索过程,特别是查询的构建方法。另一种加速手工构建领域本体的方式是多人协同构建,这种方式的难点在于多人构建结果的冲突检查和消解。When manually constructing a domain ontology, one way to improve efficiency is to reuse the existing ontology, that is, to transform an existing ontology in the same or similar domain into a new ontology according to new requirements, thus saving costs compared to redevelopment . However, there is a lack of means to find ontologies suitable for reuse from a large number of existing ontologies. One of the main approaches at present is to browse the ontologies in the online ontology library (such as the ontology library of Defense Advanced Research Projects Agency Agent Markup Language (DAML, Defense Advanced Research Projects Agency Agent Markup Language) of the US Department of Defense) one by one, which is inefficient. Another emerging approach is to perform ontology retrieval, submit query keywords to an ontology retrieval system (such as Swoogle search engine), obtain and only browse ontologies that can match the query keywords, thereby improving efficiency. However, there is no well-defined method to guide the above retrieval process, especially the query construction method. Another way to speed up the manual construction of domain ontology is multi-person collaborative construction. The difficulty of this method lies in the conflict checking and resolution of multi-person construction results.

尽管领域本体作为概念层次的模型,已经脱离了自然语言的层面,但在供人使用时仍需要对术语采用自然语言中的词汇进行命名,以便人的理解,因此,术语名称也是领域本体的重要组成部分。由于自然语言的多样性,一个术语可能对应到多个同义的自然语言词汇(如单反相机和单镜头反光相机),因此,领域本体构建中的一项重要环节是尽可能完全地获取术语名称的所有同义词。Although the domain ontology, as a model of the concept level, has been separated from the level of natural language, it is still necessary to name the terms in natural language for human understanding. Therefore, the term name is also an important part of the domain ontology. component. Due to the diversity of natural language, a term may correspond to multiple synonymous natural language vocabulary (such as SLR camera and single-lens reflex camera), therefore, an important part in the construction of domain ontology is to obtain the term name as completely as possible all synonyms for .

现有的同义词获取方法主要是利用语言学专家构建的同义词词典(如WordNet)。尽管同义词词典的精度很高,但覆盖面有限,并且目前可以获得的计算机程序易处理的同义词词典很少,其中,中文的同义词词典更少,因此,领域本体构建中的中文术语名称的同义词获取非常困难,通常只能基于构建者(即领域专家)的经验完成,难以保证质量,特别是获取的召回率(即完全度)。The existing methods of obtaining synonyms mainly use the dictionary of synonyms (such as WordNet) constructed by linguistic experts. Although the thesaurus dictionaries have high precision, their coverage is limited, and there are very few thesaurus dictionaries that can be easily processed by computer programs. Among them, there are even fewer thesaurus dictionaries in Chinese. Therefore, it is very difficult to obtain synonyms for Chinese term names in domain ontology construction. Difficult, it can usually only be done based on the experience of the builder (ie domain experts), and it is difficult to guarantee the quality, especially the recall rate (ie completeness) of the acquisition.

另一种同义词获取方法是利用社会公众的群体智能,这种方法利用了搜索引擎的用户查询日志,其基本思想是认为如果两个关键词常在用户查询中出现,且用户常打开它们对应的查询结果中的相同网页,则这两个关键词被认为是同义词。该方法存在的不足主要在于获取同义词的精度(即正确率)很低。原因在于一个网页可能涉及多个不同的主题,分别对应到不存在同义关系的多个关键词,因此,即使用户基于不同的查询关键词打开了相同的网页,也并不表明这些关键词必然存在同义关系。Another method of obtaining synonyms is to use the group intelligence of the public. This method uses the user query logs of search engines. The basic idea is that if two keywords often appear in user queries, and users often open their corresponding the same web page in the query results, the two keywords are considered synonyms. The main disadvantage of this method is that the accuracy (that is, the correct rate) of obtaining synonyms is very low. The reason is that a webpage may involve multiple different topics, corresponding to multiple keywords that do not have a synonymous relationship. Therefore, even if a user opens the same webpage based on different query keywords, it does not mean that these keywords must be There is a synonymous relationship.

发明内容 Contents of the invention

有鉴于此,本发明的主要目的在于提供一种领域本体构建方法及系统,提供一种面向本例检索的关键词查询的构建方法,具有良好的定义和可操作性,可取得较高的本体复用率。In view of this, the main purpose of the present invention is to provide a domain ontology construction method and system, and to provide a construction method for keyword queries oriented to this example retrieval, which has good definition and operability, and can obtain higher ontology reuse rate.

为达到上述目的,本发明的技术方案是这样实现的:In order to achieve the above object, technical solution of the present invention is achieved in that way:

本发明提供一种领域本体构建方法,包括:The present invention provides a domain ontology construction method, including:

罗列需要被目标本体描述的所有术语的名称,形成关键词集合W0List the names of all terms that need to be described by the target ontology to form a keyword set W 0 ;

对关键词集合W0中的所有关键词进行排序,形成关键词序列S0Sorting all the keywords in the keyword set W 0 to form a keyword sequence S 0 ;

创建待复用的本体集合O,将从关键词序列S0中抽取的连续子序列中的所有关键词提交到本体检索系统,将检索结果中排名最高的本体添加到本体集合O;Create an ontology set O to be reused, submit all keywords in the continuous subsequences extracted from the keyword sequence S0 to the ontology retrieval system, and add the highest ranking ontology in the retrieval results to the ontology set O;

对本体集合O中的所有本体进行集合的并操作处理,形成新的本体o。All ontologies in the ontology set O are combined and processed to form a new ontology o.

上述方法中,该方法还包括:为新的本体o中描述的术语命名,并根据新的本体o中描述的术语的名称进行同义词获取。In the above method, the method further includes: naming the terms described in the new ontology o, and obtaining synonyms according to the names of the terms described in the new ontology o.

上述方法中,所述罗列需要被目标本体描述的所有术语的名称,形成关键词集合W0为:In the above method, the list of the names of all terms that need to be described by the target ontology forms the keyword set W0 as :

对于目标本体所描述的目标领域,使用自然语言LS中的关键词罗列需要被目标本体所描述的所有术语的名称,形成一个关键词集合W0For the target field described by the target ontology, use the keywords in the natural language L S to list the names of all the terms that need to be described by the target ontology to form a keyword set W 0 .

上述方法中,所述对关键词集合W0中的所有关键词进行排序,形成关键词序列S0为:In the above method, all the keywords in the keyword set W 0 are sorted to form the keyword sequence S 0 as:

建立树,树中每个节点具有标签和处理标记;Build a tree, each node in the tree has a label and a processing tag;

判断树中是否所有节点的处理标记都是“已处理”,如果否,从树中所有处理标记是“未处理”的节点中选取当前节点,所述当前节点的标签的关键词集合W0为当前集合;Judging whether the processing marks of all nodes in the tree are "processed", if not, selecting the current node from the nodes of "unprocessed" in all processing marks in the tree, the keyword set W of the label of the current node is current collection;

判断当前集合中是否只包含一个关键词,当前集合包含超过一个关键词时,将当前集合划分为两个子集,将两个子集中的最重要子集WL作为当前节点的左子节点添加到树中,将两个子集中的另一个子集WR作为当前节点的右子节点添加到树中,将当前节点的处理标记改为“已处理”;否则,将当前节点的处理标记改为“已处理”,然后继续判断树中是否所有节点的处理标记都是“已处理”,直到树中所有节点的处理标记都是“已处理”时,根据关键词集合W0中的所有关键词所对应的节点的深度优先遍历顺序,形成关键词序列S0Determine whether the current collection contains only one keyword. When the current collection contains more than one keyword, divide the current collection into two subsets, and add the most important subset W L of the two subsets to the tree as the left child node of the current node In the two subsets, add another subset W R of the two subsets to the tree as the right child node of the current node, and change the processing mark of the current node to "processed"; otherwise, change the processing mark of the current node to "processed"processing", and then continue to judge whether the processing marks of all nodes in the tree are "processed", until the processing marks of all nodes in the tree are "processed", according to all keywords in the keyword set W 0 The depth-first traversal order of the nodes forms the keyword sequence S 0 .

上述方法中,所述将当前集合划分为两个子集为:In the above method, the current collection is divided into two subsets as follows:

将当前集合中的关键词作为对一个领域或范围的描述,将两个子集中的关键词分别作为对该领域或范围的两个不同子领域或子范围的描述。The keywords in the current collection are used as a description of a field or range, and the keywords in the two subsets are respectively used as descriptions of two different subfields or subranges of the field or range.

上述方法中,所述将从关键词序列S0中抽取的连续子序列中的所有关键词提交到本体检索系统,将检索结果中排名最高的本体添加到本体集合O为:In the above method, all the keywords in the continuous subsequences extracted from the keyword sequence S0 are submitted to the ontology retrieval system, and the highest-ranked ontology in the retrieval results is added to the ontology set O as follows:

创建待复用的本体集合O,将关键词序列S0记作S,获取S中的满足条件的前缀连续子序列中最长的一个子序列SH,将SH从S的前端截去,得到剩余的后缀连续子序列STCreate an ontology set O to be reused, record the keyword sequence S 0 as S, obtain the longest subsequence SH among the continuous prefix subsequences satisfying the conditions in S, cut SH from the front end of S, Obtain the remaining suffix continuous subsequence S T ;

判断SH是否为空序列,如果SH为空序列,从ST中删除最前面的一个关键词;如果SH不为空序列,将检索结果HITS(SH)中排名最高的本体添加到O;Determine whether SH is an empty sequence, if SH is an empty sequence, delete the first keyword from ST ; if SH is not an empty sequence, add the highest ranking ontology in the search result HITS( SH ) to O;

判断ST是否为空序列,如果ST不为空序列,将ST记作S,再获取S的满足条件的前缀连续子序列中最长的一个子序列SH,将SH从S的前端截去,得到剩余的后缀连续子序列ST;否则,如果ST为空序列,流程结束。Determine whether S T is an empty sequence, if S T is not an empty sequence, record S T as S, and then obtain the longest subsequence SH among the prefix continuous subsequences of S that meet the conditions, and convert S H from S The front end is truncated to obtain the remaining suffix continuous subsequence ST ; otherwise, if ST is an empty sequence, the process ends.

上述方法中,所述条件为子序列中的所有关键词组合成一个查询关键词组,将所述查询关键词组提交到本体检索系统后,检索结果HITS(SH)不为空。In the above method, the condition is that all keywords in the subsequences are combined into a query keyword group, and after the query keyword group is submitted to the ontology retrieval system, the retrieval result HITS( SH ) is not empty.

上述方法中,所述对本体集合O中的所有本体进行集合的并操作处理,形成新的本体o为:In the above-mentioned method, all the ontologies in the ontology set O are combined and processed to form a new ontology o as follows:

对本体集合O中的所有本体进行集合的并操作处理,形成一个新的本体o;并根据描述目标领域的需求对新的本体o进行编辑处理;Combine all the ontologies in the ontology set O to form a new ontology o; and edit the new ontology o according to the requirements of describing the target domain;

所述编辑处理至少包括增加术语和公理、删除术语和公理、修改术语和公理。The editing process includes at least adding terms and axioms, deleting terms and axioms, and modifying terms and axioms.

上述方法中,所述为新的本体o中描述的术语命名为:对新的本体o中描述的每一个术语用一个LS中的词汇命名。In the above method, naming the terms described in the new ontology o is: naming each term described in the new ontology o with a vocabulary in LS.

上述方法中,所述根据新的本体o中描述的术语的名称进行同义词获取为:In the above method, the acquisition of synonyms according to the names of the terms described in the new ontology o is as follows:

针对新的本体o中描述的每一个术语的名称t,创建三个关键词集合SYN、TRANS、TS;For the name t of each term described in the new ontology o, create three keyword sets SYN, TRANS, TS;

将t提交到从LS到另一种自然语言LT的翻译系统,将翻译结果中的所有关键词添加到集合TRANS;Submit t to the translation system from L S to another natural language LT, and add all keywords in the translation results to the set TRANS;

根据集合TRANS中的每一个关键词trans,从LT的同义词词典中获取的trans的所有同义词,将获取到的所有同义词添加到集合TS;According to each keyword trans in the set TRANS, all synonyms of trans obtained from the synonym dictionary of LT are added to the set TS;

将集合TS中的所有关键词添加到集合TRANS,并根据集合TRANS中的每一个关键词trans’,将trans’提交到由LT到LS的翻译系统,将翻译结果中的所有关键词添加到集合SYN;Add all keywords in the set TS to the set TRANS, and submit trans' to the translation system from L T to L S according to each keyword trans' in the set TRANS, and add all keywords in the translation results to set SYN;

从集合SYN中删除所有不适合作为t的同义词的关键词,SYN中剩余的所有关键词作为获取到的t的同义词。All keywords that are not suitable as synonyms of t are deleted from the set SYN, and all remaining keywords in SYN are used as synonyms of t obtained.

本发明还提供一种领域本体构建系统,包括:罗列单元、排序单元、添加单元、并操作处理单元;其中,The present invention also provides a domain ontology construction system, including: a listing unit, a sorting unit, an adding unit, and an operation processing unit; wherein,

罗列单元,用于罗列需要被目标本体描述的所有术语的名称,形成关键词集合W0A listing unit is used to list the names of all terms that need to be described by the target ontology to form a keyword set W 0 ;

排序单元,用于对关键词集合W0中的所有关键词进行排序,形成关键词序列S0A sorting unit, configured to sort all keywords in the keyword set W 0 to form a keyword sequence S 0 ;

添加单元,用于创建待复用的本体集合O,将从关键词序列S0中抽取的连续子序列中的所有关键词提交到本体检索系统,将检索结果中排名最高的本体添加到本体集合O;The adding unit is used to create the ontology set O to be reused, submit all the keywords in the continuous subsequences extracted from the keyword sequence S 0 to the ontology retrieval system, and add the highest-ranked ontology in the retrieval results to the ontology set O;

并操作处理单元,用于对本体集合O中的所有本体进行集合的并操作处理,形成新的本体o。The union operation processing unit is used to perform collective union operation processing on all ontologies in the ontology set O to form a new ontology o.

上述系统中,该系统还包括:In the above system, the system also includes:

命名单元,用于为新的本体o中描述的术语命名;Naming unit, used to name the terms described in the new ontology o;

获取单元,用于根据新的本体o中描述的术语的名称进行同义词获取。The obtaining unit is used for obtaining synonyms according to the names of the terms described in the new ontology o.

本发明提供的领域本体构建方法及系统,罗列需要被目标本体描述的所有术语的名称,形成关键词集合W0;对关键词集合W0中的所有关键词进行排序,形成关键词序列S0;创建待复用的本体集合O,将从关键词序列S0中抽取的连续子序列中的所有关键词提交到本体检索系统,将检索结果中排名最高的本体添加到本体集合O;对本体集合O中的所有本体进行集合的并操作处理,形成新的本体o,因此提供了一种面向本例检索的关键词查询的构建方法,达到了检索较少本体就能够覆盖较多重要关键词的效果,具有良好的定义和可操作性,可取得较高的本体复用率;此外,基于上述方法,本发明还可以为新的本体o中描述的术语命名,并根据新的本体o中描述的术语的名称进行同义词获取,因此提供了一种同义词获取方法,通过自然语言中的同义词词典,达到适用范围广泛,可取得较高的精度和召回率的效果。The domain ontology construction method and system provided by the present invention list the names of all terms that need to be described by the target ontology to form a keyword set W 0 ; sort all the keywords in the keyword set W 0 to form a keyword sequence S 0 ;Create an ontology set O to be reused, submit all keywords in the continuous subsequences extracted from the keyword sequence S 0 to the ontology retrieval system, and add the highest-ranked ontology in the retrieval results to the ontology set O; All the ontologies in the set O are combined and processed to form a new ontology o. Therefore, a method for constructing keyword queries oriented to retrieval in this example is provided, so that fewer ontologies can be searched and more important keywords can be covered. effect, has good definition and operability, and can achieve a higher ontology reuse rate; in addition, based on the above method, the present invention can also name the terms described in the new ontology o, and according to the new ontology o The name of the described term is used to obtain synonyms, so a method for obtaining synonyms is provided. Through the dictionary of synonyms in natural language, it can achieve a wide range of applications and achieve high precision and recall.

附图说明 Description of drawings

图1是本发明实现领域本体构建方法的流程示意图;Fig. 1 is a schematic flow diagram of the method for realizing domain ontology construction in the present invention;

图2是本发明实现步骤102的具体方法的流程示意图;FIG. 2 is a schematic flow chart of a specific method for realizing step 102 in the present invention;

图3是本发明实现步骤102的具体方法的实施例一的流程示意图;FIG. 3 is a schematic flowchart of Embodiment 1 of a specific method for realizing step 102 of the present invention;

图4是本发明中二叉树数据结构的示例图;Fig. 4 is the example figure of binary tree data structure among the present invention;

图5是本发明实现步骤103的具体方法的流程示意图;FIG. 5 is a schematic flow chart of a specific method for realizing step 103 in the present invention;

图6是本发明实现步骤103的具体方法的实施例一的流程示意图;FIG. 6 is a schematic flowchart of Embodiment 1 of a specific method for implementing step 103 in the present invention;

图7是本发明实现步骤106的具体方法的流程示意图;FIG. 7 is a schematic flow chart of a specific method for implementing step 106 in the present invention;

图8是本发明实现领域本体构建系统的结构示意图。Fig. 8 is a schematic diagram of the structure of the domain ontology construction system implemented in the present invention.

具体实施方式 detailed description

本发明的基本思想是:罗列需要被目标本体描述的所有术语的名称,形成关键词集合W0;对关键词集合W0中的所有关键词进行排序,形成关键词序列S0;创建待复用的本体集合O,将从关键词序列S0中抽取的连续子序列中的所有关键词提交到本体检索系统,将检索结果中排名最高的本体添加到本体集合O;对本体集合O中的所有本体进行集合的并操作处理,形成新的本体o。The basic idea of the present invention is: list the names of all terms that need to be described by the target ontology to form a keyword set W 0 ; sort all the keywords in the keyword set W 0 to form a keyword sequence S 0 ; Use the ontology set O, submit all the keywords in the continuous subsequences extracted from the keyword sequence S 0 to the ontology retrieval system, and add the highest ranked ontology in the retrieval results to the ontology set O; All ontologies are aggregated and processed to form a new ontology o.

下面通过附图及具体实施例对本发明再做进一步的详细说明。The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

本发明提供一种领域本体构建方法,图1是本发明实现领域本体构建方法的流程示意图,如图1所示,该方法包括以下步骤:The present invention provides a method for constructing a domain ontology. FIG. 1 is a schematic flowchart of the method for constructing a domain ontology in the present invention. As shown in FIG. 1 , the method includes the following steps:

步骤101,罗列需要被目标本体描述的所有术语的名称,形成关键词集合W0Step 101, list the names of all terms that need to be described by the target ontology to form a keyword set W 0 ;

具体的,对于待构建的本体(称作目标本体)所描述的领域(称作目标领域),例如单反相机领域,使用自然语言LS中的关键词罗列需要被目标本体所描述的所有术语的名称,形成一个关键词集合W0,例如LS=中文,W0={“镜头”,“像素”,“光圈”,“焦距”,“传感器”}。Specifically, for the field (called the target field) described by the ontology to be constructed (called the target ontology), such as the SLR camera field, use the keywords in the natural language L S to list all the terms that need to be described by the target ontology Name, forming a keyword set W 0 , for example, L S = Chinese, W 0 = {"lens", "pixel", "aperture", "focal length", "sensor"}.

步骤102,对关键词集合W0中的所有关键词进行排序,形成关键词序列S0Step 102, sort all the keywords in the keyword set W 0 to form a keyword sequence S 0 .

步骤103,创建待复用的本体集合O,从关键词序列S0中抽取连续子序列,并将子序列中的所有关键词提交到本体检索系统,将检索结果中排名最高的本体添加到本体集合O。Step 103, create an ontology set O to be reused, extract continuous subsequences from the keyword sequence S0 , submit all keywords in the subsequences to the ontology retrieval system, and add the highest ranking ontology in the retrieval results to the ontology Set O.

步骤104,对本体集合O中的所有本体进行集合的并操作处理,形成新的本体o;Step 104, performing a set union operation on all the ontology in the ontology set O to form a new ontology o;

具体的,对O中的所有本体(即o1和o2,每个本体视为一个公理的集合)进行集合的并操作处理,形成一个新的本体o;并根据描述目标领域(例如单反相机领域)的需求对新的本体o进行编辑处理,编辑处理包括增加术语和公理、删除术语和公理、修改术语和公理等。Specifically, all ontologies in O (namely o 1 and o 2 , each ontology regarded as a set of axioms) are combined and processed to form a new ontology o; and according to the description of the target field (such as SLR Domain) to edit the new ontology o, including adding terms and axioms, deleting terms and axioms, modifying terms and axioms, etc.

步骤105,为新的本体o中描述的术语命名;Step 105, naming the terms described in the new ontology o;

具体的,对新的本体o中描述的每一个术语用一个LS中的词汇命名,例如新的本体o中一个术语的名称为“镜头”。Specifically, each term described in the new ontology o is named with a vocabulary in LS, for example, a term in the new ontology o is named "lens".

步骤106,根据新的本体o中描述的术语的名称进行同义词获取。Step 106, obtain synonyms according to the names of the terms described in the new ontology o.

图2是本发明实现步骤102的具体方法的流程示意图,如图2所示,该方法包括以下步骤:Fig. 2 is a schematic flow chart of a specific method for realizing step 102 of the present invention. As shown in Fig. 2, the method includes the following steps:

步骤201,建立树,树中每个节点具有标签和处理标记;Step 201, building a tree, each node in the tree has a label and a processing tag;

具体的,建立一棵二叉树数据结构(称作树),树中的每个节点都附带一个标签和一个处理标记;初始时,树中仅包含一个节点,其附带的标签是W0,附带的处理标记是“未处理”。Specifically, a binary tree data structure (called a tree) is established, and each node in the tree is attached with a label and a processing tag; initially, the tree contains only one node, and its attached label is W 0 , and the attached The processing flag is "unprocessed".

步骤202,判断树中是否所有节点的处理标记都是“已处理”,如果树中所有节点的处理标记都是“已处理”,执行步骤207;否则,执行步骤203。Step 202, judge whether the processing marks of all nodes in the tree are "processed", if the processing marks of all nodes in the tree are "processed", execute step 207; otherwise, execute step 203.

步骤203,从树中所有处理标记是“未处理”的节点中任取一个节点,该节点为当前节点,当前节点的标签的关键词集合W0为当前集合。Step 203, randomly select a node from all the nodes whose processing flag is "unprocessed" in the tree, this node is the current node, and the keyword set W0 of the label of the current node is the current set.

步骤204,判断当前集合中是否只包含一个关键词,如果当前集合只包含一个关键词,执行步骤206;否则,执行步骤205。Step 204, judge whether the current collection contains only one keyword, if the current collection contains only one keyword, execute step 206; otherwise, execute step 205.

步骤205,将当前集合划分为两个子集,将两个子集中的最重要子集WL作为当前节点的左子节点添加到树中,将两个子集中的另一个子集WR作为当前节点的右子节点添加到树中;Step 205, divide the current collection into two subsets, add the most important subset W L of the two subsets to the tree as the left child node of the current node, and use the other subset W R of the two subsets as the left child node of the current node The right child node is added to the tree;

具体的,将当前集合划分为两个子集,其中划分的原则是:将当前集合中的关键词作为对一个领域或范围的描述,将两个子集中的关键词分别作为对该领域或范围的两个不同子领域或子范围的描述;Specifically, the current collection is divided into two subsets, and the principle of division is: use the keywords in the current collection as a description of a field or scope, and use the keywords in the two subsets as two descriptions of the field or scope, respectively. a description of the different subfields or subranges;

评价上述两个子集对于描述目标领域的重要性,两个子集中的最重要的子集WL作为当前节点的左子节点添加到树中,该最重要的子集WL的标签是WL,该最重要的子集WL的处理标记是“未处理”;两个子集中的另一个子集WR作为当前节点的右子节点添加到树中,该子集的标签是WR,子集WR的处理标记是“未处理”。Evaluate the importance of the above two subsets for describing the target domain, the most important subset W L of the two subsets is added to the tree as the left child node of the current node, and the label of the most important subset W L is W L , The treatment label of the most important subset W L is "unprocessed"; another subset W R of the two subsets is added to the tree as the right child node of the current node, the label of this subset is W R , the subset The processing flag of WR is "unprocessed".

步骤206,将当前节点的处理标记改为“已处理”,然后执行步骤202。Step 206, change the processing flag of the current node to "processed", and then execute step 202.

步骤207,对于关键词集合W0中的每个关键词w,都可以对应到树中满足条件的一个节点,所述条件为:作为节点的标签的关键词集合包含且只包含w;基于W0中的关键词与树中节点的标签的对应关系,根据W0中的所有关键词所对应的节点的深度优先遍历顺序,形成一个关键词序列S0Step 207, for each keyword w in the keyword set W 0 , it can correspond to a node in the tree that satisfies the condition, and the condition is: the keyword set as the label of the node contains and only contains w; based on W The corresponding relationship between the keywords in 0 and the labels of the nodes in the tree forms a keyword sequence S 0 according to the depth-first traversal order of the nodes corresponding to all the keywords in W 0 .

图3是本发明实现步骤102的具体方法的实施例一的流程示意图,如图3所示,该方法包括以下步骤:Fig. 3 is a schematic flow chart of Embodiment 1 of a specific method for implementing step 102 of the present invention. As shown in Fig. 3, the method includes the following steps:

步骤301,建立树,树中每个节点具有标签和处理标记;Step 301, building a tree, each node in the tree has a label and a processing tag;

具体的,建立一棵二叉树数据结构(称作树),树中的每个节点都附带一个标签和一个处理标记;初始时,树中仅包含一个节点,例如图4中的节点A,附带的标签是W0,W0={“镜头”,“像素”,“光圈”,“焦距”,“传感器”},附带的处理标记是“未处理”。Specifically, a binary tree data structure (called a tree) is established, and each node in the tree is attached with a label and a processing tag; initially, the tree only contains one node, such as node A in Figure 4, with The label is W 0 , W 0 ={"lens", "pixel", "aperture", "focal length", "sensor"}, and the accompanying processing flag is "unprocessed".

步骤302,从树中所有处理标记是“未处理”的节点中选取当前节点,所述当前节点的标签的关键词集合为当前集合;Step 302, selecting a current node from all nodes whose processing flag is "unprocessed" in the tree, and the keyword set of the label of the current node is the current set;

具体的,由于树中存在处理标记是“未处理”的节点,因此从树中所有处理标记是“未处理”的节点中随机选取一个节点,称该节点为当前节点,该当前节点的标签的关键词集合称为当前集合;例如图4中的节点A,当前集合为{“镜头”,“像素”,“光圈”,“焦距”,“传感器”}。Specifically, since there is a node whose processing mark is "unprocessed" in the tree, a node is randomly selected from all nodes whose processing mark is "unprocessed" in the tree, and this node is called the current node, and the label of the current node is The set of keywords is called the current set; for example, node A in Figure 4, the current set is {"lens", "pixel", "aperture", "focal length", "sensor"}.

步骤303,如果当前集合包含超过一个关键词,则将当前集合划分为两个子集;例如子集{“像素”,“传感器”}和子集{“镜头”,“光圈”,“焦距”}。Step 303, if the current collection contains more than one keyword, divide the current collection into two subsets; for example, the subset {"pixel", "sensor"} and the subset {"lens", "aperture", "focal length"}.

步骤304,将两个子集中的最重要子集WL作为当前节点的左子节点添加到树中,将两个子集中的另一个子集WR作为当前节点的右子节点添加到树中;Step 304, adding the most important subset W L of the two subsets to the tree as the left child node of the current node, and adding another subset W R of the two subsets to the tree as the right child node of the current node;

具体的,评价上述两个子集对于描述目标领域(例如单反相机领域)的重要性,两个子集中的最重要的子集WL,例如WL={“像素”,“传感器”},作为当前节点(如图4所示的节点A)的左子节点(如图4所示的节点B)添加到树中,该最重要的子集WL的标签是WL,WL={“像素”,“传感器”},该最重要的子集WL的处理标记是“未处理”;两个子集中的另一个子集WR,例如WR={“镜头”,“光圈”,“焦距”},作为当前节点(如图4所示的节点A)的右子节点(如图4所示的节点C)添加到树中,该子集的标签是WR,WR={“镜头”,“光圈”,“焦距”},子集WR的处理标记是“未处理”。Specifically, evaluate the importance of the above two subsets for describing the target field (such as the SLR camera field), the most important subset W L of the two subsets, for example W L = {"pixel", "sensor"}, as the current The left child node (node B as shown in Figure 4) of a node (node A as shown in Figure 4) is added in the tree, and the label of this most important subset W L is W L , W L = {" pixel ", "sensor"}, the processing mark of the most important subset W L is "unprocessed"; another subset W R of the two subsets, for example W R = {"lens", "aperture", "focal length "}, as the right child node (node C as shown in Figure 4) of the current node (node A as shown in Figure 4) is added to the tree, the label of this subset is W R , W R = {" lens ", "aperture", "focal length"}, the processing flag of the subset W R is "unprocessed".

步骤305,将当前节点(如图4所示的节点A)的处理标记改为“已处理”;以此类推,例如图4所示,依次将节点D、E、F、G、H、I添加到树中,在此添加过程中,节点B、C的处理标记已改为“已处理”。Step 305, change the processing mark of current node (node A as shown in Figure 4) into "processed"; Added to the tree, during this addition, the processing flags of nodes B, C have been changed to "processed".

步骤306,如果树中仍然存在处理标记是“未处理”的节点,则从树中所有处理标记是“未处理”的节点中随机选取一个节点,称该节点为当前节点,该当前节点的标签的关键词集合称为当前集合;例如图4中的节点D,当前集合为{“传感器”}。Step 306, if there is still a node whose processing mark is "unprocessed" in the tree, a node is randomly selected from all nodes whose processing mark is "unprocessed" in the tree, and this node is called the current node, and the label of the current node is The keyword set of is called the current set; for example, node D in Figure 4, the current set is {"sensor"}.

步骤307,如果当前集合只包含一个关键词,则将当前节点(例如图4中的节点D)的处理标记改为“已处理”,以此类推,将点E、F、H、I的处理标记改为“已处理”。Step 307, if the current collection only contains a keyword, then change the processing mark of the current node (such as node D in Fig. 4) to "processed", and so on, the processing of points E, F, H, I Flag changed to "Processed".

步骤308,树中所有节点的处理标记都是“已处理”时,根据关键词集合W0中的所有关键词所对应的节点的深度优先遍历顺序,形成关键词序列S0Step 308, when the processing marks of all nodes in the tree are "processed", according to the depth-first traversal order of the nodes corresponding to all keywords in the keyword set W 0 , a keyword sequence S 0 is formed;

具体的,如果树中所有节点的处理标记都是“已处理”,则基于W0中的关键词与树中节点的标签的对应关系,例如“镜头”对应节点F的标签、“像素”对应节点E的标签、“光圈”对应节点I的标签、“焦距”对应节点H的标签、“传感器”对应节点D的标签,根据W0中的所有关键词所对应的节点的深度优先遍历顺序(例如图4所示中的节点D、E、F、H、I),对W0中的所有关键词进行排序,形成一个关键词序列S0,即S0=<“传感器”,“像素”,“镜头”,“焦距”,“光圈”>。Specifically, if the processing marks of all nodes in the tree are "processed", then based on the correspondence between the keywords in W 0 and the labels of the nodes in the tree, for example, "lens" corresponds to the label of node F, and "pixel" corresponds to The label of node E, the label of node I corresponding to "aperture", the label of node H corresponding to "focal length", and the label of node D corresponding to "sensor", according to the depth-first traversal order of nodes corresponding to all keywords in W 0 ( For example, the nodes D, E, F, H, and I) shown in Figure 4 sort all the keywords in W 0 to form a keyword sequence S 0 , that is, S 0 =<"sensor","pixel" , "Lens", "Focal Length", "Aperture">.

图5是本发明实现步骤103的具体方法的流程示意图,如图5所示,该方法包括以下步骤:Fig. 5 is a schematic flow chart of a specific method for implementing step 103 of the present invention. As shown in Fig. 5, the method includes the following steps:

步骤501,创建一个待复用的本体集合O,初始时O为空集。Step 501, create an ontology set O to be reused, initially O is an empty set.

步骤502,将关键词序列S0记作S。Step 502, denote the keyword sequence S 0 as S.

步骤503,获取S的满足条件的前缀连续子序列中最长的一个子序列SH,所述条件为该子序列中的所有关键词组合成一个查询关键词组后,将该查询关键词组提交到本体检索系统后,检索结果不为空(SH对应的检索结果记作HITS(SH));并将SH从S的前端截去,得到剩余的后缀连续子序列STStep 503, obtain the longest subsequence SH among the continuous prefix subsequences of S that meet the condition, the condition is that after all the keywords in the subsequence are combined into a query keyword group, the query keyword group is submitted to the ontology After searching the system, the search result is not empty (the search result corresponding to SH is denoted as HITS( SH )); and SH is truncated from the front end of S to obtain the remaining suffix continuous subsequence ST .

步骤504,判断SH是否为空序列,如果SH为空序列(即步骤503中S中不存在满足条件的前缀连续子序列),则从ST中删除最前面的一个关键词;否则,将HITS(SH)中排名最高的本体添加到O。Step 504, judge whether SH is an empty sequence, if SH is an empty sequence (that is, there is no prefix continuous subsequence satisfying the condition in S in step 503), then delete the top keyword from ST; otherwise, Add the highest-ranked ontology in HITS(S H ) to O.

步骤505,判断ST是否为空序列,如果ST不为空序列(即步骤503中SH是S的子序列),将ST记作S,再执行步骤503;否则,流程结束。Step 505, judge whether ST is an empty sequence, if ST is not an empty sequence (that is, SH in step 503 is a subsequence of S ), record ST as S, and then execute step 503; otherwise, the process ends.

图6是本发明实现步骤103的具体方法的实施例一的流程示意图,如图6所示,该方法包括以下步骤:Fig. 6 is a schematic flowchart of Embodiment 1 of a specific method for implementing step 103 of the present invention. As shown in Fig. 6, the method includes the following steps:

步骤601,创建一个待复用的本体集合O,初始时O为空集。Step 601, create an ontology set O to be reused, initially O is an empty set.

步骤602,将关键词序列S0记作S,即S=<“传感器”,“像素”,“镜头”,)“焦距”,“光圈”>。Step 602, record the keyword sequence S 0 as S, that is, S=<"sensor", "pixel", "lens", ) "focal length", "aperture">.

步骤603,获取S中的满足条件的前缀连续子序列中最长的一个子序列SH,将SH从S的前端截去,得到剩余的后缀连续子序列STStep 603, obtain the longest subsequence SH among the continuous prefix subsequences satisfying the condition in S, cut off SH from the front end of S, and obtain the remaining continuous suffix subsequence S T ;

具体的,获取S中的满足下述条件的前缀连续子序列中最长的一个子序列SH,其中,所述条件为该子序列中的所有关键词组合成一个查询关键词组后,将该查询关键词组提交到本体检索系统(例如Swoogle)后,检索结果不为空(SH对应的检索结果记作HITS(SH));Specifically, obtain the longest subsequence SH among prefix continuous subsequences in S that meet the following conditions, wherein the condition is that after all keywords in the subsequence are combined into a query keyword group, the query After the keyword group is submitted to the ontology retrieval system (such as Swoogle), the retrieval result is not empty (the retrieval result corresponding to S H is denoted as HITS( SH ));

例如“传感器像素镜头焦距光圈”、“传感器像素镜头焦距”、“传感器像素镜头”分别提交到Swoogle后,检索结果均为空,而“传感器像素”提交到Swoogle后,检索结果不为空,则SH=<“传感器”,“像素”>;将SH从S的前端截去,则剩余的后缀连续子序列记作ST,ST=<“镜头”,“焦距”,“光圈”>。For example, after "sensor pixel lens focal length aperture", "sensor pixel lens focal length", and "sensor pixel lens" are respectively submitted to Swoogle, the retrieval results are all empty, and after "sensor pixel" is submitted to Swoogle, the retrieval results are not empty, then S H =<"sensor", "pixel">; S H is cut off from the front end of S, then the remaining suffix continuous subsequence is recorded as S T , S T =<"lens", "focal length", "aperture">.

步骤604,由于SH不为空序列,因此将检索结果HITS(SH)中排名最高的本体o1添加到本体集合O。Step 604, since SH is not an empty sequence, add ontology o 1 with the highest rank in the retrieval result HITS( SH ) to ontology set O.

步骤605,由于ST不为空序列,因此将剩余的后缀连续子序列ST记作S。Step 605, since ST is not an empty sequence, the remaining suffix continuous subsequence ST is recorded as S.

步骤606,获取S中的满足条件的前缀连续子序列中最长的一个子序列SH,将SH从S的前端截去,得到剩余的后缀连续子序列STStep 606, obtain the longest subsequence SH among the continuous prefix subsequences satisfying the condition in S, cut off SH from the front end of S, and obtain the remaining continuous suffix subsequence S T ;

具体的,获取S中的满足下述条件的前缀连续子序列中最长的一个子序列SH,其中,所述条件为该子序列中的所有关键词组合成一个查询关键词组后,将该查询关键词组提交到本体检索系统(例如Swoogle)后,检索结果不为空(SH对应的检索结果记作HITS(SH));Specifically, obtain the longest subsequence SH among prefix continuous subsequences in S that meet the following conditions, wherein the condition is that after all keywords in the subsequence are combined into a query keyword group, the query After the keyword group is submitted to the ontology retrieval system (such as Swoogle), the retrieval result is not empty (the retrieval result corresponding to S H is denoted as HITS( SH ));

例如“镜头焦距光圈”、“镜头焦距”、“镜头”分别提交到Swoogle后,检索结果均为空,则SH为空序列,将SH从S的前端截去,则剩余的后缀连续子序列记作ST,ST=<“镜头”,“焦距”,“光圈”>。For example, after "lens focal length aperture", "lens focal length" and "lens" are submitted to Swoogle respectively, and the search results are all empty, then SH is an empty sequence. If SH is cut off from the front end of S, the remaining suffixes are continuous The sequence is denoted as S T , S T =<"lens", "focal length", "aperture">.

步骤607,由于SH为空序列,因此从ST中删除最前面的一个关键词,例如“镜头”,则得到ST=<“焦距”,“光圈”>。Step 607, since S H is an empty sequence, delete the first keyword from S T , such as "lens", and then obtain S T =<"focal length", "aperture">.

步骤608,因为ST不为空序列,将ST记作S。Step 608, since ST is not an empty sequence, record ST as S.

步骤609,获取S中的满足条件的前缀连续子序列中最长的一个子序列SH,将SH从S的前端截去,得到剩余的后缀连续子序列STStep 609, obtain the longest subsequence SH among the continuous prefix subsequences satisfying the condition in S, cut off SH from the front end of S, and obtain the remaining continuous suffix subsequence S T ;

具体的,获取S中的满足下述条件的前缀连续子序列中最长的一个子序列SH,其中,所述条件为该子序列中的所有关键词组合成一个查询关键词组后,将该查询关键词组提交到本体检索系统(例如Swoogle)后,检索结果不为空(SH对应的检索结果记作HITS(SH));Specifically, obtain the longest subsequence SH among prefix continuous subsequences in S that meet the following conditions, wherein the condition is that after all keywords in the subsequence are combined into a query keyword group, the query After the keyword group is submitted to the ontology retrieval system (such as Swoogle), the retrieval result is not empty (the retrieval result corresponding to S H is denoted as HITS( SH ));

例如,“焦距光圈”提交到Swoogle后,检索结果不为空,因此SH=<“焦距”,“光圈”>,将SH从S的前端截去,则剩余的后缀连续子序列记作ST,ST为空序列。For example, after "focal length aperture" is submitted to Swoogle, the retrieval result is not empty, so SH = <"focal length", "aperture">, if SH is cut off from the front end of S, then the remaining suffix continuous subsequence is recorded as S T , S T is an empty sequence.

步骤610,因为SH不为空序列,则将HITS(SH)中排名最高的本体o2添加到本体集合O。Step 610, since SH is not an empty sequence, add ontology o 2 with the highest rank in HITS( SH ) to ontology set O.

步骤611,因为ST为空序列,则最终本体结合O={o1,o2}。Step 611, because ST is an empty sequence, the final ontology combination O={o 1 , o 2 }.

图7是本发明实现步骤106的具体方法的流程示意图,如图7所示,该方法包括以下步骤:Fig. 7 is a schematic flow chart of a specific method for implementing step 106 of the present invention. As shown in Fig. 7, the method includes the following steps:

步骤701,针对新的本体o中描述的每一个术语的名称t,例如t=“镜头”,创建三个关键词集合,分别记作SYN、TRANS、TS,初始时SYN、TRANS、TS均为空集。Step 701, for the name t of each term described in the new ontology o, for example, t="lens", create three keyword sets, which are respectively recorded as SYN, TRANS, and TS. Initially, SYN, TRANS, and TS are all empty set.

步骤702,将t(例如“镜头”)提交到从LS到另一种自然语言LT的翻译系统,例如LT=英文,翻译系统为Google Translate,将翻译结果中的所有关键词,例如“shot”、“camera lens”、“camera shot”,添加到集合TRANS,即TRANS={“shot”,“camera lens”,“camera shot”}。Step 702, submit t (such as "lens") to the translation system from LS to another natural language LT , for example, LT =English, the translation system is Google Translate, and all keywords in the translation results, such as "shot", "camera lens", "camera shot" are added to the set TRANS, that is, TRANS={"shot", "camera lens", "camera shot"}.

步骤703,根据集合TRANS中的每一个关键词trans,例如trans=“cameralens”,从LT的同义词词典(例如WordNet)中获取的trans(例如“camera lens”)的所有同义词,例如“optical lens”,将获取到的所有同义词添加到集合TS,以此类推,例如“shot”的同义词包括“guess”、“snap”,“camera shot”没有同义词,则集合TS={“guess”,“snap”,“optical lens”}。Step 703, according to each keyword trans in the set TRANS, such as trans="cameralens", all synonyms of trans (such as "camera lens") obtained from the synonym dictionary (such as WordNet) of LT , such as "optical lens ", add all the synonyms obtained to the set TS, and so on, for example, the synonyms of "shot" include "guess", "snap", and "camera shot" has no synonyms, then the set TS={"guess", "snap"","opticallens"}.

步骤704,将集合TS中的所有关键词添加到集合TRANS,则集合TRANS={“shot”,“camera lens”,“camera shot”,“guess”,“snap”,“opticallens”}。Step 704, add all keywords in the set TS to the set TRANS, then the set TRANS={"shot", "camera lens", "camera shot", "guess", "snap", "opticallens"}.

步骤705,根据集合TRANS中的每一个关键词trans’,例如trans’=“opticallens”,将trans’(例如“optical lens”)提交到由LT(即英文)到LS(即中文)的翻译系统(例如Google Translate),将翻译结果中的所有关键词,例如“光学镜头”,添加到集合SYN,以此类推,例如“shot”的翻译结果包括“射击”、“镜头”、“剂量”,“camera lens”的翻译结果包括“镜头”,“camera shot”的翻译结果包括“镜头”,“guess”的翻译结果包括“猜测”、“推测”,“snap”的翻译结果包括“单元”、“乱射”,则集合SYN={“射击”,“镜头”,“剂量”,“猜测”,“推测”,“单元”,“乱射”,“光学镜头”}。Step 705, according to each keyword trans' in the set TRANS, such as trans'="opticallens", submit trans' (such as "optical lens") to the L T (ie English) to L S (ie Chinese) A translation system (such as Google Translate), adds all keywords in the translation results, such as "optical lens", to the collection SYN, and so on, for example, the translation results of "shot" include "shooting", "lens", "dose ", the translation results of "camera lens" include "lens", the translation results of "camera shot" include "lens", the translation results of "guess" include "guess", "guess", and the translation results of "snap" include "unit ", " shooting indiscriminately ", then set SYN={"shooting", "camera", "dose", "guessing", "guessing", "unit", "shooting indiscriminately", "optical lens"}.

步骤706,可选的,为了提高同义词获取结果的准确度,还可以从集合SYN中删除所有不适合作为t(例如“镜头”)的同义词的关键词(包括t自身,例如“镜头”),例如“射击”、“镜头”、“剂量”、“猜测”、“推测”、“单元”、“乱射”都不适合作为t的同义词的关键词,则SYN中剩余的所有关键词,例如“光学镜头”,作为获取到的t(例如“镜头”)的同义词;其中,不适合作为t的同义词的关键词指的是当前领域内不能互相替代的关键词,当前领域内可以互相替代的关键词就适合作为t的同义词的关键词。Step 706. Optionally, in order to improve the accuracy of the synonym acquisition result, all keywords (including t itself, such as "shot") that are not suitable as synonyms for t (such as "shot") can also be deleted from the set SYN, For example, "shooting", "lens", "dose", "guessing", "guessing", "unit", and "random shooting" are not suitable keywords for synonyms of t, then all remaining keywords in the SYN, such as " "Optical lens", as a synonym for the acquired t (such as "lens"); among them, keywords that are not suitable as synonyms for t refer to keywords that cannot be substituted for each other in the current field, and keywords that can be substituted for each other in the current field The word is suitable as a keyword for synonyms of t.

为实现上述方法,本发明还提供一种领域本体构建系统,图8是本发明实现领域本体构建系统的结构示意图,如图8所示,该系统包括:罗列单元81、排序单元82、添加单元83、并操作处理单元84;其中,In order to realize the above method, the present invention also provides a domain ontology construction system. FIG. 8 is a schematic structural diagram of the domain ontology construction system in the present invention. As shown in FIG. 8, the system includes: a listing unit 81, a sorting unit 82, an adding unit 83, and operate the processing unit 84; wherein,

罗列单元81,用于罗列需要被目标本体描述的所有术语的名称,形成关键词集合W0Listing unit 81, used to list the names of all terms that need to be described by the target ontology to form a keyword set W 0 ;

排序单元82,用于对关键词集合W0中的所有关键词进行排序,形成关键词序列S0A sorting unit 82, configured to sort all keywords in the keyword set W 0 to form a keyword sequence S 0 ;

添加单元83,用于创建待复用的本体集合O,将从关键词序列S0中抽取的连续子序列中的所有关键词提交到本体检索系统,将检索结果中排名最高的本体添加到本体集合O;Adding unit 83, used to create ontology set O to be reused, submit all keywords in continuous subsequences extracted from keyword sequence S0 to ontology retrieval system, add ontology with the highest ranking in the retrieval results to ontology set O;

并操作处理单元84,用于对本体集合O中的所有本体进行集合的并操作处理,形成新的本体o。The union operation processing unit 84 is configured to perform union operation processing on all ontologies in the ontology set O to form a new ontology o.

该系统还包括:The system also includes:

命名单元85,用于为新的本体o中描述的术语命名;Naming unit 85, used for naming the terms described in the new ontology o;

获取单元86,用于根据新的本体o中描述的术语的名称进行同义词获取。The obtaining unit 86 is configured to obtain synonyms according to the names of the terms described in the new ontology o.

以上所述,仅为本发明的较佳实施例而已,并非用于限定本发明的保护范围,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention, and is not used to limit the protection scope of the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included in the within the protection scope of the present invention.

Claims (11)

1. A domain ontology construction method is characterized by comprising the following steps:
enumerating the names of all terms that need to be described by the target ontology, forming a set of keywords W0
For the keyword set W0All the keywords in (1) are sequenced to form a keyword sequence S0
Creating an ontology set O to be multiplexed, from the keyword sequence S0Submitting all key words in the extracted continuous subsequence to an ontology retrieval system, and retrieving the key words in the retrieval resultAdding the highest-ranking ontology to an ontology set O;
performing collective and operational processing on all ontologies in the ontology set O to form a new ontology O;
wherein the sequence of slave keywords S0Submitting all the keywords in the extracted continuous subsequence to an ontology retrieval system, and adding the ontology with the highest ranking in the retrieval result to an ontology set O:
creating an ontology set O to be multiplexed, and sequencing a keyword sequence S0Recording as S, obtaining the longest subsequence S in the prefix continuous subsequences satisfying the condition in SHWill SHTruncating from the front end of S to obtain a residual suffix continuous subsequence ST
Judgment SHWhether it is a null sequence, if SHFor null sequences, from STDeleting the top keyword; if S isHNot null sequence, the search result HITS (S)H) Adding the highest ranked ontology to O;
judgment STWhether it is a null sequence, if STNot null sequence, will STRecording as S, and acquiring the longest subsequence S in prefix continuous subsequences satisfying the condition of SHWill SHTruncating from the front end of S to obtain a residual suffix continuous subsequence ST(ii) a Otherwise, if STAnd the sequence is empty, and the process is ended.
2. The method of claim 1, further comprising: naming the terms described in the new ontology o and performing synonym acquisition according to the names of the terms described in the new ontology o.
3. The method of claim 1, wherein the listing of the names of all terms that need to be described by the target ontology forms a keyword set W0Comprises the following steps:
for the target domain described by the target ontology, the natural language L is usedSThe keyword list in (1) needs to be described by the target ontologyThe names of all the terms form a keyword set W0
4. The method of claim 1, wherein the set of keywords W0All the keywords in (1) are sequenced to form a keyword sequence S0Comprises the following steps:
establishing a tree, wherein each node in the tree is provided with a label and a processing mark;
judging whether the processing marks of all nodes in the tree are processed, if not, selecting a current node from all nodes with unprocessed processing marks in the tree, wherein the keyword set W of the label of the current node0Is a current set;
judging whether the current set only contains one keyword, when the current set contains more than one keyword, dividing the current set into two subsets, and dividing the most important subset W in the two subsetsLAdding the left child node as the current node to the tree, and adding the other of the two subsets WRAdding the right child node as the current node into the tree, and changing the processing mark of the current node into 'processed'; otherwise, changing the processing mark of the current node into processed, and then continuously judging whether the processing marks of all nodes in the tree are processed until the processing marks of all nodes in the tree are processed, wherein the processing marks of all nodes in the tree are processed according to the keyword set W0The depth-first traversal order of the nodes corresponding to all the keywords in the sequence S of the keywords is formed0
5. The method of claim 4, wherein the dividing the current set into two subsets is:
the keywords in the current set are used as descriptions of a domain or range, and the keywords in the two subsets are respectively used as descriptions of two different sub-domains or sub-ranges of the domain or range.
6. The method of claim 1, wherein the step of removing the metal oxide layer comprises removing the metal oxide layer from the metal oxide layerThe condition is that all the key words in the subsequence are combined into a query key word group, and after the query key word group is submitted to the ontology retrieval system, the result HITS is retrieved (S)H) Not empty.
7. The method according to claim 1, wherein the union operation process of all ontologies in the ontology set O forms a new ontology O as:
performing collective and operational processing on all ontologies in the ontology set O to form a new ontology O; editing the new body o according to the requirement of describing the target field;
the editing process includes at least adding terms and axioms, deleting terms and axioms, and modifying terms and axioms.
8. The method according to claim 2, characterized in that the term described in the new ontology o is named: one L for each term described in the new ontology oSThe words in (1) are named.
9. The method of claim 2, wherein the synonym derivation from the name of the term described in the new ontology o is:
creating three keyword sets SYN, TRANS, TS for the name t of each term described in the new ontology o;
submitting t to slave LSTo another natural language LTThe translation system of (2) adding all keywords in the translation result to the set TRANS;
from L according to each keyword TRANS in the set TRANSTAdding all the obtained synonyms of trans into a set TS;
adding all keywords in the TS set to the TS set, and submitting TRANS 'to the TS host according to each keyword TRANS' in the TS setTTo LSA translation system ofAll keywords in the result are added to the set SYN;
all keywords that are not suitable as synonyms for t are deleted from the set SYN, and all keywords remaining in the SYN are used as synonyms for the obtained t.
10. A domain ontology construction system, the system comprising: the device comprises a listing unit, a sorting unit, an adding unit and an operation processing unit; wherein,
a listing unit for listing the names of all the terms required to be described by the target ontology to form a keyword set W0
A sorting unit for sorting the keyword set W0All the keywords in (1) are sequenced to form a keyword sequence S0
An adding unit for creating an ontology set O to be multiplexed, from the keyword sequence S0Submitting all the key words in the extracted continuous subsequence to an ontology retrieval system, and adding an ontology with the highest ranking in the retrieval result to an ontology set O;
the operation processing unit is used for carrying out collective and operation processing on all ontologies in the ontology set O to form a new ontology O;
the adding unit is specifically configured to create an ontology set O to be multiplexed, and sequence S of keywords0Recording as S, obtaining the longest subsequence S in the prefix continuous subsequences satisfying the condition in SHWill SHTruncating from the front end of S to obtain a residual suffix continuous subsequence ST(ii) a And judging SHWhether it is a null sequence, if SHFor null sequences, from STDeleting the top keyword; if S isHNot null sequence, the search result HITS (S)H) Adding the highest ranked ontology to O; and judging STWhether it is a null sequence, if STNot null sequence, will STRecording as S, and acquiring the longest subsequence S in prefix continuous subsequences satisfying the condition of SHWill SHTruncating from the front end of S to obtain a residual suffix continuous subsequence ST(ii) a Otherwise, e.g.Fruit STAnd the sequence is empty, and the process is ended.
11. The system of claim 10, further comprising:
a naming unit for naming a term described in the new ontology o;
and the acquisition unit is used for carrying out synonym acquisition according to the name of the term described in the new ontology o.
CN201210017772.7A 2012-01-19 2012-01-19 A kind of Methodologies for Building Domain Ontology and system Active CN103218362B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210017772.7A CN103218362B (en) 2012-01-19 2012-01-19 A kind of Methodologies for Building Domain Ontology and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210017772.7A CN103218362B (en) 2012-01-19 2012-01-19 A kind of Methodologies for Building Domain Ontology and system

Publications (2)

Publication Number Publication Date
CN103218362A CN103218362A (en) 2013-07-24
CN103218362B true CN103218362B (en) 2016-12-14

Family

ID=48816165

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210017772.7A Active CN103218362B (en) 2012-01-19 2012-01-19 A kind of Methodologies for Building Domain Ontology and system

Country Status (1)

Country Link
CN (1) CN103218362B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593410B (en) * 2013-10-22 2017-04-12 上海交通大学 System for search recommendation by means of replacing conceptual terms
US10095689B2 (en) 2014-12-29 2018-10-09 International Business Machines Corporation Automated ontology building

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7991760B2 (en) * 2008-02-08 2011-08-02 International Business Machines Corporation Constructing a domain-specific ontology by mining the web
CN101398858B (en) * 2008-11-07 2011-09-21 西安交通大学 Web service semantic extracting method based on noumenon learning
CN101807181A (en) * 2009-02-17 2010-08-18 日电(中国)有限公司 Method and equipment for restoring inconsistent body
CN101944099B (en) * 2010-06-24 2012-05-30 西北工业大学 Method for automatically classifying text documents by utilizing body
CN102254014B (en) * 2011-07-21 2013-06-05 华中科技大学 Adaptive information extraction method for webpage characteristics

Also Published As

Publication number Publication date
CN103218362A (en) 2013-07-24

Similar Documents

Publication Publication Date Title
US11475319B2 (en) Extracting facts from unstructured information
CN106649818B (en) Application search intent identification method, device, application search method and server
US9406020B2 (en) System and method for natural language querying
JP6309644B2 (en) Method, system, and storage medium for realizing smart question answer
US8463593B2 (en) Natural language hypernym weighting for word sense disambiguation
CN110717339A (en) Semantic representation model processing method and device, electronic equipment and storage medium
CN103544266B (en) A kind of method and device for searching for suggestion word generation
Mishra et al. Unsupervised query segmentation using only query logs
CN105975558A (en) Method and device for establishing statement editing model as well as method and device for automatically editing statement
CN102270234A (en) Image search method and search engine
CN103886020B (en) A kind of real estate information method for fast searching
Wu et al. MMSearch-R1: Incentivizing LMMs to Search
Dulceanu et al. PhotoshopQuiA: A corpus of non-factoid questions and answers for why-question answering
CN103218362B (en) A kind of Methodologies for Building Domain Ontology and system
WO2017058584A1 (en) Extracting facts from unstructured information
Budíková et al. DISA at ImageCLEF 2014: The Search-based Solution for Scalable Image Annotation.
Jung et al. Automatic tagging of functional-goals for goal-driven semantic service discovery
Belerao et al. Summarization using mapreduce framework based big data and hybrid algorithm (HMM and DBSCAN)
CN104021222A (en) Labeling algorithm for biomedical image based on invisible dirichlet model
CN107729411A (en) A kind of across media big data retrieval unstructured data compatible models
Yang et al. Web 2.0 dictionary
Miao et al. Automatic identifying entity type in linked data
Bouchakwa et al. An ambiguous tag-based query reformulation technique for an effective semantic-based social image research
CN118839053B (en) Web page data collection method, system, computer device and readable storage medium
Berenguer et al. Evaluating the impact of content deletion on tabular data similarity and retrieval using contextual word embeddings

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant