CN103026356B

CN103026356B - Semantic content is searched for

Info

Publication number: CN103026356B
Application number: CN201180029819.5A
Authority: CN
Inventors: E·I-C·张; M·T·吉勒姆; 许燕; C·菲尔德; J·汉德勒
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2010-06-18
Filing date: 2011-06-06
Publication date: 2016-08-31
Anticipated expiration: 2031-06-06
Also published as: WO2011159516A3; EP2583203A2; EP2583203A4; CN103026356A; US8380719B2; US20110314024A1; WO2011159516A2

Abstract

One or more techniques and/or systems are disclosed that provide document retrieval in which a user can identify key attributes of desired potential target documents (eg, have specific semantic content to the user). Additionally, related documents that include the desired semantic content can be retrieved. Additionally, users can provide feedback on retrieved documents, eg, based on key semantic concepts found in the documents, and the input can be used to update the taxonomy. For example, this process can be repeated to improve the retrieval and accuracy of documents found through machine learning techniques.

Description

Semantic Content Search

背景技术Background technique

企业环境中的文档检索是非常重要的问题，尤其是在希望及时地找到关键信息的情况下。例如，在医疗环境中，找到医生当前正在从事的一个案例的相关备选案例（如找出模式和/或特定治疗方案）可能是有用的。作为示例，医生可能对找出既是吸烟者又对阿司匹林过敏的先前病人感兴趣。通常，文档搜索涉及关键词搜索，其中可在文档中找到的相关词语被输入到搜索引擎，并且检索包括该关键词的那些文档。文档检索可以在企业数据库（如医院）、分布式数据库以及在线资源（如因特网）上执行。Document retrieval in an enterprise environment is a very important problem, especially when it is expected to find critical information in a timely manner. For example, in a medical setting, it might be useful to find relevant alternatives to a case a doctor is currently working on (eg, to find patterns and/or specific treatment options). As an example, a doctor might be interested in finding a previous patient who was both a smoker and an aspirin allergy. In general, document searching involves keyword searching, where related terms that can be found in documents are entered into a search engine, and those documents that include the keyword are retrieved. Document retrieval can be performed on enterprise databases (such as hospitals), distributed databases, and on-line resources (such as the Internet).

发明内容Contents of the invention

提供本“发明内容”是为了以简化的形式介绍将在以下“具体实施方式”中进一步描述的一些概念。实施方式。本“发明内容”并非旨在标识所要求保护的主题的的关键因素或者必要特征，也并非旨在用于限定所要求保护的主题的范围。This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. implementation. This Summary is not intended to identify key factors or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

当前，诸如通过关键词这样的自由文本搜索经常会导致太少或太多的结果。作为示例，当使用基于因特网的搜索引擎来执行关键词搜索时可能会返回数百万结果。审阅关键词搜索的返回结果可能是耗时的并经常令人沮丧，因为返回的文档中经常缺少相关信息。例如，用户可能正找寻带有特定涵义的单词或短语的文档，但是搜索引擎可能返回了包含相同但具有不同意义的关键词的不相关文档。作为另一示例，英文单词和短语通常使用具有完全不同意义的相同单词（如“bass”的意思可以是鱼、乐器，或者鞋）。然而，用户希望从大型数据库中快速检索关键信息。Currently, free text searches such as by keyword often result in too few or too many results. As an example, when an Internet-based search engine is used to perform a keyword search, millions of results may be returned. Reviewing the results returned from a keyword search can be time-consuming and often frustrating because relevant information is often missing from the returned documents. For example, a user may be looking for documents with a word or phrase with a certain meaning, but a search engine may return irrelevant documents containing the same keyword but with a different meaning. As another example, English words and phrases often use the same word with completely different meanings (eg, "bass" could mean fish, musical instrument, or shoe). However, users want to quickly retrieve critical information from large databases.

因此，本文公开了提供文档检索的一个或多个技术和/或系统，其中用户可标识所需潜在目标文档的关键属性（比如具有对用户而言的特定语义内容）。此外，可检索包含所需语义内容的相关文档。另外，用户可提供对检索到的文档的反馈，例如基于在该文档中发现的关键语义概念，并且可使用输入来更新分类。例如，该过程可重复进行以提高通过机器学习技术所发现文档的检索和准确度。Accordingly, disclosed herein is one or more techniques and/or systems that provide document retrieval in which a user can identify key attributes of desired potential target documents (such as having specific semantic content to the user). Additionally, relevant documents containing desired semantic content can be retrieved. Additionally, users can provide feedback on retrieved documents, eg, based on key semantic concepts found in the documents, and the input can be used to update the taxonomy. For example, the process can be repeated to improve the retrieval and accuracy of documents discovered through machine learning techniques.

在一个通过语义内容进行文档搜索的实施方式中，接收最终用户对来自包括潜在目标文档的数据库的初始文档的所需第一部分的选择，其中初始文档包含描述该初始文档的各组成的属性的元数据标签，并且所选择的第一部分包含具有用户所需语义内容的初始文档组成。该初始文档连同所选择的第一部分穿过（run through）一个或多个已训练的分类器，以从数据库中标识具有包含与第一部分（如，由用户选择）相同语义内容的第二部分的第一潜在目标文档。In one embodiment of document search by semantic content, an end-user selection of a desired first portion of an initial document from a database comprising potential target documents is received, wherein the initial document contains metadata describing attributes of components of the initial document. data labels, and the selected first part contains the initial document composition with the semantic content desired by the user. The initial document is run through one or more trained classifiers along with the selected first part to identify from the database documents having a second part containing the same semantic content as the first part (e.g., selected by the user). The first potential target document.

在该实施方式中，若第二部分不具有与第一部分相同的语义内容，则接收最终用户对第一潜在目标文档的第三部分选择，其中该第三部分包含与第一部分相同的语义内容。此外，第一潜在目标文档连同所选择的第三部分穿过一个或多个已训练的分类器，以从数据库中标识第二潜在目标文档，其中第二潜在目标文档带有具有与第三部分相同的语义内容的第四部分。In this embodiment, if the second part does not have the same semantic content as the first part, the end user's selection of the third part of the first potential target document is received, wherein the third part contains the same semantic content as the first part. In addition, the first potential target document is passed through one or more trained classifiers together with the selected third part to identify a second potential target document from the database, wherein the second potential target document has the same as the third part The fourth part of the same semantic content.

为实现上述及相关的目的，以下描述和附图阐述了某些的说明性方面和实现。这些仅指示可采用一个或多个方面的各种方式中的少数几种。结合附图阅读以下“具体实施方式”则本发明的其他方面、优点以及新颖特征将变得显而易见。To the accomplishment of the above and related ends, the following description and drawings set forth certain illustrative aspects and implementations. These are indicative of but a few of the various ways in which one or more aspects may be employed. Other aspects, advantages and novel features of the invention will become apparent from the following "Detailed Description" read in conjunction with the accompanying drawings.

附图说明Description of drawings

图1是提供用于通过语义内容进行文档搜索的示例性方法的流程图。FIG. 1 is a flowchart providing an exemplary method for document search by semantic content.

图2是示出通过语义内容进行文档搜索的方法的实现的示例性实施方式的流程图。FIG. 2 is a flowchart illustrating an exemplary embodiment of an implementation of a method of document search by semantic content.

图3是可用于在视觉上标识分类器准确率的示例图表的图示。3 is an illustration of an example graph that can be used to visually identify classifier accuracy.

图4是用于通过语义内容进行搜索的示例性系统的组件图。4 is a component diagram of an exemplary system for searching by semantic content.

图5是示出其中可实现本文所述的一个或多个系统和/或技术的示例性实施方式的组件图。5 is a component diagram illustrating an example implementation in which one or more systems and/or techniques described herein may be implemented.

图6是包含被配置成具体化本文所阐明的原理中的一个或多个的处理器可执行指令的示例性计算机可读介质的图示。6 is an illustration of an example computer-readable medium embodying processor-executable instructions configured to embody one or more of the principles set forth herein.

图7示出了其中可实现本文所阐明的原理中的一个或多个的示例性计算环境。Fig. 7 illustrates an example computing environment in which one or more of the principles set forth herein may be implemented.

具体实施方式detailed description

现在参考附图来描述所要求保护的主题，所有附图中使用相同的附图标记来指代相同的元素。在以下说明中，为解释起见，阐明了众多具体细节以提供对所要求保护的主题的全面理解。然而，显而易见的是，所要求保护的主题可以在没有这些具体细节的情况下实施。在其它情况下，以框图形式示出了各种结构和设备以便于描述所要求保护的主题。The claimed subject matter is now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of claimed subject matter. It may be evident, however, that claimed subject matter may be practiced without these specific details. In other instances, various structures and devices are shown in block diagram form in order to facilitate describing the claimed subject matter.

图1是提供通过语义内容进行文档搜索的示例性方法100的流程图。示例性方法100始于102并在104涉及接收最终用户对来自包含潜在目标文档的数据库中的初始文档的所需第一部分的选择。在此，初始文档包含描述初始文档的各组成（如单词）的属性的元标签，而且所选第一部分包含初始文档的具有所需语义内容的组成。FIG. 1 is a flowchart of an exemplary method 100 of providing document search by semantic content. Exemplary method 100 begins at 102 and involves receiving an end-user selection of a desired first portion of an initial document from a database containing potential target documents at 104 . Here, the initial document contains meta-tags describing attributes of components of the initial document, such as words, and the selected first part contains components of the initial document with desired semantic content.

例如，可解析文档以确定该文档的语法结构。即，文档可包含诸如单词这样的词例（token）序列，而且该一个或多个序列可被词例化成各单独的组成，并且这些组成被根据其语法结构（如诸如名词和动词这样的单词类型）加了标签。此外，某些类型的解析可允许确定各相应组成（如医学术语、工程术语等等）的特定上下文。在一个实施方式中，数据库中的文档可能先前已经根据用户所需用途进行了解析，而且各相应组成（如单词、文本块等等）被用例如可描述其类型甚至是上下文的元数据标签来加了标签。For example, a document can be parsed to determine the grammatical structure of the document. That is, a document may contain a sequence of tokens such as words, and the one or more sequences may be tokenized into individual constituents and these type) is tagged. Additionally, certain types of parsing may allow determination of the specific context of each corresponding component (eg, medical terminology, engineering terminology, etc.). In one embodiment, documents in the database may have been previously parsed for the user's intended use, and each corresponding component (such as a word, text block, etc.) tagged.

此外，在一个实施方式中，用户可选择文档的包含所需语义内容的部分。即，例如用户可选择放射学报告的标识三周内的推荐追踪的部分。在该示例中，在三周内让病人返回进行追踪访问的推荐包含语义内容，因为其在周围词语和放射学报告的上下文中有特定的意义。例如，追踪推荐能以几种不同的方法来撰写，但词语所蕴含的意义却是十分特定的。作为另一示例，医生报告的诊断部分可包括诸如“本诊断是……”，“我认为病人有……”，“检查结果指出……”或者多种其他变型等这样的词语。然而，同样的是，诊断的意义是十分特定的。Additionally, in one embodiment, the user may select a portion of the document that contains the desired semantic content. That is, for example, a user may select a portion of a radiology report that identifies recommended follow-up within three weeks. In this example, the recommendation to return the patient for a follow-up visit within three weeks contains semantic content because it has specific meaning in the context of the surrounding words and radiology report. For example, tracking recommendations can be written in several different ways, but the meaning of the words is very specific. As another example, the diagnosis portion of a physician's report may include phrases such as "The diagnosis is...," "I believe the patient has...," "Test results indicate...," or various other variations. Again, however, the significance of the diagnosis is very specific.

在一个实施方式中，诸如医生或者希望标识数据库中的具有特定语义内容的多个文档的某其他最终用户之类的最终用户可选择初始文档。例如，最终用户可被赋予这样的任务：标识企业数据库中的其中标识了涉及不良反应时间的抱怨的客户报告文档。在该示例中，不良反应时间包括可用多种不同方法撰写的语义含义。在过去，可呼叫IT职业人员来设置搜索，训练分类器，运行测试，复位并调整系统以标识所需文档。在该实施方式中，诸如行政管理助理、客户服务代表或者其他的最终用户例如可选择所需语义内容来标识所需文档。In one embodiment, an end user, such as a physician or some other end user who wishes to identify a number of documents in the database with particular semantic content, may select an initial document. For example, an end user may be tasked with identifying customer report documents in an enterprise database in which complaints involving adverse response times are identified. In this example, adverse reaction times include semantic meanings that can be written in a number of different ways. In the past, an IT professional could be called in to set up a search, train a classifier, run tests, reset and tune the system to identify the required documents. In this embodiment, an end user such as an administrative assistant, customer service representative, or other, for example, may select the desired semantic content to identify the desired document.

在106处，将具有所选第一部分的初始文档穿过一个或多个已训练的分类器，以从数据库中标识第一潜在目标文档。在该实施方式中，第一潜在目标文档具有第二部分，该第二部分具有与最终用户选择的第一部分相同语义内容。例如，一个或多个分类器可被训练成使用多种技术来标识文档中的单词，这样的技术诸如隐马尔可夫模型（HMM）、支持向量机(SVM)、条件随机域(CRF)、统计语言模型，等等。At 106, the initial document having the selected first portion is passed through the one or more trained classifiers to identify a first potential target document from the database. In this embodiment, the first potential target document has a second portion having the same semantic content as the first portion selected by the end user. For example, one or more classifiers can be trained to identify words in documents using techniques such as Hidden Markov Models (HMM), Support Vector Machines (SVM), Conditional Random Fields (CRF), Statistical language models, etc.

在此，例如，一个或多个分类器可尝试从数据库中标识具有最终用户在初始文档中标识的相同语义内容的目标文档。此外，在该示例中，该一个或多个分类器可尝试突出显示目标文档的包含与最终用户突出显示的部分（第一部分）相同的语义内容的部分（第二部分）。这样，该一个或多个分类器被用来找寻具有最终用户所需内容的文档，该文档例如可能使用或不使用相同的单词但是具有相同的意义。Here, for example, one or more classifiers may attempt to identify target documents from the database that have the same semantic content that the end user identified in the original document. Also, in this example, the one or more classifiers may attempt to highlight a portion (second portion) of the target document that contains the same semantic content as the portion highlighted by the end user (first portion). In this way, the one or more classifiers are used to find documents that have the content desired by the end user, for example, may or may not use the same words but have the same meaning.

在示例性方法100的108处，若第二部分不具有与第一部分相同的语义内容，则在110处接收最终用户对第一潜在目标文档的第三部分的选择，其中该第三部分包含与第一部分相同的语义内容。例如，由一个或多个分类器突出显示的目标文档部分可能不具有与最终用户从初始文档中所选择部分相同的所需语义内容。即，分类器可能错误地分类了语义内容，并从不匹配最终用户所需内容的所检索到的目标文档中选择了内容。At 108 of the exemplary method 100, if the second portion does not have the same semantic content as the first portion, at 110 an end-user selection of a third portion of the first potential target document is received, wherein the third portion contains the same Same semantic content as the first part. For example, the portion of the target document highlighted by one or more classifiers may not have the same desired semantic content as the portion selected by the end user from the original document. That is, the classifier may have misclassified the semantic content and selected content from the retrieved target documents that did not match the content desired by the end user.

作为示例，当一个或多个分类器返回错误分类的内容时，用户可审阅目标文档并选择对应于其所需语义内容（如与初始文档中的第一部分相同）的内容（第三部分）。在112处，可将具有所选第三部分的第一潜在目标文档穿过一个或多个已训练的分类器，以从数据库中标识第二潜在目标文档。在此，第二潜在目标文档包含具有与第三部分相同语义内容的第四部分。As an example, when one or more classifiers return misclassified content, the user can review the target document and select the content (third part) that corresponds to their desired semantic content (eg, the same as the first part in the original document). At 112, the first potential target document having the selected third portion may be passed through one or more trained classifiers to identify a second potential target document from the database. Here, the second potential target document contains a fourth section with the same semantic content as the third section.

作为示例，在确定一个或多个分类器没有标识正确的内容后，最终用户审阅由该一个或多个分类器返回的第一文档，他们突出显示正确的内容并使其重新穿过该一个或多个分类器。随后，该一个或多个分类器可从数据库中返回带有具有最终用户所需语义内容的突出显示部分的另一文档。在一个实施方式中，可迭代108处的步骤，例如，直到所需语义内容被一个或多个分类器从数据库中的目标文档中检索到。这样，在该示例中，该一个或多个分类器还可被训练来标识最终用户的所需语义内容。As an example, after determining that one or more classifiers did not identify the correct content, an end user reviews the first document returned by the one or more classifiers, they highlight the correct content and rerun it through the one or more classifiers multiple classifiers. The one or more classifiers may then return another document from the database with the highlighted portion having the semantic content desired by the end user. In one embodiment, the steps at 108 may be iterated, for example, until the desired semantic content is retrieved from the target document in the database by one or more classifiers. Thus, in this example, the one or more classifiers can also be trained to identify the end user's desired semantic content.

在标识了具有所需语义内容的目标文档后，示例性方法100在114结束。Exemplary method 100 ends at 114 after target documents having desired semantic content are identified.

图2是示出通过语义内容进行文档搜索的方法的实现的示例性实施方式200的流程图。该示例性实施方式200始于202而且在204处涉及执行文档集合的关键词搜索。例如，文档集合可包括企业数据库、分布式数据库集合、或者来自因特网的文档。在该实施方式中，例如可执行包含可在包含最终用户所需语义内容的文档中发现的单词的关键词搜索。FIG. 2 is a flowchart illustrating an exemplary embodiment 200 of an implementation of a method of document search by semantic content. The exemplary embodiment 200 begins at 202 and involves performing a keyword search of a collection of documents at 204 . For example, a collection of documents may include an enterprise database, a collection of distributed databases, or documents from the Internet. In this embodiment, for example, a keyword search may be performed that contains words that may be found in documents containing the semantic content desired by the end user.

作为示例，医院管理者可能希望标识医生已经推荐进行追踪访问的病人，如标识是否进行了追踪访问、以及/或者追踪访问的结果是什么。在该实施方式中，医院存储的文档可能会达到上百万份，而且在这样大小的文档集中仅执行语义内容搜索也可能是繁重的。因此，管理者可进行该集合的关键词搜索，如使用词语“追踪”、“复查”、“返回”，以及具有相似意义的某些其他词语。该关键词搜索的结果可用于填充包含潜在目标文档250的目标数据库，该文档可用语义内容来进行搜索。As an example, a hospital administrator may wish to identify patients for whom a physician has recommended a follow-up visit, such as whether a follow-up visit was performed and/or what the outcome of the follow-up visit was. In this embodiment, the hospital stores potentially millions of documents, and performing only a semantic content search on a document set of this size can also be burdensome. Accordingly, the manager may conduct a keyword search of the collection, such as using the words "track," "review," "return," and certain other words of similar meaning. The results of this keyword search can be used to populate a target database containing potential target documents 250 that can be searched with semantic content.

在206处，在示例性实施方式200中，最终用户可在初始文档252中选择所需文本（如包含所需语义内容）。作为示例，医院管理者可标识包含医生推荐“病人在一个月之内返回进行检查”的文档。在该示例中，管理者可选择医生报告的这部分作为包含其语义内容搜索的所需语义内容。在208处，将具有所选文本的初始文档穿过分类器254。At 206 , in the exemplary embodiment 200 , the end user may select desired text (eg, contain desired semantic content) in the initial document 252 . As an example, a hospital administrator may identify a document that contains a doctor's recommendation that "patient return for a checkup within a month." In this example, the administrator may select this portion of the physician's report as the desired semantic content to include for its semantic content search. At 208 , the initial document with the selected text is passed through classifier 254 .

在一个实施方式中，将文档穿过分类器254包含指示分类器在目标文档数据库250中的文档中寻找所选类型的文本。可并行使用多个已训练的分类器254来从数据库中检索潜在目标文档256。在一个实施方式中，最终用户标识多个分类器中的哪个具有检索潜在目标文档256以寻找所需语义内容的所需准确率。在该实施方式中，例如所标识的分类器可用来从数据库中检索目标文档以寻找所需语义内容，这样就能更快的执行检索，而且错误也会更少。In one embodiment, passing the documents through the classifier 254 includes instructing the classifier to look for the selected type of text in the documents in the target document database 250 . Multiple trained classifiers 254 may be used in parallel to retrieve potential target documents 256 from the database. In one embodiment, the end user identifies which of a plurality of classifiers has the desired accuracy for retrieving potential target documents 256 for desired semantic content. In this embodiment, for example, the identified classifier can be used to retrieve the target document from the database for the desired semantic content, so that the retrieval can be performed faster and with fewer errors.

在另一个实施方式中，第二分类器可用来交叉确认由第一分类器检索到的潜在目标文档。此外，可标识两个或更多个分类器的组合，这样的组合具有检索潜在目标文档以寻找所需语义内容的所需准确率。在该实施方式中，所标识的分类器组合然后可用于从数据库中检索文档以寻找所需语义内容。In another embodiment, a second classifier may be used to cross-validate potential target documents retrieved by the first classifier. Additionally, a combination of two or more classifiers can be identified that has a desired accuracy rate for retrieving potential target documents for desired semantic content. In this embodiment, the identified combination of classifiers can then be used to retrieve documents from the database for the desired semantic content.

图3是可用于在视觉上标识分类器准确率的示例图表300的图示。例如，诸如示例300这样的图表可在最终用户训练分类器来检索包含其所需内容的文档时向最终用户显示，这样最终用户就可以看到分类器表现得如何。在实施方式300中，准确率由垂线310表示，且分类器沿水平线312表示。在该示例中，使用了四个分类器302-308来标识目标文档。FIG. 3 is an illustration of an example chart 300 that may be used to visually identify classifier accuracy. For example, a graph such as example 300 can be displayed to an end user as the end user trains the classifier to retrieve documents containing what they want, so the end user can see how the classifier is performing. In embodiment 300 , accuracy is represented by vertical line 310 and classifier is represented along horizontal line 312 . In this example, four classifiers 302-308 are used to identify target documents.

基于分类器是否正确地标识了内容和/或文档，可确定各相应分类器的准确率310。在该示例中，看得出分类器C2 304具有最高的准确率。因此，用户可确定仅利用C2 304来执行目标文档检索以寻找特定语义内容。作为示例，不同的语义内容和/或不同类型的文档可产生不同的分类器准确度结果。因此，在一个示例中，分类器C2 304对不同的内容和/或文档可能不产生相同的准确率。Based on whether the classifier correctly identified the content and/or document, an accuracy rate for each respective classifier may be determined 310 . In this example, it can be seen that classifier C2 304 has the highest accuracy. Thus, a user may determine to only utilize C2 304 to perform target document retrieval for specific semantic content. As an example, different semantic content and/or different types of documents may produce different classifier accuracy results. Thus, in one example, classifier C2 304 may not yield the same accuracy for different content and/or documents.

返回到图2，从数据库250中检索潜在目标文档256。例如，在210处，确定潜在目标文档是否满足选择准则，以便文档包含对应于由最终用户选择的所需语义内容的所选内容。若在潜在目标文档256中知道的第二部分具有与由最终用户在初始文档中选择的第一部分相同的语义内容，则最终用户可指示分类器已经正确地标识了目标文档中的所需语义内容，并且在212处，可用该信息更新分类器254。Returning to FIG. 2 , potential target documents 256 are retrieved from database 250 . For example, at 210, it is determined whether the potential target document satisfies the selection criteria such that the document contains selected content corresponding to the desired semantic content selected by the end user. If the second part known in the potential target document 256 has the same semantic content as the first part selected by the end user in the initial document, the end user can indicate that the classifier has correctly identified the desired semantic content in the target document , and at 212, the classifier 254 can be updated with this information.

即，在一个实施方式中，来自由一个或多个分类器返回的潜在目标文档的用户输入可用来更新一个或多个分类器。这样，例如，分类器被训练来标识包含所需语义内容的适当文档。此外，在使用检索正确的指示更新了分类器后，可再次对数据库运行该一个或多个分类器，以选择第三潜在目标文档256。可对数据库迭代这一过程以检索多个适当的目标文档。That is, in one embodiment, user input from potential target documents returned by one or more classifiers may be used to update the one or more classifiers. In this way, for example, a classifier is trained to identify appropriate documents containing the desired semantic content. Additionally, after the classifiers have been updated with indications that the retrieval was correct, the one or more classifiers may be run against the database again to select a third potential target document 256 . This process can be iterated over the database to retrieve multiple appropriate target documents.

然而，在210处，若检索到的目标文档256没有标识具有与最终用户选择的内容相同的语义含义的内容，则在214处，用户可选择所返回文档中的包含所需语义内容的部分。例如，最终用户可能已经在初始文档中选择了社会安全号（SSN），旨在供分类器从数据库250中的文档中检索SSN。然而，在将初始文档穿过分类器之后，潜在目标文档256标识出电话号码。在该示例中，最终用户然后指出分类器（如它们中的一个或多个）不正确地标识了语义内容，然后突出显示正确信息，即文档中的SSN。However, at 210, if the retrieved target document 256 does not identify content having the same semantic meaning as the content selected by the end user, then at 214, the user may select a portion of the returned document that contains the desired semantic content. For example, the end user may have selected a social security number (SSN) in the initial document for the classifier to retrieve the SSN from the document in the database 250 . However, after passing the initial document through the classifier, potential target document 256 identifies a phone number. In this example, the end user then points out that the classifiers (like one or more of them) incorrectly identified the semantic content, and then highlights the correct information, namely the SSN in the document.

在208处，经校正的文档可被再次穿过分类器254，以检索第二潜在目标文档256。在一个实施方式中，由分类器254标识的潜在目标文档256的部分（第二部分）可能不具有与第一部分相同的语义内容，而且该潜在目标文档可能不包含与初始文档的用户选择的部分（第一部分）的语义内容相同的内容。即，返回的文档可能不具有供用户选择来重新穿过分类器的匹配内容。At 208 , the corrected document may be passed through classifier 254 again to retrieve second potential target document 256 . In one embodiment, the portion of the potential target document 256 identified by the classifier 254 (the second portion) may not have the same semantic content as the first portion, and the potential target document may not contain the same user-selected portion of the initial document (Part 1) has the same semantic content as the content. That is, the returned documents may not have matching content for the user to select to rerun through the classifier.

在供实施方式中，最终用户可指出第一潜在目标文档不包含具有与第一部分的语义内容相同的内容，而且在212处，可用该信息更新分类器。此外，然后可对数据库250运行该一个或多个分类器以选择另一潜在目标文档256（第三潜在文档）。作为示例，在210处，若该文档满足选择准则，则可再次更新分类器以便于训练它们以标识适当的语义内容。In an alternative embodiment, the end user may indicate that the first potential target document does not contain content having the same semantic content as the first portion, and at 212 the classifier may be updated with this information. Additionally, the one or more classifiers may then be run against the database 250 to select another potential target document 256 (a third potential document). As an example, at 210, if the document satisfies the selection criteria, the classifiers may again be updated in order to train them to identify appropriate semantic content.

在一个实施方式中，最终用户可提供输入给分类器训练以用于更新分类器（如使它们更加准确）。例如，对各个文档，最终用户可指出一个或多个分类器从数据库中检索到的文档包含所需语义内容。此外，最终用户可指出一个或多个分类器从数据库中检索到的文档不包含所需语义内容。另外，最终用户可提供一个或多个分类器从数据库中检索到的文档的所选部分，其中所选部分包含所需语义内容。In one embodiment, end users may provide input to classifier training for use in updating classifiers (eg, making them more accurate). For example, for each document, an end user can indicate that one or more classifiers retrieved from the database contain the desired semantic content. Additionally, an end user may indicate that one or more classifiers retrieved documents from the database that do not contain desired semantic content. Alternatively, the end user may provide one or more classifiers with selected portions of documents retrieved from the database, wherein the selected portions contain desired semantic content.

在该实施方式200中，例如当检索目标文档时，可对分类器设定阈值。在该实施方式中，可将多个文档穿过一个或多个分类器，直到达到所需阈值，这可在216处确定。所需阈值可包含不同的准则，这可由最终用户选择，或者自动地设定（如默认）。In this embodiment 200, for example, when retrieving a target document, a threshold can be set for the classifier. In this embodiment, multiple documents may be passed through one or more classifiers until a desired threshold is reached, which may be determined at 216 . The desired thresholds may comprise different criteria, which may be selected by the end user, or set automatically (eg, default).

在一个实施方式中，可将多个文档穿过一个或多个分类器，直到对于所需语义内容而言达到所需文档选择精度。例如，在信息很关键的情况下用户可能希望文档检索是百分之百准确的；或者在准确度不是那么重要时用户可对百分之九十的准确度满意。In one embodiment, multiple documents may be passed through one or more classifiers until a desired document selection accuracy is achieved for desired semantic content. For example, users may expect document retrieval to be 100 percent accurate when the information is critical; or they may be satisfied with 90 percent accuracy when accuracy is not so important.

此外，在另一实施方式中，可将多个文档穿过一个或多个分类器，直到在没有检索不正确的文档的情况下检索到了所需数量的正确文档。例如，用户可能在文档检索返回一百个正确文档而没有任何错误时满意，然后可使检索器在无监督下运行。在又一实施方式中，可将多个文档穿过一个或多个分类器，直到从数据库中检索到所需数量的文档。例如，最终用户可能仅需要一千个文档来用于其所需目的，并可运行文档检索直到达到该数量。Furthermore, in another embodiment, multiple documents may be passed through one or more classifiers until a desired number of correct documents are retrieved without retrieving incorrect documents. For example, a user may be satisfied that a document retrieval returns a hundred correct documents without any errors, and may then let the retriever run unsupervised. In yet another embodiment, multiple documents may be passed through one or more classifiers until the desired number of documents are retrieved from the database. For example, an end user may only need a thousand documents for their desired purpose, and may run a document retrieval until that number is reached.

在216处，若不满足所需阈值，则在218处对数据库250运行分类器254以选择另一潜在目标文档256。此外，所选择的并且满足选择准则的各个潜在目标文档被存储在目标文档数据库258中。在216处，若满足阈值，则示例性实施方式200在220结束。At 216 , if the required threshold is not met, then at 218 the classifier 254 is run against the database 250 to select another potential target document 256 . In addition, each potential target document that is selected and satisfies the selection criteria is stored in the target document database 258 . At 216 , if the threshold is met, then example embodiment 200 ends at 220 .

一方面，该一个或多个分类器可以是计算数据库中潜在目标文档包含所需语义内容的概率的决策引擎。在一个实施方式中，可向分类器提供（如由最终用户或默认）确定是否应呈现文档以供最终用户输入的分类阈值。即，例如，在分类器呈现文档的情况下，仅当分类器不确信分类结果时，半监督类型的训练才可能发生。In one aspect, the one or more classifiers can be a decision engine that calculates the probability that a potential target document in the database contains the desired semantic content. In one embodiment, the classifier may be provided (eg, by an end user or by default) with a classification threshold that determines whether a document should be presented for input by the end user. That is, for example, where a classifier is presented with documents, a semi-supervised type of training may only occur if the classifier is not confident about the classification results.

作为示例，在SVM中，关于文档的决策（如，“是”，文档包含语义内容，或“否”，文档不包含语义内容）可被绘制成矩阵以确定其落入决策矩阵的哪一边。在该示例中，可能在矩阵的“是”和“否”部分之间存在空白，在这里SVM关于决策是不确定的。在该实施方式中，例如，该空白可能包含阈值，其中分类器将文档呈现给最终用户，以供输入文档是否包含所需语义内容。即，所需阈值可包含对分类器模型的不确定性阈值，并且所怀疑的文档仅当分类器模型的计算落入该阈值内时才被呈现给用户。As an example, in an SVM, a decision about a document (eg, "yes", the document contains semantic content, or "no", the document does not contain semantic content) can be plotted as a matrix to determine which side of the decision matrix it falls on. In this example, there may be gaps between the "yes" and "no" parts of the matrix where the SVM is non-deterministic about the decision. In this embodiment, for example, this blank may contain a threshold where the classifier presents the document to the end user as to whether or not the input document contains the desired semantic content. That is, the desired threshold may include an uncertainty threshold for the classifier model, and suspect documents are presented to the user only if the classifier model's calculation falls within this threshold.

例如，可设计出用于找到包含特定语义内容的文档的系统。图4是用于通过语义内容进行搜索的示例性系统400的组件图。存储器组件402存储包含多个潜在目标文档的数据库450。处理器组件404可操作上与存储器组件402相耦合，并用于执行一个或多个分类器410的指令。在一个实施方式中，存储器组件402和处理器组件404可被布置在同一个计算设备上。在其他实施方式中，这些组件可分开布置，和/或可与示例性系统400中的其他组件一起驻留在同一个计算设备上。For example, a system can be devised for finding documents that contain certain semantic content. FIG. 4 is a component diagram of an exemplary system 400 for searching by semantic content. The memory component 402 stores a database 450 containing a plurality of potential target documents. Processor component 404 is operatively coupled to memory component 402 and configured to execute instructions of one or more classifiers 410 . In one embodiment, memory component 402 and processor component 404 may be disposed on the same computing device. In other implementations, these components may be located separately and/or may reside on the same computing device with other components in the exemplary system 400 .

最终用户输入接收组件406的接收最终用户452对文档的输入，如从数据库450中检索的文档、和/或例如可用于作为数据库的种子（seed）的初始文档。由最终用户452提供的输入可包含最终用户从数据库450中选择的第一文档的所需部分，其中所选部分包含初始文档的具有所需语义内容的文档组成。例如，最终用户452可以通过选择初始文档的具有用户452所需语义内容的部分（如文档中的文本文字）来提供该初始文档作为数据库450的种子。The end user input receiving component 406 receives input from an end user 452 for a document, such as a document retrieved from a database 450, and/or an initial document that may be used as a seed for a database, for example. The input provided by the end user 452 may comprise a desired portion of the first document selected by the end user from the database 450, wherein the selected portion comprises a document composition having the desired semantic content of the initial document. For example, end-user 452 may provide an initial document as a seed for database 450 by selecting a portion of the initial document that has semantic content desired by user 452 (eg, text literals in the document).

此外，由最终用户452提供的输入可包含从数据库450中检索到的第二文档包含与第一文档的所选择的所需部分相同的语义内容的最终用户指示。例如，最终用户可将带有所选择的所需语义内容的初始文档穿过一个或多个分类器410以检索第二文档。在该示例中，用户452可审阅第二文档以确定其确实具有与在初始文档中选择的相同语义内容。用户然后例如可通过输入指示检索到的文档是正确的。Additionally, the input provided by the end user 452 may include an end user indication that the second document retrieved from the database 450 contains the same semantic content as the selected desired portion of the first document. For example, an end user may pass an initial document with selected desired semantic content through one or more classifiers 410 to retrieve a second document. In this example, user 452 may review the second document to determine that it does have the same semantic content as selected in the initial document. The user can then indicate that the retrieved document is correct, eg by input.

另外，由最终用户452提供的输入可包含从数据库中检索到的第二文档不包含与第一文档的所选择的所需部分相同的语义内容的最终用户指示。例如，最终用户可将带有所选择的所需语义内容的初始文档穿过一个或多个分类器410以检索第二文档。在该示例中，用户452可审阅第二文档以确定其不具有与在初始文档中选择的相同语义内容（例如，或者由分类器选择的内容是不正确的）。用户然后例如可通过输入指示检索到的文档是不正确的。Additionally, the input provided by the end user 452 may include an end user indication that the second document retrieved from the database does not contain the same semantic content as the selected desired portion of the first document. For example, an end user may pass an initial document with selected desired semantic content through one or more classifiers 410 to retrieve a second document. In this example, user 452 may review the second document to determine that it does not have the same semantic content as selected in the initial document (eg, or that the content selected by the classifier was incorrect). The user may then indicate, for example, by inputting that the retrieved document is incorrect.

一个或多个分类器组件410可在操作上与处理器组件404和存储器组件402相耦合。该一个或多个分类器组件410被用于从数据库中标识第二文档，其中第二文档包含具有与第一文档中的所选择的所需部分相同语义内容的目标部分。例如，最终用户可通过将初始文档穿过一个或多个分类器组件410来作为数据库450的种子，这样分类器组件410可被训练来从数据库450中仅检索包含所需语义内容的那些目标文档。One or more classifier components 410 may be operatively coupled to the processor component 404 and the memory component 402 . The one or more classifier components 410 are used to identify a second document from the database, wherein the second document contains a target portion having the same semantic content as the selected desired portion in the first document. For example, an end user can seed the database 450 by passing an initial document through one or more classifier components 410 so that the classifier components 410 can be trained to retrieve only those target documents from the database 450 that contain the desired semantic content .

分类更新组件408可在操作上与最终用户输入接收组件406相耦合，并用于使用最终用户输入更新一个或多个分类器组件410以标识所需语义内容。例如，诸如上述这样的用户输入被用来训练一个或多个分类器组件，这样它们在标识数据库450中的包含所需语义内容的目标文档时可表现得更准确。Classification update component 408 can be operatively coupled with end user input receiving component 406 and used to update one or more classifier components 410 with end user input to identify desired semantic content. For example, user input such as described above is used to train one or more classifier components so that they are more accurate at identifying target documents in database 450 that contain the desired semantic content.

图5是示出其中可实现本文描述的一个或多个系统和/或技术的一个实施方式500的组件图。数据库填充组件520被用于使用文档集合560的关键词搜索来用潜在目标文档填充数据库450。例如，使用数据库填充组件520，包含数百万潜在目标文档的文档集合可通过关键词搜索减少到几千个。FIG. 5 is a component diagram illustrating one embodiment 500 in which one or more systems and/or techniques described herein may be implemented. Database population component 520 is used to populate database 450 with potential target documents using a keyword search of document collection 560 . For example, using the database population component 520, a document collection containing millions of potential target documents can be reduced to a few thousand by keyword searching.

文档呈现组件522将来自数据库450中的文档呈现给最终用户452以用于标识所需语义内容。例如，可使用已训练的分类器从数据库450中检索潜在目标文档，并且文档呈现组件522可将其呈现给最终用户452以确定其是否满足所需语义内容选择准则。在一个实施方式中，文档呈现组件522可利用基于计算机的显示器，如监视器，而且最终用户452可仅审阅显示器上的内容。Document presentation component 522 presents documents from database 450 to end user 452 for identification of desired semantic content. For example, potential target documents can be retrieved from database 450 using a trained classifier, and document presentation component 522 can present them to end user 452 to determine whether they satisfy desired semantic content selection criteria. In one embodiment, the document presentation component 522 may utilize a computer-based display, such as a monitor, and the end user 452 may simply review the content on the display.

数据库索引组件524可用来对数据库450中包含所需语义内容的文档加标签。例如，当分类器组件检索目标文档并且确定其包含所需语义内容时，数据库索引组件524可将元数据附加在该文档上以指示其具有所需语义内容。此外，在一个实施方式中，数据库索引组件524可将元数据附加在目标文档的包含所需语义内容的部分。例如，该内容的标识可促进最终用户的信息收集。Database indexing component 524 can be used to tag documents in database 450 that contain desired semantic content. For example, when a classifier component retrieves a target document and determines that it contains desired semantic content, database indexing component 524 can append metadata to the document indicating that it has the desired semantic content. Additionally, in one embodiment, the database indexing component 524 can attach metadata to portions of the target document that contain the desired semantic content. For example, identification of this content may facilitate end user information collection.

在一个实施方式中，数据库可包括包含所需语义内容（如最终用户正在搜索的内容）的一个或多个样本文档（如种子文档）。在该实施方式中，该一个或多个样本文档可用来训练一个或多个分类器410以标识具有与所需语义内容相同的语义内容的潜在目标文档。在另一实施方式中，文档的所需部分可以是文本，其中文档组成由单词构成（如基于文本的文字文档）。作为示例，文字可包含数字、符号或者合并成串的其他基于文本的元素。In one embodiment, the database may include one or more sample documents (eg, seed documents) containing desired semantic content (eg, what the end user is searching for). In this embodiment, the one or more sample documents may be used to train one or more classifiers 410 to identify potential target documents having the same semantic content as the desired semantic content. In another embodiment, the desired portion of the document may be text, where the document components are made up of words (eg, a text-based literal document). As an example, text may contain numbers, symbols, or other text-based elements combined into strings.

又一实施方式涉及包括被配置成实现本文所呈现的技术中的一个或多个的处理器可执行指令的计算机可读介质。能以这些方法设计的一种示例性计算机可读介质在图6中示出，其中实现600包括其上编码有计算机可读数据606的计算机可读介质608（如CD-R、DVD-R、或者硬盘驱动器盘片）。这一计算机可读数据606又包括被配置成根据本文阐明的原理中的一个或多个来操作的计算机指令集604。在一个这样的实施方式602中，处理器可执行指令604可被配置成执行一种方法，诸如例如图1中的示例性方法200。在另一个这样的实施方式中，处理器可执行指令604可被配置成实现一种系统，诸如例如图4中的示例性系统400。本领域普通技术人员可设计出可被配置成根据本文所描述的技术操作的许多这样的计算机可读介质。Yet another embodiment involves a computer-readable medium comprising processor-executable instructions configured to implement one or more of the techniques presented herein. An exemplary computer-readable medium that can be designed in these ways is shown in FIG. 6, where an implementation 600 includes a computer-readable medium 608 (such as a CD-R, DVD-R, or hard drive platters). This computer readable data 606 in turn includes a set of computer instructions 604 configured to operate according to one or more of the principles set forth herein. In one such implementation 602 , processor-executable instructions 604 may be configured to perform a method, such as, for example, exemplary method 200 in FIG. 1 . In another such embodiment, processor-executable instructions 604 may be configured to implement a system, such as, for example, exemplary system 400 in FIG. 4 . Those of ordinary skill in the art can devise many such computer-readable media that can be configured to operate in accordance with the techniques described herein.

尽管已经用结构特征和/或方法动作专用的语言描述了本主题，但是可以理解，所附加权利要求书中所定义的主题不必局限于上述具体特征或动作。上述具体特征和动作是作为实现权利要求书的示例形式而公开的。Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. The specific features and acts described above are disclosed as example forms of implementing the claims.

如在本申请中所使用的，术语“组件”、“模块”、“系统”、“接口”等通常旨在表示计算机相关实体，其可以是硬件、硬件和软件的组合、软件或者执行中的软件。例如，组件可以是，但不局限于，在处理上运行的进程、处理器、对象、可执行码、执行的线程、程序、和/或计算机。作为说明，在控制器上运行的应用程序和控制器都可以是组件。一个或多个组件可驻留在进程和/或执行的线程中，而且组件可位于一台计算机中和/或分布在两台或多台计算机之间。As used in this application, the terms "component," "module," "system," "interface," etc. are generally intended to refer to a computer-related entity, which may be hardware, a combination of hardware and software, software, or an software. For example, a component may be, but is not limited to being limited to, a process running on a process, a processor, an object, an executable, a thread of execution, a program, and/or a computer. As an illustration, both an application running on a controller and a controller can be a component. One or more components can reside within a process and/or thread of execution, and a component can be localized on one computer and/or distributed between two or more computers.

此外，所要求保护的主题可以使用产生控制计算机以实现所公开的主题的软件、固件、硬件或其任意组合的标准编程和/或工程技术而被实现为方法、装置或制品。本文使用的术语“制品”旨在涵盖可以从任何计算机可读设备、载体或介质访问的计算机程序。当然，本领域技术人员会认识到，在不背离所要求保护的主题的范围或精神的前提下可对本配置做出许多修改。Furthermore, the claimed subject matter can be implemented as a method, apparatus or article of manufacture using standard programming and/or engineering techniques resulting in software, firmware, hardware or any combination thereof controlling a computer to implement the disclosed subject matter. The term "article of manufacture" as used herein is intended to cover a computer program accessible from any computer-readable device, carrier, or media. Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope or spirit of the claimed subject matter.

图7和以下的讨论提供了对用于实现本文阐述的原理中的一个或多个的实施方式的合适计算环境的简要概括描述。图7的操作环境仅是合适操作环境的一个示例，并不旨在对该操作环境的使用范围或功能提出任何限制。示例计算环境包括，但不局限于，个人计算机、服务器计算机、手持式或膝上型设备、移动设备（如移动电话、个人数字助理(PDA)、媒体播放器，等等）、多处理器系统、消费电子产品、小型计算机、大型计算机、包括任何以上系统或设备的分布式计算环境，等等。FIG. 7 and the following discussion provide a brief general description of a suitable computing environment for implementing implementations of one or more of the principles set forth herein. The operating environment of FIG. 7 is only one example of a suitable operating environment and is not intended to suggest any limitation as to the scope of use or functionality of the operating environment. Example computing environments include, but are not limited to, personal computers, server computers, handheld or laptop devices, mobile devices (such as mobile phones, personal digital assistants (PDAs), media players, etc.), multiprocessor systems , consumer electronics, minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, etc.

尽管并非所需，但是在由一个或多个计算设备执行的“计算机可读指令”的通用上下文中描述了各实施方式。计算机可读指令可通过计算机可读介质来分发（以下讨论）。计算机可读指令可被实现为执行特定任务或实现特定抽象数据类型的程序模块，函数、对象、应用程序编程接口（API）、数据结构，等等。通常，计算机可读指令的功能可按需在各个环境中进行组合或分布。Although not required, various embodiments are described in the general context of "computer readable instructions" being executed by one or more computing devices. Computer readable instructions may be distributed via computer readable media (discussed below). Computer readable instructions may be implemented as program modules, functions, objects, application programming interfaces (APIs), data structures, etc., that perform particular tasks or implement particular abstract data types. Generally, the functions of the computer readable instructions may be combined or distributed among various environments as desired.

图7示出了包含被配置成实现本文提供的一个或多个实施方式的计算设备712的系统710的示例。在一个配置中，计算设备712包括至少一个处理单元716和存储器718。取决于计算设备的确切配置和类型，存储器718可以是易失性非（比如RAM）、非易失性（比如ROM、闪存等）或者这两者的某种组合。该配置在图7中由虚线714示出。FIG. 7 illustrates an example of a system 710 including a computing device 712 configured to implement one or more implementations provided herein. In one configuration, computing device 712 includes at least one processing unit 716 and memory 718 . Depending on the exact configuration and type of computing device, memory 718 may be volatile (such as RAM), non-volatile (such as ROM, flash memory, etc.), or some combination of the two. This configuration is shown by dashed line 714 in FIG. 7 .

在其他实施方式中，设备712可包括附加特征和/或功能。例如，设备712还可包括附加存储（如可移动和/或不可移动），其包括但不局限于，磁存储、光存储等等。这样的附加存储在图7中通过存储720来示出。在一个实施方式中，用于实现本文描述的一个或多个实施方式的计算机可读指令可在存储720中。存储720还可存储实现操作系统、应用程序等的其他计算机可读指令。可在存储器718中加载计算机可读指令以便比如由处理单元716来执行。In other implementations, device 712 may include additional features and/or functionality. For example, device 712 may also include additional storage (eg, removable and/or non-removable) including, but not limited to, magnetic storage, optical storage, and the like. Such additional storage is illustrated by storage 720 in FIG. 7 . In one embodiment, computer readable instructions for implementing one or more embodiments described herein may be in storage 720 . Storage 720 may also store other computer readable instructions implementing an operating system, application programs, and the like. Computer readable instructions may be loaded in memory 718 for execution, such as by processing unit 716 .

此处使用的术语“计算机可读介质”包括计算机存储介质。计算机存储介质包括以用于存储诸如计算机可读指令或其他数据这样的信息的任何方法或技术来实现的易失性和非易失性、可移动和不可移动介质。存储器718和存储720是计算机存储介质的示例。计算机存储介质包括，但不局限于，RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘（DVD）或者其他光存储、盒式磁带、磁带、磁盘存储或者其他磁存储设备，或者可用于存储所需信息并可由设备712访问的任何其他介质。任何这样的计算机存储介质都可以是设备712的一部分。The term "computer-readable medium" as used herein includes computer storage media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions or other data. Memory 718 and storage 720 are examples of computer storage media. Computer storage media including, but not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage, cassette tape, magnetic tape, magnetic disk storage or other magnetic storage device, or any other medium that can be used to store the required information and be accessed by device 712. Any such computer storage media may be part of device 712 .

设备712还可包括允许该设备712与其他设备进行通信的通信连接726。通信连接726可包括，但不局限于，调制解调器、网络接口卡（NIC）、集成网络接口、射频发射机/接收机、红外线端口、USB连接、或用于将计算设备712连接到其他计算设备的其他接口。通信连接726可包括有线连接或无线连接。通信连接726可发送和/或接收通信介质。The device 712 may also include a communication connection 726 that allows the device 712 to communicate with other devices. Communications connections 726 may include, but are not limited to, modems, network interface cards (NICs), integrated network interfaces, radio frequency transmitters/receivers, infrared ports, USB connections, or connections for connecting computing device 712 to other computing devices. other interfaces. Communications connection 726 may include a wired connection or a wireless connection. Communication connection 726 may send and/or receive communication media.

术语“计算机可读介质”可包括通讯介质。通信介质通常以诸如载波或其他传输机制等“已调制数据信号”来体现计算机可读指令或其他数据，并包括任何信息递送介质。术语“已调制数据信号”指以对信号中的信息进行编码的方式设置或改变其一个或多个特征的信号。The term "computer-readable medium" may include communication media. Communication media typically embodies computer readable instructions or other data in a "modulated data signal" such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.

设备712可包括输入设备724，如键盘、鼠标、笔、语音输入设备、触摸输入设备、红外照相机、视频输入设备，和/或任何其他输入设备。设备712中还可包括输出设备722，诸如一个或多个显示器、扬声器、打印机和/或任何其他输出设备。输入设备724和输出设备722可通过有线连接、无线连接或者其任何组合来连接到设备712。在一个实施方式中，来自另一计算设备的输入设备或输出设备可用作计算环境712的输入设备724或输出设备722。Devices 712 may include input devices 724 such as keyboards, mice, pens, voice input devices, touch input devices, infrared cameras, video input devices, and/or any other input devices. Also included in device 712 are output devices 722, such as one or more displays, speakers, printers, and/or any other output devices. Input device 724 and output device 722 may be connected to device 712 via a wired connection, a wireless connection, or any combination thereof. In one embodiment, an input device or an output device from another computing device may be used as an input device 724 or an output device 722 of the computing environment 712 .

计算环境712的组件可通过诸如总线这样的各种互连来连接。这样的互连可包括如PCI Express这样的外围部件互联（PCI）、通用串行总线（USB）、火线（IEEE1394）、以及光总线结构等等。在另一实施方式中，计算环境712的组件可通过网络互连。例如，存储器718可包含位于通过网络互连的不同物理位置的多个物理存储器单元。Components of computing environment 712 may be connected by various interconnects, such as a bus. Such interconnections may include Peripheral Component Interconnect (PCI) such as PCI Express, Universal Serial Bus (USB), FireWire (IEEE1394), and optical bus structures, among others. In another implementation, the components of computing environment 712 may be interconnected by a network. For example, memory 718 may comprise multiple physical memory units located in different physical locations interconnected by a network.

本领域技术人员会认识到，用来存储计算机可读指令的存储设备可分布在网络中。例如，可通过网络728访问的计算设备730可存储计算机可读指令以实现本文提供的一个或多个实施方式。计算设备712可访问计算设备730并下载部分或全部计算机可读指令来执行。或者，计算设备712可按需下载计算机可读指令的片断，或者某些指令可在计算设备712中执行，而某些指令则在计算设备730中执行。Those skilled in the art will realize that storage devices utilized to store computer readable instructions can be distributed across a network. For example, computing device 730 accessible via network 728 may store computer readable instructions to implement one or more embodiments provided herein. Computing device 712 may access computing device 730 and download some or all of the computer readable instructions for execution. Alternatively, computing device 712 may download pieces of computer readable instructions as needed, or some instructions may be executed in computing device 712 and some instructions may be executed in computing device 730 .

本文提供了各实施方式的各种操作。在一个实施方式中，所描述的操作中的一个或多个可组成存储在一个或多个计算机可读介质上的计算机可读指令，这些指令若被计算设备执行则会使计算设备执行所描述的操作。部分或所有操作被描述的次序不应被解释为暗示了这些操作一定要依赖于次序。从本说明书获益的本领域技术人员应理解替换的排序。此外，应当理解，并非所有的操作都一定要出现在本文提供的每个实施方式中。Various operations of various embodiments are provided herein. In one embodiment, one or more of the operations described may constitute computer-readable instructions stored on one or more computer-readable media that, if executed by a computing device, cause the computing device to perform the described operations. operation. The order in which some or all operations are described should not be construed as to imply that these operations are necessarily order dependent. Those skilled in the art having the benefit of this description will appreciate the ordering of substitutions. Additionally, it should be understood that not all operations necessarily occur in every implementation presented herein.

此外，本文使用词语“示例性”意在用作示例、实例或说明。本文被描述为“示例性”的任何方面或设计不一定要被解释为比其他方面或设计有利。相反，使用词语“示例性”旨在以具体的方式呈现各个概念。如在本申请中使用的，术语“或”的意思是包括性“或”而不是互斥性“或”。即，除非另有指定或者从上下文中可以清楚，否则“X使用A或B”意指任何自然包括性排列。即若X使用A；X用B；或者X既使用A也使用B，则在任何以上情况下都满足“X使用A或B”。另外，本申请中和所附权利要求书中使用的冠词“一”和“一个”一般可被解释为意指“一个或多个”，除非另有指定或从上下文中可以清楚指的是单数形式。Additionally, use of the word "exemplary" herein is intended to be an example, instance, or illustration. Any aspect or design described herein as "exemplary" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "exemplary" is intended to present various concepts in a concrete manner. As used in this application, the term "or" means an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" means any naturally inclusive permutation. That is, if X uses A; X uses B; or X uses both A and B, then "X uses A or B" is satisfied in any of the above cases. Additionally, the articles "a" and "an" as used in this application and in the appended claims may generally be construed to mean "one or more" unless specified otherwise or clear from context that singular form.

同样，尽管参考一个或多个实现示出并描述了本发明，但是本领域技术人员基于对本说明书和附图的阅读和理解可以想到各种等效替换和修改。本发明包括所有这样的修改和替换，并只由所附权利要求书来限定。特别地，对于由上述组件（如元素、资源等）执行的各种功能，除非另外指明，否则用于描述这些组件的术语旨在对应于执行所描述的执行此处在本发明的示例性实现中所示的功能的组件的指定功能（例如，功能上等效）的任何组件，即使这些组件在结构上不等效于所公开的结构。另外，尽管可相对于若干实现中的仅一个实现来公开本发明的一个特定特征，但是这一特征可以如对任何给定或特定应用所需且有利地与其他实现的一个或多个其他特征相组合。此外，就在说明书或权利要求书中使用术语“包含”、“具有”、“含有”和“带有”及其变体而言，这些术语旨在以与术语“包括”相似的方式为包含性的。Also, while the invention has been shown and described with reference to one or more implementations, various equivalents and modifications will occur to those skilled in the art upon the reading and understanding of this specification and the annexed drawings. The present invention includes all such modifications and substitutions and is limited only by the appended claims. In particular, for the various functions performed by the above-described components (eg, elements, resources, etc.), unless otherwise indicated, the terminology used to describe these components is intended to correspond to the implementation described herein in the exemplary implementation of the invention. Any components that are functionally equivalent (eg, functionally equivalent) to the functional components shown in , even if those components are not structurally equivalent to the disclosed structures. Additionally, although a particular feature of the invention may be disclosed with respect to only one of several implementations, that feature may be advantageously combined with one or more other features of other implementations as required for any given or particular application. combined. Furthermore, to the extent the terms "comprising", "having", "containing" and "with" and variations thereof are used in the specification or claims, these terms are intended to include in a manner similar to the term "comprising". sexual.

Claims

1. for carrying out a computer based method for document searching based on semantic content, including:

Receive to the required Part I of the original document from the data base including potential target document Whole user selects, and described original document includes the first number describing the attribute of each ingredient of described original document According to label, described Part I includes required semantic content；

The processor using computer makes the described original document including selected Part I through one or many The individual grader trained includes the first potential target document of Part II with mark from described data base； And

In response to determining that described Part II does not have required semantic content, then:

Reception includes the final of the Part III of required semantic content to described first potential target document User selects；And

Make the described first potential target document including selected Part III through the one or more The grader of training includes having the of the Part IV of required semantic content with mark from described data base Two potential target documents；And

At least one in following action:

Receiving the end user to respective document to input, described input includes one in the following Or multiple:

The one or more grader from described database retrieval to document include required language The instruction of justice content；

The one or more grader from described database retrieval to document do not include required The instruction of semantic content；And

The one or more grader from described database retrieval to the selected portion of document, Wherein this selected portion includes required semantic content；

Make multiple document through the one or more grader until for required semantic content Till reaching required document choice accuracy；

Make the plurality of document through the one or more grader until there is no retrieval error literary composition Till retrieving the correct document of requirement in the case of Dang；

Make the plurality of document through the one or more grader until from described database retrieval Document to requirement；

The second grader is used to confirm the document retrieved by the first grader；Or

Mark has required accuracy rate, two or more graders combination of search file, and profit With the combined retrieval document identified to find required semantic content.

2. the method for claim 1, it is characterised in that also include by one group of document is performed pass Keyword search uses potential target document to fill described data base.

3. the method for claim 1, it is characterised in that in response to determining that described Part II has Required semantic content, described method includes:

Described data base is run the one or more grader to select the 3rd potential target document.

4. the method for claim 1, it is characterised in that if described Part II does not has and institute State the identical semantic content of Part I and described first potential target document does not includes having and described The content of a part of identical semantic content, the most described method includes:

Receive described first potential target document and do not include that there is the semantic content identical with described Part I Content end user instruction；And

5. the method for claim 1, it is characterised in that include utilizing the one or more point User's input of the potential target document that class device is returned updates the one or more grader.

6. the method for claim 1, it is characterised in that include making the plurality of document through described One or more graders, until it reaches till required threshold value.

7. the method for claim 1, it is characterised in that include only providing at document classification being in institute When needing the result in threshold value, just present potential target document to described end user.

8. method as claimed in claim 7, it is characterised in that required threshold value includes that sorter model is not Definitiveness threshold value.