KR20160124079A

KR20160124079A - Systems and methods for in-memory database search

Info

Publication number: KR20160124079A
Application number: KR1020167017516A
Authority: KR
Inventors: 스캇 라이트너; 프란츠 베케서; 라케시 데이브; 산제이 보두; 조셉 베크넬; 비르알리 하키즈므와미
Original assignee: 큐베이스 엘엘씨
Priority date: 2013-12-02
Filing date: 2014-12-02
Publication date: 2016-10-26
Also published as: CN106164889A; WO2015084759A1; CA2932401A1; EP3077918A1; EP3077918A4; JP2017504105A

Abstract

엔티티 동시 발생 지식 베이스를 이용하여 관련된 엔티티들을 식별하는 시스템 및 방법이 개시된다. 실시 예들은, 엔티티 인덱스 코퍼스로부터 추출된 엔티티들의 엔티티 동시 발생 지식베이스를 이용하여 탐색 조회들에서 식별되는 엔티티들을 추출하여, 탐색 결과들을 관련 엔티티들로서 제공한다. 엔티티 동시 발생 지식 베이스와의 퍼지-스코어 매칭을 이용하여 탐색 제시어들을 생성하는 실시 예들이 개시된다. 실시 예들은 탐색 조회들로부터 부분 엔티티들을 추출하고, 추출된 엔티티들의 유형에 기초하여 매칭 알고리즘을 실행하고, 엔티티 동시 발생 지식베이스에 대한 탐색을 실행한다. 또한, 동시 발생 및/또는 퍼지-스코어 매칭에 기초하여 관련 엔티티들의 탐색 제시어들을 생성하는 실시 예들이 개시된다. 실시 예들은, 부분 탐색 조회들을 프로세싱하고 새로운 탐색 조회들로서 이용되는 완성된 조회들의 제시어들을 제공한다. 또한, 엔티티 및 트렌드 동시 발생 데이터베이스들을 이용하여 탐색 조회들로부터 엔티티들을 추출함에 의해 엔티티 동시 발생을 이용하여 탐색 제시어를 생성하는 실시 예가 개시된다. 또한, 콘텐츠 관리 시스템에 있어서의 지리적 및 지명 기반 탐색 기능을 인에이블하는 실시 예들이 개시된다.A system and method for identifying related entities using an entity co-occurrence knowledge base is disclosed. Embodiments extract entities identified in search queries using the entity concurrent knowledge base of entities extracted from the entity index corpus, and provide search results as related entities. Embodiments for generating search suggestions using fuzzy-score matching with an entity concurrent knowledge base are disclosed. Embodiments extract partial entities from search queries, execute a matching algorithm based on the type of extracted entities, and perform searches on an entity concurrent knowledge base. Also disclosed are embodiments for generating search suggestions of related entities based on coincidence and / or fuzzy-score matching. Embodiments process partial search queries and provide suggested queries of completed queries used as new search queries. In addition, an embodiment is disclosed in which a search query is generated using co-occurrence of entities by extracting entities from search queries using entities and trend co-occurrence databases. Also disclosed are embodiments that enable geographic and location-based search functionality in a content management system.

Description

[0001] SYSTEM AND METHODS FOR IN-MEMORY DATABASE SEARCH [0002]

본 개시는 일반적으로 정보 검색(information retrieval)을 위한 방법 및 시스템에 관한 것으로, 보다 구체적으로는, 엔티티 동시 발생(entity co-occurrence)을 이용하여 관련된 엔티티들을 탐색하는 방법에 관한 것이다. 본 개시는 일반적으로 조회 강화(query enhancement)에 관한 것으로, 보다 구체적으로는, 지식 베이스(knowledge base)에 있어서의 퍼지-스코어 매칭(fuzzy-score matching) 및 엔티티 동시 발생을 이용하는 탐색 제시어(search suggestions)에 관한 것이다. 본 개시는, 일반적으로 컴퓨터 질의 프로세싱에 관한 것으로, 보다 구체적으로는 동시 발생 및/또는 퍼지 스코어 매칭에 기초한 관련 엔티티들의 전자 탐색 제시어들에 관한 것이다. 본 개시는 일반적으로 정보 검색을 위한 방법 및 시스템에 관한 것으로, 보다 구체적으로는 탐색 제시어들을 획득하는 방법에 관한 것이다. 본 개시는 일반적으로 탐색 엔진 및 콘텐츠 관리에 관한 것으로, 보다 구체적으로는, 디지털 콘텐츠의 지오태깅 및 지명된 엔티티 보강(geotagging and named entities enrichment)을 가능하게 하기 위한 콘텐츠 관리 시스템의 탐색 엔진 기술의 확장에 관한 것이다. This disclosure relates generally to methods and systems for information retrieval, and more particularly to methods for searching for related entities using entity co-occurrences. This disclosure relates generally to query enhancement and more particularly to query enhancements using fuzzy-score matching in entities and concurrency of entities, ). BACKGROUND I. Field [0002] The present disclosure relates generally to computer-based query processing, and more particularly to electronic search suggestions of related entities based on concurrency and / or fuzzy score matching. This disclosure relates generally to methods and systems for information retrieval, and more particularly to methods for obtaining search suggestions. This disclosure relates generally to search engines and content management and, more particularly, to an extension of search engine technology in content management systems to enable geotagging and named entities enrichment of digital content. .

상업적 콘텍스트(commercial context)에 있어서, 잘 알려진 탐색 엔진은 탐색 용어들의 세트를 파싱(parsing)하고 일부 방식으로 소팅(sorting)된 아이템들(전형적인 탐색에서는 웹 페이지들(web pages))의 리스트를 리턴한다. 탐색을 실행하기 위한 대부분의 알려진 방식은, 통상적으로 다른 사용자의 이력 참조 자료(historical references)에 기초하여, 궁극적으로 키워드(keywords)를 기초로 인덱스를 생성하는데 이용될 수 있는 탐색 조회 데이터베이스를 구축한다. 사용자 탐색 조회들은 엔티티와 연계될 수 있는 이름 또는 속성에 의해 식별되는 하나 이상의 엔티티들을 포함할 수 있다. 엔티티들은 조직들, 사람들, 위치, 날짜 및/또는 시간을 포함할 수 있다. 전형적인 탐색에서는, 사용자가 두개의 특정 조직에 관련된 정보를 탐색하고 있으면, 탐색 엔진이 동일한 이름 또는 유사한 이름들을 가진 다른 엔티티들이 혼재된 것에 관한 것일 수 있는 여러 가지의 결과들을 리턴할 수 있다. 후자의 방식은 사용자가 실제로 관심있는 것이 무엇인지에 대한 것과 관련이 없을 수 있는 대량의 문서들을 사용자가 발견할 수 있게 한다. In a commercial context, a well-known search engine parses a set of search terms and returns a list of items sorted in some way (web pages in a typical search) do. Most known methods for performing searches build a search query database that can be used to generate indexes based on keywords, typically based on historical references of other users . User search queries may include one or more entities identified by a name or attribute that may be associated with the entity. Entities may include organizations, people, location, date and / or time. In a typical search, if the user is searching for information related to two particular organizations, the search engine may return various results that may be about mixed entities with the same name or similar names. The latter approach allows the user to find a large amount of documents that may not be relevant to what the user is actually interested in.

따라서, 사용자가 관련된 관심 엔티티들을 발견할 수 있도록 하는, 관련 엔티티들을 탐색하는 방법이 필요하다. Thus, there is a need for a way to search for relevant entities, which allows the user to find relevant entities of interest.

사용자는 인터넷 또는 임의 데이터베이스 시스템상에서의 관심있는 정보의 위치 결정(locating)을 위한 탐색 엔진을 빈번하게 이용한다. 탐색 엔진은, 통상적으로, 사용자로부터 탐색 조회를 수신하고 탐색 결과를 사용자에게 리턴함에 의해 동작한다. 탐색 결과는, 통상적으로, 탐색 조회에 대한 각각의 리턴된 탐색 결과의 관련성에 기초하여, 탐색 엔진에 의해 정돈된다. 그러므로, 탐색 결과의 품질을 위해서는 탐색 조회의 품질이 아주 중요하다. 그러나, 대부분의 경우, 사용자로부터의 탐색 조회들은 불완전하게 또는 부분적으로 작성될 수 있으며(예를 들어, 탐색 조회는 중점적인 관련 결과들의 세트를 생성하기에 충분한 단어들을 포함하는 것이 아니라, 그 대신에 많은 수의 무관한 결과(irrelevant results)들을 생성한다), 때때로, 오기가 있을 수 있다(예를 들어, Bill Smith의 철자가 Bill Smitth로서 잘못 쓰여질 수 있다). The user frequently uses a search engine for locating the information of interest on the Internet or any database system. The search engine typically operates by receiving a search query from a user and returning a search result to the user. The search results are typically ordered by the search engine based on the relevance of each returned search result to the search query. Therefore, the quality of the search query is very important for the quality of the search result. However, in most cases, search queries from the user can be written incompletely or partially (e.g., the search query does not include enough words to generate a set of focused related results, (For example, Bill Smith's spelling may be mislabeled as Bill Smitth). In some cases, there may be a number of irrelevant results.

탐색 결과의 품질을 개선하기 위한 한가지 일반적인 방식은 탐색 조회를 강화시키는 것이다. 탐색 조회들을 강화시키는 한가지 방법은 사용자의 입력에 기초하여 가능한 제시어를 생성하는 것일 수 있다. 이를 위해, 여러 방식들은 하나 이상의 사용자에 의해 제시된 과거 조회들로부터 주어진 조회에 대한 후보 조회 정제를 식별하는 방법들을 제안한다. 그러나, 이들 방식들은 때때로 사용자를 관심없는 결과로 이끌 수 있는 조회 로그들(query logs)에 기반한다. 충분히 정확하지 않을 수 있는 다른 기술들을 이용하는 다른 방식들이 있다. 따라서, 보다 정확한 결과를 얻기 위해 사용자로부터 탐색 조회들을 개선 또는 강화시키는 방법이 여전히 필요하고, One common way to improve the quality of search results is to enhance search queries. One way to enhance search queries is to generate a possible presentation term based on user input. To this end, several approaches propose methods for identifying candidate query refinements for a given query from past queries presented by one or more users. However, these schemes are sometimes based on query logs that can lead users to an uninteresting result. There are other ways to use other technologies that may not be accurate enough. Thus, there is still a need to improve or enhance search queries from users to obtain more accurate results,

사용자는 인터넷 또는 임의 데이터베이스 시스템으로부터 관심있는 정보의 위치 결정(locating)을 위한 탐색 엔진을 빈번하게 이용한다. 탐색 엔진은, 통상적으로, 사용자로부터 탐색 조회를 수신하고 탐색 결과를 사용자에게 리턴함에 의해 동작한다. 탐색 결과는, 통상적으로, 탐색 조회에 대한 각각의 리턴된 탐색 결과의 관련성에 기초하여, 정돈된다. 그러므로, 탐색 결과의 품질을 위해서는 탐색 조회의 품질이 아주 중요하다. 그러나, 대부분의 경우, 사용자로부터의 탐색 조회들은 불완전하게 또는 부분적으로 작성될 수 있으며(예를 들어, 탐색 조회는 중점적인 관련 결과들의 세트를 생성하기에 충분한 단어들을 포함하는 것이 아니라, 그 대신에 많은 수의 무관한 결과들을 생성한다), 때때로, 오기가 있을 수 있다(예를 들어, Bill Smith의 철자가 Bill Smitth로서 잘못 쓰여질 수 있다). A user frequently uses a search engine for locating information of interest from the Internet or any database system. The search engine typically operates by receiving a search query from a user and returning a search result to the user. The search results are typically arranged based on the relevance of each returned search result to the search query. Therefore, the quality of the search query is very important for the quality of the search result. However, in most cases, search queries from the user can be written incompletely or partially (e.g., the search query does not include enough words to generate a set of focused related results, (Which, for example, Bill Smith's spelling can be mistakenly written as Bill Smitth).

탐색 결과의 품질을 개선하기 위한 한가지 일반적인 방식은 탐색 조회를 강화시키는 것이다. 탐색 조회들을 강화시키는 한가지 방법은 사용자의 입력에 기초하여 가능한 제시어를 생성하는 것일 수 있다. 이를 위해, 여러 방식들은 하나 이상의 사용자에 의해 제시된 과거 조회들로부터 주어진 조회에 대한 후보 조회 정제를 식별하는 방법들을 제안한다. 그러나, 이들 방식들은 때때로 사용자를 관심없는 결과로 이끌 수 있는 조회 로그들(query logs)에 기반한다. 충분히 정확하지 않을 수 있는 다른 기술들을 이용하는 다른 방식들이 있다. 따라서, 보다 정확한 결과를 얻기 위해 사용자로부터 탐색 조회들을 개선 또는 강화시키고, 또한 그들이 탐색 조회를 타이핑(typing)함에 따라 유용한 관련된 관심 엔티티들을 사용자에게 제공하는 방법이 여전히 필요하다One common way to improve the quality of search results is to enhance search queries. One way to enhance search queries is to generate a possible presentation term based on user input. To this end, several approaches propose methods for identifying candidate query refinements for a given query from past queries presented by one or more users. However, these schemes are sometimes based on query logs that can lead users to an uninteresting result. There are other ways to use other technologies that may not be accurate enough. Thus, there is still a need to improve or enhance search queries from users to obtain more accurate results, and also to provide users with relevant interested entities as they type in search queries

탐색 엔진들은 사용자의 조회에 대한 예측(forecast)을 제공하기 위해 다수의 특징들(features)을 포함한다. 그러한 예측은 조회 자동 완성(auto-complete) 및 탐색 제시어들을 포함할 수 있다. 오늘날, 그러한 예측 방법들은 이력 키워드 참조 자료에 기초한다. 그러한 이력 참조 자료들은, 하나의 키워드가 단일 텍스트(text)내의 다수의 토픽들에 대한 것일 수 있기 때문에 정확하지 않을 수 있다.The search engines include a number of features to provide a forecast of the user's query. Such prediction may include query auto-complete and search suggestions. Today, such prediction methods are based on historical keyword reference data. Such historical reference data may not be accurate because one keyword may be for a number of topics in a single text.

또한, 사용자 탐색 조회들은 엔티티와 연계될 수 있는 이름 또는 속성들에 의해 식별되는 하나 이상의 엔티티들을 포함할 수 있다. 엔티티들은 조직, 사람들, 위치들, 이벤트들, 날짜 및/또는 시간을 포함할 수 있다. 전형적인 탐색에서는, 사용자가 두개의 특정 조직에 관련된 정보를 탐색하고 있으면, 탐색 엔진이 동일한 이름 또는 유사한 이름들을 가진 다른 엔티티들이 혼재된 것에 관한 것일 수 있는 여러 가지의 결과들을 리턴할 수 있다. 후자의 방식은 사용자가 실제로 관심있는 것이 무엇인지에 대한 것과 관련이 없을 수 있는 대량의 문서들을 사용자가 발견할 수 있게 한다. In addition, user search queries may include one or more entities identified by names or attributes that may be associated with the entity. Entities may include organizations, people, locations, events, dates and / or times. In a typical search, if the user is searching for information related to two particular organizations, the search engine may return various results that may be about mixed entities with the same name or similar names. The latter approach allows the user to find a large amount of documents that may not be relevant to what the user is actually interested in.

따라서, 보다 빠르고 보다 정확한 탐색 제시어를 획득하는 방법이 필요하다. Therefore, there is a need for a method for acquiring a faster and more accurate search word.

문서 버저닝(versioning) 및 콜라보레이션(collaboration) 프로젝트 관리를 위한 콘텐츠 관리 및 문서 관리 시스템은 알려져 있다. 한가지 비 제한적 예시는 마이크로소프트사의 Sharepoint 2013® 소프트웨어 및 툴의 애플리케이션 슈트일 수 있다. 마이크로소프트사의 Sharepoint 2013®은 콜라보레이션, 파일 공유 및 웹 출판을 위해 마이크로소프트사에 의해 개발된 소프트웨어 제품군이다. Sharepoint 2013®은 사용자에게 광대한 양의 콘텐츠 또는 정보를 제공할 수 있는데, 사용자가 특정 상황에 대한 가장 관련된 정보를 발견하는 것은 어렵다. 이러한 쟁점을 완화시키기 위해, Sharepoint 2013®은 그들이 필요로 하는 콘테츠를 발견하는데 있어서 사용자를 보조하기 위해 탐색 엔진을 제공한다. 사용자는 키워드 기반 탐색 조회를 입력할 수 있으며, Sharepoint 2013®에 있어서의 탐색 엔진은, 일단콘텐츠가 인덱스되었으면, Sharepoint 2013® 플랫폼의 콘텍스트내에서 발견된 가장 관련된 결과들의 리스트를 사용자에게 리턴할 수 있다. Content management and document management systems for document versioning and collaboration project management are known. One non-limiting example may be the application suite of Microsoft's Sharepoint 2013® software and tools. Microsoft's Sharepoint 2013® is a software suite developed by Microsoft for collaboration, file sharing, and web publishing. Sharepoint 2013® can provide users with vast amounts of content or information, and it is difficult for users to find the most relevant information for a particular situation. To alleviate these issues, Sharepoint 2013® provides a search engine to assist users in discovering the content they need. Users can enter keyword-based navigation queries, and the search engine in Sharepoint 2013® can return to the user a list of the most relevant results found in the context of the Sharepoint 2013® platform once the content has been indexed .

때때로, 사용자는 Sharepoint 2013®에서의 지리적 엔티티들 또는 문서에서의 지칭된 조직 또는 사람과 같은 다른 유형의 엔티티와 관련된 콘텐츠를 발견하기를 원할 수 있다. Sharepoint 2013®는 문서로부터 엔티티들을 자동으로 추출하기 위해 즉각적인 기능성을 제공하지 못한다. 특히, 그것은 지리적 엔티티들을 추출하고 그들을 지리적 위치로 귀착시키기 위한 지오태깅 콘텐츠를 지원하지 못한다. 또한, Sharepoint 2013®은 문서내의 조직 또는 사람과 같이 명확하고 정확하게 지명된 엔티티들을 식별하기 위해 엔티티 태깅을 지원하지 못한다. 그러나, Sharepoint 2013® 탐색은 엔티티 기반 탐색 패싯(search facet)을 포함하는 유효 지리적 탐색 및 다른 엔티티 관련 탐색이 가능하도록 확장될 수 있다. Sharepoint 2013®의 이전 버전은 쉐어포인트(SharePoint)를 위한 "고속 탐색(FAST Search)"을 포함하였는데, 그로부터 샌드박스 애플리케이션(sandboxed application)을 통해 콘텐츠 프로세싱 파이프라인을 확장할 수 있었지만, 이것은 느리고 액세스할 수 있었던 정보에 있어서 제한이 있었다. Occasionally, a user may want to discover content related to geographic entities in Sharepoint 2013® or other types of entities such as named organizations or people in documents. Sharepoint 2013® does not provide immediate functionality to automatically extract entities from a document. In particular, it does not support geotagging content for extracting geographic entities and returning them to a geographic location. In addition, Sharepoint 2013® does not support entity tagging to identify clearly and correctly named entities, such as organizations or people within a document. However, the Sharepoint 2013® search can be extended to enable effective geographic searches and other entity-related searches involving entity-based search facets. Earlier versions of Sharepoint 2013® included "FAST Search" for SharePoint, from which a content processing pipeline could be extended with a sandboxed application, which was slow and accessible There was a limit on the information available.

Sharepoint 2013®은 개념 추출, 연관성 추출, 지오태깅, 요약 및 정교한 텍스트 분석과 같은 전문적인 언어학을 추가할 수 있게 하는 훨씬 더 개방적인 API를 도입한다. 따라서, 지리적 및 다른 엔티티 기반 탐색들이 가능하도록 Sharepoint 2013® 탐색 엔진의 기능성을 확장할 기회가 있다.Sharepoint 2013® introduces a much more open API that allows you to add professional linguistics such as concept extraction, association extraction, geotagging, summarization, and sophisticated text analysis. Thus, there is an opportunity to extend the functionality of the Sharepoint 2013® search engine to enable geographic and other entity-based exploration.

엔티티 동시 발생을 이용하여 관련된 엔티티들을 탐색하는 방법이 개시된다. 본 개시의 일 측면에 있어서, 그 방법은 클라이언트/서버를 포함할 수 있는 탐색 시스템에 채용될 수 있다. 일 실시 예에 있어서, 탐색 시스템은 네트워크 접속을 통해 하나 이상의 서버 디바이스들과 통신하는데 있어서 탐색 엔진을 위한 사용자 인터페이스를 포함한다. 서버 디바이스는 전자 데이터의 엔티티 인덱스 코퍼스(entity indexed corpus), 엔티티 동시 발생 지식 베이스 데이터베이스 및 엔티티 추출 컴퓨터 모듈을 포함할 수 있다. 지식 베이스는 인-메모리 데이터베이스로서 구축될 수 있으며, 하나 이상의 탐색 제어기, 다수의 탐색 노드들, 압축 데이터의 콜렉션(collection) 및 중의성 해소 모듈과 같은 다른 구성 요소들을 포함할 수 있다. 하나의 탐색 제어기는 하나 이상의 탐색 노드들과 선택적으로 연계될 수 있다. 각 탐색 노드는 압축 데이터의 콜렉션을 통해 퍼지 키 탐색을 독자적으로 실행하고, 그의 연계된 탐색 제어기에 스코어링된 결과들의 세트를 리턴시킬 수 있다.A method for searching for related entities using entity coincidence is disclosed. In one aspect of the present disclosure, the method may be employed in a search system that may include a client / server. In one embodiment, the search system includes a user interface for a search engine in communicating with one or more server devices via a network connection. The server device may include an entity indexed corpus of electronic data, an entity co-occurrence knowledge base database, and an entity extraction computer module. The knowledge base may be constructed as an in-memory database and may include other components such as one or more search controllers, a plurality of search nodes, a collection of compressed data, and a solvency module. One search controller may be selectively associated with one or more search nodes. Each search node can independently perform a fuzzy key search through a collection of compressed data and return a set of scored results to its associated search controller.

일 실시 예에 있어서, 컴퓨터 구현 방법은 엔티티 추출 컴퓨터가 클라이언트 컴퓨터로부터 하나 이상의 엔티티들을 구비하는 탐색 질의를 수신하고; 엔티티 추출 컴퓨터가 각 엔티티를 동시 발생 데이터베이스내의 각 엔티티의 하나 이상의 동시 발생과 비교하고; 엔티티 추출 컴퓨터가 탐색 조회로부터 하나 이상의 엔티티들의 서브셋을 추출하되, 상기 추출은 추출된 엔티티와 동시 발생 데이터베이스에 따른 전자 데이터 코퍼스내의 하나 이상의 관련된 엔티티들의 동시 발생의 확실성 레벨에 기초하여 하나 이상의 엔티티들의 서브셋의 각 엔티티가 동시 발생 데이터베이스의 신뢰 스코어를 초과한다고 판정하는 것에 응답하여 이루어지고; 엔티티 추출 컴퓨터가 다수의 추출된 엔티티들내의 각 엔티티들에 인덱스 식별자(인덱스 ID)를 할당하고; 엔티티 추출 컴퓨터가, 하나 이상의 관련 엔티티들의 각각에 대응하는 인덱스 ID에 의해 인덱스되는 전자 데이터 코퍼스에 다수의 추출된 엔티티들의 각각마다 인덱스 ID를 보관하고; 탐색 서버 컴퓨터가, 다수의 추출된 엔티티들의 위치 결정을 위해 및 다수의 추출된 엔티티들 중 적어도 2개가 동시 발생하는 데이터 레코드의 인덱스 ID들을 식별하기 위해 엔티티 인덱스 전자 데이터 코퍼스(entity indexed electronic data corpus)를 탐색하고; 탐색 서버 컴퓨터가, 식별된 인덱스 ID 들에 대응하는 데이터 레코드를 가진 탐색 결과 리스트를 구축하는 것을 구비한다.In one embodiment, a computer-implemented method includes receiving an search query from an entity extraction computer having one or more entities from a client computer; The entity extraction computer compares each entity with one or more occurrences of each entity in the co-occurrence database; An entity extraction computer extracts a subset of one or more entities from a search query, said extraction comprising a subset of one or more entities based on the level of certainty of co-occurrence of one or more related entities in an electronic data corpus, In response to determining that each entity of the concurrent database exceeds the trust score of the co-occurrence database; An entity extraction computer assigns an index identifier (index ID) to each of the entities within the plurality of extracted entities; Storing an index ID for each of a plurality of extracted entities in an electronic data corpus indexed by an index ID corresponding to each of one or more related entities; A search server computer is configured to search for an entity indexed electronic data corpus to identify the index IDs of data records in which at least two of the plurality of extracted entities are co- &Lt; / RTI > And the search server computer comprises constructing a search result list having a data record corresponding to the identified index IDs.

일 실시 예에 있어서, 시스템은, 다수의 컴퓨터 모듈들에 대한 하나 이상의 프로세서 실행 컴퓨터 독출 가능 명령어를 가진 하나 이상의 서버 컴퓨터들을 구비하고, 상기 다수의 컴퓨터 모듈들은, 탐색 조회 파라메타들의 사용자 입력을 수신하도록 구성된 엔티티 추출 모듈; 및 탐색 서버 모듈을 포함하고, 상기 엔티티 추출 모듈은, 탐색 조회 파라메타들로부터 다수의 엔티티들을 추출하되, 상기 추출은, 추출된 엔티티와, 전자 데이터 코퍼스내의 하나 이상의 관련된 엔티티들의, 동시 발생의 확실성 레벨을 나타내는 신뢰 스코어를 포함하는 엔티티 동시 발생 데이터베이스를, 다수의 추출된 엔티티들에 있어서의 각 엔티티와 비교함에 의해 이루어지고; 다수의 추출된 엔티티들에 있어서의 각 엔티티에 인덱스 식별자(인덱스 ID)를 할당하고; 하나 이상의 관련된 엔티티들의 각각에 대응하는 인덱스 ID에 의해 인덱스되는 전자 데이터 코퍼스에 다수의 추출된 엔티티들의 각각에 대한 인덱스ID를 보관하도록 구성되고, 상기 탐색 서버 모듈은,다수의 추출된 엔티티들의 위치 결정을 위해 및 다수의 추출된 엔티티들 중 적어도 2개가 동시 발생하는 데이터 레코드들의 인덱스 ID들을 식별하기 위해 엔티티 인덱스 전자 데이터 코퍼스를 탐색하도록 구성되고; 상기 식별된 인덱스 ID에 대응하는 데이터 레코드를 가진 탐색 결과 리스트를 구축하도록 추가 구성된다.In one embodiment, the system comprises one or more server computers with one or more processor-executable computer-readable instructions for a plurality of computer modules, the plurality of computer modules being configured to receive user input of search query parameters A configured entity extraction module; And a search server module, wherein the entity extraction module extracts a plurality of entities from the search query parameters, the extraction comprising: extracting entities and one or more related entities in the electronic data corpus, By comparing an entity concurrency database containing a trust score indicative of the entities in the plurality of extracted entities with each entity in the plurality of extracted entities; Assigning an index identifier (index ID) to each entity in the plurality of extracted entities; Wherein the search server module is configured to store an index ID for each of a plurality of extracted entities in an electronic data corpus indexed by an index ID corresponding to each of one or more associated entities, At least two of the plurality of extracted entities are configured to search for an entity index electronic data corpus to identify index IDs of concurrently occurring data records; And construct a search result list having a data record corresponding to the identified index ID.

다른 실시 예에 있어서, 비-일시적 컴퓨터 독출 가능 매체는 컴퓨터 실행 가능 명령어가 저장되고, 상기 컴퓨터 실행 가능 명령어는, 엔티티 추출 컴퓨터가, 탐색 조회 파라메타들의 사용자 입력을 수신하도록 하고; 상기 엔티티 추출 컴퓨터가, 탐색 조회 파라메타들로부터 다수의 엔티티들을 추출하도록 하되, 상기 추출은, 상기 추출된 엔티티와, 전자 데이터 코퍼스내의 하나 이상의 관련된 엔티티들의, 동시 발생의 확실성 레벨을 나타내는 신뢰 스코어를 포함하는 엔티티 동시 발생 데이터베이스를, 다수의 추출된 엔티티들에 있어서의 각 엔티티와 비교함에 의해 이루어지도록 하고; 상기 엔티티 추출 컴퓨터가, 다수의 추출된 엔티티들에 있어서의 각 엔티티에 인덱스 식별자(인덱스 ID)를 할당하도록 하고; 상기 엔티티 추출 컴퓨터가, 하나 이상의 관련된 엔티티들의 각각에 대응하는 인덱스 ID에 의해 인덱스되는 전자 데이터 코퍼스에 다수의 추출된 엔티티들의 각각에 대한 인덱스 ID를 보관하도록 하고; 탐색 서버 컴퓨터가, 다수의 추출된 엔티티들의 위치 결정을 위해 및 다수의 추출된 엔티티들중 적어도 2개가 동시 발생하는 데이터 레코드의 인덱스 ID를 식별하기 위해 엔티티 인덱스 전자 데이터 코퍼스를 탐색하도록 하고; 상기 탐색 서버 컴퓨터가, 식별된 인덱스 ID에 대응하는 데이터 레코드를 가진 탐색 결과 리스트를 구축하도록 하는 것을 구비한다.In another embodiment, the non-transitory computer readable medium has computer executable instructions stored thereon, the computer executable instructions causing the entity extraction computer to receive user input of search query parameters; Wherein the entity extraction computer causes a plurality of entities to be extracted from the search query parameters, the extraction including a confidence score indicating a concurrency level of concurrency of the extracted entity and one or more related entities in the electronic data corpus To be performed by comparing the entity concurrency database with each entity in the plurality of extracted entities; The entity extraction computer to assign an index identifier (index ID) to each entity in the plurality of extracted entities; The entity extraction computer storing an index ID for each of a plurality of extracted entities in an electronic data corpus indexed by an index ID corresponding to each of one or more related entities; The search server computer causes the entity index electronic data corpus to be searched to locate a plurality of extracted entities and to identify an index ID of a concurrently occurring data record of at least two of the plurality of extracted entities; And causing the search server computer to construct a search result list having a data record corresponding to the identified index ID.

지식 베이스에 있어서의 퍼지-스코어 매칭 및 엔티티 동시 발생을 이용하여 탐색 제시어를 생성하는 방법이 개시된다. 본 개시의 일 측면에 있어서, 그 방법은 클라이언트/서버 유형 아키텍쳐를 포함할 수 있는 탐색 시스템에 채용될 수 있다. 일 실시 예에 있어서, 탐색 시스템은 네트워크 접속을 통해 하나 이상의 서버 디바이스들과 통신하는 탐색 엔진에 대한 사용자 인터페이스를 포함할 수 있다. 서버 디바이스는 엔티티 추출 컴퓨터 모듈, 퍼지-스코어 매칭 컴퓨터 모듈 및 엔티티 동시 발생 지식 베이스 데이터베이스를 포함할 수 있다. 지식 베이스는 인-메모리 데이터베이스로서 구축될 수 있으며, 하나 이상의 탐색 제어기들, 다수의 탐색 노드들, 압축 데이터의 콜렉션 및 중의성 해소 컴퓨터 모듈과 같은 다른 하드웨어 및/또는 소프트웨어 구성 요소들을 포함할 수 있다. 하나의 탐색 제어기는 하나 이상의 탐색 노드들과 선택적으로 연계될 수 있다. 각 탐색 노드는 압축 데이터의 콜렉션을 통해 퍼지 키 탐색을 독자적으로 실행하고 그와 연계된 탐색 제어기에 스코어링된 결과들의 세트를 리턴할 수 있다.A method for generating a search query using fuzzy-score matching and simultaneous occurrence of entities in a knowledge base is disclosed. In one aspect of the present disclosure, the method may be employed in a search system that may include a client / server type architecture. In one embodiment, the search system may include a user interface to a search engine that communicates with one or more server devices via a network connection. The server device may include an entity extraction computer module, a fuzzy-score matching computer module, and an entity concurrent knowledge base database. The knowledge base may be constructed as an in-memory database and may include other hardware and / or software components such as one or more search controllers, a plurality of search nodes, a collection of compressed data, and a deconvolution computer module . One search controller may be selectively associated with one or more search nodes. Each search node can independently perform a fuzzy key search through a collection of compressed data and return a set of scored results to the associated search controller.

본 개시의 다른 측면에 있어서, 그 방법은 탐색 조회가 엔티티를 지칭하는지 및 만약 그렇다면 그것이 무슨 유형의 엔티티를 지칭하는지를 식별하기 위해 제공된 탐색 조회로부터 부분적인 엔티티 추출을 실행할 수 있는 엔티티 추출 모듈을 포함할 수 있다. 또한, 그 방법은 추출된 엔티티의 유형에 기반하여 알고리즘들을 생성하게 하고 엔티티 동시 발생 지식 베이스에 대한 탐색을 실행하는 퍼지-스코어 매칭 모듈을 포함할 수 있다. 추가적으로, 엔티티들에 대응하는 것으로서 검출되지 않은 조회 텍스트 부분들은, 엔티티 동시 발생 지식 베이스를 탐색하기 위해 채용될 수 있는, 토픽들, 팩트(fact)들 및 키 구문과 같은 개념적 특징들로서 처리된다. 실시 예에 있어서, 엔티티 동시 발생 지식 베이스는, 다른 것들 중에서도 엔티티들이 엔티티 대 엔티티(entities to entities), 엔티티 대 토픽(entities to topics) 또는 엔티티 대 팩트(entities to facts)로서 인덱스될 수 있는 보관소(repository)를 포함함으로써, 사용자에게로의 빠르고 정확한 제시어의 리턴을 도모하여 탐색 조회를 완성한다.In another aspect of the present disclosure, the method includes an entity extraction module capable of performing a partial entity extraction from a search query provided to identify whether the search query refers to an entity and if so, what type of entity it refers to . The method may also include a fuzzy-score matching module that generates algorithms based on the type of extracted entity and performs a search on the entity concurrent knowledge base. Additionally, lookup text portions that are not detected as corresponding to the entities are treated as conceptual features such as topics, facts, and key phrases that may be employed to search for an entity coincidence knowledge base. In an embodiment, an entity co-occurrence knowledge base is a repository of entities that can be indexed as entities to entities, entities to topics, or entities to facts, among others. repository) is included, and the search query is completed by returning a quick and accurate presentation word to the user.

일 실시 예에 있어서, 방법이 개시된다. 그 방법은, 엔티티 추출 컴퓨터가, 사용자 인터페이스로부터 탐색 조회 파라메타들의 사용자 입력을 수신하고; 엔티티 추출 컴퓨터가, 탐색 조회 파라메타들을, 전자 데이터 코퍼스내의 하나 이상의 엔티티들의 동시 발생의 인스턴스(instance)들을 가진 엔티티 동시 발생 데이터베이스와 비교하고, 탐색 조회 파라메타들에 있어서 하나 이상의 엔티티들에 대응하는 적어도 하나의 엔티티 유형을 식별함에 의해 탐색 조회 파라메타들로부터 하나 이상의 엔티티들을 추출하고; 퍼지-스코어 매칭 컴퓨터가 엔티티 동시 발생 데이터베이스를 탐색하는 퍼지 매칭 알고리즘을 선택하여, 탐색 조회 파라메타들과 연계된 하나 이상의 레코드들을 식별하는 것을 구비하되, 퍼지 매칭 알고리즘은 적어도 하나의 식별된 엔티티 유형에 대응한다. 그 방법은 퍼지-스코어 매칭 컴퓨터가, 선택된 퍼지 매칭 알고리즘을 이용하여 엔티티 동시 발생 데이터베이스를 탐색하고 그 탐색에 기초하여 하나 이상의 레코드들로부터 하나 이상의 제시된 탐색 조회 파라메타들을 형성하고, 퍼지-스코어 매칭 컴퓨터가 사용자 인터페이스를 통해 하나 이상의 제시된 탐색 조회 파라메타들을 제공하는 것을 더 포함한다.In one embodiment, a method is disclosed. The method includes: an entity extraction computer receiving user input of search query parameters from a user interface; Entity extraction computer compares search query parameters with an entity concurrency database having instances of concurrent occurrence of one or more entities in an electronic data corpus and determines at least one Extracting one or more entities from the search query parameters by identifying an entity type of the search query parameters; Selecting a fuzzy matching algorithm that searches the entity concurrent database for a fuzzy-score matching computer to identify one or more records associated with search query parameters, wherein the fuzzy matching algorithm corresponds to at least one identified entity type do. The method includes the steps of: a fuzzy-score matching computer searching the entity concurrency database using a selected fuzzy matching algorithm and forming one or more suggested search query parameters from one or more records based on the search; And providing the one or more suggested search query parameters via the user interface.

다른 실시 예에 있어서, 시스템이 제공된다. 그 시스템은 사용자 인터페이스로부터 탐색 조회 파라메타들의 사용자 입력을 수신하도록 구성되고, 탐색 조회 파라메타들을, 전자 데이터 코퍼스내의 하나 이상의 엔티티들의 동시 발생의 인스턴스(instance)들을 가진 엔티티 동시 발생 데이터베이스와 비교하고, 탐색 조회 파라메타들에 있어서 하나 이상의 엔티티들에 대응하는 적어도 하나의 엔티티 유형을 식별함에 의해, 탐색 조회 파라메타들로부터 하나 이상의 엔티티들을 추출하도록 추가 구성된 엔티티 추출 모듈을 포함하는 다수의 컴퓨터 모듈들에 대한 하나 이상의 프로세서 실행 컴퓨터 독출 가능 명령어를 가진 하나 이상의 서버 컴퓨터를 포함한다. 그 시스템은, 엔티티 동시 발생 데이터베이스를 탐색하는 퍼지 매칭 알고리즘을 선택하여 탐색 조회 파라메타들과 연계된 하나 이상의 레코드들을 식별하도록 구성된 퍼지-스코어 매칭 모듈을 더 포함하되, 퍼지 매칭 알고리즘은 적어도 하나의 식별된 엔티티 유형에 대응한다. 퍼지-스코어 매칭 모듈은 선택된 퍼지 매칭 알고리즘을 이용하여 엔티티 동시 발생 데이터베이스를 탐색하고, 그 탐색에 기초하여 하나 이상의 레코드들로부터 하나 이상의 제시된 탐색 조회 파라메타들을 형성하고, 사용자 인터페이스를 통해 하나 이상의 제시된 탐색 조회 파라메타들을 제공하도록 추가 구성된다. In another embodiment, a system is provided. The system is configured to receive user input of search query parameters from a user interface and compare search search parameters with an entity concurrency database having instances of concurrent occurrence of one or more entities in an electronic data corpus, An entity extraction module configured to extract one or more entities from the search query parameters by identifying at least one entity type corresponding to one or more entities in the parameters of the plurality of computer modules, And one or more server computers having executable computer readable instructions. The system further includes a fuzzy-match matching module configured to select a fuzzy matching algorithm to search for an entity co-occurrence database to identify one or more records associated with search query parameters, wherein the fuzzy matching algorithm comprises at least one identified Corresponds to the entity type. The fuzzy-score matching module searches the entity concurrency database using a selected fuzzy matching algorithm, forms one or more suggested search query parameters from one or more records based on the search, Lt; RTI ID = 0.0 > parameters. &Lt; / RTI >

동시 발생 및/또는 퍼지 스코어 매칭에 기반하여 관련 엔티티들의 탐색 제시어를 생성하는 방법이 개시된다. 본 개시의 일 측면에 있어서, 그 방법은 클라이언트/서버 유형 아키텍쳐를 포함할 수 있는 컴퓨터 탐색 시스템에 채용될 수 있다. 일 실시 예에 있어서, 탐색 시스템은 네트워크 접속을 통해 하나 이상의 서버 디바이스와 통신하는 탐색 엔진에 대한 사용자 인터페이스를 포함한다. 서버 디바이스는 엔티티 추출 컴퓨터 모듈, 퍼지-스코어 매칭 컴퓨터 모듈 및 엔티티 동시 발생 지식 베이스 데이터베이스를 포함하는 다수의 전용 컴퓨터 모듈에 대한 하나 이상의 프로세서 실행 명령어를 포함할 수 있다. 지식 베이스는 인-메모리 데이터베이스로서 구축될 수 있으며, 하나 이상의 탐색 제어기들, 다수의 탐색 노드들, 압축 데이터의 콜렉션 및 중의성 해소 컴퓨터 모듈과 같은 다른 구성 요소들을 포함할 수 있다. 하나의 탐색 제어기는 하나 이상의 탐색 노드들과 선택적으로 연계될 수 있다. 각 탐색 노드는 압축 데이터의 콜렉션을 통해 퍼지 키 탐색을 독자적으로 실행하고 그와 연계된 탐색 제어기에 스코어링된 결과들의 세트를 리턴할 수 있다.A method for generating a search suggestion of related entities based on coincidence and / or fuzzy score matching is disclosed. In one aspect of the present disclosure, the method may be employed in a computer search system that may include a client / server type architecture. In one embodiment, the search system includes a user interface to a search engine that communicates with one or more server devices via a network connection. The server device may include one or more processor execution instructions for a plurality of dedicated computer modules, including an entity extraction computer module, a fuzzy-score matching computer module, and an entity concurrent knowledge base database. The knowledge base may be constructed as an in-memory database and may include other components such as one or more search controllers, a plurality of search nodes, a collection of compressed data, and a deconvolution computer module. One search controller may be selectively associated with one or more search nodes. Each search node can independently perform a fuzzy key search through a collection of compressed data and return a set of scored results to the associated search controller.

본 개시의 다른 측면에 있어서, 그 방법은 엔티티 추출 모듈이 탐색 조회가 엔티티를 지칭하는지 식별하고, 만약 그렇다면 엔티티 유형을 판정하기 위해 제공된 탐색 조회로부터 부분적인 엔티티 추출을 실행하는 것을 포함할 수 있다. 또한, 그 방법은 퍼지-스코어 매칭 모듈이 추출된 엔티티의 유형에 대응하는 알고리즘들을 생성하고 엔티티 동시 발생 지식 베이스에 대한 탐색을 실행하는 것을 포함할 수 있다. 추가적으로, 엔티티들로서 검출되지 않은 조회 텍스트 부분들은, 엔티티 동시 발생 지식 베이스를 탐색하기 위해 채용될 수 있는, 토픽들, 팩트(fact)들 및 키 구문과 같은 개념적 특징들로서 처리된다. 다른 것들 중에서도 엔티티들이 엔티티 대 엔티티(entities to entities), 엔티티 대 토픽(entities to topics) 또는 엔티티 대 팩트(entities to facts)로서 인덱스될 수 있는 보관소(repository)를 이미 가지고 있는 엔티티 동시 발생 지식 베이스는, 사용자에게 빠르고 정확한 제시어를 리턴하여 탐색 조회를 완성한다.In another aspect of the present disclosure, the method may include performing an entity extraction module to identify if the search query refers to an entity, and if so, a partial entity extraction from the provided search query to determine the entity type. The method may also include the fuzzy-score matching module generating algorithms corresponding to the type of the extracted entity and performing a search for the entity concurrent knowledge base. Additionally, lookup text portions that are not detected as entities are treated as conceptual features such as topics, facts, and key phrases that can be employed to search for an entity coincidence knowledge base. Among other things, an entity concurrent knowledge base that already has an repository that can be indexed as entities to entities, entities to topics, or entities to facts , And the search query is completed by returning a quick and accurate presentation word to the user.

본 개시의 추가적인 측면에 있어서, 완성된 탐색 조회는 새로운 탐색 조회에 이용될 수 있다. 탐색 시스템은 새로운 탐색 조회를 프로세싱하고, 엔티티 추출을 실행하고, 엔티티 동시 발생 지식 베이스로부터 가장 높은 스코어를 가진 관련 엔티티들을 발견하고, 사용자에게 유용한 드롭 다운 리스트(drop down list)로 관련 엔티티를 제공한다. In a further aspect of the present disclosure, the completed search query may be used for a new search query. The search system processes the new search query, performs entity extraction, discovers the associated entities with the highest score from the entity concurrent knowledge base, and provides the relevant entities with a drop down list that is useful to the user .

일 실시 예에 있어서, 방법이 개시된다. 그 방법은 엔티티 추출 컴퓨터가, 사용자 인터페이스로부터 부분 탐색 조회 파라메타들 - 부분 탐색 조회 파라메타들은 적어도 하나의 불완전한 탐색 조회 파라메타를 가짐 - 의 사용자 입력을 수신하고; 엔티티 추출 컴퓨터가, 부분 탐색 조회 파라메타들을, 전자 데이터 코퍼스내의 하나 이상의 제 1 엔티티들의 동시 발생의 인스턴스(instance)들을 가진 엔티티 동시 발생 데이터베이스와 비교하고, 부분 탐색 조회 파라메타들에 있어서 하나 이상의 제 1 엔티티들에 대응하는 적어도 하나의 엔티티 유형을 식별함에 의해 부분 탐색 조회 파라메타들로부터 하나 이상의 제 1 엔티티들을 추출하고; 퍼지-스코어 매칭 컴퓨터가 엔티티 동시 발생 데이터베이스를 탐색하는 퍼지 매칭 알고리즘을 선택하여, 부분 탐색 조회 파라메타들과 연계된 하나 이상의 레코드들을 식별하는 것을 구비하되, 퍼지 매칭 알고리즘은 적어도 하나의 식별된 엔티티 유형에 대응한다. 그 방법은, 퍼지-스코어 매칭 컴퓨터가, 선택된 퍼지 매칭 알고리즘을 이용하여 엔티티 동시 발생 데이터베이스를 탐색하고 그 탐색에 기초하여 하나 이상의 레코드들로부터 하나 이상의 제시된 제 1 탐색 조회 파라메타들을 형성하고, 퍼지-스코어 매칭 컴퓨터가 사용자 인터페이스를 통해 하나 이상의 제시된 제 1 탐색 조회 파라메타들을 제공하고, 엔티티 추출 컴퓨터가, 완성된 탐색 조회 파라메타들을 형성하기 위해 하나 이상의 제시된 제 1 탐색 조회 파라메타들의 사용자 선택을 수신하고, 엔티티 추출 컴퓨터가 완성된(completed) 탐색 조회 파라메타들로부터 하나 이상의 제 2 엔티티들을 추출하는 것을 추가로 포함한다. 그 방법은, 엔티티 추출 컴퓨터가 엔티티 동시 발생 데이터베이스를 탐색하여, 하나 이상의 제 2 엔티티들과 관련된 하나 이상의 엔티티들을 식별함으로써 하나 이상의 제시된 제 2 탐색 조회 파라메타를 형성하고, 엔티티 추출 컴퓨터가 사용자 인터페이스를 통해 하나 이상의 제시된 제 2 탐색 조회 파라메타를 제공하는 것을 추가로 포함한다.In one embodiment, a method is disclosed. The method comprising: an entity extraction computer receiving user input of partial search query parameters - partial search query parameters having at least one incomplete search query parameter from the user interface; Entity extraction computer compares the partial search query parameters with an entity concurrency database having instances of concurrent occurrence of one or more first entities in the electronic data corpus and determines whether the partial search query parameters include one or more first entities Extracting one or more first entities from the partial search query parameters by identifying at least one entity type corresponding to the at least one entity type; Selecting a fuzzy matching algorithm that searches for an entity coincidence database to identify one or more records associated with the partial search query parameters, wherein the fuzzy matching algorithm is adapted to search for at least one identified entity type Respectively. The method includes the steps of the fuzzy-score matching computer searching the entity concurrency database using the selected fuzzy matching algorithm, forming one or more of the presented first search query parameters from the one or more records based on the search, Wherein the matching computer provides one or more of the presented first search query parameters via a user interface and the entity extraction computer receives a user selection of one or more of the presented first search query parameters to form the completed search query parameters, The computer further includes extracting one or more second entities from the completed search query parameters. The method includes the steps of the entity extraction computer searching the entity coincidence database to form one or more presented second search lookup parameters by identifying one or more entities associated with the one or more second entities, And providing one or more of the presented second search query parameters.

다른 실시 예에 있어서, 시스템이 개시된다. 그 시스템은, 사용자 인터페이스로부터 부분 탐색 조회 파라메타들 - 부분 탐색 조회 파라메타들은 적어도 하나의 불완전한 탐색 조회 파라메타를 가짐 - 의 사용자 입력을 수신하도록 구성되고, 부분 탐색 조회 파라메타들을, 전자 데이터 코퍼스내의 하나 이상의 제 1 엔티티들의 동시 발생의 인스턴스(instance)들을 가진 엔티티 동시 발생 데이터베이스와 비교하고 부분 탐색 조회 파라메타들에 있어서 하나 이상의 제 1 엔티티들에 대응하는 적어도 하나의 엔티티 유형을 식별함에 의해 부분 탐색 조회 파라메타들로부터 하나 이상의 제 1 엔티티들을 추출하도록 추가 구성된 엔티티 추출 모듈을 포함하는 다수의 컴퓨터 모듈들에 대한 하나 이상의 프로세서 실행 컴퓨터 독출 가능 명령어를 가진 하나 이상의 서버 컴퓨터를 포함한다. 그 시스템은, 엔티티 동시 발생 데이터베이스를 탐색하는 퍼지 매칭 알고리즘을 선택하여 부분 탐색 조회 파라메타들과 연계된 하나 이상의 레코드들을 식별하도록 구성된 퍼지-스코어 매칭 모듈을 더 포함하되, 퍼지 매칭 알고리즘은 적어도 하나의 식별된 엔티티 유형에 대응한다. 퍼지-스코어 매칭 모듈은 선택된 퍼지 매칭 알고리즘을 이용하여 엔티티 동시 발생 데이터베이스를 탐색하고, 그 탐색에 기초하여 하나 이상의 레코드들로부터 하나 이상의 제시된 제 1 탐색 조회 파라메타들을 형성하고, 사용자 인터페이스를 통해 하나 이상의 제시된 제 1 탐색 조회 파라메타들을 제공하도록 추가 구성된다. 추가적으로, 엔티티 추출 모듈은, 하나 이상의 제시된 제 1 탐색 조회 파라메타들의 사용자 선택을 수신하여 완성된 탐색 조회 파라메타들을 형성하고, 완성된 탐색 조회 파라메타들로부터 하나 이상의 제 2 엔티티들을 추출하고, 엔티티 동시 발생 데이터베이스를 탐색하여 하나 이상의 제 2 엔티티들과 관련된 하나 이상의 엔티티들을 식별함으로써 하나 이상의 제시된 제 2 탐색 조회 파라메타들을 형성하고, 사용자 인터페이스를 통해 하나 이상의 제시된 제 2 탐색 조회 파라메타들을 제공하도록 추가 구성된다. In another embodiment, a system is disclosed. The system is configured to receive user input of partial search query parameters - partial search query parameters from the user interface having at least one incomplete search query parameter, and to perform partial search query parameters on one or more 1 Entities from the partial search query parameters by comparing them with an entity concurrency database having instances of concurrency of one entity and identifying at least one entity type corresponding to one or more first entities in the partial search query parameters And one or more server computers having one or more processor executable computer readable instructions for the plurality of computer modules, the entity extraction module being further configured to extract one or more first entities. The system further includes a fuzzy-match matching module configured to select a fuzzy matching algorithm that searches for an entity co-occurrence database to identify one or more records associated with the partial search query parameters, wherein the fuzzy matching algorithm includes at least one identification Lt; / RTI > entity type. The fuzzy-score matching module may search the entity concurrency database using the selected fuzzy matching algorithm, form one or more of the presented first search query parameters from the one or more records based on the search, And is further configured to provide first search query parameters. In addition, the entity extraction module may receive the user selection of one or more of the presented first search query parameters to form the completed search query parameters, extract one or more second entities from the completed search query parameters, To form one or more of the presented second search query parameters by identifying one or more entities associated with the one or more second entities and to provide one or more of the presented second search query parameters via the user interface.

엔티티 및 특징 동시 발생을 이용하여 엔티티들과 관련된 탐색 제시어를 획득하는 방법이 개시된다. 본 개시의 일 측면에 있어서, 그 방법은 클라이언트/서버 유형 아키텍쳐를 포함하는 탐색 시스템에 채용될 수 있다. A method for acquiring search suggestions associated with entities using entities and feature co-occurrence is disclosed. In one aspect of the present disclosure, the method may be employed in a search system that includes a client / server type architecture.

탐색 시스템은 엔티티 데이터베이스와 트렌드 데이터베이스(trends database)를 허용하는, 하나 이상의 서버에 저장된 엔티티들을 채용할 수 있는 방법을 이용한다. 그러한 데이터베이스상의 엔티티들은 보다 높은 스코어에 기초하여 인덱싱하는 스코어를 가질 수 있다. 탐색 제시어를 획득하는 방법은 탐색 제시어들의 단일 리스트를 생성하기 위한 2개의 데이터베이스들에 저장된 정보를 조합할 수 있다. 트렌드 데이터베이스는 로컬 네트워크 및/또는 인터넷에 있어서 하나 이상의 사용자들로부터의 이전 탐색 조회를 제공할 수 있다. 엔티티 데이터베이스는 로컬 네트워크 및/또는 인터넷에 있어서 이용 가능한 다수의 데이터로부터의 엔티티 추출에 기초하여 탐색 제시어를 제공할 수 있다. 이러한 리스트는 사용자에게 보다 정확하고 빠른 제시어들의 그룹을 제공할 수 있다.The search system utilizes a method capable of employing entities stored in one or more servers, allowing for an entity database and a trends database. Entities on such a database may have a score that is indexed based on a higher score. The method of acquiring the search presenter may combine information stored in two databases for generating a single list of search presuppositions. The trend database may provide a prior search query from one or more users on the local network and / or the Internet. The entity database may provide a search suggestion based on entity extraction from a plurality of data available on the local network and / or the Internet. Such a list can provide the user with a group of more accurate and quick presentation words.

일 실시 예에 있어서, 컴퓨터 구현 방법은, 컴퓨터가 탐색 엔진으로부터 하나 이상의 데이터 스트링들을 구비하는 탐색 조회를 수신하고 - 하나 이상의 스트링들의 서브셋에 각 엔티티가 대응함 - ; 컴퓨터가, 엔티티 데이터베이스와 트렌드 데이터베이스를, 하나 이상의 엔티티들과 비교하는 것에 기초하여 하나 이상의 데이터 스트링들에 있어서 하나 이상의 엔티티들을 식별하고; 컴퓨터가 적어도 하나의 엔티티에 대응하는 것으로 식별된 것이 아닌 하나 이상의 데이터 스트링들에 있어서의 하나 이상의 특징들을 식별하고; 컴퓨터가 매칭 알고리즘에 기반하여 하나 이상의 엔티티들 중 적어도 하나에 하나 이상의 특징들의 각각을 할당하고; 컴퓨터가 각 엔티티에 할당된 각 특징에 할당된 스코어에 기반하여 각 엔티티에 추출 스코어를 할당하고; 컴퓨터가 각 엔티티의 추출 스코어로부터 임계 거리내에 있는 스코어를 가진 하나 이상의 엔티티들을 포함하는 제 1 탐색 리스트를 엔티티 데이터베이스로부터 수신하고; 컴퓨터가, 각 엔티티의 추출 스코어로부터 임계 거리내에 있는 스코어를 가진 하나 이상의 엔티티들을 포함하는 제 2 탐색 리스트를 트렌드 데이터베이스로부터 수신하고; 컴퓨터가 제 1 탐색 리스트와 제 2 탐색 리스트를 구비하는 취합 리스트(aggregated list)를 생성하고 - 취합 리스트의 엔티티들은 각각 취합된 스코어에 따라 랭크(rank)됨 -; 컴퓨터가 취합 리스트에 따른 제시된 탐색을 제공하는 것을 구비한다.In one embodiment, a computer-implemented method comprises: receiving from a search engine a search query comprising one or more data strings, each entity corresponding to a subset of one or more strings; The computer identifying one or more entities in the one or more data strings based on comparing the entity database and the trend database with one or more entities; Identify one or more characteristics in one or more data strings that the computer is not identified as corresponding to at least one entity; The computer assigning each of the one or more features to at least one of the one or more entities based on a matching algorithm; The computer assigns an extraction score to each entity based on a score assigned to each feature assigned to each entity; The computer receiving from the entity database a first search list comprising one or more entities with a score within a threshold distance from each entity's extraction score; The computer receiving a second search list from the trend database, the second search list including one or more entities having a score within a threshold distance from an extraction score of each entity; Wherein the computer generates an aggregated list having a first search list and a second search list, and entities of the aggregation list are each ranked according to the aggregated scores; Wherein the computer comprises providing a suggested search according to the collection list.

본 명세서에는 마이크로소프트사의 SharePoint 2013®과 같은 콘텐츠 관리 시스템에 있어서 지리적 엔티티-기반 탐색을 가능하게 하는 시스템들 및 방법들이 개시된다. 실시 예들이 설명된다. 그 방법은, 지리적 태깅 웹 서비스(geographic tagging web service)를 추가함에 의해 SharePoint 2013® 탐색 아키텍쳐를 확장하는 것을 수반한다. 그 시스템은, 컴퓨터 메모리와 하나 이상의 I/O 디바이스와 동작 가능하게 연계된 컴퓨터 프로세서를 포함하되, 그 프로세서 및 메모리는 하나 이상의 SharePoint 2013® 프로세스들을 동작시키도록 구성된다. 그 시스템은 컴퓨터 메모리와 하나 이상의 I/O 디바이스와 동작 가능하게 연계된 또 다른 컴퓨터 프로세서를 포함하되, 그 프로세서 및 메모리는 지오태깅 웹 서비스(geotagging web service)에 대한 프로세싱을 호스팅(hosting)하고 제공하도록 구성된다. SharePoint 2013® 시스템은 콘텐츠의 탐색이 가능하도록 크롤링(crawling) 구성 요소, 콘텐츠 프로세싱 구성 요소, 탐색 인덱싱 구성 요소를 포함할 수 있다. SharePoint 2013® 탐색에 있어서의 콘텐츠 프로세싱 구성 요소는 CEWS(Content Enrichment Web Service) 특징을 이용하여 그의 기능성을 확장시킬 수 있다.Systems and methods are disclosed herein that enable geographic entity-based discovery in a content management system such as SharePoint 2013® from Microsoft. Embodiments are described. The method involves extending the SharePoint 2013® search architecture by adding a geographic tagging web service. The system includes a computer processor operatively associated with a computer memory and one or more I / O devices, the processor and memory configured to operate one or more SharePoint 2013 processes. The system includes a computer memory and another computer processor operatively associated with one or more I / O devices, the processor and memory hosting and providing processing for a geotagging web service, . The SharePoint 2013® system may include a crawling component, a content processing component, and a search indexing component to enable navigation of the content. The content processing component in SharePoint 2013® search can extend its functionality using the Content Enrichment Web Service (CEWS) feature.

그 방법은 콘텐츠 프로세싱을 위해 전송되는 크롤링된 특성들(crawled properties)의 어레이를 획득하기 위해 다른 소스들로부터 콘텐츠를 크롤링하는 것을 수반한다. 콘텐츠 프로세싱 동안, 트리거 조건은 추가적인 지리적 메타데이터 특성들로 원본 콘텐츠를 보강하기 위해 크롤링된 특성들이 추가적인 프로세싱으로부터 이득을 취할 수 있는지를 판정한다. 크롤링된 특성들이 추가적인 프로세싱으로부터 이득을 취하지 못하면, 크롤링된 특성들은 관리 프로세싱(managed processing)으로 매핑되어 탐색 인덱스로 전송된다. 크롤링된 특성들이 외부 웹 서비스 프로세싱으로부터 이득을 취하면, CEWS는 HTTP(hypertext transfer protocol) 또는 임의 다른 웹 서비스 호출 방법을 이용하여 구성 가능 종단점(endpoint)에 SOAP(simple object access protocol) 요청을 한다. 엔티티 보강 서비스는 콘텐츠 유형을 판정한다. 콘텐츠가 화상 포맷이면, 파일 위치와 같은 그의 메타데이터가 OCR(optical character recognition)로 전송되어, 원본 문서가 비 동기식으로 검색되고 프로세싱되어, 텍스트로 변환되고 크롤 구성 요소로 되전송됨으로써 텍스트 포맷으로 다시 크롤링(re-crawling)될 수 있게 된다. 콘텐츠가 텍스트 포맷이면, 지오태깅 웹 서비스는 지리적 메타데이터를 식별하고 그것을 관리 특성들로서 콘텐츠와 연계시킨다. 콘텐츠가 지오태깅되었으면, 그것은 인덱싱 구성 요소에 전송된다. The method involves crawling content from other sources to obtain an array of crawled properties to be transmitted for content processing. During content processing, the trigger condition determines whether the crawled properties can benefit from additional processing to supplement the original content with additional geographic metadata properties. If the crawled properties do not benefit from additional processing, the crawled properties are mapped to managed processing and sent to the search index. When the crawled properties benefit from external web service processing, the CEWS makes a simple object access protocol (SOAP) request to the configurable endpoint using HTTP (hypertext transfer protocol) or any other web service invocation method. The entity enhancement service determines the content type. If the content is in an image format, its metadata, such as a file location, is transmitted in optical character recognition (OCR), the original document is retrieved and processed asynchronously, converted to text, and sent back to the crawl component, And can be re-crawled. If the content is in text format, the geotagging web service identifies the geographic metadata and associates it with the content as management features. If the content has been geotagged, it is sent to the indexing component.

SharePoint 2013®웹 부분들을 이용하여 또는 다른 것들 중에서 HTML, HTML 5, JavaScript 및 CSS와 같은 표준 웹 개발 툴을 가진 SharePoint 2013® 탐색의 표준 레이아웃(standard layout)을 수정함에 의해 추가적인 탐색 사용자 인터페이스(UI)가 추가될 수 있다. 탐색 UI는 예를 들어 및 제한없이 디지털 맵(digital map)과 같은 디지털 지리적 특징을 이용하여 지리적 탐색 결과들을 디스플레이하거나 지리적 탐색 조회들을 실행하는데 있어서 사용자를 보조할 수 있다. 탐색 UI는 추가적인 보강된 엔티티들 또는 그들과 연계된 메타데이터를 이용하여 패싯 탐색(faceted search)을 실행하도록 개선될 수 있다.(UI) by modifying the standard layout of SharePoint 2013® navigation with standard Web development tools such as HTML, HTML 5, JavaScript, and CSS, using SharePoint 2013® Web parts, among others. Can be added. The navigation UI may assist the user in displaying geographic search results or performing geographic search queries using digital geographic features such as, for example and without limitation, a digital map. The search UI may be enhanced to perform faceted searches using additional reinforced entities or metadata associated with them.

본 개시의 여러 다른 측면, 특징 및 장점은 이하의 상세한 설명으로부터 명확하게 될 것이다.Various other aspects, features, and advantages of the present disclosure will become apparent from the following detailed description.

본 개시는 이하의 도면을 참조함에 의해 더욱 잘 이해할 수 있을 것이다. 도면에서의 구성 요소들은 반드시 축척으로 도시된 것을 아니고, 본 개시의 원리를 강조하여 도시되었다. 도면에 있어서 참조 번호들은 다른 도면에 걸쳐서도 대응하는 부분을 나타낸다.
도 1은 본 개시의 일 실시 예가 동작하는 컴퓨터 시스템의 예시적인 환경을 도시한 블럭도이다.
도 2는 실시 예에 따른, 엔티티 동시 발생을 이용하여 탐색하는 방법을 도시한 흐름도이다.
도 3은 시스템에 의해 리턴된 탐색 결과가 관련된 관심 엔티티들을 포함하는 단순한 탐색의 실시 예를 도시한 흐름도이다.
도 4는 본 개시의 일 실시 예가 동작하는 예시적인 시스템 환경을 도시한 블럭도이다.
도 5는 실시 예에 따른, 지식 베이스에 있어서의 퍼지-스코어 매칭 및 엔티티 동시 발생을 이용하는 탐색 제시어를 위한 방법을 도시한 흐름도이다.
도 6은 도 4 내지 도 6의 지식 베이스에 있어서의 퍼지 매칭 및 엔티티 동시 발생을 이용하여 탐색 제시어가 생성될 수 있는 사용자 인터페이스의 예시를 도시한 도면이다.
도 7은 본 개시의 일 실시 예가 동작하는 예시적인 시스템 환경을 도시한 블럭도이다.
도 8은 실시 예에 따른, 동시 발생 및/또는 퍼지 스코어 매칭에 기초하여 관련 엔티티들의 탐색 제시어를 생성하는 방법을 도시한 흐름도이다.
도 9는 도 8에 도시된 방법과 연계된 사용자 인터페이스의 예시적인 실시 예를 나타낸 도면이다.
도 10은 엔티티 및 트렌드 데이터베이스들에 기초하여 탐색 제시어를 획득하는 방법을 도시한 블럭도이다.
도 11은 엔티티 및 트렌드 데이터베이스들의 각각에 있어서의 탐색 제시어의 개별적 스코어에 기초하여 제시어들의 리스트를 생성함에 의해, 엔티티 및 트렌드 데이터베이스에 기초하여 탐색 제시어를 획득하는 방법을 도시한 블럭도이다.
도 12는 엔티티 및 트렌드 데이터베이스들상의 탐색 제시어의 전체 스코어에 기초하여 제시어들의 리스트를 생성함에 의해, 엔티티 및 트렌드 데이터베이스들에 기초하여 탐색 제시어를 획득하는 방법을 도시한 블럭도이다.
도 13은 콘텐츠 관리 시스템내의 콘텐츠의 태깅 및 엔티티 보강을 위한 시스템 아키텍쳐를 도시한 도면이다.
도 14는 지명된 및 지리적 엔티티 탐색을 위해 콘텐츠를 태깅하고 인덱싱하는 프로세스를 도시한 도면이다.The present disclosure may be better understood by reference to the following drawings. The components in the figures are not necessarily drawn to scale, but are illustrated with emphasis on the principles of the present disclosure. Reference numerals in the drawings denote corresponding parts throughout the other drawings.
1 is a block diagram illustrating an exemplary environment of a computer system in which one embodiment of the present disclosure operates.
2 is a flow diagram illustrating a method for searching using co-occurrence of entities, according to an embodiment.
3 is a flow diagram illustrating an embodiment of a simple search in which the search results returned by the system include interested entities associated therewith.
4 is a block diagram illustrating an exemplary system environment in which one embodiment of the present disclosure operates.
5 is a flow diagram illustrating a method for a fuzzy-score matching in a knowledge base and a search suggestion using entity co-occurrence, according to an embodiment.
FIG. 6 is a diagram illustrating an example of a user interface through which a search suggestion word can be generated using fuzzy matching and entity co-occurrence in the knowledge base of FIGS. 4 to 6. FIG.
7 is a block diagram illustrating an exemplary system environment in which one embodiment of the present disclosure operates.
8 is a flow diagram illustrating a method for generating search suggestions of related entities based on concurrent and / or fuzzy score matching, in accordance with an embodiment.
9 is a diagram illustrating an exemplary embodiment of a user interface associated with the method shown in FIG.
10 is a block diagram illustrating a method for obtaining a search suggestion based on entities and trend databases.
11 is a block diagram illustrating a method of obtaining a search suggestion based on an entity and a trend database by generating a list of suggestions based on an individual score of the search suggestion in each of the entity and trend databases.
12 is a block diagram illustrating a method for obtaining a search query based on entities and trend databases by generating a list of search terms based on an overall score of search queries on entities and trend databases.
13 is a diagram illustrating a system architecture for tagging content and enhancing entities in a content management system.
Figure 14 is a diagram illustrating a process for tagging and indexing content for named and geographic entity searches.

정의Justice

본 명세서에서 이용된, 이하의 용어는 이하의 정의를 가질 수 있다.As used herein, the following terms may have the following definitions.

"엔티티 추출"은 이름, 장소 및 조직과 같은 정보를 추출하는 정보 프로세싱 방법을 지칭한다."Entity extraction" refers to an information processing method that extracts information such as name, location, and organization.

"코퍼스"는 하나 이상의 문서들의 콜렉션을 지칭한다.A "corpus" refers to a collection of one or more documents.

"특징들(features)"은 문서로부터 적어도 부분적으로 도출되는 임의 정보를 지칭한다."Features" refers to any information derived at least in part from a document.

"이벤트 개념 스토어(event concept store)"는 이벤트 템플릿 모델(Event template models)의 데이터베이스를 지칭한다.An " event concept store "refers to a database of event template models.

"이벤트"는 적어도 실시간으로의 특징의 발생을 특징으로 하는 하나 이상의 특징들을 지칭한다."Event" refers to one or more features that characterize the occurrence of a feature at least in real time.

"이벤트 모델"은 이벤트의 특정 유형을 비교하고 식별하는데 이용될 수 있는 데이터의 콜렉션을 지칭한다.An "event model" refers to a collection of data that can be used to compare and identify particular types of events.

"모듈"은 적어도 하나의 작업을 실행하는데 적합한 컴퓨터 또는 소프트웨어 구성 요소들을 지칭한다."Module" refers to computer or software components suitable for executing at least one task.

"특징 속성(feature attribute)"은 다른 것들 중에서 문서내의 특징의 위치, 신뢰 스코어와 같은 특징과 연계된 메타데이터를 지칭한다."Feature attribute" refers to metadata associated with features, such as the location of a feature in a document, trust score, among others.

"팩트(fact)"는 특징들간의 객관적 연관성(objective relationship)을 지칭한다."Fact" refers to an objective relationship between features.

"엔티티 지식 베이스"는 특징들/엔티티들을 포함하는 컴퓨터 데이터베이스를 지칭한다."Entity knowledge base" refers to a computer database containing features / entities.

"조회(query)"는 하나 이상의 적당한 데이터베이스로부터 정보를 검색하도록 하는 컴퓨터 생성 요청을 지칭한다.A "query" refers to a computer generated request to retrieve information from one or more suitable databases.

"토픽(topic)"은 코퍼스로부터 적어도 부분적으로 도출되는 테마 정보(thematic information)의 세트를 지칭한다.A "topic" refers to a set of thematic information derived at least in part from the corpus.

"지오태깅(geotagging)"은 비정형 텍스트 파일로부터 지리적 엔티티들을 추출하는 프로세스를 지칭하는 것으로, 특정 지리적 장소에 대한 엔티티의 중의성을 해소하고, 지리적 좌표, 지리적 특징 유형 및 다른 메타데이터와 같은 지리적 메타데이터를 첨부하는 것을 포함한다."Geotagging" refers to the process of extracting geographic entities from an unstructured text file, which removes the ambiguity of the entities for a particular geographic location and provides geographic metadata such as geographic coordinates, geographic feature types, And attaching the data.

"엔티티 태깅(entity tagging)"은 비정형 텍스트로부터 지명된 엔티티를 추출하는 프로세스를 지칭하며, 엔티티 중의성 해소, 엔티티 이름 정규화 및 엔티티 메타데이터의 첨부를 포함한다. "Entity tagging" refers to the process of extracting named entities from unstructured text, including resolution of entity entities, entity name normalization, and attachment of entity metadata.

"지명된 엔티티"는 사람, 조직 또는 토픽을 지칭한다.A "named entity" refers to a person, organization, or topic.

"지리적 엔티티"는 지리적 위치 또는 지리적 장소를 지칭한다.A "geographic entity" refers to a geographic location or geographic location.

"크롤링된 특성들(crawled properties)"은 크롤(crawls)동안 문서들을 검사하여 획득한 콘텐츠 관리 시스템 메타데이터를 지칭한다."Crawled properties" refers to content management system metadata obtained by examining documents during crawls.

바람직한 실시 예에 대한 참조가 세부적으로 이루어질 것이며, 그의 예시들이 첨부 도면에 도시된다. 본 명세서에서 설명한 실시 예들은 예시적인 것이다. 당업자라면 수많은 대안적인 구성 요소들 및 실시 예들이 본 명세서에서 설명하는 특정 예시를 대신할 수 있고, 본 발명의 범주내에 있음을 알 것이다. 다른 실시 예들이 이용될 수 있고/있거나 본 개시의 사상 또는 범주를 벗어나지 않는 다른 변경이 이루어질 수 있다. 상세한 설명에서 설명된 예시적인 실시 예들은 본 명세서에서 제공된 주제를 제한하기 위한 것은 아니다. Reference will now be made in detail to the preferred embodiments, examples of which are illustrated in the accompanying drawings. The embodiments described herein are illustrative. Those skilled in the art will recognize that a number of alternative components and embodiments may be substituted for the specific examples described herein and are within the scope of the present invention. Other embodiments may be utilized and / or other changes may be made without departing from the spirit or scope of the disclosure. The exemplary embodiments described in the Detailed Description are not intended to limit the subject matter provided herein.

그렇지만, 본 발명의 범주를 제한하기 위한 것은 아님을 알 것이다. 본 개시의 당업자가 할 수 있는 본 명세서에서 설명한 발명 특징의 변경 및 추가적인 수정과 본 명세서에서 설명된 발명의 원리의 추가적인 애플리케이션은 본 발명의 범주내인 것으로 간주되어야 한다. However, it will be understood that they are not intended to limit the scope of the invention. Modifications and further modifications of the inventive features described herein, which can be made by those skilled in the art of the present disclosure, and additional applications of the principles of the invention described herein should be considered within the scope of the present invention.

본 개시는 다수의 소스들로부터 이벤트들을 검출하고, 추출하고, 인증하는 시스템 및 방법을 설명한다. 소스들은 이벤트와 관련되는 데이터를 포함할 수 있는, 뉴스 소스, 소셜 미디어 웹사이트(social media website) 및/또는 임의 소스를 포함할 수 있다.The present disclosure describes a system and method for detecting, extracting, and authenticating events from multiple sources. The sources may include a news source, a social media website, and / or any source that may contain data related to the event.

본 명세서에서 개시된 시스템 및 방법의 여러 실시 예들은 독자적인 이벤트들을 식별하기 위해 다른 소스들로부터 데이터를 수집한다. Various embodiments of the systems and methods disclosed herein collect data from other sources to identify unique events.

도 1에는 본 개시에 따른 탐색 시스템(100)의 블럭도가 도시된다. 탐색 시스템(100)은 네트워크(108)를 통해 서버 디바이스(106)와 이진 데이터 형태의 탐색 조회들을 통신하는 탐색 엔진(104)을 액세스하는 그래픽 사용자 인터페이스(102)를 포함할 수 있는, 탐색 시스템(100)과 연계된 프로세서 실행 소프트웨어 모듈들을 구비하는 하나 이상의 클라이언트 컴퓨팅 디바이스를 포함할 수 있다. 예시적인 실시 예에 있어서, 탐색 시스템(100)은 클라이언트-서버 컴퓨팅 아키텍쳐에 구현될 수 있다. 그러나, 탐색 시스템(100)은 다른 컴퓨터 아키텍쳐(예를 들어, 단독형 컴퓨터, 단말을 가진 메인프레임 시스템, ASP(application service provider) 모델, 피어-투-피어 모델(peer-to-peer model)등)를 이용하여 구현될 수 있음을 알아야 한다. 네트워크(108)는 LAN(local area network), WAN(wide area network), 인터넷, 무선 네트워크, 이동 전화 네트워크 등과 같은, 컴퓨팅 디바이스들간에 디지털 데이터를 통신할 수 있는 임의 적당한 하드웨어 및 소프트웨어 모듈들을 구비할 수 있다. 그 경우, 시스템(100)은 단일 네트워크(108)를 통해 구현되거나 다수의 네트워크들(108)을 이용하여 구현될 수 있음을 알아야 한다.FIG. 1 shows a block diagram of a search system 100 in accordance with the present disclosure. The search system 100 may include a search system 104 that may include a graphical user interface 102 that accesses a search engine 104 that communicates search requests in the form of binary data to a server device 106 over a network 108 100 and one or more client computing devices having processor execution software modules associated therewith. In an exemplary embodiment, the search system 100 may be implemented in a client-server computing architecture. However, the search system 100 may include other computer architectures (e.g., a stand alone computer, a mainframe system with terminals, an application service provider (ASP) model, a peer-to-peer model, etc. ). &Lt; / RTI > The network 108 may comprise any suitable hardware and software modules capable of communicating digital data between computing devices, such as a local area network (LAN), a wide area network (WAN), the Internet, a wireless network, . In that case, it should be appreciated that the system 100 may be implemented via a single network 108, or may be implemented using multiple networks 108.

사용자의 컴퓨팅 디바이스(102)는 탐색 조회들을 전송할 수 있는 소프트웨어 모듈을 포함하는, 탐색 엔진(104)을 액세스할 수 있다. 탐색 조회들은 검색을 위한 원하는 정보를 나타내는, 탐색 엔진(104)에 제공되는 파라메타들이다. 탐색 조회들은 탐색 엔진(104)의 파싱(parsing) 및 프로세싱 루틴들과 호환 가능한 임의 적당한 데이터 포맷(예를 들어, 정수, 스트링, 복합 객체(complex object)들)으로 사용자 또는 다른 소프트웨어 애플리케이션에 의해 제공될 수 있다. 일부 실시 예들에 있어서, 탐색 엔진(104)은 사용자의 컴퓨팅 디바이스(102) 브라우저 또는 다른 소프트웨어 애플리케이션을 통해 액세스 가능하고 사용자 또는 소프트웨어 애플리케이션이 월드 와이드 웹(World Wide Web)상의 정보를 위치 결정할 수 있도록 하는 웹 기반 툴일 수 있다. 일부 실시 예들에 있어서, 탐색 엔진(104)은 사용자 또는 애플리케이션이 시스템(100)의 데이터베이스내의 정보를 위치 결정할 수 있게 하는, 시스템(100) 고유의 애플리케이션 소프트웨어 모듈들일 수 있다. The user's computing device 102 may access the search engine 104, which includes a software module capable of transmitting search queries. The search queries are parameters provided to the search engine 104, which represent the desired information for searching. The search queries are provided by the user or other software application in any suitable data format (e.g., integer, string, complex objects) compatible with the parsing and processing routines of the search engine 104 . In some embodiments, the search engine 104 is accessible through a user's computing device 102 browser or other software application, and allows a user or software application to locate information on the World Wide Web It can be a web-based tool. In some embodiments, the search engine 104 may be application software modules specific to the system 100 that enable a user or application to locate information within the database of the system 100. [

서버 디바이스(106)는 다수의 서버 컴퓨터들에 걸쳐 분산형 아키텍쳐에 또는 단일 서버 디바이스(106)로서 구현될 수 있으며, 엔티티 추출 모듈(110), 엔티티 동시 발생 지식 베이스(112) 및 엔티티 인덱스 코퍼스(114)를 포함할 수 있다. 엔티티 추출 모듈(110)은 조회 스트링, 정형 데이터(structured data) 등과 같은 주어진 조회 세트로부터 독자적인 엔티티들을 추출하고 중의성을 해소할 수 있는 컴퓨터 소프트웨어 및/또는 하드웨어 모듈일 수 있다. 엔티티들은, 예를 들어, 사람들, 조직들, 지리적 위치들, 날짜들 및/또는 시간을 포함할 수 있다. 추출 동안, 하나 이상의 특징 인식 및 추출 알고리즘들이 채용될 수 있다. 또한, 각각의 추출된 특징에, 특징의 확실성 레벨이 정확한 속성을 가지고 정확하게 추출됨을 나타내는 스코어가 할당될 수 있다. 특징 속성을 고려하여, 각 특징들의 상대적 가중치(weight) 또는 관련성이 판정될 수 있다. 추가적으로, 특징들간의 연계의 관련성이 가중 스코어링 모델(weighted scoring model)을 이용하여 판정될 수 있다. The server device 106 can be implemented in a distributed architecture across multiple server computers or as a single server device 106 and includes an entity extraction module 110, an entity concurrent knowledge base 112 and an entity index corpus 114). The entity extraction module 110 may be computer software and / or hardware modules capable of extracting proprietary entities from a given query set, such as a query string, structured data, etc., and resolving the ambiguities. Entities may include, for example, people, organizations, geographic locations, dates and / or times. During extraction, one or more feature recognition and extraction algorithms may be employed. Also, for each extracted feature, a score may be assigned that indicates that the certainty level of the feature is correctly extracted with the correct attribute. In consideration of feature attributes, a relative weight or relevance of each feature can be determined. Additionally, the relevance of the association between features may be determined using a weighted scoring model.

여러 실시 예에 따르면, 엔티티 동시 발생 지식 베이스(112)는 인-메모리 컴퓨터 데이터베이스(도시되지 않음)로서 구축되지만, 그에 국한되는 것은 아니며, 하나 이상의 탐색 제어기들, 다수의 탐색 노드들, 압축 데이터의 콜렉션 및 중의성 해소 컴퓨터 모듈과 같은 다른 구성 요소들(도시되지 않음)을 포함할 수 있다. 하나의 탐색 제어기는 하나 이상의 탐색 노드와 선택적으로 연계될 수 있다. 각 탐색 노드는 압축 데이터의 콜렉션을 통해 퍼지 키 탐색을 독자적으로 실행하고 스코어링된 결과들의 세트를 그와 연계된 탐색 제어기로 리턴할 수 있다.According to various embodiments, the entity concurrent knowledge base 112 is constructed as an in-memory computer database (not shown), but is not limited to, one or more search controllers, a plurality of search nodes, And other components (not shown), such as a collection and hygrometer computer module. One search controller may optionally be associated with one or more search nodes. Each search node can independently perform a fuzzy key search through a collection of compressed data and return a set of scored results to its associated search controller.

엔티티 동시 발생 지식 베이스(112)는 특징들에 기반하고, 신뢰 스코어에 의해 랭크되는 관련 엔티티들을 포함할 수 있다. 어느 엔티티 유형들이 가장 중요한지, 어느 것이 더 큰 가중치를 가지는지를 판정하는 가중 모델을 필수적으로 이용하고, 신뢰 스코어에 기초하여 정확한 특징들의 추출이 얼마만큼의 신뢰성으로 실행되었는지를 판정하는, 특징들을 링크 접속하는 여러 방법들이 채용될 수 있다. 엔티티 인덱스 코퍼스(114)는 대형 코퍼스(massive corpus) 또는 라이브 코퍼스(live corpus)를 가진 인터넷과 같은 다수의 소스로부터의 데이터를 포함할 수 있다.The entity co-occurrence knowledge base 112 may include related entities based on features and ranked by a trust score. The features that make use of a weighted model that determines which entity types are most important and which have a larger weight and determine how much reliably the extraction of the correct features is performed based on the confidence score, May be employed. Entity index corpus 114 may include data from a number of sources, such as the Internet with a massive corpus or live corpus.

도 2는 도 1 에 도시된 것과 같은, 탐색 시스템(100)에 구현될 수 있는 엔티티 동시 발생을 이용하여 관련 엔티티들을 탐색하는 방법을 도시한 흐름도이다. 여러 실시 예들에 따르면, 방법(200)의 시작 전에, 도 1에 도시된 것과 유사한 엔티티 인덱스 코퍼스(114)에는, (예를 들어, 인터넷, 웹사이트, 블로그, 워드 프로세싱 파일, 평문(plaintext) 파일과 같은) 전자 데이터의 대형 코퍼스 또는 라이브 코퍼스와 같은 다수의 소스로부터의 데이터가 피딩된다. 엔티티 인덱스 코퍼스(114)는 새로운 데이터가 발견됨에 따라 일정하게 갱신할 수 있는 다수의 인덱스된 엔티티들을 포함할 수 있다. 2 is a flow diagram illustrating a method for searching for related entities using entity co-occurrence that may be implemented in the search system 100, such as that shown in FIG. According to various embodiments, prior to the start of the method 200, the entity index corpus 114, similar to that shown in Figure 1, may be stored in an entity (e.g., Internet, Web site, blog, word processing file, plaintext file Such as a large corpus or live corpus of electronic data. The entity index corpus 114 may include a number of indexed entities that can be updated regularly as new data is discovered.

일 실시 예에 있어서, 단계 202에서, 사용자 또는 컴퓨팅 디바이스(102)의 소프트웨어 애플리케이션이 탐색 엔진(104)에 하나 이상의 엔티티들을 포함하는 하나 이상의 탐색 조회들을 제공할 때, 방법(200)이 시작된다. 단계 202에서 제공되었던 탐색 조회들은 매번 1 부터 n까지 탐색 시스템(100)에 의해 프로세싱된다. 예를 들어, 단계 202에서의 탐색 조회는 스트링, 정형 데이터 또는 다른 적당한 데이터 포맷과 같은 키워드(keywords)들의 조합일 수 있다. 도 2의 이러한 예시적인 실시 예에 있어서, 탐색 조회의 키워드는 사람, 조직, 지리적 위치, 날짜 및/또는 시간을 나타내는 엔티티들일 수 있다. In one embodiment, at step 202, when a user or a software application of computing device 102 provides one or more search queries that include one or more entities to search engine 104, method 200 begins. The search queries provided in step 202 are processed by the search system 100 from 1 to n each time. For example, the search query at step 202 may be a combination of keywords, such as a string, structured data, or other suitable data format. In this exemplary embodiment of FIG. 2, the keywords in the search query may be entities representing a person, organization, geographic location, date and / or time.

단계 202로부터의 탐색 조회들은 단계 204에서 엔티티 추출을 위해 프로세싱될 수 있다. 이 단계에서, 엔티티 추출 모듈(110)은, 가능한 많은 엔티티들을 추출하고 중의성 해소하기 위해, 단계 202로부터의 탐색 조회들을 엔티티들로서 프로세싱하고 그들 모두를 엔티티 동시 발생 지식 베이스(112)와 비교한다. 추출 동안에, 하나 이상의 특징 인식 및 추출 알고리즘이 채용될 수 있다. 또한, 특징의 확실성 레벨이 정확한 속성으로 정확하게 추출됨을 나타내는 스코어가 각 추출된 특징에 할당될 수 있다. 특징 속성을 고려하여, 각 특징의 상대적 가중치 또는 관련성이 판정될 수 있다. 추가적으로, 특징들간의 연계의 관련성은 가중 스코어링 모델을 이용하여 판정될 수 있다. Search queries from step 202 may be processed for entity extraction in step 204. At this stage, the entity extraction module 110 processes the search queries from step 202 as entities and compares them all with the entity concurrency knowledge base 112 to extract and resolve as many entities as possible. During extraction, one or more feature recognition and extraction algorithms may be employed. In addition, a score may be assigned to each extracted feature indicating that the certainty level of the feature is correctly extracted with the correct attribute. In consideration of feature attributes, a relative weight or relevance of each feature can be determined. Additionally, the relevance of associations between features can be determined using a weighted scoring model.

또한, 어느 엔티티 유형들이 가장 중요한지, 어느 것이 더 큰 가중치를 가지는지를 판정하기 위한 가중 모델을 필수적으로 이용하고 신뢰 스코어에 기초하여 정확한 특징들의 추출이 얼마만큼의 신뢰성으로 실행되었는지를 판정하는, 특징들을 링크 접속하는 여러 방법들이 채용될 수 있다. 신뢰 스코어에 기초하여 엔티티들이 추출되고 랭크되면, 일부 경우에 숫자일 수 있는 인덱스 ID가 단계 206에서 추출된 엔티티에 할당될 수 있다.In addition, features that make use of a weighted model to determine which entity types are most important and which have a larger weight, and how much reliably the extraction of correct features is performed based on the confidence score Several methods of link access may be employed. Once the entities are extracted and ranked based on the trust score, an index ID, which in some cases may be a number, may be assigned to the extracted entity in step 206.

다음, 단계 208에서, 단계 206에서 할당된 엔티티 인덱스 ID에 기초한 탐색이 실행될 수 있다. 탐색 단계 208에서, 추출된 엔티티들은 표준 인덱싱 방법들을 이용하여, 엔티티 인덱스 코퍼스(114) 데이터에 배치될 수 있다. 추출된 엔티티들이 배치되면, 엔티티 연계 단계 210로 진행된다. 엔티티 연계 단계 201에서, 문서, 비디오, 사진, 파일 등과 같은 모든 데이터(적어도 2개의 추출된 엔티티들이 오버랩(overlap)됨)는 엔티티 인덱스 코퍼스(114)로부터 배출된다. 마지막으로, 단계 212에서, 잠재적 결과(potential results)의 리스트가 구축되고, 관련성에 의해 소팅(sorting)되고, 탐색 결과로서 사용자에게 제공된다. 결과들의 리스트는, 사용자가 관심있는 관련 엔티티들을 발견할 수 있는 데이터에 대한 링크(link)만을 보여준다. Next, at step 208, a search based on the assigned entity index ID at step 206 may be performed. In the search stage 208, the extracted entities may be placed in the entity index corpus 114 data using standard indexing methods. Once the extracted entities are deployed, proceed to entity association step 210. At entity association step 201, all data (such as documents, video, photographs, files, etc.) (at least two extracted entities overlap) is ejected from the entity index corpus 114. Finally, at step 212, a list of potential results is constructed, sorted by relevance, and provided to the user as a search result. The list of results only shows a link to the data that the user may find relevant entities of interest.

도 3은 도 2와 관련하여 상기에서 설명한 바와 같이 엔티티 동시 발생을 이용하여 관련 엔티티들을 탐색하는 방법(300)의 특정 예시이다. 도 2에 도시된 바와 같이, 여러 실시 예에 따르면, 방법(300)의 시작 전에, 도 1에 도시된 것과 유사한 엔티티 인덱스 코퍼스(114)에는 대형 코퍼스 또는 라이브 코퍼스(인터넷)과 같은 다수의 소스로부터의 데이터가 피딩되어 있을 수 있다. 엔티티 인덱스 코퍼스(114)는 새로운 데이터가 발견됨에 따라 갱신될 수 있는 다수의 인덱스된 엔티티들을 포함할 수 있다.FIG. 3 is a specific illustration of a method 300 for searching for related entities using entity co-occurrence as described above in connection with FIG. 2, prior to the start of the method 300, an entity index corpus 114 similar to that shown in FIG. 1 may be obtained from a number of sources, such as a large corpus or live corpus (Internet) Data may be fed. The entity index corpus 114 may include a number of indexed entities that can be updated as new data is discovered.

본 예시적인 실시 예에 있어서, 사용자는 회사 "Apple"에서의 "jobs"에 관한 정보를 검색할 수 있다. 이를 위해, 사용자는 예를 들어 도 1에 설명된 것과 같은 탐색 엔진(104)과의 인터페이스일 수 있는(이에 국한되는 것은 아님) 사용자 인터페이스(102)를 통해 하나 이상의 엔티티(예를 들어, 단계 302에서의 탐색 조회)를 입력할 수 있다. 예시적으로(제한을 위한 것은 아님), 사용자는 "Apple + Jobs"과 같은 엔티티들의 조합을 입력할 수 있다. 다음, 탐색 엔진(104)은 단계 302에서 탐색 조회들을 생성하고, 프로세싱될 서버 디바이스(106)에 이들 조회들을 전송한다. 서버 디바이스(106)에서, 엔티티 추출 모듈(110)은 단계 302에서의 탐색 조회 입력으로부터 엔티티 추출 단계 304를 실행할 수 있다.In this exemplary embodiment, the user can retrieve information about "jobs" in company "Apple. &Quot; To that end, a user may interact with one or more entities (e.g., through step 302 (e.g., via step 302) via user interface 102, which may be, Search query in the search engine). By way of example, and not limitation, a user may enter a combination of entities such as "Apple + Jobs ". Next, the search engine 104 generates search queries at step 302 and sends these queries to the server device 106 to be processed. At server device 106, entity extraction module 110 may execute an entity extraction step 304 from the search query input at step 302.

엔티티 추출 모듈(110)은 엔티티들로서 예를 들어 "Apple" 및 "Jobs"과 같이 단계 302에서 입력되었던 탐색 조회들을 프로세싱하고, 그들 모두를 엔티티 동시 발생 지식 베이스(112)와 비교하여, 가능한 많은 엔티티들을 추출하고 그들의 중의성을 해소한다. 추출 동안, 하나 이상의 특징 인식 및 추출 알고리즘들이 채용될 수 있다. 또한, 특징의 확실성 레벨이 정확한 속성으로 정확하게 추출됨을 나타내는 스코어가 각 추출된 특징에 할당될 수 있다. 특징 속성을 고려하여, 특징들의 각각의 상대적 가중치 및 관련성이 판정될 수 있다. 특징들간의 연계 관련성이 가중 스코어링 모델을 이용하여 판정될 수 있다.Entity extraction module 110 processes the search queries that were entered at step 302, e.g., "Apple" and "Jobs" as entities, and compares them all with the entity concurrent knowledge base 112, And eliminates their ambiguity. During extraction, one or more feature recognition and extraction algorithms may be employed. In addition, a score may be assigned to each extracted feature indicating that the certainty level of the feature is correctly extracted with the correct attribute. In consideration of the feature attribute, the relative weight and relevance of each of the features can be determined. Association relationships between features can be determined using a weighted scoring model.

또한, 어느 엔티티 유형들이 가장 중요한지, 어느 것이 더 큰 가중치를 가지는지를 판정하기 위한 가중 모델을 필수적으로 이용하고 신뢰 스코어에 기초하여 정확한 특징들의 추출이 얼마만큼의 신뢰성으로 실행되었는지를 판정하는, 특징들을 링크 접속하는 여러 방법들이 채용될 수 있다. 결과적으로, 엔티티 및 동시 발생을 포함하는 테이블(306)이 생성될 수 있다. 테이블(306)은 엔티티 "apple" 및 그의 동시 발생(이 경우, 이것은 Apple 및 Jobs, Apple 및 Steve Jobs일 수 있음)을 보여준다. 테이블(306)은 Apple 및 관련성이 발견된 조직 A를 포함할 수 있는데, 이는 조직 A가 Apple과 거래하고 있고 상기 조직 A에서 "jobs"을 생성하고 있기 때문이다. 보다 덜 중요한 다른 동시 발생이 발견될 수도 있다. 그 경우, Apple 및 Jobs는 가장 높은 스코어(1)를 가질 수 있고, 그에 따라 제일 위에 리스트될 수 있으며, 그 다음 Apple 및 Steve Jobs가 두번째로 높은 스코어(0.8)를 가질 수 있고, 최종적으로 Apple 및 다른 조직 A가 가장 낮은 스코어(0.3)를 가진 채 하부에 리스트될 수 있다. In addition, features that make use of a weighted model to determine which entity types are most important and which have a larger weight, and how much reliably the extraction of correct features is performed based on the confidence score Several methods of link access may be employed. As a result, a table 306 containing entities and concurrency can be created. Table 306 shows the entity "apple" and its concurrency (in this case, it could be Apple and Jobs, Apple and Steve Jobs). The table 306 may include Apple and the organization A found to be relevant because Organization A is trading with Apple and is creating "jobs" in Organization A. Other less important concurrent occurrences may be found. In that case, Apple and Jobs can have the highest score (1), so they can be listed first, then Apple and Steve Jobs can have the second highest score (0.8), and finally Apple Other organization A can be listed in the bottom with the lowest score (0.3).

신뢰 스코어에 기초하여 엔티티들이 추출되고 랭크되면, 일부 경우에 숫자일 수 있는 인덱스 ID는 단계 308에서 추출된 엔티티에 할당될 수 있다. 테이블(310)은 추출된 엔티티들에 할당된 인덱스 ID들을 보여준다. 그 다음, 테이블(310)은 인덱스 ID 1을 가진 "Apple"과, 인덱스 ID 2를 가진 "Jobs"와, 인덱스 ID 3을 가진 "Steve Jobs"와, 인덱스 ID 4를 가진 "조직 A"를 보여준다.Once the entities are extracted and ranked based on the trust score, the index ID, which in some cases may be a number, may be assigned to the extracted entity in step 308. [ Table 310 shows the index IDs assigned to the extracted entities. Table 310 then shows the "Apple" with index ID 1, "Jobs" with index ID 2, "Steve Jobs" with index ID 3, and "Organization A" with index ID 4 .

다음, 단계 308에서의 엔티티 인덱스 ID에 기초하여 탐색 단계 312가 실행될 수 있다. 탐색 단계 312에서, "Apple", "Jobs", "Steve Jobs" 및 "조직 A"와 같은 추출된 엔티티들이 표준 인덱싱 방법들을 이용하여 엔티티 인덱스 코퍼스(114) 데이터내에 배치될 수 있다. Next, a search step 312 may be performed based on the entity index ID at step 308. [ In the search step 312, extracted entities such as "Apple", "Jobs", "Steve Jobs", and "Organization A" may be placed in the entity index corpus 114 data using standard indexing methods.

추출된 엔티티들을 엔티티 인덱스 코퍼스(114)내에 배치한 후, 엔티티 연계 단계 314가 수행된다. 엔티티 연계 단계 314에 있어서, 문서, 비디오, 사진, 파일 등과 같은 모든 데이터(적어도 2개의 추출된 엔티티들이 오버랩됨)는 엔티티 인덱스 코퍼스(114)로부터 배출되어, 탐색 결과들로서 링크들의 리스트를 구축한다(단계 318). 예시적으로(제한을 위한 것은 아님), 테이블(316)은 엔티티 인덱스 코퍼스(114)에서 추출된 엔티티들이 데이터에 연계되는 방법을 보여준다. 테이블(316)에서, 문서 1, 4, 5, 7, 8 및 10은 2개의 추출된 엔티티들의 오버랩을 보여주며, 따라서 이들 문서들에 대한 링크들이 단계 318에서 탐색 결과들로서 나타난다.After placing the extracted entities in the entity index corpus 114, an entity association step 314 is performed. In entity association step 314, all data (at least two extracted entities overlap), such as documents, video, photographs, files, etc., is ejected from the entity index corpus 114 to build up a list of links as search results Step 318). By way of example, and not limitation, table 316 shows how entities extracted from entity index corpus 114 are associated with data. In table 316, documents 1, 4, 5, 7, 8, and 10 show an overlap of two extracted entities, and thus links to these documents appear as search results at step 318.

도 4는 본 개시에 따른 탐색 컴퓨터 시스템(400)의 블럭도이다. 탐색 시스템(400)은 네트워크(408)를 통해 서버 디바이스(406)와 통신하는 탐색 엔진(404)에 대한 하나 이상의 사용자 인터페이스(402)를 포함할 수 있다. 본 실시 예에 있어서, 탐색 시스템(400)은 클라이언트/서버 유형 아키텍쳐를 통하는 것을 포함하는, 아래에 참조된 하나 이상의 전용 컴퓨터 및 컴퓨터 모듈들에 구현될 수 있다. 그러나, 탐색 시스템(400)은 다른 컴퓨터 아키텍쳐(예를 들어, 단독형 컴퓨터, 단말을 가진 메인프레임 시스템, ASP 모델, 피어-투-피어(peer-to-peer) 모델 등)를 이용하여 구현될 수 있다. 실시 예에 있어서, 탐색 컴퓨터 시스템(400)은 LAN(local area network), WAN(wide area network), 인터넷, 무선 네트워크, 이동 전화 네트워크와 같은 다수의 네트워크들을 포함할 수 있다.4 is a block diagram of a search computer system 400 in accordance with the present disclosure. The search system 400 may include one or more user interfaces 402 for the search engine 404 in communication with the server device 406 via the network 408. [ In this embodiment, the search system 400 may be implemented in one or more dedicated computer and computer modules referenced below, including through a client / server type architecture. However, the search system 400 may be implemented using other computer architectures (e.g., a stand-alone computer, a mainframe system with terminals, an ASP model, a peer-to-peer model, etc.) . In an embodiment, the search computer system 400 may include a number of networks, such as a local area network (LAN), a wide area network (WAN), the Internet, a wireless network,

탐색 엔진(404)은 사용자가 월드 와이드 웹(World Wide Web)상의 정보를 위치 결정할 수 있게 하는 웹 기반 툴과 같은 사용자 인터페이스를 포함할 수 있다. 탐색 엔진(404)은 사용자가 내부 데이터베이스 시스템의 정보를 위치 결정할 수 있게 하는 사용자 인터페이스 툴을 포함할 수 있다. 단일 서버 디바이스(406)에 구현되거나 다수의 서버 컴퓨터에 걸쳐있는 분산형 아키텍쳐에 구현될 수 있는 서버 디바이스(406)는 엔티티 추출 모듈(410), 퍼지-스코어 매칭 모듈(412) 및 엔티티 동시 발생 지식 베이스 데이터베이스(414)를 포함할 수 있다. The search engine 404 may include a user interface, such as a web-based tool, that allows a user to locate information on the World Wide Web. The search engine 404 may include a user interface tool that allows a user to locate information in the internal database system. The server device 406, which may be implemented in a distributed architecture that is implemented in a single server device 406 or spans multiple server computers, includes an entity extraction module 410, a fuzzy-score matching module 412, Base database 414.

엔티티 추출 모듈(410)은 조회 스트링, 부분 조회, 정형 데이터 등과 같은 주어진 조회 세트로부터 온더플라이 독자적 엔티티들(on-the-fly independent entities)을 추출하고 그들의 중의성을 해소하도록 구성된 하드웨어 및/또는 소프트웨어 모듈일 수 있다. 예를 들어, 엔티티들은, 사람, 조직, 지리적 위치, 날짜 및/또는 시간을 포함할 수 있다. 추출 동안, 하나 이상의 특징 인식 및 추출 알고리즘들이 채용될 수 있다. 또한, 특징의 확실성 레벨이 정확한 속성으로 정확하게 추출됨을 나타내는 스코어가 각 추출된 특징에 할당될 수 있다. 특징 속성을 고려하여, 특징들 각각의 상대적 가중치 및 관련성이 판정될 수 있다. 추가적으로, 특징들간의 연계의 관련성이 가중 스코어링 모델을 이용하여 판정될 수 있다.Entity extraction module 410 may include hardware and / or software that is configured to extract on-the-fly independent entities from a given set of queries, such as query strings, partial queries, structured data, Module. For example, the entities may include a person, an organization, a geographic location, a date and / or a time. During extraction, one or more feature recognition and extraction algorithms may be employed. In addition, a score may be assigned to each extracted feature indicating that the certainty level of the feature is correctly extracted with the correct attribute. In consideration of the feature attribute, the relative weight and relevance of each of the features can be determined. Additionally, the relevance of association between features may be determined using a weighted scoring model.

퍼지-스코어 매칭 모듈(412)은 주어진 탐색 조회로부터 추출되는 엔티티 유형에 따라 선택될 수 있는 다수의 알고리즘들을 포함할 수 있다. 그 알고리즘의 기능은 사용자 입력을 통해 수신된 주어진 탐색 조회 및 알고리즘에 의해 식별된 다른 탐색된 스트링들이 서로 유사한지 또는 주어진 패턴 스트링과 적절하게 매칭되는지를 판정하는 것이다. 퍼지 매칭은 퍼지 스트링 매칭, 부정확 매칭(inexact matching) 및 근사 매칭으로서 알려져 있다. 엔티티 추출 모듈(410) 및 퍼지 스코어 매칭 모듈(412)은 엔티티 동시 발생 지식 베이스(414)와 함께 작업하여 사용자에 대한 탐색 제시어를 생성한다. The fuzzy-score matching module 412 may include a number of algorithms that may be selected according to the entity type extracted from a given search query. The function of the algorithm is to determine whether a given search query received via a user input and the other searched strings identified by the algorithm are similar to each other or properly matched to a given pattern string. Fuzzy matching is known as fuzzy string matching, inexact matching, and approximate matching. Entity extraction module 410 and fuzzy score matching module 412 work in conjunction with entity concurrency knowledge base 414 to generate search suggestions for users.

여러 실시 예들에 따르면, 엔티티 동시 발생 지식 베이스(414)는 인-메모리 데이터베이스로서 구축될 수 있으며, 하나 이상의 탐색 제어기, 다수의 탐색 노드들, 압축 데이터의 콜렉션 및 중의성 해소 모듈과 같은 구성 요소들을 포함하지만, 이에 국한되는 것은 아니다. 하나의 탐색 제어기는 하나 이상의 탐색 노드와 선택적으로 연계될 수 있다. 각 탐색 노드는 압축 데이터의 콜렉션을 통해 퍼지 키 탐색을 독자적으로 실행하고 스코어링된 결과들의 세트를 그와 연계된 탐색 제어기로 리턴할 수 있다.According to various embodiments, the entity concurrency knowledge base 414 may be constructed as an in-memory database and may include components such as one or more search controllers, a plurality of search nodes, a collection of compressed data, But are not limited to, One search controller may optionally be associated with one or more search nodes. Each search node can independently perform a fuzzy key search through a collection of compressed data and return a set of scored results to its associated search controller.

엔티티 동시 발생 지식 베이스(414)는 특징들에 기반하고, 신뢰 스코어에 의해 랭크되는 관련 엔티티들을 포함할 수 있다. 어느 엔티티 유형들이 가장 중요한지, 어느 것이 더 큰 가중치를 가지는지를 판정하는 가중 모델을 필수적으로 이용하고, 신뢰 스코어에 기초하여 정확한 특징들의 추출이 얼마만큼의 신뢰성으로 실행되었는지를 판정하는, 특징들을 링크 접속하는 여러 방법들이 채용될 수 있다. The entity concurrency knowledge base 414 may include related entities based on features and ranked by a trust score. The features that make use of a weighted model that determines which entity types are most important and which have a larger weight and determine how much reliably the extraction of the correct features is performed based on the confidence score, May be employed.

도 5는 지식 베이스에 있어서 퍼지-스코어 매칭 및 엔티티 동시 발생을 이용하여 탐색 제시어를 생성하는 방법(500)을 도시한 흐름도이다. 방법(500)은 도 4에 도시된 것과 유사한 탐색 시스템(400)에 구현될 수 있다.5 is a flow diagram illustrating a method 500 of generating a search suggestion word using fuzzy-score matching and entity co-occurrence in a knowledge base. The method 500 may be implemented in a search system 400 similar to that shown in FIG.

일 실시 예에 있어서, 방법(500)은, 사용자가 단계 502에서 탐색 조회를 도 4에 도시된 탐색 엔진 인터페이스(402)내에 타이핑(typing)을 시작함에 의해, 개시된다. 단계 502에서 탐색 조회가 타이핑됨에 따라, 탐색 시스템(400)은 온더플라이 프로세스(on-the-fly process)를 실행한다. 여러 실시 예들에 따르면, 단계 502에서의 탐색 조회 입력은 완전한 것일 수 있으며, 또는 부분적으로 정확하게 철자가 쓰여지거나 잘못 쓰여질 수 있다. 이어서, 탐색 시스템(400)에서, 단계 502의 탐색 조회 입력으로부터 부분적 엔티티 추출 단계 504가 실행될 수 있다. 부분적 엔티티 추출 단계 504는 엔티티 동시 발생 지식 베이스(414)에 대한 신속한 탐색을 실행하여, 단계 502에서 입력되었던 탐색 조회가 엔티티인지를 식별하고, 그렇다면, 엔티티의 유형이 무엇인지를 식별한다. 여러 실시 예들에 따르면, 단계 402의 탐색 조회 입력은 다른 것들 중에서도 사람, 조직, 장소의 위치 및 날짜를 지칭한 것일 수 있다. 탐색 조회 입력의 엔티티 유형이 식별되면, 퍼지-스코어 매칭 모듈(412)은 단계 506에서 대응하는 퍼지 매칭 알고리즘을 선택할 수 있다. 예를 들어, 탐색 조회가 사람을 지칭하고 있는 엔티티로서 식별되었으면, 퍼지-스코어 매칭 모듈(412)은, 예를 들어, 처음, 중간, 마지막 및 타이틀(title)을 포함하는 사람 이름의 다른 구성 요소들을 추출함에 의해, 사람에 대한 스트링 매칭 알고리즘을 선택한다. 다른 실시 예에 있어서, 탐색 조회가 조직을 지칭하고 있는 엔티티로서 식별되었으면, 퍼지-스코어 매칭 모듈(412)은 학교, 대학, 기업, 주식회사 등과 같은 식별 용어들을 포함할 수 있는 조직에 대한 스트링 매칭 알고리즘을 선택할 수 있다. 그 다음, 퍼지-스코어 매칭 알고리즘(412)은 그 탐색을 능가하도록 탐색 조회 입력에 있어서의 식별된 엔티티의 유형에 대응하는 스트링 매칭 알고리즘을 선택할 수 있다. 스트링 매칭 알고리즘이 식별된 엔티티의 유형에 맞게 조정되면, 퍼지-스코어 매칭 단계 508이 실행될 수 있다.In one embodiment, the method 500 is initiated by the user starting typing a search query at step 502 into the search engine interface 402 shown in FIG. As the search query is typed in step 502, the search system 400 executes an on-the-fly process. According to various embodiments, the search query input at step 502 may be complete, or partially spelled or misspelled. Subsequently, in the search system 400, the partial entity extraction step 504 from the search query input of step 502 may be executed. The partial entity extraction step 504 performs a quick search for the entity concurrency knowledge base 414 to identify if the search query entered in step 502 is an entity and if so, what the type of entity is. According to various embodiments, the search query entry of step 402 may be, among other things, referring to a person, organization, location and date of a location. If the entity type of the search query input is identified, the fuzzy-score matching module 412 may select a corresponding fuzzy matching algorithm at step 506. [ For example, if the search query has been identified as an entity that refers to a person, the fuzzy-score matching module 412 may determine the fuzzy-score matching module 412 based on, for example, the first, middle, And then selects a string matching algorithm for the person. In another embodiment, if the search query is identified as an entity that refers to an organization, the fuzzy-score matching module 412 may use a string matching algorithm for an organization that may include identifying terms such as schools, colleges, corporations, corporations, Can be selected. The fuzzy-score matching algorithm 412 may then select a string matching algorithm corresponding to the type of identified entity in the search query input to outperform its search. If the string matching algorithm is adjusted to the type of entity being identified, a fuzzy-score matching step 508 may be performed.

퍼지-스코어 매칭 단계 508에서는, 추출된 엔티티 또는 엔티티들과, 비-엔티티들(non-entities)이 탐색되고 엔티티 동시 발생 지식 베이스(414)와 비교된다. 추출된 엔티티 또는 엔티티들은, 다른 것들 중에서, 예를 들어, 성과 이름의 첫번째 캐릭터(character)와 같은 사람의 불완전한 이름, 예를 들어 "United Nations"을 상징하는 "UN"과 같은 조직의 약어, 약식(short form) 및 별명을 포함할 수 있다. 엔티티 동시 발생 지식 베이스(414)는, 다른 것들 중에서도, 엔티티 대 엔티티(entity to entity), 엔티티 대 토픽(entity to topic) 및 엔티티 대 팩트(entity to fact)와 같은 정형 데이터로서 인덱스된 다수의 레코드들을 미리 등록했을 수 있다. 후자는, 단계 508에서의 퍼지-스코어 매칭이 매우 빠른 방식으로 이루어질 수 있게 한다. 단계 508에서의 퍼지-스코어 매칭은 Levenshtein 거리, strcmp95, ITF 스코어링과 같은 통상적인 스트링 메트릭(string metric)을 이용할 수 있지만, 이게 국한되는 것은 아니다. 두 단어들간의 Levenshtein 거리는 하나의 단어를 다른 단어로 변경하는데 필요한 최소 개수의 단일-캐릭터 편집들을 지칭한다. In the fuzzy-score matching step 508, extracted entities or entities and non-entities are searched and compared to the entity concurrency knowledge base 414. The extracted entities or entities may include, among others, an incomplete name of a person, such as, for example, the first character of a first and last name, an abbreviation of an organization such as "UN" a short form, and an alias. The entity concurrency knowledge base 414 may include among other things a plurality of records indexed as structured data such as entity to entity, entity to topic, and entity to fact May be registered in advance. The latter allows the fuzzy-score matching at step 508 to be done in a very fast manner. The fuzzy-score matching at step 508 may use conventional string metrics such as, but not limited to, Levenshtein distance, strcmp95, ITF scoring. The Levenshtein distance between two words refers to the minimum number of single-character edits necessary to change one word to another.

마지막으로, 퍼지-스코어 매칭 단계 508이 엔티티 동시 발생 지식 베이스(414)에 있어서의 모든 레코드를 탐색 조회와 비교하고 탐색하는 것을 끝마치면, 단계 510에서, 주어진 패턴 스트링(즉, 단계 502의 탐색 조회 입력)과 가장 잘 매칭되거나 가장 근접하게 매칭된 레코드가 탐색 제시어에 대한 제 1 후보로서 선택될 수 있다. 주어진 패턴 스트링과 보다 덜 근접하게 매칭된 다른 레코드들은 하향 순서로 제 1 후보 아래에 자리하게 된다. 단계 510에서의 탐색 제시어는 사용자가 무시하거나 무시하지 않을 수 있는 가능한 매칭들의 드롭 다운 리스트(drop down list)로 사용자에게 제공될 수 있다. Finally, when the fuzzy-score matching step 508 finishes comparing and searching all the records in the entity concurrent knowledge base 414 with the search query, at step 510, the given pattern string (i.e., the search query of step 502) Input) may be selected as the first candidate for the search result word. Other records that matched less closely with a given pattern string are placed in the downward order below the first candidate. The search suggestion at step 510 may be provided to the user as a drop down list of possible matches that the user may ignore or ignore.

도 6은 상술한 도 4 및 도 5에 도시된 바와 같은, 지식 베이스에 있어서의 퍼지-스코어 매칭 및 엔티티 동시 발생을 이용하여 탐색 제시어를 생성하는 방법에 따른 예시적인 사용자 인터페이스(600)이다. 이 예시에 있어서, 도 4에 도시된 것과 유사한 탐색 엔진 인터페이스(602)를 통해 사용자는 탐색 박스(606)에 부분 조회(604)를 입력한다. 예시적으로(제한을 위한 것은 아님), 부분 조회(604)는 도 6에 도시된 바와 같이, "Michael J"와 같은 사람의 불완전한 이름일 수 있다. 실제 탐색을 실행하고 결과를 획득하기 위해 사용자가 탐색 버튼(608)을 아직 선택하지 않을 수 있고, 그렇지 않으면, 탐색 시스템(400)에 부분 조회(604)를 제공하지 않았을 수 있기 때문에 부분 조회(604)가 고려될 수 있다. FIG. 6 is an exemplary user interface 600 according to a method for generating a search suggestion word using fuzzy-score matching and entity co-occurrence in a knowledge base, as shown in FIGS. 4 and 5 above. In this example, the user enters a partial query 604 into the search box 606 via a search engine interface 602 similar to that shown in FIG. By way of example, and not by way of limitation, the partial query 604 may be an incomplete name of a person such as "Michael J ", as shown in FIG. Because the user may not yet have selected the navigation button 608 to perform the actual search and obtain the result or else the partial query 604 may not have been provided to the search system 400 ) Can be considered.

방법(500)(도 5)에 이어서, 사용자가 "Michael J"를 타이핑함에 따라, 엔티티 추출 모듈(410)은 엔티티 동시 발생 지식 베이스(414)에 대해 제 1 단어(Michael)의 즉각적인 빠른 탐색(quick search on the fly)을 실행하여 엔티티의 유형을 식별한다. 본 예시에서 엔티티는 사람 이름을 지칭한 것일 수 있다. 결론적으로, 퍼지-스코어 매칭 모듈(412)은 사람의 이름에 대해 덧붙여진 스트링 매칭 알고리즘을 선택할 수 있다. 사람의 이름은 예를 들어, 단지, 머리글자(약식) 또는 성과 이름의 첫번째 캐릭터, 또는 성, 중간 이름 및 마지막 이름의 머리글자, 또는 그들의 임의 조합을 이용하여, 다른 형태로 작성된 데이터베이스에서 발견될 수 있다. 퍼지-스코어 매칭 모듈(412)은 엔티티 "Michael"과 매칭될 수 있는 엔티티 동시 발생 지식 베이스(414)내의 엔티티, 토픽 또는 팩트에 대한 스코어를 판정하고 할당하기 위해, Levenshtein 거리와 같은 통상적인 스트링 메트릭을 이용할 수 있다. 이 예시에 있어서, Michael은 그 이름을 가진 상당량의 레코드들과 매칭된다. 그러나, 사용자가 이어지는 캐릭터 "J"를 타이핑함에 따라, 퍼지-스코어 매칭 모듈(412)은 Levenshtein 거리에 기초하여, 엔티티 동시 발생 지식 베이스(414)와 Michael을 가진 모든 동시 발생의 또 다른 비교를 수행할 수 있다. 엔티티 동시 발생 지식 베이스(414)는 "Michael J"에 대해 가장 높은 스코어를 가진 모든 가능한 매칭을 선택한다. 예를 들어, 퍼지-스코어 매칭 모듈(412)은 일부 경우에 "Michael Jackson", "Michael Jordan", "Michael J. Fox" 또는 "Michael Dell"과 같은 탐색 제시어(610)를 사용자에게 리턴할 수 있다. 사용자는 탐색 조회를 완성하기 위해 제시된 사람들 중 한명을 드롭 다운 리스트로부터 선택할 수 있다. 상술한 예시를 확장하여, "Michael the basketball player"과 같은 조회는, 사람 엔티티 이름 변형에 있어서의 "Michael", 및 키 구문, 팩트 및 토픽과 같은 동시 발생 특징에 있어서의 "the basketball player"에 대한 엔티티 동시 발생 지식 베이스의 탐색에 의해 리턴되는 결과들에 기초하여, "Michael Jordan"라는 제시어를 이끌어 낼 수 있다. 다른 예시는 "Alexander the actor"일 수 있으며, "Alexander Polinsky"라는 제시어를 이끌어낼 수 있다. 당업자라면, 기존의 탐색 플랫폼(platform)이 상술한 방식으로 제시어를 생성할 수 없음을 알 것이다.Following the method 500 (FIG. 5), as the user types "Michael J ", the entity extraction module 410 generates an instant quick search of the first word Michael for the entity concurrent knowledge base 414 quick search on the fly) to identify the type of entity. In this example, the entity may refer to a person's name. Consequently, the fuzzy-score matching module 412 may select a string matching algorithm that is appended to the person's name. A person's name may be found in a database written in another form, for example, using only the first character of the initial (abbreviation) or first name, or the first character of the last name, middle name and last name, . The fuzzy-score matching module 412 may use a conventional string metric, such as a Levenshtein distance, to determine and assign a score for an entity, topic or fact in the entity concurrency knowledge base 414 that can match the entity "Michael & Can be used. In this example, Michael matches a significant amount of records with that name. However, as the user types the succeeding character "J ", the fuzzy-score matching module 412 performs another comparison of all concurrent occurrences with entity concurrency knowledge base 414 and Michael based on the Levenshtein distance can do. The entity concurrency knowledge base 414 selects all possible matches with the highest score for "Michael J ". For example, the fuzzy-score matching module 412 may in some cases return search suggestions 610 such as "Michael Jackson", "Michael Jordan", "Michael J. Fox", or "Michael Dell" have. The user can select one of the suggested persons from the drop-down list to complete the search query. Expanding on the above example, an inquiry such as "Michael the basketball player" can be found in "the Michael" in the transformation of human entity names, and in "the basketball player" in concurrent features such as key phrases, facts and topics Based on the results returned by the search of the co-occurrence knowledge base for the entity, it is possible to derive the suggested term "Michael Jordan". Another example could be "Alexander the actor", leading to the suggestion "Alexander Polinsky". Those skilled in the art will appreciate that an existing search platform can not generate a presentation word in the manner described above.

도 7은 본 개시에 따른 탐색 시스템(700)의 블럭도이다. 탐색 시스템(700)은 네트워크(708)를 통해 서버 디바이스(706)와 통신하는 탐색 엔진(704)에 대한 하나 이상의 사용자 인터페이스(702)를 포함할 수 있다. 본 실시 예에 있어서, 탐색 시스템(700)은 클라이언트/서버 유형 아키텍쳐에 구현될 수 있다. 그러나, 탐색 시스템(700)은 다른 컴퓨터 아키텍쳐(예를 들어, 단독형 컴퓨터, 단말을 가진 메인프레임 시스템, ASP 모델, 피어-투-피어(peer-to-peer) 모델 등)와, 예를 들어, LAN(local area network), WAN(wide area network), 인터넷, 무선 네트워크, 이동 전화 네트워크와 같은 다수의 네트워크들을 이용하여 구현될 수 있다.7 is a block diagram of a search system 700 in accordance with the present disclosure. The search system 700 may include one or more user interfaces 702 for a search engine 704 that communicates with a server device 706 via a network 708. The search engine 704 may include a search engine 704, In this embodiment, the search system 700 may be implemented in a client / server type architecture. However, the search system 700 may include other computer architectures (e.g., a stand alone computer, a mainframe system with terminals, an ASP model, a peer-to-peer model, etc.) , A local area network (LAN), a wide area network (WAN), the Internet, a wireless network, a mobile phone network, and the like.

탐색 엔진(704)은 사용자가 월드 와이드 웹상의 정보를 위치 결정할 수 있게 하는 웹 기반 툴을 통하는 인터페이스를 포함하지만, 이에 국한되는 것은 아니다. 탐색 엔진(704)은 내부 데이터베이스 시스템내의 정보를 사용자가 위치 결정할 수 있도록 하는 툴을 포함할 수 있다. 단일 서버 디바이스(706)에 구현되거나 다수의 서버 컴퓨터들에 걸쳐있는 분산형 아키텍쳐에 구현될 수 있는 서버 디바이스(706)는 엔티티 추출 모듈(710), 퍼지-스코어 매칭 모듈(712) 및 엔티티 동시 발생 지식 베이스 데이터베이스(714)를 포함할 수 있다.The search engine 704 includes, but is not limited to, an interface through a web-based tool that allows a user to locate information on the World Wide Web. The search engine 704 may include a tool that allows the user to locate information within the internal database system. The server device 706, which may be implemented in a distributed architecture that is implemented in a single server device 706 or spans multiple server computers, includes an entity extraction module 710, a fuzzy-score matching module 712, And a knowledge base database 714.

엔티티 추출 모듈(710)은 조회 스트링, 부분 조회, 정형 데이터 등과 같은 주어진 조회 세트로부터 온더플라이 독자적 엔티티들(on-the-fly independent entities)을 추출하고 그들의 중의성을 해소하도록 구성된 하드웨어 및/또는 소프트웨어 모듈일 수 있다. 예를 들어, 엔티티들은, 사람, 조직, 지리적 위치, 날짜 및/또는 시간을 포함할 수 있다. 추출 동안, 하나 이상의 특징 인식 및 추출 알고리즘들이 채용될 수 있다. 또한, 특징의 확실성 레벨이 정확한 속성으로 정확하게 추출됨을 나타내는 스코어가 각 추출된 특징에 할당될 수 있다. 특징 속성을 고려하여, 특징들 각각의 상대적 가중치 및 관련성이 판정될 수 있다. 추가적으로, 특징들간의 연계의 관련성이 가중 스코어링 모델을 이용하여 판정될 수 있다.Entity extraction module 710 may comprise hardware and / or software configured to extract on-the-fly independent entities from a given query set, such as query strings, partial queries, structured data, etc., Module. For example, the entities may include a person, an organization, a geographic location, a date and / or a time. During extraction, one or more feature recognition and extraction algorithms may be employed. In addition, a score may be assigned to each extracted feature indicating that the certainty level of the feature is correctly extracted with the correct attribute. In consideration of the feature attribute, the relative weight and relevance of each of the features can be determined. Additionally, the relevance of association between features may be determined using a weighted scoring model.

퍼지-스코어 매칭 모듈(712)은 주어진 탐색 조회로부터 추출되는 엔티티 유형에 따라 조정되거나 선택될 수 있는 다수의 알고리즘들을 포함할 수 있다. 그 알고리즘의 기능은 주어진 탐색 조회(입력) 및 제시된 탐색된 스트링들이 서로 유사한지 또는 주어진 패턴 스트링과 적절하게 매칭되는지를 판정하는 것이다. 퍼지 매칭은 퍼지 스트링 매칭, 부정확 매칭(inexact matching) 및 근사 매칭으로서 알려져 있다. 엔티티 추출 모듈(710) 및 퍼지 스코어 매칭 모듈(712)은 엔티티 동시 발생 지식 베이스(714)와 함께 작업하여 사용자에 대한 탐색 제시어를 생성한다. The fuzzy-score matching module 712 may include a number of algorithms that may be adjusted or selected according to the entity type extracted from a given search query. The function of the algorithm is to determine whether the given search query (input) and presented search strings are similar to each other or whether the given search string matches appropriately with a given pattern string. Fuzzy matching is known as fuzzy string matching, inexact matching, and approximate matching. Entity extraction module 710 and fuzzy score matching module 712 work in conjunction with entity concurrency knowledge base 714 to generate search suggestions for users.

여러 실시 예들에 따르면, 엔티티 동시 발생 지식 베이스(714)는 인-메모리 데이터베이스로서 구축될 수 있으며, 하나 이상의 탐색 제어기, 다수의 탐색 노드들, 압축 데이터의 콜렉션 및 중의성 해소 모듈과 같은 구성 요소들을 포함하지만, 이에 국한되는 것은 아니다. 하나의 탐색 제어기는 하나 이상의 탐색 노드와 선택적으로 연계될 수 있다. 각 탐색 노드는 압축 데이터의 콜렉션을 통해 퍼지 키 탐색을 독자적으로 실행하고 스코어링된 결과들의 세트를 그와 연계된 탐색 제어기로 리턴할 수 있다.According to various embodiments, the entity concurrency knowledge base 714 may be constructed as an in-memory database and may include components such as one or more search controllers, a plurality of search nodes, a collection of compressed data, But are not limited to, One search controller may optionally be associated with one or more search nodes. Each search node can independently perform a fuzzy key search through a collection of compressed data and return a set of scored results to its associated search controller.

엔티티 동시 발생 지식 베이스(714)는 특징들에 기반하고, 신뢰 스코어에 의해 랭크되는 관련 엔티티들을 포함할 수 있다. 어느 엔티티 유형들이 가장 중요한지, 어느 것이 더 큰 가중치를 가지는지를 판정하는 가중 모델을 필수적으로 이용하고, 신뢰 스코어에 기초하여 정확한 특징들의 추출이 얼마만큼의 신뢰성으로 실행되었는지를 판정하는, 특징들을 링크 접속하는 여러 방법들이 채용될 수 있다. The entity co-occurrence knowledge base 714 may include related entities based on features and ranked by a trust score. The features that make use of a weighted model that determines which entity types are most important and which have a larger weight and determine how much reliably the extraction of the correct features is performed based on the confidence score, May be employed.

도 8은 동시 발생 및/또는 퍼지 스코어 매칭에 기반하여 관련 엔티티들의 탐색 제시어를 생성하는 방법(800)의 실시 예를 도시한 흐름도이다. 방법(800)은 도 7에 도시된 것과 유사한 탐색 시스템(700)에 구현될 수 있다.FIG. 8 is a flow diagram illustrating an embodiment of a method 800 of generating a search suggestion of related entities based on concurrency and / or fuzzy score matching. The method 800 may be implemented in a search system 700 similar to that shown in FIG.

일 실시 예에 있어서, 방법(800)은 사용자가 단계 802에서 탐색 조회를 도 7에 도시된 탐색 엔진(702)내에 타이핑(typing)을 시작함에 의해, 개시된다. 탐색 조회가 타이핑됨에 따라, 탐색 시스템(700)은 온더플라이 프로세스(on-the-fly process)를 실행한다. 여러 실시 예들에 있어서, 탐색 조회는 완전한 것일 수 있으며/있거나, 또는 부분적으로 정확하게 철자가 쓰여지거나 잘못 쓰여질 수 있다. 다음, 탐색 조회의 부분적 엔티티 추출 단계 804가 실행될 수 있다. 부분적 엔티티 추출 단계 804는 엔티티 동시 발생 지식 베이스(714)에 대한 신속한 탐색을 실행하여, 탐색 조회가 엔티티를 포함하고 있는지를 식별하고, 그렇다면, 엔티티의 유형을 식별한다. 여러 실시 예들에 따르면, 탐색 조회 엔티티는 다른 것들 중에서도 사람, 조직, 장소의 위치 및 날짜를 지칭한 것일 수 있다. 엔티티가 있으면, 퍼지-스코어 매칭 모듈(712)은 단계 806에서 대응하는 퍼지 매칭 알고리즘을 선택할 수 있다. 예를 들어, 탐색 조회가 사람을 지칭하고 있는 엔티티로서 식별되었으면, 퍼지-스코어 매칭 모듈(712)은, 예를 들어, 처음, 중간, 마지막 및 타이틀(title)을 포함하는 사람 이름의 다른 구성 요소들을 추출할 수 있는, 사람에 대한 스트링 매칭 알고리즘을 조정 또는 선택한다. 다른 실시 예에 있어서, 탐색 조회가 조직을 지칭하고 있는 엔티티로서 식별되었으면, 퍼지-스코어 매칭 모듈(712)은 학교, 대학, 기업, 주식회사 등과 같은 식별 용어들을 포함할 수 있는 조직에 대한 스트링 매칭 알고리즘을 조정 또는 선택할 수 있다. 그러므로, 퍼지-스코어 매칭 알고리즘(712)은 그 탐색을 도모하기 위해 엔티티의 유형에 대한 스트링 매칭 알고리즘을 조정 또는 선택할 수 있다. 스트링 매칭 알고리즘이 엔티티의 유형에 대응하도록 조정 또는 선택되면, 단계 808에서 퍼지-스코어 매칭이 실행될 수 있다.In one embodiment, the method 800 is initiated by the user initiating a search query at step 802 in the search engine 702 shown in FIG. As the search query is typed, the search system 700 executes an on-the-fly process. In various embodiments, the search query may be complete and / or partially spelled correctly or spelled incorrectly. Next, the partial entity extraction step 804 of the search query can be executed. The partial entity extraction step 804 performs a quick search for the entity concurrency knowledge base 714 to identify whether the search query includes an entity and, if so, the type of entity. According to various embodiments, the search query entity may refer to a person, an organization, a location and a date among other things. If there is an entity, the fuzzy-score matching module 712 may select the corresponding fuzzy matching algorithm at step 806. [ For example, if the search query has been identified as an entity that refers to a person, the fuzzy-score matching module 712 may determine that the fuzzy- And adjusts or selects a string matching algorithm for the person that can extract the strings. In another embodiment, if the search query is identified as an entity that refers to an organization, the fuzzy-score matching module 712 may use a string matching algorithm for organizations that may include identifying terms such as schools, colleges, corporations, corporations, Can be adjusted or selected. Thus, the fuzzy-score matching algorithm 712 may adjust or select a string matching algorithm for the type of entity to facilitate its search. If the string matching algorithm is adjusted or selected to correspond to the type of entity, then fuzzy-score matching may be performed at step 808. [

퍼지-스코어 매칭 단계 808에서는, 추출된 엔티티 또는 엔티티들과, 임의 비-엔티티들(non-entities)이 탐색되고 엔티티 동시 발생 지식 베이스(714)와 비교된다. 추출된 엔티티 또는 엔티티들은, 다른 것들 중에서, 예를 들어, 성과 이름의 첫번째 캐릭터(character)와 같은 사람의 불완전한 이름, 예를 들어 "United Nations"을 상징하는 "UN"과 같은 조직의 약어, 약식(short form) 및 별명을 포함할 수 있다. 엔티티 동시 발생 지식 베이스(714)는, 다른 것들 중에서도, 엔티티 대 엔티티(entity to entity), 엔티티 대 토픽(entity to topic) 및 엔티티 대 팩트 인덱스(entity to facts index)와 같은 정형 데이터에 인덱스된 다수의 레코드들을 미리 등록했을 수 있다. 이것은, 단계 808에서의 퍼지-스코어 매칭이 신속하게 이루어질 수 있게 한다. 퍼지-스코어 매칭은 Levenshtein 거리, strcmp95, ITF 스코어링과 같은 통상적인 스트링 메트릭(string metric)을 이용할 수 있지만, 이게 국한되는 것은 아니다. 두 단어들간의 Levenshtein 거리는 하나의 단어를 다른 단어로 변경하는데 필요한 최소 개수의 단일-캐릭터 편집들을 지칭한다. In the fuzzy-score matching step 808, the extracted entities or entities and any non-entities are searched and compared to the entity concurrency knowledge base 714. The extracted entities or entities may include, among others, an incomplete name of a person, such as, for example, the first character of a first and last name, an abbreviation of an organization such as "UN" a short form, and an alias. The entity concurrency knowledge base 714 may be any of a number of indexed to structured data, such as entity to entity, entity to topic, and entity to facts index among others. May be registered in advance. This allows the fuzzy-score matching at step 808 to be done quickly. Fuzzy-score matching can use conventional string metrics such as, but not limited to, Levenshtein distance, strcmp95, ITF scoring. The Levenshtein distance between two words refers to the minimum number of single-character edits necessary to change one word to another.

단계 808에서의 퍼지-스코어 매칭 단계가 엔티티 동시 발생 지식 베이스(414)에 있어서의 모든 레코드를 탐색 조회와 비교하고 탐색하는 것을 끝마치면, 단계 801에서, 탐색 조회 입력의 주어진 패턴 스트링과 가장 잘 매칭되거나 가장 근접하게 매칭된 레코드가 탐색 제시어에 대한 제 1 후보로서 선택될 수 있다. 탐색 조회 입력의 주어진 패턴 스트링과 보다 덜 근접하게 매칭된 다른 레코드들은 하향 순서로 제 1 후보 아래에 자리하게 된다. 단계 810에서의 탐색 제시어는 사용자가 조회를 완성하기 위해 선택할 수 있는 가능한 매칭들의 드롭 다운 리스트(drop down list)로 사용자에게 제공될 수 있다. If the fuzzy-score matching step in step 808 finishes comparing and searching all the records in the entity concurrency knowledge base 414 with the search query, then in step 801, the best match with the given pattern string of the search query input Or the closest matched record may be selected as the first candidate for the search presentation word. The other records matched less closely to the given pattern string of the search query input are placed in the downward order below the first candidate. The search suggestion at step 810 may be provided to the user as a drop down list of possible matches that the user can select to complete the query.

다른 실시 예에 있어서, 사용자가 그/그녀의 관심의 매치를 선택한 후, 탐색 시스템(700)은, 단계 812에서, 그 선택을 새로운 탐색 조회로서 취한다. 후속적으로, 상기 새로운 탐색 조회로부터 엔티티 추출 단계 814가 실행될 수 있다. 추출 동안, 하나 이상의 특징 인식 및 추출 알고리즘이 채용될 수 있다. 또한, 특징의 확실성 레벨이 정확한 속성으로 정확하게 추출됨을 나타내는 스코어가 각 추출된 특징에 할당될 수 있다. 특징 속성을 고려하여, 특징들 각각의 상대적 가중치 및 관련성이 판정될 수 있다. 추가적으로, 특징들간의 연계의 관련성이 가중 스코어링 모델을 이용하여 판정될 수 있다. 엔티티 추출 모듈(710)은 가장 높은 스코어들을 가진 동시 발생에 기초하여, 단계 816에서, 관련 엔티티들을 발견하기 위해, 엔티티 동시 발생 지식 베이스(714)에 대한 탐색을 실행할 수 있다. 마지막으로, 전자 문서 코퍼스내의 데이터의 실제적인 탐색을 실행하기 전에, 관련 엔티티들을 포함하는, 단계 818에서의 탐색 제시어의 드롭 다운 리스트가 사용자에게 제공될 수 있다.In another embodiment, after the user selects a match of his / her interest, the search system 700, in step 812, takes the selection as a new search query. Subsequently, the entity extraction step 814 from the new search query may be executed. During extraction, one or more feature recognition and extraction algorithms may be employed. In addition, a score may be assigned to each extracted feature indicating that the certainty level of the feature is correctly extracted with the correct attribute. In consideration of the feature attribute, the relative weight and relevance of each of the features can be determined. Additionally, the relevance of association between features may be determined using a weighted scoring model. The entity extraction module 710 may perform a search for the entity concurrency knowledge base 714 to discover related entities at step 816 based on concurrency with the highest scores. Finally, before executing the actual search of the data in the electronic document corpus, a drop-down list of search suggestions at step 818, including relevant entities, may be provided to the user.

도 9는 동시 발생 및/또는 퍼지 스코어 매칭에 기반하여 관련 엔티티들의 탐색 제시어를 생성하는 방법(800)과 연계된 사용자 인터페이스(900)의 예시적인 실시 예이다. 이 예시에 있어서, 도 7에 도시된 것과 유사한 탐색 엔진 인터페이스(902)를 통해 사용자는 탐색 박스(906)에 부분 조회(904)를 입력한다. 예시적으로(제한을 위한 것은 아님), 부분 조회(904)는 도 9에 도시된 바와 같이, "Michael J"와 같은 사람의 불완전한 이름일 수 있다. 실제 탐색을 실행하고 결과를 획득하기 위해 사용자가 탐색 버튼(908)을 아직 선택하지 않을 수 있고, 그렇지 않으면, 탐색 시스템(100)에 부분 조회(604)를 제공하지 않았을 수 있기 때문에 부분 조회(904)가 고려될 수 있다. FIG. 9 is an exemplary embodiment of a user interface 900 associated with a method 800 for generating search suggestions of related entities based on coincidence and / or fuzzy score matching. In this example, the user enters a partial query 904 into the search box 906 via a search engine interface 902 similar to that shown in FIG. By way of example, and not by way of limitation, the partial query 904 may be an incomplete name of a person such as "Michael J ", as shown in FIG. Because the user may not yet have selected the navigation button 908 to perform the actual search and obtain the result or else it may not have provided the partial query 604 to the search system 100, ) Can be considered.

방법(800)에 이어서, 사용자가 "Michael J"를 타이핑함에 따라, 엔티티 추출 모듈(710)은 엔티티 동시 발생 지식 베이스(714)에 대해 제 1 단어(Michael)의 즉각적인 빠른 탐색(quick search on the fly)을 실행하여 엔티티의 유형을 식별한다. 본 예시에서 엔티티는 사람 이름을 지칭한 것일 수 있다. 후속적으로, 퍼지-스코어 매칭 모듈(712)은 사람의 이름에 대해 덧붙여진 스트링 매칭 알고리즘을 선택할 수 있다. 사람의 이름은 예를 들어, 단지, 머리글자(약식) 또는 성과 이름의 첫번째 캐릭터, 또는 성, 중간 이름 및 마지막 이름의 머리글자, 또는 그들의 임의 조합을 이용하여, 다른 형태로 작성된 데이터베이스에서 발견될 수 있다. 퍼지-스코어 매칭 모듈(712)은 엔티티 "Michael"과 매칭될 수 있는 엔티티 동시 발생 지식 베이스(714)내의 엔티티, 토픽 또는 팩트에 대한 스코어를 판정하고 할당하기 위해, Levenshtein 거리와 같은 통상적인 스트링 메트릭을 이용할 수 있다. 이 예시에 있어서, Michael은 그 이름을 가진 상당량의 레코드들과 매칭된다. 그러나, 사용자가 이어지는 캐릭터 "J"를 타이핑함에 따라, 퍼지-스코어 매칭 모듈(712)은 Levenshtein 거리에 기초하여, 엔티티 동시 발생 지식 베이스(714)와 Michael을 가진 모든 동시 발생의 또 다른 비교를 수행할 수 있다. 엔티티 동시 발생 지식 베이스(714)는 "Michael J"에 대해 가장 높은 스코어를 가진 모든 가능한 매칭을 선택한다. 예를 들어, 퍼지-스코어 매칭 모듈(712)은 일부 경우에 "Michael Jackson", "Michael Jordan", "Michael J. Fox" 또는 "Michael Dell"과 같은 "Michael J"를 완성하기 위해 탐색 제시어(910)를 사용자에게 리턴할 수 있다. 사용자는 제시된 사람들 중 한 명을 드롭 다운 리스트로부터 선택하거나, 제시어를 무시하고 타이핑을 계속할 수 있다. 상술한 예시를 확장하여, "Michael the basketball player"과 같은 조회는, 사람 엔티티 이름 변형에 있어서의 "Michael", 및 키 구문, 팩트 및 토픽과 같은 동시 발생 특징에 있어서의 "the basketball player"에 대한 엔티티 동시 발생 지식 베이스의 탐색에 의해 리턴되는 결과들에 기초하여, "Michael Jordan"라는 제시어를 이끌어 낼 수 있다. 다른 예시는 "Alexander the actor"일 수 있으며, "Alexander Polinsky"라는 제시어를 이끌어낼 수 있다. 당업자라면, 기존의 탐색 플랫폼(platform)이 상술한 방식으로 생성되는 제시어를 제공할 수 없음을 알 것이다.Following the method 800, as the user types "Michael J ", the entity extraction module 710 generates a quick search on the first word Michael for the entity concurrency knowledge base 714 fly) to identify the type of entity. In this example, the entity may refer to a person's name. Subsequently, the fuzzy-score matching module 712 may select a string matching algorithm that is appended to the person's name. A person's name may be found in a database written in another form, for example, using only the first character of the initial (abbreviation) or first name, or the first character of the last name, middle name and last name, . The fuzzy-score matching module 712 may use a conventional string metric, such as a Levenshtein distance, to determine and assign a score for an entity, topic or fact in the entity concurrency knowledge base 714 that may match the entity "Michael & Can be used. In this example, Michael matches a significant amount of records with that name. However, as the user types the following character "J ", the fuzzy-score matching module 712 performs another comparison of all concurrent occurrences with the entity concurrency knowledge base 714 and Michael based on the Levenshtein distance can do. The entity concurrency knowledge base 714 selects all possible matches with the highest score for "Michael J ". For example, the fuzzy-score matching module 712 may in some cases generate a search suggestion (" Michael J ") to complete a "Michael J" such as "Michael Jackson "," Michael Jordan & 910) to the user. The user can select one of the presented persons from the drop-down list, or ignore the suggested word and continue typing. Expanding on the above example, an inquiry such as "Michael the basketball player" can be found in "the Michael" in the transformation of human entity names, and in "the basketball player" in concurrent features such as key phrases, facts and topics Based on the results returned by the search of the co-occurrence knowledge base for the entity, it is possible to derive the suggested term "Michael Jordan". Another example could be "Alexander the actor", leading to the suggestion "Alexander Polinsky". Those skilled in the art will appreciate that existing search platforms can not provide a presentation word that is generated in the manner described above.

본 실시 예에 있어서, 사용자는 도 9에 나타난 부분 조회(904)를 완성하기 위해 드롭 다운 리스트로부터 "Michael Jordan"을 선택할 수 있다. 상기 선택은 탐색 시스템(700)에 의해 새로운 탐색 조회(912)로서 프로세싱될 수 있다. 후속적으로, 상기 새로운 탐색 조회로부터의 엔티티 추출이 실행될 수 있다. 추출 동안, 하나 이상의 특징 인식 및 추출 알고리즘이 채용될 수 있다. 또한, 특징의 확실성 레벨이 정확한 속성으로 정확하게 추출됨을 나타내는 스코어가 각 추출된 특징에 할당될 수 있다. 특징 속성을 고려하여, 특징들 각각의 상대적 가중치 및 관련성이 판정될 수 있다. 추가적으로, 특징들간의 연계의 관련성이 가중 스코어링 모델을 이용하여 판정될 수 있다. 엔티티 추출 모듈(710)은 가장 높은 스코어들을 가진 동시 발생에 기초하여, 관련 엔티티들을 발견하기 위해, 엔티티 동시 발생 지식 베이스(714)에 대해 "Michael Jordan" 대한 탐색을 실행할 수 있다. 마지막으로, 탐색 버튼(908)을 클릭함에 의해 실제적인 탐색을 실행하기 전에, 관련 엔티티들을 포함하는, 탐색 제시어(914)의 드롭 다운 리스트가 사용자에게 제공될 수 있다. 도 7 내지 도 9에 설명된 상술한 시스템 및 방법은, 고속이며 사용자에게 편리한데, 그 이유는 사용자가 유용한 관련성을 발견할 수 있기 때문이다. In this embodiment, the user can select "Michael Jordan" from the drop-down list to complete the partial query 904 shown in FIG. The selection may be processed by the search system 700 as a new search query 912. Subsequently, entity extraction from the new search query may be performed. During extraction, one or more feature recognition and extraction algorithms may be employed. In addition, a score may be assigned to each extracted feature indicating that the certainty level of the feature is correctly extracted with the correct attribute. In consideration of the feature attribute, the relative weight and relevance of each of the features can be determined. Additionally, the relevance of association between features may be determined using a weighted scoring model. Entity extraction module 710 may perform a search for "Michael Jordan" for entity concurrency knowledge base 714 to discover related entities based on concurrency with the highest scores. Finally, a drop-down list of search suggestion words 914 may be provided to the user, including relevant entities, prior to performing an actual search by clicking the search button 908. [ The systems and methods described above in FIGS. 7-9 are fast and convenient to the user, since the user may find useful relevance.

도 10은 본 개시에 따른 탐색 시스템(1000)의 블럭도이다. 탐색 시스템(1000)은 탐색 엔진(1002)을 포함할 수 있으며, 그러한 탐색 엔진(1002)은 사용자 조회들과 같은 사용자로부터의 데이터 입력을 허용하는 하나 이상의 사용자 인터페이스를 포함할 수 있다.10 is a block diagram of a search system 1000 in accordance with the present disclosure. The search system 1000 may include a search engine 1002, which may include one or more user interfaces that allow data input from a user, such as user queries.

탐색 시스템(1000)은 하나 이상의 데이터베이스들을 포함할 수 있다. 그러한 데이터베이스들은 엔티티 데이터베이스(1004)와 트렌드 데이터베이스(1006)를 포함할 수 있다. 데이터베이스들은 로컬 서버 또는 웹 기반 서버에 저장될 수 있다. 따라서, 탐색 시스템(1000)은 클라이언트/서버 유형 아키텍쳐에 구현될 수 있지만, 탐색 시스템(1000)은, 예를 들어, 단독형 컴퓨터, 단말을 가진 메인프레임 시스템, ASP 모델, 피어-투-피어(peer-to-peer) 모델 등과 같은 다른 컴퓨터 아키텍쳐와, 예를 들어, LAN(local area network), WAN(wide area network), 인터넷, 무선 네트워크, 이동 전화 네트워크와 같은 다수의 네트워크들을 이용하여 구현될 수 있다.The search system 1000 may include one or more databases. Such databases may include an entity database 1004 and a trend database 1006. The databases may be stored on a local server or a web based server. Thus, while the search system 1000 can be implemented in a client / server type architecture, the search system 1000 can be, for example, a stand alone computer, a mainframe system with terminals, an ASP model, a peer- peer-to-peer model, and the like, and multiple networks, such as, for example, a local area network (LAN), a wide area network (WAN), the Internet, a wireless network, .

탐색 엔진(1002)은 사용자가 월드 와이드 웹상의 정보를 위치 결정할 수 있게 하는 웹 기반 툴을 통하는 인터페이스를 포함하지만, 이에 국한되는 것은 아니다. 탐색 엔진(1002)은 내부 데이터베이스 시스템내의 정보를 사용자가 위치 결정할 수 있도록 하는 툴을 포함할 수 있다. Search engine 1002 includes, but is not limited to, an interface through a web-based tool that allows a user to locate information on the World Wide Web. The search engine 1002 may include a tool that allows the user to locate information within the internal database system.

엔티티 데이터베이스(1004)는 단일 서버 또는 다수의 서버들에 걸쳐있는 분산형 아키텍쳐에 구현될 수 있다. 엔티티 데이터베이스(1004)는 조회 스트링, 정형 데이터 등과 같은 엔티티 조회들의 세트를 허용한다. 그러한 엔티티 조회 세트는 인터넷 및/또는 로컬 네트워크에서 이용할 수 있는 다수의 코퍼스들로부터 사전에 추출될 수 있다. 엔티티 조회들은 인덱스되고 스코어링된다. 예를 들어, 엔티티들은 사람, 조직, 지리적 위치, 날짜 및/또는 시간을 포함한다. 추출 동안, 하나 이상의 특징 인식 및 추출 알고리즘이 채용될 수 있다. 또한, 특징의 확실성 레벨이 정확한 속성으로 정확하게 추출됨을 나타내는 스코어가 각 추출된 특징에 할당될 수 있다. 특징 속성을 고려하여, 특징들 각각의 상대적 가중치 및 관련성이 판정될 수 있다. 추가적으로, 특징들간의 연계의 관련성이 가중 스코어링 모델을 이용하여 판정될 수 있다.The entity database 1004 may be implemented in a distributed architecture that spans a single server or multiple servers. The entity database 1004 allows a set of entity references, such as query strings, structured data, and so on. Such an entity lookup set may be pre-extracted from a plurality of corpuses available in the Internet and / or the local network. Entity queries are indexed and scored. For example, the entities include a person, an organization, a geographic location, a date and / or a time. During extraction, one or more feature recognition and extraction algorithms may be employed. In addition, a score may be assigned to each extracted feature indicating that the certainty level of the feature is correctly extracted with the correct attribute. In consideration of the feature attribute, the relative weight and relevance of each of the features can be determined. Additionally, the relevance of association between features may be determined using a weighted scoring model.

트렌드 데이터베이스(1006)는 단일 서버 또는 다수의 서버들에 걸쳐있는 분산형 아키텍쳐에 구현될 수 있다. 트렌드 데이터베이스(1006)는 조회 스트링, 정형 데이터 등과 같은 엔티티 조회들의 세트를 허용한다. 그러한 엔티티 조회 세트는 인터넷 및/또는 로컬 네트워크에서 사용자 및/또는 다수의 사용자들에 의해 실행되는 이력 조회로부터 사전에 추출될 수 있다. 엔티티 조회들은 인덱스되고 스코어링된다. 예를 들어, 엔티티들은 사람, 조직, 지리적 위치, 날짜 및/또는 시간을 포함한다. 추출 동안, 하나 이상의 특징 인식 및 추출 알고리즘이 채용될 수 있다. 또한, 특징의 확실성 레벨이 정확한 속성으로 정확하게 추출됨을 나타내는 스코어가 각 추출된 특징에 할당될 수 있다. 특징 속성을 고려하여, 특징들 각각의 상대적 가중치 및 관련성이 판정될 수 있다. 추가적으로, 특징들간의 연계의 관련성이 가중 스코어링 모델을 이용하여 판정될 수 있다. The trend database 1006 may be implemented in a distributed architecture that spans a single server or multiple servers. The trend database 1006 allows for a set of entity hits such as query strings, structured data, and so on. Such an entity query set may be pre-extracted from a history query run by a user and / or multiple users on the Internet and / or the local network. Entity queries are indexed and scored. For example, the entities include a person, an organization, a geographic location, a date and / or a time. During extraction, one or more feature recognition and extraction algorithms may be employed. In addition, a score may be assigned to each extracted feature indicating that the certainty level of the feature is correctly extracted with the correct attribute. In consideration of the feature attribute, the relative weight and relevance of each of the features can be determined. Additionally, the relevance of association between features may be determined using a weighted scoring model.

엔티티 데이터베이스(1004)와 트렌드 데이터베이스(1006)는 인-메모리 데이터베이스(도시되지 않음)로서 구축될 수 있는 엔티티 동시 발생 지식 베이스를 포함할 수 있고, 하나 이상의 탐색 제어기, 다수의 탐색 노드들, 압축 데이터의 콜렉션 및 중의성 해소 모듈과 같은 다른 구성 요소들(도시되지 않음)을 포함할 수 있다. 하나의 탐색 제어기는 하나 이상의 탐색 노드들과 선택적으로 연계될 수 있다. 각 탐색 노드는 압축 데이터의 콜렉션을 통해 퍼지 키 탐색을 독자적으로 실행하고, 그의 연계된 탐색 제어기에 스코어링된 결과들의 세트를 리턴할 수 있다. Entity database 1004 and trend database 1006 may include an entity concurrent knowledge base that may be constructed as an in-memory database (not shown) and may include one or more search controllers, a plurality of search nodes, And other components (not shown), such as a collection of messages and a conflict resolution module. One search controller may be selectively associated with one or more search nodes. Each search node can independently perform a fuzzy key search through a collection of compressed data and return a set of scored results to its associated search controller.

동시 발생 지식 베이스는 특징들에 기초하고 신뢰 스코어에 의해 랭크된 관련된 엔티티들을 포함할 수 있다. 어느 엔티티 유형들이 가장 중요한지, 어느 것이 더 큰 가중치를 가지는지를 판정하는 가중 모델을 필수적으로 이용하고, 신뢰 스코어에 기초하여 정확한 특징들의 추출이 얼마만큼의 신뢰성으로 실행되었는지를 판정하는, 특징들을 링크 접속(linking)하는 여러 방법들이 채용될 수 있다.A concurrent knowledge base may include related entities based on features and ranked by a trust score. The features that make use of a weighted model that determines which entity types are most important and which have a larger weight and determine how much reliably the extraction of the correct features is performed based on the confidence score, various methods of linking may be employed.

탐색 시스템(1000)은 탐색 엔진(1002)에서의 사용자 조회를 엔티티 데이터베이스(1004) 및 트렌드 데이터베이스(1006)와 비교할 수 있다, 양 데이터베이스들, 즉, 엔티티 데이터베이스(1004)와 트렌트 데이터베이스(1006)로부터 탐색 엔진(1002)에 대한 자동 완성 모드가 인에이블될 수 있다. 탐색 시스템(1000)은 탐색 제시어의 리스트(1008)를 사용자에게 전개하는데, 그러한 리스트는 데이터베이스내의 각 엔티티 제시어에 할당된 퍼지 스코어에 기초하여 생성되고 인덱싱된다. 각 엔티티 제시어의 스코어는 탐색 시스템(1000)에 의해 자동적으로, 시스템 감독관(system supervisor)에 의해 수동적으로 할당될 수 있다. 엔티티 제시어는 각 엔티티에 의해 달성된 스코어에 기초하여 가장 관련성있는 것에서부터 관련성이 덜한 것으로 정렬된다. 또한, 트렌드 데이터베이스(1006)에 있어서의 스코어는 로컬 네트워크 및/또는 인터넷에 있어서 하나 이상의 사용자로부터의 트렌드 및 조회 빈도를 이용하여 할당될 수 있다.The search system 1000 can compare a user query in the search engine 1002 with an entity database 1004 and a trend database 1006. Both databases, The autocomplete mode for search engine 1002 may be enabled. The search system 1000 deploys a list of search suggestion words 1008 to the user, which is generated and indexed based on the fuzzy score assigned to each entity word in the database. The score of each entity presenter may be manually assigned by the search system 1000, manually by the system supervisor. The entity presenter is sorted from the most relevant to the less relevant based on the score achieved by each entity. In addition, the scores in the trend database 1006 may be assigned using trends and query frequencies from one or more users in the local network and / or the Internet.

각 데이터베이스의 엔티티 제시어는 그들간에 서로 비교되고, 인덱스되고, 스코어에서 획득된 랭크(rank)에 의해 정렬되며, 그에 따라, 탐색 제시어 리스트(1008)가 양 데이터베이스들, 즉 엔티티 데이터베이스(1004)와 트렌드 데이터베이스(1006)에 있어서의 엔티티 제시어들을 조합하는 사용자에게 보여지게 된다. 사용자가 리스트로부터 제시어를 선택하거나 제시어 리스트외의 다른 결과를 선택하면, 탐색 시스템(1000)은 트렌드 데이터베이스에 그러한 정보를 저장할 수 있다. 따라서, 자기 학습 시스템(self-learning system)이 가능하게 되어, 탐색 시스템(1000)의 신뢰성 및 정확성을 증가시킨다. 간단하게, 트렌드 동시 발생 지식 베이스는, 사용자의 조회 및 선택된 제시어로부터 추출된 특징들로 계속적으로 갱신되어, 온더플라이 학습(on-the-fly learning)을 제공함으로써, 탐색 관련성 및 정확성을 개선한다. 또한 트렌드 동시 발생 지식 베이스는 시스템을 이용하는 다른 사용자 및 트렌드 검색 모듈과 같은 자동 방법에 의해, 채워질 수 있다. The entity presenter of each database is compared to each other, indexed, and sorted by the rank obtained in the score, so that the search presenter list 1008 is associated with both databases, i. E., Entity database 1004, And is displayed to the user who combines the entity presenter in the database 1006. If the user selects a presentation word from the list or selects a result other than the presenter list, the search system 1000 may store such information in the trend database. Thus, a self-learning system becomes possible, increasing the reliability and accuracy of the search system 1000. Briefly, the trend co-occurring knowledge base is continuously updated with the features extracted from the user's query and the selected presentation word, thereby improving on-the-fly relevance and accuracy by providing on-the-fly learning. In addition, a trend co-occurrence knowledge base can be populated by an automatic method such as another user using the system and a trend search module.

도 11은 본 개시에 따른 탐색 시스템(1100)의 블럭도이다. 탐색 시스템(1100)은 탐색 엔진(1102)을 포함하고, 그러한 탐색 엔진(1102)은 사용자 조회들과 같은 사용자로부터의 데이터 입력을 허용하는 하나 이상의 사용자 인터페이스들을 포함할 수 있다. 11 is a block diagram of a search system 1100 in accordance with the present disclosure. The search system 1100 includes a search engine 1102 and such search engine 1102 may include one or more user interfaces that allow data input from a user, such as user queries.

탐색 시스템(1100)은 하나 이상의 데이터베이스들을 포함할 수 있다. 그러한 데이터베이스들은 엔티티 데이터베이스(1104)와 트렌드 데이터베이스(1106)를 포함할 수 있다. 데이터베이스들은 로컬 서버 또는 웹 기반 서버에 저장될 수 있다. 따라서, 탐색 시스템(1100)은 클라이언트/서버 유형 아키텍쳐에 구현될 수 있지만, 탐색 시스템(1100)은, 예를 들어, 단독형 컴퓨터, 단말을 가진 메인프레임 시스템, ASP 모델, 피어-투-피어(peer-to-peer) 모델 등과 같은 다른 컴퓨터 아키텍쳐와, 예를 들어, LAN(local area network), WAN(wide area network), 인터넷, 무선 네트워크, 이동 전화 네트워크와 같은 다수의 네트워크들을 이용하여 구현될 수 있다.The search system 1100 may include one or more databases. Such databases may include an entity database 1104 and a trend database 1106. The databases may be stored on a local server or a web based server. Thus, while the search system 1100 can be implemented in a client / server type architecture, the search system 1100 can be, for example, a stand alone computer, a mainframe system with terminals, an ASP model, a peer- peer-to-peer model, and the like, and multiple networks, such as, for example, a local area network (LAN), a wide area network (WAN), the Internet, a wireless network, .

일 실시 예에 있어서, 탐색 시스템(1100)은 탐색 엔진(1102)에 있어서의 사용자 인터페이스를 통해 사용자가 (탐색 조회들에 있어서의) 하나 이상의 엔티티들을 입력할 때 시작할 수 있다. 예를 들어, 탐색 조회는 스트링 데이터 포맷, 정형 데이터 등에 있어서의 키워드(keywords)들의 조합일 수 있다. 이러한 키워드는 사람, 조직, 지리적 위치, 날짜 및/또는 시간을 나타내는 엔티티들일 수 있다. 본 실시 예에 있어서, "Indiana Na"가 탐색 조회로서 이용된다.In one embodiment, search system 1100 may begin when a user enters one or more entities (in search queries) through a user interface in search engine 1102. For example, the search query may be a combination of keywords in a string data format, structured data, and the like. These keywords may be entities representing a person, organization, geographic location, date and / or time. In the present embodiment, "Indiana Na" is used as a search query.

"Indiana Na"는 엔티티 추출을 위해 프로세싱될 수 있다. 엔티티 추출 모듈은 "Indiana Na"와 같은 탐색 조회들을 엔티티들로서 프로세싱하고, 그들 모두를 엔티티 데이터베이스(1104) 및 트렌드 데이터베이스(1106)내의 엔티티 동시 발생 지식 베이스와 비교하여, 가능한 많은 엔티티들을 추출하고 그들의 중의성을 해소한다. 추가적으로, 엔티티들(예를 들어, 사람, 조직, 위치)로서 검출되지 않은 조회 텍스트 부분들은 엔티티 동시 발생 지식 베이스(예를 들어, 엔티티 및 트렌트 데이터베이스들)를 탐색하기 위해 채용될 수 있는 개념적 특징들(예를 들어, 토픽, 팩트, 키 구문)로서 처리된다. 추출 동안에, 하나 이상의 특징 인식 및 추출 알고리즘이 채용될 수 있다. 또한, 특징의 확실성 레벨이 정확한 속성으로 정확하게 추출됨을 나타내는 스코어가 각 추출된 특징에 할당될 수 있다. 특징 속성을 고려하여, 각 특징의 상대적 가중치 또는 관련성이 판정될 수 있다. 추가적으로, 특징들간의 연계의 관련성은 가중 스코어링 모델을 이용하여 판정될 수 있다. "Indiana Na" can be processed for entity extraction. The entity extraction module processes search queries such as "Indiana Na " as entities and compares them all with the entity concurrent knowledge base in the entity database 1104 and trend database 1106 to extract as many entities as possible, Resolve sexuality. Additionally, query text portions that are not detected as entities (e.g., person, organization, location) may include conceptual features that may be employed to search for an entity concurrent knowledge base (e.g., entities and trench databases) (E.g., topic, fact, key phrase). During extraction, one or more feature recognition and extraction algorithms may be employed. In addition, a score may be assigned to each extracted feature indicating that the certainty level of the feature is correctly extracted with the correct attribute. In consideration of feature attributes, a relative weight or relevance of each feature can be determined. Additionally, the relevance of associations between features can be determined using a weighted scoring model.

본 실시 예에 있어서, 엔티티 데이터베이스(1104)는 탐색 제시어들의 리스트를 인덱싱되고 랭크될 수 있는 엔티티 제시어들의 리스트(1108)로서 보여준다. 트렌드 데이터베이스(1106)는 탐색 제시어들의 리스트를, 인덱싱되고 랭크될 수 있는 트렌드 기반 제시어 리스트(1110)로서 보여준다. 후속적으로, 탐색 시스템(1100)은 엔티티 데이터베이스(1104)와 트렌드 데이터베이스(1106)에 의해 제공된 것들에 기초하여 탐색 제시어 리스트(1112)를 구축할 수 있다. 탐색 제시어 리스트(1112)는 각 데이터베이스내의 각 엔티티 제시어의 개별적인 스코어에 기초하여 인덱싱되고 랭크될 수 있으며, 그에 따라, 가장 관련성이 있는 것이 우선 나타나게 되고, 관련성이 보다 적은 결과가 그 아래에 이어진다. In the present embodiment, the entity database 1104 shows the list of search prescriptions as a list 1108 of entity presets that can be indexed and ranked. The trend database 1106 shows the list of search suggestions as a list of trend-based presenters 1110 that can be indexed and ranked. Subsequently, the search system 1100 may build the search suggestion list 1112 based on those provided by the entity database 1104 and the trend database 1106. [ The search presenter list 1112 can be indexed and ranked based on the individual scores of each entity's presenter in each database so that the most relevant first appears first and the less relevant results follow below.

탐색 시스템(1100)에 있어서, 탐색 제시어를 획득하는 예시적인 이용이 개시된다. 탐색 제시어 리스트(1112)는 "Indiana Na" 사용자 조회에 기초하여 제시어들을 보여준다. 결과적으로, "Indiana Name"은 그 엔티티에 대한 0.9의 개별적 스코어에 기초하여 첫번째로 나타나게 되고, 그 다음 "Indiana Nascar"이 0.8의 개별적 스코어의 결과로서 보여지게 되고, 마지막으로, "Indiana Nashville"이 0.7의 개별적 스코어에 기초하여 보여지게 된다. 개별적 스코어는, 반복된 엔티티들을 고려하는 것을 적용하지 않고, 엔티티 제시어의 리스트(1108)와 트렌드 기반 제시어 리스트(1110)를 이용하여 비교될 수 있다. In search system 1100, an exemplary use for obtaining a search suggestion word is disclosed. The search suggestion word list 1112 shows the presentation words based on the "Indiana Na" user inquiry. As a result, "Indiana Name" appears first based on an individual score of 0.9 for that entity, then "Indiana Nascar" is seen as the result of an individual score of 0.8, and finally "Indiana Nashville" 0.0 > 0.7 < / RTI > The individual scores may be compared using a list of entity presenter 1108 and a list of trend-based presenters 1110 without considering the consideration of repeated entities.

도 12는 본 개시에 따른 탐색 시스템(1200)의 블럭도이다. 탐색 시스템(1200)은 탐색 엔진(1202)을 포함하고, 그러한 탐색 엔진(1202)은 사용자 조회들과 같은 사용자로부터의 데이터 입력을 허용하는 하나 이상의 사용자 인터페이스들을 포함할 수 있다. 12 is a block diagram of a search system 1200 in accordance with the present disclosure. The search system 1200 includes a search engine 1202 and such search engine 1202 may include one or more user interfaces that allow data input from a user, such as user queries.

탐색 시스템(1200)은 하나 이상의 데이터베이스들을 포함할 수 있다. 그러한 데이터베이스들은 엔티티 데이터베이스(1204)와 트렌드 데이터베이스(1206)를 포함할 수 있다. 데이터베이스들은 로컬 서버 또는 웹 기반 서버에 저장될 수 있다. 따라서, 탐색 시스템(1200)은 클라이언트/서버 유형 아키텍쳐에 구현될 수 있지만, 탐색 시스템(1200)은, 예를 들어, 단독형 컴퓨터, 단말을 가진 메인프레임 시스템, ASP 모델, 피어-투-피어(peer-to-peer) 모델 등과 같은 다른 컴퓨터 아키텍쳐와, 예를 들어, LAN(local area network), WAN(wide area network), 인터넷, 무선 네트워크, 이동 전화 네트워크와 같은 다수의 네트워크들을 이용하여 구현될 수 있다.The search system 1200 may include one or more databases. Such databases may include an entity database 1204 and a trend database 1206. The databases may be stored on a local server or a web based server. Thus, while the search system 1200 may be implemented in a client / server type architecture, the search system 1200 may include, for example, a stand alone computer, a mainframe system with terminals, an ASP model, a peer- peer-to-peer model, and the like, and multiple networks, such as, for example, a local area network (LAN), a wide area network (WAN), the Internet, a wireless network, .

일 실시 예에 있어서, 탐색 시스템(1200)은 탐색 엔진(1202)에 있어서의 사용자 인터페이스를 통해 사용자가 하나 이상의 엔티티들(탐색 조회들)을 입력할 때 시작할 수 있다. 예를 들어, 탐색 조회는 스트링, 정형 데이터 등과 같은 키워드(keywords)들의 조합일 수 있다. 이러한 키워드는 사람, 조직, 지리적 위치, 날짜 및/또는 시간을 나타내는 엔티티들일 수 있다. 본 실시 예에 있어서, "Indiana Na"가 탐색 조회로서 이용된다.In one embodiment, search system 1200 may begin when a user enters one or more entities (search queries) through a user interface in search engine 1202. For example, the search query may be a combination of keywords, such as strings, structured data, and the like. These keywords may be entities representing a person, organization, geographic location, date and / or time. In the present embodiment, "Indiana Na" is used as a search query.

"Indiana Na"는 엔티티 추출을 위해 프로세싱될 수 있다. 엔티티 추출 모듈은 "Indiana Na"와 같은 탐색 조회들을 엔티티들로서 프로세싱하고, 그들 모두를 엔티티 데이터베이스(1204) 및 트렌드 데이터베이스(1206)내의 엔티티 동시 발생 지식 베이스와 비교하여, 가능한 많은 엔티티들을 추출하고 그들의 중의성을 해소한다. 추가적으로, 엔티티들(예를 들어, 사람, 조직, 위치)로서 검출되지 않은 조회 텍스트 부분들은 엔티티 동시 발생 지식 베이스(예를 들어, 엔티티 데이터베이스 및 트렌트 데이터베이스)를 탐색하기 위해 채용될 수 있는 개념적 특징들(예를 들어, 토픽, 팩트, 키 구문)로서 처리된다. 추출 동안에, 하나 이상의 특징 인식 및 추출 알고리즘이 채용될 수 있다. 또한, 특징의 확실성 레벨이 정확한 속성으로 정확하게 추출됨을 나타내는 스코어가 각 추출된 특징에 할당될 수 있다. 각 특징 속성에 기초하여, 각 특징의 상대적 가중치 또는 관련성이 판정될 수 있다. 추가적으로, 특징들간의 연계의 관련성은 가중 스코어링 모델을 이용하여 판정될 수 있다. "Indiana Na" can be processed for entity extraction. The entity extraction module processes search queries such as "Indiana Na " as entities and compares them all with the entity concurrent knowledge base in the entity database 1204 and trend database 1206 to extract as many entities as possible, Resolve sexuality. Additionally, query text portions that are not detected as entities (e.g., person, organization, location) may include conceptual features that may be employed to search for an entity concurrent knowledge base (e.g., an entity database and a trent database) (E.g., topic, fact, key phrase). During extraction, one or more feature recognition and extraction algorithms may be employed. In addition, a score may be assigned to each extracted feature indicating that the certainty level of the feature is correctly extracted with the correct attribute. Based on each feature attribute, the relative weight or relevance of each feature can be determined. Additionally, the relevance of associations between features can be determined using a weighted scoring model.

본 실시 예에 있어서, 엔티티 데이터베이스(1204)는 탐색 제시어들의 리스트를 이미 인덱싱되고 랭크된 엔티티 제시어들의 리스트(1208)로서 보여준다. 마찬가지로, 트렌드 데이터베이스(1206)는 탐색 제시어들의 리스트를, 이미 인덱싱되고 랭크된 트렌드 기반 제시어 리스트(1210)로서 보여준다. 후속적으로, 탐색 시스템(1200)은 엔티티 데이터베이스(1204)와 트렌드 데이터베이스(1206)에 의해 제공된 것들에 기초하여 탐색 제시어 리스트(1212)를 구축할 수 있다. 탐색 제시어 리스트(1212)는 양 데이터베이스들내의 각 엔티티 제시어의 전체 스코어에 기초하여 인덱싱되고 랭크될 수 있으며, 그에 따라, 가장 관련성이 있는 것이 우선 나타나게 되고, 관련성이 보다 적은 결과가 그 아래에 이어진다. In the present embodiment, the entity database 1204 shows the list of search prescriptions as a list of indexed and ranked entity presets 1208 that are already indexed. Similarly, the trend database 1206 shows a list of search suggestions as an indexed and ranked trend-based presenters list 1210 already indexed. Subsequently, the search system 1200 may build the search suggestion list 1212 based on those provided by the entity database 1204 and the trend database 1206. [ The search suggestion list 1212 can be indexed and ranked based on the overall score of each entity's presenter in both databases, so that the most relevant first appears first and the less relevant result goes below.

탐색 시스템(1200)에 있어서, 탐색 제시어를 획득하는 예시적인 이용이 개시된다. 탐색 제시어 리스트(1212)는 "Indiana Na" 사용자 조회에 기초하여 제시어들을 보여준다. 결과적으로, "Indiana Nascar"은 엔티티 제시어 리스트(1208)에서의 스코어 0.8과 트렌드 기반 제시어 리스트(1210)에서의 스코어 0.6의 합으로부터 결과하는 1.4의 전체 스코어에 기초하여 첫번째로 보여지게 된다. 유사하게, "Indiana Name"은 0.9의 전체 스코어의 결과로서 보여지게 되고, 마지막으로, Indiana Nashville"가 0.7의 전체 스코어에 기초하여 보여지게 된다.In search system 1200, an exemplary use for obtaining a search suggestion word is disclosed. The search suggestion word list 1212 shows the presentation words based on the "Indiana Na" user inquiry. As a result, "Indiana Nascar" is first shown based on an overall score of 1.4 resulting from the sum of score 0.8 in the entity presenter list 1208 and score 0.6 in the trend-based presenter list 1210. Similarly, the "Indiana Name" is shown as a result of the overall score of 0.9, and finally the Indiana Nashville "is shown based on the overall score of 0.7.

도 13은 SharePoint 2013®에 있어서의 콘텐츠를 지오태깅하는 시스템 아키텍쳐(1300)이다. 탐색 인덱스(1324)는 SharePoint(1302)에서의 탐색을 할 수 있게 하기 위한 다수의 키 구성 요소들 중 하나이다. SharePoint 2013®(1302)에서 탐색을 할 수 있게 하는 또 다른 키 부분은 콘텐츠를 인덱스 하기 위해 포획(capturing)한 콘텐츠일 수 있다. SharePoint(1302)는 콘텐츠 포획을 할 수 있도록 크롤러(crawler)(1304) 구성 요소를 포함한다.13 is a system architecture 1300 for geotagging content in SharePoint 2013®. The search index 1324 is one of a number of key components to enable searching in the SharePoint 1302. Another key portion that allows navigation in SharePoint 2013® 1302 can be captured content to index the content. SharePoint 1302 includes a crawler 1304 component for content capture.

크롤러(1304)는 각 콘텐츠에 메타데이터 특성들의 리스트를 추가하는 다른 콘텐츠 소스들(1306)를 통해 크롤링할 수 있다. 예를 들어, 콘텐츠 소스는 SharePoint 콘텐츠, 네트워크 파일-공유 또는 사용자나 인터넷 콘텐츠를 제한없이 포함할 수 있다. 크롤러(1304)는 크롤링된 특성으로서 소스로부터의 문서를 그들의 메타데이터에 연계시키는, 콘텐츠 소스(1306)에 대한 안전한 접속 기능들을 실행하도록 구성될 수 있다. 크롤러(1304)는 콘텐츠에 대한 전체 또는 증가성 크롤들에 대해 구성된다. 예를 들어, 크롤링된 특성은, 예를 들어, 제한없이, 다른 것들 중에서 저자, 타이틀, 생성일을 포함할 수 있다.The crawler 1304 may crawl through other content sources 1306 that add a list of metadata properties to each content. For example, a content source may include SharePoint content, network file-sharing, or user or Internet content without limitation. The crawler 1304 may be configured to perform secure connection functions to the content source 1306, which associates the documents from the source with their metadata as crawled properties. The crawler 1304 is configured for full or incremental crawls for the content. For example, the crawled property may include, among other things, author, title, and creation date, without limitation, for example.

SharePoint 2013®은 콘텐츠 프로세싱(1308) 구성 요소를 포함한다. 콘텐츠 프로세싱(1308) 구성요소는 크롤러(1304)로부터 콘텐츠를 취하고, 그것을 인덱싱할 준비를 한다. 콘텐츠 프로세싱(1308)은, 다른 것들 중에서, 언어적 프로세싱(언어 검출), 파싱, 엔티티 추출 관리, 콘텐츠 기반 파일 포맷 검출, 콘텐츠 프로세싱 에러 보고, 자연 언어 프로세싱 및 크롤링된 특성들을 관리 특성들(managed properties)로 매핑하는 단계들을 수반한다. SharePoint 2013® includes a content processing 1308 component. The content processing 1308 component takes content from the crawler 1304 and is ready to index it. Content processing 1308 may include, among others, verbal processing (language detection), parsing, entity extraction management, content based file format detection, content processing error reporting, natural language processing, and managed properties ). &Lt; / RTI >

콘텐츠 프로세싱(1308)은 CEWS(content enrichment web service: 1310)에 의해 확장될 수 있다. CEWS(1310)는 추가적인 동작을 실행하고 크롤링된 데이터 특성을 보강하기 위해 웹 서비스 콜아웃(web service callout)(1312)이 외부 웹 서비스를 호출할 수 있도록 함에 의해, 콘텐츠 프로세싱(1308)가 보강될 수 있게 한다. 웹 서비스 콜아웃(1312)은 크롤링된 데이터의 정형 정보(structured information)와 엔티티 보강 서비스(1314)를 교환하는데 이용되는 표준 SOAP(standard simple object access protocol) 요청 또는 임의 다른 웹 서비스 호출 방법일 수 있다. 웹 서비스 콜아웃(1312)은 보강 프로세싱을 위한 외부 웹 서비스를 호출할 때 제어하는, 콘텐츠 보강 구성 객체내에 구성된 트리거 조건들(trigger conditons)을 포함할 수 있다. 또한, 엔티티 보강 서비스(1314)는 화상 (스캐닝된 문서들, 사진들 등) 형태로 오는 콘텐츠를 판정하기 위해 크롤링된 데이터의 문서 유형을 판정할 수 있다. 화상 형태의 콘텐츠가 발견될 때마다, 엔티티 보강 서비스(1314)는 크롤링된 문서의 위치를, 예를 들어(제한없이) 광학 캐릭터 인식 구성 요소 또는 다른 화상 프로세싱 구성 요소와 같은 OCR 프로세싱 엔진(1316)에 전송한다. OCR 프로세싱 엔진(1316)은 화상 파일들을 검색 및 프로세싱하고, 그들을 텍스트 파일들로 비동기적으로 변환할 수 있다. 후속적으로, OCR된 프로세싱된 파일들(1318)은 크롤러(1304)로 다시 피딩되어, 텍스트 파일들로서 크롤링되고, 콘텐츠 프로세싱(1308)으로 되전송되어 나머지 작업 흐름으로 진행한다.Content processing 1308 may be extended by a content enrichment web service (CEWS) 1310. The CEWS 1310 may be augmented with the content processing 1308 by allowing the web service callout 1312 to invoke an external web service to perform additional operations and augment the crawled data characteristics I will. The web service callout 1312 may be a standard simple object access protocol (SOAP) request or any other web service invocation method used to exchange the structured information of the crawled data with the entity enrichment service 1314 . Web service callout 1312 may include trigger conditions configured in a content augmented configuration object that control when invoking an external web service for augmentative processing. In addition, the entity enhancement service 1314 can determine the document type of the crawled data to determine the content that comes in the form of an image (scanned documents, pictures, etc.). Whenever an image-type of content is found, the entity enrichment service 1314 directs the location of the crawled document to an OCR processing engine 1316, such as (without limitation) an optical character recognition component or other image processing component, Lt; / RTI > OCR processing engine 1316 can retrieve and process image files and asynchronously convert them to text files. Subsequently, the OCR processed files 1318 are fed back to the crawler 1304, crawled as text files, and sent back to the content processing 1308 to proceed to the rest of the workflow.

시스템 아키텍쳐(1300)는 외부 지오태거 웹 서비스(external geotagger web service)(1320)와 지명된 엔티티 태거 서비스(named entity tagger service)(1322)를 포함한다. 지오태거 웹 서비스(1320) 및 지명된 엔티티 태거 서비스(1322)는 웹 서비스 애플리케이션 제공자로서 기능하고, 웹 서비스 콜아웃(1312)에 응답하도록 구성될 수 있다. 지오태거 웹 서비스(1320)는 크롤링된 콘텐츠로부터 지리적 엔티티들을 식별하고 그들의 중의성을 해소하기 위해 자연 언어 프로세싱 엔티티 추출 기술, 기계 학습 모델들 및 다른 기술들을 이용할 수 있다. 예를 들어, 지오태거 웹 서비스(1320)는 지명 사전(gazetteer)에서 발견되는 엔티티들의 통계적 동시 발생을 분석함에 의해 지리적 엔티티들의 중의성을 해소할 수 있다. 지오태거 웹 서비스(1320)는 크롤러(1304)에 의해 발견되는 콘텐츠에 대해 링크 접속될 수 있는 통계적 동시 발생 엔티티들의 데이터베이스를 포함할 수 있다. 동일 기술에 이어, 지명된 엔티티 태거 서비스(1322)는 조직, 사람 또는 토픽과 같은 추가적인 엔티티들 또는 텍스트 특징들을 추출하는데 이용될 수 있다.The system architecture 1300 includes an external geotagger web service 1320 and a named entity tagger service 1322. The geotagger web service 1320 and the named entity tager service 1322 may function as a web service application provider and be configured to respond to a web service callout 1312. The GeoTagger web service 1320 can utilize natural language processing entity extraction techniques, machine learning models, and other techniques to identify geographic entities from crawled content and resolve their ambiguities. For example, geotagger web service 1320 can resolve the ambiguity of geographic entities by analyzing the statistical co-occurrence of entities found in a gazetteer. The GeoTagger web service 1320 may include a database of statically concurrent entities that can be linked to the content found by the crawler 1304. Following the same technique, the named entity tiger service 1322 can be used to extract additional entities or text features such as an organization, person, or topic.

지오태거 웹 서비스(1320)는 CEWS(1310)에 의해 입력 특성들로서 전송된 관리 특성들의 어레이를 분석하고 텍스트에서 참조된 임의 지리적 엔티티들을 식별할 수 있다. 비 제한적 예시로서, 입력 특성들은, 다른 것들 중에서, FileType, IsDocument, OriginalPath 및 body를 포함할 수 있다. 지오태거 웹 서비스(1320)는 발견된 각 지리적 엔티티를 참조하여 관리 특성들을 생성하고 수정함에 의해 텍스트를 지오태깅할 수 있다. 지오태거 웹 서비스(1320)는 수정되거나 새로운 관리 특성들을 엔티티 보강 서비스(1314)로 전송할 수 있으며, 거기에서는 수정된 관리 특성들을 매핑하여 그들을 출력 특성들로서 CEWS(1310)으로 리턴하는 변환이 이루어진다. 조직, 사람 또는 토픽과 같은 다른 엔티티들 또는 텍스트 특징들의 추출 및 엔티티 태깅을 위해 지명된 엔티티 태거 서비스(1322)와의 상호 작용에 동일한 프로세스가 이용될 수 있다.Geotagger web service 1320 can analyze an array of management properties sent as input properties by CEWS 1310 and identify any geographic entities referenced in the text. By way of a non-limiting example, input properties may include, among others, FileType, IsDocument, OriginalPath, and body. The geotagger web service 1320 can geotag the text by creating and modifying management properties with reference to each geographical entity found. Geotagger Web service 1320 may send modified or new management properties to entity enhancement service 1314 where a transformation is made to map the modified management properties and return them as output properties to CEWS 1310. [ The same process can be used for the extraction of other entities or text features, such as an organization, person or topic, and for interaction with the named entity tiger service 1322 for entity tagging.

엔티티 보강 서비스(1314)에 의해 증가된 관리 특성들이 리턴된 후, 그 특성들은 크롤링된 파일 관리 특성들과 합병되어, 탐색 인덱스(1324)로 전송된다. After the enhanced management properties are returned by the entity enhancement service 1314, the properties are merged with the crawled file management properties and sent to the search index 1324. [

지리적 및 다른 엔티티 태그들이 콘텐츠와 연계되고 인덱싱되었으면, 지리적 및 지명된 엔티티 특성을 이용하여 탐색 조회들이 실행된다. SharePoint 2013®에 있어서의 탐색 UI(1326)는 지리적 기반 탐색을 실행하는데 있어서 사용자를 지원하고, 패싯 탐색 결과들의 강화된 디스플레이를 지원할 수 있는 특정 디스플레이를 포함할 수 있다. 탐색 UI(1326)는 커스텀 웹(custom web) 부분일 수 있고, 또는 HTML, HTML 5, JavaScript 및 CSS와 같은 표준 툴을 가진 SharePoint 2013® 탐색의 표준 레이아웃을 수정함에 의해 실행될 수 있다.If geographic and other entity tags have been indexed and associated with the content, search queries are performed using geographic and named entity properties. The search UI 1326 in SharePoint 2013® may include a particular display that supports the user in performing geographic based searches and can support enhanced display of faceted search results. The navigation UI 1326 can be a custom web portion or can be implemented by modifying the standard layout of SharePoint 2013® navigation with standard tools such as HTML, HTML 5, JavaScript and CSS.

도 14는 SharePoint 2013® 탐색을 위해 콘텐츠를 태깅하는 프로세스 단계들을 도시한 흐름도(1400)이다. 그 프로세스는, SharePoint 2013®내의 크롤러 구성 요소가 콘텐츠에 대한 크롤(crawl)을 실행할 때(단계 1402), 시작된다. 일 실시 예에 있어서, 크롤은 전체 크롤일 수 있으며, 다른 실시 예에 있어서, 크롤은 증가성 크롤일 수 있다. 크롤러 구성 요소는 콘텐츠 프로세싱에 크롤링된 특성 및 메타데이터를 피딩한다(단계 1404). 크롤링된 콘텐츠가 지리적 또는 지명된 엔티티를 포함하는지를 입증하기 위한 판정이 이루어질 수 있다. 예시적으로 및 제한없이, 트리거 조건이 이용될 수 있다. 트리거 조건은, 콘텐츠가 지오태깅 또는 엔티티 태깅으로부터 이득을 취할 수 있는지를 판정하는 프로그램 로직 또는 규칙 세트를 포함할 수 있다. 트리거 조건이 거짓(false)인 것으로 평가되면, 크롤링된 콘텐츠는 관리 특성들과 연계되어(단계 1406) 탐색 인덱스 구성 요소로 전달된다(단계 1408). 트리거 조건이 참(true)인 것으로 평가되면, CEWS는 웹 서비스 콜아웃(단계 1410)을 엔티티 보강 서비스로 전송한다. 콘텐츠가 화상 포맷(스캐닝된 문서들, 사진들 등)인지를 판정하기 위해 엔티티 보강 서비스는 전송된 콘텐츠를 분석할 수 있다. 화상 포맷으로 발견된 콘텐츠는 OCR에 의해 비동기식으로 프로세싱되어 크롤링 구성 요소에 의해 다시 크롤링되도록 텍스트 파일로서 되전송된다(단계 1412). 콘텐츠가 화상 포맷이 아니면, 콘텐츠는 지오태깅 웹 서비스 또는 이름 엔티티 태거 서비스에 의해 프로세싱될 수 있다(단계 1414). 웹 서비스는 콘텐츠에 참조된 지리적 또는 지명된 엔티티들을 추출하고 그들의 중의성을 해소하며, 엔티티 메타데이터로 그들을 보강한다. 식별된 엔티티 및 그들의 메타데이터는 관리 특성들로서 콘텐츠 프로세싱 구성 요소에 되전송되어 콘텐츠와 연계된다(단계 1416). 연계된 메타데이터는 탐색 인덱스 구성 요소로 전송될 수 있다(단계 1406).14 is a flowchart 1400 illustrating process steps for tagging content for SharePoint 2013® search. The process begins when the crawler component in SharePoint 2013® performs a crawl on the content (step 1402). In one embodiment, the claw may be a full crown, and in another embodiment, the crown may be an increasing crown. The crawler component feeds the crawled properties and metadata to the content processing (step 1404). A determination may be made to prove that the crawled content includes a geographically or named entity. Exemplarily and without limitation, a trigger condition may be used. The trigger condition may include program logic or a set of rules that determine whether the content can benefit from geotagging or entity tagging. If the trigger condition is evaluated to be false, the crawled content is associated with the management properties (step 1406) and passed to the search index component (step 1408). If the trigger condition evaluates to true, the CEWS sends a web service callout (step 1410) to the entity enrichment service. To determine if the content is in an image format (scanned documents, pictures, etc.), the entity augmentation service may analyze the transferred content. The content found in the image format is processed asynchronously by the OCR and sent back as a text file to be re-crawled by the crawling component (step 1412). If the content is not in an image format, the content may be processed by the geotagging web service or the name entity tiger service (step 1414). Web services extract geographic or named entities referenced in the content, resolve their ambiguities, and reinforce them with entity metadata. The identified entities and their metadata are sent back to the content processing component as management properties and associated with the content (step 1416). Associated metadata may be sent to the search index component (step 1406).

다양한 측면 및 실시 예가 개시되었지만, 다른 측면 및 실시 예가 예상된다. 개시된 다양한 측면 및 실시 예는 예시의 목적을 위한 것이지 한정하려는 것으로 의도하지 않으며, 참 범위 및 사상은 아래와 같은 청구범위에 의해 지시된다. While various aspects and embodiments have been disclosed, other aspects and embodiments are contemplated. The various aspects and embodiments disclosed are for the purpose of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.

전술한 방법의 설명 및 프로세스 흐름도는 그저 예시적인 예로서 제공될 뿐이며 각종 실시 예의 단계가 제시된 순서대로 수행되어야 한다는 것을 요구하거나 암시하는 것으로 의도하지 않는다. 본 기술에서 통상의 지식을 가진 자에 의해 인식되는 바와 같이, 전술한 실시 예에서 단계는 임의의 순서로 수행될 수 있다. "그런 다음", "다음" 등과 같은 단어는 단계의 순서를 한정하려는 의도는 아니며, 이와 같은 단어는 방법의 설명 전체에서 단순히 독자를 안내하기 위해 사용된다. 비록 프로세스 흐름도가 동작을 순차적인 프로세스로서 설명할 수 있지만, 많은 동작은 병렬로 또는 동시에 수행될 수 있다. 또한, 동작의 순서는 재 배열될 수 있다. 프로세스는 방법, 기능, 절차, 서브루틴, 서브프로그램 등에 대응할 수 있다. 프로세스가 기능에 대응할 때, 프로세스의 종료는 호출 기능 또는 주요 기능으로의 기능의 복귀에 대응할 수 있다. It should be understood that the description of the methods and process flow diagrams above are provided as illustrative examples only and are not intended to imply or imply that the steps of the various embodiments should be performed in the order presented. As will be appreciated by those of ordinary skill in the art, the steps in the above-described embodiments may be performed in any order. The words "then "," next ", and the like are not intended to limit the order of the steps, and such words are used merely to guide the reader throughout the description of the method. Although the process flow diagram can describe the operation as a sequential process, many operations can be performed in parallel or concurrently. Also, the order of operations can be rearranged. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, and the like. When a process corresponds to a function, the termination of the process may correspond to the calling function or the return of the function to the main function.

본 명세서에서 개시된 실시 예와 관련하여 기술된 각종의 예시적인 논리 블록, 모듈, 회로, 및 알고리즘 단계는 전자 하드웨어, 컴퓨터 소프트웨어, 또는 이들의 조합으로서 구현될 수 있다. 하드웨어와 소프트웨어의 이와 같은 치환가능성을 분명하게 설명하기 위해, 각종의 예시적인 구성 요소, 블록, 모듈, 회로, 및 단계가 이들의 기능성의 관점에서 앞에서 개괄적으로 설명되었다. 그러한 기능성이 하드웨어 또는 소프트웨어로서 구현되는지의 여부는 전체 시스템에 부과된 특정 애플리케이션 및 디자인 제한에 달려 있다. 숙련된 기술자는 기술된 기능성을 각각의 특정 애플리케이션마다 여러 방법으로 구현할 수 있지만, 그러한 구현 판단은 본 발명의 범위를 벗어나게 하는 것으로서 해석되지 않아야 한다. The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this substitution of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

컴퓨터 소프트웨어로 구현되는 실시 예는 소프트웨어, 펌웨어, 미들웨어, 마이크로코드, 하드웨어 서술 언어, 또는 이들의 임의의 조합으로 구현될 수 있다. 코드 세그먼트 또는 기계 실행가능한 명령어는 절차, 기능, 서브프로그램, 프로그램, 루틴, 서브루틴, 모듈, 소프트웨어 패키지, 클래스, 또는 명령어, 데이터 구조체, 또는 프로그램 스테이트먼트의 임의의 조합을 표현할 수 있다. 코드 세그먼트는 정보, 데이터, 인수, 파라미터 또는 메모리 내용을 전달 및/또는 수신함으로써 다른 코드 세그먼트 또는 하드웨어 회로에 연결될 수 있다. 정보, 인수, 파라미터, 데이터 등은 메모리 공유, 메시지 전달, 토큰 전달, 네트워크 전송 등을 포함하는 임의의 적합한 수단을 통해 전달, 포워딩, 또는 전송될 수 있다. Embodiments embodied in computer software may be implemented in software, firmware, middleware, microcode, hardware description language, or any combination thereof. A code segment or machine executable instruction may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or hardware circuit by communicating and / or receiving information, data, arguments, parameters or memory contents. Information, arguments, parameters, data, etc. may be communicated, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing,

이러한 시스템 및 방법을 구현하는데 사용되는 실제 소프트웨어 코드 또는 전용 제어 하드웨어는 본 발명을 한정하지 않는다. 그래서, 시스템 및 방법의 동작 및 작동은 소프트웨어 및 제어 하드웨어가 본 명세서의 설명을 기반으로 하여 시스템 및 방법을 구현하도록 설계될 수 있다는 것으로 이해되는 특정 소프트웨어 코드를 참조하지 않고 설명되었다The actual software code or dedicated control hardware used to implement such systems and methods does not limit the present invention. Thus, the operation and operation of systems and methods have been described without reference to specific software code that is understood to be software and control hardware that may be designed to implement systems and methods based on the description herein

소프트웨어로 구현될 때, 기능은 하나 이상의 명령어로서 비일시적 컴퓨터 판독 가능한 또는 프로세서 판독 가능한 저장 매체에 저장될 수 있다. 본 명세서에서 개시된 방법 또는 알고리즘의 단계는 컴퓨터 판독 가능한 또는 프로세서 판독 가능한 저장 매체상에서 상주할 수 있는 프로세서 실행가능한 소프트웨어 모듈에서 구현될 수 있다. 비일시적 컴퓨터 판독 가능한 또는 프로세서 판독 가능한 매체는 컴퓨터 프로그램을 한 장소에서 다른 장소로 이전하는 것을 용이하게 해주는 컴퓨터 저장 매체 및 유형의 저장 매체 두 가지를 포함한다. 비일시적 프로세서 판독 가능한 저장 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 이용 가능한 매체일 수 있다. 예를 들어, 제한하지 않고, 그러한 비일시적 프로세서 판독 가능한 매체는 원하는 프로그램 코드를 명령어 또는 데이터 구조체의 형태로 저장하는데 사용될 수 있고 컴퓨터 또는 프로세서에 의해 액세스될 수 있는 RAM, ROM, EEPROM, CD-ROM이나 다른 광 디스크 저장소, 자기 디스크 저장소나 다른 자기 저장 디바이스, 또는 임의의 다른 유형의 저장 매체를 포함할 수 있다. 본 명세서에서 사용된 것으로서 디스크(Disk 및 disc)는 컴팩트 디스크(CD), 레이저 디스크, 광 디스크, 다기능 디스크(DVD), 플로피 디스크, 및 블루-레이 디스크를 포함하는데, 여기서 디스크(disk)는 통상 데이터를 자기방식으로 재생하는데 반해, 디스크(disc)는 데이터를 레이저를 이용한 광학방식으로 재생한다. 전술한 것들의 조합은 또한 컴퓨터 판독 가능한 매체의 범위 내에 포함되어야 한다. 또한, 방법 또는 알고리즘의 동작은 컴퓨터 프로그램 제품 내에 통합될 수 있는 비일시적 프로세서 판독 가능한 매체 및/또는 컴퓨터 판독 가능한 매체상에서 코드들 및/또는 명령어들의 하나 이상의 임의의 조합이나 그 집합으로서 상주할 수 있다. When implemented in software, the functions may be stored in one or more instructions in non-volatile computer readable or processor readable storage media. The steps of a method or algorithm disclosed herein may be embodied in a processor-executable software module that may reside on a computer-readable or processor-readable storage medium. Non-volatile computer readable or processor readable media include both computer storage media and types of storage media that facilitate transferring computer programs from one place to another. Non-volatile processor readable storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such non-volatile processor readable media can be RAM, ROM, EEPROM, CD-ROM, CD-ROM, Or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other type of storage medium. As used herein, discs and discs include compact discs (CDs), laser discs, optical discs, multifunctional discs (DVD), floppy discs, and Blu-ray discs, While discs reproduce data in a magnetic way, discs reproduce data in an optical manner using a laser. Combinations of the foregoing should also be included within the scope of computer readable media. In addition, the operation of the method or algorithm may reside in any combination or combination of one or more of the codes and / or instructions on a non-transitory processor readable medium and / or computer readable medium that may be incorporated into the computer program product .

기술의 다양한 구성 요소는 분산된 네트워크 및/또는 인터넷의 원거리 부분에서 또는 전용의 보안된, 보안되지 않은, 그리고/또는 암호화된 시스템 내에 배치될 수 있다는 것이 인식될 것이다. 그러므로 시스템의 구성 요소는 하나 이상의 디바이스로 결합될 수 있거나 통신 네트워크와 같은 분산된 네트워크의 특정 노드상에 공존할 수 있다는 것을 인식하여야 한다. 설명으로부터 인식되는 바와 같이, 그리고 계산적 효율성의 이유로, 시스템의 구성 요소는 시스템의 동작에 영향을 미치지 않고 분산된 네트워크 내 어느 장소에도 배열될 수 있다. 더욱이, 구성 요소는 전용 기계 내에 내장될 수도 있다. It will be appreciated that the various components of the technology may be deployed in a remote portion of the distributed network and / or the Internet, or in a dedicated secure, unsecured, and / or encrypted system. It is therefore to be appreciated that components of the system may be combined into one or more devices or may coexist on a particular node of a distributed network, such as a communications network. As will be appreciated from the description, and for reasons of computational efficiency, the components of the system can be arranged anywhere in the distributed network without affecting the operation of the system. Moreover, the components may be embedded in a dedicated machine.

뿐만 아니라, 요소들을 연결하는 각종 링크는 데이터를 연결된 요소들에 그리고 연결된 요소들로부터 공급 및/또는 전달할 수 있는 유선이나 무선 링크 또는 이들의 임의의 조합일 수 있거나, 또는 임의의 다른 공지되거나 향후 개발되는 요소(들)일 수 있다. 본 명세서에서 사용된 바와 같은 용어 모듈은 그 요소와 연관된 기능성을 수행할 수 있는 임의의 공지되거나 향후 개발되는 하드웨어, 소프트웨어, 펌웨어, 또는 이들의 조합을 말할 수 있다. 본 명세서에서 사용된 것으로서 용어 결정하는, 산술하는 및 계산하는, 그리고 이들의 변형은 교환 가능하게 사용되며 임의의 형태의 방법론, 프로세스, 수학적 연산이나 기술을 포함한다.In addition, the various links connecting the elements may be wired or wireless links, or any combination thereof, that can feed and / or transmit data to and from the connected elements, or any other known or later developed Lt; / RTI > element (s). The term module as used herein may refer to any known or later developed hardware, software, firmware, or a combination thereof capable of performing the functionality associated with the element. As used herein, the terms determining, arithmetic and computation, and variations thereof are used interchangeably and include any form of methodology, process, mathematical operation or technique.

개시된 실시 예의 전술한 설명은 본 기술에서 통상의 지식을 가진 자가 본 발명을 제조하거나 사용할 수 있도록 제공된다. 이러한 실시 예에 대한 다양한 수정은 본 기술에서 통상의 지식을 가진 자에게 쉽게 자명해질 것이며, 본 명세서에서 정의된 일반적인 원리는 본 발명의 사상이나 범위를 일탈하지 않고 다른 실시 예에 적용될 수 있다. 그러므로, 본 발명은 본 명세서에서 도시된 실시 예로 한정되는 것으로 의도되지 않고 다음과 같은 청구범위 및 본 명세서에서 개시된 원리 및 새로운 특징과 일관하는 가장 넓은 범위에 일치할 것이다.The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those of ordinary skill in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. The present invention, therefore, is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein, as well as the following claims.

전술한 실시 예는 예시적인 것으로 의도된다. 본 기술에서 통상의 지식을 가진 자라면 많은 대안의 구성 요소 및 실시 예가 본 명세서에서 기술된 특정 예에 대체될 수 있고 그럼에도 본 발명의 범위에 속할 수 있는 것을 인식할 것이다.The foregoing embodiments are intended to be illustrative. Those of ordinary skill in the art will recognize that many alternative components and embodiments may be substituted for the specific examples described herein and still fall within the scope of the invention.

Claims

As a computer implemented method,
The entity extraction computer receiving a search query from the client computer having one or more entities;
The entity extraction computer comparing each entity with one or more co-occurrences of each entity in the co-occurrence database;
Wherein the entity extraction computer extracts a subset of one or more entities from a search query, wherein the extraction is based on a coherence level of concurrency of the entity and one or more related entities in an electronic data corpus according to the co- In response to determining that each entity exceeds the trust score of the co-occurrence database;
The entity extraction computer assigns an index identifier (index ID) to each of the entities within the plurality of extracted entities;
The entity extraction computer storing an index ID for each of a plurality of extracted entities in an electronic data corpus indexed by an index ID corresponding to each of one or more related entities;
A search server computer is configured to search for an entity indexed electronic data corpus to locate index identifiers of data records in which at least two of the plurality of extracted entities are co- &Lt; / RTI >
Wherein the search server computer comprises constructing a search result list having a data record corresponding to the identified index IDs
Computer implemented method.

The method according to claim 1,
The search server computer sorting the search result list by relevance based on the trust score,
The search server computer further comprising transmitting the sorted search result list to the user device
Computer implemented method.

The method according to claim 1,
The plurality of extracted entities are ranked based on the trust score
Computer implemented method.

The method according to claim 1,
The entity extraction computer associates the extracted entity with one or more co-occurring entities in the entity index electronic data corpus
Computer implemented method.

5. The method of claim 4,
The associated entities are ranked by the trust score
Computer implemented method.

The method according to claim 1,
Each of the plurality of entities is selected from the group comprising a person, an organization, a geographic location, a date and a time
Computer implemented method.

As a system,
One or more server computers with one or more processor executable computer readable instructions for a plurality of computer modules,
The plurality of computer modules,
An entity extraction module configured to receive user input of search query parameters; And
A search server module,
Wherein the entity extraction module comprises:
Extracting a plurality of entities from the search query parameters wherein the extracting comprises an entity concurrency database comprising an extracted entity and a confidence score representing a concurrency level of coincidence of one or more related entities in the electronic data corpus, With the respective entities in the extracted entities of the entity;
Assigning an index identifier (index ID) to each entity in the plurality of extracted entities;
To store an index ID for each of a plurality of extracted entities in an electronic data corpus indexed by an index ID corresponding to each of one or more associated entities,
The search server module,
For locating a plurality of extracted entities and for identifying at least two of the plurality of extracted entities the index IDs of concurrently occurring data records;
And to construct a search result list having a data record corresponding to the identified index ID
system.

8. The method of claim 7,
Wherein the search server module is further configured to sort the search result list by relevance based on the trust score and to send the sorted search result list to the user device
system.

8. The method of claim 7,
The plurality of extracted entities are ranked based on the trust score
system.

8. The method of claim 7,
The entity extraction module is configured to associate the extracted entity with one or more concurrent entities in the entity index electronic data corpus
system.

11. The method of claim 10,
The associated entities are ranked by the trust score
system.

8. The method of claim 7,
Each of the plurality of entities is selected from the group comprising a person, an organization, a geographic location, a date and a time
system.

17. A non-transitory computer readable medium having computer executable instructions stored thereon,
The computer-
An entity extraction computer to receive user input of search query parameters;
Wherein the entity extraction computer causes a plurality of entities to be extracted from the search query parameters, the extraction including a confidence score indicating a concurrency level of concurrency of the extracted entity and one or more related entities in the electronic data corpus To be performed by comparing the entity concurrency database with each entity in the plurality of extracted entities;
The entity extraction computer to assign an index identifier (index ID) to each entity in the plurality of extracted entities;
The entity extraction computer storing an index ID for each of a plurality of extracted entities in an electronic data corpus indexed by an index ID corresponding to each of one or more related entities;
The search server computer causes the entity index electronic data corpus to be searched to locate a plurality of extracted entities and to identify an index ID of a concurrently occurring data record of at least two of the plurality of extracted entities;
Wherein the search server computer is configured to construct a search result list having a data record corresponding to the identified index ID,
Non-transient computer readable medium.

14. The method of claim 13,
Wherein the command comprises:
Wherein the search server computer causes the search result list to be sorted by relevance based on the trust score,
Further comprising causing the search server computer to transmit the sorted search result list to the user device
Non-transient computer readable medium.

14. The method of claim 13,
The plurality of extracted entities are ranked based on the trust score
Non-transient computer readable medium.

14. The method of claim 13,
Wherein the command comprises:
Wherein the entity extraction computer is further configured to associate the extracted entity with one or more co-occurring entities in an entity index electronic data corpus
Non-transient computer readable medium.

17. The method of claim 16,
The associated entities are ranked by the trust score
Non-transient computer readable medium.

14. The method of claim 13,
Each of the plurality of entities is selected from the group comprising a person, an organization, a geographic location, a date and a time
Non-transient computer readable medium.

An entity extraction computer receives a user input of search query parameters from a user interface;
Wherein the entity extraction computer compares search query parameters with an entity concurrency database having instances of concurrent occurrence of one or more entities in an electronic data corpus and determines at least one of at least two entities corresponding to one or more entities in the search query parameters Extract one or more entities from the search query parameters by identifying one entity type;
Selecting a fuzzy matching algorithm for searching an entity coincidence database to identify one or more records associated with search query parameters, the fuzzy matching algorithm corresponding to at least one identified entity type, ;
Wherein the fuzzy-score matching computer searches the entity concurrency database using the selected fuzzy matching algorithm and forms one or more suggested search query parameters from the one or more records based on the search;
Wherein the fuzzy-score matching computer comprises providing one or more suggested search query parameters via a user interface
Way.

20. The method of claim 19,
Wherein the fuzzy-score matching computer further comprises searching the entity co-occurrence database using the selected fuzzy matching algorithm before the user input is terminated
Way.

20. The method of claim 19,
One or more records associated with the search query parameters may contain conceptual features
Way.

20. The method of claim 19,
One or more suggested search query parameters include a number of suggested search query parameters,
The method comprises:
Wherein the fuzzy-score matching computer further comprises sorting the plurality of presented search query parameters in a downward order based on a proximity of a match for the search query parameters in the user input
Way.

23. The method of claim 22,
The fuzzy-score matching computer may be configured to provide, via the user interface, a plurality of suggested search query parameters sorted into a drop-down list
Way.

20. The method of claim 19,
Entity Concurrency Database is Indexed
Way.

The method according to claim 1,
Entity Concurrency A database contains an entity-to-entity index
Way.

20. The method of claim 19,
An entity-concurrency database is one that contains an entity-to-topic index
Way.

20. The method of claim 19,
An entity-concurrency database contains an entity-to-facts index
Way.

Comprising one or more server computers with one or more processor executable computer readable instructions for a plurality of computer modules,
The plurality of computer modules,
An entity extraction module;
A fuzzy-score matching module,
The entity extraction module,
Configured to receive user input of search query parameters from a user interface;
Compare search query parameters with an entity concurrency database having instances of concurrent occurrence of one or more entities in an electronic data corpus and identify at least one entity type corresponding to one or more entities in search query parameters To extract one or more entities from the search query parameters,
The fuzzy-
Selecting a fuzzy matching algorithm to search for an entity coincidence database to identify one or more records associated with search query parameters, the fuzzy matching algorithm corresponding to at least one identified entity type;
Searching the entity concurrency database using the selected fuzzy matching algorithm, and forming one or more suggested search query parameters from the one or more records based on the search;
Further configured to provide one or more suggested search query parameters via the user interface
system.

29. The method of claim 28,
The fuzzy-score matching module is further configured to search the entity concurrency database using the selected fuzzy matching algorithm before the user input is terminated
system.

29. The method of claim 28,
One or more records associated with the search query parameters may contain conceptual features
system.

29. The method of claim 28,
One or more suggested search query parameters include a number of suggested search query parameters,
The fuzzy-score matching computer is further configured to sort the plurality of presented search query parameters in a downward order based on a proximity of a match for the search query parameters in the user input
system.

33. The method of claim 32,
The fuzzy-score matching computer is configured to provide, via a user interface, a plurality of suggested search query parameters sorted into a drop down list
system.

29. The method of claim 28,
Entity Concurrency Database is Indexed
system.

29. The method of claim 28,
Entity Concurrency A database contains an entity-to-entity index
system.

29. The method of claim 28,
An entity-concurrency database is one that contains an entity-to-topic index
system.

29. The method of claim 28,
An entity-concurrency database contains an entity-to-facts index
system.

An entity extraction computer receives a user input of partial search query parameters - partial search query parameters from the user interface having at least one incomplete search query parameter;
Entity extraction computer compares the partial search query parameters with an entity concurrency database having instances of concurrent occurrence of one or more first entities in the electronic data corpus and determines whether the partial search query parameters include one or more first entities Extracting one or more first entities from the partial search query parameters by identifying at least one entity type corresponding to the at least one entity type;
A fuzzy-score matching computer selects a fuzzy matching algorithm to search for an entity concurrency database to identify one or more records associated with partial search query parameters, the fuzzy matching algorithm corresponding to at least one identified entity type;
Wherein the fuzzy-score matching computer searches the entity concurrency database using the selected fuzzy matching algorithm and forms one or more of the presented first search query parameters from the one or more records based on the search;
The fuzzy-score matching computer providing one or more of the presented first search query parameters via a user interface;
The entity extraction computer receiving a user selection of one or more of the presented first search query parameters to form completed search query parameters;
The entity extraction computer extracts one or more second entities from the completed search query parameters;
Wherein the entity extraction computer searches for an entity concurrency database to identify one or more entities associated with one or more second entities to form one or more second presented search parameters;
Wherein the entity extraction computer comprises providing, via the user interface, one or more of the presented second search query parameters
Way.

39. The method of claim 37,
Wherein the fuzzy-score matching computer further comprises searching the entity co-occurrence database using the selected fuzzy matching algorithm before the user input is terminated
Way.

39. The method of claim 37,
The one or more records associated with the partial search query parameters may include conceptual features
Way.

39. The method of claim 37,
The one or more presented first search query parameters include a plurality of presented first search query parameters,
The method comprises:
Wherein the fuzzy-score matching computer further comprises sorting the plurality of presented first search query parameters in a downward order based on a proximity of a match for a partial search query parameter in a user input
Way.

41. The method of claim 40,
The fuzzy-score matching computer is configured to provide, via the user interface, a plurality of presented first search query parameters sorted into a drop-down list
Way.

39. The method of claim 37,
Entity Concurrency Database is Indexed
Way.

39. The method of claim 37,
Entity Concurrency A database contains an entity-to-entity index
Way.

39. The method of claim 37,
An entity-concurrency database is one that contains an entity-to-topic index
Way.

39. The method of claim 37,
An entity-concurrency database contains an entity-to-facts index
Way.

Comprising one or more server computers with one or more processor executable computer readable instructions for a plurality of computer modules,
The plurality of computer modules,
An entity extraction module;
A fuzzy-score matching module,
The entity extraction module,
The partial search query parameters from the user interface are configured to receive user input of the partial search query parameters having at least one incomplete search query parameter;
Compare the partial search query parameters with an entity concurrency database having instances of concurrent occurrence of one or more first entities in the electronic data corpus and compare at least two of the at least two first entities corresponding to the one or more first entities in the partial search query parameters The first entity being further configured to extract one or more first entities from the partial search query parameters by identifying one entity type,
The fuzzy-
The fuzzy matching algorithm is configured to identify one or more records associated with partial search query parameters by selecting a fuzzy matching algorithm that searches for an entity concurrent occurrence database, the fuzzy matching algorithm corresponding to at least one identified entity type;
Searching the entity concurrency database using the selected fuzzy matching algorithm, and forming one or more of the presented first search query parameters from the one or more records based on the search;
Further configured to provide one or more of the presented first search query parameters via a user interface,
Wherein the entity extraction module comprises:
Receiving a user selection of one or more of the presented first search query parameters to form a completed search query parameter;
Extract one or more second entities from the completed search query parameters;
Create one or more presented second search query parameters by searching the entity coincidence database to identify one or more entities associated with one or more second entities;
Further configured to provide one or more of the presented second search query parameters via the user interface
system.

47. The method of claim 46,
The fuzzy-score matching module is further configured to search the entity concurrency database using the selected fuzzy matching algorithm before the user input is terminated
system.

47. The method of claim 46,
The one or more records associated with the partial search query parameters may include conceptual features
system.

47. The method of claim 46,
The one or more presented first search query parameters include a plurality of presented first search query parameters,
The fuzzy-score matching module is further configured to sort the plurality of presented first search query parameters in a downward order based on a proximity of a match for a partial search query parameter in a user input
system.

50. The method of claim 49,
The fuzzy-score matching computer is configured to provide, via a user interface, a plurality of presented first search query parameters sorted into a drop-down list
system.

47. The method of claim 46,
Entity Concurrency Database is Indexed
system.

47. The method of claim 46,
Entity Concurrency A database contains an entity-to-entity index
system.

47. The method of claim 46,
An entity-concurrency database is one that contains an entity-to-topic index
system.

47. The method of claim 46,
An entity-concurrency database contains an entity-to-facts index
system.

As a computer implemented method,
The computer receiving a search query comprising one or more data strings from a search engine, each entity corresponding to a subset of one or more strings;
The computer identifying one or more entities in the one or more data strings based on comparing the entity database and the trend database with one or more entities;
The computer identifying one or more characteristics in one or more data strings that are not identified as corresponding to at least one entity;
The computer assigning each of the one or more features to at least one of the one or more entities based on a matching algorithm;
The computer assigning an extraction score to each entity based on a score assigned to each feature assigned to each entity;
The computer receiving a first search list from an entity database, the first search list including one or more entities having a score within a threshold distance from an extraction score of each entity;
The computer receiving a second search list from the trend database, the second search list including one or more entities having a score within a threshold distance from an extraction score of each entity;
The computer generating an aggregated list comprising a first search list and a second search list, the entities of the aggregation list being each ranked according to the aggregated scores;
Wherein the computer comprises providing a suggested search according to a collection list
Computer implemented method.

As a computer implemented method,
A computer receiving a plurality of data streams, each associated with a plurality of data sources;
The computer generating an array of characteristics associated with each of the data streams;
In response to the computer detecting a triggering condition associated with data in the data stream, the computer generating geographic data associated with data in the data stream;
Map an array of properties for a data source to a set of managed proportions associated with a search index, in response to the computer not detecting a triggering condition for a data source;
In response to determining that the content type of the data source is image data, the computer executes an optical character recognition routine on the metadata associated with the data received from the data source, and the computer identifies Retrieving from the web service an updated data stream from a data source,
The data source is associated with the web service identified by the metadata.
Computer implemented method.