JP2004501412A

JP2004501412A - Method and apparatus for flexibly assigning tokenization procedures

Info

Publication number: JP2004501412A
Application number: JP2001550618A
Authority: JP
Inventors: アンブロジアク，　ジャセク
Original assignee: Sun Microsystems Inc
Current assignee: Sun Microsystems Inc
Priority date: 2000-01-06
Filing date: 2001-01-02
Publication date: 2004-01-15
Also published as: EP1386248A2; WO2001050327A3; AU2757901A; WO2001050327A2

Abstract

本発明の１つの実施形態は、ドキュメントにあるテキストの検索を容易にするために、このテキストを個別の意味を有する単位のテキストに応じたトークンに変換することによってドキュメントのテキストをトークン化するシステムを提供する。このシステムはトークン化されるべきドキュメントを受け取り、このドキュメントに関連するトークン命令のセットを取り出すことによって動作する。次いで、このシステムはこのドキュメントを、トークン命令のセットによって特定される方法で、個別の意味を有する単位のテキストに対応するトークンに翻訳することによりトークン化する。
【選択図】図５One embodiment of the present invention is a system for tokenizing text of a document by converting the text into tokens according to a unit of text having individual meanings to facilitate searching for text in the document. I will provide a. The system operates by receiving a document to be tokenized and retrieving a set of token instructions associated with the document. The system then tokenizes the document by translating in a manner specified by the set of token instructions into tokens corresponding to a unit of text having individual meanings.
[Selection diagram] FIG.

Description

【０００１】
（背景）
本発明はコンピュータ化されたデータ検索を容易にするためのインデックス構造に関する。さらに詳細に述べると、本発明はドキュメントを、文字型または数字などのような、個別の意味を有する単位のテキストに関連するトークンに変換するためのトークン化手順を柔軟に割り当てる方法および装置に関する。
【０００２】
インターネットの爆発的な成長が、ユーザが数千および数百万の種々のウェブサイトから膨大な量のテキストデータを速く検索できる検索エンジンの開発に強く結びつけられた。特定のトピックに興味あるユーザは複数のキーワードを含んでいる種々のウェブページへのリンクを受け取るために、これらを検索エンジンに単に入力する必要があるだけである。
【０００３】
検索エンジンは通常、ＷＷＷ（ｗｏｒｌｄ　ｗｉｄｅ　ｗｅｂ）で利用されているドキュメント（ウェブページなど）の“インデックス”を生成することである。通常、インデックスは“トークン”として公知のよりコンパクトで容易に検索可能な形式の個別の文字群（または他の意味を有するテキストストリング）を格納することである。
【０００４】
ドキュメントが異なってインデックス化される必要のある広範囲の種々の異なる形式を有し得るという事実によって、有効なインデックスの構築のプロセスは、大変複雑になり得る。例えば、技術論文における有効なインデックスは技術論文の要約および題目を含み得るが、技術論文の本文を含まず、一方、テレビジョンのスケジュ−ルにおける有効なインデックスは個別のテレビジョンプログラムにおける格付けを含み得る。
【０００５】
インデックスを生成するプロセスは、共通のドキュメントフォーマット、例えばハイパーテキストマークアップ言語（ＨＴＭＬ）または拡張マークアップ言語（ＸＭＬ）において、検索目的のための多数の重要な情報が属性フィールドに格納され、ドキュメントの通常テキストの中に格納されてないという事実によって、また複雑になっている。
【０００６】
さらに、ドキュメントの構造は長い間に変わり得るし、このことが変化するインデックス構造を要求し得る。例えば、プロダクトカタログ構造は個別商品における消費者展望を含むように更新されることを想定する。この変化はこれらの消費者展望を含むように変化するインデックスを要求し得る。
【０００７】
実存のシステムはその場限りの規則を用いるドキュメントのインデックスを生成する。例えば、その場限りの規則のあるものは、属性フィールドにない全てのテキスト情報ためのインデックスを生成する。不幸にも、そのようなその場限りの規則はたびたび重要でない情報を多く含み、そしてたびたび重要な情報を排除する。
【０００８】
類似問題がインデックス生成プロセスにおいてドキュメントをトークンに変換する（このドキュメントのトークン化）際に存在する。このインデックス生成プロセスにおいて、ドキュメントの関連部分は、個別の意味を有する単位のテキスト、例えば、文字形式または数字に関連するトークンに変換される。英語では、通常、文字形式は空白および句読マークで線引きされる。従って、トークン化プロセスが相対的に簡単である。対照的に、日本語のような言語はそのような線引きを有しない。結果として、このトークン化プロセスは文脈情報に依存し、非常に複雑にされ得る。
【０００９】
このトークン化プロセスはまたドメインに依存し得る。例えば、“ｐｅｒｓｏｎ．ｄｅｐｔ＠ｃｏｍｐａｎｙｘ．ｃｏｍ”のような電子メールのピリオドは、連結要素であり、一方、他のテキスト情報の中でのピリオドは通常、単語および文章境界の線引きを行う。
【００１０】
従って、トークン化プロセスは言語間およびドメイン間で変化する。
【００１１】
（要旨）
本発明の１つの実施形態は、テキスト検索を容易にするために、ドキュメントの中のテキストを、個別の意味を有する単位のテキストに対応するトークンに変換することによってトークン化するシステムを供給する。このシステムはトークン化されるべきドキュメントを受け取り、このドキュメントに関連するトークン化命令のセットを取り出すことによって動作する。次いで、このシステムはこのドキュメントを、トークン命令のセットによって特定される方法で、個別の意味を有する単位のテキストに対応するトークンに翻訳することによりトークン化する。
【００１２】
本発明のある実施形態において、このドキュメントをトークン化することはこのドキュメントの第１の区分をトークン化するための第１のトークン命令のセットを用いることと、このドキュメントの第２の区分をトークン化するための第２のトークン命令のセットを用いることとを含む。
【００１３】
本発明のある実施形態において、このトークン化命令のセットはプラグインモジュールにおいて含まれる。
【００１４】
本発明のある実施形態において、このトークン化命令のセットはオブジェクト指向プログラミングシステムの中で定義されているオブジェクトを介して駆動される。
【００１５】
本発明のある実施形態において、このシステムはさらにドキュメントのインデックスを生成することにおいてトークン化されたドキュメントを用いる。この実施形態のある変形において、このシステムはこのインデックスを検索エンジンに利用できるようにさせ、この検索エンジンがこのインデックスを走査できるようにする。
【００１６】
本発明のある実施形態において、このシステムはリモートアドレスからネットワークを介してトークン化命令のセットを取り出す。
【００１７】
本発明のある実施形態において、このトークン化命令のセットはこのドキュメントに追加される。
【００１８】
本発明のある実施形態において、このトークン化命令のセットはこのドキュメントに関連するトークン化手順の中に含まれる。
【００１９】
本発明のある実施形態において、このドキュメントはトークン化サーバにおいてクライアントから取り出される。この実施形態では、トークン化サーバはこのトークン化されたドキュメントをクライアントに返す。
【００２０】
本発明のある実施形態において、このトークン化命令のセットはリモートサーバからネットワークを介して与えられる。
【００２１】
（詳細な説明）
下記の説明は、任意の当業者がこの発明を作り出しおよび使用することが可能であることを提示し、特定の応用およびこの条件に関連して与えられる。この開示された実施形態における種々の変形は当業者にとって容易に理解され、そして本明細書の中で定義される一般的な原理は本発明の精神と範囲を逸脱することなく他の実施形態および用途に応用され得る。このようにして、本発明は示された実施形態に限定されるように意図されたものでなく、ここで開示した原理および特長に一致する最も広い範囲に一致する。
【００２２】
この詳細な説明の中で記述されるデータ構造およびコードは通常、コンピュータの読み出し可能な記憶媒体に格納され、そしてこの媒体はコンピュータシステムによって使用されるコードおよび／またはデータを格納できる任意のデバイスまたは媒体であり得る。これは、限定されないが、磁気的および光学的記憶デバイス、例えば、デイスクドライブ、磁気テープ、ＣＤ（コンパクトデイスク）、およびＤＶＤ（デジタルビデオデイスク）を含み、そしてさらに伝送媒体に組み込まれたコンピュータ命令信号（この信号が変調されるキャリア波を持つかまたは持たない）を含む。例えば、この伝送媒体はインターネットのような通信ネットワークを含み得る。
【００２３】
（分散コンピュータシステム）
図１は、本発明の実施形態に従った分散コンピュータシステム１００を示す。分散コンピュータシステム１００は、クライアント１０２および１１８を含み、これらは、ネットワーク１１０を介して、インデックスサーバ１１２および検索エンジン１２２を接続する。
【００２４】
ネットワーク１１０は相互にコンピュータノードを接続できる任意の有線または無線の通信チャンネルを含むことができる。これは、限定されないが、構内ネットワーク、広帯域ネットワーク、またはネットワークの組み合わせを含む。本発明のある実施形態において、ネットワーク１１０はインターネットを含む。
【００２５】
クライアント１０２および１１８はコンピュータの性能を含み、およびネットワーク１１０を介して通信メカニズムを含むネットワーク１１０上の任意のノードを含み得る。
【００２６】
クライアント１０２は複数のドキュメント１０４−１０６を含み、これらはインデックスサーバ１１２の中のインデックス１１６に統合化される。インデックスサーバ１１２は演算および／またはデータ格納資源に対してクライアントからの要求をサービスするためのメカニズムを含むコンピュータネットワーク上のノードを含み得る。さらに特に、インデクッスサーバ１１２は、ドキュメント１０４−１０６をインデックス化するために、データベース１１４のなかにインデックス１１６を生成するための資源を含む。データベース１１４はデータを不揮発性形式で格納する任意のタイプのメカニズムを含むことができる。本発明のある実施形態では、データベース１１４は、Ｏｒａｃｌｅ　Ｃｏｒｐｏｒａｔｉｏｎ　ｏｆ　Ｒｅｄｗｏｏｄ　Ｓｈｏｒｅｓ，Ｃａｌｉｆｏｒｎｉａによって配布されたＯＲＡＣＬＥ８（登録商標）データベースを含む。
【００２７】
クライアント１１８は、インデックス１１６を走査するために検索エンジン１２２と通信するブラウザ１２０を含む。ブラウザ１２０は、ウェブサイトを閲覧できる任意のタイプのブラウザ、例えば、ＭｉｃｒｏｓｏｆｔＣｏｒｐｏｒａｔｉｏｎｏｆＲｅｄｍｏｎｄ，Ｗａｓｈｉｎｇｔｏｎによって配布されたＩＮＴＥＲＮＥＴＥＸＰＬＯＲＥＲ（登録商標）ブラウザを含み得る。検索エンジン１２２は、データを検索できる任意のタイプのコンピュータシステムまたはアプリケーション応用を含み得る。
【００２８】
動作中、インデックスサーバ１１２はクライアント１０２からドキュメント１０４−１０６を取り出し、インデックス１１６を生成するためにドキュメント１０４−１０６を利用する。クライアント１０２は、ドキュメント１０４−１０６をインデックスサーバ１１２に送り得ることを留意されたい。あるいは、この代わりとして、インデックスサーバ１１２はクライアント１０２からドキュメント１０４−１０６を集めることができる。
【００２９】
インデックスサーバ１１２はドキュメント１０４−１０６から選択された部分をトークン化し、このトークンからインデックス１１６を生成することよってインデックス１１６を生成する。クライアント１０２はそれ自体、ドキュメント１０４−１０６を、ネットワーク１１０を介して利用させる１つのサーバであり得ることを留意されたい。
【００３０】
インデックス１１６が生成された後、クライアント１１８はブラウザ１２０を介して検索エンジン１２２にクエリー１２４を送信する。クエリー１２４はユーザのクライアント１１８に対し目標のキーワードを特定し得る。クエリー１２４に応答して、検索エンジン１２２はインデックス１１６の中にあるマッチングしたキーワードを含むドキュメントを見つけるためにインデックス１１６を検索する。そのようなドキュメントがつきとめられた場合、検索エンジン１２２はクェリーヒット１２６の一覧のなかの当該ドキュメントをブラウザ１２０に返す。
【００３１】
（インデックスサーバ）
図２は、インデックスサーバ１１２が本発明の実施形態に従って、異なるドキュメントのタイプのためのインデックスをいかに生成するかを示す。図２において、インデックスサーバ１１２は異なる資源から多数の異なるドキュメントを受け取る。インデックスサーバ１１２はこれらの異なるタイプのドキュメントをインデックス１１６に統合化する。インデックス１１６は多数の異なるドキュメントタイプを含む１つのインデックスを含み得ることを留意されたい。また、インデックス１１６は各ドキュメントタイプに対し異なるインデックスを含み得る。
【００３２】
限りない数のドキュメントタイプがインデックス化され得る。例えば、図２は、ニュースドキュメント２０２と、プロダクトカタログ２０４と、テレビジョンプログラムスケジュール２０６と、ユーザのドキュメンテーションを含むドキュメント２０８と、財務情報を含むドキュメント２１０とを示す。これらの異なるドキュメントタイプのそれぞれが異なるドキュメントの構造を有し得、そしてこの構造はＸＭＬのような言語で定義され得る。これらの異なるドキュメントの構造のそれぞれは異なるインデックスの仕組みに関連され得る。数個のドキュメントにおいて、特定の属性はこのインデックスの中に含まれ得る。例えば、数種のタイプのユーザのドキュメンテーションにおいて、このユーザのドキュメンテーションが初心者ユーザかまたは専門ユーザに適しているかどうかを示す属性をインデックス化することは有利になり得る。
【００３３】
（インデックススタイルシートおよびトークン化手順）
図３は、本発明の実施形態に従って、あるドキュメントに対してインデックス化のためのスタイルシートおよびトークン化のための手順がいかに生成し、かつインデックス化するために用いられるかを示す。図３において、インデックスサーバ１１２の中のインデックス構築メカニズム３１０は入力としてドキュメント３０２を受け取り、そしてこのドキュメント３０２のためのインデックス３１２を生成する。インデックス３１２はドキュメント収集のために大きなインデックス１１６の中に入り、そしてこの収集はデータベース１１４（図１から）の中に含まれる。
【００３４】
このインデックス構築プロセス中、インデックス構築器３１０はこのインデックススタイルシート３０４およびトークン化手順３０６−３０７を参照する。インデックススタイルシート３０４はドキュメント３０２のためのインデックス３１２を生成する命令のセットを含む。例えば、インデックススタイルシート３０４は、インデックス３１２を生成することにおいて、ドキュメント３０２のどの区分が読み飛ばされるべきかを特定できる。インデックススタイルシート３０４は、インデックススタイルシート３０４に含まれているドキュメント３０２の属性をも特定できる。例えば、属性はある人がドキュメント３０２のアクセス権を有するための最低のセキュリティレベルを特定できる。別の属性はドキュメント３０２のための内容の格付け（Ｇ，ＰＧ−１３，ＰＧ，Ｒ，Ｘ）を特定できる。
【００３５】
トークン化手順３０６−３０７はドキュメント３０２のある部分がいかにトークン化されるかを特定する。例えば、トークン化手順３０６はドキュメント３０２の第１の部分がいかにトークン化されるかを特定し、一方、トークン化手順３０７はドキュメント３０２の第２の部分がいかにトークン化されるかを特定する。大抵のドキュメントは多分１つのトークン化手順を用いるが、他のドキュメントは異なる言語での部分または異なるトークン化手順を必要とする異なるドメインからの部分を含み得る。
【００３６】
インデックススタイルシート３０４はＸＭＬ標準によって特定されたフォーマットのスタイルシートと類似している。フォーマットのスタイルシートは、ＸＭＬドキュメントを表示するために、表示属性、例えば、フォントおよび色などを特定するために用いられる。同じように、インデックスのスタイルシート３０４はドキュメント３０２のためのインデックスがいかに生成されるかを特定する。
【００３７】
図３はスタイルシートおよび手順の形態においてインデックス命令およびトークン命令を示すが、他の表現も可能であることを留意されたい。例えば、このインデックスおよびトークン命令はインデックス構築器３１０の中にプラグインされ得るプラグインモジュールに中に含まれ得る。
【００３８】
これらのインデックスおよびトークン命令はオブジェクト指向プログラミングシステムの中で定義されたオブジェクトを介して参照され得る。例えば、インデックスのパラメータオブジェクトはドキュメント３０２のインデックスを構築するために、この命令を取り出す方法を含み得る。
【００３９】
このトークン手順３０６−３０７はさらにトークン化命令を含むコードモジュールの形式を仮定できるかまたは、ネットワークを通してのリモートサービスによって供給され得る。
【００４０】
また、インデックスのスタイルシート３０４は、トークン化手順３０６−３０７がどこから取り出され得るかを特定できる参照項を含み得ることを留意されたい。
【００４１】
このインデックス構築器３１０は多数の異なるインデックスのスタイルシートおよびトークン化手順からの入力を受けることができる標準化されたインタフェースを含む。これは、インデックス構築器３１０が多数の異なるトークン化ルールを用いて、多数の異なるドキュメントタイプのためのインデックスを生成できることを可能にする。
【００４２】
（トークン化プロセス）
図４はこのトークン化プロセスの例を示す。この例では、“ＭＡＹ７，２０００”という１つのテキストが３つのトークン４０２−４０４に分けられる。トークン４０２は単語“ＭＡＹ．”を含む。トークン４０３は日の数字の“７，”を含み、トークン４０４は年の数字“２０００．”を含む。これらのトークンのそれぞれは唯一のトークン数字に関係し、この唯一のトークン数字はこのインデックスを生成するために用いられる。トークン数字を用いることはさらにコンパクトな再表現に通じ、これはトークン数字が大文字より大きくかつ空白より小さいストリングを拾い上げているからである。さらに、この検索プロセスの間、文字ストリングより数字ストリングを調べることの方がより簡単である。
【００４３】
（インデックス生成プロセス）
図５は、本発明の実施形態に従ってインデックスを生成するプロセスを示すフローチャートである。このシステムは構成ファイル（ステップ５０２）をダウンロードすることで開始する。このプロセスはネットワークを介して構成ファイルをダウンロードすることを含み得る。次いで、このシステムはこの構成ファイルの構文解析を行い（ステップ５０４）、この構成ファイルの中のインデックスのスタイルシート３０４のアドレスを識別する（ステップ５０６）。
【００４４】
次いで、このシステムはこの識別されたアドレスからこのインデックスのスタイルシート３０４をダウンロードする（ステップ５０８）。これはユニバーサルリソースローケータ（ＵＲＬ）によって特定された場所からのネットワークを介してこのスタイルシートを取り出すことを含む。または、このスタイルシートはこのドキュメントに追加でき、どのケースにおいても、このインデックスのスタイルシートは簡単に取り出され得る。
【００４５】
次いで、このシステムは、このインデックス生成プロセスの間、このインデックスのスタイルシートの中の命令が用いられ得るようにこのインデックスのスタイルシート３０４の構文解析を行う（ステップ５１０）。
【００４６】
このシステムは同じようにトークン化手順３０６のアドレスを識別する（ステップ５１２）。（トークン化手順３０６のアドレスはこの構成ファイルの中かまたは、インデックスのスタイルシート３０４の中に含まれ得る）。次いで、このシステムはこの識別されたアドレスからトークン化手順３０６をダウンロードする（ステップ５１４）。本発明のある実施形態では、トークン化手順３０６はインデックスのスタイルシート３０４と同じように同じ場所から取り出され得る。本発明の別の実施形態では、トークン化手順３０６は別の場所から取り出され得る。
【００４７】
次いで、このシステムは複数のドキュメントをインデックス１１６に入力する。これは、１つのドキュメントをインデックス構築器３１０にダウンロードし（ステップ５１８）、そしてその後インデックスのスタイルシート３０４の中で特定された命令を用いてこのドキュメントの構文解析を行う（ステップ５２０）ことによって成し遂げられる。このシステムはこの構文解析されたドキュメントをトークン化手順３０６を用いてトークンに変換し（ステップ５２２）、そしてこのトークンを用いてインデックスを生成する（ステップ５２４）。このプロセスはインデックス１１６に入力されるドキュメントのそれぞれに対して、繰り返される。
【００４８】
インデックス１１６が完成した後、検索エンジン１２２がクエリー処理のためにインデックス１１６を走査できるように、このシステムはインデックス１１６を検索エンジン１２２（図１から）に対して利用させる（ステップ５２６）。
【００４９】
図６は、本発明の実施形態に従って、あるドキュメントに対して更新されるインデックスを動的に生成するプロセスを示すフローチャートである。あるタイプの検索に対して、データは寿命が限られている。例えば、現在の天候データの検索において、古い天候データは関心がない。
【００５０】
これらの環境において、本発明のある実施形態は下述のように動作する。このシステムは検索リクエストを受け取る（ステップ６０２）。この検索リクエストの応答に対し、このシステムはこの検索の中に含まれる任意のドキュメントに対してドキュメント作成日時をチェックする（ステップ６０４）。このシステムが、ドキュメントが古いと判定すると（多分ドキュメントの年と制限年を比較することによって）、このシステムは新バージョンのドキュメントを生成させる（例えば、新天候データを収集することによって）（ステップ６０６）。次いで、このシステムはこの新バージョンのドキュメントに対してインデックスを生成する（ステップ６０８）。このプロセスは暗黙のうちに古いバージョンのドキュメントを取り除く（ステップ６１０）。最後に、このシステムはこの新しく更新されたインデックスを含む検索を実行する。
【００５１】
本発明の実施形態の前述の説明は図解および説明だけの目的だけに表されたものである。これらは排他的でかつ本発明を開示された形式に限定することを意図したものでない。従って、多数の変更と変形は当業者には明白であり得る。
【００５２】
例えば、本発明はクライアントおよびサーバを含む分散コンピュータシステムに関連して記述されているが、本発明は必ずしも分散クライアント−サーバコンピュータシステムに限定されるものでない。一般に、本発明はテキスト情報のためのインデックスを生成する任意のシステムまたはテキスト情報をトークン化する任意のシステムに適用できる。
【００５３】
さらに、上述の開示は本発明に限定することを意図していない。本発明の範囲は添付の特許請求の範囲によって規定される。
【図面の簡単な説明】
【図１】
図１は本発明の実施形態に従う分散コンピュータシステムを示す。
【図２】
図２は、インデックスサーバが本発明の実施形態に従う異なるドキュメントタイプのインデックスをどのように生成するかを示す。
【図３】
図３は、インデックス化するスタイルシートおよびトークン化手順が、本発明の実施形態に従ってドキュメントのインデックスを生成するためにどのように用いられるかを示す。
【図４】
図４は、トークン化プロセスの例を示す。
【図５】
図５は、本発明の実施形態に従って、インデックスを生成するプロセスを示すフローチャートを示す。
【図６】
図６は、本発明の実施形態に従って、古いドキュメントに対して、更新されたインデックスを動的に生成するプロセスを示すフローチャートを示す。[0001]
(background)
The present invention relates to an index structure for facilitating computerized data search. More particularly, the present invention relates to a method and apparatus for flexibly assigning a tokenization procedure for converting a document into tokens associated with a unit of text having a discrete meaning, such as a character type or a number.
[0002]
The explosive growth of the Internet has been strongly tied to the development of search engines that allow users to quickly search vast amounts of text data from thousands and millions of different websites. Users interested in a particular topic only need to enter them into a search engine to receive links to various web pages containing multiple keywords.
[0003]
Search engines typically create an "index" of documents (such as web pages) that are being used on the WWW (world wide web). Typically, an index is to store a discrete group of characters (or other meaningful text strings) in a more compact and easily searchable form known as a "token".
[0004]
The fact that documents can have a wide variety of different formats that need to be indexed differently can complicate the process of building an effective index. For example, a valid index in a technical article may include the abstract and title of the technical article, but not the body of the technical article, while a valid index on a television schedule includes ratings in individual television programs. obtain.
[0005]
The process of generating the index is based on a common document format, such as Hypertext Markup Language (HTML) or Extensible Markup Language (XML), where a number of important information for search purposes is stored in attribute fields; Complicated by the fact that they are not usually stored in text.
[0006]
Further, the structure of a document can change over time, which can require a changing index structure. For example, assume that the product catalog structure is updated to include the consumer outlook on individual products. This change may require a changing index to include these consumer perspectives.
[0007]
Existing systems index documents using ad-hoc rules. For example, some ad hoc rules generate an index for all textual information that is not in an attribute field. Unfortunately, such ad hoc rules often contain a lot of insignificant information and often eliminate important information.
[0008]
A similar problem exists when converting a document into tokens (tokenify this document) in the index generation process. In this indexing process, relevant parts of the document are converted into tokens relating to units of text having individual meanings, for example, character form or numbers. In English, character forms are usually delineated with blanks and punctuation marks. Therefore, the tokenization process is relatively simple. In contrast, languages such as Japanese do not have such a line. As a result, this tokenization process depends on contextual information and can be very complicated.
[0009]
This tokenization process may also be domain dependent. For example, an email period such as "person.dept@companyx.com" is a connected component, while a period in other textual information typically delineates word and sentence boundaries.
[0010]
Thus, the tokenization process varies between languages and between domains.
[0011]
(Abstract)
One embodiment of the present invention provides a system for tokenizing text in a document by converting the text in a document into tokens corresponding to units of text having individual meanings to facilitate text search. The system operates by receiving a document to be tokenized and retrieving a set of tokenized instructions associated with the document. The system then tokenizes the document by translating in a manner specified by the set of token instructions into tokens corresponding to units of text having individual meanings.
[0012]
In one embodiment of the invention, tokenizing the document uses a first set of token instructions to tokenize a first section of the document, and tokenizing the second section of the document with a token. Using a second set of token instructions to generate
[0013]
In one embodiment of the invention, this set of tokenized instructions is included in a plug-in module.
[0014]
In one embodiment of the invention, this set of tokenized instructions is driven via objects defined in an object-oriented programming system.
[0015]
In some embodiments of the invention, the system further uses the tokenized document in indexing the document. In a variation of this embodiment, the system makes the index available to a search engine, which allows the search engine to scan the index.
[0016]
In one embodiment of the present invention, the system retrieves a set of tokenized instructions from a remote address over a network.
[0017]
In one embodiment of the invention, this set of tokenized instructions is added to this document.
[0018]
In one embodiment of the invention, the set of tokenization instructions is included in a tokenization procedure associated with the document.
[0019]
In one embodiment of the invention, the document is retrieved from the client at the tokenization server. In this embodiment, the tokenization server returns the tokenized document to the client.
[0020]
In one embodiment of the present invention, this set of tokenized instructions is provided over a network from a remote server.
[0021]
(Detailed description)
The following description sets out that any person skilled in the art can make and use the invention, and is given in connection with the specific application and its conditions. Various modifications of this disclosed embodiment will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and to other embodiments without departing from the spirit and scope of the invention. It can be applied to applications. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
[0022]
The data structures and codes described in this detailed description are typically stored on computer readable storage media, and the media can be any device or code capable of storing code and / or data used by a computer system. It can be a medium. This includes, but is not limited to, magnetic and optical storage devices such as disk drives, magnetic tapes, CDs (compact disks), and DVDs (digital video disks), and further includes computer instruction signals embedded in transmission media. (This signal may or may not have a modulated carrier wave). For example, the transmission medium may include a communication network such as the Internet.
[0023]
(Distributed computer system)
FIG. 1 shows a distributed computer system 100 according to an embodiment of the present invention. The distributed computer system 100 includes clients 102 and 118, which connect an index server 112 and a search engine 122 via a network 110.
[0024]
Network 110 may include any wired or wireless communication channel that can connect computer nodes to each other. This includes, but is not limited to, a private network, a broadband network, or a combination of networks. In one embodiment of the present invention, network 110 includes the Internet.
[0025]
Clients 102 and 118 include the capabilities of a computer and may include any node on network 110 that includes a communication mechanism over network 110.
[0026]
Client 102 includes a plurality of documents 104-106, which are integrated into an index 116 in index server 112. Index server 112 may include a node on a computer network that includes a mechanism for servicing requests from clients for computing and / or data storage resources. More specifically, index server 112 includes resources for generating index 116 in database 114 to index documents 104-106. Database 114 may include any type of mechanism for storing data in a non-volatile format. In one embodiment of the present invention, database 114 includes an ORACLE8® database distributed by Oracle Corporation of Redwood Shores, California.
[0027]
Client 118 includes a browser 120 that communicates with search engine 122 to scan index 116. Browser 120 may include any type of browser capable of browsing websites, such as the INTERNET EXPLORER® browser distributed by Microsoft Corporation of Redmond, Washington. Search engine 122 may include any type of computer system or application application that can search data.
[0028]
In operation, index server 112 retrieves documents 104-106 from client 102 and utilizes documents 104-106 to generate index 116. Note that client 102 may send documents 104-106 to index server 112. Alternatively, index server 112 may collect documents 104-106 from client 102.
[0029]
Index server 112 tokenizes portions selected from documents 104-106, and generates index 116 by generating index 116 from the tokens. Note that client 102 may itself be one server that makes documents 104-106 available over network 110.
[0030]
After the index 116 is generated, the client 118 sends a query 124 to the search engine 122 via the browser 120. Query 124 may identify target keywords to user's client 118. In response to the query 124, the search engine 122 searches the index 116 for documents containing the matched keywords in the index 116. If such a document is located, the search engine 122 returns the document in the list of query hits 126 to the browser 120.
[0031]
(Index server)
FIG. 2 illustrates how the index server 112 generates indices for different document types according to an embodiment of the present invention. In FIG. 2, index server 112 receives a number of different documents from different resources. The index server 112 integrates these different types of documents into the index 116. Note that index 116 may include one index that includes many different document types. Also, index 116 may include a different index for each document type.
[0032]
An unlimited number of document types can be indexed. For example, FIG. 2 shows a news document 202, a product catalog 204, a television program schedule 206, a document 208 containing user documentation, and a document 210 containing financial information. Each of these different document types may have a different document structure, and this structure may be defined in a language such as XML. Each of these different document structures may be associated with a different indexing scheme. In some documents, certain attributes may be included in this index. For example, in the documentation of several types of users, it may be advantageous to index an attribute that indicates whether this user's documentation is suitable for a novice or professional user.
[0033]
(Index style sheet and tokenization procedure)
FIG. 3 illustrates how a style sheet for indexing and a procedure for tokenization are generated and used to index a document according to an embodiment of the present invention. In FIG. 3, an index construction mechanism 310 in the index server 112 receives a document 302 as input and generates an index 312 for the document 302. Index 312 enters into large index 116 for document collection, and this collection is included in database 114 (from FIG. 1).
[0034]
During the index construction process, the index constructor 310 references the index style sheet 304 and the tokenization procedures 306-307. Index style sheet 304 includes a set of instructions that generate index 312 for document 302. For example, index style sheet 304 can identify which sections of document 302 should be skipped in generating index 312. The index style sheet 304 can also specify attributes of the document 302 included in the index style sheet 304. For example, an attribute may specify a minimum security level for a person to have access to document 302. Another attribute may specify a content rating (G, PG-13, PG, R, X) for the document 302.
[0035]
Tokenization procedures 306-307 specify how certain portions of document 302 are tokenized. For example, the tokenization procedure 306 specifies how a first part of the document 302 is tokenized, while the tokenization procedure 307 specifies how a second part of the document 302 is tokenized. Most documents will probably use one tokenization procedure, but other documents may include parts in different languages or parts from different domains that require different tokenization procedures.
[0036]
Index style sheet 304 is similar to a style sheet in a format specified by the XML standard. The format style sheet is used to specify display attributes, such as font and color, for displaying the XML document. Similarly, the index stylesheet 304 specifies how the index for the document 302 is generated.
[0037]
Although FIG. 3 shows index and token instructions in the form of style sheets and procedures, it should be noted that other representations are possible. For example, the index and token instructions may be included in a plug-in module that may be plugged into index builder 310.
[0038]
These index and token instructions can be referenced via objects defined in the object-oriented programming system. For example, an index parameter object may include a method for retrieving this instruction to build an index for document 302.
[0039]
This token procedure 306-307 can also assume the form of a code module that includes tokenized instructions, or can be provided by a remote service over a network.
[0040]
Also note that the index stylesheet 304 may include a reference term that can identify where the tokenization procedures 306-307 can be retrieved.
[0041]
The index builder 310 includes a standardized interface that can receive input from many different index style sheets and tokenization procedures. This enables the index builder 310 to generate indexes for a number of different document types using a number of different tokenization rules.
[0042]
(Tokenization process)
FIG. 4 shows an example of this tokenization process. In this example, one text "MAY 7, 2000" is divided into three tokens 402-404. Token 402 contains the word "MAY." Token 403 contains the day number "7," and token 404 contains the year number "2000." Each of these tokens is associated with a unique token number, which is used to generate this index. Using token digits leads to a more compact re-expression, since the token digits pick up strings that are larger than uppercase and smaller than white space. Further, during this search process, it is easier to look up a digit string than a character string.
[0043]
(Index generation process)
FIG. 5 is a flowchart illustrating a process for generating an index according to an embodiment of the present invention. The system starts by downloading a configuration file (step 502). This process may include downloading the configuration file over a network. The system then parses the configuration file (step 504) and identifies the address of style sheet 304 of the index in the configuration file (step 506).
[0044]
The system then downloads the style sheet 304 for the index from the identified address (step 508). This involves retrieving this stylesheet over the network from the location specified by the Universal Resource Locator (URL). Alternatively, the style sheet can be added to this document, and in any case, the style sheet for this index can be easily retrieved.
[0045]
The system then parses the index's stylesheet 304 so that the instructions in the index's stylesheet can be used during the index generation process (step 510).
[0046]
The system similarly identifies the address of the tokenization procedure 306 (step 512). (The address of the tokenization procedure 306 may be included in this configuration file or in the style sheet 304 of the index). The system then downloads the tokenization procedure 306 from the identified address (step 514). In some embodiments of the present invention, the tokenization procedure 306 may be retrieved from the same location as the style sheet 304 of the index. In another embodiment of the invention, the tokenization procedure 306 may be retrieved from another location.
[0047]
The system then enters the plurality of documents into index 116. This is accomplished by downloading one document to the index builder 310 (step 518) and then parsing the document using the instructions specified in the index's stylesheet 304 (step 520). Can be The system converts the parsed document into a token using tokenization procedure 306 (step 522), and generates an index using the token (step 524). This process is repeated for each of the documents entered into index 116.
[0048]
After the index 116 is completed, the system makes the index 116 available to the search engine 122 (from FIG. 1) so that the search engine 122 can scan the index 116 for query processing (step 526).
[0049]
FIG. 6 is a flowchart illustrating a process for dynamically generating an index that is updated for a document in accordance with an embodiment of the present invention. For some types of searches, the data has a limited lifetime. For example, in searching for current weather data, old weather data is not of interest.
[0050]
In these environments, certain embodiments of the present invention operate as described below. The system receives a search request (step 602). In response to the search request, the system checks the document creation date and time for any documents included in the search (step 604). If the system determines that the document is out of date (perhaps by comparing the document year to the limit year), the system causes a new version of the document to be generated (eg, by collecting new weather data) (step 606). ). The system then generates an index for the new version of the document (step 608). The process silently removes old versions of the document (step 610). Finally, the system performs a search that includes the newly updated index.
[0051]
The foregoing description of the embodiments of the present invention has been presented for purposes of illustration and description only. They are exclusive and are not intended to limit the invention to the form disclosed. Accordingly, many modifications and variations may be apparent to practitioners skilled in the art.
[0052]
For example, although the invention has been described with reference to a distributed computer system including clients and servers, the invention is not necessarily limited to a distributed client-server computer system. In general, the invention is applicable to any system that generates an index for text information or any system that tokenizes text information.
[0053]
Moreover, the above disclosure is not intended to be limited to the present invention. The scope of the invention is defined by the appended claims.
[Brief description of the drawings]
FIG.
FIG. 1 shows a distributed computer system according to an embodiment of the present invention.
FIG. 2
FIG. 2 illustrates how an index server generates indexes for different document types according to an embodiment of the present invention.
FIG. 3
FIG. 3 illustrates how indexing style sheets and tokenization procedures are used to index documents according to embodiments of the present invention.
FIG. 4
FIG. 4 shows an example of the tokenization process.
FIG. 5
FIG. 5 shows a flowchart illustrating a process for generating an index according to an embodiment of the present invention.
FIG. 6
FIG. 6 shows a flowchart illustrating a process for dynamically generating an updated index for old documents according to an embodiment of the present invention.

Claims

A method of tokenizing text in a document by converting the text into tokens corresponding to units of text having individual meanings to facilitate searching for the text in the document,
Receiving the document to be tokenized;
Retrieving a set of tokenized instructions associated with the document;
Tokenizing the document in a manner specified by the set of tokenizing instructions by translating the document into tokens corresponding to a unit of text having individual meanings.

Tokenizing the document comprises using a first set of tokenizing instructions for tokenizing a first section of the document; and a second tokenizing instruction for tokenizing a second section of the document. Using a set of

The method of claim 1, wherein the set of tokenized instructions is included in a plug-in module.

The method of claim 1, wherein the set of tokenized instructions is driven via an object defined in an object-oriented programming system.

The method of claim 1, further comprising using the tokenized document in generating an index for the document.

6. The method of claim 5, further comprising causing the search engine to utilize the index so that the search engine can scan the index.

The method of claim 1, wherein retrieving the set of tokenized instructions comprises retrieving the set of tokenized instructions over a network from a remote address.

The method of claim 1, wherein the set of tokenized instructions is added to the document.

The method of claim 1, wherein the set of tokenization instructions is included in a tokenization procedure associated with the document.

The method of claim 1, wherein the document is received from a client to a tokenization server, and further comprising returning the tokenized document from the tokenization server to the client.

The method of claim 1, wherein the set of tokenized instructions is provided by a remote service over a network.

When executed by a computer, the computer translates the text in the document by converting the text into tokens corresponding to units of text having individual meanings to facilitate searching for the text in the document. A computer readable storage medium storing instructions for performing a tokenizing method,
Receiving the document to be tokenized;
Retrieving a set of tokenized instructions associated with the document;
Tokenizing the document in a manner specified by the set of tokenizing instructions by translating the document into tokens corresponding to a unit of text having individual meanings.

Tokenizing the document comprises using a first set of tokenizing instructions for tokenizing a first section of the document, and a second tokenizing instruction for tokenizing a second section of the document. Using the set of computer-readable storage media of claim 12.

13. The computer readable storage medium of claim 12, wherein the set of tokenized instructions is included in a plug-in module.

13. The computer readable storage medium of claim 12, wherein the set of tokenized instructions is driven via an object defined in an object-oriented programming system.

13. The computer readable storage medium of claim 12, further comprising using the tokenized document if the method generates an index for the document.

17. The computer readable storage medium of claim 16, wherein the method further comprises causing the search engine to utilize the index so that the search engine can scan the index.

13. The computer-readable storage medium of claim 12, wherein retrieving the set of tokenized instructions comprises retrieving the set of tokenized instructions from a remote address over a network.

13. The computer readable storage medium of claim 12, wherein the set of tokenized instructions is added to the document.

13. The computer readable storage medium of claim 12, wherein the set of tokenization instructions is included in a tokenization procedure associated with the document.

Said document is received at a tokenization server from a client,
13. The computer readable storage medium of claim 12, wherein the method further comprises returning the tokenized document from the tokenizing server to the client.

13. The computer readable storage medium of claim 12, wherein the set of tokenized instructions is provided by a remote service over a network.

An apparatus for tokenizing text in a document by converting the text into tokens corresponding to units of text having individual meanings to facilitate retrieval of the text in the document,
A receiving mechanism configured to receive the document to be tokenized;
An instruction receiving mechanism configured to retrieve a set of tokenized instructions associated with the document;
A tokenization mechanism configured to tokenize the document by translating the document in a manner specified by the set of tokenization instructions into tokens corresponding to units of text having individual meanings. apparatus.

If the document is of a hybrid token type, the tokenization mechanism uses a first set of tokenization instructions that tokenize a first section of the document, and a tokenization instruction that tokenizes a second section of the document. 24. The apparatus of claim 23, wherein the apparatus is configured to use a set of two tokenized instructions.

24. The apparatus of claim 23, wherein the set of tokenized instructions is included in a plug-in module.

24. The apparatus of claim 23, wherein the instruction fetch mechanism is configured to fetch the set of tokenized instructions via an object defined in an object-oriented programming system.

24. The apparatus of claim 23, wherein generating an index for the document further comprises an index generation mechanism configured to use the tokenized document.

28. The apparatus of claim 27, further comprising an access mechanism configured to cause the search engine to utilize the index so that the search engine can scan the index.

The apparatus of claim 23, wherein the instruction fetch mechanism is configured to fetch the set of tokenized instructions over a network from a remote address.

The apparatus of claim 23, wherein the set of tokenization instructions is added to the document.

The apparatus of claim 23, wherein the set of tokenization instructions is included in a tokenization procedure associated with the document.

The device includes a tokenization server that receives the document from a client;
The apparatus of claim 23, wherein the tokenization server is configured to return the tokenized document to the client.

The apparatus of claim 23, wherein the set of tokenization instructions is provided by a remote service over a network.