JP2001222545A

JP2001222545A - Similar document search device, similar document search method, and recording medium

Info

Publication number: JP2001222545A
Application number: JP2000031598A
Authority: JP
Inventors: Shigemi Nakazato; 茂美中里; Tsutomu Kobayashi; 勉小林; Yukio Nakamoto; 幸夫中本; Takuya Nishina; 卓哉仁科; Hiroshi Yamazaki; 弘山崎; Takeshi Matsukuma; 剛松隈
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2000-02-09
Filing date: 2000-02-09
Publication date: 2001-08-17

Abstract

(57)【要約】【課題】カテゴリを過去に遡り、同意の単語の表現が
変化してしまっても高精度に類似文献抽出を行なうこと
を可能とする。【解決手段】新カテゴリ、旧カテゴリ［１］、旧カテ
ゴリ［２］といった分類が時系列になされていた場合
に、カテゴリ間のリンク情報と表現の変化する単語のリ
ンク情報とをメモリに記憶しておく。そして、検索キー
として与えられる文書が属するカテゴリを特定した後、
上記メモリに記憶された情報に基づいて、特定カテゴリ
に関連した旧カテゴリ［１］、旧カテゴリ［２］を検索
対象とし、且つ単語のリンク情報を用いて検索キー文書
と類似する文書の抽出を行なう。これにより過去のカテ
ゴリに溯って検索を行なっても、高精度に類似文書抽出
を行なうことができる。 (57) [Summary] [Problem] To enable similarity document extraction to be performed with high accuracy even if the expression of a word of consent changes by going back to the category in the past. SOLUTION: When classifications such as a new category, an old category [1], and an old category [2] are performed in chronological order, link information between categories and link information of words whose expressions change are stored in a memory. Keep it. Then, after specifying the category to which the document given as the search key belongs,
Based on the information stored in the memory, the old category [1] and the old category [2] related to the specific category are searched, and a document similar to the search key document is extracted using the word link information. Do. As a result, similar documents can be extracted with high accuracy even if the search is performed retroactively in the past category.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、データベースの
中から類似文書を検索する類似文書検索装置であって、
特にカテゴリ毎に分類される文書を対象として類似文書
を検索する場合に好適な類似文書検索装置と、この装置
に用いられる類似文書検索方法及び記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a similar document retrieval apparatus for retrieving a similar document from a database,
In particular, the present invention relates to a similar document search device suitable for searching for a similar document targeting documents classified by category, a similar document search method and a recording medium used in this device.

【０００２】[0002]

【従来の技術】従来、各種文書を多数データベース化
しておき、その中から指定された文書（以下、検索キー
文書と称す）に類似する文書を自動検索するシステムが
ある。このようなシステムでは、検索キー文書に含まれ
ている単語と検索対象となる文書に含まれている単語と
を比較し、共通する単語の種類、出現場所、出現回数な
どからベクトル空間法などにより類似度を算出して、類
似度の高い文書を検索結果として出力する。2. Description of the Related Art Conventionally, there is a system in which a large number of various documents are stored in a database, and a document similar to a designated document (hereinafter, referred to as a search key document) is automatically searched from the database. In such a system, a word included in a search key document is compared with a word included in a search target document, and a common word type, an appearance location, an appearance frequency, and the like are determined by a vector space method or the like. A similarity is calculated, and a document having a high similarity is output as a search result.

【０００３】ところで、この種のシステムでは、データ
ベース内の各種文書を所定のカテゴリ毎に分類してお
き、検索対象となる文書のカテゴリを特定してから、そ
のカテゴリに属する文書を対象として類似文書検索が行
なわれる。この場合、カテゴリの数や定義は常に固定な
ものではなく、年度等により更新されるものである。そ
こで本発明者らは、更新されたカテゴリと更新前の関連
カテゴリとの間でリンク付けを行ない、検索時にはリン
ク付けられた複数のカテゴリに含まれる文書を検索対象
文書とする技術を開発した（特願平１１−３２８３３０
号）。In this type of system, various documents in a database are classified into predetermined categories, a category of a document to be searched is specified, and then a document belonging to the category is searched for a similar document. A search is performed. In this case, the number and definition of categories are not always fixed but are updated according to the year or the like. Accordingly, the present inventors have developed a technique for linking between an updated category and a related category before updating, and using a document included in a plurality of linked categories as a search target document during search. Japanese Patent Application No. 11-328330
issue).

【０００４】[0004]

【発明が解決しようとする課題】従来の技術において
は、更新前後のリンク付けられた複数カテゴリに含まれ
る文書を検索対象文書として抽出できる点で効果があ
る。しかしながら、検索キー文書との類似度を算出する
検索対象文書を時系列にリンク付けがなされたカテゴリ
を過去に遡るに従って、単語の言い回しが変化してくる
場合が多いが、それにも関わらず現時点の検索キー文書
に含まれている単語と、過去の異なる検索対象文書に含
まれている単語とから類似度を算出していたため、過去
に溯るに従って、抽出される類似文献の精度が低下する
という問題があった。The conventional technique is effective in that documents included in a plurality of linked categories before and after updating can be extracted as documents to be searched. However, in many cases, the wording of the word changes as the search target document for calculating the similarity to the search key document is linked back in time to the past category. Since the similarity was calculated from the words included in the search key document and the words included in the search target documents in the past, the accuracy of the extracted similar documents decreases as going back in the past. was there.

【０００５】本発明は、このような課題を解決するため
のもので、カテゴリが時系列により更新されるデータベ
ースを用いる場合の類似度算出および類似文献検索の精
度向上を図る類似文書検索装置、及びこの装置に用いら
れる類似文書検索方法を提供することを目的とする。The present invention has been made to solve such a problem, and a similar document search apparatus for calculating similarity and improving the accuracy of similar document search when a database whose categories are updated in a time series is used. An object of the present invention is to provide a similar document search method used in this device.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するた
めに、本発明の類似文書検索装置では、所定のカテゴリ
毎に分類される複数の文書を格納するデータベースと、
カテゴリの変更単位に第１の期間で用いられる分類を示
す第１のカテゴリに属する文書と第２の期間で用いられ
る分類を示す第２のカテゴリに属する文書とを関連付け
るためのカテゴリリンク情報を記憶する第１の記憶手段
と、上記第１のカテゴリと上記第２のカテゴリとで同意
でありながら異なる表現が用いられる単語を関連付ける
ための単語リンク情報を記憶する第２の記憶手段と、検
索キーとして入力された文書が属するカテゴリを上記第
１のカテゴリを対象に特定するカテゴリ特定手段と、こ
のカテゴリ特定手段によって特定された上記第１のカテ
ゴリと関連する上記第２のカテゴリを上記第１の記憶手
段に記憶されたリンク情報から判断する判断手段と、こ
の判断手段によって得られた上記第１のカテゴリ及び上
記第２のカテゴリに属する文書を検索対象とし、且つ上
記第２の記憶手段に記憶された単語リンク情報を用いて
上記データベースから上記検索キー文書と類似する文書
を検索する検索手段と、この検索手段によって得られた
類似文書を当該検索キー文書に対する検索結果として出
力する出力手段と、を具備したことを特徴とする。この
ような構成により、カテゴリが時系列により更新される
データベースを用いて類似度算出および類似文献検索を
行なう場合、リンク付けられた複数カテゴリの文書を容
易に検索できる上、過去に溯って単語の言い回しが変化
したとしても単語リンク情報によりこれらの単語を関連
付けているため、対象となる全てのカテゴリの文書につ
いて高精度に類似文献検索をおこなうことができる。Means for Solving the Problems In order to achieve the above object, a similar document search device according to the present invention includes: a database storing a plurality of documents classified into predetermined categories;
Category link information for associating a document belonging to the first category indicating the classification used in the first period with a document belonging to the second category indicating the classification used in the second period in a unit of change of the category A first storage unit that stores word link information for associating a word in which a different expression is used while agreeing between the first category and the second category, and a search key Category specifying means for specifying the category to which the document input as the target belongs for the first category, and the second category related to the first category specified by the category specifying means to the first category. Determining means for determining from the link information stored in the storage means, and the first category and the second category obtained by the determining means A search unit for searching a document belonging to the search key document from the database using the word link information stored in the second storage unit, and a similarity obtained by the search unit; Output means for outputting a document as a search result for the search key document. With such a configuration, when similarity calculation and similar document search are performed using a database in which the categories are updated in a time series, documents in a plurality of linked categories can be easily searched, and the word Even if the wording changes, these words are associated with each other by the word link information, so that similar documents can be searched for documents of all target categories with high accuracy.

【０００７】[0007]

【発明の実施の形態】まず、本発明の実施形態を説明す
る前に理解を容易にするため、本発明の類似文書検索装
置の概要について説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Before describing embodiments of the present invention, an outline of a similar document search apparatus of the present invention will be described for easy understanding.

【０００８】本発明の類似文書検索装置は、所定のカテ
ゴリ毎に分類される複数の文書が登録されたデータベー
スの中から類似文書を検索する際に、カテゴリの変更単
位に変更後のカテゴリと変更前のカテゴリとを関連付け
たリンク情報を用いて、関連するカテゴリを順次検索対
象とし、且つリンク付けられたカテゴリ間での単語の表
現の変化もあわせてリンク付けしたものである。つまり
ある時点でカテゴリが統合されれば、統合前のカテゴリ
も検索対象とし、一方カテゴリが分割されているなら
ば、分割先から分割前のカテゴリも検索対象とするもの
である。また、ある時点で複数カテゴリが再構成されて
いれば、再構成前のカテゴリから検索式等を用いて該当
する文書を絞り込んで検索対象とする。さらに、これら
のカテゴリ間で時系列で変化する単語の表現についても
リンク付けしている点に特徴がある。The similar document search apparatus of the present invention, when retrieving a similar document from a database in which a plurality of documents classified for each predetermined category is registered, changes the category into a unit of category change and the changed category. The related categories are sequentially searched using link information that associates with the previous category, and a change in the expression of a word between the linked categories is also linked. In other words, if the categories are integrated at a certain point in time, the category before the integration is also searched, and if the category is split, the categories before the split from the splitting destination are also searched. Further, if a plurality of categories are reconstructed at a certain point in time, the corresponding documents are narrowed down from the category before the reconstruction using a search formula or the like, and are set as search targets. Another characteristic is that the expressions of words that change in time series between these categories are also linked.

【０００９】例えば、図１に示すように、新カテゴリ、
旧カテゴリ［１］、旧カテゴリ［２］といった分類が時
系列になされていたとする。新カテゴリは最新のカテゴ
リであり、ここではカテゴリ［１１］〜［１５］を有す
る。旧カテゴリ［１］は１つ前のカテゴリであり、ここ
ではカテゴリ［２１］〜［２５］を有する。また、旧カ
テゴリ［２］はさらにもう１つ前のカテゴリ（２つ前の
カテゴリ）であり、ここではカテゴリ［３１］〜［３
３］および［２５］を有する。For example, as shown in FIG.
It is assumed that classifications such as the old category [1] and the old category [2] are performed in chronological order. The new category is the latest category, and here has categories [11] to [15]. The old category [1] is the previous category, and here has categories [21] to [25]. Further, the old category [2] is the category of the previous one (the category of the previous two), and here, the categories [31] to [3]
3] and [25].

【００１０】この例では、カテゴリ［１１］とカテゴリ
［１２］は、旧カテゴリ［１］では同じカテゴリ［２
１］に統合されている。さらに、このカテゴリ［２１］
は、旧カテゴリ［２］では、カテゴリ［２２］と共にカ
テゴリ［３１］に統合されている。同様に、他のカテゴ
リ［１３］〜［１５］についても、旧カテゴリ［１］、
［２］で同じカテゴリに統合されていたり、異なるカテ
ゴリに分割されている。In this example, category [11] and category [12] are the same category [2] in old category [1].
1]. Furthermore, this category [21]
Are integrated into category [31] together with category [22] in old category [2]. Similarly, for the other categories [13] to [15], the old category [1],
In [2], they are integrated into the same category or divided into different categories.

【００１１】具体例を挙げると、新カテゴリにおけるカ
テゴリ［１１］は「軟式テニス」、カテゴリ［１２］は
「硬式テニス」といった分野を示すものとすると、これ
らは旧カテゴリ［１］ではカテゴリ［２１］にて「テニ
ス」といった分野で統合され、さらに旧カテゴリ［２］
ではカテゴリ［３１］にて「スポーツ」といった分野で
統合されていたことになる。As a specific example, suppose that category [11] in the new category indicates a field such as "soft tennis" and category [12] indicates a field such as "rigid tennis". ], And integrated in fields such as "tennis", and the old category [2]
In this case, category [31] has been integrated in a field such as "sports".

【００１２】また、別の例で、新カテゴリにおけるカテ
ゴリ［１４］を「交通」といった分野を示すものとする
と、旧カテゴリ［２］ではカテゴリ［２３］にて「自動
車」と、カテゴリ［２４］にて「自転車」といった分野
に分割されていたことになる。ここで、このカテゴリ
［１４］のある新カテゴリの年代では「４ＷＤ」という
単語で表現されることの多いアイテムが、カテゴリ［２
３］のある旧カテゴリ［１］の年代では「四輪駆動車」
という単語で表現されることが多い、ということがあ
る。このような場合に、両単語が同一のアイテムを指し
ていることは明らかであることから、両単語についてリ
ンク付けを行ない、同一単語として扱うことにする。In another example, if the category [14] in the new category indicates a field such as "traffic", in the old category [2], the category [23] indicates "automobile" and the category [24]. Was divided into fields such as "bicycles". Here, in the age of this new category in the category [14], an item often expressed by the word “4WD” is the category [2].
[4] In the age of the old category [1] with [3]
Is often expressed by the word In such a case, since it is clear that both words point to the same item, both words are linked and treated as the same word.

【００１３】このようにすることにより、カテゴリ更新
されても新カテゴリにリンク付けられた旧カテゴリ
［１］、及び旧カテゴリ［２］から検索キー文書に類似
する文書を抽出することができる上、類似度算出、及び
類似文書検索精度の向上を図ることができる。By doing so, even if the category is updated, a document similar to the search key document can be extracted from the old category [1] and the old category [2] linked to the new category. Similarity calculation and similar document search accuracy can be improved.

【００１４】以下に、このような類似文書の検索を実現
するための本発明の実施形態を説明する。An embodiment of the present invention for realizing such a similar document search will be described below.

【００１５】図２は本発明の一実施形態に係る類似文書
検索装置のハードウェア構成を示す図である。なお、本
装置は一般的なアーキテクチャを持つコンピュータ上の
一機能として構築されるものである。FIG. 2 is a diagram showing a hardware configuration of a similar document search apparatus according to one embodiment of the present invention. This device is constructed as one function on a computer having a general architecture.

【００１６】図２に示すように、本装置は、制御装置
１、キーボード、ポインティングデバイス、及びスキャ
ナからなる入力装置２、類似文書の検索結果などを表示
する表示装置３、および外部記憶装置４から構成され
る。制御装置１は、ＣＰＵからなる制御部１ａとメモリ
部１ｂからなり、所定のプログラムに従って本装置全体
の制御を行う。入力装置２は、検索条件の入力などを行
う場合に用いられる。表示装置３は、例えばＣＲＴ（Ca
thode-ray tube）やＬＣＤ（Liquid Crystal Display）
などからなり、類似文書検索結果の表示等を行う。外部
記憶装置４は、例えば磁気ディスク装置や光ディスク装
置等からなり、本装置で扱われる各種データを保持す
る。As shown in FIG. 2, the present apparatus includes a control device 1, an input device 2 including a keyboard, a pointing device, and a scanner, a display device 3 for displaying similar document search results, and an external storage device 4. Be composed. The control device 1 includes a control unit 1a including a CPU and a memory unit 1b, and controls the entire device according to a predetermined program. The input device 2 is used for inputting search conditions and the like. The display device 3 is, for example, a CRT (Ca
thode-ray tube) and LCD (Liquid Crystal Display)
Display similar document search results. The external storage device 4 is composed of, for example, a magnetic disk device or an optical disk device, and holds various data handled by this device.

【００１７】この外部記憶装置４には、所定のカテゴリ
別に複数の文書を登録した文書データベース４ａ、新カ
テゴリ（現在のカテゴリ）として分類される各カテゴリ
毎に該当する文書のＩＤを登録したカテゴリ特定用文書
データベース４ｂ、検索対象カテゴリ、前時系のカテゴ
リ名群、文書ＩＤ群、文書登録年月等の情報を登録した
検索対象カテゴリ文書情報データベース４ｃ、特定のカ
テゴリにおける単語ＩＤとそれに対応する単語とを対に
して登録した単語変換情報データベース４ｄ、カテゴリ
単位で出現する単語の重み値を登録した単語重みテーブ
ルデータベース４ｅ、及び特定のカテゴリから時系列に
溯ったカテゴリにおける単語ＩＤとそれに対応する単語
とを対にして登録した時系列別単語変換情報データベー
ス４ｆが構築されている。The external storage device 4 has a document database 4a in which a plurality of documents are registered for each predetermined category, and a category identification in which the ID of a document corresponding to each category classified as a new category (current category) is registered. Document database 4b, a search target category, a search target category document information database 4c in which information such as a previous time category name group, a document ID group, and a document registration date is registered, a word ID in a specific category and a word corresponding thereto , A word conversion information database 4d registered as a pair, a word weight table database 4e registered with a weight value of a word appearing in a category unit, and a word ID and a word corresponding to a word ID in a category that goes back in time series from a specific category And a time-sequential word conversion information database 4f registered as a pair is constructed. There.

【００１８】図３は制御装置１の内部構成を示した図で
ある。制御部１ａは、ＣＰＵとＲＯＭにより構成される
ものであり、これらにより実行される各機能を個別に表
現したものである。一方メモリ部１ｂは、ＲＡＭにより
構成されており、ここに記憶するデータの内容毎に個別
に表現している。FIG. 3 is a diagram showing the internal configuration of the control device 1. The control unit 1a is constituted by a CPU and a ROM, and expresses each function executed by these individually. On the other hand, the memory unit 1b is constituted by a RAM, and expresses each content of data stored therein individually.

【００１９】制御部１ａは、初期化部２０１、入力部２
０３、出力部２０５、検索キー文書入力部２０７、検索
キー文書項目切り出し部２０９、検索キー単語抽出部２
１１、検索キー単語出現頻度算出部２１３、類似カテゴ
リ特定部２１５、検索対象カテゴリ文書情報読出し部２
１７、特定カテゴリ単語変換情報読込み部２１９、時系
列別単語重み読込み部２２１、検索対象文書読出し部２
２３、検索対象文書項目切出し部２２５、検索対象文書
単語抽出部２２７、検索対象単語出現頻度算出部２２
９、共通単語抽出部２３１、類似度算出部２３３、前カ
テゴリ検索部２３５、時系列別単語変換情報読込み部２
３７、カテゴリ間単語変換情報生成部２３９、及び検索
結果出力部２４１を有している。The control unit 1a includes an initialization unit 201, an input unit 2
03, output unit 205, search key document input unit 207, search key document item cutout unit 209, search key word extraction unit 2
11. Search key word appearance frequency calculation unit 213, similar category identification unit 215, search target category document information reading unit 2
17, specific category word conversion information reading section 219, time-series word weight reading section 221, search target document reading section 2
23, search target document item cutout unit 225, search target document word extraction unit 227, search target word appearance frequency calculation unit 22
9, common word extraction unit 231, similarity calculation unit 233, previous category search unit 235, time-sequential word conversion information reading unit 2
37, an inter-category word conversion information generation unit 239, and a search result output unit 241.

【００２０】またメモリ部１ｂは、検索キー文書格納バ
ッファ部２５１、検索キー文書項目格納バッファ部２５
３、検索キー単語情報格納バッファ部２５５、特定カテ
ゴリ格納バッファ部２５７、検索対象カテゴリ文書情報
格納バッファ部２５９、前カテゴリ検索バッファ部２６
１、特定カテゴリ単語変換情報格納バッファ部２６３、
時系列別単語重み読込みバッファ部２６５、検索対象文
書格納バッファ部２６７、検索対象文書項目格納バッフ
ァ部２６９、検索対象単語情報格納バッファ部２７１、
共通単語情報格納バッファ部２７３、カテゴリ間単語変
換情報格納バッファ部２７５、類似度格納バッファ部２
７７、時系列別単語変換情報格納バッファ部２７９、及
び検索結果出力バッファ部２８１を有している。The memory section 1b includes a search key document storage buffer section 251 and a search key document item storage buffer section 25.
3. Search key word information storage buffer unit 255, specific category storage buffer unit 257, search target category document information storage buffer unit 259, previous category search buffer unit 26
1. Specific category word conversion information storage buffer unit 263,
A time-series word weight reading buffer 265, a search target document storage buffer 267, a search target document item storage buffer 269, a search target word information storage buffer 271,
Common word information storage buffer 273, inter-category word conversion information storage buffer 275, similarity storage buffer 2
77, a time-sequential word conversion information storage buffer unit 279, and a search result output buffer unit 281.

【００２１】ここで、初期化部２０１は、各バッファ部
の初期化を行う。入力部２０３は、入力装置２を用いた
ユーザの指定操作に対応し検索キー文書設定等の入力を
行う。出力部２０５は、入力部２０３により行った検索
キー文書や検索結果の内容を表示装置３に出力する。Here, the initialization section 201 initializes each buffer section. The input unit 203 inputs search key document settings and the like in response to a user's designation operation using the input device 2. The output unit 205 outputs the content of the search key document and the search result performed by the input unit 203 to the display device 3.

【００２２】検索キー文書入力部２０７は、入力装置か
ら入力された検索キー文書のテキスト情報を検索キー文
書格納バッファ部２５１に格納する。The search key document input unit 207 stores the text information of the search key document input from the input device in the search key document storage buffer unit 251.

【００２３】検索キー文書項目切り出し部２０９は、検
索キー文書格納バッファ部２５１に格納されている検索
対象文書の文構造を解析し、内容のまとまりである項目
単位に切り出し、その項目単位に検索キー文書項目格納
バッファ部２５３に格納する。検索キー単語抽出部２１
１は、検索キー文書項目格納バッファ部２５３に格納さ
れているテキスト文書情報の単語切りを行う。さらに、
その文書の内容を表す上でキーとなる単語を抽出し、抽
出された単語種を検索キー単語情報格納バッファ部２５
５に格納する。例えば、単語切りは、形態素解析などを
用いて行い、その文書の内容を表す上でキーとなる単語
は、各単語の品詞情報（例えば「名詞」や「サ変名
詞」）を使って行う。The search key document item cutout unit 209 analyzes the sentence structure of the search target document stored in the search key document storage buffer unit 251, cuts out the unit of contents as a unit of contents, and searches the unit of the search key. It is stored in the document item storage buffer unit 253. Search key word extraction unit 21
1 performs word segmentation of the text document information stored in the search key document item storage buffer unit 253. further,
A key word in representing the contents of the document is extracted, and the extracted word type is stored in a search key word information storage buffer unit 25.
5 is stored. For example, word segmentation is performed using morphological analysis or the like, and words that are key in representing the contents of the document are performed using part-of-speech information (for example, “noun” or “sa-variable noun”) of each word.

【００２４】検索キー単語出現頻度算出部２１３は、検
索キー単語抽出部２１１により抽出されたキー単語につ
いて、その単語が検索対象文書項目格納バッファ部２５
３に格納されているテキスト文書中における出現頻度を
単語種単位に算出し、検索キー単語情報格納バッファ部
２５５に格納する。The search key word appearance frequency calculation unit 213 stores the key word extracted by the search key word extraction unit 211 in the search target document item storage buffer unit 25.
3 is calculated for each word type in the text document stored in No. 3 and stored in the search key word information storage buffer unit 255.

【００２５】類似カテゴリ特定部２１５は、外部記憶装
置に格納されているカテゴリ特定用文書データベース４
ｂ中のカテゴリ付与済み文書の単語情報と、検索キー単
語情報格納バッファ部２５５に格納されている検索キー
文書の単語情報とから、検索キーと各文書の類似度を算
出し、各文書の類似度とその文書に付与されているカテ
ゴリから、検索キー文書のカテゴリを任意数特定し、特
定したカテゴリを特定カテゴリ格納バッファ部２５７に
格納する。The similar category specifying unit 215 stores the category specifying document database 4 stored in the external storage device.
b, the search key and the similarity of each document are calculated from the word information of the category-added document and the word information of the search key document stored in the search key word information storage buffer 255. An arbitrary number of search key document categories are specified based on the degree and the category assigned to the document, and the specified categories are stored in the specific category storage buffer unit 257.

【００２６】検索対象カテゴリ文書情報読出し部２１７
は、特定カテゴリ格納バッファ部２５７に格納されてい
る検索対象カテゴリから、外部記憶装置４に格納されて
いる検索対象カテゴリ文書情報の呼び出しを行い、検索
対象カテゴリ文書情報格納バッファ部２５９に格納す
る。検索対象カテゴリ文書情報には、カテゴリ名、前時
系のカテゴリ名群、当カテゴリに該当する文書識別番号
（文書ID）群、当カテゴリに属する時系列別単語変換情
報の分野コード（カテゴリコードとそのカテゴリに含ま
れる単語のＩＤ）などから構成されている。一カテゴリ
に登録されている文書数、期間および前時系のカテゴリ
数は、カテゴリ毎任意である。Search target category document information reading section 217
Calls the search target category document information stored in the external storage device 4 from the search target category stored in the specific category storage buffer unit 257 and stores it in the search target category document information storage buffer unit 259. The search target category document information includes a category name, a previous time category name group, a document identification number (document ID) group corresponding to this category, and a field code of time-sequential word conversion information belonging to this category (category code and ID of a word included in the category). The number of documents, the period, and the number of previous time categories registered in one category are arbitrary for each category.

【００２７】特定カテゴリ単語変換情報読込み部２１９
は、検索キー文書と類似する特定カテゴリに対応する単
語変換情報を外部記憶装置４から読込み、特定カテゴリ
単語変換情報格納バッファ部２６３に格納する。特定カ
テゴリ文書単語変換情報は分野コードとそのコードに対
応する単語文字列から構成されている。Specific category word conversion information reading section 219
Reads word conversion information corresponding to a specific category similar to the search key document from the external storage device 4 and stores it in the specific category word conversion information storage buffer unit 263. The specific category document word conversion information includes a field code and a word character string corresponding to the code.

【００２８】時系列別単語重み読込み部２２１は、検索
対象カテゴリ文書情報格納バッファ部２５９に格納され
ているカテゴリに対応する単語重みテーブルを外部記憶
装置４から呼び出し、そのカテゴリに属する文書に含ま
れている単語の重み値を時系列別単語重み読込みバッフ
ァ部２６５に格納する。単語重みは、カテゴリやその時
系列により変化しているため、単語重みテーブルは、カ
テゴリ別および時系列別に外部記憶装置４に格納されて
いる。The time-series word weight reading unit 221 calls the word weight table corresponding to the category stored in the search target category document information storage buffer unit 259 from the external storage device 4 and is included in the documents belonging to the category. The weight value of the word is stored in the time-series word weight reading buffer unit 265. Since the word weight changes depending on the category and its time series, the word weight table is stored in the external storage device 4 for each category and each time series.

【００２９】検索対象文書読出し部２２３は、外部記憶
装置に格納されている文書に関する情報を文書データベ
ース化するために、文書データベース化する文書を外部
記憶装置から読込み、そのテキスト文書情報を検索対象
文書格納バッファ部２６７に格納する。The search target document reading unit 223 reads a document to be converted into a document database from the external storage device in order to convert the information related to the document stored in the external storage device into a document database, and converts the text document information into the search target document. The data is stored in the storage buffer unit 267.

【００３０】検索対象文書項目切出し部２２５は、検索
対象文書格納バッファ部２６７に格納されている検索対
象文書の文構造を解析し、内容のまとまりである項目単
位に切出し、その項目単位に検索対象文書項目格納バッ
ファ部２６９に格納する。検索対象文書単語抽出部２２
７は、検索対象文書項目格納バッファ部２６９に格納さ
れているテキスト文書情報の単語切りを行う。さらに、
その文書あるいは項目の内容を表す上でキーとなる単語
を抽出し、抽出された単語を検索対象単語情報格納バッ
ファ部２７１に格納する。The search target document item cutout unit 225 analyzes the sentence structure of the search target document stored in the search target document storage buffer unit 267, cuts out the content in units of items, and searches the target unit for each item. It is stored in the document item storage buffer unit 269. Search target document word extraction unit 22
Reference numeral 7 performs word segmentation of the text document information stored in the search target document item storage buffer unit 269. further,
A key word for representing the contents of the document or the item is extracted, and the extracted word is stored in the search target word information storage buffer unit 271.

【００３１】検索対象単語出現頻度算出部２２９は、検
索対象文書単語抽出部２２７により抽出されたキー単語
について、そのキー単語が検索対象文書格納バッファ部
２６７あるいは検索対象文書項目格納部２６９に格納さ
れているテキスト文書中における出現頻度を単語単位に
算出し、検索対象単語情報格納バッファ部２７１に格納
する。The search target word appearance frequency calculation unit 229 stores the key words of the key words extracted by the search target document word extraction unit 227 in the search target document storage buffer unit 267 or the search target document item storage unit 269. The appearance frequency in the text document is calculated for each word, and stored in the search target word information storage buffer unit 271.

【００３２】共通単語抽出部２３１は、検索キー単語情
報格納バッファ部２５５に格納されている検索キー文書
の単語情報と、検索対象単語情報格納バッファ部２７１
に格納されている検索対象文書の単語情報とから、カテ
ゴリ間単語変換情報格納バッファ部２７５内のカテゴリ
間単語変換情報を用いて、両バッファに格納されている
単語とその頻度に関する情報を共通単語情報格納バッフ
ァ部２７３に格納する。つまり検索キー単語情報格納バ
ッファ部２５５に格納されている単語と検索対象単語情
報格納バッファ部２７１に格納されている単語が異なる
場合において、カテゴリ間単語変換情報格納バッファ部
２７５に両単語が関連単語であるという情報が格納され
ていれば、これを加味して共通単語情報格納バッファ部
２７３に格納することになる。The common word extraction unit 231 includes a search key word information storage buffer unit 255 and the word information of the search key document and the search target word information storage buffer unit 271.
Using the word information of the search target document stored in the buffer and using the inter-category word conversion information in the inter-category word conversion information storage buffer unit 275, the words stored in both buffers and the information on the frequency thereof are shared. The information is stored in the information storage buffer unit 273. That is, when the word stored in the search key word information storage buffer 255 and the word stored in the search target word information storage buffer 271 are different, both words are stored in the inter-category word conversion information storage buffer 275. Is stored in the common word information storage buffer unit 273 taking this into account.

【００３３】類似度算出部２３３は、検索キー単語情報
格納バッファ部２５５と検索対象単語情報格納バッファ
部２７１と共通単語情報格納バッファ部２７３と時系列
別単語重み読込みバッファ部２６５とから、単語の出現
頻度にその単語の単語重みを加味し、単語ベクトル空間
法等の手法により検索キー文書と一検索対象文書との類
似度を算出し、その類似度値を類似度格納バッファ部２
７７に格納する。The similarity calculation unit 233 receives a word of the word from the search key word information storage buffer unit 255, the search target word information storage buffer unit 271, the common word information storage buffer unit 273, and the time-series word weight read buffer unit 265. By adding the word weight of the word to the appearance frequency, the similarity between the search key document and one search target document is calculated by a method such as the word vector space method, and the similarity value is stored in the similarity storage buffer unit 2.
77.

【００３４】前カテゴリ検索部２３５は、検索対象カテ
ゴリ文書情報読み出し部２１７を起動させ、前カテゴリ
検索バッファ部２６１に格納されている前カテゴリがあ
ればその前カテゴリの検索対象カテゴリ文書情報を外部
記憶装置４から呼び出し、前時系のカテゴリの検索対象
カテゴリ文書情報を検索対象カテゴリ文書情報格納バッ
ファ部２５９に格納する。この検索対象カテゴリ文書情
報は特定のカテゴリにおける前時系のカテゴリを関連付
けるためのカテゴリリンク情報である。The previous category search unit 235 activates the search target category document information reading unit 217 and, if there is a previous category stored in the previous category search buffer unit 261, externally stores the search target category document information of the previous category. It is called from the device 4 and stores the search target category document information of the previous time category in the search target category document information storage buffer unit 259. This search target category document information is category link information for associating a previous category in a specific category.

【００３５】時系列別単語変換情報読み込み部２３７
は、外部記憶装置４に格納されている時系列別単語変換
情報を読み込み、時系列別単語変換情報格納バッファ部
２７９に格納する。時系列別単語変換情報は分野コード
（カテゴリコードとそのカテゴリに含まれる単語のＩ
Ｄ）とその分野コードに対応する単語文字列から構成さ
れている。Time-sequential word conversion information reading unit 237
Reads the time-series word conversion information stored in the external storage device 4 and stores it in the time-series word conversion information storage buffer unit 279. The time-series word conversion information includes a field code (category code and I of a word included in the category).
D) and a word character string corresponding to the field code.

【００３６】カテゴリ間単語変換情報生成部２３９は、
特定カテゴリ単語変換情報格納バッファ部２６３に格納
されている、検索キー文書と類似しているとして特定さ
れた類似カテゴリーに含まれている分野コードと、時系
列別単語変換情報格納バッファ部２７９に格納されてい
る分野コードとを参照して、同じ分野コードの各単語
（分野コードは同じであるが単語文字列が異なる単語）
を同一言語して捕らえ、その情報をカテゴリ間単語変換
情報格納バッファ部２７５に格納する。このカテゴリ間
単語変換情報は、カテゴリ間で同意にも関わらず異なる
表記の単語を関連付ける単語リンク情報である。The inter-category word conversion information generation section 239
The field code included in the similar category specified as similar to the search key document stored in the specific category word conversion information storage buffer unit 263 and the time series word conversion information storage buffer unit 279 Each word of the same field code with reference to the field code (words with the same field code but different word character strings)
In the same language, and the information is stored in the inter-category word conversion information storage buffer unit 275. This inter-category word conversion information is word link information for associating words with different notations between categories, despite consent.

【００３７】検索結果出力部２４１は、類似度格納バッ
ファ部２７７に格納されている検索対象文書毎の類似度
から上位類似度の文書情報（例えば、文書ID）を検索結
果出力バッファ部２８１に格納する。そして、検索結果
出力バッファ部２８１の内容を出力部２０５を介して表
示装置３に出力させる。The search result output unit 241 stores, in the search result output buffer unit 281, document information (for example, a document ID) of a higher similarity based on the similarity for each search target document stored in the similarity storage buffer unit 277. I do. Then, the contents of the search result output buffer unit 281 are output to the display device 3 via the output unit 205.

【００３８】次に、本実施形態の類似文書検索装置の動
作を図４を参照しつつ説明する。ここで説明する動作は
制御装置１の制御部１ａのＣＰＵが、ＲＯＭ内のプログ
ラム、及びメモリ部１ｂとして記載したＲＡＭ内の記憶
領域を用いて実行するものである。Next, the operation of the similar document search apparatus of the present embodiment will be described with reference to FIG. The operation described here is performed by the CPU of the control unit 1a of the control device 1 using a program in the ROM and a storage area in the RAM described as the memory unit 1b.

【００３９】まず、初期化部２０１が起動しメモリ部１
ｂをクリアする（ステップ４００）。続いて検索キー文
書入力部２０７が起動し、入力装置２を介して検索キー
文書の入力が為されると、検索キー文書をテキスト文書
として検索キー文書格納バッファ部２５１に格納する
（ステップ４０１）。ここでは入力される検索キー文書
として特許出願明細書を用いるものとし、その例を図５
に示す。検索キー文書項目切出し部２０９は、検索キー
文書格納バッファ部２５１に格納されている検索キー文
書の文構造を解析し、内容のまとまりである項目単位に
切出し、その項目単位に図６に例示するような形式で検
索キー文書項目格納バッファ部２５３に格納する。この
図６の格納例では、「[発明の名称] [請求の範囲]
[要約]」により項目を切出し項目単位にバッファに格納
している。次に、検索キー単語抽出部２１１が起動し形
態素解析などを用いてテキストを単語切りし、「名詞」
や「サ変名詞」などの文書の内容を表すキーとなる単語
を抽出し、図７に示すように検索キー単語情報格納バッ
ファ部２５５に格納する。例えば、「情報データベース
に格納されている道情報の・・・」から「情報／データ
ベース／格納／道／情報／・・・／４ＷＤ」を抽出す
る。続いて、検索キー単語出現頻度算出部２１３が起動
し、検索キー単語情報格納バッファ部２５５に格納され
ている単語について、検索キー文書項目格納バッファ部
２５３に格納されている項目のテキスト文書中の出現頻
度を算出し検索キー単語情報格納バッファ部２５５に格
納する（ステップ４０２）。例えば、図８に示すよう
に、単語と頻度を並べて記述し、「４ＷＤ」が2回出現
している場合は「４ＷＤ=2」のように格納する。これで
検索キー文書情報抽出が完了する。First, the initialization unit 201 is activated and the memory unit 1
b is cleared (step 400). Subsequently, when the search key document input unit 207 is activated and a search key document is input via the input device 2, the search key document is stored as a text document in the search key document storage buffer unit 251 (step 401). . Here, it is assumed that a patent application specification is used as a search key document to be input, and an example thereof is shown in FIG.
Shown in The search key document item cutout unit 209 analyzes the sentence structure of the search key document stored in the search key document storage buffer unit 251 and cuts out the content in units of items, which are illustrated in FIG. It is stored in the search key document item storage buffer unit 253 in such a format. In the storage example of FIG. 6, “[Title of Invention] [Claims]
[Summary] "to extract items and store them in the buffer for each item. Next, the search key word extraction unit 211 is activated and cuts the text into words using morphological analysis or the like.
A key word representing the content of the document, such as "a" or "sa noun", is extracted and stored in the search key word information storage buffer unit 255 as shown in FIG. For example, "information / database / storage / road / information /.../ 4WD" is extracted from "road information stored in the information database ...". Subsequently, the search key word appearance frequency calculation unit 213 is activated, and the words stored in the search key word information storage buffer unit 255 are searched for in the text document of the items stored in the search key document item storage buffer unit 253. The appearance frequency is calculated and stored in the search key word information storage buffer 255 (step 402). For example, as shown in FIG. 8, words and frequencies are described side by side, and when "4WD" appears twice, it is stored as "4WD = 2". This completes the retrieval key document information extraction.

【００４０】次にステップ４０３の類似カテゴリ特定へ
進む。類似カテゴリ特定部２１５は、外部記憶装置４の
カテゴリ特定用文書データベース４ｂ中のカテゴリ付与
済み文書の単語情報と、検索キー単語情報格納バッファ
部２５５に格納されている検索キー文書の単語情報とか
ら、検索キー文書と各カテゴリ付与済み文書との類似度
を算出し、各カテゴリ付与済み文書の類似度とこれらの
文書に付与されているカテゴリから、検索キー文書のカ
テゴリを任意数（例えば上位２カテゴリ）特定する。こ
のようにして特定されたカテゴリは特定カテゴリ格納バ
ッファ部２５７へ格納する。このステップ４０３にて用
いられるカテゴリ付与済み文書情報は、例えば図９に示
すように、新カテゴリとそのカテゴリが付与されている
文書IDとを含むものである。特定したカテゴリは図１０
に示すようにカテゴリ名が格納される。Next, the process proceeds to step 403 for specifying a similar category. The similar category specifying unit 215 uses the word information of the category-added document in the category specifying document database 4b of the external storage device 4 and the word information of the search key document stored in the search key word information storage buffer unit 255. Then, the similarity between the search key document and each category-added document is calculated, and an arbitrary number of search key document categories (for example, the top two Category) Specify. The category specified in this way is stored in the specific category storage buffer unit 257. The category-added document information used in step 403 includes, for example, a new category and a document ID to which the category is assigned, as shown in FIG. Figure 10 shows the identified categories.
The category name is stored as shown in FIG.

【００４１】ステップ４０３における類似カテゴリの特
定が完了すると、ステップ４０４のカテゴリ読込みへ進
む。検索対象カテゴリ文書情報読出し部２１７は、特定
カテゴリ格納バッファ部２５７に格納されているカテゴ
リにおける検索対象カテゴリ文書情報を、外部記憶装置
４の検索対象カテゴリ文書情報データベース４ｃから取
得して検索対象カテゴリ文書情報格納バッファ部２５９
に格納する。例えば図１１に示すように、カテゴリ名
「カテゴリID=14 交通」、格納期間「1999.1〜1999.
2」、前カテゴリ1「カテゴリID=23 自動車」、前カテゴ
リ2「カテゴリID=24自転車」、該当文書IDと文書年月を
格納する。これは、現在、カテゴリ「交通」が存在し、
1999年1年から1999年2月までの文書が格納されており、
1999年1月以前のカテゴリ分けでは、現在の「交通」は
「自動車」と「自転車」というカテゴリに分類されてい
たことを表わしている。尚カテゴリ「交通」に関する分
類に更新がない場合は、「カテゴリ」と「前カテゴリ」
は同じカテゴリ名で表されることになる。この検索対象
カテゴリ文書情報に含まれる前カテゴリ名（ここでは
「自動車」と「自転車」）は、図１２に示すように前カ
テゴリ検索バッファ部２６１へも格納する。When the specification of the similar category in step 403 is completed, the process proceeds to step 404 to read the category. The search target category document information reading unit 217 acquires the search target category document information in the category stored in the specific category storage buffer unit 257 from the search target category document information database 4c of the external storage device 4, and retrieves the search target category document. Information storage buffer unit 259
To be stored. For example, as shown in FIG. 11, the category name “Category ID = 14 Traffic” and the storage period “1999.1 to 1999.
2), the previous category 1 “category ID = 23 car”, the previous category 2 “category ID = 24 bicycle”, the corresponding document ID and the document date are stored. This is because the category "Transportation" currently exists,
Documents from January 1999 to February 1999 are stored.
The category classification before January 1999 indicates that the current "traffic" was classified into the categories "automobile" and "bicycle". If there is no update in the category for the category "traffic", "category" and
Will be represented by the same category name. The previous category names (here, “automobile” and “bicycle”) included in the search target category document information are also stored in the previous category search buffer unit 261 as shown in FIG.

【００４２】次に、ステップ４０５において特定カテゴ
リ単語変換情報読込みを行う。特定カテゴリ単語変換情
報読込み部２１９は、特定カテゴリ格納バッファ部２５
７に格納されている特定カテゴリを参照し、この特定カ
テゴリに対応する単語変換情報を外部記憶装置４の単語
変換情報データベース４ｄから読込み、特定カテゴリ単
語変換情報格納バッファ部２６３に格納する。特定カテ
ゴリ単語変換情報は分野コード（カテゴリコードとその
カテゴリ含まれる単語のID）とそのコードに対応する単
語文字列から構成されている。例えば、図１３のよう
に、カテゴリIDとそのカテゴリ内での単語IDとその単語
IDに対応する単語文字列が格納されており、ここでは
「単語ID=101、単語＝４ＷＤ」が格納されていることを
示している。単語ＩＤ番号は新単語出現順に割り当てて
いくが、後述するようにカテゴリ間の関連付ける単語の
ＩＤについては共通とするよう、オペレータにより修正
が加えられる。Next, in step 405, specific category word conversion information is read. The specific category word conversion information reading unit 219 includes a specific category storage buffer unit 25.
7, the word conversion information corresponding to the specific category is read from the word conversion information database 4d of the external storage device 4, and stored in the specific category word conversion information storage buffer unit 263. The specific category word conversion information is composed of a field code (category code and ID of a word included in the category) and a word character string corresponding to the code. For example, as shown in FIG. 13, a category ID, a word ID within the category, and the word
A word character string corresponding to the ID is stored, which indicates that “word ID = 101, word = 4WD” is stored. The word ID numbers are assigned in the order of appearance of the new words. As will be described later, an operator corrects the IDs of the words to be associated between the categories so that they are common.

【００４３】ステップＳ４０５に続いてステップＳ４０
６では、単語重み読込みを行う。時系列別単語重み読込
み部２２１は、検索対象カテゴリ文書情報格納バッファ
部２５９に格納されているカテゴリに対応する、つまり
カテゴリ毎に固有の単語重みテーブルを外部記憶装置４
の単語重みテーブルデータベース４ｅから読込み、図１
４のように、単語の重み値を時系列別単語重み読込みバ
ッファ部２６５に格納する。この例では、「カテゴリ＝
交通」「格納期間＝1999.1〜1999.2」に属する検索対象
文書に含まれている単語の重み値を示しており、「自動
車=10」「モータ=5」のように単語と重み値とを対応付
けて格納している。ここで、出現文書数の多い単語は、
その「重み」を大きくし、逆に出現する文書数の少ない
単語は重みを小さくしている。つまり、対象の単語が出
てくる文書数が多ければその単語の類似度判断に対する
寄与率が低いことを意味する為、上述のような重み付け
を行うようにしている。Following step S405, step S40
In step 6, word weight reading is performed. The time-series word weight reading unit 221 stores a word weight table corresponding to the category stored in the search target category document information storage buffer unit 259, that is, a unique word weight table for each category.
1 is read from the word weight table database 4e of FIG.
4, the weight value of the word is stored in the time-sequential word weight reading buffer unit 265. In this example, "Category =
Indicates the weight value of the words included in the search target documents belonging to "traffic" and "storage period = 1999.1-1999.2", and associates the words with the weight values such as "car = 10" and "motor = 5" Stored. Here, words with a large number of appearing documents are
The “weight” is increased, and conversely, words appearing with a small number of documents are reduced in weight. That is, if the number of documents in which the target word appears is large, it means that the contribution ratio of the word to the similarity determination is low, and the above-described weighting is performed.

【００４４】尚、このステップ４０６において、検索対
象カテゴリ文書情報格納バッファ部２５９に特定カテゴ
リの前カテゴリが格納されている場合は、上述の通り単
語重みテーブルがカテゴリに固有のものなので前カテゴ
リに対応する単語重みテーブルを時系列別単語重み読み
込みバッファ部２６５に格納する。In this step 406, if the previous category of the specific category is stored in the search target category document information storage buffer section 259, the word weight table is unique to the category as described above, so that it corresponds to the previous category. Is stored in the time-series word weight reading buffer unit 265.

【００４５】続いて、ステップ４０７にて、検索対象文
書抽出を行う。検索対象文書読出し部２２３は、外部記
憶措置４から一検索対象文書のテキスト文書を読込み、
図１５に例示するように検索対象文書格納バッファ部２
６７に格納する。Subsequently, in step 407, a search target document is extracted. The search target document reading unit 223 reads a text document of one search target document from the external storage unit 4,
As illustrated in FIG. 15, the search target document storage buffer unit 2
67.

【００４６】検索対象文書抽出の後、ステップ４０８に
て検索対象文書情報抽出を行う。まず検索対象文書項目
切出し部２２５が起動し、検索キー文書項目切出し部２
０９が検索キー文書に対して行ったと同様に、検索対象
文書格納バッファ部２６７に格納されている検索対象文
書に対して文構造を解析し、内容のまとまりである項目
単位に切り出し、その項目単位に検索対象文書項目格納
バッファ部２６９に格納する。図１６の例では、「発明
の名称」「請求の範囲」「要約」等により項目を切り出
し項目単位にバッファ部２６９に格納している。この後
検索対象単語抽出部２２７は、形態素解析などを用いて
テキストを単語切りし、「名詞」や「サ変名詞」などの
文書の内容を表すキーとなる単語を抽出し、単語種を検
索対象単語情報格納バッファ部２７１に格納する。図１
７の例では、「情報データベースに蓄積されている道路
情報の・・・」から「情報／データベース／蓄積／道路
／・・・／四輪駆動車」を抽出し、バッファ２７１に格
納している。さらに、検索対象単語出現頻度算出部２２
９が起動し、検索対象単語情報格納バッファ部２７１に
格納されている単語について、テキスト文書中の出現頻
度を算出し検索対象単語情報格納バッファ部２７１に再
格納する。例えば図１８に示すように、単語と頻度を並
べて記述し、「四輪駆動車」が2回出現している場合は
「四輪駆動車=2」のように格納する。After the extraction of the search target document, in step 408, the search target document information is extracted. First, the search target document item extraction unit 225 is activated, and the search key document item extraction unit 2
As in the case of the search key document 09, the sentence structure is analyzed for the search target document stored in the search target document storage buffer unit 267, the content is cut out in units of items, and the item unit is extracted. Is stored in the search target document item storage buffer unit 269. In the example of FIG. 16, items are cut out based on “name of invention”, “claims”, “summary”, and the like, and stored in the buffer unit 269 in item units. Thereafter, the search target word extraction unit 227 cuts the text into words using morphological analysis or the like, extracts a key word representing the content of the document such as “noun” or “sa-variable noun”, and searches for the word type. It is stored in the word information storage buffer unit 271. FIG.
In the example of No. 7, "information / database / accumulation / road /.../ four-wheel drive vehicle" is extracted from "road information stored in the information database ..." and stored in the buffer 271. . Furthermore, the search target word appearance frequency calculation unit 22
9 starts, calculates the appearance frequency of the words stored in the search target word information storage buffer unit 271 in the text document, and stores the words in the search target word information storage buffer unit 271 again. For example, as shown in FIG. 18, words and frequencies are described side by side, and when "four-wheel drive vehicle" appears twice, it is stored as "four-wheel drive vehicle = 2".

【００４７】検索対象文書情報抽出が終了すると、ステ
ップ４０９の類似度算出へ進み、検索キー文書と各検索
対象文書の類似度を算出していく。まず、共通単語抽出
部２３１が起動し、検索対象単語情報格納バッファ部２
７１と検索キー単語情報格納バッファ部２５５とに共通
に格納されている単語を共通単語情報格納バッファ部２
７３に格納する。When the retrieval target document information extraction is completed, the process proceeds to the similarity calculation in step 409, and the similarity between the retrieval key document and each retrieval target document is calculated. First, the common word extraction unit 231 is activated, and the search target word information storage buffer unit 2 is started.
71 and the search key word information storage buffer unit 255 are stored in common word information storage buffer unit 2
73.

【００４８】尚、このステップ４０９が、最初に対象と
するカテゴリの検索対象文書の類似度を求めた後、前カ
テゴリの検索対象文書の類似度を求める場合には、カテ
ゴリ間単語変換情報格納バッファ部２７５に、既に実施
済みの他のカテゴリと当該カテゴリとの間における単語
間の変換情報が格納されていることがあり、これを考慮
して共通単語の抽出を行う。例えば「４ＷＤ・・・四輪
駆動車」という情報があれば、「４ＷＤ」と「四輪駆動
車」を同一単語と見なし、「情報／データベース／格納
／道／・・・／４ＷＤ」と「情報／データベース／蓄積
／道路／・・・／四輪駆動車」の共通単語である「情報
／データベース／・・・／４ＷＤ」を図１９に示すよう
に共通単語情報格納バッファ部２７３に格納する。カテ
ゴリ間単語変換情報格納バッファ部２７５へカテゴリ間
単語変換情報を格納する方法は後述する。In this step 409, when the similarity of the search target document of the target category is first obtained and then the similarity of the search target document of the previous category is obtained, the inter-category word conversion information storage buffer is used. The conversion information between words between another category that has already been implemented and the category may be stored in the unit 275, and a common word is extracted in consideration of this. For example, if there is information “4WD... Four-wheel drive vehicle”, “4WD” and “four-wheel drive vehicle” are regarded as the same word, and “information / database / storage / road /.. ./4WD” and “4WD” The "information / database /.../ 4WD" which is a common word of "information / database / accumulation / road /.../ four-wheel drive vehicle" is stored in the common word information storage buffer unit 273 as shown in FIG. . A method of storing the inter-category word conversion information in the inter-category word conversion information storage buffer unit 275 will be described later.

【００４９】次に、類似度算出部２３３が起動し、検索
対象単語情報格納バッファ部２７１、検索キー単語情報
格納バッファ部２５５、時系列別単語重み読込みバッフ
ァ２６５、及び共通単語情報格納バッファ部２７３の各
格納情報からベクトル空間法などを用いて類似度を算出
し、その類似度値を類似度格納バッファ部２７７に格納
する。類似度は検索対象文書と類似度値が対になるよ
う、例えば図２０のように、「文書ID=10 類似度=0.540
4、文書ID=13 類似度=0.2351」というように格納する。Next, the similarity calculating section 233 is activated, and the search target word information storage buffer section 271, the search key word information storage buffer section 255, the time-sequential word weight reading buffer 265, and the common word information storage buffer section 273 are provided. Is calculated using the vector space method or the like from each piece of stored information, and the similarity value is stored in the similarity storage buffer unit 277. The similarity is set so that the document to be searched and the similarity value form a pair, for example, as shown in FIG. 20, “Document ID = 10 Similarity = 0.540
4, document ID = 13 similarity = 0.2351 ".

【００５０】ステップ４０９の後、検索対象文書読出し
部２３３が起動し、外部記憶措置４にまだ処理を終えて
ない検索対象文書があるか否かを判断し（ステップ４１
０）、もし検索対象文書があればステップ４０７へ戻
る。一方ステップ４１０において、まだ処理を終えてな
い検索対象文書がない場合は、ステップ４１１へ進む。
ステップ４１１では、前カテゴリ検索部２３５が前カテ
ゴリ検索バッファ部２６１に格納されているカテゴリ文
書情報の有無を判断し、前カテゴリがあれば、ステップ
４１２へ進む。After step 409, the search target document reading unit 233 starts up and determines whether or not there is a search target document in the external storage unit 4 that has not been processed yet (step 41).
0) If there is a document to be searched, the process returns to step 407. On the other hand, if there is no search target document that has not been processed in step 410, the process proceeds to step 411.
In step 411, the previous category search unit 235 determines whether there is category document information stored in the previous category search buffer unit 261. If there is a previous category, the process proceeds to step 412.

【００５１】ステップ４１２では、前カテゴリ検索部２
３５が検索対象カテゴリ文書情報読出し部２１７を起動
させ、前カテゴリ検索バッファ部２６１に格納されてい
るカテゴリの検索対象カテゴリ文書情報を外部記憶装置
４の検索対象カテゴリ文書情報データベース４ｃから呼
び出し、検索対象カテゴリ文書情報格納バッファ部２５
９に格納する。In step 412, the previous category search unit 2
35 activates the search target category document information reading unit 217, calls the search target category document information of the category stored in the previous category search buffer unit 261 from the search target category document information database 4c of the external storage device 4, and retrieves the search target category document information. Category document information storage buffer unit 25
9 is stored.

【００５２】次に、ステップ４１３へ進む。時系列別単
語変換情報読込み部２３７は、外部記憶装置４の時系列
別単語変換情報データベース４ｆに格納されている時系
列別単語変換情報を読込み、図２１のように時系列別単
語変換情報格納バッファ部２７９に格納する。時系列別
単語変換情報は分野コード（カテゴリコードとそのカテ
ゴリに含まれる単語のＩＤ）とそのコードに対応する単
語文字列から構成されている。図２３の例では、カテゴ
リID＝２３「自動車」内において「単語ID=101、単語＝
四輪駆動車」が格納されている。ここで用いられる単語
ＩＤ番号も、特定カテゴリ単語変換情報と同様に、新単
語出現順に割り当てていく。そして特定カテゴリとこの
前カテゴリと、またそれ以外のカテゴリ間でも同様に、
カテゴリ間の関連付けるべき単語の単語ＩＤについては
共通とするよう、オペレータにより修正が加えられる。Next, the routine proceeds to step 413. The time-series word conversion information reading unit 237 reads the time-series word conversion information stored in the time-series word conversion information database 4f of the external storage device 4, and stores the time-series word conversion information as shown in FIG. The data is stored in the buffer unit 279. The time-series word conversion information includes a field code (category code and ID of a word included in the category) and a word character string corresponding to the code. In the example of FIG. 23, in the category ID = 23 “car”, “word ID = 101, word =
Four-wheel drive vehicle "is stored. The word ID numbers used here are also assigned in the order of appearance of new words, similarly to the specific category word conversion information. And between a specific category, the previous category, and other categories,
The operator corrects the word IDs of the words to be associated between the categories so that they are common.

【００５３】こうして時系列別単語変換情報読込みが終
了すると、ステップ４１４へ進みカテゴリ間単語変換情
報生成を行う。カテゴリ間単語変換情報生成部２３９
は、特定カテゴリ単語変換情報格納バッファ部２６３に
格納されている、検索キー文書と類似しているとして特
定された類似カテゴリーに含まれている分野コード（カ
テゴリコードと単語ＩＤ）と、時系列別単語変換情報格
納バッファ部２７９に格納されている分野コード（カテ
ゴリコードと単語ＩＤ）を参照して、上述のように既に
整えられた状態にある同じ単語ＩＤの各単語（つまり単
語ＩＤは同じであるが単語文字列が異なる単語）を同一
言語して捉えるべく、単語情報をカテゴリ間単語変換情
報格納バッファ部２７５に格納する。例えば、図２２に
示すように、図１３に示す特定カテゴリ単語変換情報格
納バッファ２６３の内容と図２１に示す時系列別単語変
換情報格納バッファ２７９の内容から、同じカテゴリで
且つ同じ単語IDである「４ＷＤ」と「四輪駆動車」を対
応付けて格納することにより、両単語を同一言語として
捉えるよう対応付けている。ステップ４１４によりカテ
ゴリ間単語変換情報生成が終了すると、ステップ４０６
へ戻りステップ４１０までを上述と同様に繰返す。When reading of the time-series word conversion information is completed, the flow advances to step 414 to generate inter-category word conversion information. Inter-category word conversion information generation unit 239
The field code (category code and word ID) included in the similar category specified as similar to the search key document stored in the specific category word conversion information storage buffer unit 263, Referring to the field code (category code and word ID) stored in the word conversion information storage buffer unit 279, each word having the same word ID already prepared as described above (that is, the word ID is the same) The word information is stored in the inter-category word conversion information storage buffer unit 275 in order to catch words having different word character strings) in the same language. For example, as shown in FIG. 22, from the contents of the specific category word conversion information storage buffer 263 shown in FIG. 13 and the contents of the time-sequential word conversion information storage buffer 279 shown in FIG. By storing "4WD" and "four-wheel drive vehicle" in association with each other, the two words are associated with each other so as to be regarded as the same language. When the generation of the inter-category word conversion information ends in step 414, step 406
Returning to step 410, the process is repeated in the same manner as described above.

【００５４】ステップ４１１において、前カテゴリ検索
バッファ部２６１に格納されているすべての検索対象カ
テゴリ文書情報が処理済みの場合は、ステップ４１５へ
進む。検索結果出力部２４１は、類似度格納バッファ部
２７７に格納されている検索対象文書毎の類似度から、
類似度が高い検索対象文書を抽出し、検索キーとする文
書情報（例えば、文書ID）を図２３に例示するように検
索結果出力バッファ部２８１に格納する。そして、検索
結果出力バッファ部２８１の内容を出力部２０５を介し
て、図２４に例示するように表示装置３に出力する。本
実施形態においては上述の通り、特定カテゴリ単語変換
情報と時系列別単語変換情報とからカテゴリ間単語変換
情報を作成することにより、検索キー文書と検索対象文
書との類似度算出時の共通単語抽出の際に、経時による
単語言い回しの変化に関わらず共通の単語として取り扱
えるようにできるため、カテゴリが時系列により更新さ
れるデータベースを用いる場合の類似度算出および類似
文献検索の精度向上を図ることができる。If it is determined in step 411 that all search target category document information stored in the previous category search buffer unit 261 has been processed, the process proceeds to step 415. The search result output unit 241 calculates the similarity of each search target document stored in the similarity storage buffer unit 277,
A search target document having a high degree of similarity is extracted, and document information (for example, a document ID) serving as a search key is stored in the search result output buffer unit 281 as illustrated in FIG. Then, the contents of the search result output buffer unit 281 are output to the display device 3 via the output unit 205 as illustrated in FIG. In the present embodiment, as described above, by generating inter-category word conversion information from the specific category word conversion information and the time-series word conversion information, a common word when calculating the similarity between the search key document and the search target document is generated. At the time of extraction, it is possible to handle words as common words regardless of changes in word phrases over time. Therefore, it is necessary to improve the accuracy of similarity calculation and similar document search when using a database whose categories are updated in chronological order. Can be.

【００５５】本発明はその主旨を逸脱しない範囲であれ
ば、上記の実施例に限定されるものではない。そして、
文書作成支援装置、情報検索装置、情報管理装置及び情
報フィルタリング装置等に広く適用できるものである。The present invention is not limited to the above embodiment as long as it does not depart from the gist of the present invention. And
The present invention can be widely applied to a document creation support device, an information search device, an information management device, an information filtering device, and the like.

【００５６】[0056]

【発明の効果】以上詳述したように本発明によれば、カ
テゴリが時系列により更新されるデータベースを用いる
場合の類似度算出および類似文書検索の精度向上を図る
類似文書検索装置、及びこの装置に用いられる類似文書
検索方法を提供することができる。As described above in detail, according to the present invention, a similar document retrieval apparatus for improving the accuracy of similarity calculation and similar document retrieval when a database whose categories are updated in time series is used, and this apparatus. Can be provided.

[Brief description of the drawings]

【図１】本発明の類似文書検索装置の概要を説明するた
めの図FIG. 1 is a diagram for explaining an outline of a similar document search device according to the present invention;

【図２】本発明の類似文書検索装置の実施形態の構成を
示すブロック図FIG. 2 is a block diagram illustrating a configuration of an embodiment of a similar document search device according to the present invention.

【図３】本実施形態の類似文書検索装置の制御装置の内
部機能を示すブロック図FIG. 3 is a block diagram showing internal functions of a control device of the similar document search device according to the embodiment;

【図４】本実施形態の類似文書検索処理の動作を示す図FIG. 4 is a view showing an operation of a similar document search process according to the embodiment;

【図５】検索キー文書格納例を示す図FIG. 5 is a diagram showing an example of storing a search key document.

【図６】検索キー文書項目格納例を示す図FIG. 6 is a diagram showing an example of retrieval key document item storage;

【図７】検索キー単語情報格納例を示す図FIG. 7 is a diagram showing a storage example of search key word information.

【図８】出現頻度情報付き検索キー単語情報格納例を示
す図FIG. 8 is a diagram showing an example of storing search key word information with appearance frequency information;

【図９】類似カテゴリ特定用文書情報例を示す図FIG. 9 is a diagram showing an example of similar category specifying document information.

【図１０】特定カテゴリ格納例を示す図FIG. 10 is a diagram showing a specific category storage example.

【図１１】検索対象カテゴリ文書情報格納例を示す図FIG. 11 is a diagram showing a storage example of search target category document information.

【図１２】前カテゴリ検索格納例を示す図FIG. 12 is a diagram showing a previous category search storage example.

【図１３】特定カテゴリ単語変換情報格納例を示す図FIG. 13 is a diagram showing an example of storing specific category word conversion information.

【図１４】時系列別単語重み読込み格納例を示す図FIG. 14 is a diagram showing an example of reading and storing word weights by time series;

【図１５】検索対象文書格納例を示す図FIG. 15 is a diagram illustrating an example of storing a search target document.

【図１６】検索対象文書項目格納例を示す図FIG. 16 is a diagram showing an example of storing search target document items.

【図１７】検索対象単語情報格納例を示す図FIG. 17 is a diagram showing an example of storing search target word information.

【図１８】出現頻度情報付き検索対象文書項目格納例を
示す図FIG. 18 is a diagram showing a storage example of search target document items with appearance frequency information.

【図１９】共通単語情報格納例を示す図FIG. 19 is a diagram showing an example of storing common word information.

【図２０】類似度格納例を示す図FIG. 20 is a diagram showing a similarity storage example.

【図２１】時系列別単語変換情報格納例を示す図FIG. 21 is a diagram showing an example of storing word conversion information by time series.

【図２２】カテゴリ間単語変換情報格納例を示す図FIG. 22 is a diagram showing an example of storing word conversion information between categories.

【図２３】類似度格納例を示す図FIG. 23 is a diagram showing a similarity storage example.

【図２４】検索結果出力例を示す図FIG. 24 is a diagram showing a search result output example.

[Explanation of symbols]

１…制御装置１ａ…制御部１ｂ…メモリ部２…入力装置３…表示装置４…外部記憶装置４ｂ…カテゴリ特定用文書データベース４ｃ…検索対象カテゴリ文書情報データベース４ｄ…単語変換情報データベース４ｆ…時系列別単語変換情報データベース２１５…類似カテゴリ特定部２１７…検索対象カテゴリ文書情報読出し部２１９…特定カテゴリ単語変換情報読込み部２２１…時系列別単語重み読込み部２３５…前カテゴリ検索部２３７…時系列別単語変換情報読込み部２３９…カテゴリ間単語変換情報生成部２５７…特定カテゴリ格納バッファ部２５９…検索対象カテゴリ文書情報格納バッファ部２６１…前カテゴリ検索バッファ部２６３…特定カテゴリ単語変換情報格納バッファ部２６５…時系列別単語重み読込みバッファ部２７５…カテゴリ間単語変換情報格納バッファ部２７９…時系列別単語変換情報格納バッファ部 DESCRIPTION OF SYMBOLS 1 ... Control device 1a ... Control part 1b ... Memory part 2 ... Input device 3 ... Display device 4 ... External storage device 4b ... Category specification document database 4c ... Search target category document information database 4d ... Word conversion information database 4f ... Time series Another word conversion information database 215 ... Similar category specifying unit 217 ... Search target category document information reading unit 219 ... Specific category word conversion information reading unit 221 ... Time series word weight reading unit 235 ... Previous category search unit 237 ... Time series word Conversion information reading unit 239 ... inter-category word conversion information generation unit 257 ... specific category storage buffer unit 259 ... search target category document information storage buffer unit 261 ... previous category search buffer unit 263 ... specific category word conversion information storage buffer unit 265 ... hour Word weight reading buffer by sequence Part 275 ... category between the words conversion information storage buffer section 279 ... time series by word conversion information stored in the buffer unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者仁科卓哉東京都青梅市新町３丁目３番地の１ (72)発明者山崎弘東京都青梅市新町３丁目３番地の１ (72)発明者松隈剛東京都青梅市新町３丁目３番地の１Ｆターム(参考） 5B075 ND03 NK02 NK32 NK44 NR06 NR12 PP25 PQ02 PR06 QM08 UU06 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Takuya Nishina 1-3-3-1 Shinmachi, Ome-shi, Tokyo (72) Inventor Hiroshi Yamazaki 1-3-3-1 Shinmachi, Ome-shi, Tokyo 1 (72) Inventor Tsuyoshi Matsukuma 3F-3 Shinmachi, Ome-shi, Tokyo 1F term (reference) 5B075 ND03 NK02 NK32 NK44 NR06 NR12 PP25 PQ02 PR06 QM08 UU06

Claims

[Claims]

1. A database storing a plurality of documents classified by a predetermined category, a document belonging to a first category indicating a classification used in a first period as a unit of change of a category and a document belonging to a second period. A first storage unit for storing category link information for associating a document belonging to a second category indicating a classification to be used; and a different but consensus expression between the first category and the second category. Second storage means for storing word link information for associating words to be used; category specifying means for specifying the category to which the document input as the search key belongs for the first category; Determining the second category associated with the first category specified from the link information stored in the first storage means. From the database using the word link information stored in the second storage means, as a search target, and the documents belonging to the first category and the second category obtained by the determination means. A similar document search comprising: searching means for searching for a document similar to the search key document; and output means for outputting the similar document obtained by the searching means as a search result for the search key document. apparatus.

2. The method according to claim 1, wherein the search means includes means for creating the word link information by comparing a word used in the first category with a word used in the second category. The similar document search device according to claim 1, wherein

3. The method according to claim 1, wherein said search means calculates a similarity using the words linked by the word link information as the same word, and extracts a similar document based on the similarity. Similar document search device.

4. The search means calculates a similarity of a word included in the document belonging to the first category in consideration of a first weight value, and calculates a word included in the document belonging to the second category. 2. A similar document search apparatus according to claim 1, wherein the similarity is calculated by taking into account a second weight value, and a similar document is extracted based on the calculated similarity.

5. The method according to claim 1, wherein the first category and the second category are created in chronological order.
A similar document search device described in the section.

6. A similar document retrieval method for retrieving similar documents from a database in which a plurality of documents classified for each predetermined category are registered, wherein a classification used in a first period for a category change unit Category link information for associating a document belonging to a first category indicating a category with a document belonging to a second category indicating a classification used in a second period, and the first category and the second category. The word link information for associating words using different expressions while agreeing is stored in the memory, and the category to which the document given as the search key belongs is specified for the first category, and the specified Said second category associated with said first category
Is determined from the category link information stored in the memory, documents belonging to the first and second categories are searched, and similar to the search key document from the database using the word link information. A similar document search method comprising: searching for a document; and outputting the similar document as a search result for the search key document.

7. A computer provided with a database in which a plurality of documents classified for each predetermined category are registered, a document belonging to a first category indicating a classification used in a first period as a unit of change of a category, Category link information for associating a document belonging to a second category indicating a classification used in the second period, and a word in which the first category and the second category agree but different expressions are used A function of storing word link information for associating a document with a memory, a function of specifying a category to which a document given as a search key belongs for the first category, and a function of relating to the specified first category. The second
And a function of judging the category of the document from the link information stored in the memory. The documents belonging to the first category and the second category are searched, and the search key is searched from the database using the word link information. A computer-readable recording medium storing a program for executing a function of searching for a document similar to a document and a function of outputting the similar document as a search result for the search key document.