JPH07121547A

JPH07121547A - Information retrieval device

Info

Publication number: JPH07121547A
Application number: JP5263281A
Authority: JP
Inventors: Masao Ito; 藤正雄伊
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1993-10-21
Filing date: 1993-10-21
Publication date: 1995-05-12

Abstract

(57)【要約】【目的】検索条件を正規表現に変換することにより、
誤字脱字が含まれている場合や改行コードが含まれてい
る場合でも正確に検索できるようにする。【構成】検索条件入力部１１で入力された各キーワー
ドに対して誤字脱字が含まれていても検索可能なように
正規表現に変換する正規表現変換部１２を設け、異なっ
た表記でも検索できるように異表記拡張部１３で拡張を
行ない、また類似した表記でも検索できるように類似表
記拡張部１４で拡張を行ない、検索条件を変換すること
より検索精度の向上を図る。 (57) [Summary] [Purpose] By converting the search conditions into regular expressions,
Make it possible to search accurately even if it contains a typographical error or a line feed code. [Structure] A regular expression conversion unit 12 for converting a regular expression is provided so that even if a typographical error is included in each keyword input by the search condition input unit 11, a regular expression conversion unit 12 is provided so that different notations can be searched. Further, the different notation extension unit 13 performs the extension, and the similar notation extension unit 14 performs the extension so that the similar notation can be searched, and the search condition is converted to improve the search accuracy.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、電子化された文書デー
タベースからの検索処理に利用される情報検索装置に関
するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information retrieval apparatus used for retrieval processing from an electronic document database.

【０００２】[0002]

【従来の技術】近年、ワードプロセッサの普及や文字認
識装置の普及に伴い、これらによって作成される電子化
文書が多くなってきている。このため、大量の文書情報
を蓄積し、必要に応じて文書情報を検索するための文書
データベースに対する関心が高まってきている。従来の
文書データベースでは、文書を検索する場合、文書毎に
付されたキーワードを利用するキーワード検索が一般的
であった。しかし、キーワード付け作業が蓄積文書の増
加に間に合わない、時間が経過するとキーワードが陳腐
化する、データベース管理者の予想を越えたキーワード
による検索には対応できず検索漏れが多くなる、等の問
題点があった。このような背景から最近は、全文データ
ベースと呼ばれる文書データベースが注目されている。
全文データベースでは、利用者から与えられた検索条件
と蓄積されている文書の全ての情報との間で照合を行な
い、検索条件を満たす文書を出力する。しかしながら、
全文データベースから利用者が検索する場合には、全文
データベースに登録されている文字の種類と、検索する
文字の種類が異なると検索できないという問題があっ
た。このため文書検索漏れを少なくするために平仮名・
片仮名・漢字・ローマ字の各表記への変換、異表記辞書
を用いた異表記展開を用いることにより検索漏れを防い
できた。2. Description of the Related Art In recent years, with the spread of word processors and the spread of character recognition devices, the number of electronic documents created by them has increased. Therefore, interest in a document database for accumulating a large amount of document information and retrieving the document information as needed is increasing. In a conventional document database, when searching for a document, a keyword search using a keyword attached to each document is general. However, problems such as keyword addition work not keeping up with the number of stored documents, keywords becoming obsolete over time, and being unable to respond to searches with keywords that exceed the expectations of the database administrator, resulting in frequent omission of searches. was there. From such a background, a document database called a full-text database has recently been attracting attention.
In the full-text database, the search condition provided by the user is compared with all the information of the stored documents, and the document satisfying the search condition is output. However,
When the user searches the full-text database, there is a problem that the character type registered in the full-text database and the type of the character to be searched are different from each other. Therefore, in order to reduce the omission of document search, hiragana /
It was possible to prevent omission of search by converting into katakana, kanji and romaji and using different notation expansion using different notation dictionary.

【０００３】以下、従来の情報検索装置について説明す
る。図６は従来の情報検索装置の構成を示すものであ
る。図６において、１は検索条件入力部、２は文字種変
換部、３は異表記拡張部、４は検索部、５は検索結果表
示部である。A conventional information retrieval device will be described below. FIG. 6 shows the configuration of a conventional information retrieval device. In FIG. 6, 1 is a search condition input unit, 2 is a character type conversion unit, 3 is a different notation expansion unit, 4 is a search unit, and 5 is a search result display unit.

【０００４】以上のように構成された情報検索装置につ
いて、以下その動作を説明する。まず、検索条件入力部
１で検索条件が入力される。続いて入力された検索条件
の各キーワードに対して文字種変換部２で平仮名・片仮
名・漢字・ローマ字等の文字種に変換する。例えば「検
索」と検索条件を入力した場合は「けんさく」，「ケン
サク」，「ｋｅｎｓａｋｕ」に変換される。次に文字種
変換部２で変換された各単語に対して異表記拡張部３で
異表記拡張を行なう。例えば「ｋｅｎｓａｋｕ」は「Ｋ
ＥＮＳＡＫＵ」に拡張される。次に異表記拡張部３で拡
張した検索条件をもとに検索部４で検索を行ない、検索
結果表示部５で検索結果を表示する。The operation of the information retrieving apparatus configured as above will be described below. First, the search condition input unit 1 inputs search conditions. Subsequently, the character type conversion unit 2 converts each keyword of the input search condition into a character type such as hiragana, katakana, kanji, and roman characters. For example, if "search" is entered as the search condition, it is converted into "kensaku", "kensaku", and "kensaku". Next, the different notation expansion unit 3 performs different notation expansion for each word converted by the character type conversion unit 2. For example, "kensaku" is "K
It is expanded to "ENSAKU". Next, the search unit 4 performs a search based on the search condition expanded by the different notation expansion unit 3, and the search result display unit 5 displays the search result.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら上記の従
来の情報検索では、光学的文字認識装置（ＯＣＲ）で入
力した文書に認識誤りがあった場合やワープロ等で作成
した文書に誤字脱字が含まれた場合には検索できないと
いう課題があり、また文書に改行コードや改ページコー
ドや空白コードやタブコードといったもので単語が分け
られた場合にも検索できないという課題を有していた。However, in the above-described conventional information retrieval, when a document input by an optical character recognition device (OCR) has a recognition error or a document created by a word processor or the like contains a typographical error. However, there is a problem that the search cannot be performed when the word is divided, and the search cannot be performed even when the word is divided by a line feed code, a page break code, a space code, or a tab code in the document.

【０００６】本発明は、上記従来技術の課題を解決する
もので、ＯＣＲで誤認識した文書や誤字脱字や特殊コー
ドを含む文書でも正しく検索することができる情報検索
装置を提供することを目的とする。The present invention solves the above-mentioned problems of the prior art, and an object of the present invention is to provide an information retrieving apparatus capable of retrieving correctly even a document erroneously recognized by OCR or a document containing a typographical error or a special code. To do.

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するため
に、本発明は、検索のためのキーワードを入力する検索
条件入力部と、入力されたキーワードに対して正規表現
に変換する正規表現変換部と、正規表現に変換された検
索条件をもとに検索を行なう検索部と、検索結果を表示
する検索結果表示部とを備えたものである。In order to achieve the above object, the present invention provides a search condition input section for inputting a keyword for search, and a regular expression conversion for converting the input keyword into a regular expression. A search unit for performing a search based on a search condition converted into a regular expression, and a search result display unit for displaying a search result.

【０００８】また上記目的を達成するために、本発明
は、正規表現に変換された検索条件を異なる表記に拡張
する異表記拡張部およびまたは表記が類似した表記に拡
張する類似表記拡張部とを備えたものである。Further, in order to achieve the above object, the present invention provides a different notation extension part for extending a search condition converted into a regular expression into different notations and / or a similar notation extension part for extending notations having similar notations. Be prepared.

【０００９】また上記目的を達成するために、本発明
は、正規表現に変換された検索条件を異表記拡張部で異
表記に拡張した後、類似表記拡張部で類似表記に拡張す
ることを特徴としたものである。In order to achieve the above object, the present invention is characterized in that a search condition converted into a regular expression is expanded to a different notation by a different notation expansion unit and then expanded to a similar notation at a similar notation expansion unit. It is what

【００１０】また上記目的を達成するために、本発明
は、正規表現変換部が、改行コード、改ページコード、
空白コード、タブコードを無視する正規表現に変換する
読み飛ばし変換部を備えたものである。Further, in order to achieve the above object, according to the present invention, the regular expression conversion unit is configured to include a line feed code, a page break code,
It has a skipping conversion unit that converts blank codes and tab codes into regular expressions that are ignored.

【００１１】また上記目的を達成するために、本発明
は、正規表現変換部が、任意文字を使ってキーワード間
の文字数制限を表す正規表現に変換する隣接照合変換部
を備えたものである。Further, in order to achieve the above object, the present invention is such that the regular expression conversion unit is provided with an adjacency matching conversion unit for converting an arbitrary character into a regular expression representing a character number limitation between keywords.

【００１２】また上記目的を達成するために、本発明
は、正規表現変換部が、キーワード間に句点と読点を除
いた文字で一致する正規表現に変換する可変長文字列変
換部を備えたものである。Further, in order to achieve the above object, the present invention is such that the regular expression conversion unit is provided with a variable length character string conversion unit for converting into a regular expression in which characters except for punctuation and punctuation are matched between keywords. Is.

【００１３】[0013]

【作用】したがって本発明によれば、対象文書に誤字脱
字があるような場合に、正規表現変換部でキーワードを
１文字誤っている場合と１文字欠けている場合と１文字
多い場合の３種類の正規表現を作成して検索することに
より、精度の高い検索を行なうことができる。Therefore, according to the present invention, when there is a typographical error in the target document, there are three types: one in the regular expression conversion unit, one in which the keyword is incorrect, one in which the keyword is missing, and one in which there is more. A highly accurate search can be performed by creating and searching for the regular expression.

【００１４】本発明はまた、正規表現に変換された検索
条件を異表記拡張部で拡張することにより、ワープロ等
で作成した文書に誤字や脱字があっても検索することが
できる。Further, according to the present invention, by expanding the search condition converted into the regular expression by the different notation expansion unit, it is possible to search even if there is a typographical error or omission in a document created by a word processor or the like.

【００１５】本発明はまた、正規表現に変換された検索
条件を類似表記拡張手段で拡張することにより、ＯＣＲ
で誤認識した文字も検索することができる。The present invention also extends the OCR by expanding the search condition converted into a regular expression with the similar notation expanding means.
You can also search for characters that were misrecognized by.

【００１６】本発明はまた、読み飛ばし変換部で改行コ
ード等の読み飛ばしたい文字の繰り返しをキーワードの
各文字間に挿入した正規表現を作成することにより、改
行コード等が入った文書も正確に検索することができ
る。The present invention also makes it possible to accurately generate a document containing a linefeed code by creating a regular expression in the skipping conversion unit by inserting a repetition of characters such as a linefeed code to be skipped between each character of a keyword. You can search.

【００１７】本発明はまた、キーワード間の文字数の制
限を指定した場合に、隣接照合変換部で任意文字を指定
制限文字数まで論理和で接続した正規表現を作成するこ
とにより、検索条件の入力を簡便に行なうことができ
る。In addition, according to the present invention, when the limitation of the number of characters between keywords is designated, the adjacency matching conversion unit creates a regular expression in which an arbitrary character is connected by the logical sum up to the designated limited number of characters, thereby inputting search conditions. It can be done easily.

【００１８】本発明はまた、可変長文字列変換部で句点
と読点を除いた文字で一致する正規表現を作成すること
により、キーワード間の任意一致照合の場合でも、文節
にまたがらずに検索することができる。In the present invention, the variable-length character string conversion unit creates a regular expression that matches with characters excluding punctuation and punctuation, so that even in the case of arbitrary matching between keywords, the search can be performed without straddling clauses. can do.

【００１９】[0019]

【Example】

（実施例１）以下、本発明の第１の実施例について、図
面を参照しながら説明する。図１は本発明の第１の実施
例における情報検索装置の構成を示すものである。図１
において、１１は検索条件入力部、１２は正規表現変換
部、１３は異表記拡張部、１４は類似表示拡張部、１５
は検索部、１６は検索結果表示部である。異表記拡張部
１３と類似表記拡張部１４とは、正規表現変換部１２と
検索部１５との間に並列に接続されている。(First Embodiment) A first embodiment of the present invention will be described below with reference to the drawings. FIG. 1 shows the configuration of an information search apparatus according to the first embodiment of the present invention. Figure 1
, 11 is a search condition input unit, 12 is a regular expression conversion unit, 13 is a different notation expansion unit, 14 is a similar display expansion unit, 15
Is a search unit, and 16 is a search result display unit. The different notation extension unit 13 and the similar notation extension unit 14 are connected in parallel between the regular expression conversion unit 12 and the search unit 15.

【００２０】以上のように構成された情報検索装置につ
いて、その動作を説明する。まず、検索条件入力部１１
で入力された各キーワードに対して、正規表現変換部１
２で正規表現に変換を行なう。変換方法は１文字誤って
いる場合と、１文字欠けている場合と１文字多い場合の
３種類があるが、変換するか否かは独立に制御できるも
のとする。例えばキーワードが「ａｂｃｄ」の場合は一
文字誤っている場合の変換は「ｂｃｄ｜ａ．ｃｄ｜ａ
ｂ．ｄ｜ａｂｃ」（．は任意の一文字、｜は論理和を表
す。）である。一文字欠けている場合の変換は「ｂｃｄ
｜ａｃｄ｜ａｂｄ｜ａｂｃ」である。また１文字多い場
合の変換は「ａ．ｂｃｄ｜ａｂ．ｃｄ｜ａｂｃ．ｄ」で
ある。このように３種類の変換を行なうことで１文字欠
けていても検索することができる。The operation of the information retrieval apparatus configured as described above will be described. First, the search condition input unit 11
Regular expression converter 1 for each keyword entered in
Convert to regular expression in 2. There are three conversion methods: one character is incorrect, one character is missing, and one character is more, but whether to convert or not can be controlled independently. For example, when the keyword is "abcd", the conversion when one character is incorrect is "bcd | a.cd | a
b. d | abc ”(. represents an arbitrary character, and | represents a logical sum). If one character is missing, the conversion is "bcd
| Acd | abd | abc ". The conversion when there is one more character is “a.bcd | ab.cd | abc.d”. By performing three types of conversion in this way, it is possible to search even if one character is missing.

【００２１】次に、これらの文字に対して異表記拡張部
１３で拡張を行なう。拡張の方法は２種類ある。１つは
全て同じ系統の文字種で拡張する場合と、そうでない場
合がある。例えばキーワードが「ａｂｃｄ」の場合、前
者では「ＡＢＣＤ｜ａｂｃｄ」というように拡張し、
「ＡＢＣＤ」か「ａｂｃｄ」のいずれかで一致する。後
者では「（ａ｜Ａ）（ｂ｜Ｂ）（ｃ｜Ｃ）（ｄ｜Ｄ）」
というように拡張し、「ＡＢＣＤ」か「ＡＢＣｄ」か
「ＡＢｃｄ」……「ａｂｃｄ」のいずれかに一致する。Next, the different notation expansion unit 13 expands these characters. There are two expansion methods. One is the case where characters are all expanded with the same character type, and the other is not. For example, when the keyword is "abcd", the former is expanded to "ABCD | abcd",
Matches with either "ABCD" or "abcd". In the latter case, "(a | A) (b | B) (c | C) (d | D)"
And so on, and it matches either "ABCD", "ABCd", "ABcd", ... "Abcd".

【００２２】次に、類似表記拡張部１４でＯＣＲで誤認
識しそうな文字列に変換する。例えば「工（こう）」と
「エ（エ）」、「一（いち）」と「−（マイナス）」の
ように表記が似ているものを予め登録しておき、これら
の文字のどちらかが出た場合には、条件拡張を行なう。Next, the similar notation expansion unit 14 converts the character string into a character string that is likely to be erroneously recognized by OCR. For example, "Kou" and "E", "ichi" and "-(minus)" that have similar notations are registered in advance, and either of these letters can be registered. When is displayed, the condition is expanded.

【００２３】次に、このように条件拡張した検索条件を
もとに検索部１５で検索を行ない、検索結果表示部１６
で検索結果を表示する。検索は、条件拡張された文字列
を状態遷移表に変換し、有限状態オートマトンのアルゴ
リズムを用いた文字列照合により実行される。このよう
な有限状態オートマトンを用いた文字列照合は、文献
（高橋恒介著「テキスト検索プロセッサ」電子情報通信
学会）に詳しく紹介されているので、ここでの説明は省
略する。Next, the search unit 15 performs a search based on the search conditions thus expanded, and the search result display unit 16
To display the search results. The search is performed by converting the condition-expanded character string into a state transition table and performing character string matching using the finite state automaton algorithm. Since the character string matching using such a finite state automaton has been introduced in detail in the literature (Tunesuke Takahashi, "Text Search Processor" The Institute of Electronics, Information and Communication Engineers), its explanation is omitted here.

【００２４】以上のように、本実施例によれば、正規表
現変換部１２と検索部１５との間に異表現拡張部１３と
類似表記変換部１４とを並列に設けたので、ＯＣＲで作
成した文書で認識誤りがある場合や、ワープロで作成し
た文書で誤字脱字があるような文書に対しても正確に検
索することができる。なお、本実施例では、より簡素化
した構成として、異表記拡張１３と類似表記拡張部１４
とを省略して、入力された検索条件を単に正規表現に変
換して検索する構成とすることができる。As described above, according to the present embodiment, since the different expression extension unit 13 and the similar notation conversion unit 14 are provided in parallel between the regular expression conversion unit 12 and the search unit 15, they are created by OCR. Even if there is a recognition error in the created document, or if there is a typographical error in the document created by the word processor, it is possible to accurately search. In the present embodiment, the different notation extension 13 and the similar notation extension unit 14 have a simpler configuration.
By omitting and, the input search condition can be simply converted into a regular expression for searching.

【００２５】（実施例２）次に、本発明の第２の実施例
について、図面を参照しながら説明する。図２は本発明
の第１の実施例における情報検索装置の構成を示すもの
である。本実施例は、図１に示した上記第１の実施例に
おける正規表現変換部１２に読み飛ばし変換部１７を付
加したものであり、他の構成は上記第１の実施例と同じ
なので、同じ要素には同じ符号を付して重複した説明は
省略する。(Second Embodiment) Next, a second embodiment of the present invention will be described with reference to the drawings. FIG. 2 shows the configuration of the information search device according to the first embodiment of the present invention. In this embodiment, a read-skipping conversion unit 17 is added to the regular expression conversion unit 12 in the first embodiment shown in FIG. 1, and other configurations are the same as those in the first embodiment. The same reference numerals are given to the elements, and duplicated description will be omitted.

【００２６】上記のように構成された情報検索装置につ
いて、以下読み飛ばし変換部１７の動作を主に説明す
る。読み飛ばし変換部１７では、読み飛ばす文字をキー
ワードの各文字の間に挿入する。例えば「ａｂｃｄ」か
ら検索する場合には「ａ（＼ｎ｜＼ｆ｜＼ｓ｜＼ｔ）＊
ｂ（＼ｎ｜＼ｆ｜＼ｓ｜＼ｔ）＊ｃ（＼ｎ｜＼ｆ｜＼ｓ
｜＼ｔ）＊ｄ」（＼ｎは改行を、＼ｆは改ページを、＼
ｓは空白を、＼ｔはタブを、＊は０個以上の続き）とな
る。このような文字列を読み飛ばし文字列として挿入す
る。With respect to the information retrieval device configured as described above, the operation of the skip-read conversion unit 17 will be mainly described below. The skip-read conversion unit 17 inserts the skipped characters between the characters of the keyword. For example, when searching from "abcd", "a (\ n | \ f | \ s | \ t) *
b (\ n | \ f | \ s | \ t) * c (\ n | \ f | \ s
| \ T) * d ”(\ n is a line feed, \ f is a page break, \
s is a blank, \ t is a tab, and * is a continuation of 0 or more). Such a character string is skipped and inserted as a character string.

【００２７】以上のように、本実施例によれば、正規表
現変換部１２に読み飛ばし文字列を挿入する読み飛ばし
変換部１７を設けることにより、上記第１の実施例の動
作に加えて、改行コード、改ページコード、空白コー
ド、タブコードが途中に含まれている場合でも、それを
無視して検索することができ、検索漏れを防ぐことがで
きる。As described above, according to the present embodiment, by providing the regular expression converting unit 12 with the read skip converting unit 17 for inserting the read skip character string, in addition to the operation of the first embodiment, Even if a line feed code, a page break code, a space code, or a tab code is included in the middle, it is possible to ignore the search and prevent the search omission.

【００２８】（実施例３）次に、本発明の第３の実施例
について、図面を参照しながら説明する。図３は本発明
の第３の実施例における情報検索装置の構成を示すもの
である。本実施例は、図１に示した第１の実施例におけ
る正規表現変換部１２に隣接照合変換部１８を付加した
ものであり、他の構成は上記第１の実施例と同じなの
で、同じ要素には同じ符号を付して重複した説明は省略
する。(Embodiment 3) Next, a third embodiment of the present invention will be described with reference to the drawings. FIG. 3 shows the configuration of an information retrieval apparatus according to the third embodiment of the present invention. In this embodiment, an adjacency matching conversion unit 18 is added to the regular expression conversion unit 12 in the first embodiment shown in FIG. 1, and since the other configurations are the same as those in the first embodiment, the same elements are used. Are denoted by the same reference numerals and redundant description will be omitted.

【００２９】上記のように構成された情報検索装置につ
いて、以下隣接照合変換部１８の動作を主に説明する。
隣接照合変換部３２では、キーワードとキーワードの間
の指定された最大文字数によって正規表現を作成する。
正規表現の作成方法は、任意文字（．）を１から順番に
最大文字数の個数だけ並べて、それを論理和の縦棒
（｜）で結合して作成する。例えばキーワードが「情
報」と「装置」でその間に最大文字数が４個の文字が入
る場合は「情報（．｜．．｜．．．｜．．．．）装置」
といった正規表現に変換する。このように変換すると
「情報検索装置」や「情報入出力装置」といった文字列
を検索することができる。The operation of the adjacency matching conversion unit 18 will be mainly described below for the information retrieval apparatus configured as described above.
The adjacency matching conversion unit 32 creates a regular expression based on the specified maximum number of characters between keywords.
A regular expression is created by arranging arbitrary characters (.) In order from 1 up to the maximum number of characters and connecting them with a vertical bar (|) of logical sums. For example, if the keyword is "information" and "device" and a maximum number of characters is 4 between them, "information (. | ... | ... | ...) device"
To a regular expression. When converted in this way, a character string such as "information retrieval device" or "information input / output device" can be retrieved.

【００３０】以上のように、本実施例によれば、正規表
現変換部１２に隣接照合変換部１８を設けることによ
り、上記第１の実施例の動作に加えて、キーワードとキ
ーワードの間に設定された最大文字数の任意文字を使っ
た正規表現に変換することができるので、検索条件の入
力をより簡単に行なうことができる。なお、本実施例に
おける正規表現変換部１２に、上記第２の実施例におけ
る読み飛ばし変換部１７を設けることができる。As described above, according to the present embodiment, by providing the adjacency matching conversion unit 18 in the regular expression conversion unit 12, in addition to the operation of the first embodiment, setting between keywords is performed. Since it can be converted into a regular expression using the specified maximum number of characters, it is possible to input search conditions more easily. The regular expression conversion unit 12 in this embodiment may be provided with the read skip conversion unit 17 in the second embodiment.

【００３１】（実施例４）次に、本発明の第４の実施例
について、図面を参照しながら説明する。図４は本発明
の第４の実施例における情報検索装置の構成を示すもの
である。本実施例は、図１に示した第１の実施例におけ
る正規表現変換部１２に可変長文字列変換部１９を加え
たものであり、他の構成は上記第１の実施例と同じなの
で、同じ要素には同じ符号を付して重複した説明は省略
する。(Fourth Embodiment) Next, a fourth embodiment of the present invention will be described with reference to the drawings. FIG. 4 shows the structure of an information retrieval apparatus according to the fourth embodiment of the present invention. In this embodiment, a variable length character string conversion unit 19 is added to the regular expression conversion unit 12 in the first embodiment shown in FIG. 1, and since the other configurations are the same as those in the first embodiment, The same elements will be denoted by the same reference symbols and redundant description will be omitted.

【００３２】上記のように構成された情報検索装置につ
いて、以下可変長文字列変換部１９の動作を主に説明す
る。可変長文字列変換部４２では、キーワードとキーワ
ードの間を任意文字を句点（。）と読点（、）を除いた
文字集合の連続を表す正規表現を作成する。例えば「国
際」何とか「会議」を検索したい場合には「国際
［＾、。］＊会議」（［］はかぎかっこ中の文字列のい
ずれかの文字と一致する。＾は［］内の文字を除く全て
の文字と一致する。）といった正規表現に変換される。The operation of the variable-length character string conversion section 19 of the information search apparatus configured as described above will be mainly described below. The variable-length character string conversion unit 42 creates a regular expression that represents the continuity of the character set between the keywords, excluding the punctuation marks (.) And the punctuation marks (,) from arbitrary characters. For example, if you want to search for “international” or “meeting”, you can search for “international [^ ,.] * meeting” ([] matches one of the characters in the brackets. ^ Is the character in []. Matches all characters except.)).

【００３３】以上のように、本実施例によれば、正規表
現変換部１２に可変長文字列変換部１９を設けることに
より、上記第１の実施例の動作に加えて、任意文字の照
合を行なう場合に句点と読点を一致させないことによ
り、文節にまたがった一致を省くことができ、検索をよ
り正確に行なうことができる。なお、本実施例における
正規表現変換部１２に、上記第２の実施例における読み
飛ばし変換部１７およびまたは上記第３の実施例におけ
る隣接照合変換部１８を設けることができる。As described above, according to the present embodiment, by providing the variable length character string conversion unit 19 in the regular expression conversion unit 12, in addition to the operation of the first embodiment, collation of arbitrary characters is performed. By not matching the punctuation marks and the punctuation marks when performing the search, it is possible to omit the matching over the clauses and perform the search more accurately. The regular expression conversion unit 12 in the present embodiment can be provided with the read skip conversion unit 17 in the second embodiment and / or the adjacent matching conversion unit 18 in the third embodiment.

【００３４】（実施例５）次に、本発明の第５の実施例
について、図面を参照しながら説明する。図５は本発明
の第５の実施例における情報検索装置の構成を示すもの
である。本実施例は、第１の実施例における異表記拡張
部１３と類似表記拡張部１４を直列に接続したものであ
り、他の構成は上記第１の実施例と同じなので、同じ要
素には同じ符号を付して重複した説明は省略する。(Fifth Embodiment) Next, a fifth embodiment of the present invention will be described with reference to the drawings. FIG. 5 shows the configuration of the information retrieval apparatus in the fifth embodiment of the present invention. In this embodiment, the different notation extension unit 13 and the similar notation extension unit 14 in the first embodiment are connected in series. Since the other configurations are the same as those in the first embodiment, the same elements are the same. The same reference numerals are given and duplicate explanations are omitted.

【００３５】以上のように構成された情報検索装置につ
いて、以下異表記拡張部１３と類似表記拡張部１４の動
作を主に説明する。まず異表記拡張１３で入力キーワー
ドの拡張を行なう。例えば「加工」という文字に対し
「下降」「仮構」「河口」等に拡張される。次に、これ
ら各文字に対し、類似表記拡張部１４で類似表記に拡張
する。例えば「加工」に対しては「加エ」、「下降」に
対しては「下隆」、「仮構」に対しては「板構」、「河
口」に対しては「河ロ」等に拡張される。The operations of the different notation extension unit 13 and the similar notation extension unit 14 of the information retrieval apparatus configured as described above will be mainly described below. First, the different notation extension 13 is used to extend the input keyword. For example, the word “processing” is expanded to “down”, “temporary structure”, “kawaguchi”, and the like. Next, each of these characters is expanded to a similar notation by the similar notation expansion unit 14. For example, "processing" is "Ka", "falling" is "Shimotaka", "temporary" is "plate", and "Kawaguchi" is "Kuro". To be extended.

【００３６】上記第１の実施例の場合は、異表記拡張と
類似表記拡張とが並列して行なわれるので、「加工」に
対する異表記拡張は同じになるが、類似拡張は「加工」
に対してのみ行なわれることになり、本実施例の方が、
より語彙が豊富になりそれだけ検索漏れが少なくなる。In the case of the first embodiment, since the different notation expansion and the similar notation expansion are performed in parallel, the different notation expansion for "processing" is the same, but the similar expansion is "processing".
Will be performed only for the
The vocabulary becomes richer and the search omission becomes smaller accordingly.

【００３７】以上のように、本実施例によれば、検索条
件の拡張を異表示拡張を行なった後に類似拡張を行なう
ので、検索漏れをより少なくすることができる。なお、
本実施例においても、正規表現変換部１２に、読み飛ば
し変換部１７およびまたは隣接照合変換部１８およびま
たは可変長文列変換部１９を設けることができる。As described above, according to the present embodiment, since the search condition is expanded by the different display and then the similar expansion is performed, it is possible to further reduce the omission of the search. In addition,
Also in the present embodiment, the regular expression conversion unit 12 may be provided with the read skip conversion unit 17 and / or the adjacent matching conversion unit 18 and / or the variable length sentence string conversion unit 19.

【００３８】また上記各実施例において、正規表現変換
部１２では、変換する場合のキーワードの文字数につい
ては特に制限してないが、２文字以下では適合率が悪く
なるので、３文字以上といった文字数制限を設けること
ができる。また、異表記拡張部１３と類似表記拡張部１
４を独立に動作するようにして、いずれか一方または両
方を選択できるようにすることができる。In each of the above embodiments, the regular expression conversion unit 12 does not particularly limit the number of characters of the keyword when converting, but if the number of characters is 2 or less, the matching rate becomes poor, so the number of characters is limited to 3 or more. Can be provided. Further, the different notation extension unit 13 and the similar notation extension unit 1
4 can be operated independently so that either one or both can be selected.

【００３９】[0039]

【発明の効果】以上のように、本発明によれば、対象文
書に誤字脱字があるような場合に、正規表現変換部でキ
ーワードを１文字誤っている場合と１文字欠けている場
合と１文字多い場合の３種類の正規表現を作成して検索
することにより、精度の高い検索を行なうことができ
る。As described above, according to the present invention, when there is a typographical error in the target document, one character is erroneous in the regular expression conversion unit, one character is deficient in the keyword, and one character is missing. It is possible to perform a highly accurate search by creating and searching three types of regular expressions when there are many characters.

【００４０】本発明はまた、正規表現に変換された検索
条件を異表記拡張部で拡張することにより、ワープロ等
で作成した文書に誤字や脱字があっても検索することが
できる。Further, according to the present invention, by expanding the search condition converted into a regular expression by the different notation expansion unit, it is possible to search even if there is a typographical error or a missing character in a document created by a word processor or the like.

【００４１】本発明はまた、正規表現に変換された検索
条件を類似表記拡張手段で拡張することにより、ＯＣＲ
で誤認識した文字も検索することができる。The present invention also expands the search condition converted into the regular expression by the similar notation expanding means to obtain the OCR.
You can also search for characters that were misrecognized by.

【００４２】本発明はまた、読み飛ばし変換部で改行コ
ード等の読み飛ばしたい文字の繰り返しをキーワードの
各文字間に挿入した正規表現を作成することにより、改
行コード等が入った文書も正確に検索することができ
る。The present invention also makes it possible to accurately generate a document containing a line feed code by creating a regular expression in the skipping conversion unit by inserting a repeating character such as a line feed code to be skipped between each character of a keyword. You can search.

【００４３】本発明はまた、キーワード間の文字数の制
限を指定した場合に、隣接照合変換部で任意文字を指定
制限文字数まで論理和で接続した正規表現を作成するこ
とにより、検索条件の入力を簡便に行なうことができ
る。Further, according to the present invention, when the limitation of the number of characters between keywords is designated, the adjacency collation conversion unit creates a regular expression in which an arbitrary character is connected by the logical sum up to the designated limited number of characters, thereby inputting search conditions. It can be done easily.

【００４４】本発明はまた、可変長文字列変換部で句点
と読点を除いた文字で一致する正規表現を作成すること
により、キーワード間の任意一致照合の場合でも、文節
にまたがらずに検索することができる。According to the present invention, the variable-length character string conversion unit creates a regular expression that matches with characters excluding punctuation and punctuation, so that even in the case of arbitrary matching between keywords, searching can be performed without straddling clauses. can do.

[Brief description of drawings]

【図１】本発明の第１の実施例における情報検索装置の
構成を示す概略ブロック図FIG. 1 is a schematic block diagram showing the configuration of an information search device according to a first embodiment of the present invention.

【図２】本発明の第２の実施例における情報検索装置の
構成を示す概略ブロック図FIG. 2 is a schematic block diagram showing the configuration of an information search device according to a second embodiment of the present invention.

【図３】本発明の第３の実施例における情報検索装置の
構成を示す概略ブロック図FIG. 3 is a schematic block diagram showing the configuration of an information search device according to a third embodiment of the present invention.

【図４】本発明の第４の実施例における情報検索装置の
構成を示す概略ブロック図FIG. 4 is a schematic block diagram showing a configuration of an information search device according to a fourth embodiment of the present invention.

【図５】本発明の第５の実施例における情報検索装置の
構成を示す概略ブロック図FIG. 5 is a schematic block diagram showing the configuration of an information search device according to a fifth embodiment of the present invention.

【図６】従来の情報検索装置の構成を示す概略ブロック
図FIG. 6 is a schematic block diagram showing the configuration of a conventional information search device.

[Explanation of symbols]

１１検索条件入力部１２正規表現変換部１３異表記拡張部１４類似表記拡張部１５検索部１６検索結果表示部１７読み飛ばし変換部１８隣接照合変換部１９可変長文字列変換部 11 search condition input unit 12 regular expression conversion unit 13 different notation expansion unit 14 similar notation expansion unit 15 search unit 16 search result display unit 17 read skip conversion unit 18 adjacency matching conversion unit 19 variable length character string conversion unit

Claims

[Claims]

1. A search condition input unit for inputting a keyword for a search, a regular expression conversion unit for converting the input keyword into a regular expression, and a search condition converted into the regular expression. An information search device comprising a search unit for performing a search and a search result display unit for displaying a search result.

2. The information search device according to claim 1, further comprising: a different notation extension unit that extends the search condition converted into the regular expression to a different notation, and / or a similar notation extension unit that extends the notation having a similar notation.

3. The information search device according to claim 1, wherein the search condition converted into the regular expression is expanded to a different notation by the different notation expansion unit and then expanded to a similar notation at the similar notation expansion unit.

4. The information according to claim 1, wherein the regular expression conversion unit includes a read skip conversion unit that converts a line feed code, a page break code, a space code, and a tab code into a regular expression. Search device.

5. The information search device according to claim 1, wherein the regular expression conversion unit includes an adjacent matching conversion unit that converts an arbitrary character into a regular expression that represents a character number limitation between keywords.

6. The information according to claim 1, wherein the regular expression conversion unit includes a variable length character string conversion unit for converting a regular expression in which characters that match punctuation marks and punctuation marks are matched between keywords. Search device.