JP2000259629A

JP2000259629A - Morphological analysis method and its device

Info

Publication number: JP2000259629A
Application number: JP11064237A
Authority: JP
Inventors: Hirobumi Sakurai; 博文櫻井; Toru Hisamitsu; 徹久光
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1999-03-11
Filing date: 1999-03-11
Publication date: 2000-09-22

Abstract

(57)【要約】【課題】単語分かち書きされていない文書を解析し
て、その文を構成する単語を取り出しそれぞれの単語の
品詞を同定する処理を計算機によって、より高速に行う
こと。【解決手段】単語辞書の各検索結果の単語リストごと
に、与えられた単語と接続可能な語のリストをあらかじ
め計算しておき、解析時にその結果を利用する。 (57) [Summary] [PROBLEMS] To analyze a document that has not been word-separated, extract words constituting the sentence, and identify the part of speech of each word at a higher speed by a computer. A list of words connectable to a given word is calculated in advance for each word list of each search result in a word dictionary, and the result is used during analysis.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、連続文字列を構成
単語列に分解するプロセスを含むすべての分野において
利用可能である。特に、日本語のような単語を分かち書
きされない言語の文を入力とする検索や機械翻訳、かな
漢字変換処理、自然言語インターフェースにおいて有効
である。なお、本発明で「文」と言うときは、必ずし
も、読点“。”で終わる文章を言うわけではなく、人が
理解できる意味のある文字列のことを言うものである。The present invention can be used in all fields including a process of decomposing a continuous character string into constituent word strings. In particular, the present invention is effective in retrieval, machine translation, kana-kanji conversion processing, and natural language interface in which a sentence in a language in which words such as Japanese cannot be separated is input. In the present invention, the term “sentence” does not necessarily mean a sentence ending with a reading mark “.” But refers to a character string that has a meaning that can be understood by a person.

【０００２】[0002]

【従来の技術】形態素解析とは、単語分かち書きされて
いない言語の文を解析して、その文を構成する単語を取
り出し、それぞれの単語の品詞を同定する処理のことで
ある。従来、形態素解析においては以下のような解析手
法が使われている。2. Description of the Related Art Morphological analysis is a process of analyzing a sentence in a language that is not separated by words, extracting words constituting the sentence, and identifying the parts of speech of each word. Conventionally, the following analysis methods have been used in morphological analysis.

【０００３】Σを文字列の集合、ＦＴを単語の前接続属
性コードの集合、ＲＴを単語の後接続属性コードの集
合、単語辞書ＤをΣ×ＦＴ×ＲＴ×［０，∞）の有限な
部分集合（未知語や数字列はここには含まれない）とす
る。今、文字列ｗについてｗ＝（ｓ_ｗ，ｆｔ_ｗ，ｒｔ
_ｗ，ｃ_ｗ）∈Ｄと表現できるとき、これを単語ｗある
いは形態素ｗ(以下、簡単のために単語と言う)という。
ここで、ｓ_ｗは単語ｗの表記、ｆｔ_ｗは単語ｗの前接
続属性コード、ｒｔ_ｗは単語ｗの後接続属性コード、
ｃ_ｗは単語コストである。単語コストｃ_ｗは０または
正の実数で表現され単語の使用される頻度が高いときは
小さい値が与えられる。単語の前接続属性、後接続属性
とは、単語同士の接続の特性を表す属性で、単語の表記
や品詞、活用型、活用形などによって決まる。また、ｆ
ｔ_ｓｔａｒｔ（∈ＦＴ），ｒｔ_ｓｔａｒｔ（∈ＲＴ）
およびｆｔ_ｅｎｄ（∈ＦＴ），ｒｔ_ｅｎｄ（∈ＲＴ）
はそれぞれ文の先頭および末尾を表す特別な文法コー
ド、εを長さ０の文字列で表現するものとすると、先頭
を表す特別な文法コードはｗ_ｓｔａｒｔ＝（ε，ｆｔ_
ｓｔａｒｔ，ｒｔ_ｓｔａｒｔ，０）で、末尾を表す特
別な文法コードはｗ_ｅｎｄ＝（ε，ｆｔ_ｅｎｄ，ｒｔ
_ｅｎｄ，０）で、それぞれ表現される。[0003] 文字 is a set of character strings, FT is a set of preceding connection attribute codes of a word, RT is a set of post-connection attribute codes of a word, and a word dictionary D is a finite of Σ × FT × RT × [0, ∞). It is a subset (unknown words and digit strings are not included here). Now, for a character string w, w = (s_w, ft_w, rt
_w, c_w) ∈D, this is called a word w or a morpheme w (hereinafter referred to as a word for simplicity).
Here, s_w is the notation of the word w, ft_w is the connection attribute code before the word w, rt_w is the connection attribute code after the word w,
c_w is the word cost. The word cost c_w is expressed as 0 or a positive real number, and a small value is given when the frequency of use of the word is high. The pre-connection attribute and post-connection attribute of a word are attributes that represent the characteristics of the connection between words, and are determined by the word description, part of speech, inflection type, inflection type, and the like. Also, f
t_start (@FT), rt_start (@RT)
And ft_end (@FT), rt_end (@RT)
Is a special grammatical code representing the beginning and end of a sentence, respectively, and ε is represented by a character string of length 0, and the special grammatical code representing the beginning is w_start = (ε, ft_
start, rt_start, 0), and a special grammar code representing the end is w_end = (ε, ft_end, rt)
_end, 0).

【０００４】単語の後接続属性ＲＴとこれに接続される
単語の前接続属性ＦＴの間の接続しやすさを示す関数γ
について見ると、γ×ＲＴ＝ＦＴ→｛ｘ｜ｘは０以上の
整数、または−１｝と表現され、前側のある単語ｗ’と
その後のある単語ｗとがγ（ｆｔ_ｗ，ｒｔ_ｗ’）≧０
のとき単語ｗ’と単語ｗとは接続可能、−１のとき接続
不能と呼ぶ。ここで、関数γを接続コスト関数、γの返
す値を接続コストと呼ぶ。[0004] A function γ indicating the ease of connection between the post-connection attribute RT of a word and the pre-connection attribute FT of a word connected thereto.
, Γ × RT = FT → {x | x is expressed as an integer of 0 or more, or −1}, and a certain word w ′ on the front side and a certain word w thereafter are γ (ft_w, rt_w ′) ≧ 0
When the word w ′ and the word w are connectable, when −1, the word w ′ and the word w are not connectable. Here, the function γ is called a connection cost function, and the value returned by γ is called a connection cost.

【０００５】単語間の接続のしやすさはこの接続属性間
の接続コストによって表現され、その値は０以上の整数
で、値が小さい程接続しやすいことをあらわす。たとえ
ば、普通名詞と“は”という表記を持つ格助詞は一般に
接続しやすいので、小さい値を与え、助詞“は”と助詞
“が”は普通接続しないので無限大の値を与える。The ease of connection between words is expressed by the connection cost between the connection attributes, and the value is an integer of 0 or more. The smaller the value, the easier the connection. For example, a case particle having the notation "ha" and a common noun is generally easily connected, so a small value is given, and a particle "ha" and a particle "ga" are usually not connected, so an infinite value is given.

【０００６】ある文Ｓ∈Σの位置ｉ（≧０）から始まる
候補単語の集合をＷ＾ｓｔａｒｔ_ｉ（Ｓ），文の位置
ｉで終る候補単語の集合をＷ＾ｅｎｄ_ｉ（Ｓ）と書
く。ただし、文の位置ｉとは正の整数で文字と文字の間
の数を指し、文Ｓの先頭位置はｉ＝０、文Ｓのｉ文字目
とｉ＋１文字目の間の位置を位置ｉと呼ぶこととする。
また、候補単語とは文Ｓの連続した部分列を表記に持つ
単語のことである。この場合、同じ表記であっても、品
詞が異なれば候補単語としては別物である。A set of candidate words starting from position i (≧ 0) of a certain sentence S∈Σ is written as W ＾ start_i (S), and a set of candidate words ending at position i of the sentence is written as W ＾ end_i (S). However, the position i of the sentence is a positive integer and indicates the number between characters, the head position of the sentence S is i = 0, and the position between the i-th character and the (i + 1) -th character of the sentence S is a position i. I will call it.
The candidate word is a word having a continuous substring of the sentence S in the notation. In this case, even if the notation is the same, if the part of speech is different, it is a different candidate word.

【０００７】経路情報ｒを、単語ｗ、経路情報へのポイ
ンタｐの集合Ｐ、コストｃｃの３つの組み（ｗ，Ｐ，ｃ
ｃ）で表現するものとする。ただし、ポインタｐとは経
路情報ｒの値（ｗ，Ｐ，ｃｃ）の格納場所を指すアドレ
スである。^*ｐはポインタｐの指す経路情報を表すこと
とする。また、経路情報の全体集合をＲｅと書く。[0007] The route information r is divided into three sets (w, P, c) of a word w, a set P of pointers p to the route information, and a cost cc.
c). Here, the pointer p is an address indicating the storage location of the value (w, P, cc) of the route information r. ^* p indicates path information indicated by the pointer p. Also, the entire set of the route information is written as Re.

【０００８】位置ｉから始まる単語ｗまでの部分解の集
合Ｒ（Ｓ，ｉ，ｗ）とは、単語ｗと接続可能でかつ接し
ている部分解へのポインタと、単語ｗまでのコストの和
からなる経路情報の集合Ｒ_ａｌｌ（Ｓ，ｉ，ｗ）中か
らコストが最小なものをひとつ選んで作った集合で、
（数１）のように帰納的に定義される。A set R (S, i, w) of partial solutions from a position i to a word w is a sum of a pointer to a partial solution connectable to and in contact with the word w and a cost to the word w. A set created by selecting one of the routes having the lowest cost from a set of route information R_all (S, i, w) consisting of
It is defined recursively as in (Equation 1).

【０００９】[0009]

【数１】 (Equation 1)

【００１０】ただし、関数ａｒｇｍｉｎは集合Ｒ∈Ｒｅ
と評価関数ｆ：Ｒｅ→［０，∞）を引数に取り、評価関
数の値を最小とするＲの要素の集合を返す関数である。
また、φは空集合を表す。Note that the function argmin is a set R∈Re
And an evaluation function f: Re → [0, ∞), and returns a set of R elements that minimize the value of the evaluation function.
Φ represents an empty set.

【００１１】ｆ，ａｒｇｍｉｎを（数２）のように定義
できる。F and argmin can be defined as (Equation 2).

【００１２】[0012]

【数２】 (Equation 2)

【００１３】また、関数ｏｎｅは与えられた経路の集合
から要素ｒ＝（ｗ，Ｐ，ｃｃ）を一つ取り出し、さらに
Ｐの要素ｐをひとつ取り出して作った集合｛（ｗ，
｛ｐ｝，ｃｃ）｝を返す。（集合Ｒ（Ｓ，ｉ，ｗ）の定
義としては、コストｃの順位がＮ番目以下の要素だけを
選択する演算を使う場合や、コストが最も小さいものか
ら一定の幅以内のものを選択する演算を使う場合があ
る。）Ｒ（Ｓ，ｉ，ｗ_ｅｎｄ）を形態素解析の解と呼
ぶ。これはＲ（Ｓ，ｉ，ｗ），（ｗ∈Ｗ＾ｓｔａｒｔ_ｉ（Ｓ））を開始位置が前の単語から順に求めることによって得ら
れる。A function one extracts one element r = (w, P, cc) from a given set of paths, and further extracts one element p of P to form a set ｛(w,
{P}, cc)} is returned. (As a definition of the set R (S, i, w), a case where an operation for selecting only the element whose rank of the cost c is Nth or less is used, or a set whose cost is within a certain width from the smallest cost is selected. (A calculation may be used.) R (S, i, w_end) is called a morphological analysis solution. This is obtained by obtaining R (S, i, w), (w∈W ＾ start_i (S)) in order from the word whose start position is the previous one.

【００１４】上記のような日本語の形態素解析では、Ｒ
_ａｌｌ（Ｓ，ｉ，ｗ）の計算において動的計画法に基
づく横型の探索を用いる。解析速度を向上するためには
この探索の時間を減らすことが重要である。In the Japanese morphological analysis described above, R
A horizontal search based on dynamic programming is used in the calculation of _all (S, i, w). It is important to reduce this search time to improve the analysis speed.

【００１５】次に示すのは、与えられた文Ｓと位置ｉ
と、ｉから始まるすべての候補単語ｗ∈Ｗ＾ｓｔａｒｔ
_ｉ（Ｓ）についてＲ_ａｌｌ（Ｓ，ｉ，ｗ）を計算する
従来使われていた手続きである。ここでｗ’は与えられ
た文Ｓにおける位置ｉ−１で終わる候補単語である。ステップ１：各ｗ∈Ｗ＾ｓｔａｒｔ_ｉ（Ｓ）につい
て、Ｒ_ａｌｌ（Ｓ，ｉ，ｗ）を空にする。ステップ２：各ｗ’＝（ｓ_ｗ’，ｒｔ_ｗ’，ｒｔ_
ｗ’，ｃ_ｗ’）∈Ｗ＾ｅｎｄ_ｉ（Ｓ）について、候補
単語ｗ’が部分解ｒ’＝（ｗ’，Ｐ’，ｃｃ）∈Ｒ
（Ｓ，ｉ，ｗ’）を持つ場合、次のステップ３、ステッ
プ４を繰り返す。ステップ３：各ｗ＝（ｓ，ｆｔ_ｗ，ｒｔ_ｗ，ｃｗ）∈
Ｗ＾ｓｔａｒｔ_ｉ（Ｓ）について次を繰り返す。ステップ４：もしも候補単語ｗ’と候補単語ｗが接続可
能（γ（ｒｔ_ｗ’，ｆｔ_ｗ）≧０）ならば、ｒ＝
（ｗ，Ｐ，ｃ_ｗ’＋γ（ｒｔ_ｗ’，ｆｔ_ｗ）＋ｃ
ｃ）をＲ_ａｌｌ（Ｓ，ｉ，ｗ）に加える。The following shows a given sentence S and a position i.
And all candidate words w \ W \ start starting with i
_i (S) is a conventionally used procedure for calculating R_all (S, i, w). Here, w ′ is a candidate word ending at position i−1 in the given sentence S. Step 1: For each w \ W \ start_i (S), empty R_all (S, i, w). Step 2: Each w '= (s_w', rt_w ', rt_
w ′, c_w ′) ∈W ＾ end_i (S), the candidate word w ′ is partially resolved r ′ = (w ′, P ′, cc) ∈R
If it has (S, i, w '), the following steps 3 and 4 are repeated. Step 3: Each w = (s, ft_w, rt_w, cw) ∈
The following is repeated for W @ start_i (S). Step 4: If the candidate word w ′ and the candidate word w are connectable (γ (rt_w ′, ft_w) ≧ 0), then r =
(W, P, c_w '+ γ (rt_w', ft_w) + c
c) is added to R_all (S, i, w).

【００１６】ただし、この手続きにおいては、ステップ
２とステップ３のループは逆でも構わない。However, in this procedure, the loop of step 2 and step 3 may be reversed.

【００１７】[0017]

【発明が解決しようとする課題】従来の手法においては
Ｒ_ａｌｌ（Ｓ，ｉ，ｗ）を計算する手続きのステップ
３、ステップ４の処理において論理上の接続可能性のあ
る単語ｗ、ｗ’の組み合わせの数が多く処理に時間がか
かっていたが、これらの接続可能性のある単語ｗ，ｗ’
の組み合わせのうち、意味のある接続が可能（γ（ｒｔ
_ｗ’，ｆｔ_ｗ）≧０）な組は、一般的に、実データに
おいておよそ１０％程度と少ないにもかかわらず、全て
のケースについて計算をするので非効率的だった。In the conventional method, the combination of words w and w 'which are logically connectable in steps 3 and 4 of the procedure for calculating R_all (S, i, w) And the processing took a long time, but these connectable words w and w '
, Meaningful connection is possible (γ (rt
_w ′, ft_w) ≧ 0) is generally inefficient because the calculation is performed for all cases, although the actual data is as small as about 10%.

【００１８】[0018]

【課題を解決するための手段】本発明では、この部分で
接続できない語の組の対応関係のチェックを省くことに
よって解析を効率的に行い高速化することが可能である
ことに着目して、上記探索のステップ３において候補単
語ｗと接続可能なＷ＾ｅｎｄ_ｉ（Ｓ）中の語ｗ’のリ
ストをあらかじめ用意して、探索時には接続可能な語の
組についてのみステップ４を実行する。これによって上
記ステップ３、ステップ４を高速に処理することを可能
にする。In the present invention, attention is paid to the fact that the analysis can be performed efficiently and the processing speed can be increased by omitting the check of the correspondence relation of the set of words that cannot be connected at this part. In step 3 of the search, a list of words w 'in W @ end_i (S) that can be connected to the candidate word w is prepared in advance, and step 4 is executed only for a set of connectable words at the time of search. This makes it possible to process Steps 3 and 4 at high speed.

【００１９】[0019]

【発明の実施の形態】本発明では、この課題を高速に動
作しかつ十分小さいデータ量で表現可能とするために、
以下のようなデータ構造を用いる。DESCRIPTION OF THE PREFERRED EMBODIMENTS In the present invention, in order to operate this object at high speed and to express it with a sufficiently small data amount,
The following data structure is used.

【００２０】ステップ３において、ステップ２で与えら
れた部分解ｒ’に接続する（Ｗ＾ｓｔａｒｔ_ｉ中の）
単語のリストは経路情報ｒ’の単語ｗ’の後接続属性コ
ードｍ’と候補単語ｗの集合Ｗ＾ｓｔａｒｔ_ｉ（Ｓ）
から求めることができる。単語辞書Ｄに含まれる単語の
数は有限だから、可能なＷ＾ｓｔａｒｔ＿ｉ（Ｓ）の集
合Ｌ_Ｗ（Ｄ）＝｛Ｗ＾ｓｔａｒｔ_ｉ（Ｓ）｜Ｓ∈Σ，
ｉ（０≦ｉ＜｜Ｓ｜）｝は有限である。これは単語辞書
Ｄをトライで表現した場合｜Ｌ_Ｗ（Ｄ）｜はトライの
ノードの数でおさえられることによる。In step 3, connect to the partial solution r 'given in step 2 (in W @ start_i)
The list of words is a set W ＾ start_i (S) of the connection attribute code m ′ and the candidate word w after the word w ′ in the path information r ′.
Can be obtained from Since the number of words included in the word dictionary D is finite, a set L_W (D) of possible W ＾ start_i (S) = {W ＾ start_i (S) | S},
i (0 ≦ i <| S |)} is finite. This is because when the word dictionary D is represented by a trie, | L_W (D) | is suppressed by the number of nodes of the trie.

【００２１】したがって、あらかじめすべての後接続属
性コードｒｔ∈ＲＴと、単語辞書から得られるすべての
候補単語ｗの集合Ｗ＾ｓｔａｒｔ_ｉ（Ｓ）について、
後接続属性コードｒｔを持つ単語ｗ’に接続可能なＷ＾
ｓｔａｒｔ_ｉ中の単語のリストを用意することができ
る。Therefore, for all post-connection attribute codes rt @ RT and a set W @ start_i (S) of all candidate words w obtained from the word dictionary in advance,
W $ connectable to word w ′ having post-connection attribute code rt
A list of words in start_i can be prepared.

【００２２】以下のようなリストの配列の集合である配
列テーブルＡ_Ｗによりこれらを表現する。These are represented by an array table A_W which is a set of arrays of the following lists.

【００２３】Ｗ∈Ｌ_Ｗ（Ｄ）ごとに、後接続属性コー
ドｒｔのＩＤ番号をインデックスとするポインタ列ＡＨ
_Ｗと、Ｗ＾ｓｔａｒｔ_ｉ（Ｓ）中の単語のインデック
スの内容を示すデータ列ＡＤ_Ｗ［ｒｔ］を用いた。ポ
インタ列ＡＨ_Ｗの指すデータ列ＡＤ_Ｗ［ｒｔ］の要素
が候補単語の接続可能なＷ＾ｓｔａｒｔ_ｉ中の単語の
リストを指定する。For each W∈L_W (D), a pointer sequence AH indexed by the ID number of the subsequent connection attribute code rt
_W and a data string AD_W [rt] indicating the contents of the index of the word in W @ start_i (S). The element of the data string AD_W [rt] pointed to by the pointer string AH_W specifies a list of connectable candidate words in W @ start_i.

【００２４】図１で、これを具体的に見ると、次のよう
である。たとえば、入力文Ｓ「犬ははしる」の位置２で
終わる単語ｗ'１の“は”（ft格助詞，rt格助詞，１
０）に着目し、この“は”の後接続属性コードrtに接続
出来る可能性のある単語を単語辞書Ｄから検索する場合
を考える。位置２から始まる“はしる”に先頭部分が一
致する単語ｗとしては、検索文字列“はしる”に対応す
る単語辞書Ｄの内容から、ＷＬ＝Ｗ＾ｓｔａｒｔ_２
（Ｓ）としてブロック１２内にあるような単語が候補単
語ｗとしてあげられる。これらの候補単語ｗと着目して
いる単語ｗ'１の“は”との間では、接続できる単語は
（“は”，ft名詞，rt名詞，１００）、（“はし”，ft
名詞，rt名詞，１００）および（“はしる”，ft動詞，
rt動詞，１００）の三つにかぎられる。FIG. 1 shows this in detail as follows. For example, the word “ha” (ft case particle, rt case particle, 1) of the word w′1 ending at position 2 of the input sentence S “Inu wa Hashiru”
Focusing on 0), consider a case in which a word that can be connected to the connection attribute code rt after this “wa” is searched from the word dictionary D. As the word w whose head part matches “Hashiru” starting from position 2, WL = W ＾ start — 2 from the contents of the word dictionary D corresponding to the search character string “Hashiru”
A word in the block 12 as (S) is given as a candidate word w. Between these candidate words w and the word “ha” of the word of interest w ′ 1, connectable words are (“wa”, ft noun, rt noun, 100), (“hashi”, ft
Noun, rt noun, 100) and ("hasuru", ft verb,
rt verb, 100).

【００２５】したがって、ブロック１２内にある候補単
語ｗと位置２で終わる単語ｗ'１の“は”との接続につ
いて調べるときは、この三つとの関係を調べれば足りる
ことは明らかである。すなわち、本発明はこの点に着目
し、単語の接続可能性についてのデータを予め用意して
おき、接続できる単語間についてのみ形態素解析の処理
を行うことにするのである。Therefore, when examining the connection between the candidate word w in the block 12 and the word "ha" of the word w'1 ending at position 2, it is clear that it is sufficient to examine the relationship between these three. That is, the present invention pays attention to this point, prepares data on the connectability of words in advance, and performs morphological analysis processing only between connectable words.

【００２６】図１で見ると、配列テーブルＡ_Ｗのポイ
ンタ列ＡＨ_Ｗを検索して、ｉ＝１の単語ｗ'１の“は”
に対応するポインタを探す。次いで、このポインタ列で
の接続可能単語のリストを、rt格助詞への接続可能性を
示すデータ列ＡＤ_Ｗ［rt］から検索して、単語ｗ'１の
“は”のポインタに対応するデータＡＤ_Ｗ［rt格助
詞］を得て、単語ｗ'１の“は”と単語ＤのＷＬ＝Ｗ＾
ｓｔａｒｔ_２（Ｓ）との接続可能性性を決める。図の
入力文Ｓの場合には、ｉ＝１の単語ｗ'１の“は”に対
応するポインタ列ＡＨ_ＷからデータＡＤ_Ｗ［rt格助
詞］が指定され、データＡＤ_Ｗ［rt格助詞］の内容か
ら、接続できる単語は（“は”，ft名詞，rt名詞，１０
０）、（“はし”，ft名詞，rt名詞，１００）および
（“はしる”，ft動詞，rt動詞，１００）の三つである
ことが知れる。Referring to FIG. 1, the pointer string AH_W of the array table A_W is searched, and "ha" of the word w'1 of i = 1 is searched.
Find the pointer corresponding to. Next, the list of connectable words in the pointer string is searched from the data string AD_W [rt] indicating the connectability to the rt case particle, and the data AD_W corresponding to the pointer of “ha” of the word w′1 is retrieved. [Rt case particle] is obtained, “ha” of word w′1 and WL = W ＾ of word D are obtained.
The connection possibility with start_2 (S) is determined. In the case of the input sentence S in the figure, data AD_W [rt case particle] is designated from the pointer sequence AH_W corresponding to “ha” of the word w′1 of i = 1, and data AD_W [rt case particle] , Connectable words are ("wa", ft noun, rt noun, 10
0), ("Hashi", ft noun, rt noun, 100) and ("Hashiru", ft verb, rt verb, 100).

【００２７】図２では、他の入力文「かれはしる」の位
置２から始まる検索文字列が図１と同じ“はしる”とな
る場合の例を示している。検索文字列が同じであるか
ら、単語辞書Ｄの内容から、ＷＬ＝Ｗ＾ｓｔａｒｔ_ｎ
（Ｓ）としてブロック１２内にあるような単語が候補単
語としてあげられる。一方、配列テーブルＡ_Ｗのポイ
ンタ列ＡＨ＿Ｗから、位置２で終わる単語ｗ'２の“か
れ”に対応するポインタを探すことにより、rt名詞への
接続可能性を示すデータ列ＡＤ_Ｗ［rt］から検索し
て、単語ｗ'２の“かれ”のポインタに対応するデータ
ＡＤ_Ｗ［ｒｔ名詞］を得ることができる。この場合、
ＡＤ_Ｗ［ｒｔ名詞］が指す単語は（“は”，ft格助
詞，rt格助詞，１００）および（“はしる”，ft動詞，
rt動詞，１００）の二つであることが知れる。FIG. 2 shows an example in which the search character string starting from the position 2 of another input sentence "Karahashiru" is "Hashiru" as in FIG. Since the search character strings are the same, WL = W @ start_n
Words in the block 12 as (S) are listed as candidate words. On the other hand, by searching the pointer string AH_W of the array table A_W for a pointer corresponding to "he" of the word w'2 ending at position 2, a search is made from the data string AD_W [rt] indicating the possibility of connection to the rt noun. Thus, data AD_W [rt noun] corresponding to the pointer of “he” of the word w′2 can be obtained. in this case,
The words pointed to by AD_W [rt noun] are (“wa”, ft case particle, rt case particle, 100) and (“hashiru”, ft verb,
rt verb, 100).

【００２８】本発明では、このように配列テーブルＡ_
Ｗから後接続属性ｒｔに接続可能なＷ＾ｓｔａｒｔ_ｉ
中の単語のリストは、単にＷ＾ｓｔａｒｔ_ｉに対応す
るポインタ列の対応によって得られ、接続可能性のある
単語についてのみ形態素解析の処理を進めることを可能
にする。In the present invention, the array table A_
W @ start_i connectable from W to post-connection attribute rt
The list of words in the middle is obtained simply by the correspondence of the pointer sequence corresponding to W @ start_i, and enables the morphological analysis to proceed only for words that can be connected.

【００２９】単語辞書Ｄがトライで表現されている場
合、Ｗ∈Ｌ_Ｗ（Ｄ）はトライの各ノードごとに与えら
れる。本発明ではトライの各ノードに対応した単語リス
トＷＬを記録している。単語リストＷＬを保存すること
は単語を保存するより余分にメモリを必要とするが、形
態素の解析処理の中でリストを作成する時間が不要とな
り高速化につながる。When the word dictionary D is represented by a trie, W∈L_W (D) is given for each node of the trie. In the present invention, the word list WL corresponding to each node of the trie is recorded. Saving the word list WL requires more memory than saving words, but it does not require time to create the list in the morpheme analysis process, leading to higher speed.

【００３０】従って、トライのノードごとに対応する配
列テーブルＡ_Ｗもあらかじめ決められる。トライの各
ノードごとにあらかじめ配列テーブルＡ_Ｗを作成して
おくことにより、形態素の解析処理実行時に高速に検索
することができる。Therefore, an array table A_W corresponding to each node of the trie is also determined in advance. By creating the array table A_W in advance for each node of the trie, a high-speed search can be performed when executing the morpheme analysis processing.

【００３１】図３は配列テーブルＡ_Ｗのデータ量を圧
縮するための工夫を説明する図である。この実施例で
は、ポインタ列ＡＨ＿Ｗを圧縮する場合について説明す
る。FIG. 3 is a diagram for explaining a device for compressing the data amount of the array table A_W. In this embodiment, a case where the pointer sequence AH_W is compressed will be described.

【００３２】図１に示したように、後接続属性ｒｔに対
応するポインタ列ＡＨ＿Ｗのポインタには、それぞれ、
データ列ＡＤ_Ｗ［ｒｔ］が対応する。したがって、ポ
インタ列ＡＨ_Ｗのデータ量はポインタ列ＡＨ＿Ｗの長
さとポインタ列ＡＨ＿Ｗによって指定されるデータ列Ａ
Ｄ_Ｗ［ｒｔ］を特定するための幅に対応した広がりの
あるものとなる。図３(ａ)はこの広がりを模式的に示す
ために幅広の列で描いた。一方、図１には示されていな
いが、ポインタ列ＡＨ＿Ｗの指定するデータ列ＡＤ_Ｗ
［ｒｔ］が同じものになることは非常に多い。図３(ａ)
では、これを同じ文字ａ，ｂ，ｃで模式的に表した。こ
のことから容易に分かるように、ポインタ列ＡＨ_Ｗを
ポインタの序列を示す単純なインデックスナンバーを配
列した対応表ＡＭ_Ｗと、このインデックスナンバーに
対応する内容を示すデータリストａ，ｂ，ｃを一組だけ
持つテーブルＡＰ_Ｗとの組み合わせで構成するものと
した。したがって、この実施形態によれば、ポインタ列
ＡＨ_Ｗは、実質的に、インデックスナンバー示す列で
構成されている小容量のものと出来る。すなわち、イン
デックスナンバーの列ＡＭ_Ｗは、データとして必要な
行数を減らすことはできないが、データ量としては４バ
イトあれば足りる。データリストＡＰ_Ｗはデータ量と
しては、それぞれのデータで必要とするバイト数は必要
であるが、図に例示するようにその数は限られたもので
あるから、トータルのポインタ列ＡＨ_Ｗについてのデ
ータ量は大幅に削減できる。As shown in FIG. 1, the pointers of the pointer sequence AH_W corresponding to the post-connection attribute rt include:
The data string AD_W [rt] corresponds. Therefore, the data amount of the pointer string AH_W is equal to the length of the pointer string AH_W and the data string AH_W specified by the pointer string AH_W.
It has a width corresponding to the width for specifying D_W [rt]. FIG. 3 (a) is drawn in wide columns to schematically show this spread. On the other hand, although not shown in FIG. 1, the data string AD_W designated by the pointer string AH_W
[Rt] is very often the same. FIG. 3 (a)
Here, this is schematically represented by the same characters a, b, and c. As can be easily understood from this, the pointer sequence AH_W is composed of a set of a correspondence table AM_W in which simple index numbers indicating the order of pointers are arranged, and a data list a, b, c indicating the contents corresponding to the index numbers. It is configured in combination with the table AP_W. Therefore, according to the present embodiment, the pointer sequence AH_W can be of a small capacity substantially composed of a column indicating the index number. That is, for the column AM_W of the index number, the number of rows required as data cannot be reduced, but a data amount of 4 bytes is sufficient. The data list AP_W requires the number of bytes required for each data as the data amount, but the number is limited as shown in the figure, so the data amount for the total pointer row AH_W is limited. Can be greatly reduced.

【００３３】実施例としての説明は省略するが、データ
列ＡＤ_Ｗ［ｒｔ］についても、同様に圧縮できる。Although the description of the embodiment is omitted, the data string AD_W [rt] can be similarly compressed.

【００３４】以下、本発明にしたがって構成した日本語
形態素解析プログラムについての検証内容について説明
する。The verification contents of the Japanese morphological analysis program constructed according to the present invention will be described below.

【００３５】１．プログラムの実装方法１．１．単語辞書実験には約５３万語の単語辞書と６６９１個の接続関係
情報を利用した。1. Implementation method of program 1.1. Word dictionary For the experiment, a word dictionary of about 530,000 words and 6691 pieces of connection relation information were used.

【００３６】活用する語は語幹部分と語尾の部分を分割
して辞書に登録した。対応する語幹と語尾の間の接続は
後に述べる接続コストテーブルにおいてコスト０の接続
関係を設定することにより処理可能となる。The words to be used are registered in the dictionary by dividing the stem portion and the ending portion. The connection between the corresponding stem and the ending can be processed by setting a connection relationship of zero cost in a connection cost table described later.

【００３７】このことにより、辞書の構造を複雑にする
ことなく活用語を効率よく処理することが可能となる。
しかし反面、文の各位置で辞書から検索される候補単語
の数が増えるという欠点がある。しかし、この欠点は本
発明を用いることにより解析時間に対する悪影響にはな
らない。As a result, it is possible to efficiently process the vocabulary words without complicating the structure of the dictionary.
However, on the other hand, there is a disadvantage that the number of candidate words searched from the dictionary at each position of the sentence increases. However, this disadvantage does not adversely affect the analysis time by using the present invention.

【００３８】この辞書の単語は、最小接頭辞トライ（mi
nimal prefix TRIE）をダブル配列（double array）で
表現したデータ構造を用いてインプリメントした（松本
裕治ら著「単語と辞書」岩波書店ｐ．８０参照）。ただ
し、語の末尾を示す記号をもつリンクは省いた。従っ
て、トライの木の葉でないノードにも単語が対応しう
る。The words in this dictionary are the minimum prefix trie (mi
Implemented using a data structure expressed as a double array (double prefix, nimal prefix TRIE) (see “Words and Dictionaries” written by Yuji Matsumoto et al., Iwanami Shoten p.80). However, links with symbols indicating the end of words have been omitted. Therefore, a word can correspond to a node that is not a leaf of the trie tree.

【００３９】また、各ノードには先頭からそのノードま
での枝のラベルを連結した単語の品詞情報だけでなく、
先頭からの経路の途中までの枝のラベルを連結した単語
の長さと品詞情報も含めたリスト（を指す値）を記録し
てある点が上記の文献の例とは異なる。この変更によ
り、辞書を検索するときに途中の経路に規則された情報
を記録・整理する必要がなくなるため実行時間を省くこ
とができる。Each node has not only part-of-speech information of a word obtained by connecting the labels of the branches from the head to the node, but also
The point different from the example of the above-mentioned document is that a list (a value indicating) including the word length and part-of-speech information in which the labels of the branches from the beginning to the middle of the path are connected is recorded. With this change, it is not necessary to record and organize information regulated along a route in the middle of searching a dictionary, so that execution time can be saved.

【００４０】各ノードにつけられたリストについては、
ノードａがノードｂの先祖のノードであった場合、ａに
記録してあるリストはｂのリストの一部分と同じであ
る。したがって、トライの葉のノード以外のノードは子
孫のノードのリストの一部（長さの短いものだけ）を参
照することにより、このリストの情報の総量を減らすこ
とができる。この性質を利用して、リストを記憶する領
域を節約した。単語の文字列は日本語ＥＵＣコードで表
現した。トライの各枝には文字列の１バイトごとの値を
割り当てた。For the list attached to each node,
If node a is an ancestor node of node b, the list recorded in a is the same as a part of the list of b. Therefore, nodes other than the trie leaf node can reduce the total amount of information in this list by referring to a part (only a short one) of the descendant node list. By utilizing this property, the area for storing the list was saved. The character strings of the words were expressed in Japanese EUC code. Each branch of the trie was assigned a value for each byte of the character string.

【００４１】次に、トライをダブル配列により表現し
た。ダブル配列はトライの各ノードから出る枝のリンク
情報と、各ノードが記録する単語の品詞および長さ情報
のリストを指す値（ポインタ）、最小接頭辞以降の文字
列を指す値を保存する。Next, the tries were represented by a double array. The double array stores link information of a branch from each node of the trie, a value (pointer) indicating a list of word class and length information of words recorded by each node, and a value indicating a character string after the minimum prefix.

【００４２】このトライは与えられた文Ｓのある位置ｉ
から始まる文字列と一致するすべての単語を辞書から検
索をするのに適している。This try is performed at a position i of a given sentence S.
Suitable for searching the dictionary for all words that match a string starting with.

【００４３】１．２．接続コストテーブル接続コストテーブルは後接続属性と前接続属性の間の接
続コストを記した行列である。後接続属性ｒｔと前接続
属性ｆｔが接続可能の場合、０または正の整数値を、接
続できない場合は−１を記す。1.2. Connection Cost Table The connection cost table is a matrix describing connection costs between the back connection attribute and the previous connection attribute. When the rear connection attribute rt and the front connection attribute ft are connectable, 0 or a positive integer value is described.

【００４４】１．３．接続可能単語検索テーブル接続可能単語検索テーブルである配列テーブルＡ_Ｗの
ポインタ列ＡＨ＿Ｗは図３(ｂ)に示すような２種類の配
列からなるデータ構造によって作成した。ある単語のリ
ストに対応した接続可能な単語の検索テーブルからは、
後接続属性ｒｔと接続可能な語のインデックスのリスト
が得られる。単語リストＷＬのｉ番目の要素をＷＬ［ｒ
ｔ］、接続属性をインデックスとした配列テーブルＡ_
Ｗの後接続属性ｒｔに対応した要素をＡ_Ｗ［ｒｔ］、
Ａ_Ｗ［ｒｔ］が指すリストがｎ個の要素を持ちそのｊ
番目の要素がＡ_Ｗ［ｒｔ］［ｊ］である場合、ｒｔに
接続する単語のリストは、ＷＬ［Ａ_Ｗ［ｒｔ］
［１］］，ＷＬ［Ａ_Ｗ［ｒｔ］［２］］，．．．，Ｗ
Ｌ［Ａ_Ｗ［ｒｔ］［ｎ］］となる。1.3. Connectable Word Search Table The pointer row AH_W of the array table A_W, which is a connectable word search table, was created with a data structure composed of two types of arrays as shown in FIG. From the connectable word lookup table corresponding to a list of words,
A list of indices of words that can be connected to the post-connection attribute rt is obtained. Let the ith element of the word list WL be WL [r
t], array table A_ using connection attributes as indexes
A_W [rt] is an element corresponding to W after connection attribute rt,
The list pointed to by A_W [rt] has n elements and its j
If the th element is A_W [rt] [j], the list of words connected to rt is WL [A_W [rt]
[1]], WL [A_W [rt] [2]],. . . , W
L [A_W [rt] [n]].

【００４５】また、この配列テーブルにより接続可能な
語の数に比例した時間ですべての接続可能な語を列挙す
ることが可能となる。これは特に、単語リストＷＬが大
きい場合に非常に有効である。Further, it is possible to enumerate all connectable words in a time proportional to the number of connectable words by using the arrangement table. This is particularly effective when the word list WL is large.

【００４６】１．４．形態素解析ルーチン配列テーブルＡ_Ｗを用いて高速に接続可能な語のみを
選び出し、接続可能な語についてのみ接続コストテーブ
ルを検索することになるため処理速度が高速になる。1.4. Morphological analysis routine Only words that can be connected at high speed are selected using the array table A_W, and the connection cost table is searched for only the connectable words, so that the processing speed is increased.

【００４７】２．結果探索を高速化するために用いた配列テーブルＡ_Ｗのポ
インタ列ＡＨ_Ｗは、図３（ｂ）に示すように、同じ配
列同士をマージすることにより９７１個になった。ま
た、ポインタ列ＡＨ_Ｗが指すデータＡＤ_Ｗはマージ後
で２４，３６２個である。2. Result As shown in FIG. 3B, the number of the pointer arrays AH_W of the array table A_W used for speeding up the search became 971 by merging the same arrays. The number of data AD_W indicated by the pointer sequence AH_W is 24,362 after merging.

【００４８】探索の高速化のためのデータの量は合計で
４，１４７，２５２バイトである。また、単語辞書（単
語リストＬ_Ｗ（Ｄ）のデータを含む）の大きさは約１
２Ｍバイトである。The amount of data for speeding up the search is 4,147,252 bytes in total. The size of the word dictionary (including data of the word list L_W (D)) is about 1
2 Mbytes.

【００４９】本実施例の実行時には、最大で約２２Ｍｂ
ｙｔｅｓの主記憶を使うが、ＯＳの機能であるmapped f
ile systemを用いて主記憶は１０Ｍｂｙｔｅｓ以下でも
動作可能である。実行速度はＣＰＵにもよるが、最高３
２，０００バイト／秒以上が可能であり、従来に比べて
６〜８倍程度の高速化が実現できていることになる。At the time of execution of this embodiment, a maximum of about 22 Mb
It uses the main memory of ytes, but the mapped f
Using the ile system, the main memory can operate even at 10 Mbytes or less. The execution speed depends on the CPU, but up to 3
2,000 bytes / sec or more is possible, which means that a speedup of about 6 to 8 times as compared with the related art can be realized.

【００５０】なお、この高速動作が可能な要因は前述の
配列テーブルＡ_Ｗのほかにもメモリアクセス方法のチ
ューニングや、単語リストＷ＾ｓｔａｒｔ_ｉを辞書中
にあらかじめ用意しておくなどといった改良も寄与して
いる。The factors enabling this high-speed operation also contribute to improvements such as tuning the memory access method and preparing the word list W @ start_i in the dictionary in advance in addition to the array table A_W. I have.

【００５１】なお、この手法は必要な探索を切り捨てる
ことはないので、探索の結果は従来方法と同じであるこ
とは言うまでもなかろう。It should be noted that this method does not truncate the necessary search, so that the result of the search is the same as the conventional method.

【００５２】[0052]

【発明の効果】接続可能な単語のリストを使うことによ
って、従来の手続きでもっとも処理時間を要した処理の
不要な部分を除き、効率を上げることにより解析処理を
高速化することができる。By using a list of connectable words, the analysis process can be speeded up by increasing the efficiency except for the unnecessary portion of the process that requires the longest processing time in the conventional procedure.

[Brief description of the drawings]

【図１】配列テーブルＡ_Ｗにより接続可能な単語の対
応を検索する処理の流れを説明するための模式図。FIG. 1 is a schematic diagram for explaining a flow of a process of searching for correspondence of connectable words using an array table A_W.

【図２】配列テーブルＡ_Ｗにより図１とは異なる品詞
の接続可能な単語の対応を検索する処理の流れを説明す
るための模式図。FIG. 2 is a schematic diagram for explaining a flow of a process of searching a correspondence of connectable words having parts of speech different from those in FIG. 1 using an arrangement table A_W.

【図３】ポインタ列ＡＨ_Ｗのデータ量を削減する工夫
を説明するための模式図。FIG. 3 is a schematic diagram for explaining a device for reducing the data amount of a pointer array AH_W.

【手続補正書】[Procedure amendment]

【提出日】平成１１年３月１５日（１９９９．３．１
５）[Submission date] March 15, 1999 (1999.3.1.
5)

【手続補正１】[Procedure amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】０００４[Correction target item name] 0004

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【０００４】単語の後接続属性ＲＴとこれに接続される
単語の前接続属性ＦＴの間の接続しやすさを示す関数γ
について見ると、γ：ＲＴ×ＦＴ→｛ｘ｜ｘは０以上の
整数、または−１｝と表現され、前側のある単語ｗ’と
その後のある単語ｗとがγ（ｆｔ_ｗ，ｒｔ_ｗ’）≧０
のとき単語ｗ’と単語ｗとは接続可能、−１のとき接続
不能と呼ぶ。ここで、関数γを接続コスト関数、γの返
す値を接続コストと呼ぶ。[0004] A function γ indicating the ease of connection between the post-connection attribute RT of a word and the pre-connection attribute FT of a word connected thereto.
, Γ: RT × FT → {x | x is expressed as an integer equal to or greater than 0 or −1}, and a certain word w ′ on the front side and a certain word w thereafter are γ (ft_w, rt_w ′) ≧ 0
When the word w ′ and the word w are connectable, when −1, the word w ′ and the word w are not connectable. Here, the function γ is called a connection cost function, and the value returned by γ is called a connection cost.

Claims

[Claims]

1. A word string D having a word string D having a meaning and a word dictionary D described by a trie in which a pair of a word notation and a word connection attribute is recorded, and a connection cost table describing connection costs between word attributes. A morphological analysis method for decomposing a sentence S into a word string, wherein a set of words w in a dictionary D starting from a certain position i of the sentence S is defined as W ^ start_i (S), and after a word w 'ending with i-1. When the connection attribute is rt_w ', the connection possibility between the word w' and the word w is determined by the post-connection attribute of rt_w '.
A morphological analysis method that searches only for a relationship with a subset of words in W ^ start_i (S) that can be connected to the word.

2. The connection possibility between the word w ′ and the word w is
2. The morphological analysis method according to claim 1, wherein the post-connection attribute is prepared as an array table indicating connectable words corresponding to rt_w '.

3. The arrangement table is prepared as a table including a pointer string corresponding to the connection attribute rt_w 'after the word w' and a data string indicating a connectable word list corresponding to each pointer of the pointer string. 3. The morphological analysis method according to claim 2, wherein the pointer sequence is composed of an index number indicating the order of the pointer sequence and data indicating the content of the index number, and the same content of the index number is integrated.

4. A sentence comprising a word dictionary D described by a trie in which a pair of a word notation and a word connection attribute is recorded, a connection cost table describing connection costs between word attributes, and a character string having a meaning. A table indicating a set of words w in the dictionary D starting from a position i connectable to a word w ′ ending at a position i-1 of S, and performing a morphological analysis of the sentence S to a subset of the connectable words w A morphological analyzer that searches only for.

5. A connection possibility between the word w ′ and the word w,
The morphological analyzer according to claim 4, wherein the post-connection attribute is prepared as an array table indicating connectable words corresponding to rt_w '.

6. The method according to claim 6, wherein the table is connected to the word w 'after the connection attribute rt
_w 'is prepared as an array table including a data string indicating a connectable word list corresponding to each pointer of the pointer string and each pointer of the pointer string, and the pointer string has an index number indicating an order of the pointer string. 6. The morphological analyzer according to claim 5, comprising data indicating the contents of the index numbers, and integrating the same contents of the index numbers.