JP2004252775A

JP2004252775A - Word extraction device, word extraction method and program

Info

Publication number: JP2004252775A
Application number: JP2003043311A
Authority: JP
Inventors: Yoshihiro Matsuo; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-02-20
Filing date: 2003-02-20
Publication date: 2004-09-09

Abstract

【課題】所定の文書から単語を抽出する場合、抽出された単語に、誤認識語が混入される割合を低減することができる単語抽出装置を提供することを目的とする。
【解決手段】文書から所定の単語を抽出する単語抽出装置において、所定の入力文書から文書内単語を抽出し、この抽出された文書内単語を、単語リストに記憶する単語抽出部と、上記単語リスト中の各単語に、重要度を付与する重要度付与部と、上記単語リスト中の所定の単語と他の単語との結束度を、上記所定の単語に付与する結束度付与部と、上記重要度と上記結束度とが付加されている単語リストを出力する単語リスト出力部とを有することを特徴とする単語抽出装置。
【選択図】図１An object of the present invention is to provide a word extraction device capable of reducing a rate of mixing an erroneously recognized word into an extracted word when extracting a word from a predetermined document.
In a word extraction device for extracting a predetermined word from a document, a word extraction unit that extracts a word in the document from a predetermined input document, and stores the extracted word in the document in a word list; An importance assigning unit that assigns importance to each word in the list; a unity assigning unit that assigns a unity between a predetermined word and another word in the word list to the predetermined word; A word extraction device comprising: a word list output unit that outputs a word list to which importance and the unity are added.
[Selection diagram] Fig. 1

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識や文字認識等を行った結果の語に重要度を付与し、重要語を抽出する単語抽出装置に関する。
【０００２】
【従来の技術】
従来の単語抽出装置では、入力文書中の出現頻度や外部コーパスでの出現頻度に基づいて、重要度を計算する手法が一般的である。たとえば、ｔｆ・ｉｄｆ法では、入力文書中で、その語の出現頻度（ｔｆ）と、新聞記事等の外部コーパス中においてその語が出現する文書数の逆数（ｉｄｆ；一般にはさらに対数を取って利用する）との積をもって重要度とする（たとえば、特許文献１参照）。また、文書の重要度と単語間の関連性とを評価することが知られている（たとえば、特許文献２参照）。
【０００３】
これらの手法において、入力文書が、完全に信頼がおけるものであるとして設計され、たとえば、音声認識や文字認識の結果のように、入力に誤りが含まれている場合にも、誤り語を無分別に採用するという問題がある。
【０００４】
【特許文献１】
特開２００１−０６７３６２号公報
【特許文献２】
特開２００１−１０１１９４号公報
【０００５】
【発明が解決しようとする課題】
従来の単語抽出装置では、認識誤りの可能性を考慮しないので、誤認識率に応じて、選ばれた重要語に誤認識語が混入する。
【０００６】
重要語は、いわば文書の特徴を少数の語で代表させた語であるということができ、選択された重要語に誤認識語が混入していれば、その重要語を用いた処理（要約や関連文書検索等の処理）に、致命的な悪影響を及ぼす。
【０００７】
したがって、選択された重要語に誤認識語が混入していれば、この重要語の誤認識語混入は、単に非重要語が選ばれた場合とは、質の異なる誤抽出であり、誤認識語の混入を避けることが望まれる。認識誤りかどうかを決定的に判別することは不可能であるので、ある程度、誤認識語が混入することを避けることはできないが、誤認識語混入の割合をできるだけ低くすることが望まれる。
【０００８】
本発明は、所定の文書から単語を抽出する場合、抽出された単語に、誤認識語が混入される割合を低減することができる単語抽出装置を提供することを目的とする。
【０００９】
【課題を解決するための手段】
本発明は、抽出単語と文書中の他の単語との結束度を表す尺度を、抽出単語に付与することによって、重要語への誤認識語の混入割合を低減することを可能とする。
【００１０】
また、本発明は、文書から所定の単語を抽出する単語抽出方法において、所定の入力文書から文書内単語を抽出し、この抽出された文書内単語を、単語リストに記憶する単語抽出段階と、上記単語リスト中の各単語に、重要度を付与する重要度付与段階と、上記単語リスト中の所定の単語と他の単語との結束度を、上記所定の単語に付与する結束度付与段階と、上記重要度と上記結束度とが付加されている単語リストを出力する単語リスト出力段階とを有することを特徴とする単語抽出方法である。
【００１１】
さらに、本発明は、文書から所定の単語を抽出するプログラムにおいて、所定の入力文書から文書内単語を抽出し、この抽出された文書内単語を、単語リスト記憶装置に格納されている単語リストに、単語抽出部が、記憶させる単語抽出手順と、上記単語リスト中の各単語に、重要度付与部が、重要度を付与する重要度付与手順と、上記単語リスト中の所定の単語と他の単語との結束度を、結束度付与部が、上記所定の単語に付与する結束度付与手順と、上記重要度と上記結束度とが付加されている単語リストを、出力部に出力させる単語リスト出力手順とをコンピュータに実行させるプログラムである。
【００１２】
【発明の実施の形態および実施例】
図１は、本発明の一実施例である単語抽出装置１００を示す構成図である。
【００１３】
入力文書は、入力装置１１０から入力され、単語抽出部１２０によって文書内の単語が抽出され、単語リスト記憶装置１３０に保存される。
【００１４】
単語抽出部１２０は、形態素解析によってテキストから単語に分割するものである。
【００１５】
重要度付与部１４０は、単語リスト記憶装置１３０内の各単語について、重要度を付与する。付与される重要度は、自動要約装置や文書自動分類装置、関連文書検索装置等の自然言語処理システムで広く用いられる。
【００１６】
重要度を計算する場合、たとえば、ｔｆ・ｉｄｆ等を利用するようにしていもよい。結束度付与部１５０は、単語リスト記憶装置１３０内の各単語について、結束度を付与する。結束度を計算する場合、たとえば、別コーパスから算出した単語共起ベクトルの距離や、別コーパスから算出した単語間の相互情報量等を利用することができる。
【００１７】
単語ｗ_ｉとｗ_ｊとの相互情報量Ｉ（ｗ_ｉ，ｗ_ｊ）を用いて、結束度（ｗ_ｉ）を計算する場合、たとえば、次の式（１）によって、結束度（ｗ_ｉ）を算出することができる。
【００１８】
【数１】

相互情報量Ｉ（ｗ_ｉ，ｗ_ｊ）は、新聞記事等の別コーパスから予め算出し、別コーパス中における所定の単語の出現確立Ｐ（ｗ）と、共起出現確立Ｐ（ｗ_ｉ，ｗ_ｊ）とを用い、次の式（２）によって、相互情報量Ｉ（ｗ_ｉ，ｗ_ｊ）を計算することができる。
【００１９】
【数２】

なお、上記式（２）において、対数の底に、何にとるかは任意である。
【００２０】
単語共起ベクトルの距離を用いる場合、たとえば、単語ベクトル
【００２１】
【数３】

と、文書ベクトル
【００２２】
【数４】

とのなす角度を用い、次の式（３）によって、単語共起ベクトルの距離を算出することができる。
【００２３】
【数５】

ここで、新聞記事等の別コーパスから、各単語ベクトルｗ_ｉを予め算出し、別コーパス内の各文書がベクトルの次元を持ち、所定の単語が、文書中に出現すれば、１であり、出現しなければ、０であると定義する。角度（上記単語ベクトルと上記文書ベクトルとのなす角度）の代わりに、内積やユークリッド距離等を用いても、上記と同様である。
【００２４】
単語リスト出力部１６０は、重要度と結束度とが付加された単語を、記憶装置１３０から取り出して出力する。
【００２５】
出力されたリストから重要語を抽出する場合、たとえば、重要度と結束度との積が大きい順にソートし、上位ｎ単語を抽出すればよい。重要度と結束度とについて、上記のように積を求める代わりに、加算するようにしてもよく、上記と同様である。この場合、重み付けして加算してもよく、また、重み付けしないで加算するようにしてもよい。
【００２６】
次に、上記実施例の動作について説明する。
【００２７】
日本語音声を自動音声認識した文書を例にとって、単語抽出装置１００の動作を説明する。
【００２８】
図２は、日本語音声を自動音声認識した文書の例を示す図である。
【００２９】
図２に示す例文書では、「アザラシ」、「タマ」、「帷子川」は正しく音声認識された単語であるが、「イチロー」は誤認識された語であるとして説明する。
【００３０】
図２の入力文書は、入力装置１１０から入力され、上記入力された入力文書から、単語抽出部１２０が、単語「アザラシ」、「タマ」、「帷子川」、「イチロー」を抽出し、これら抽出された単語が、単語リスト記憶装置１３０に保存される。
【００３１】
この例では、入力文書は、日本語プレインテキストであるので、単語抽出部１２０は、形態素解析装置を用い、容易に構成可能である。仮に入力文書が、音声認識装置からのマークアップつきの文書であれば、単語抽出部１２０の処理は、単語とマークされている箇所とを取り出すだけであり、さらに容易に構成可能である。
【００３２】
重要度付与部１４０は、単語リスト記憶装置１３０に記憶されている各単語について、重要度を付与する。ここでは、ｔｆ・ｉｄｆ法を用い、
重要度（「アザラシ」）＝１．０
重要度（「タマ」）＝４．０
重要度（「帷子川」）＝３．０
重要度（「イチロー」）＝４．０
であると計算されたとする。
【００３３】
ｔｆ・ｉｄｆの計算については、たとえば「東京大学出版会：情報検索と言語処理」を参照。
【００３４】
次に、結束度付与部１５０は、単語リスト記憶装置１３０に記憶されている各単語について、結束度を付与する。
【００３５】
図３は、共起頻度の例を示す図である。
【００３６】
上記実施例において、結束度を算出する場合、別コーパスから、図３に示す共起頻度が得られていたとし、この場合における相互情報量を用いる。図３に示す共起頻度から、相互情報量は、
Ｉ（「アザラシ，タマ」）＝８．７６
Ｉ（「アザラシ，帷子川」）＝８．９７
Ｉ（「アザラシ，イチロー」）＝０
Ｉ（「タマ，帷子川」）＝９．３８
Ｉ（「タマ，イチロー」）＝０
Ｉ（「帷子川，イチロー」）＝０
と計算できる（ここでは、対数の底を２とした）。
【００３７】
したがって、各単語の結束度ｆ（ｗ）は、
結束度（「アザラシ」）＝１７．７３
結束度（「タマ」）＝１８．１４
結束度（「帷子川」）＝１８．３５
結束度（「イチロー」）＝０
である。なお、重要度付与部１４０の処理順序と、結束度付与部１５０の処理順序とは任意である。
【００３８】
次に、出力部１６０は、重要度と結束度とが付加ている単語リストを、記憶装置１３０から取り出し、出力する。
【００３９】
図４は、上記実施例において、重要度と結束度とが付加されている単語リストを取り出した例を示す図である。
【００４０】
この結果に基づいて、たとえば、重要度２語を取り出すことを考える。従来技術では、図４において、「重要度」のみによって判断しているので、取り出される単語は、「タマ」と「イチロー」であるが、上記実施例において、たとえば、重要度と結束度との積を尺度とすれば、「タマ」と「帷子川」が取り出される。上記積の代わりに、線形結合を用いても、上記と同様の結果になる。
【００４１】
すなわち、出力部１６０が出力した単語リストから、上記重要度と上記結束度との積、または、線形結合に応じて、単語を取り出すようにしてもよい。
【００４２】
つまり、上記実施例によれば、抽出単語と、文書中の他の単語との結束度を表す尺度を付与することによって、誤認識語が重要語に混入する割合を低減することができる。これによって、算出された重要度を用いた要約や文書分類、関連文書検索等の自然言語処理の精度を向上させることができる。
【００４３】
また、上記実施例を、単語抽出方法として把握することができる。
【００４４】
つまり、上記実施例は、文書から所定の単語を抽出する単語抽出方法において、所定の入力文書から文書内単語を抽出し、この抽出された文書内単語を、単語リストに記憶する単語抽出段階と、上記単語リスト中の各単語に、重要度を付与する重要度付与段階と、上記単語リスト中の所定の単語と他の単語との結束度を、上記所定の単語に付与する結束度付与段階と、上記重要度と上記結束度とが付加されている単語リストを出力する出力段階とを有することを特徴とする単語抽出方法の例である。
【００４５】
さらに、上記実施例をプログラムとして把握することができる。すなわち、上記実施例は、文書から所定の単語を抽出するプログラムにおいて、所定の入力文書から文書内単語を抽出し、この抽出された文書内単語を、単語リスト記憶装置に格納されている単語リストに、単語抽出部が、記憶させる単語抽出手順と、上記単語リスト中の各単語に、重要度付与部が、重要度を付与する重要度付与手順と、上記単語リスト中の所定の単語と他の単語との結束度を、結束度付与部が、上記所定の単語に付与する結束度付与手順と、上記重要度と上記結束度とが付加されている単語リストを、出力部に出力させる単語リスト出力手順とをコンピュータに実行させるプログラムの例である。
【００４６】
また、上記実施例を次のように把握することができる。すなわち、上記実施例は、文書から所定の単語を抽出する単語抽出装置において、所定の入力文書から文書内単語を抽出し、この抽出された文書内単語を、単語リストに記憶する単語抽出部と、上記単語リスト中の各単語に、重要度を付与する重要度付与部と、上記単語リスト中の所定の単語と他の単語との結束度を、上記所定の単語に付与する結束度付与部と、上記単語抽出部が抽出した単語のうちで、上記重要度と上記結束度とに応じた単語を出力する単語出力部とを有することを特徴とする単語抽出装置の例である。
【００４７】
【発明の効果】
本発明によれば、所定の文書から単語を抽出する場合、抽出された単語に、誤認識語が混入される割合を低減することができるという効果を奏する。
【図面の簡単な説明】
【図１】本発明の一実施例である単語抽出装置１００を示す構成図である。
【図２】日本語音声を自動音声認識した文書の例を示す図である。
【図３】共起頻度の例を示す図である。
【図４】上記実施例において、重要度と結束度とが付加されている単語リストを取り出した例を示す図である。
【符号の説明】
１００…単語抽出装置、
１１０…入力装置、
１２０…単語抽出装置、
１３０…単語リスト記憶装置、
１４０…重要度付与部、
１５０…結束度付与部、
１６０…出力部、
ｗ_ｉ、ｗ_ｊ…単語、
Ｉ（ｗ_ｉ，ｗ_ｊ）…単語ｗ_ｉとｗ_ｊとの相互情報量、
ｗ_ｉ…結束度、
Ｐ（ｗ）…別コーパス中における所定の単語の出現確立、
Ｐ（ｗ_ｉ，ｗ_ｊ）…共起出現確立、
ｆ（ｗ）…単語の結束度。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a word extraction device that assigns importance to words resulting from speech recognition, character recognition, and the like, and extracts important words.
[0002]
[Prior art]
In a conventional word extraction device, a method of calculating importance based on an appearance frequency in an input document or an appearance frequency in an external corpus is generally used. For example, in the tf-idf method, the frequency of occurrence (tf) of the word in the input document and the reciprocal (idf; generally logarithm) of the number of documents in which the word appears in an external corpus such as a newspaper article are calculated. Is used as the importance (for example, see Patent Document 1). It is also known to evaluate the importance of a document and the relevance between words (for example, see Patent Document 2).
[0003]
In these methods, the input document is designed to be completely reliable, and even if the input contains errors, for example, as a result of speech recognition or character recognition, no erroneous words are detected. There is a problem of hiring separately.
[0004]
[Patent Document 1]
Japanese Patent Application Laid-Open No. 2001-067362 [Patent Document 2]
JP-A-2001-101194 [0005]
[Problems to be solved by the invention]
In the conventional word extraction device, the possibility of recognition error is not taken into account, so that a misrecognized word is mixed with the selected important word according to the misrecognition rate.
[0006]
An important word can be said to be a word that represents the characteristics of a document with a small number of words. If the selected important word contains a misrecognized word, processing using that important word (summarization or summarization) is performed. (E.g., related document search).
[0007]
Therefore, if a misrecognized word is mixed in the selected important word, the misrecognized word mixture of the important word is an erroneous extraction of a different quality from the case where the non-significant word is simply selected. It is desirable to avoid mixing words. Since it is impossible to decisively determine whether or not a recognition error has occurred, it is unavoidable that mixing of erroneously recognized words cannot be avoided to some extent.
[0008]
SUMMARY OF THE INVENTION An object of the present invention is to provide a word extraction device that can reduce the rate at which erroneously recognized words are mixed into extracted words when words are extracted from a predetermined document.
[0009]
[Means for Solving the Problems]
The present invention makes it possible to reduce the mixing ratio of misrecognized words into important words by assigning a scale indicating the degree of cohesion between the extracted words and other words in the document to the extracted words.
[0010]
The present invention also provides a word extraction method for extracting a predetermined word from a document, wherein a word extraction step of extracting a word in the document from a predetermined input document, and storing the extracted word in the document in a word list, An importance level assigning step of assigning importance to each word in the word list; and a cohesion degree assigning step of assigning, to the predetermined word, a cohesion degree between a predetermined word and another word in the word list. A word list outputting step of outputting a word list to which the importance and the cohesion are added.
[0011]
Further, the present invention provides a program for extracting a predetermined word from a document, extracting a word in the document from a predetermined input document, and storing the extracted word in the document in a word list stored in a word list storage device. A word extraction procedure to be stored by the word extraction unit; an importance assignment procedure to assign an importance to each word in the word list; A cohesion level providing step in which a cohesion level providing unit provides a cohesion level to a predetermined word, and a word list to which an output unit outputs a word list to which the importance level and the cohesion level are added. This is a program that causes a computer to execute the output procedure.
[0012]
Embodiments and Examples of the Invention
FIG. 1 is a configuration diagram showing a word extraction device 100 according to one embodiment of the present invention.
[0013]
The input document is input from the input device 110, words in the document are extracted by the word extraction unit 120, and stored in the word list storage device 130.
[0014]
The word extracting unit 120 divides a text into words by morphological analysis.
[0015]
The importance assigning unit 140 assigns an importance to each word in the word list storage device 130. The assigned importance is widely used in natural language processing systems such as an automatic summarizing apparatus, an automatic document classifying apparatus, and a related document search apparatus.
[0016]
When calculating the importance, for example, tf / idf may be used. The cohesion degree giving unit 150 gives a cohesion degree to each word in the word list storage device 130. When calculating the cohesion degree, for example, the distance of a word co-occurrence vector calculated from another corpus, the mutual information amount between words calculated from another corpus, and the like can be used.
[0017]
If using a word _{w i} and _{w j} and mutual information _I of _(w i, _{w j),} calculates cohesion degree _{(w i),} for example, by the following equation (1), cohesion _(w i) Can be calculated.
[0018]
(Equation 1)

Mutual information _{I _(w} i, w _j) is calculated in advance from another corpus such as newspaper articles, the predetermined word appearing established P (w) in the different corpus co-occurrence establish P _(w i, w _j) and using, by the following equation (2), it is possible to calculate the mutual information _{I (w} i, _{w j).}
[0019]
(Equation 2)

In the above equation (2), what to take at the base of the logarithm is arbitrary.
[0020]
When the distance of the word co-occurrence vector is used, for example, the word vector
[Equation 3]

And the document vector
(Equation 4)

The distance between word co-occurrence vectors can be calculated by the following equation (3) using the angle formed by
[0023]
(Equation 5)

Here, from another corpus such as newspaper articles, previously calculated each word vector w _i, each document in a different corpus have dimensional vector, predetermined words, if appearing in the document, is 1, If it does not appear, it is defined as 0. The same applies to the case where an inner product or a Euclidean distance is used instead of the angle (the angle between the word vector and the document vector).
[0024]
The word list output unit 160 extracts the word to which the importance and the cohesion degree are added from the storage device 130 and outputs the word.
[0025]
When extracting important words from the output list, for example, the top n words may be extracted by sorting in descending order of the product of importance and unity. The importance and the cohesion may be added instead of obtaining the product as described above, which is the same as above. In this case, the weighted addition may be performed, or the addition may be performed without weighting.
[0026]
Next, the operation of the above embodiment will be described.
[0027]
The operation of the word extraction device 100 will be described using a document in which Japanese speech is automatically recognized as an example.
[0028]
FIG. 2 is a diagram illustrating an example of a document in which Japanese speech is automatically recognized.
[0029]
In the example document shown in FIG. 2, it is assumed that “seal”, “tama”, and “bakugawa” are words that have been correctly speech-recognized, while “Ichiro” is a word that has been misrecognized.
[0030]
The input document of FIG. 2 is input from the input device 110, and the word extracting unit 120 extracts the words “seal”, “tama”, “veiled river”, and “ichiro” from the input document, The extracted words are stored in the word list storage device 130.
[0031]
In this example, since the input document is Japanese plain text, the word extracting unit 120 can be easily configured using a morphological analyzer. If the input document is a document with a markup from the speech recognition device, the processing of the word extracting unit 120 only needs to extract the word and the marked portion, and thus can be configured more easily.
[0032]
The importance assigning unit 140 assigns an importance to each word stored in the word list storage device 130. Here, using the tf · idf method,
Importance (“seal”) = 1.0
Importance ("tama") = 4.0
Importance (“Kaibagawa”) = 3.0
Importance ("Ichiro") = 4.0
Suppose that is calculated.
[0033]
For the calculation of tf · idf, for example, see “The University of Tokyo Press: Information Search and Language Processing”.
[0034]
Next, the cohesion degree assigning unit 150 assigns a cohesion degree to each word stored in the word list storage device 130.
[0035]
FIG. 3 is a diagram illustrating an example of the co-occurrence frequency.
[0036]
In the above embodiment, when calculating the cohesion degree, it is assumed that the co-occurrence frequency shown in FIG. 3 has been obtained from another corpus, and the mutual information amount in this case is used. From the co-occurrence frequency shown in FIG.
I (“seal, ball”) = 8.76
I ("seal, bamboo river") = 8.97
I (“Seal, Ichiro”) = 0
I ("Tama, Kabikogawa") = 9.38
I ("Tama, Ichiro") = 0
I ("Bakuragawa, Ichiro") = 0
(In this case, the base of the logarithm is 2).
[0037]
Therefore, the cohesion degree f (w) of each word is
Cohesion ("seal") = 17.73
Cohesion ("tama") = 18.14
Cohesion degree ("Valve River") = 18.35
Cohesion degree ("Ichiro") = 0
It is. Note that the processing order of the importance assigning unit 140 and the processing order of the cohesion assigning unit 150 are arbitrary.
[0038]
Next, the output unit 160 extracts the word list to which the importance and the cohesion degree are added from the storage device 130 and outputs the word list.
[0039]
FIG. 4 is a diagram illustrating an example in which the word list to which the importance and the cohesion degree are added in the embodiment is extracted.
[0040]
On the basis of this result, for example, consider extracting two words of importance. In the prior art, in FIG. 4, since the determination is made only based on the “importance”, the words to be extracted are “tama” and “ichirou”. If the product is used as a scale, "Tama" and "Baikogawa" are taken out. The same result as above can be obtained by using a linear combination instead of the above product.
[0041]
That is, a word may be extracted from the word list output by the output unit 160 in accordance with the product of the degree of importance and the degree of cohesion or a linear combination.
[0042]
That is, according to the above-described embodiment, by assigning a scale indicating the degree of cohesion between the extracted word and another word in the document, it is possible to reduce the rate at which the erroneously recognized word mixes with the important word. As a result, it is possible to improve the accuracy of natural language processing such as summarization, document classification, and related document search using the calculated importance.
[0043]
Further, the above embodiment can be understood as a word extracting method.
[0044]
That is, the above-described embodiment is a word extraction method for extracting a predetermined word from a document, extracting a word in the document from a predetermined input document, and storing the extracted word in the document in a word list. An importance assigning step of assigning importance to each word in the word list; and a cohesion degree assigning step of assigning, to the predetermined word, a cohesion degree between a predetermined word and another word in the word list. And an output step of outputting a word list to which the importance and the cohesion are added.
[0045]
Further, the above embodiment can be grasped as a program. That is, in the above embodiment, in a program for extracting a predetermined word from a document, a word in the document is extracted from a predetermined input document, and the extracted word in the document is stored in a word list stored in a word list storage device. The word extracting unit stores a word extracting procedure, an importance assigning unit assigns an importance to each word in the word list, an importance assigning procedure to assign importance to each word in the word list, and a predetermined word in the word list. The unity degree giving unit gives the unity degree to the predetermined word, and the word list to which the importance and the unity degree are added is output to the output unit. It is an example of a program that causes a computer to execute a list output procedure.
[0046]
Further, the above embodiment can be understood as follows. That is, in the above-described embodiment, a word extraction unit that extracts a word in a document from a predetermined input document, and stores the extracted word in the document in a word list in a word extraction device that extracts a predetermined word from a document. An importance assigning unit that assigns importance to each word in the word list; and a unity assigning unit that assigns, to the predetermined word, a unity between a predetermined word in the word list and another word. And a word output unit that outputs a word corresponding to the degree of importance and the degree of unity among words extracted by the word extraction unit.
[0047]
【The invention's effect】
Advantageous Effects of Invention According to the present invention, when words are extracted from a predetermined document, it is possible to reduce the rate at which erroneously recognized words are mixed in the extracted words.
[Brief description of the drawings]
FIG. 1 is a configuration diagram showing a word extraction device 100 according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of a document in which Japanese speech has been automatically recognized.
FIG. 3 is a diagram illustrating an example of a co-occurrence frequency.
FIG. 4 is a diagram showing an example of extracting a word list to which importance and cohesion are added in the above embodiment.
[Explanation of symbols]
100 ... word extraction device,
110 input device,
120 ... word extraction device,
130 ... word list storage device,
140 ... importance assigning unit,
150 ... unity imparting unit
160 output unit
w _i , w _j ... words,
_{_{I (w i, w j)}} ... mutual information of the word _{w i} and _{w j,}
w _i ... cohesion degree,
P (w): establishment of occurrence of a predetermined word in another corpus,
_{_{P (w i, w j)}} ... co-occurrence established,
f (w): the degree of unity of the word.

Claims

In a word extraction device for extracting a predetermined word from a document,
A word extracting unit that extracts words in the document from a predetermined input document and stores the extracted words in the document in a word list;
An importance assigning unit that assigns importance to each word in the word list;
A unit for assigning a degree of cohesion between the predetermined word in the word list and another word to the predetermined word;
A word list output unit that outputs a word list to which the importance and the cohesion are added;
A word extraction device comprising:

In a word extraction method for extracting a predetermined word from a document,
Extracting a word in a document from a predetermined input document and storing the extracted word in the document in a word list;
Assigning importance to each word in the word list;
A cohesion degree giving step of giving a cohesion degree between the predetermined word in the word list and another word to the predetermined word;
A word list outputting step of outputting a word list to which the importance and the unity are added;
A word extraction method characterized by having:

In a program for extracting a predetermined word from a document,
A word extraction procedure for extracting words in the document from a predetermined input document, and causing the word extraction unit to store the extracted words in the document in a word list stored in the word list storage device;
An importance assigning step in which the importance assigning unit assigns an importance to each word in the word list;
A cohesion degree giving procedure in which a cohesion degree giving unit gives the cohesion degree between the predetermined word in the word list and another word to the predetermined word;
A word list output procedure for outputting a word list to which the importance and the unity are added to a word list output unit;
A program that causes a computer to execute.