[go: up one dir, main page]

WO2018174816A1 - Procédé et appareil d'analyse de cohérence sémantique de textes - Google Patents

Procédé et appareil d'analyse de cohérence sémantique de textes Download PDF

Info

Publication number
WO2018174816A1
WO2018174816A1 PCT/SG2017/050154 SG2017050154W WO2018174816A1 WO 2018174816 A1 WO2018174816 A1 WO 2018174816A1 SG 2017050154 W SG2017050154 W SG 2017050154W WO 2018174816 A1 WO2018174816 A1 WO 2018174816A1
Authority
WO
WIPO (PCT)
Prior art keywords
subject
subgraph
words
tuple
tuples
Prior art date
Application number
PCT/SG2017/050154
Other languages
English (en)
Inventor
Shangfeng Hu
Jung Jae Kim
Rajaraman Kanagasabai
Original Assignee
Agency For Science, Technology And Research
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agency For Science, Technology And Research filed Critical Agency For Science, Technology And Research
Priority to PCT/SG2017/050154 priority Critical patent/WO2018174816A1/fr
Publication of WO2018174816A1 publication Critical patent/WO2018174816A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • Various aspects of this disclosure generally relate to machine learning, and more particularly, to the training and applications of machine learning based models for semantic coherence analysis of texts.
  • Machine learning is a subfield of computer science that explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms make data driven predictions or decisions through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms is infeasible. Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that lend themselves to prediction. These analytical models allow researchers, data scientists, engineers, and analysts to produce reliable, repeatable decisions and results and uncover hidden insights through learning from historical relationships and trends in the data.
  • Semantic text analytics aim to analyse text by making sense of context, meaning, and domain knowledge.
  • semantic text analytics use natural language processing (NLP), ontologies, and/or machine learning (e.g., deep learning) approaches.
  • Various applications of semantic text analytics include information extraction, questions and answers (Q&A) systems, machine translation, etc.
  • Q&A questions and answers
  • Many text analytics problems involve the process of determining whether two text structures are coherent to each other when connected via a common term (pivot). Such process may be referred to as semantic text coherence analysis. Semantic text coherence analysis may be used in areas such as text summarization, information extraction, Q&A systems, dialogue systems, machine translation, and so on.
  • Supervised learning is the machine learning task of inferring a function from labeled training data. It is costly and time consuming to annotate training data sets manually.
  • Distant-supervised machine learning is the machine learning task of generating the labeled training data required for supervised learning by automatically or semi- automatically labeling "unlabeled" data with positive samples from e.g. a relational database and a knowledge base. For example, supervised learning for information extraction (e.g.
  • event extraction requires texts in which events (e.g., (acquiring_company, business_acquisition_keyword, company acquired)) are labelled, while distant- supervised learning assumes an event database and an unlabelled text corpus, automatically or semi-automatically annotates the events of the database on the texts of the corpus and uses the labelled data as the training data of supervised learning.
  • events e.g., (acquiring_company, business_acquisition_keyword, company acquired)
  • distant- supervised learning assumes an event database and an unlabelled text corpus, automatically or semi-automatically annotates the events of the database on the texts of the corpus and uses the labelled data as the training data of supervised learning.
  • the cost of automatically or semi-automatically labelling unlabelled text with known positive samples is usually much cheaper than that of manually labelling texts. Therefore, it may be desirable to use distant-supervised learning to construct machine learning based models for semantic text coherence analysis.
  • semantic text coherences analysis may be desirable to use semantic text coherences analysis to identify text structures that convey the same meaning (e.g., to resolve pronoun with precedent referents). It may also be desirable to use semantic text coherences analysis to populate a knowledge base with unknown knowledge (e.g., ontology learning).
  • a method, a computer-readable medium, and an apparatus for constructing machine learning based models for semantic text coherence analysis may automatically generate a plurality of tuples based on a text corpus.
  • Each tuple of the plurality of tuples may include a first subject, a second subject, a first subgraph of a first dependency graph corresponding to a first sentence, a second subgraph of a second dependency graph corresponding to a second sentence, and a relationship between the first subject and the second subject.
  • the first subgraph includes the first subject and the second subgraph includes the second subject.
  • the apparatus may normalize the first subgraph and the second subgraph of the tuple by replacing the first subject and the second subject with a first label and replacing other words within the first subgraph and the second subgraph with a second label. For each normalized tuple of the plurality of normalized tuples, the apparatus may then merge the two normalized subgraphs via the first label to obtain a joined pattern. The apparatus may classify the plurality of tuples into a plurality of groups based on the joined pattern of each normalized tuple. For each group of the plurality of groups, the apparatus may train a machine learning based model based on tuples classified into the group.
  • a method, a computer-readable medium, and an apparatus for validating a syntactic structure may parse the syntactic structure into a dependency graph.
  • the apparatus may split the dependency graph into a first subgraph and a second subgraph based on a subject within the syntactic structure.
  • the apparatus may generate a tuple based on the subject, a subgraph of the first subgraph, a subgraph of the second subgraph, and the relationship between the subject and the rest of the syntactic structure.
  • the two subgraphs in a tuple include the subject.
  • the apparatus may identify a trained machine learning based model corresponding to the tuple.
  • the apparatus may estimate the validity of the syntactic structure using the trained machine learning based model.
  • FIG. 1 is a diagram illustrating an example of two dependency graphs.
  • FIG. 2 is a diagram illustrating an example of two respective subgraphs of the dependency graphs described above in FIG. 1.
  • FIG. 3 is a diagram illustrating an example of two respective subgraph patterns of the subgraphs described above in FIG. 2.
  • FIG. 4 is a diagram illustrating an example of a joined pattern based on the subgraph patterns described above in FIG. 3.
  • FIG. 5 is a flowchart of a method of constructing machine learning based models for semantic text coherence analysis.
  • FIG. 6 is a diagram illustrating an example of a dependency graph corresponding to a candidate sentence.
  • FIG. 7 is a diagram illustrating an example of the dependency graph described above in FIG. 6 being split into two subgraphs.
  • FIG. 8 is a diagram illustrating an example of two respective subgraphs of the subgraphs described above in FIG. 7.
  • FIG. 9 is a diagram illustrating an example of a joined pattern based on the subgraph patterns described above in FIG. 8.
  • FIG. 10 is a flowchart of a method of validating a syntactic structure.
  • FIG. 11 is a flowchart of a method of pronoun resolution.
  • FIG. 12 is a flowchart of a method of knowledge base completion.
  • FIG. 13 is a conceptual data flow diagram illustrating the data flow between different means/components in an exemplary apparatus.
  • FIG. 14 depicts a schematic drawing of an exemplary computer system.
  • processors include microprocessors, microcontrollers, graphics processing units (GPUs), central processing units (CPUs), application processors, digital signal processors (DSPs), reduced instruction set computing (RISC) processors, systems on a chip (SoC), baseband processors, field programmable gate arrays (FPGAs), programmable logic devices (PLDs), state machines, gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functionality described throughout this disclosure.
  • processors in the processing system may execute software.
  • Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software components, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
  • the functions described may be implemented in hardware, software, or any combination thereof. If implemented in software, the functions may be stored on or encoded as one or more instructions or code on a computer-readable medium.
  • Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computer.
  • such computer-readable media may include a random-access memory (RAM), a read-only memory (ROM), an electrically erasable programmable ROM (EEPROM), optical disk storage, magnetic disk storage, other magnetic storage devices, combinations of the aforementioned types of computer- readable media, or any other medium that can be used to store computer executable code in the form of instructions or data structures that can be accessed by a computer.
  • RAM random-access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable ROM
  • optical disk storage magnetic disk storage
  • magnetic disk storage other magnetic storage devices
  • combinations of the aforementioned types of computer- readable media or any other medium that can be used to store computer executable code in the form of instructions or data structures that can be accessed by a computer.
  • One aspect of this disclosure is a distant-supervised method of constructing machine learning based models over coherent syntactic structures of valid sentences.
  • the structural coherence of automatically generated sentences is estimated using the constructed machine learning based models in order to support applications that test the validity of candidate sentences and syntactic structures. Such applications may include pronoun resolution and knowledge base completion.
  • the method of one embodiment works in a distant-supervised manner, such that it requires only positive samples as well as unlabeled sentences, but not negative samples and labelling on sentences. As a result, the costly manual labelling on sentences and the collection of negative samples may be avoided. Also, the collection of positive samples may be semi-automated, which will be further described below.
  • pronoun resolution (or pronoun disambiguation) is to locate the referent (or antecedent) of a given pronoun in its earlier context.
  • the two may match and the candidate referent may be identified as a correct referent of the pronoun.
  • An existing knowledge base (KB) may consist of triples (e.g. Obama-is_a- US_president).
  • Knowledge base completion is the task of adding more relevant triples to the KB. Specifically, given a candidate triple, knowledge base completion determines whether the candidate is relevant to the KB or not.
  • a candidate triple is generated by replacing one of the three elements in a known triple (i.e. a triple already registered to the KB) with an unknown element (e.g. Clinton is an unknown element to the triple of Obama-is_a-US_president), where the name of the unknown element is designated as u; 2) each triple is assumed to have a textual description in the form of a sentence (e.g. "Obama is a US president"), where the sentence is designated as s. Based on the two assumptions, the task of knowledge base completion may be redefined as determining the validity of the sentence s in which the unknown element name u replaces the corresponding element. In terms of structural coherence, if the word u and the rest of the sentence s together form into a coherent syntactic structure, a conclusion that the candidate triple is relevant to the KB may be drawn.
  • Pronoun resolution is a fundamental task of natural language processing, which has many applications in discourse analysis including question answering and information extraction.
  • Knowledge bases have formalized answers to questions and are thus computationally efficient for question answering, while answer-bearing texts (e.g. Web pages) are unstructured, thus computationally inefficient.
  • the completion (or population) of knowledge bases can enhance efficient methods of KB-based question answering.
  • the positive samples of pronoun resolution required for constructing machine learning based models for structural coherence measurement may be constructed by mining unlabeled, unstructured texts using a nominal coreference resolution system that correlates nouns and noun phrases (e.g. the president - US President).
  • nominal coreference resolution is usually easier, thus showing higher accuracy, than pronoun resolution.
  • the positive samples of knowledge base completion may be constructed by manually giving a textual description not to each triple (e.g. Obama-is_a- US_president), but to each unique relationship type (e.g.
  • machine learning based models may be constructed for structural coherence measurement.
  • the construction process may work in an offline manner (i.e., done before the applications of the models).
  • the construction process may take as inputs a text corpus and a nominal coreference resolution system and generates machine learning based models for structural coherence measurement.
  • FIGS. 1-4 describe an example of constructing machine learning based models for structural coherence measurement.
  • a list of coreferences of m may be located within a given local window of m (for example, three sentences before or after the noun m).
  • the list of coreferences of m may be designated as coreferences(m).
  • the text corpus may include two sentences. The first sentence is "A man looks at a cat.” The second sentence is "The cat is playing a ball.” When m is 'cat' in the second sentence, coreferences(m) has one element of 'cat' in the first sentence.
  • the mention pair ⁇ m, mi ⁇ may be labeled as positive because mi is a coreference of m
  • the mention pair ⁇ m, m 2 ⁇ may be labeled as negative because m is not a coreference of m.
  • the mention pair ⁇ 'cat' in the second sentence, 'cat' in the first sentence ⁇ may be labeled as positive
  • the mention pair ⁇ 'cat' in the second sentence, 'man' in the first sentence ⁇ may be labeled as negative.
  • FIG. 1 is a diagram illustrating an example of two dependency graphs 100 and 120.
  • the first sentence is parsed into the dependency graph 100
  • the second sentence is parsed into the dependency graph 120.
  • FIG. 2 is a diagram illustrating an example of two respective subgraphs 200 and 220 of the dependency graphs 100 and 120 described above in FIG. 1.
  • the subgraphs 200 and 220 are examples of gi and g'j, respectively.
  • an example tuple may be ('cat' in the second sentence, 'cat' in the first sentence, positive, subgraph 200, subgraph 220).
  • the two words of m and m' in gi and g'j may be replaced with a first label (e.g., NodeRef), which means that m and m' coreference each other, and all the other words may be replace with a second label (e.g., Node). Consequently, two new subgraphs called "subgraph patterns" may be generated.
  • the subgraph patterns may be designated as pattern(gi, m) and pattern(g'j, m').
  • FIG. 3 is a diagram illustrating an example of two respective subgraph patterns 300 and 320 of the subgraphs 200 and 220 described above in FIG. 2.
  • the subgraph pattern 300 may be designated as pattern (gi, cat)
  • the subgraph pattern 320 may be designated as pattern (g'j, cat).
  • the subgraph patterns 300 and 320 are examples of subgraph patterns of gi and g'j, respectively.
  • the two subgraph patterns 300 and 320 may be combined to form a joined pattern.
  • the two odeRef of the subgraph patterns 300 and 320 may be merged into a single node.
  • the graph that combines pattern(gi, m) and pattern(g'j, m') may be referred to as patternj 0 in(gi, m, g'j, m'). All such patternjoin may be collected.
  • the collection of patternjoin may be referred to as ⁇ pi, p 2 .. . pk ⁇ and the subset of tuples that correspond to pi may be designated as Ti.
  • the whole tuple collection T is subcategorized into ⁇ Ti, T 2 .. .
  • FIG. 4 is a diagram illustrating an example of a joined pattern 400 based on the subgraph patterns 300 and 320 described above in FIG. 3. As illustrated, the joined pattern 400 may be designated as patternj 0 in(gi, m, g'j, m').
  • the words in the subgraph gi except the word m may be collected into a word sequence following the original sequence of the words in the sentence, designated as (wn, Win).
  • the word sequence of g'j except m' may be collected into a word sequence following the original sequence of the words in the sentence, designated as (wji, wjn')-
  • the two sequences may then be joined, where m is placed in the beginning of the joined sequence, like (m, wn, w ⁇ , wji, Wjn -
  • the word sequence may be converted into a joint vector ⁇ 3 ⁇ 4 by replacing each word with its word vector, which may be generated by a word embedding method (e.g., word2vec), and by joining all the word vectors into a joint vector. For example, if there are 5 words in the sequence and word vectors have 200 dimensions, the joint vector of the word sequence will have 1000 dimensions.
  • word2vec word embedding method
  • each tuple (m, m', 1, gi, g'j) in the tuple subset may be replaced with a pair that includes the joint vector of qij (i.e., v y ) and the label 1.
  • the tuple subset Tk may have pairs of (v u , 1).
  • the word sequences qij of all the tuples in each tuple subset Tk have the same length, i.e. the same number of words, designated as length(Tk).
  • a machine learning based model may be trained for each tuple subset Tk.
  • the associated pairs of (vy, 1) may be provided as the training data set.
  • a joint vector of qij i.e., v y
  • the corresponding label 1 may be provided as the expected output of the model.
  • machine learning based models ⁇ Xi, .. . , XK ⁇ eX may be constructed.
  • FIG. 5 is a flowchart 500 of a method of constructing machine learning based models for semantic text coherence analysis.
  • the method may perform operations described above with reference to FIGS. 1-4.
  • the method may be performed by a computing device.
  • the device may automatically generate a plurality of tuples based on a text corpus.
  • the plurality of tuples may be the dataset T described above, and each tuple may be the tuple of (m, m', 1, gi, g'j) described above.
  • Each tuple of the plurality of tuples may include a first subject (e.g., 'man'), a second subject (e.g., 'cat'), a relationship (e.g., 1) between the first subject and the second subject, a first subgraph (e.g., gi) of a first dependency graph (e.g., g) corresponding to a first sentence that includes the first subject, a second subgraph (e.g., g'j) of a second dependency graph (e.g., g') corresponding to a second sentence that includes the second subject.
  • the first sentence and the second sentence may be the same sentence or two different sentences.
  • the second subject may be a coreference of the first subject within a local window of the first subject in the text corpus, and the relationship between the first subject and the second subject is positive. In one embodiment, the second subject may not be a coreference of the first subject in the text corpus, and the relationship between the first subject and the second subject is negative.
  • the device may further parse the first sentence into the first dependency graph, and parse the second sentence into the second dependency graph.
  • the first subgraph may include the first subject and the second subgraph may include the second subject.
  • the device may normalize the tuple by replacing the first subject and the second subject with a first label (e.g., NodeRef) and replacing other words within the first subgraph and the second subgraph with a second label (e.g., Node).
  • a first label e.g., NodeRef
  • a second label e.g., Node
  • the device may merge the first subgraph and the second subgraph via the first label to obtain a joined pattern (e.g., pattern JO in(gi, m, g'j, m').
  • a joined pattern e.g., pattern JO in(gi, m, g'j, m').
  • the device may classify the plurality of tuples into a plurality of groups (e.g., tuple subsets ⁇ Ti, T 2 , ⁇ ) based on the joined pattern associated with each tuple.
  • the device may generate a joined word sequence (e.g., qij) that includes the first subject, a first set of words in the first subgraph except the first subject, and a second set of words in the second subgraph except the second subject. The first subject may be placed in the beginning of the joined word sequence followed by the first set of words and the second set of words.
  • the first set of words may follow the original sequence of words in the first sentence and the second set of words may follow the original sequence of words in the second sentence.
  • the device may further convert the joined word sequence into a joint word vector (e.g., Vij).
  • the device may train a machine learning based model (e.g., Xk) based on tuples classified into the group.
  • a machine learning based model e.g., Xk
  • the device may train the machine learning based model with the joint word vector as input and the relationship between the first subject and the second subject as the expected output.
  • the validity of a given sentence may be checked by using the machine learning based models (e.g., X) described above with reference to FIGS. 1-5.
  • FIGS. 6-9 describe an example of applying the trained machine learning based models to candidate sentences or syntactic structures.
  • the candidate sentence s is "A plaster cat is playing a ball" and the noun n is 'cat'.
  • the sentence s may be parsed into a dependency graph g by using an off-the-shelf parser.
  • FIG. 6 is a diagram illustrating an example of a dependency graph 600 corresponding to the candidate sentence s. As illustrated, the dependency graph 600 may be designated as g.
  • the dependency graph g may be split into two subgraphs g' and g" at the noun n.
  • the node of noun n may be copied as n' included in g' and n" included in g".
  • FIG. 7 is a diagram illustrating an example of the dependency graph 600 described above in FIG. 6 being split into two subgraphs 700 and 720.
  • the subgraphs 700 and 720 may be designated as g' and g", respectively.
  • the node n' in subgraph 700 and the node n' ' in subgraph 720 are the word cat.
  • two sets of the subgraphs G' and G" may be collected.
  • the set of subgraphs G' includes g' i, g' 2 ... g' m , which are subgraphs of g'.
  • the set of subgraphs G" includes g" i, g" 2 . .. g" n , which are subgraphs of g" .
  • FIG. 8 is a diagram illustrating an example of two respective subgraphs 800 and 820 of the subgraphs 700 and 720 described above in FIG. 7.
  • the subgraphs 800 and 820 are examples of g' i and g"j, respectively.
  • a test tuple ( ⁇ ', n", ?, g'i, g"j) may be generated.
  • an example tuple may be ('cat' in the subgraph 800, 'cat' in the subgraph 820, ?, the subgraph 800, the subgraph 820).
  • the two words of n' and n" in g'i and g"j may be replaced with a first label (e.g., NodeRef), and all the other words may be replace with a second label (e.g., Node). Consequently, two new subgraphs called "subgraph patterns" may be generated.
  • the subgraph patterns may be designated as pattern(g' i, n') and pattern(g"j, n").
  • the two subgraph patterns pattern(g' i, n') and pattern(g"j, n") may be combined to form a joined pattern.
  • the two NodeR e f of the subgraph patterns pattern(g' i, n') and pattern(g"j, n") may be merged into a single node.
  • the graph that combines pattern(g'i, n') and pattern(g"j, n") may be referred to as patternj 0 in(g'i, n', g"j, n").
  • FIG. 9 is a diagram illustrating an example of a joined pattern 900 based on the subgraph patterns 800 and 820 described above in FIG. 8.
  • the joined pattern 900 may be designated as patternj 0 in(g'i, n', g"j, n").
  • the patternjoin(g'i, n', g"j, n") is generated the same way as the method described above with reference to FIGS. 3 and 4.
  • the words in the subgraph g'i except the word n' may be collected into a word sequence following the original sequence of the words in the sentence.
  • the word sequence of g", except n" may be collected into a word sequence following the original sequence of the words in the sentence.
  • the two sequences may then be joined, where n' is placed in the beginning of the joined sequence.
  • the joined sequence may be designated as qij.
  • the word sequence may be converted into a joint vector ⁇ 3 ⁇ 4 by replacing each word with its word vector, which may be generated by a word embedding method (e.g., word2vec), and by joining all the word vectors into a joint vector.
  • a word embedding method e.g., word2vec
  • the joint vector vy is generated the same way as the method described above regarding generating the joint vector Vij for constructing machine learning based models.
  • the corresponding model Xk is used to generate its regression score n.
  • the correctness score r of the sentence s is assigned as the weighted average of n, s.t. the sum of n x Wk divided by the sum of Wk.
  • validation-sentence (s, n) € ⁇ (0..1).
  • sub-sequence of validating a test tuple can be formalized as a sub function as follows: validation-tuple ( ⁇ ', n", g'j, g"j) e (0..1).
  • FIG. 10 is a flowchart 1000 of a method of validating a syntactic structure.
  • the syntactic structure may be a sentence.
  • the method may perform operations described above with reference to FIGS. 6-9.
  • the method may be performed by a computing device.
  • the device may parse the syntactic structure into a dependency graph (e.g., the dependency graph 600).
  • the device may split the dependency graph into a first subgraph (e.g., the subgraph 700) and a second subgraph (e.g., the subgraph 720) based on a subject (e.g., 'cat') within the syntactic structure.
  • the subject may be included in both the first subgraph and the second subgraph.
  • the device may generate a plurality of tuples of the syntactic structure.
  • Each tuple e.g., the tuple(n', n", ?, g'i, g"j)
  • the device may identify a trained machine learning based model (e.g., X k ) corresponding to each tuple.
  • a trained machine learning based model e.g., X k
  • the device may normalize the tuple by replacing the subject with a first label and replacing other words within the first subgraph and the second subgraph with a second label.
  • the device may further merge the first subgraph and the second subgraph via the first label to obtain a joined pattern.
  • the device may then identify the trained machine learning based model based on the joined pattern.
  • the device may estimate validity of the syntactic structure using the trained machine learning based models of the plurality of tuples.
  • the device may generate a word vector for the tuple, and feed the word vector as input to the trained machine learning based model.
  • the device may generate a joined word sequence that includes the subject, a first set of words in the first subgraph except the subject, and a second set of words in the second subgraph except the subject. The subject may be placed in the beginning of the joined word sequence followed by the first set of words and the second set of words. The first set of words and the second set of words may follow the original sequence of words in the sentence.
  • the device may further convert the joined word sequence into the word vector.
  • the device may determine the validity of the syntactic structure based on the outputs of the trained machine learning based models in response to the word vectors of the plurality of tuples of the syntactic structure. In one embodiment, to estimate the validity of the syntactic structure, the device may estimate the validity of the syntactic structure as the weighted sum of validity scores of the plurality of tuples of the syntactic structure.
  • the machine learning based models (e.g., X) described above with reference to FIGS. 1-5 may be applied to pronoun resolution.
  • Pronoun resolution may be treated as text coherence analysis. Thus, if the two syntactic structures of two sentences that contain a pronoun and its referent are coherent to each other, it is likely for the pronoun to refer to the referent.
  • pronoun resolution it is assumed that the input has one or two sentences, where the second (or the first if only one) sentence has a pronoun p and that the task is to identify the referent (or antecedent) of the pronoun in the first sentence.
  • all nouns and noun phrases in the first sentence may be located.
  • the validation process described above with reference to FIGS. 6-10 may be used to estimate the match between the two subgraphs by calculating validation-tuple (ni, p, g'j, gk).
  • the noun or noun phrase ni with the highest validation score for the given pronoun p may be considered as the correct referent of the pronoun p.
  • FIG. 1 1 is a flowchart 1 100 of a method of pronoun resolution.
  • the method may be performed by a computing device.
  • the device may identify a set of subjects (e.g., N) in a first sentence and a pronoun (e.g., p) in a second sentence.
  • the set of subjects may include nouns and noun phrases.
  • the first sentence and the second sentence may be the same sentence or two different sentences.
  • the device may generate a set of tuples based on the first sentence and the second sentence.
  • Each tuple of the set of tuples may include a subject (e.g., ni) of the set of subjects, the pronoun, a first subgraph (e.g., g'j) of a first dependency graph corresponding to the first sentence, a second subgraph (e.g., gk) of a second dependency graph corresponding to the second sentence.
  • the device may identify a trained machine learning based model corresponding to the tuple.
  • the device may normalize the tuple by replacing the subject and the pronoun with a first label and replacing other words within the first subgraph and the second subgraph with a second label.
  • the device may further merge the first subgraph and the second subgraph via the first label to obtain a joined pattern.
  • the device may then identify the trained machine learning based model based on the joined pattern.
  • the device may determine a validation score of the tuple using the trained machine learning based model corresponding to the tuple.
  • the device may generate a word vector for the tuple, and feed the word vector as input to the trained machine learning based model.
  • the device may generate a joined word sequence that includes the subject, a first set of words in the first subgraph except the subject, and a second set of words in the second subgraph except the pronoun.
  • the subject may be placed in the beginning of the joined word sequence followed by the first set of words and the second set of words.
  • the first set of words may follow the original sequence of words in the first sentence and the second set of words may follow the original sequence of words in the second sentence.
  • the device may further convert the joined word sequence into the word vector.
  • the device may determine a referent of the pronoun based on the validation scores.
  • a noun or noun phrase contained in a tuple that has the highest validation score is determined to be the referent of the pronoun.
  • the device may determine the validation score of each of the nouns and pronouns based on outputs of the trained machine learning based models in response to the word vectors of tuples of the noun or noun phrase. The noun or noun phrase whose validation score is the highest may be determined to be the referent of the pronoun.
  • the machine learning based models (e.g., X) described above with reference to FIGS. 1-5 may be applied to knowledge base completion.
  • KB completion may be treated as text coherence analysis.
  • a knowledge base may consist of triples, designated as (head, relation, tail) or in short as (h, r, t). It is assumed that there are textual descriptions for each relation type r, designated as D(r), where such a description has two empty slots for head and tail.
  • the set of all heads and tails from the existing triples of the KB may be collected.
  • the set of all relation types of the KB designated as R, may also be collected. If head or tail is not a noun but a noun phrase, the headword of the noun phrase may be identified and added to S. The noun phrase is not added to S. If head or tail is not a noun (e.g., verb phrase), the first noun in the head or tail may be identified and added to S. If head or tail has no noun, it may be discarded.
  • a set of candidate triples (h, r, t) may be generated, where h 6 S, r £ R, t E S, and (h, r, t) does not exist in KB.
  • a candidate triple (h, r, t)
  • a textual description may be randomly selected among D(r) and replaced its two empty slots with h and t, designated as d.
  • the set of tuples (h, r, t, d) may be referred as C.
  • the correctness of the triple may be decided by using the validation process described above with reference to FIGS. 6-10. That is, the correctness of the triple may be determined by calculating validation- sentence (d, h).
  • the tuple may be considered as positive/valid.
  • the candidate tuples (h, r, t, d) that are validated in the previous step may be collected, and the corresponding triples (h, r, t) may be added to the KB.
  • FIG. 12 is a flowchart 1200 of a method of knowledge base completion.
  • the method may be performed by a computing device.
  • the device may identify a plurality of heads and tails (e.g., S), and a plurality of relations (e.g., R) in a knowledge base including a plurality of triples, each triple including a head, a tail, and a relation between the head and the tail.
  • a plurality of heads and tails e.g., S
  • R a plurality of relations
  • the device may generate a candidate triple (e.g., (h, r, t)) including a candidate head, a candidate tail, and a candidate relation.
  • the candidate head and the candidate tail may be selected from the plurality of heads and tails.
  • the candidate relation may be selected from the plurality of relations.
  • the candidate triple may be outside of the plurality of triples. That is, the candidate triple is not part of the knowledge base.
  • the device may estimate the validity of a sentence formed based on the candidate triple.
  • the device may parse the sentence into a dependency graph.
  • the device may further split the dependency graph into a first subgraph and a second subgraph at the candidate head.
  • the candidate head of a tuple may be included in both the first subgraph and the second subgraph of the tuple.
  • the device may further generate a set of tuples for the dependency graph. Each tuple may be based on the candidate head, a subgraph of the first subgraph, and a subgraph of the second subgraph.
  • the device may further identify a trained machine learning based model corresponding to each tuple.
  • the device may then estimate the validity of the sentence using the trained machine learning based models of the set of tuples for the dependency graph.
  • the device may normalize the tuple by replacing the candidate head with a first label and replacing other words within the first subgraph and the second subgraph with a second label.
  • the device may further merge the first subgraph and the second subgraph via the first label to obtain a joined pattern.
  • the device may then identify the trained machine learning based model based on the joined pattern.
  • the device may generate a word vector for the tuple, and feed the word vector as input to the trained machine learning based model.
  • the device may generate a joined word sequence that includes the candidate head, a first set of words in the first subgraph except the candidate head, and a second set of words in the second subgraph except the candidate head.
  • the candidate head may be placed in the beginning of the joined word sequence followed by the first set of words and the second set of words.
  • the first set of words and the second set of words may follow the original sequence of words in the sentence.
  • the device may further convert the joined word sequence into the word vector.
  • the device may determine the validity of the sentence based on the outputs of the trained machine learning based models in response to the word vectors of the set of tuples.
  • the device may add the candidate triple into the knowledge base when the sentence is estimated to be valid. For example, the device may add the candidate triple into the knowledge base when the output of the trained machine learning based model satisfies a threshold (e.g., above 0.5).
  • a threshold e.g., above 0.5
  • FIG. 13 is a conceptual data flow diagram 1300 illustrating the data flow between different means/components in an exemplary apparatus 1302.
  • the apparatus 1302 may be a computing device.
  • the apparatus 1302 may include a training component 1304 and/or an application component 1306.
  • the training component 1304 may receive a text corpus and construct machine learning based models in a distant-supervised manner based on the text corpus. In one configuration, the training component 1304 may perform the operations described above with reference to FIG. 5.
  • the application component 1306 may perform semantic coherence analysis of texts based on the trained machine learning based models, which may be constructed by the apparatus 1302 or one or more different devices. In one configuration, the application component 1306 may perform the operations described above with reference to FIG. 10, 11, or 12.
  • the apparatus may include additional components that perform each of the blocks of the algorithm in the aforementioned flowcharts of FIGS. 5, 10, 1 1, 12. As such, each block in the aforementioned flowcharts of FIGS. 5, 10, 11, 12 may be performed by a component and the apparatus may include one or more of those components.
  • the components may be one or more hardware components specifically configured to carry out the stated processes/algorithm, implemented by a processor configured to perform the stated processes/algorithm, stored within a computer-readable medium for implementation by a processor, or some combination thereof.
  • the methods or functional modules of the various example embodiments as described hereinbefore may be implemented on a computer system, such as a computer system 1400 as schematically shown in FIG. 14 as an example only.
  • the method or functional module may be implemented as software, such as a computer program being executed within the computer system 1400, and instructing the computer system 1400 to conduct the method of various example embodiments.
  • the computer system 1400 may include a computer module 1402, input modules such as a keyboard 1404 and mouse 1406 and a plurality of output devices such as a display 1408, and a printer 1410.
  • the computer module 1402 may be connected to a computer network 1412 via a suitable transceiver device 1414, to enable access to e.g.
  • the computer module 1402 in the example may include a processor 1418 for executing various instructions, a Random Access Memory (RAM) 1420 and a Read Only Memory (ROM) 1422.
  • the computer module 1402 may also include a number of Input/Output (I/O) interfaces, for example I/O interface 1424 to the display 1408, and I/O interface 1426 to the keyboard 1404.
  • I/O Input/Output
  • the components of the computer module 1402 typically communicate via an interconnected bus 1428 and in a manner known to the person skilled in the relevant art.
  • Combinations such as "at least one of A, B, or C,” “one or more of A, B, or C,” “at least one of A, B, and C,” “one or more of A, B, and C,” and "A, B, C, or any combination thereof include any combination of A, B, and/or C, and may include multiples of A, multiples of B, or multiples of C.
  • combinations such as “at least one of A, B, or C,” “one or more of A, B, or C,” “at least one of A, B, and C,” “one or more of A, B, and C,” and “A, B, C, or any combination thereof may be A only, B only, C only, A and B, A and C, B and C, or A and B and C, where any such combinations may contain one or more member or members of A, B, or C.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé, un support lisible par ordinateur et un appareil permettant de construire des modèles basés sur l'apprentissage automatique pour une analyse de cohérence de texte sémantique. L'appareil peut générer automatiquement une pluralité de tuples d'après un corpus de texte. Chaque tuple de la pluralité de tuples peut comprendre un premier sujet, un deuxième sujet, un premier sous-graphe, un second sous-graphe et une relation. Pour chaque tuple de la pluralité de tuples, l'appareil peut normaliser le tuple. Pour chaque tuple normalisé, l'appareil peut fusionner le premier sous-graphe normalisé et le deuxième sous-graphe normalisé afin d'obtenir un motif joint. L'appareil peut classer la pluralité de tuples en une pluralité de groupes d'après le motif joint de chaque tuple normalisé. Pour chaque groupe de la pluralité de groupes, l'appareil peut former un modèle basé sur l'apprentissage automatique d'après les tuples classés dans le groupe.
PCT/SG2017/050154 2017-03-24 2017-03-24 Procédé et appareil d'analyse de cohérence sémantique de textes WO2018174816A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/SG2017/050154 WO2018174816A1 (fr) 2017-03-24 2017-03-24 Procédé et appareil d'analyse de cohérence sémantique de textes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2017/050154 WO2018174816A1 (fr) 2017-03-24 2017-03-24 Procédé et appareil d'analyse de cohérence sémantique de textes

Publications (1)

Publication Number Publication Date
WO2018174816A1 true WO2018174816A1 (fr) 2018-09-27

Family

ID=63584669

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2017/050154 WO2018174816A1 (fr) 2017-03-24 2017-03-24 Procédé et appareil d'analyse de cohérence sémantique de textes

Country Status (1)

Country Link
WO (1) WO2018174816A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457713A (zh) * 2019-06-19 2019-11-15 腾讯科技(深圳)有限公司 基于机器翻译模型的翻译方法、装置、设备和存储介质
CN110688461A (zh) * 2019-09-30 2020-01-14 中国人民解放军国防科技大学 一种综合多源知识的在线文本类教育资源标签生成方法
CN111428470A (zh) * 2020-03-23 2020-07-17 北京世纪好未来教育科技有限公司 文本连贯性判定及其模型训练方法、电子设备及可读介质
CN111931506A (zh) * 2020-05-22 2020-11-13 北京理工大学 一种基于图信息增强的实体关系抽取方法
EP3944128A1 (fr) * 2020-07-20 2022-01-26 Beijing Baidu Netcom Science And Technology Co., Ltd. Procédé et appareil pour l'apprentissage d'un modèle de traitement de langage naturel, dispositif et support d'informations
CN114722802A (zh) * 2022-04-07 2022-07-08 平安科技(深圳)有限公司 词向量的生成方法、装置、计算机设备及存储介质
US11748571B1 (en) * 2019-05-21 2023-09-05 Educational Testing Service Text segmentation with two-level transformer and auxiliary coherence modeling

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050220351A1 (en) * 2004-03-02 2005-10-06 Microsoft Corporation Method and system for ranking words and concepts in a text using graph-based ranking
US20110270604A1 (en) * 2010-04-28 2011-11-03 Nec Laboratories America, Inc. Systems and methods for semi-supervised relationship extraction

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050220351A1 (en) * 2004-03-02 2005-10-06 Microsoft Corporation Method and system for ranking words and concepts in a text using graph-based ranking
US20110270604A1 (en) * 2010-04-28 2011-11-03 Nec Laboratories America, Inc. Systems and methods for semi-supervised relationship extraction

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CAI J.: "Coreference Resolution via Hypergraph Partitioning", DISSERTATION, 4 April 2013 (2013-04-04), XP055547713, Retrieved from the Internet <URL:Ruprecht-Karls-Universit¨atHeidelberg> [retrieved on 20170505] *
ERIC N. ET AL.: "A Semi-Supervised Information Extraction Framework for Large Redundant Corpora", THESIS, 19 December 2008 (2008-12-19), B.G.S. University of New Orleans,, XP055547711, Retrieved from the Internet <URL:https://scholarworks.uno.edu/cgi/viewcontent.cgi?article=1857&context=td> [retrieved on 20170505] *
RALPH D.: "Multiword expressions as dependency subgraphs", PROCEEDINGS OF THE WORKSHOP ON MULTIWORD EXPRESSIONS INTEGRATING PROCESSING, MWE '04, 2004, pages 56 - 63, XP055547720, [retrieved on 20170505] *
RAZVAN C. B. ET AL.: "A Shortest Path Dependency Kernel for Relation Extraction", PROCEEDINGS OF THE CONFERENCE ON HUMAN LANGUAGE TECHNOLOGY AND EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING , HLT '05, 1 October 2005 (2005-10-01), pages 724 - 731, XP055547726, [retrieved on 20170505] *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11748571B1 (en) * 2019-05-21 2023-09-05 Educational Testing Service Text segmentation with two-level transformer and auxiliary coherence modeling
CN110457713A (zh) * 2019-06-19 2019-11-15 腾讯科技(深圳)有限公司 基于机器翻译模型的翻译方法、装置、设备和存储介质
CN110457713B (zh) * 2019-06-19 2023-07-28 腾讯科技(深圳)有限公司 基于机器翻译模型的翻译方法、装置、设备和存储介质
CN110688461A (zh) * 2019-09-30 2020-01-14 中国人民解放军国防科技大学 一种综合多源知识的在线文本类教育资源标签生成方法
CN111428470B (zh) * 2020-03-23 2022-04-22 北京世纪好未来教育科技有限公司 文本连贯性判定及其模型训练方法、电子设备及可读介质
CN111428470A (zh) * 2020-03-23 2020-07-17 北京世纪好未来教育科技有限公司 文本连贯性判定及其模型训练方法、电子设备及可读介质
CN111931506A (zh) * 2020-05-22 2020-11-13 北京理工大学 一种基于图信息增强的实体关系抽取方法
CN111931506B (zh) * 2020-05-22 2023-01-10 北京理工大学 一种基于图信息增强的实体关系抽取方法
JP7293543B2 (ja) 2020-07-20 2023-06-20 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド 自然言語処理モデルの訓練方法、装置、電子デバイス、コンピュータ可読記憶媒体及びプログラム
JP2022020582A (ja) * 2020-07-20 2022-02-01 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド 自然言語処理モデルの訓練方法、装置、デバイス及び記憶媒体
EP3944128A1 (fr) * 2020-07-20 2022-01-26 Beijing Baidu Netcom Science And Technology Co., Ltd. Procédé et appareil pour l'apprentissage d'un modèle de traitement de langage naturel, dispositif et support d'informations
CN114722802A (zh) * 2022-04-07 2022-07-08 平安科技(深圳)有限公司 词向量的生成方法、装置、计算机设备及存储介质
CN114722802B (zh) * 2022-04-07 2024-01-30 平安科技(深圳)有限公司 词向量的生成方法、装置、计算机设备及存储介质

Similar Documents

Publication Publication Date Title
CN113887215B (zh) 文本相似度计算方法、装置、电子设备及存储介质
KR102801013B1 (ko) 문장 패러프레이즈 인식 기반 대화 시스템 답변 방법
US11210468B2 (en) System and method for comparing plurality of documents
CN109726274B (zh) 问题生成方法、装置及存储介质
WO2018174816A1 (fr) Procédé et appareil d&#39;analyse de cohérence sémantique de textes
US10055402B2 (en) Generating a semantic network based on semantic connections between subject-verb-object units
CN112069295B (zh) 相似题推荐方法、装置、电子设备和存储介质
US11170169B2 (en) System and method for language-independent contextual embedding
Gokul et al. Sentence similarity detection in Malayalam language using cosine similarity
WO2018174815A1 (fr) Procédé et appareil d&#39;analyse de cohérence sémantique de textes
Singh et al. A decision tree based word sense disambiguation system in Manipuri language
Kliegr et al. LHD 2.0: A text mining approach to typing entities in knowledge graphs
Wang et al. Structural block driven enhanced convolutional neural representation for relation extraction
CN111859858A (zh) 从文本中提取关系的方法及装置
CN114239828A (zh) 一种基于因果关系的供应链事理图谱构建方法
CN118378631B (zh) 文本审查方法、装置、设备及存储介质
US20230111052A1 (en) Self-learning annotations to generate rules to be utilized by rule-based system
SABRIYE et al. AN APPROACH FOR DETECTING SYNTAX AND SYNTACTIC AMBIGUITY IN SOFTWARE REQUIREMENT SPECIFICATION.
WO2023088278A1 (fr) Procédé et appareil pour vérifier l&#39;authenticité d&#39;une expression, et dispositif et support
CN113705207A (zh) 语法错误识别方法及装置
Oo et al. An analysis of ambiguity detection techniques for software requirements specification (SRS)
CN117313695A (zh) 文本敏感性检测方法、装置、电子设备及可读存储介质
Malik et al. Named Entity Recognition on Software Requirements Specification Documents.
Arbaaeen et al. Natural language processing based question answering techniques: A survey
CN117251567A (zh) 多领域知识抽取方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17901474

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17901474

Country of ref document: EP

Kind code of ref document: A1