CN111061851B - Question generation method and system based on given facts - Google Patents
Question generation method and system based on given facts Download PDFInfo
- Publication number
- CN111061851B CN111061851B CN201911276552.4A CN201911276552A CN111061851B CN 111061851 B CN111061851 B CN 111061851B CN 201911276552 A CN201911276552 A CN 201911276552A CN 111061851 B CN111061851 B CN 111061851B
- Authority
- CN
- China
- Prior art keywords
- question
- historical
- input information
- generation model
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及自然语言处理技术领域,特别涉及一种基于给定事实的问句生成方法及系统。The invention relates to the technical field of natural language processing, in particular to a method and system for generating question sentences based on given facts.
背景技术Background technique
随着互联网的蓬勃发展和网络通讯终端的日益普及,人们每天都会接触涉及各个领域的海量信息。知识库问答可以帮助人们快速地从海量信息中获取知识,从而减轻人类的学习成本。然而,知识库问答严重依赖人工标注数据,问答对(pair)的标注数据成为制约问句技术和问答系统开发的瓶颈资源,问句生成可以有效解决这一问题。With the vigorous development of the Internet and the increasing popularity of network communication terminals, people are exposed to massive amounts of information in various fields every day. Knowledge base question answering can help people quickly acquire knowledge from massive information, thereby reducing the cost of human learning. However, knowledge base question answering relies heavily on human-labeled data, and the labeled data of question-answer pairs has become a bottleneck resource that restricts question technology and question-answer system development. Question generation can effectively solve this problem.
问句生成这一任务主要从给定的答案及其辅助信息中自动生成问句。给定的答案及辅助信息可能是纯文本形式,也可能是结构化的知识库。问句生成有着如下用途:1.自动构建问答的数据资源,或者减少人工标注问答对的工作量;2.用于数据增强,提高问答系统的性能;3.作为一个典型的文本生成任务,可以促进文本生成技术的发展和进步。The task of question generation is mainly to automatically generate questions from a given answer and its auxiliary information. The given answer and auxiliary information may be in the form of plain text or a structured knowledge base. Question generation has the following uses: 1. Automatically construct question-and-answer data resources, or reduce the workload of manually labeling question-and-answer pairs; 2. Used for data enhancement to improve the performance of question-answering systems; 3. As a typical text generation task, you can Contribute to the development and advancement of text generation technologies.
然而,传统问句生成方法容易生成谓词不匹配的问句,如表1中的给定输入<自由女神像,位置,纽约市>,可能会生成Q1(谁创作了自由女神像?)这样不能表达给定谓词的问句,此外,传统方法生成的问句容易对应多个模棱两可的答案,如Q2(自由女神像在哪?)有着多个正确答案(如,美国、纽约州、纽约市等等),这使得传统方法生成的问句难以实用化。However, traditional question generation methods are prone to generate questions with mismatched predicates, such as given input <Statue of Liberty, location, New York City> in Table 1, may generate Q1 (Who created the Statue of Liberty?) so cannot Questions that express a given predicate. In addition, questions generated by traditional methods tend to correspond to multiple ambiguous answers. For example, Q2 (Where is the Statue of Liberty?) has multiple correct answers (such as the United States, New York State, New York City, etc. etc.), which makes the questions generated by traditional methods difficult to be practical.
表1Table 1
发明内容Contents of the invention
为了解决现有技术中的上述问题,即为了解决基于少量给定事实,准确确定问题,本发明提供一种基于给定事实的问句生成方法及系统。In order to solve the above-mentioned problems in the prior art, that is, to solve the problem of accurately determining based on a small number of given facts, the present invention provides a method and system for generating question sentences based on given facts.
为解决上述技术问题,本发明提供了如下方案:In order to solve the problems of the technologies described above, the present invention provides the following solutions:
一种基于给定事实的问句生成方法,所述问句生成方法包括:A method for generating questions based on given facts, the method for generating questions includes:
获取历史参考数据,所述历史参考数据包括多条不同用户的历史输入信息;Acquiring historical reference data, the historical reference data including multiple pieces of historical input information of different users;
对各历史输入信息进行扩展,得到对应的上下文表示;Extend each historical input information to obtain the corresponding context representation;
根据各所述输入信息及对应的上下文表示,建立问句生成模型;Establishing a question generation model according to each of the input information and the corresponding context representation;
基于所述问句生成模型,根据当前用户的当前输入信息,确定所述当前输入信息对应的问句序列。Based on the question generation model, according to the current input information of the current user, a question sequence corresponding to the current input information is determined.
可选地,所述历史参考数据还包括多条监督信息,各所述监督信息包括对应历史输入信息的人工标注问句及参考答案;Optionally, the historical reference data also includes multiple pieces of supervisory information, each of which includes manually marked questions and reference answers corresponding to historical input information;
所述问句生成方法还包括:The question generation method also includes:
根据所述监督信息,对所述问句生成模型进行修正,得到修正后的问句生成模型。According to the supervisory information, the question generation model is corrected to obtain the corrected question generation model.
可选地,所述根据所述监督信息,对所述问句生成模型进行修正,得到修正后的问句生成模型,具体包括:Optionally, according to the supervision information, the question generation model is corrected to obtain a revised question generation model, which specifically includes:
基于所述问句生成模型,根据各历史输入信息,确定对应的历史问句序列;Based on the question generation model, according to each historical input information, determine the corresponding historical question sequence;
根据各所述历史问句序列及对应的人工标注问句,计算生成问句损失 According to each of the historical question sequence and the corresponding manually marked question, calculate and generate the question loss
根据各所述历史问句序列及对应的参考答案,计算辅助答案损失 Calculate the auxiliary answer loss based on the historical question sequence and the corresponding reference answer
其中,各参考答案包括有对应历史输入信息的答案类型词,所述历史问句序列包括有对应历史输入信息的生成词,是答案类型词的集合,|A|表示所述答案类型词的集合中答案类型词的数量,/>是一个问句序列中生成词yt与对应答案类型词an的损失;Wherein, each reference answer includes answer type words corresponding to historical input information, and the historical question sequence includes generated words corresponding to historical input information, is the set of answer type words, |A| represents the number of answer type words in the set of answer type words, /> is the loss of the generated word y t and the corresponding answer type word a n in a question sequence;
根据所述生成问句损失及辅助答案损失/>确定监督信息损失 Generate question sentences according to the loss and auxiliary answer loss /> Determining Supervisory Information Loss
其中,λ表示参考系数;Among them, λ represents the reference coefficient;
根据所述监督信息损失对所述问句生成模型进行修正,得到修正后的问句生成模型。According to the supervisory information loss The question generation model is corrected to obtain the corrected question generation model.
可选地,所述不同用户的历史输入信息的格式为头实体-关系-尾实体;Optionally, the format of the historical input information of the different users is head entity-relationship-tail entity;
所述对各历史输入信息进行扩展,得到对应的上下文表示,具体包括:The step of expanding each historical input information to obtain a corresponding context representation specifically includes:
针对头实体和/或尾实体,用在知识库的类型信息作为头实体和/或尾实体的上下文表示;For the head entity and/or the tail entity, use the type information in the knowledge base as the context representation of the head entity and/or tail entity;
针对关系,用对应在知识库中的领域、值域、主题以及距离监督回标的句子中的至少一者作为所述关系的上下文表示。For the relationship, at least one of the sentences corresponding to the field, value range, topic and distance supervision mark in the knowledge base is used as the context representation of the relationship.
可选地,当所述知识库的类型信息有多个时,选用最频繁使用以及最具有区分度的类型作为头实体和/或尾实体的上下文表示。Optionally, when there are multiple types of information in the knowledge base, the most frequently used and most differentiated type is selected as the context representation of the head entity and/or tail entity.
可选地,所述根据各所述输入信息及对应的上下文表示,建立问句生成模型,具体包括:Optionally, the establishment of a question generation model according to each of the input information and the corresponding context representation specifically includes:
针对每对输入信息及对应的上下文表示,For each pair of input information and the corresponding context representation,
对所述输入信息进行训练,得到训练信息;performing training on the input information to obtain training information;
基于第一序列模型,根据所述上下文表示,得到表示序列;Obtaining a representation sequence according to the context representation based on the first sequence model;
将所述训练信息及表示序列进行融合,得到融合信息;fusing the training information and the representation sequence to obtain fusion information;
对所述融合信息进行编码,得到隐层状态序列;Encoding the fusion information to obtain a hidden layer state sequence;
对各所述隐层状态序列进行解码,计算得到对应的解码序列函数,所述解码序列函数为问句生成模型。Each hidden layer state sequence is decoded to obtain a corresponding decoding sequence function, and the decoding sequence function is a question generation model.
可选地,所述对各所述隐层状态序列进行解码,计算得到对应的解码序列函数,具体包括:Optionally, the decoding of each of the hidden layer state sequences, and calculation to obtain a corresponding decoding sequence function specifically includes:
基于第二序列模型,对各所述隐层状态序列进行解码,得到解码信息;Decoding each of the hidden layer state sequences based on the second sequence model to obtain decoding information;
根据所述解码信息,分别计算从知识库中复制历史输入信息对应名称的知识库复制模式概率、复制上下文表示的上下文复制模式概率及从词表中生成词语的词表生成模式概率;According to the decoding information, respectively calculate the knowledge base copy mode probability of copying the corresponding name of the historical input information from the knowledge base, the context copy mode probability of copying the context representation and the vocabulary generation mode probability of generating words from the vocabulary;
根据所述知识库复制模式概率pcpkb、上下文复制模式概率pcptx及词表生成模式概率pgenv,计算目标词的预测概率P(yt|st,yt-1,F,C):According to the knowledge base copy mode probability p cpkb , the context copy mode probability p cptx and the vocabulary generation mode probability p genv , calculate the predicted probability P(y t |s t , y t-1 , F, C) of the target word:
其中genv,cpkb和cpctx分别代表词表生成模式,知识库复制模式和上下文复制模式,p.代表三种不同模式的概率,P(*|*)代表各种模式下生成目标词的概率,F和C分别表示输入信息和上下文,st表示当前解码状态,yt表示当前时刻生成的词语;Among them, genv, cpkb and cpctx represent vocabulary generation mode, knowledge base replication mode and context replication mode respectively, p. represents the probability of three different modes, P(*|*) represents the probability of generating target words in various modes, F and C represent the input information and context respectively, s t represents the current decoding state, and y t represents the words generated at the current moment;
根据目标词的预测概率P(yt|st,yt-1,F,C),逐词解码,得到解码的问句序列函数。According to the predicted probability P(y t |s t , y t-1 , F, C) of the target word, it is decoded word by word to obtain the decoded question sequence function.
为解决上述技术问题,本发明还提供了如下方案:In order to solve the problems of the technologies described above, the present invention also provides the following solutions:
一种基于给定事实的问句生成系统,所述问句生成系统包括:A question generation system based on given facts, said question generation system comprising:
获取单元,用于获取历史参考数据,所述历史参考数据包括多条不同用户的历史输入信息;an acquisition unit, configured to acquire historical reference data, the historical reference data including multiple pieces of historical input information of different users;
扩展单元,用于对各历史输入信息进行扩展,得到对应的上下文表示;An expansion unit, configured to expand each historical input information to obtain a corresponding context representation;
建模单元,用于根据各所述输入信息及对应的上下文表示,建立问句生成模型;A modeling unit, configured to establish a question generation model according to each of the input information and the corresponding context representation;
确定单元,用于基于所述问句生成模型,根据当前用户的当前输入信息,确定所述当前输入信息对应的问句序列。The determining unit is configured to determine a question sequence corresponding to the current input information based on the question generation model and according to the current input information of the current user.
为解决上述技术问题,本发明还提供了如下方案:In order to solve the problems of the technologies described above, the present invention also provides the following solutions:
一种基于给定事实的问句生成系统,包括:A question generation system based on given facts, including:
处理器;以及processor; and
被安排成存储计算机可执行指令的存储器,所述可执行指令在被执行时使所述处理器执行以下操作:a memory arranged to store computer-executable instructions that, when executed, cause the processor to:
获取历史参考数据,所述历史参考数据包括多条不同用户的历史输入信息;Acquiring historical reference data, the historical reference data including multiple pieces of historical input information of different users;
对各历史输入信息进行扩展,得到对应的上下文表示;Extend each historical input information to obtain the corresponding context representation;
根据各所述输入信息及对应的上下文表示,建立问句生成模型;Establishing a question generation model according to each of the input information and the corresponding context representation;
基于所述问句生成模型,根据当前用户的当前输入信息,确定所述当前输入信息对应的问句序列。Based on the question generation model, according to the current input information of the current user, a question sequence corresponding to the current input information is determined.
为解决上述技术问题,本发明还提供了如下方案:In order to solve the problems of the technologies described above, the present invention also provides the following solutions:
一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个程序,所述一个或多个程序当被包括多个应用程序的电子设备执行时,使得所述电子设备执行以下操作:A computer-readable storage medium that stores one or more programs that, when executed by an electronic device including a plurality of application programs, cause the electronic device to perform the following operations :
获取历史参考数据,所述历史参考数据包括多条不同用户的历史输入信息;Acquiring historical reference data, the historical reference data including multiple pieces of historical input information of different users;
对各历史输入信息进行扩展,得到对应的上下文表示;Extend each historical input information to obtain the corresponding context representation;
根据各所述输入信息及对应的上下文表示,建立问句生成模型;Establishing a question generation model according to each of the input information and the corresponding context representation;
基于所述问句生成模型,根据当前用户的当前输入信息,确定所述当前输入信息对应的问句序列。Based on the question generation model, according to the current input information of the current user, a question sequence corresponding to the current input information is determined.
根据本发明的实施例,本发明公开了以下技术效果:According to the embodiments of the present invention, the present invention discloses the following technical effects:
本发明通过历史参考数据建立问句生成模型;能够基于所述问句生成模型,可根据当前用户给定的少量当前输入信息,即可准确确定所述当前输入信息对应的问句序列。The present invention establishes a question generation model through historical reference data; based on the question generation model, the question sequence corresponding to the current input information can be accurately determined according to a small amount of current input information given by the current user.
附图说明Description of drawings
图1是本发明基于给定事实的问句生成方法的流程图;Fig. 1 is the flow chart of the present invention based on the question generation method of given fact;
图2是答案辅助监督示意图Figure 2 is a schematic diagram of answer-assisted supervision
图3是本发明基于给定事实的问句生成系统的模块结构示意图。Fig. 3 is a schematic diagram of the module structure of the system for generating questions based on given facts in the present invention.
符号说明:Symbol Description:
获取单元—1,扩展单元—2,建模单元—3,确定单元—4。Acquisition unit-1, expansion unit-2, modeling unit-3, determination unit-4.
具体实施方式Detailed ways
下面参照附图来描述本发明的优选实施方式。本领域技术人员应当理解的是,这些实施方式仅仅用于解释本发明的技术原理,并非旨在限制本发明的保护范围。Preferred embodiments of the present invention are described below with reference to the accompanying drawings. Those skilled in the art should understand that these embodiments are only used to explain the technical principles of the present invention, and are not intended to limit the protection scope of the present invention.
本发明的目的在于提供一种基于给定事实的问句生成方法,通过历史参考数据建立问句生成模型;能够基于所述问句生成模型,可根据当前用户给定的少量当前输入信息,即可准确确定所述当前输入信息对应的问句序列。The object of the present invention is to provide a question generation method based on given facts, and establish a question generation model through historical reference data; based on the question generation model, it can be based on a small amount of current input information given by the current user, namely The question sequence corresponding to the current input information can be accurately determined.
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.
如图1所示,本发明基于给定事实的问句生成方法包括:As shown in Figure 1, the question generation method of the present invention based on given fact comprises:
步骤100:获取历史参考数据,所述历史参考数据包括多条不同用户的历史输入信息;Step 100: Acquiring historical reference data, the historical reference data includes multiple pieces of historical input information of different users;
步骤200:对各历史输入信息进行扩展,得到对应的上下文表示;Step 200: Extend each historical input information to obtain a corresponding context representation;
步骤300:根据各所述输入信息及对应的上下文表示,建立问句生成模型;Step 300: Establish a question generation model according to the input information and the corresponding context representation;
步骤400:基于所述问句生成模型,根据当前用户的当前输入信息,确定所述当前输入信息对应的问句序列。Step 400: Based on the question generation model, according to the current user's current input information, determine the question sequence corresponding to the current input information.
进一步地,所述历史参考数据还包括多条监督信息,各所述监督信息包括对应历史输入信息的人工标注问句及参考答案。Further, the historical reference data also includes multiple pieces of supervisory information, each of which includes manually marked questions and reference answers corresponding to historical input information.
为提高确定精度,本发明基于给定事实的问句生成方法还包括:In order to improve the determination accuracy, the question generation method based on given facts of the present invention also includes:
根据所述监督信息,对所述问句生成模型进行修正,得到修正后的问句生成模型。According to the supervisory information, the question generation model is corrected to obtain the corrected question generation model.
传统方法只有问句端的监督信息,即比较生成问句与人工标注的问句,以他们之间的差异作为损失,再用优化器进行训练,这样可能容易得到模棱两可的问句,即一个问句对应多个正确的答案。为此,除了人工标注问句作为监督信息,还利用答案信息作为辅助监督,目标是生成的问句包含任意一个答案的类型词语。The traditional method only has the supervision information at the question end, that is, compares the generated questions with the artificially labeled questions, uses the difference between them as a loss, and then uses the optimizer for training, so it may be easy to get ambiguous questions, that is, a question Corresponds to multiple correct answers. To this end, in addition to manually labeling questions as supervision information, answer information is also used as auxiliary supervision. The goal is to generate questions that contain any type of answer.
进一步地,所述根据所述监督信息,对所述问句生成模型进行修正,得到修正后的问句生成模型,具体包括:Further, according to the supervision information, the question generation model is corrected to obtain the corrected question generation model, which specifically includes:
步骤S1:基于所述问句生成模型,根据各历史输入信息,确定对应的历史问句序列。Step S1: Based on the question generation model, according to each historical input information, determine the corresponding historical question sequence.
步骤S2:根据各所述历史问句序列及对应的人工标注问句,计算生成问句损失 Step S2: According to each of the historical question sequences and the corresponding manually marked questions, calculate and generate the question loss
步骤S3:根据各所述历史问句序列及对应的参考答案,计算辅助答案损失(如图2所示):Step S3: Calculate the auxiliary answer loss according to the historical question sequence and the corresponding reference answer (as shown in picture 2):
其中,各参考答案包括有对应历史输入信息的答案类型词,所述历史问句序列包括有对应历史输入信息的生成词,是答案类型词的集合(如纽约市的类型词包括城市、行政区域等),|A|表示所述答案类型词的集合中答案类型词的数量,/>是一个问句序列中生成词yt与对应答案类型词an的损失。其中,辅助答案损失/>为两次计算的最小值。Wherein, each reference answer includes answer type words corresponding to historical input information, and the historical question sequence includes generated words corresponding to historical input information, is the collection of answer type words (as the type words of New York City include city, administrative region, etc.), |A| represents the quantity of answer type words in the collection of said answer type words, /> is the loss of the generated word y t and the corresponding answer type word a n in a question sequence. where the auxiliary answer loss /> is the minimum of the two calculations.
步骤S4:根据所述生成问句损失及辅助答案损失/>确定监督信息损失/> Step S4: Generate a question loss according to the and auxiliary answer loss /> Determining Supervisory Information Loss />
其中,λ表示参考系数。Among them, λ represents the reference coefficient.
步骤S5:根据所述监督信息损失对所述问句生成模型进行修正,得到修正后的问句生成模型。Step S5: According to the supervisory information loss The question generation model is corrected to obtain the corrected question generation model.
在本实施例中,所述不同用户的历史输入信息的格式为头实体(主语)-关系(谓语)-尾实体(宾语)。In this embodiment, the format of the historical input information of different users is head entity (subject)-relationship (predicate)-tail entity (object).
进一步地,在步骤200中,所述对各历史输入信息进行扩展,得到对应的上下文表示,具体包括:Further, in step 200, the expansion of each historical input information to obtain the corresponding context representation specifically includes:
针对头实体和/或尾实体,用在知识库的类型信息作为头实体和/或尾实体的上下文表示;For the head entity and/or the tail entity, use the type information in the knowledge base as the context representation of the head entity and/or tail entity;
针对关系,用对应在知识库中的领域、值域、主题以及距离监督回标的句子中的至少一者作为所述关系的上下文表示。For the relationship, at least one of the sentences corresponding to the field, value range, topic and distance supervision mark in the knowledge base is used as the context representation of the relationship.
其中,当所述知识库的类型信息有多个时,选用最频繁使用以及最具有区分度的类型作为头实体和/或尾实体的上下文表示。Wherein, when there are multiple types of information in the knowledge base, the most frequently used and most differentiated type is selected as the context representation of the head entity and/or tail entity.
在步骤300中,所述根据各所述输入信息及对应的上下文表示,建立问句生成模型,具体包括:In step 300, the question generation model is established according to the input information and the corresponding context representation, which specifically includes:
针对每对输入信息及对应的上下文表示,For each pair of input information and the corresponding context representation,
对所述输入信息进行训练,得到训练信息;performing training on the input information to obtain training information;
基于第一序列模型,根据所述上下文表示,得到表示序列;Obtaining a representation sequence according to the context representation based on the first sequence model;
将所述训练信息及表示序列进行融合,得到融合信息;fusing the training information and the representation sequence to obtain fusion information;
对所述融合信息进行编码,得到隐层状态序列;Encoding the fusion information to obtain a hidden layer state sequence;
对各所述隐层状态序列进行解码,计算得到对应的解码序列函数,所述解码序列函数为问句生成模型。Each hidden layer state sequence is decoded to obtain a corresponding decoding sequence function, and the decoding sequence function is a question generation model.
其中,对于输入符号化的事实,可以用TransE等知识库表示学习方法在大规模语料上预训练,也可以随机初始化跟着第一序列模型一起训练。对于上下文信息,可以用第一序列模型进行建模,该第一序列模型可以为递归神经网络(RNN,Recurrent NeuralNetworks)、门控循环单元(GRU,Gated Recurrent Unit)、长短期记忆网络(LSTM,LongShort Term Memory)以及Transformer模型等。这样,参考信息中的每个元素既有符号化的表示,又有上下文的表示,可以通过Gate进行融合,最终,将融合信息可以编码成一个隐层状态序列(如Hf=[hs;hp;ho])。Among them, for the fact that the input is symbolic, knowledge base representation learning methods such as TransE can be used to pre-train on large-scale corpus, or random initialization can be used to train with the first sequence model. For the context information, the first sequence model can be used for modeling, and the first sequence model can be a recurrent neural network (RNN, Recurrent NeuralNetworks), a gated recurrent unit (GRU, Gated Recurrent Unit), a long short-term memory network (LSTM, LongShort Term Memory) and Transformer models, etc. In this way, each element in the reference information has both a symbolic representation and a contextual representation, which can be fused through the Gate, and finally, the fused information can be encoded into a hidden layer state sequence (such as H f =[h s ; h p ; h o ]).
进一步地,所述对各所述隐层状态序列进行解码,计算得到对应的解码序列函数,具体包括:Further, the decoding of each of the hidden layer state sequences is calculated to obtain a corresponding decoding sequence function, which specifically includes:
基于第二序列模型,对各所述隐层状态序列进行解码,得到解码信息;Decoding each of the hidden layer state sequences based on the second sequence model to obtain decoding information;
根据所述解码信息,分别计算从知识库中复制历史输入信息对应的名称的知识库复制模式概率、复制上下文表示的上下文复制模式概率及从词表中生成词语的词表生成模式概率;According to the decoding information, respectively calculate from the knowledge base duplication mode probability of the name corresponding to the history input information, the context duplication mode probability of duplication context representation and the vocabulary generation mode probability of generating words from the vocabulary from the knowledge base;
根据所述知识库复制模式概率pcpkb、上下文复制模式概率pcpctx及词表生成模式概率pgenv,计算目标词的预测概率P(yt|st,yt-1,F,C):According to the knowledge base replication mode probability p cpkb , the context replication mode probability p cpctx and the vocabulary generation mode probability p genv , calculate the predicted probability P(y t |s t , y t-1 , F, C) of the target word:
其中genv,cpkb和cpctx分别代表词表生成模式,知识库复制模式和上下文复制模式,p.代表三种不同模式的概率,P(*|*)代表各种模式下生成词语的概率,F和C分别表示输入信息和上下文,st表示当前解码状态,yt表示当前时刻生成的词语;Among them, genv, cpkb and cpctx represent vocabulary generation mode, knowledge base replication mode and context replication mode respectively, p. represents the probability of three different modes, P(*|*) represents the probability of generating words in various modes, F and C represents input information and context respectively, s t represents the current decoding state, and y t represents the words generated at the current moment;
根据目标词的预测概率P(yt|st,yt-1,F,C),对逐词解码,得到问句序列函数。According to the predicted probability P(y t |s t , y t-1 , F, C) of the target word, decode word by word to obtain the question sequence function.
同样的,解码器可以用第二序列模型进行解码,该第二序列模型可以是递归神经网络(RNN,Recurrent Neural Networks)、门控循环单元(GRU,Gated Recurrent Unit)、长短期记忆网络(LSTM,Long Short Term Memory)以及Transformer模型等。解码过程中,为了更好地捕获输入信息,可以选择性采用如下几种copy(复制)形式:1.考虑头实体经常在生成的问句出现,因此copy知识库头实体符号对应的名称;2.copy扩展的上下文,需要注意的是,输入的上下文可能会有很多重复的词语,因此使用maxout指针的机制,即出现多个相同token(词语)时,以copy得分最高的token作为该token在copy模式下的得分,而不是加和。最后,一共有两种不同copy形式以及从词表中生成token三种模式,这三种模式的加权和作为最终选择token概率,进而根据目标词的预测概率,逐词解码,得到问句生成模型,再一步对所述问句生成模型修正,得到修正后的问句生成模型,根据修正后的问句生成模型,基于少量给定事实可以准确确定问句序列信息。Similarly, the decoder can use the second sequence model to decode, and the second sequence model can be a recurrent neural network (RNN, Recurrent Neural Networks), a gated recurrent unit (GRU, Gated Recurrent Unit), a long short-term memory network (LSTM , Long Short Term Memory) and Transformer models, etc. In the decoding process, in order to better capture the input information, the following copy (copy) forms can be selectively used: 1. Considering that the head entity often appears in the generated question, so copy the name corresponding to the head entity symbol in the knowledge base; 2. The context of .copy extension, it should be noted that the input context may have many repeated words, so the maxout pointer mechanism is used, that is, when multiple identical tokens (words) appear, the token with the highest copy score is used as the token in the Scores in copy mode, not sums. Finally, there are two different copy forms and three modes of generating tokens from the vocabulary. The weighted sum of these three modes is used as the final selection token probability, and then according to the predicted probability of the target word, word-by-word decoding is obtained to obtain the question generation model , further modifying the question generation model to obtain the revised question generation model, and according to the revised question generation model, the question sequence information can be accurately determined based on a small number of given facts.
下面通过以下实验验证本发明的有效性:Verify effectiveness of the present invention below by following experiment:
测试语料test corpus
SimpleQuestions:当前规模最大的知识库问答数据集。SimpleQuestions: Currently the largest knowledge base question answering dataset.
比较方法:Comparison method:
Template:模板生成问句Template: template generation questions
Serban et al.(2016):Sequence-to-Sequence模型生成问句Serban et al. (2016): Sequence-to-Sequence Model Generation Questions
Elsahar et al.(2018):引入单一上下文生成问句Elsahar et al. (2018): Introducing a Single Context to Generate Questions
所示实验结果(如表2)Shown experimental result (as table 2)
表2Table 2
总体性能比较:通过对比已有方法和本发明的效果来说明其有效性。本发明的性能都明显好于基准方法,加上答案辅助监督(最后一行),性能继续提升。Overall performance comparison: its effectiveness is illustrated by comparing the effects of the existing methods and the present invention. The performance of the present invention is significantly better than the benchmark method, and with the help of answer supervision (the last row), the performance continues to improve.
表3table 3
谓词覆盖比较(如表3所示):人工评估了生成问句中是否正确表达了给定输入的谓词,通过计算正确表达给定谓词的比例(谓词覆盖度Predicate Identification),发现本发明的性能最好。Predicate coverage comparison (as shown in table 3): artificially evaluated whether the predicate of given input is correctly expressed in the generated question sentence, by calculating the ratio (predicate coverage degree Predicate Identification) of correct expression of given predicate, find the performance of the present invention most.
表4Table 4
答案覆盖度比较:本发明进一步定义了生成问句中包含答案类型词的比例这一评价指标Anscov,用来评价生成问句对应答案的确定程度。调节答案辅助监督信息的权重λ。发现有答案辅助监督时,BLEU得分更高,Anscov提升更明显。Comparison of answer coverage: the present invention further defines an evaluation index Ans cov of the proportion of answer-type words contained in generated questions, which is used to evaluate the degree of certainty of the answers corresponding to generated questions. Adjust the weight λ of the answer auxiliary supervision information. It is found that when the answer is assisted by supervision, the BLEU score is higher, and the Ans cov improvement is more obvious.
此外,本发明还提供一种基于给定事实的问句生成系统,可基于少量给定事实,准确确定问题。In addition, the present invention also provides a question generation system based on given facts, which can accurately determine questions based on a small number of given facts.
如图3所示,本发明基于给定事实的问句生成系统包括:获取单元1、扩展单元2、建模单元3及确定单元4。As shown in FIG. 3 , the system for generating question sentences based on given facts in the present invention includes: an acquisition unit 1 , an expansion unit 2 , a modeling unit 3 and a determination unit 4 .
具体地,所述获取单元1用于获取历史参考数据,所述历史参考数据包括多条不同用户的历史输入信息;Specifically, the acquiring unit 1 is configured to acquire historical reference data, the historical reference data including multiple pieces of historical input information of different users;
所述扩展单元2用于对各历史输入信息进行扩展,得到对应的上下文表示;The expansion unit 2 is used to expand each historical input information to obtain a corresponding context representation;
所述建模单元3用于根据各所述输入信息及对应的上下文表示,建立问句生成模型;The modeling unit 3 is used to establish a question generation model according to each of the input information and the corresponding context representation;
所述确定单元4用于基于所述问句生成模型,根据当前用户的当前输入信息,确定所述当前输入信息对应的问句序列。The determining unit 4 is configured to determine a sequence of questions corresponding to the current input information based on the question generation model and according to the current input information of the current user.
进一步地,本发明还提供一种基于给定事实的问句生成系统,包括:Further, the present invention also provides a question generation system based on given facts, including:
处理器;以及processor; and
被安排成存储计算机可执行指令的存储器,所述可执行指令在被执行时使所述处理器执行以下操作:a memory arranged to store computer-executable instructions that, when executed, cause the processor to:
获取历史参考数据,所述历史参考数据包括多条不同用户的历史输入信息;Acquiring historical reference data, the historical reference data including multiple pieces of historical input information of different users;
对各历史输入信息进行扩展,得到对应的上下文表示;Extend each historical input information to obtain the corresponding context representation;
根据各所述输入信息及对应的上下文表示,建立问句生成模型;Establishing a question generation model according to each of the input information and the corresponding context representation;
基于所述问句生成模型,根据当前用户的当前输入信息,确定所述当前输入信息对应的问句序列。Based on the question generation model, according to the current input information of the current user, a question sequence corresponding to the current input information is determined.
本发明还提供一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个程序,所述一个或多个程序当被包括多个应用程序的电子设备执行时,使得所述电子设备执行以下操作:The present invention also provides a computer-readable storage medium that stores one or more programs that, when executed by an electronic device including a plurality of application programs, cause the electronic The device does the following:
获取历史参考数据,所述历史参考数据包括多条不同用户的历史输入信息;Acquiring historical reference data, the historical reference data including multiple pieces of historical input information of different users;
对各历史输入信息进行扩展,得到对应的上下文表示;Extend each historical input information to obtain the corresponding context representation;
根据各所述输入信息及对应的上下文表示,建立问句生成模型;Establishing a question generation model according to each of the input information and the corresponding context representation;
基于所述问句生成模型,根据当前用户的当前输入信息,确定所述当前输入信息对应的问句序列。Based on the question generation model, according to the current input information of the current user, a question sequence corresponding to the current input information is determined.
相对于现有技术,本发明基于给定事实的问句生成系统、计算机可读存储介质与上述基于给定事实的问句生成方法的有益效果相同,在此不再赘述。Compared with the prior art, the system for generating question sentences based on given facts and the computer-readable storage medium of the present invention have the same beneficial effects as the method for generating question sentences based on given facts, and will not be repeated here.
至此,已经结合附图所示的优选实施方式描述了本发明的技术方案,但是,本领域技术人员容易理解的是,本发明的保护范围显然不局限于这些具体实施方式。在不偏离本发明的原理的前提下,本领域技术人员可以对相关技术特征作出等同的更改或替换,这些更改或替换之后的技术方案都将落入本发明的保护范围之内。So far, the technical solutions of the present invention have been described in conjunction with the preferred embodiments shown in the accompanying drawings, but those skilled in the art will easily understand that the protection scope of the present invention is obviously not limited to these specific embodiments. Without departing from the principles of the present invention, those skilled in the art can make equivalent changes or substitutions to relevant technical features, and the technical solutions after these changes or substitutions will all fall within the protection scope of the present invention.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201911276552.4A CN111061851B (en) | 2019-12-12 | 2019-12-12 | Question generation method and system based on given facts |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201911276552.4A CN111061851B (en) | 2019-12-12 | 2019-12-12 | Question generation method and system based on given facts |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111061851A CN111061851A (en) | 2020-04-24 |
| CN111061851B true CN111061851B (en) | 2023-08-08 |
Family
ID=70300718
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201911276552.4A Active CN111061851B (en) | 2019-12-12 | 2019-12-12 | Question generation method and system based on given facts |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111061851B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111813907A (en) * | 2020-06-18 | 2020-10-23 | 浙江工业大学 | A Question Intention Recognition Method in Natural Language Question Answering Technology |
Citations (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102737042A (en) * | 2011-04-08 | 2012-10-17 | 北京百度网讯科技有限公司 | Method and device for establishing question generation model, and question generation method and device |
| GB201419051D0 (en) * | 2014-10-27 | 2014-12-10 | Ibm | Automatic question generation from natural text |
| CN105095444A (en) * | 2015-07-24 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Information acquisition method and device |
| CN106485370A (en) * | 2016-11-03 | 2017-03-08 | 上海智臻智能网络科技股份有限公司 | A kind of method and apparatus of information prediction |
| CN106599215A (en) * | 2016-12-16 | 2017-04-26 | 广州索答信息科技有限公司 | Question generation method and question generation system based on deep learning |
| CN107329967A (en) * | 2017-05-12 | 2017-11-07 | 北京邮电大学 | Question answering system and method based on deep learning |
| CN107463699A (en) * | 2017-08-15 | 2017-12-12 | 济南浪潮高新科技投资发展有限公司 | A kind of method for realizing question and answer robot based on seq2seq models |
| CN108363743A (en) * | 2018-01-24 | 2018-08-03 | 清华大学深圳研究生院 | A kind of intelligence questions generation method, device and computer readable storage medium |
| CN109522393A (en) * | 2018-10-11 | 2019-03-26 | 平安科技(深圳)有限公司 | Intelligent answer method, apparatus, computer equipment and storage medium |
| KR20190056184A (en) * | 2017-11-16 | 2019-05-24 | 주식회사 마인즈랩 | System for generating question-answer data for maching learning based on maching reading comprehension |
| CN109992657A (en) * | 2019-04-03 | 2019-07-09 | 浙江大学 | A Conversational Question Generation Method Based on Enhanced Dynamic Reasoning |
| CN110162613A (en) * | 2019-05-27 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for generating questions |
| CN110188182A (en) * | 2019-05-31 | 2019-08-30 | 中国科学院深圳先进技术研究院 | Model training method, dialogue generation method, device, equipment and medium |
| CN110196896A (en) * | 2019-05-23 | 2019-09-03 | 华侨大学 | A kind of intelligence questions generation method towards the study of external Chinese characters spoken language |
| CN110209790A (en) * | 2019-06-06 | 2019-09-06 | 阿里巴巴集团控股有限公司 | Question and answer matching process and device |
| CN110543553A (en) * | 2019-07-31 | 2019-12-06 | 平安科技(深圳)有限公司 | question generation method and device, computer equipment and storage medium |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10572595B2 (en) * | 2017-04-13 | 2020-02-25 | Baidu Usa Llc | Global normalized reader systems and methods |
| US10902738B2 (en) * | 2017-08-03 | 2021-01-26 | Microsoft Technology Licensing, Llc | Neural models for key phrase detection and question generation |
| US11250038B2 (en) * | 2018-01-21 | 2022-02-15 | Microsoft Technology Licensing, Llc. | Question and answer pair generation using machine learning |
-
2019
- 2019-12-12 CN CN201911276552.4A patent/CN111061851B/en active Active
Patent Citations (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102737042A (en) * | 2011-04-08 | 2012-10-17 | 北京百度网讯科技有限公司 | Method and device for establishing question generation model, and question generation method and device |
| GB201419051D0 (en) * | 2014-10-27 | 2014-12-10 | Ibm | Automatic question generation from natural text |
| CN105095444A (en) * | 2015-07-24 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Information acquisition method and device |
| CN106485370A (en) * | 2016-11-03 | 2017-03-08 | 上海智臻智能网络科技股份有限公司 | A kind of method and apparatus of information prediction |
| CN106599215A (en) * | 2016-12-16 | 2017-04-26 | 广州索答信息科技有限公司 | Question generation method and question generation system based on deep learning |
| CN107329967A (en) * | 2017-05-12 | 2017-11-07 | 北京邮电大学 | Question answering system and method based on deep learning |
| CN107463699A (en) * | 2017-08-15 | 2017-12-12 | 济南浪潮高新科技投资发展有限公司 | A kind of method for realizing question and answer robot based on seq2seq models |
| KR20190056184A (en) * | 2017-11-16 | 2019-05-24 | 주식회사 마인즈랩 | System for generating question-answer data for maching learning based on maching reading comprehension |
| CN108363743A (en) * | 2018-01-24 | 2018-08-03 | 清华大学深圳研究生院 | A kind of intelligence questions generation method, device and computer readable storage medium |
| CN109522393A (en) * | 2018-10-11 | 2019-03-26 | 平安科技(深圳)有限公司 | Intelligent answer method, apparatus, computer equipment and storage medium |
| CN109992657A (en) * | 2019-04-03 | 2019-07-09 | 浙江大学 | A Conversational Question Generation Method Based on Enhanced Dynamic Reasoning |
| CN110196896A (en) * | 2019-05-23 | 2019-09-03 | 华侨大学 | A kind of intelligence questions generation method towards the study of external Chinese characters spoken language |
| CN110162613A (en) * | 2019-05-27 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for generating questions |
| CN110188182A (en) * | 2019-05-31 | 2019-08-30 | 中国科学院深圳先进技术研究院 | Model training method, dialogue generation method, device, equipment and medium |
| CN110209790A (en) * | 2019-06-06 | 2019-09-06 | 阿里巴巴集团控股有限公司 | Question and answer matching process and device |
| CN110543553A (en) * | 2019-07-31 | 2019-12-06 | 平安科技(深圳)有限公司 | question generation method and device, computer equipment and storage medium |
Non-Patent Citations (1)
| Title |
|---|
| Yankai Lin等.Neural Relation Extraction with Selective Attention over Instances.Association for Computational Linguistics.2016,全文. * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111061851A (en) | 2020-04-24 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109783657B (en) | Multi-step self-attention cross-media retrieval method and system based on limited text space | |
| CN108962224B (en) | Joint modeling method, dialogue method and system for spoken language understanding and language model | |
| CN108363743B (en) | Intelligent problem generation method and device and computer readable storage medium | |
| CN111078836A (en) | Machine reading comprehension method, system and device based on external knowledge enhancement | |
| CN113971394B (en) | Text repetition rewriting system | |
| US20230351153A1 (en) | Knowledge graph reasoning model, system, and reasoning method based on bayesian few-shot learning | |
| CN112100332A (en) | Word embedding expression learning method and device and text recall method and device | |
| CN110688489B (en) | Knowledge graph deduction method and device based on interactive attention and storage medium | |
| CN112100348A (en) | Knowledge base question-answer relation detection method and system of multi-granularity attention mechanism | |
| CN107862087A (en) | Sentiment analysis method, apparatus and storage medium based on big data and deep learning | |
| CN103646088A (en) | Product comment fine-grained emotional element extraction method based on CRFs and SVM | |
| CN107766320A (en) | A kind of Chinese pronoun resolution method for establishing model and device | |
| CN112668344B (en) | Diverse problem generation method with controllable complexity based on hybrid expert model | |
| CN109933792A (en) | Viewpoint type problem based on multi-layer biaxially oriented LSTM and verifying model reads understanding method | |
| CN112100342A (en) | Knowledge graph question-answering method based on knowledge representation learning technology | |
| CN114722176A (en) | Intelligent question answering method, device, medium and electronic equipment | |
| CN113705207A (en) | Grammar error recognition method and device | |
| CN116304085A (en) | A diabetes question answering method based on deep learning and knowledge graph | |
| Amiri et al. | Spotting spurious data with neural networks | |
| CN111061851B (en) | Question generation method and system based on given facts | |
| Zhang | An assisted teaching method of college English translation using generative adversarial network | |
| CN119578398A (en) | An intelligent homework correction method based on syntax tree and semantic analysis | |
| CN107274077B (en) | Course Sequence Calculation Method and Equipment | |
| He et al. | [Retracted] Application of Grammar Error Detection Method for English Composition Based on Machine Learning | |
| CN116052048A (en) | Multi-mode reasoning and iterative optimization video description generation model and method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |