CN117493508A

CN117493508A - Question-answer data generation method and device, computer equipment and storage medium

Info

Publication number: CN117493508A
Application number: CN202311434678.6A
Authority: CN
Inventors: 杨文超; 于祥; 黄冰亮; 金亮; 李彤
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2024-02-02
Also published as: WO2025092056A1

Abstract

The present disclosure relates to the field of data processing, and in particular, to a method and apparatus for generating question-answer data, a computer device, and a storage medium. The method comprises the following steps: acquiring a plurality of cutting file blocks, wherein each cutting file block comprises a first preset number of text data; obtaining target question data corresponding to the cutting file block according to the text data in the cutting file block and the target question model; acquiring full-volume data, wherein the full-volume data is content information which is contained in a complete document consisting of a plurality of cutting document blocks and is associated with target problem data; obtaining target answer data corresponding to the target question data according to the cutting file block, the target question data, the full-scale data and the target answer model; and generating question and answer data according to the target question data and the target answer data. The method and the device solve the problems of low efficiency and poor data quality when QA data mining is performed on documents in the related technology.

Description

Question-answer data generation method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a method and apparatus for generating question-answer data, a computer device, and a storage medium.

Background

Most businesses have large amounts of text data available that has been deposited during long-term business development in the past. Enterprises may need to utilize such existing data if they wish to make large model intelligence attempts. However, these data are not cleaned and combed, nor annotated, making the technician less efficient and less quality of the mined data when mining the required QA (Question and Answer ) data (i.e., question and answer data) from existing documents.

Currently, in order to ensure that QA data with higher quality can be dug from the existing document, a manual marking mode is generally adopted, but the method has the defects of high labor cost and time cost, timeliness cannot be guaranteed, meanwhile, manual marking needs to rely on professional literacy of staff, so that the quality of the dug data is uneven, and even more data with poor quality appear.

Therefore, the related art has problems of low efficiency and poor data quality when QA data mining is performed in the face of a document.

Disclosure of Invention

In view of the above, the present disclosure provides a method, an apparatus, a computer device, and a storage medium for generating question-answer data, so as to solve the problems of low efficiency and poor data quality when QA data mining is performed on documents in the related art.

In a first aspect, the present disclosure provides a method for generating question-answer data, where the method includes:

acquiring a plurality of cutting file blocks, wherein each cutting file block comprises a first preset number of text data;

obtaining target question data corresponding to the cutting file block according to the text data in the cutting file block and the target question model;

acquiring full-volume data, wherein the full-volume data is content information which is contained in a complete document consisting of a plurality of cutting document blocks and is associated with target problem data;

obtaining target answer data corresponding to the target question data according to the cutting file block, the target question data, the full-scale data and the target answer model;

and generating question and answer data according to the target question data and the target answer data.

In the embodiment of the disclosure, a plurality of cutting file blocks are obtained, wherein each cutting file block contains a first preset number of text data; obtaining target question data corresponding to the cutting file block according to the text data in the cutting file block and the target question model; acquiring full-volume data, wherein the full-volume data is content information which is contained in a complete document consisting of a plurality of cutting document blocks and is associated with target problem data; obtaining target answer data corresponding to the target question data according to the cutting file block, the target question data, the full-scale data and the target answer model; and generating question and answer data according to the target question data and the target answer data. The sample disclosure embodiment can efficiently mine high-quality question and answer data from the existing document, achieves the purpose of automation, greatly reduces labor cost and time cost, and solves the problems of low efficiency and poor data quality of manual mining when the question and answer data mining is carried out on the document in the related technology.

In an alternative embodiment, before obtaining the target question data corresponding to the cut document block according to the text data and the target question model in the cut document block, the method further includes:

determining a questioning strategy according to the text data in the cutting file block;

processing the text data according to the question policy, and extracting a plurality of question questions;

obtaining first scores of a plurality of questioning questions according to a question scoring model;

determining a plurality of candidate question questions from the plurality of question questions according to the first score;

and optimizing the initial question model according to the candidate question questions to obtain a target question model.

In the embodiment of the disclosure, high-quality questions are screened out through the question scoring model, so that the initial question scoring model is continuously adapted to the mining strategy and style of a specific enterprise or industry, and the accuracy of the trained target question scoring model is improved. Meanwhile, related strategies and algorithms are adopted for dynamic optimization, so that learning and improvement are performed in continuously generating high-quality and authenticity synthetic data.

In an alternative embodiment, determining a question policy based on text data within a cut document block includes:

determining a text type of the text data;

And determining a corresponding question strategy according to the text type.

In the embodiment of the disclosure, the question strategy, the answer strategy and the like are flexibly selected and adjusted according to different texts and industry requirements, so that the question strategy, the answer strategy and the like are more suitable for actual application scenes.

In an alternative embodiment, before obtaining the target answer data corresponding to the target question data according to the cutting document block, the target question data, the full-scale data and the target answer model, the method further includes:

determining an answer strategy according to the text data and the target question data in the cutting file block;

processing the text data according to the answer strategy to obtain a plurality of answer data;

obtaining second scores of the plurality of answer data according to the answer score model;

determining a plurality of candidate answer data from the plurality of answer data according to the second score;

and optimizing the initial answer model according to the candidate answer data to obtain a target answer model.

In the embodiment of the disclosure, a solid foundation can be laid for constructing a high-quality question-answer library by iterative optimization and an advanced machine learning method. Meanwhile, a question strategy, an answer strategy and the like are flexibly selected and adjusted, so that the method is more suitable for actual application scenes.

In an alternative embodiment, determining an answer policy based on text data and target question data within a cut document block includes:

determining a text type of the text data;

and determining an answer strategy according to the text type and the target question data.

In an alternative embodiment, according to the cutting document block, the target question data, the full-scale data and the target answer model, target answer data corresponding to the target question data is obtained, including:

obtaining a plurality of answer data according to the cutting file block, the target question data, the full-scale data and the target answer model;

splitting the answer data into minimum units to obtain a plurality of target fields;

comparing the descriptive contents of the target fields in every second preset number of answer data;

the descriptive content corresponding to the answer data with the highest second score is reserved;

and integrating the descriptive contents to obtain target answer data.

In the embodiment of the disclosure, the quality and the comprehensiveness of the answers are efficiently ensured by synthesizing a plurality of answer data, and compared with a single answer, the quality of QA library construction is greatly improved.

In an alternative embodiment, after obtaining question-answer data about the cut document block, the method further comprises:

Scoring the question-answer data by using a question-answer data scoring model to obtain a third score of the question-answer data;

data cleaning is carried out on the question-answer data according to the third scores, and target question-answer data meeting preset requirements are obtained;

and carrying out data sample expansion according to the target question-answer data to obtain third preset number of question-answer data.

In the embodiment of the disclosure, strategies such as multi-task learning, reinforcement learning and the like are applied to improve the data quality. Meanwhile, the robustness of the model is improved through methods such as model integration and data enhancement, and the influence of uncertainty on the performance of the model is reduced.

In an alternative embodiment, the cut document includes text data within the cut document block after the picture recognition.

In a second aspect, the present disclosure provides a device for generating question-answer data, the device including:

the first acquisition module is used for acquiring a plurality of cutting file blocks, wherein each cutting file block comprises a first preset number of text data;

the first obtaining module is used for obtaining target problem data corresponding to the cutting file block according to the text data in the cutting file block and the target question model;

the second acquisition module is used for acquiring full-volume data, wherein the full-volume data is content information which is contained in a complete document consisting of a plurality of cutting document blocks and is associated with target problem data;

The second obtaining module is used for obtaining target answer data corresponding to the target question data according to the cutting file block, the target question data, the full-scale data and the target answer model;

and the third obtaining module is used for generating question and answer data according to the target question data and the target answer data.

In a third aspect, the present disclosure provides a computer device comprising: the processor executes the computer instructions, thereby executing the question-answer data generation method according to the first aspect or any one of the corresponding embodiments.

In a fourth aspect, the present disclosure provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the method for generating question-answer data according to the first aspect or any one of the embodiments corresponding thereto.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the prior art, the drawings that are required in the detailed description or the prior art will be briefly described, it will be apparent that the drawings in the following description are some embodiments of the present disclosure, and other drawings may be obtained according to the drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flow diagram of a method of generating question-answer data according to some embodiments of the present disclosure;

FIG. 2 is a complete flow diagram of a method of generating question-answer data according to some embodiments of the present disclosure;

fig. 3 is a block diagram of a structure of a question-answer data generating apparatus according to some embodiments of the present disclosure;

fig. 4 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

Most businesses have large amounts of text data available that has been deposited during long-term business development in the past. Enterprises may need to utilize such existing data if they wish to make large model intelligence attempts. However, these data were not cleaned and combed, were not annotated, and were only pre-trained. The pre-training cost is very high, and the data volume which most enterprises can accumulate is not large enough, if the pre-training effect is limited.

The best attempt strategy for an enterprise is to make model fine-tuning. At this time, the past service data needs to be cleaned and QA extracted. However, regarding the data security problem, it is difficult for an enterprise to give data to a third party labeling company for data processing. Therefore, currently, in order to ensure that QA data with higher quality can be dug from the existing document, a manual marking mode is generally adopted, but the method needs to spend larger labor cost and time cost, timeliness cannot be guaranteed, meanwhile, manual marking needs to depend on professional literacy of staff, so that quality of dug data is uneven, and even more data with poorer quality appear.

In order to solve the above-described problems, the embodiments of the present disclosure propose a method embodiment of generating question-answer data, it should be noted that the steps illustrated in the flowcharts of the drawings may be executed in a computer system such as a set of computer-executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be executed in an order different from that herein.

In this embodiment, a method for generating question-answer data is provided, fig. 1 is a schematic flow diagram of a method for generating question-answer data according to some embodiments of the present disclosure, as shown in fig. 1, the method may be applied to a server side, and the method flow includes the following steps:

Step S101, a plurality of cutting file blocks are obtained, wherein each cutting file block comprises a first preset number of text data.

Alternatively, in the embodiment of the present disclosure, the server side obtains a document of question-answer data (i.e., QA) to be extracted (or mined), cuts the document, for example, cuts the document into 10 parts in equal proportion, and the like, to obtain a plurality of cut document blocks. If the same-proportion cutting is performed, each cutting file block contains the same first preset number of text data, for example, the file contains 10 ten thousand text data, and after the cutting file block is cut into 10 parts, each cutting file block contains the same 1 ten thousand text data.

In addition, there may be a significant number of problems with the text within the acquired document before the document is cut, where the text needs to be preprocessed to get the correct text in a uniform format. As shown in table 1, described in table 1 are problems that may exist for a part of text and corresponding solutions:

TABLE 1

In addition to the problems in the table above, there may be other text problems such as: when automatically extracting system characters, misunderstanding is caused by grammar errors of documents, wrongly written characters and the like, and the corresponding solution can be as follows: error detection and correction, semantic disambiguation, etc. It should be noted that the foregoing is merely illustrative, and the embodiments of the present disclosure include, but are not limited to, the text problem and the data preprocessing method described above.

Step S102, obtaining target question data corresponding to the cut document block according to the text data and the target question model in the cut document block.

Optionally, inputting the text data in each cutting document block into the trained target question model, and obtaining the target question data corresponding to the cutting document block.

Step S103, acquiring full-volume data, wherein the full-volume data is content information associated with target problem data contained in a complete document composed of a plurality of cutting document blocks.

Alternatively, some answers may require context to be fully understood, simply extracting questions and answers may result in losing important background information. The context information closely related to questions and answers can be taken into account by enlarging the context range at the time of processing. This enables the model to capture the necessary background information to better understand questions and answers. Context information may also be encoded by using sentence or paragraph representation methods. This approach may help the model capture long-range context information and integrate into the process flow at the time of prediction.

However, if only context is used to obtain question-answer data, there may be a general deficiency. For example, the answer to a question in one document may depend on the content of another document, and full data may be required. The full data are stored in a vector database (a complete document database consisting of a plurality of cutting document blocks), the questions are matched with data in the vector database after vectorization, content information most relevant to target question data, such as 10 pieces of data, is screened out by using similarity of the vectors, then the target question data and 10 pieces of data relevant to the target question data are fed into a question selection answer strategy related model together, and the corresponding target answer data are obtained after an answer strategy is applied.

In this case, the determination method of the full-size data may be: 1. information can be searched and extracted across documents by an Embedding model + vector database. 2. Document linking: if there are explicit links between documents, the documents may be sequentially crawled and processed according to the links to form a contextual understanding. 3. Modeling a composite document: a multiple-input model is built, and multiple related documents are input simultaneously for understanding and reasoning. 4. Answer extraction and synthesis: when the answer span is larger, firstly extracting the answers, extracting possible answer fragments, then synthesizing the answers, and splicing the fragments according to a proper mode. In the extraction stage, a plurality of answer segments can be set, each answer segment is given a score, and then in the synthesis stage, the answer with the highest score is selected, or the answers with higher scores are reasonably spliced. 5. Paragraph level processing: when the answer may cross a plurality of paragraphs, the paragraphs are taken as processing units, and the generation of the answer is performed by considering the information of the plurality of paragraphs. A sliding window or other method may be used to cut the document in order to capture a greater range of contexts.

It will be appreciated that the full volume data already contains context information associated with the target issue data.

Step S104, obtaining target answer data corresponding to the target question data according to the cutting file block, the target question data, the full-scale data and the target answer model.

Optionally, in determining the target answer data corresponding to the target question data of each cutting document block, the embodiment of the disclosure needs to consider the text data of each cutting document block, the target question data output by the target question model, and all the total data associated with the target question data, and then inputs the data to be considered into the trained target answer model, so that the target answer data corresponding to the target question data can be obtained.

Step S105, generating question and answer data according to the target question data and the target answer data.

Optionally, each cut document block should correspond to one target question data and one target answer data. The question-answer data composed of one target question data and one target answer data is then the final question-answer data mined from each cut document block.

In addition, the QA library may be composed of the final question-answer data mined in all the cut document blocks for storing the question-answer data.

In some alternative embodiments, before obtaining the target question data corresponding to the cut document block according to the text data and the target question model in the cut document block, the method further includes:

Optionally, in the embodiment of the present disclosure, before the trained target questioning model is generated, continuous optimization processing is required on the initial questioning model to obtain the final target questioning model. The underlying QA structured data needs to be screened before optimizing the initial challenge model. The text suitable for extracting the QA structured data has the following characteristics:

a. the content is rich: the text should contain rich information points and descriptions

b. The logic is clear: the text should be structured reasonably and expressed accurately to ensure that the questions and answers extracted therefrom are correct.

c. The method has the facts and objectivity: text has objective factual properties that help generate accurate, referenced questions and answers.

d. The method comprises the following steps of: the ideal text type should be highly relevant to the specific field and task to ensure that the extracted QA data has practical application value.

In this case, a question policy for each cut document block is determined based on the text features suitable for extracting QA structured data, so that a plurality of question questions (hereinafter, also referred to as "questions") are extracted from each cut document block.

In order to mine high quality questions to construct answers related thereto in subsequent passes, embodiments of the present disclosure provide for a question scoring model to score acquired questions. Wherein, the scoring standard design of the question scoring model: 1. difficulty level of problem: there are challenges in picking out. 2. Diversity of problems: ensuring that the problem originates from different domains and knowledge points. 3. Accuracy of the problem: whether a question accurately describes a particular concept or fact is measured. 4. The expression quality of the problem: ensuring that the problem has good syntax and definition. 5. Clarity of the problem: whether the problem is clear and clear is easy to understand. 6. Correlation of problems: whether a question is closely related to a given topic or dataset. 7. Information amount of problem: whether the problem relates to meaningful information. 8. Complexity of the problem: whether the question has a certain difficulty or not can not be answered by simple searching.

And inputting the plurality of question questions into a question scoring model to obtain the scoring value of each question. The questions are ranked according to the scoring values to pick out high quality candidate question questions. It will be appreciated that there may be a plurality of high quality candidate questions, where multiple loops are required while setting the scoring threshold, leaving only questions above the threshold. In addition, the scoring threshold can be adjusted according to requirements by the embodiment of the disclosure to obtain different numbers and quality of problems.

And optimizing the reserved questions based on feedback of the question scoring model. This includes modifying the problem formulation, supplementing key information, etc.

And taking the screened and optimized candidate question questions as training samples, and inputting the training samples into an initial question model for training. Therefore, the initial question model is optimized continuously, the question quality is improved, and after multiple iterations, the system generates a high-quality question set, and finally the target question model is obtained.

In addition, in the process of training the initial question model and obtaining the target question model, the embodiment of the disclosure also provides some optimization directions:

a. multi-factor scoring: in the design of a question scoring model, questions may be scored in combination with a variety of factors, in addition to the scoring criteria mentioned above. For example, the NLP model is used for semantic analysis of the problem, and comprehensive scoring is performed by combining the correlation, information quantity, complexity and other factors of the problem.

b. Adaptive threshold: in the problem screening link, an adaptive threshold strategy is used to dynamically adjust the scoring threshold. For example, the scoring threshold is dynamically adjusted according to the number of questions, the difficulty of the questions, the category of the questions, and the like. And the high-quality problems are screened out more accurately, and the quality of a problem library is improved.

c. Balance sample weights: in the return training phase, attempts are made to employ a method of balancing sample weights to avoid model overfitting. According to factors such as importance and difficulty of the problems, different weights are given to the problems, and influences of the different types of problems in the training process are balanced.

d. Expert knowledge is introduced: in the problem optimization stage, the knowledge of the domain expert is tried to be introduced, and the problem is optimized through the feedback of the expert. And the expert corrects and modifies the problem, so that the quality of the problem is improved. Meanwhile, according to feedback of an expert, the problem scoring model can be further optimized.

e. Cross-validation: in the problem collection and preprocessing stage, a cross-validation method is adopted to increase the diversity of the problem library. For example, when processing different datasets of the same topic, a portion of questions may be extracted from each dataset and then scored and screened. By the method, a wider problem set can be obtained, and the quality of the problem library can be improved.

In some alternative embodiments, determining a question policy based on text data within a cut document block includes:

determining a text type of the text data;

and determining a corresponding question strategy according to the text type.

Optionally, the following text types are well suited for extracting QA structured data for feature analysis of the text type of the text data:

a. educational material: textbooks, lectures, courses, etc. contain rich knowledge and explanatory information.

b. Technical document: product manuals, API documents, development guidelines, and the like provide detailed technical specifications and methods of operation.

c. Research paper: research texts such as academic papers, industry reports, and the like often contain rich analyses and conclusions, particularly in the abstract section, where the problems and conclusions of research are often described.

d. Regulation and policy documents: official regulations, policy documents and legal documents are usually structured strictly and in fact.

e. News stories and articles: the inclusion of a large number of descriptions about facts, events, and views facilitates the extraction of relevant QA data such as events therefrom.

f. Question-answer communities and forums: questions and answers on the platform usually already exist with explicit questions and answers, and the QA library can be directly extracted and built.

g. Manual and guideline: these documents typically contain specific questions and corresponding answers, such as instructions for use, FAQs, product manuals, and the like.

Determining a corresponding question policy according to the text type, as shown in table 2 (wherein table 2 includes both a question extraction policy and an answer extraction policy corresponding to the question):

TABLE 2

Table 2 is directed only to the text types of educational materials such as textbooks, lectures, courses, etc.; technical documents, such as product manuals, API documents, development guidelines, etc., and research papers, such as academic papers, industry reports, etc., determine corresponding question policies and answer extraction policies, and rules and policy documents, news reports, articles, etc., are also obtained according to the current text types.

In determining the corresponding question policy based on the text type, there are also some auxiliary policies:

a. Different question models were used: depending on the text type, different question models may need to be employed to extract questions pointedly. For example, in processing technical documents, a pre-trained model for the technical field is used; in processing the regulatory documents, a pre-trained model for the legal field is used.

b. Dynamically adjusting a questioning strategy: and dynamically adjusting the questioning strategy according to the quality, the field and the background knowledge of the text. For example, adjusting the threshold for keyword extraction, using different entity types and relationship types, etc.

c. Increasing the interpretability: the interpretive nature of the problem is maintained during the generation of the problem so that the user can track and understand the strategy by which the problem was generated. In order to improve the interpretability of the questioning process, the questioning strategy is visually displayed, the reasons for generating the questions are provided, and the attention degree of the model to different text parts is given.

d. And (3) self-adaptive strategy adjustment: when processing large amounts of text, the model may need to automatically adapt to different text characteristics and difficulties. For this purpose, adaptive question strategies are developed, such as dynamic adjustment of question models according to the complexity and domain of the text, to achieve higher quality question mining.

e. User interaction: in order to improve the effect and flexibility of the questioning strategy, user interaction is introduced, so that a user can participate in the adjustment of the questioning strategy, feedback is provided, and correction suggestions are received. For example, after a question is generated, the user is given an evaluation of the question and the question policy is adjusted based on user feedback.

f. Combining an unsupervised and a supervised method: in the selection process of the questioning strategy, the robustness of the model is improved by combining an unsupervised method and a supervised method. An unsupervised approach would help extract potential structures in the text, while a supervised approach would make more accurate question generation based on existing annotation data.

g. Difficulty and diversity of trade-off problems: in choosing a question strategy, the difficulty and diversity of the questions need to be balanced. Simple and straightforward questions may be easier for the user to understand, but may not cover the entire content of the text; while complex and diverse problems may cover more dimensions, they may present challenges to the user's understanding. In constructing the QA database, it is recommended to comprehensively consider the problem difficulty and diversity to avoid unduly weighing a certain class of problems.

In some alternative embodiments, before obtaining the target answer data corresponding to the target question data according to the cut document block, the target question data, the full-scale data, and the target answer model, the method further comprises:

Alternatively, from the text data and the target question data within the cut document block, an answer policy (see table 2) may already be determined, and then a plurality of answer data is obtained. Collecting these answer data requires a series of pre-processing of the answer, including cleaning irrelevant words, correcting misspellings, etc., to optimize the presentation of the answer.

Setting an answer scoring model in advance, scoring the generated answers, retaining the answers with high quality, feeding the scoring information back to the trained initial answer model, and optimizing the performance of the initial answer model. A common natural language processing answer scoring model may be selected at this time. The objective of the answer scoring model is to rank the quality of the plurality of answer data according to specific criteria, the scoring criteria comprising: accuracy of answer: and (5) evaluating whether the content of the answer is accurate or not and how well the answer is matched with the question. Integrity of answer: and evaluating whether the answer obtains complete information or not, and whether the answer can comprehensively answer the question or not. The expression quality of the answer: the linguistic representation of the answer is checked, including grammar accuracy, understandability, etc.

Each answer is then scored using an answer scoring model, which score may reflect the overall quality of the answer.

And sorting the second scores of the plurality of answers, setting a threshold value, and retaining a plurality of candidate answer data with scores exceeding the threshold value. And (3) taking the screened and optimized multiple candidate answer data as training samples, inputting the training samples into an initial answer model, further training, carrying out loop iteration, and optimizing the performance of the initial answer model.

The above steps are iterated as needed. In the iterative process, the system continuously generates a high-quality answer set, prepares for constructing a QA library and subsequent procedures, and further obtains a final trained target answer model. The final target answer data from the target answer model should be accurate, complete and well-expressed.

During training of the initial answer model, there are also some auxiliary strategies:

a. multitasking learning: during the training process, a multitasking learning strategy is used while optimizing the accuracy, completeness and presentation quality of the answers. The method is realized by sharing hidden layer parameters, using inter-task soft parameter sharing and other technologies.

b. Using reinforcement learning strategies: reinforcement Learning strategies are introduced, for example using the Actor-Critic algorithm or Deep Q-Learning to determine the best answer. The quality is further improved in the answer generation link.

c. Combining with a knowledge graph: in the answer scoring model, the accuracy and confidence of the answer are enhanced with reference to an external knowledge graph. By utilizing the knowledge-graph, the correctness of the answer is further verified and supported, and additional references are provided where possible.

d. Real-time fine tuning: and for the real-time application scene, adopting a real-time fine tuning strategy, namely updating the answer scoring model in real time according to the newly generated answer. This will enable the model to adapt to changing data distributions and improve the quality of the answers.

e. Interpretability: an interpretability mechanism, such as an attention mechanism or model sensitivity analysis, is added to the answer scoring model to take care of the quality of the answer and provide a basis for analysis and adjustment.

In some alternative embodiments, determining an answer policy based on text data and target question data within a cut document block includes:

determining a text type of the text data;

Alternatively, as in table 2, the answer policy may be determined based on the text type and the target question data extraction policy. And then obtaining a plurality of answers related to the target question data according to an answer strategy, scoring the answers to obtain a high-quality answer set consisting of a plurality of candidate answer data, and continuously iterating and optimizing an initial answer model to obtain a trained target answer model. It should be explained that the number of answers outputted by the target answer model should be only one of the target answer data.

In some alternative embodiments, obtaining target answer data corresponding to target question data according to the cut document block, the target question data, the full-scale data and the target answer model includes:

And integrating the descriptive contents to obtain target answer data.

Alternatively, n answers may be generated for each question. These answers may be generated by different initial answer models, or by the same initial answer model under slightly different input conditions. The screened answers are broken down into smaller units of information (i.e., target fields). These units may be sentences or phrases describing a particular fact. Every two units of information are compared and if they describe the same fact or detail, only the answer data that is scored higher is retained. If they describe different facts or details, they are kept.

The selected units of information are combined together to form a new answer. While additional strategies are applied to determine the order of the units of information, including but not limited to their order in the original answer, or according to some logical order. Finally, post-processing is carried out on the reconstructed answer, including grammar checking, word order adjustment, spelling error correction and the like, so as to obtain final target answer data.

In some alternative embodiments, after obtaining question-answer data about the cut document block, the method further comprises:

Optionally, the embodiment of the disclosure sets a question and answer data scoring model in advance for checking the quality of the obtained QA. Wherein, QA data core index:

a. integrity of questions and answers: whether the extracted questions and answers are complete or not is determined, and key parts are not truncated or missing. Ensuring that the answer provided is meaningful to the user.

b. Consistency: whether similar or repeated questions and answers extracted from different parts or different documents are consistent. Ensuring that the answers provided are consistent from document to document or document part to document part.

c. Correlation: whether the extracted questions and answers are contextually or thematically related. Ensure that QA data specific queries or tasks are relevant.

d. Readability: whether questions and answers are easy to read and understand. Ensuring that the final model can better understand the extracted content.

And scoring the question-answer data of each cutting document block based on the core indexes to obtain a third score of the question-answer data. Wherein, the scoring standard is:

rule-based inspection: the format of questions and answers is checked using regular expressions or other rules. The integrity of the answer is automatically checked, for example to ensure that the answer is not truncated.

Sentence embedding comparison: the questions and answers are converted into vectors using a pre-trained sentence embedding model. The embedding of the questions and answers is compared to evaluate their relevance or similarity.

Feedback loop: questions are automatically generated using the language model, and verification is performed against answers provided by the document. The automatically generated questions are compared with the actual extracted questions to evaluate their quality.

Diversity and repeatability checks: the extracted QA data is checked for repeatability using Jaccard similarity, cosine similarity, or other text similarity methods.

Statistical analysis: the frequency of occurrence of certain keywords or words is automatically counted to determine whether there are excessive duplicates or missing topics. The TF-IDF (term frequency-inverse document frequency, a commonly used weighting technique for information retrieval and data mining) or other technique is used to identify abnormal or rare words that appear in a question or answer.

Context consistency: the consistency of the answer in context is assessed using a pre-trained language model, ensuring that the answer is related to the text surrounding it.

Error analysis: possible text errors or inconsistencies are identified using automated tools, such as grammar checkers or text classifiers. And simultaneously, an automatic system is designed, and a data extraction model is optimized according to feedback and error continuous fine tuning.

Reference dataset comparison: an automatic scoring system is used to evaluate the quality of the QA data, as opposed to a quality verified seed dataset.

The embodiment of the disclosure can perform data cleaning on the question-answer data according to the third score, including removing low-quality question-answer data, removing sensitive privacy data and the like, so as to obtain target question-answer data meeting preset requirements, and ensure high quality of the question-answer data. In addition, the expansion of the data sample is carried out according to the target question-answer data obtained after cleaning, so as to ensure the sufficiency of the data sample of the high-quality QA data.

Meanwhile, after generating a plurality of QA data and obtaining a QA library, the embodiment of the disclosure also provides some auxiliary strategies for supporting the application of the QA library in more scenes.

a. Incremental update policy: as business progresses, businesses may generate new document data. At this time, an incremental update mechanism can be designed to periodically update and optimize the existing QA library to maintain its real-time and correlation.

b. User feedback integration: if the QA library is to be used in customer support or other scenarios, user feedback on answers to questions can be collected for continuous optimization and updating of the QA library. For example, the user may provide feedback on relevance, accuracy, intelligibility, etc. of the answer, and the system is continually optimized based on such feedback.

c. Labeling data hierarchy: and the questions and answers in the QA library are subjected to hierarchical annotation, such as classification, difficulty and the like, so that the relevant information can be more accurately matched and displayed in the user inquiry, and the user experience is improved.

d. Introducing a theme model: the topic model is used for problem clustering so as to maintain good problem diversity in the QA library. This helps to avoid excessive concentration of a single topic, ensuring that the QA library encompasses multiple domains and knowledge points.

e. Visual data evaluation: visualization tools are provided that present the process and results of QA score screening. Key indexes such as distribution of questions and answers, quality score change trend and the like can be displayed, and the enterprise can be helped to get an overview of data quality.

f. Integrating an external knowledge base: and combining the QA library with an external knowledge base to expand the coverage range of questions and answers and improve the adaptability to user inquiry.

In some alternative embodiments, the cut document includes text data within the cut document block that is post-picture recognition.

Alternatively, in the embodiment of the present disclosure, after the server side obtains the document of question-answer data (i.e., QA) to be extracted (or mined), it may first determine whether the document is a picture document or a non-picture document. If the document is a picture document, the picture document needs to be identified, corresponding text content is identified, then the document is cut, for example, the document is cut into 10 parts in equal proportion, and a plurality of cutting document blocks are obtained, so that each cutting document block comprises text data after picture identification.

As shown in fig. 2, fig. 2 is a complete flow chart of a method for generating question-answer data according to some embodiments of the present disclosure, and the specific flow is as follows:

obtaining unstructured data, such as industry books, articles, papers and the like;

if the unstructured data is a picture, performing picture processing, extracting features of a visual encoder, extracting characters from the picture by utilizing an OCR technology, acquiring context information of the extracted characters, and performing text preprocessing and document cutting on the extracted characters and the context information together, for example, cutting the extracted characters into n parts of Wen Dangkuai with the size of less than 10 k; if the unstructured data is not a picture, preprocessing text in the document and cutting the document into n parts of Wen Dangkuai with the size of <10 k; selecting a question policy based on the text type of the words within the document block (e.g., regular document, special professional document, news story class, literature class, etc. in fig. 2); and extracting N questions by applying various question strategies and question models (namely an initial question model), inputting the N questions into a question scoring model for scoring, cycling through N rounds, screening out high-quality questions to optimize the question model, and finally outputting an optimal question to be input into a QA library and determining an answer strategy module.

When determining an answer strategy, determining the answer strategy according to text types (such as a conventional document, a special professional document, a news report class, a literature class and the like in fig. 2) in each document block and input optimal questions, inputting a plurality of answer strategies, each document block and total data which are obtained by vector processing on unstructured data and are associated with the optimal questions into an answer model to obtain a plurality of answer data, inputting the answer data into an answer scoring model, and screening out high-quality answers through N rounds of circulation to learn and optimize the answer model; the n answers are integrated to form a best answer (i.e., the best answer) which is input to the QA library. It follows that the QA database stores QA data that is the best question corresponding to the best answer.

And then scoring each QA data in the QA library by utilizing a QA scoring model, performing quality inspection, simultaneously performing data cleaning on the QA data, removing low-quality data and sensitive privacy data, performing data expansion on the QA data after data cleaning, and finally performing fine adjustment on all models according to the expanded and cleaned QA data.

In the embodiments, the data processing is completed locally, so that the enterprise data is not exposed to a third party, and the security of the data is greatly improved; through multiple rounds of training, including an initial question model, an initial answer model, a question scoring model and an answer scoring model, the models are continuously adapted to mining strategies and styles of specific enterprises or industries, so that knowledge migration is realized and an enterprise-specific mining model is constructed; meanwhile, the balance is carried out according to the hardware resources and the performance requirements, and the proper model architecture and parameter setting are selected, so that the requirement of calculation resources can be reduced on the premise of not influencing the quality of synthesized data.

The embodiment also provides a device for generating question-answer data, which is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides a question-answer data generating device, as shown in fig. 3, including:

a first obtaining module 301, configured to obtain a plurality of cutting document blocks, where each cutting document block includes a first preset number of text data;

a first obtaining module 302, configured to obtain target question data corresponding to the cut document block according to the text data and the target question model in the cut document block;

a second obtaining module 303, configured to obtain full-volume data, where the full-volume data is content information associated with target problem data included in a complete document composed of a plurality of cutting document blocks;

a second obtaining module 304, configured to obtain target answer data corresponding to the target question data according to the cutting document block, the target question data, the full-scale data, and the target answer model;

And a third obtaining module 305, configured to generate question-answer data according to the target question data and the target answer data.

In some alternative embodiments, the apparatus further comprises:

the first determining module is used for determining a question policy according to the text data in the cutting file block before obtaining the target question data corresponding to the cutting file block according to the text data in the cutting file block and the target question model;

the extraction module is used for processing the text data according to the question policy and extracting a plurality of question questions;

the third acquisition module is used for acquiring first scores of a plurality of questioning questions according to the question scoring model;

the second determining module is used for determining a plurality of candidate question questions from the plurality of question questions according to the first scores;

and the fourth obtaining module is used for optimizing the initial question model according to the candidate question questions to obtain the target question model.

In some alternative embodiments, the first determining module includes:

a first determining unit configured to determine a text type of the text data;

and the second determining unit is used for determining a corresponding questioning strategy according to the text type.

In some alternative embodiments, the apparatus further comprises:

The third determining module is used for determining an answer strategy according to the text data and the target question data in the cutting file block before the target answer data corresponding to the target question data is obtained according to the cutting file block, the target question data, the full data and the target answer model;

a fifth obtaining module, configured to process the text data according to an answer policy, to obtain a plurality of answer data;

a fourth obtaining module, configured to obtain second scores of the plurality of answer data according to the answer score model;

a fourth determining module for determining a plurality of candidate answer data from the plurality of answer data according to the second score;

and a sixth obtaining module, configured to optimize the initial answer model according to the candidate answer data, and obtain a target answer model.

In some alternative embodiments, the third determining module includes:

a third determining unit configured to determine a text type of the text data;

and a fourth determining unit for determining an answer strategy according to the text type and the target question data.

In some alternative embodiments, the second deriving module 304 includes:

the obtaining unit is used for obtaining a plurality of answer data according to the cutting file block, the target question data, the full-scale data and the target answer model;

The splitting unit is used for splitting the minimum unit of the answer data to obtain a plurality of target fields;

a comparison unit for comparing the descriptive contents of the target fields in every second preset number of answer data;

the reservation unit is used for reserving descriptive contents corresponding to the answer data with the highest second score;

and the integrating unit is used for integrating the descriptive contents to obtain target answer data.

In some alternative embodiments, the apparatus further comprises:

the scoring module is used for scoring the question-answer data by using a question-answer data scoring model after obtaining the question-answer data about the cutting file block to obtain a third score of the question-answer data;

the cleaning module is used for carrying out data cleaning on the question-answer data according to the third scores to obtain target question-answer data meeting preset requirements;

and the expansion module is used for carrying out data sample expansion according to the target question-answer data to obtain third preset number of question-answer data.

The question-answer data generating means in this embodiment are presented in the form of functional units, where units refer to ASIC circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above-mentioned functions.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The embodiment of the disclosure also provides a computer device, which is provided with the question-answer data generating device shown in the figure 3.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a computer device according to an alternative embodiment of the disclosure, as shown in fig. 4, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 4.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform a method for implementing the embodiments described above.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created from the use of the computer device of the presentation of a sort of applet landing page, and the like. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.

The presently disclosed embodiments also provide a computer readable storage medium, and the methods described above according to the presently disclosed embodiments may be implemented in hardware, firmware, or as recordable storage medium, or as computer code downloaded over a network that is originally stored in a remote storage medium or a non-transitory machine-readable storage medium and is to be stored in a local storage medium, such that the methods described herein may be stored on such software processes on a storage medium using a general purpose computer, special purpose processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present disclosure have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the disclosure, and such modifications and variations are within the scope defined by the appended claims.

Claims

1. A method for generating question-answer data, the method comprising:

obtaining target problem data corresponding to the cutting document block according to the text data and the target question model in the cutting document block;

acquiring full-volume data, wherein the full-volume data is content information which is contained in a complete document composed of a plurality of cutting document blocks and is associated with the target problem data;

obtaining target answer data corresponding to the target question data according to the cutting file block, the target question data, the full-scale data and a target answer model;

and generating the question-answer data according to the target question data and the target answer data.

2. The method of claim 1, wherein prior to said obtaining target question data corresponding to a cut document block from said text data and target question model within said cut document block, the method further comprises:

obtaining first scores of the plurality of question questions according to a question scoring model;

and optimizing an initial question model according to the candidate question questions to obtain the target question model.

3. The method of claim 2, wherein said determining a questioning policy based on text data within said cut document block comprises:

determining a text type of the text data;

and determining the corresponding question policy according to the text type.

4. The method of claim 1, wherein prior to said deriving target answer data corresponding to said target question data from said cut document block, said target question data, said full-scale data, and a target answer model, said method further comprises:

determining an answer strategy according to the text data in the cutting file block and the target question data;

Obtaining second scores of the plurality of answer data according to an answer score model;

and optimizing an initial answer model according to the candidate answer data to obtain the target answer model.

5. The method of claim 4, wherein said determining an answer policy based on text data within said cut document block and said target question data comprises:

determining a text type of the text data;

and determining the answer strategy according to the text type and the target question data.

6. The method of claim 4, wherein the obtaining target answer data corresponding to the target question data based on the cut document block, the target question data, the full-scale data, and a target answer model comprises:

obtaining a plurality of answer data according to the cutting file block, the target question data, the full-scale data and a target answer model;

comparing the descriptive contents of the target fields in every second preset number of the answer data;

and integrating the description content to obtain the target answer data.

7. The method according to any one of claims 1 to 6, wherein after said obtaining question-answer data about said cut document block, the method further comprises:

performing data cleaning on the question-answer data according to the third scores to obtain target question-answer data meeting preset requirements;

8. The method of any one of claims 1 to 6, wherein the cut document block includes therein text data after picture recognition.

9. A question-answer data generation apparatus, the apparatus comprising:

The second acquisition module is used for acquiring full-volume data, wherein the full-volume data is content information which is contained in a complete document formed by a plurality of cutting file blocks and is associated with the target problem data;

and the third obtaining module is used for generating the question-answer data according to the target question data and the target answer data.

10. A computer device, comprising:

a memory and a processor communicatively coupled to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of generating question-answer data of any one of claims 1 to 7.

11. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of generating question-answer data of any one of claims 1 to 7.