[go: up one dir, main page]

CN108959412B - Method, device and equipment for generating labeled data and storage medium - Google Patents

Method, device and equipment for generating labeled data and storage medium Download PDF

Info

Publication number
CN108959412B
CN108959412B CN201810580489.2A CN201810580489A CN108959412B CN 108959412 B CN108959412 B CN 108959412B CN 201810580489 A CN201810580489 A CN 201810580489A CN 108959412 B CN108959412 B CN 108959412B
Authority
CN
China
Prior art keywords
sample
data
condition information
annotation
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810580489.2A
Other languages
Chinese (zh)
Other versions
CN108959412A (en
Inventor
王晓雪
吴世伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mobvoi Information Technology Co ltd
Original Assignee
Mobvoi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mobvoi Information Technology Co Ltd filed Critical Mobvoi Information Technology Co Ltd
Priority to CN201810580489.2A priority Critical patent/CN108959412B/en
Publication of CN108959412A publication Critical patent/CN108959412A/en
Application granted granted Critical
Publication of CN108959412B publication Critical patent/CN108959412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本发明实施例公开了一种标注数据的生成方法、装置、设备及存储介质,所述方法包括:获取数据需求方提供的与需求样本匹配的样本条件信息;其中,所述样本条件信息包括:需求样本的当前语义理解协议、与需求样本关联的历史样本的历史语义理解协议、需求样本的样本类型以及需求样本的语法规则;将所述样本条件信息提供给至少一个数据标注方,并获取所述数据标注方针对所述样本条件信息生成的备选标注样本;根据所述样本条件信息对所述备选标注样本进行合理性校验,得到目标标注样本;根据所述目标标注样本以及所述样本条件信息,构造结构化的标注数据,实现高效获取所需求的多轮交互系统的数据,简化数据获取流程,并降低人工成本。

Figure 201810580489

The embodiment of the present invention discloses a method, device, device and storage medium for generating annotation data. The method includes: obtaining sample condition information provided by a data demander that matches a demand sample; wherein the sample condition information includes: The current semantic understanding protocol of the demand sample, the historical semantic understanding protocol of the historical sample associated with the demand sample, the sample type of the demand sample, and the grammatical rules of the demand sample; provide the sample condition information to at least one data labeler, and obtain all the the candidate annotation samples generated by the data labeling party for the sample condition information; the rationality verification of the candidate annotation samples is carried out according to the sample condition information, and the target annotation samples are obtained; according to the target annotation samples and the Sample condition information, construct structured labeled data, achieve efficient acquisition of required data from multiple rounds of interactive systems, simplify the data acquisition process, and reduce labor costs.

Figure 201810580489

Description

Method, device and equipment for generating labeled data and storage medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a method, a device, equipment and a storage medium for generating labeled data.
Background
The multi-round interactive system is applied more and more widely in the field of the existing intelligent electronic products, for example, the multi-round interaction based on the context dialog scene plays a very important role in the field of intelligent questions and is also an important function and a huge problem in intelligent question answering. In practical applications, the problem to be solved by the intelligent question-answering system is likely to be a complex flow-type knowledge, rather than a simple question-answer form.
At present, rule-based models are more commonly used in multi-round interactive systems. As the application scenes of the multi-turn interactive system become more and more complex, the pure rule-based model has difficulty meeting the requirements of the interactive system. Compared with a rule-based model, the statistical model is more flexible, has statistical significance, and can be suitable for complex interaction scenes. But large-scale interactive data is required to train the statistical model.
In the prior art, two ways of acquiring interactive data of a multi-round interactive system mainly include acquiring from a log file and manually constructing. The method for acquiring data from the log file is convenient, and only the data to be acquired needs to be buried in points in advance, and the buried data is extracted after the log file is generated. And the way of manually constructing data requires that the required data be constructed according to the requirements of clear construction data.
In the process of implementing the invention, the inventor finds that the prior art has the following defects: acquiring interactive data for training from a log file, wherein the interactive data is required to depend on the performance of an interactive system excessively and only the interactive data of a system support scene can be acquired; the process of manually constructing interactive data for training is complicated and inefficient, and a large amount of labor time is consumed.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a storage medium for generating labeled data, which can be used for efficiently acquiring data of a required multi-round interactive system, simplifying a data acquisition process and reducing labor cost.
In a first aspect, an embodiment of the present invention provides a method for generating annotation data, including:
acquiring sample condition information which is provided by a data demander and matched with a demand sample;
wherein the sample condition information includes: the method comprises the steps of obtaining a current semantic understanding protocol of a demand sample, a historical semantic understanding protocol of a historical sample associated with the demand sample, a sample type of the demand sample and a grammar rule of the demand sample;
providing the sample condition information to at least one data labeling party, and acquiring an alternative labeling sample generated by the data labeling party for the sample condition information;
performing rationality verification on the alternative annotation sample according to the sample condition information to obtain a target annotation sample;
and constructing structured labeling data according to the target labeling sample and the sample condition information.
In a second aspect, an embodiment of the present invention further provides a device for generating annotation data, where the device includes:
the information acquisition module is used for acquiring sample condition information which is provided by a data demander and matched with a demand sample;
wherein the sample condition information includes: the method comprises the steps of obtaining a current semantic understanding protocol of a demand sample, a historical semantic understanding protocol of a historical sample associated with the demand sample, a sample type of the demand sample and a grammar rule of the demand sample;
the sample obtaining module is used for providing the sample condition information to at least one data labeling party and obtaining an alternative labeling sample generated by the data labeling party for the sample condition information;
the sample checking module is used for checking the rationality of the alternative marked sample according to the sample condition information to obtain a target marked sample;
and the data construction module is used for constructing structured labeling data according to the target labeling sample and the sample condition information.
In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:
one or more processors;
storage means for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors implement any of the above-mentioned methods for generating annotation data.
In a fourth aspect, an embodiment of the present invention further provides a computer storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements any of the above-mentioned methods for generating annotation data.
The embodiment of the invention obtains the target marking sample by providing the sample condition information which is provided by the data demander and matched with the required sample to at least one data marking party and carrying out rationality verification on the alternative marking sample generated by the data marking party according to the sample condition information; according to the method, the structured labeling data are constructed according to the target labeling sample and the sample condition information, the problems of complicated flow, low efficiency and the like existing when the data applied to the multi-round interactive system are obtained in the prior art are solved, the technical effect of efficiently obtaining the required data of the multi-round interactive system is achieved, the data obtaining flow is simplified, and the labor cost is reduced.
Drawings
Fig. 1 is a flowchart of a method for generating annotation data according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for generating annotation data according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a method for generating annotation data according to a third embodiment of the present invention;
FIG. 4 is a flowchart of a method for generating annotation data according to a fourth embodiment of the present invention;
fig. 5 is a schematic diagram of a device for generating annotation data according to a fifth embodiment of the present invention;
fig. 6 is a schematic structural diagram of a computer device according to a sixth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention.
It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Example one
Fig. 1 is a flowchart of a method for generating annotation data according to an embodiment of the present invention, where the embodiment is applicable to a situation of efficiently acquiring annotation data applied to a multi-round interactive system, and the method can be executed by an apparatus for generating annotation data, where the apparatus can be implemented by software and/or hardware, and can be generally integrated in various computer devices, and specifically includes the following steps:
and S110, acquiring sample condition information matched with the demand sample and provided by the data demand party.
Wherein the sample condition information includes: a current semantic understanding protocol of the demand sample, a historical semantic understanding protocol of the historical sample associated with the demand sample, a sample type of the demand sample, and a grammar rule of the demand sample.
The data demander generally refers to a user who generates or optimizes a multi-round interactive system, and the user provides conditions (i.e., sample condition information) for data (i.e., required samples) required in the process of generating or optimizing the multi-round interactive system, requests the data labeling demander to construct the required data according to the conditions, and trains a statistical model applied to the multi-round interactive system according to the constructed data.
In this embodiment, the requirement sample may be a data sample meeting a certain interaction requirement in any or specific field, in any or specific scene, and optionally, the requirement sample may be text data. Typically, the requirement samples may include: the user side is interactive under the current conversation turn. The current conversation turn may specifically be a conversation turn obtained by obtaining a question input by a user in real time in a voice interaction scenario of a question and a response (the user proposes a question, and the machine returns an answer corresponding to the question). The interactive mode in the current conversation turn may be specifically a question of a question input by a user in real time in a voice interaction scenario of a question and a answer (the user end proposes a question, and the machine end returns an answer corresponding to the question).
It can be understood that, when training the statistical model used by the multi-round interactive system, it is desirable that the statistical model can correctly understand the actual semantics of the interactive (Query) input by the user end, and further provide corresponding feedback to the user according to the understood semantics, so as to meet the actual information acquisition requirement of the user end. In a specific example, the interactive formula input by the user end to the multi-turn interactive system is as follows: the ' coffee shops around ' are available ', the actual semantics of the user is the information of the entity of the ' coffee shop ' in the surrounding environment, and if the multi-turn question-answering system can correctly identify the actual semantics of the user, the required information can be fed back to the user, namely: "cafes near you have: cafe a, cafe B, … ". Accordingly, the current semantic understanding protocol of the requirement sample can be the actual semantic or scene condition, such as the information of the domain, the intention or the semantic, which the data demander needs to meet according to the actual requirement design requirement sample. Typically, the data demander may define a current semantic understanding protocol of the demand sample by data in the JOSN format, and in a specific example, the current semantic understanding protocol may be defined as: "field as catering; intention as takeaway; by defining the current semantic understanding protocol, a demand sample meeting the current semantic understanding protocol can be requested from a data annotation party, and accordingly, the data annotation party can construct a formula according to the current semantic understanding protocol, such as: "take-out of restaurants near my thought point", etc.
Further, the history samples associated with the demand samples include: and the user side and/or the system side are/is interactive under at least one historical conversation turn related to the current conversation turn. The historical conversation turns associated with the current conversation turn specifically refer to the conversation turns of a question and a answer that have been completed before the current conversation turn.
In this embodiment, the inventor considers that the history semantic understanding protocol corresponding to the interaction under the history dialog turn adjacent to one or more interactions under the current dialog turn may include semantic actions that are the same as or similar to the semantic actions corresponding to the current semantic understanding protocol corresponding to the interaction under the current dialog turn. Therefore, the interactive corresponding current semantic understanding protocol of the user terminal and/or the system terminal under at least one historical conversation turn associated with the current conversation turn is obtained while the interactive corresponding historical semantic understanding protocol under the current conversation turn is obtained, so that the method is beneficial to accurately or quickly generating the alternative labeling sample for the sample condition information by the data labeling party. In addition, a historical semantic understanding protocol of a historical sample associated with the demand sample is added to the sample condition information, so that the obtained data can be applied to complex scenes based on context dialogue and the like, and the intelligent demand is met.
For example: the interactive mode input by the user side under a historical conversation turn is as follows: "please recommend the restaurant in the surrounding environment", the interactive mode of the machine end feedback under the history conversation turn is: "the restaurants around you have: restaurant a, restaurant B, and restaurant C "; if the interactive mode input by the user side under the current conversation turn is as follows: the machine end can recognize that the actual semantics of the user is that the user wants to find the Chuan museum in the surrounding museums and the Guangdong museums are to be excluded according to the historical conversation turns. If the multi-turn question-answering system can correctly identify the actual semantics of the user, the required information can be fed back to the user, namely: "the restaurant C near you is Chuan restaurant".
Correspondingly, in order to enable the data annotating party to accurately construct the requirement sample, the sample condition information further includes: a historical semantic understanding protocol of a historical sample associated with a demand sample. The historical semantic understanding protocol can be the actual semantics or scene conditions of the historical sample associated with the demand sample, such as information of a domain, an intention or a semantic.
Generally, in training a statistical model, negative samples are required in addition to positive samples. Correspondingly, the sample type of the demand sample in the sample condition information is specifically used for specifying that the type of the demand sample is a positive sample or a negative sample; the grammar rule of the requirement sample in the sample condition information specifically refers to a grammar habit that the requirement sample must follow, such as: the predicate and object must be included, which specific fields must be included or which specific fields must not be included, etc.
In the embodiment of the invention, optionally, the data demander only provides the sample condition information and is not responsible for operations such as labeling and verifying the required sample, generating the data and the like, so that the generation efficiency of the labeled data can be effectively improved.
In an optional embodiment of the present invention, in the current semantic understanding protocol, a first target field associated with the requirement sample and a field value corresponding to the first target field are defined in JSON format; defining a second target field associated with the history sample and a field value corresponding to the second target field in a JSON format in the history semantic understanding protocol; in the grammar rule, a third target field which must be contained in the requirement sample and a fourth target field which cannot be contained in the requirement sample are defined; the sample types of the demand sample include: a positive sample type that the requirement sample conforms to the current semantic understanding protocol context, or a negative sample type that the requirement sample does not conform to the current semantic understanding protocol context; wherein the first target field is the same as the second target field, the first target field or the second target field including at least one of: domain, intent, semantic action, and slot information.
The field refers to a field related to data in the demand sample, for example, the field corresponding to dishes is catering, and the field corresponding to mountain climbing is sports. The intent is the need for the goal and method to be achieved by the data in the demand sample, e.g., "query Chuan restaurant, exclude Guangdong restaurant," and the intent of the demand sample is to find a restaurant. The semantic action is specifically information which can be understood and processed by the machine terminal, and more specifically, if the content input by the user terminal is good in the morning, the semantic action which can be understood by the machine terminal is obtained as a call after the semantic understanding. The slot information refers to a meaning category of each word, and specifically may be a plurality of information slots, such as dishes, clothes, electronic devices, and the like, which are preset, and a plurality of corresponding field values are set for each information slot, for example, the slot information is a slot value of a dish, which may include a xiang dish, a yue dish, a chuan dish, and the like.
JSON is a lightweight data exchange format, is easy for human reading and writing, and is also easy for machine analysis and generation. In the embodiment of the invention, the first target field, the second target field and the corresponding field value in the current semantic understanding protocol and the historical semantic understanding protocol are defined through the JSON format, so that a data annotation party can conveniently and quickly read the target field and the field value in the semantic understanding protocol. Meanwhile, fields which must be contained and cannot be contained in the requirement sample are regulated through the grammar rule, so that the accuracy of the alternative labeling sample generated by the data labeling party can be improved. In addition, the positive sample type and the negative sample type are set for the demand sample, so that the alternative annotation sample can be further ensured to meet the demand of the demand sample.
For example, the current semantic understanding protocol of the demand sample in the sample condition information may be domain — restaurant; find a restaurant; the semantic action is that the information slot is a dish, the corresponding slot value is a user intention of the Sichuan dish is Yes, the information slot is a dish, and the corresponding slot value is a user intention of the Guangdong dish is No; the information slot is equal to the dish, and the corresponding slot value is equal to the Sichuan dish; the information slot is equal to dish, and the corresponding slot value is equal to Guangdong dish. The historical semantic understanding protocol of the historical samples associated with the demand samples may be domain catering; intention as takeaway; semantic action is the acquisition of surrounding restaurants that provide takeaway services. The sample types of the demand sample may include a positive sample type and a negative sample type. Accordingly, the grammar rule may be a positive sample or a negative sample, and the grammar rule may be that the field which must be contained is restaurant or take-out or dish, etc., and the field which must not be contained is drink or cold drink, etc.
Correspondingly, the data annotating party constructs the demand sample of the positive sample type according to the sample condition information, wherein the demand sample comprises: the method comprises the following steps of selecting restaurants with heavy taste and excluding restaurants with light taste, or selecting dishes with the taste of Chuan dishes and dishes with the taste of Guangdong dishes; the demand sample for the constructed negative sample type may be: "I want to find the movie theatre next to the XX restaurant" or "how the XX dish did", etc.
In an optional embodiment of the invention, the sample condition information further comprises: interactive examples of the current conversation turn matching the demand sample, and interactive examples of the historical conversation turn matching the historical sample; wherein the interactive instance of the current conversation turn is in compliance with a current semantic understanding protocol of the requirement sample; the interactive examples of the historical dialog turn are in accordance with a historical semantic understanding protocol of the historical sample.
Wherein, the interactive example of the current conversation turn and the interactive example of the historical conversation turn are both examples provided by the data demand side according to the required demand sample, and the two examples need to be respectively in accordance with the current semantic understanding protocol of the demand sample and the historical semantic understanding protocol of the historical sample. In other words, the interactive example of the current conversation turn is an illustration of a sample of the needs required by the data demander.
For example, the current semantic understanding protocol of the demand sample is domain navigation; searching and positioning a bus station near Beijing university; information slot is university, slot value is Beijing university; the historical semantic understanding protocol of the historical sample is navigation as field; semantic actions-a route provided to the university of Beijing; information slot is university and slot value is Beijing university.
Accordingly, an interactive example of a current conversation turn for a demand sample of a positive sample type may be "where is a bus stop near Beijing university," consistent with a current semantic understanding protocol for the demand sample; an interactive example of a current conversation turn for a demand sample of negative sample type may be "what market word is best around Beijing university," not compliant with the current semantic understanding protocol for the demand sample; an interactive example of a historical conversation turn of a historical sample associated with a demand sample may be "how to go to Beijing university," etc.
In the embodiment of the present invention, the data annotating party can quickly locate the requirements of the data demanding party on the current demand sample according to the two examples. Through the interactive example of the current conversation turn, the data annotation party can quickly know the context of the requirement sample and the specific requirement sample example, and construct an alternative annotation sample according to the content.
And S120, providing the sample condition information to at least one data labeling party, and acquiring an alternative labeling sample generated by the data labeling party for the sample condition information.
The data annotation party generally refers to constructing annotation data required by the data demander according to conditions (sample condition information matched with the required sample) provided by the data demander. The number of the data labeling parties can be one or multiple, and optionally, multiple data labeling parties can be adopted in order to ensure that the target labeling sample and the structured labeling data can meet the quantity requirement of a preset model on the labeling sample. The alternative annotation sample is at least one sample (usually a plurality of samples) which is constructed by the data annotation party according to the sample condition information and theoretically satisfies the sample condition information.
In the embodiment of the invention, the data demander provides the matched sample condition information for the data annotating party according to the requirement of the current required sample, and the data annotating party learns the requirement of the current sample after acquiring the sample condition information matched with the current required sample, and constructs the alternative annotated sample matched with the current required sample according to the requirement. Briefly, a data demander provides requirements, and a data annotating demander constructs batch samples according to the requirements. Therefore, the data demand side and the data annotation side work separately and cooperate to complete the generation operation of the alternative annotation sample, and other links are handed to the equipment for processing, so that the labor cost can be effectively reduced, the data acquisition efficiency of the multi-round interactive system can be improved, and the data acquisition process is simplified.
S130, performing rationality verification on the alternative annotation sample according to the sample condition information to obtain a target annotation sample.
The rationality check specifically refers to a check operation performed on whether the alternative annotation sample meets the requirement of the current requirement sample.
In this embodiment, the inventor considers that although the data annotating party is the alternative annotation sample generated according to the sample condition information, in view of the influence of subjective factors, in the actual operation process, the alternative annotation sample generated by the data annotating party inevitably has a situation of not matching the required sample. Therefore, the operation of carrying out rationality check on the alternative annotation sample is added, the extremely high accuracy of the finally obtained target annotation sample can be ensured, and meanwhile, the cost and time for manually checking the alternative annotation sample are saved, so that the data acquisition process is simplified, and the data acquisition efficiency is improved.
And S140, constructing structured labeling data according to the target labeling sample and the sample condition information.
The structured labeling data is data formed by taking a target labeling sample as a data source and extracting effective data of the data source according to sample condition information.
In the embodiment of the present invention, the purpose of constructing the structured data is to obtain the structural features of the target annotation sample, and optionally, the target annotation sample with the structural features may be used for training the model.
The embodiment of the invention obtains the target marking sample by providing the sample condition information which is provided by the data demander and matched with the required sample to at least one data marking party and carrying out rationality verification on the alternative marking sample generated by the data marking party according to the sample condition information; according to the method, the structured labeling data are constructed according to the target labeling sample and the sample condition information, the problems of complicated flow, low efficiency and the like existing when the data applied to the multi-round interactive system are obtained in the prior art are solved, the technical effect of efficiently obtaining the required data of the multi-round interactive system is achieved, the data obtaining flow is simplified, and the labor cost is reduced.
Example two
Fig. 2 is a flowchart of a method for generating annotation data according to a second embodiment of the present invention, which is embodied on the basis of the second embodiment, in this embodiment, a plausibility check is performed on an alternative annotation sample according to the sample condition information to obtain a target annotation sample, specifically: acquiring a current semantic understanding protocol of the demand sample in the sample condition information; in the alternative annotation sample, acquiring a field value to be verified corresponding to a first target field included in the current semantic understanding protocol; and if the field value to be verified is determined to be matched with the field value corresponding to the first target field in the current semantic understanding protocol, determining the alternative annotation sample as the target annotation sample. Correspondingly, as shown in fig. 2, the method of the present embodiment may include:
and S210, obtaining sample condition information matched with the demand sample and provided by the data demand party.
S220, providing the sample condition information to at least one data labeling party, and acquiring an alternative labeling sample generated by the data labeling party for the sample condition information.
And S230, performing rationality verification on the alternative annotation sample according to the sample condition information to obtain a target annotation sample.
Specifically, S230 may include:
s231, obtaining the current semantic understanding protocol of the demand sample in the sample condition information.
In the embodiment of the invention, when the rationality of the alternative annotation sample is checked according to the sample condition information, the rationality of the alternative annotation sample can be checked according to the relevant field information in the current semantic understanding protocol of the required sample in the sample condition information.
S232, obtaining a field value to be verified corresponding to a first target field included in the current semantic understanding protocol from the alternative labeling sample.
In one specific example, for an alternative labeled sample: "where the bus stops near Beijing university," the current semantic understanding protocol obtained is: "navigation is a domain; searching and positioning a bus station near Beijing university; information slot-university, slot value-Beijing university ";
accordingly, for each first target field included in the current semantic understanding protocol, that is: and acquiring the field value of each first target field in the alternative labeling sample as the field value to be verified. For example: in the alternative annotation sample, obtaining a value of a field to be verified corresponding to a first target field included in the current semantic understanding protocol is as follows: "navigation is a domain; searching and positioning a bus station near Tianjin university; information slot is university, slot value is Tianjin university ".
S233, determining whether the field value to be verified matches the field value corresponding to the first target field in the current semantic understanding protocol, if yes, executing S234, otherwise, executing S235.
In one specific example, for an alternative labeled sample: the value of the field to be verified corresponding to the first target field included in the current semantic understanding protocol, which is acquired as "where the bus station near Tianjin university is" is: "navigation is a domain; searching and positioning a bus station near Tianjin university; information slot is university, slot value is Tianjin university ". If the first target field and the corresponding field value of the current semantic understanding protocol of the requirement sample are respectively as follows: "navigation is a domain; searching and positioning a bus station near Tianjin university; the information slot is university, and the slot value is Tianjin university, which indicates that the field value to be verified is matched with the field value corresponding to the first target field in the current semantic understanding protocol; if the first target field and the corresponding field value of the current semantic understanding protocol of the requirement sample are respectively as follows: "navigation is a domain; searching and positioning a bus station near Beijing university; and if the information slot is university, and the slot value is Beijing university ", it indicates that the field value to be verified does not match the field value corresponding to the first target field in the current semantic understanding protocol.
It should be noted that the field value to be verified and the field value corresponding to the first target field in the current semantic understanding protocol do not need to completely correspond one to one, and when the field value to be verified and the field value corresponding to the first target field in the current semantic understanding protocol are synonyms, it can also be considered that the field value to be verified is matched with the field value corresponding to the first target field in the current semantic understanding protocol.
For example, for the alternative annotated sample: the value of the field to be verified corresponding to the first target field included in the current semantic understanding protocol acquired by the restaurant nearby is as follows: "field as catering; semantic action-search for surrounding restaurants. The first target field and the corresponding field value of the current semantic understanding protocol of the requirement sample are respectively as follows: "field as diet; and if the semantic action is searching for a nearby restaurant, the semantic action indicates that the field value to be verified is matched with the field value corresponding to the first target field in the current semantic understanding protocol.
And S234, determining the alternative annotation sample as the target annotation sample.
In the embodiment of the invention, the alternative annotation sample which passes the rationality check is determined as the target annotation sample.
And S235, deleting the alternative labeling sample.
Correspondingly, if the alternative annotation sample does not pass the rationality check, the alternative annotation sample is deleted. Or, all the alternative labeled samples which do not pass the rationality check can be summarized into a set, the set is fed back to the data labeling party, and the data labeling party checks and corrects the samples in the set. Since the data annotation party already complies with the sample condition information provided by the data demand party when generating the alternative annotation sample, the number of the alternative annotation samples which do not pass the rationality check is not excessive. Even if the set containing the alternative annotation samples which do not pass the rationality verification is fed back to the data annotation party for correction, too much workload is not added to the data annotation party.
S240, structured labeling data are constructed according to the target labeling sample and the sample condition information.
The embodiment of the invention verifies whether the candidate annotation sample meets the rationality check by verifying whether the value of the field to be verified corresponding to the first target field in the current semantic understanding protocol of the candidate annotation sample is matched with the value of the field corresponding to the first target field in the current semantic understanding protocol of the required sample, thereby ensuring the accuracy of the candidate annotation sample and reducing the labor cost.
EXAMPLE III
Fig. 3 is a flowchart of a method for generating annotation data according to a third embodiment of the present invention, which is embodied on the basis of the third embodiment, in this embodiment, a plausibility check is performed on an alternative annotation sample according to the sample condition information to obtain a target annotation sample, specifically: obtaining a grammar rule of the demand sample in the sample condition information; searching a third target field and a fourth target field corresponding to the grammar rule in the alternative labeling sample; and if the search result is determined to be matched with the grammar rule, determining the alternative annotation sample as the target annotation sample. Accordingly, as shown in fig. 3, the method of the present embodiment may include:
and S310, acquiring sample condition information matched with the demand sample and provided by the data demand party.
S320, providing the sample condition information to at least one data labeling party, and acquiring an alternative labeling sample generated by the data labeling party for the sample condition information.
S330, performing rationality verification on the alternative annotation sample according to the sample condition information to obtain a target annotation sample.
And S331, obtaining the grammar rule of the requirement sample in the sample condition information.
In the embodiment of the invention, when the rationality of the alternative labeled sample is checked according to the sample condition information, the rationality of the alternative labeled sample can be checked according to the grammar rule of the required sample in the sample condition information.
S332, searching a third target field and a fourth target field corresponding to the grammar rule in the alternative labeling sample.
Specifically, when the rationality of the candidate tagged sample is checked according to the grammatical rule of the required sample in the sample condition information, the third target field and the fourth target field corresponding to the grammatical rule corresponding to the candidate tagged sample can be extracted.
It should be noted that the third target field and the fourth target field are predefined fields, and may be one or multiple fields. The third target field may include one or more fields in the first target field, or may be defined separately from the fields included in the first target field. For example, the third field may be a restaurant, take-out, or dish, etc., and the fourth field may be a beverage or cold drink, etc.
S333, judging whether the search result is matched with the grammar rule, if so, executing S334, otherwise, executing S335.
In a specific example, the grammar rule corresponding to the requirement sample is that the field which must be included is restaurant or take-out or dish, and the field which must not be included is beverage or cold drink. For the alternative annotation sample "which restaurants are nearby", it is found that it contains the third target field "restaurant" corresponding to the grammar rule, and it is determined that it does not contain fields such as "drink" or "cold drink", which indicates that the search result matches the grammar rule corresponding to the requirement sample. And aiming at the alternative annotation sample 'taking out of which cold drinks are nearby', the alternative annotation sample is searched for the third target field 'taking out' corresponding to the grammar rule, but the alternative annotation sample is determined to contain the field 'cold drinks', and the search result is not matched with the grammar rule corresponding to the requirement sample.
It should be noted that, performing rationality check according to the syntax rules requires that field values are completely corresponding one to one, and even if synonyms corresponding to the third field and the fourth field are found, the search result cannot be considered to be matched with the syntax rules. Therefore, the matching degree of the fields is required to be higher by the rationality check of the alternative annotation samples through the grammatical rules of the requirement samples in the sample condition information.
And S334, determining the candidate annotation sample as the target annotation sample.
And S335, deleting the alternative labeling sample.
S340, constructing structured labeling data according to the target labeling sample and the sample condition information.
According to the embodiment of the invention, the third target field and the fourth target field corresponding to the grammatical rule of the required sample are searched in the alternative labeling sample, and whether the search result is matched with the grammatical rule of the required sample is checked to realize rationality check, so that the accuracy of the alternative labeling sample is ensured, and the labor cost is reduced.
Example four
Fig. 4 is a flowchart of a method for generating annotation data according to a fourth embodiment of the present invention, which is embodied on the basis of the foregoing embodiment, and in this embodiment, structured annotation data is constructed according to the target annotation sample and the sample condition information, specifically: acquiring a current semantic understanding protocol of a demand sample, a historical semantic understanding protocol of a historical sample associated with the demand sample and a sample type of the demand sample in the sample condition information; and combining the target labeling sample, the current semantic understanding protocol of the demand sample, the historical semantic understanding protocol of the historical sample associated with the demand sample and the sample type of the demand sample to obtain the structured labeling data. Correspondingly, as shown in fig. 4, the method of this embodiment may include:
and S410, obtaining sample condition information matched with the demand sample and provided by the data demand party.
And S420, providing the sample condition information to at least one data labeling party, and acquiring an alternative labeling sample generated by the data labeling party for the sample condition information.
And S430, performing rationality verification on the alternative annotation sample according to the sample condition information to obtain a target annotation sample.
S440, acquiring a current semantic understanding protocol of the demand sample, a historical semantic understanding protocol of a historical sample associated with the demand sample and a sample type of the demand sample in the sample condition information.
In the embodiment of the present invention, in order to train a statistical model using a target labeled sample that passes the rationality check, data in the target labeled sample needs to be extracted and combined to obtain structured data. Because the data finally generated by the data annotation party only comprises the target annotation sample, when the structured data is constructed according to the target annotation sample, the data extraction can be performed on the target annotation sample by taking the current semantic understanding protocol of the demand sample, the historical semantic understanding protocol of the historical sample associated with the demand sample and the sample type of the demand sample as standards.
S450, combining the target labeling sample, the current semantic understanding protocol of the demand sample, the historical semantic understanding protocol of the historical sample associated with the demand sample and the sample type of the demand sample to obtain the structured labeling data.
For example, for a target annotation sample, "i love eating chinese cuisine and do not love eating cantonese", and the current semantic understanding protocol of the demand sample corresponding to the target annotation sample, the historical semantic understanding protocol of the historical sample associated with the demand sample, and the structured annotation data constructed by the sample type of the demand sample may be: "labeling the sample: "I love eating Chuan vegetable and don't love eating Guangdong vegetable";
current semantic understanding protocols: the field is diet; find a restaurant; the semantic action is that the information slot is a dish, the corresponding slot value is a user intention of the Sichuan dish is Yes, the information slot is a dish, and the corresponding slot value is a user intention of the Guangdong dish is No; the information slot is equal to the dish, and the corresponding slot value is equal to the Sichuan dish; the information slot is equal to dish, and the corresponding slot value is equal to Guangdong dish;
historical semantic understanding protocol: the field is catering; intention as takeaway; semantic action, namely acquiring surrounding restaurants providing takeaway services;
sample type: positive samples ".
And S460, inputting the structured labeling data into a preset model for training to obtain a model for interactive semantic action recognition of the user side under the current conversation turn.
The preset model is a self-defined model used for training the target labeling sample. The interactive semantic action recognition model of the user side under the current conversation turn is a statistical model applied in the multi-turn interactive system, and can be used for recognizing semantic actions of the problems raised by the user side and acquired in the multi-turn interactive system application process.
In the embodiment of the invention, after a sufficient number of target labeling samples are obtained, all the target labeling samples can form a structured labeling data customs preset model for model training to train, and a model for performing interactive semantic action recognition on a user side in the current conversation turn is obtained. The structured marking data generated by combination is directly used as training sample resources, so that the link of independently manufacturing the training sample resources can be avoided, and the data utilization rate and the data processing efficiency are improved.
According to the embodiment of the invention, the target labeling sample, the current semantic understanding protocol of the demand sample, the historical semantic understanding protocol of the historical sample associated with the demand sample and the sample type of the demand sample are combined to obtain the structured labeling data, and then the structured labeling data is input into the preset model for training to obtain the model for performing semantic action recognition on the interactive mode of the user side under the current conversation turn, so that the data utilization rate and the data processing efficiency can be effectively improved.
It should be noted that any permutation and combination between the technical features in the above embodiments also belong to the scope of the present invention.
EXAMPLE five
Fig. 5 is a schematic diagram of a device for generating annotation data according to the fifth embodiment of the present invention, which is capable of executing a method for generating annotation data according to any embodiment of the present invention, and has functional modules and beneficial effects corresponding to the execution method.
The device comprises:
an information obtaining module 510, configured to obtain sample condition information that is provided by a data demander and matches a demand sample;
wherein the sample condition information includes: the method comprises the steps of obtaining a current semantic understanding protocol of a demand sample, a historical semantic understanding protocol of a historical sample associated with the demand sample, a sample type of the demand sample and a grammar rule of the demand sample;
a sample obtaining module 520, configured to provide the sample condition information to at least one data annotating party, and obtain an alternative annotated sample generated by the data annotating party for the sample condition information;
a sample checking module 530, configured to perform rationality checking on the alternative annotation sample according to the sample condition information, to obtain a target annotation sample;
and a data constructing module 540, configured to construct structured annotation data according to the target annotation sample and the sample condition information.
The embodiment of the invention obtains the target marking sample by providing the sample condition information which is provided by the data demander and matched with the required sample to at least one data marking party and carrying out rationality verification on the alternative marking sample generated by the data marking party according to the sample condition information; according to the method, the structured labeling data are constructed according to the target labeling sample and the sample condition information, the problems of complicated flow, low efficiency and the like existing when the data applied to the multi-round interactive system are obtained in the prior art are solved, the technical effect of efficiently obtaining the required data of the multi-round interactive system is achieved, the data obtaining flow is simplified, and the labor cost is reduced.
Optionally, the requirement sample includes: the interactive mode of the user side under the current conversation turn; the history samples associated with the demand samples include: and the user side and/or the system side are/is interactive under at least one historical conversation turn related to the current conversation turn.
Optionally, in the current semantic understanding protocol, a first target field associated with the requirement sample and a field value corresponding to the first target field are defined in a JSON format; defining a second target field associated with the history sample and a field value corresponding to the second target field in a JSON format in the history semantic understanding protocol; in the grammar rule, a third target field which must be contained in the requirement sample and a fourth target field which cannot be contained in the requirement sample are defined; the sample types of the demand sample include: a positive sample type that the requirement sample conforms to the context of the current semantic understanding protocol, or a negative sample type that the requirement sample does not conform to the context of the current semantic understanding protocol; wherein the first target field is the same as the second target field, the first target field or the second target field including at least one of: domain, intent, semantic action, and slot information.
Optionally, the sample checking module 530 is further configured to obtain, in the sample condition information, a current semantic understanding protocol of the demand sample; in the alternative annotation sample, acquiring a field value to be verified corresponding to a first target field included in the current semantic understanding protocol; and if the field value to be verified is determined to be matched with the field value corresponding to the first target field in the current semantic understanding protocol, determining the alternative annotation sample as the target annotation sample.
Optionally, the sample checking module 530 is further configured to obtain a syntax rule of the demand sample in the sample condition information; searching a third target field and a fourth target field corresponding to the grammar rule in the alternative labeling sample; and if the search result is determined to be matched with the grammar rule, determining the alternative annotation sample as the target annotation sample.
Optionally, the interactive examples of the current conversation turn matched with the requirement sample, and the interactive examples of the historical conversation turn matched with the historical sample; wherein the interactive instance of the current conversation turn is in compliance with a current semantic understanding protocol of the requirement sample; the interactive examples of the historical dialog turn are in accordance with a historical semantic understanding protocol of the historical sample.
Optionally, the data constructing module 540 is further configured to obtain, in the sample condition information, a current semantic understanding protocol of the demand sample, a historical semantic understanding protocol of a historical sample associated with the demand sample, and a sample type of the demand sample; and combining the target labeling sample, the current semantic understanding protocol of the demand sample, the historical semantic understanding protocol of the historical sample associated with the demand sample and the sample type of the demand sample to obtain the structured labeling data.
Optionally, the apparatus further includes a model obtaining module 550, configured to input the structured annotation data into a preset model for training, so as to obtain a model for performing semantic action recognition on the user side in an interactive manner in the current conversation turn.
The generating device of the annotation data can execute the generating method of the annotation data provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method. For details of the technology that are not described in detail in this embodiment, reference may be made to a method for generating annotation data provided in any embodiment of the present invention.
Since the above-described apparatus for generating annotation data is an apparatus capable of executing the method for generating annotation data in the embodiment of the present invention, based on the method for generating annotation data described in the embodiment of the present invention, a person skilled in the art can understand the specific implementation of the apparatus for generating annotation data in the embodiment and various variations thereof, and therefore, how to implement the method for generating annotation data in the embodiment of the present invention by the apparatus for generating annotation data is not described in detail here. The device used by those skilled in the art to implement the method for generating the annotation data in the embodiment of the present invention is within the scope of the protection of the present application.
EXAMPLE six
Fig. 6 is a schematic structural diagram of a computer device according to a sixth embodiment of the present invention. FIG. 6 illustrates a block diagram of a computer device 612 suitable for use in implementing embodiments of the present invention. The computer device 612 shown in fig. 6 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.
As shown in fig. 6, the computer device 612 is in the form of a general purpose computing device. Components of computer device 612 may include, but are not limited to: one or more processors 616, a memory device 628, and a bus 618 that couples the various system components including the memory device 628 and the processors 616.
Bus 618 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.
Computer device 612 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 612 and includes both volatile and nonvolatile media, removable and non-removable media.
Storage 628 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 630 and/or cache Memory 632. The computer device 612 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 634 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard disk drive"). Although not shown in FIG. 6, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk-Read Only Memory (CD-ROM), a Digital Video disk (DVD-ROM), or other optical media) may be provided. In such cases, each drive may be connected to bus 618 by one or more data media interfaces. Storage device 628 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program 636 having a set (at least one) of program modules 626 may be stored, for example, in storage device 628, such program modules 626 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may include an implementation of a network environment. Program modules 626 generally perform the functions and/or methodologies of embodiments of the invention as described herein.
Computer device 612 may also communicate with one or more external devices 614 (e.g., keyboard, pointing device, camera, display 624, etc.), with one or more devices that enable a user to interact with computer device 612, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 612 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 622. Further, computer device 612 may also communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), and/or a public Network, such as the internet) via Network adapter 620. As shown, the network adapter 620 communicates with the other modules of the computer device 612 via the bus 618. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the computer device 612, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) systems, tape drives, and data backup storage systems, to name a few.
The processor 616 executes various functional applications and data processing by executing programs stored in the storage device 628, for example, implementing the generation method of the annotation data provided by the above-described embodiment of the present invention.
That is, the processing unit implements, when executing the program: acquiring sample condition information which is provided by a data demander and matched with a demand sample; wherein the sample condition information includes: the method comprises the steps of obtaining a current semantic understanding protocol of a demand sample, a historical semantic understanding protocol of a historical sample associated with the demand sample, a sample type of the demand sample and a grammar rule of the demand sample; providing the sample condition information to at least one data labeling party, and acquiring an alternative labeling sample generated by the data labeling party for the sample condition information; performing rationality verification on the alternative annotation sample according to the sample condition information to obtain a target annotation sample; and constructing structured labeling data according to the target labeling sample and the sample condition information.
Providing sample condition information which is provided by a data demander and matched with a required sample to at least one data annotating party through the computer equipment, and performing rationality verification on a standby annotated sample generated by the data annotating party according to the sample condition information to obtain a target annotated sample; according to the method, the structured labeling data are constructed according to the target labeling sample and the sample condition information, the problems of complicated flow, low efficiency and the like existing when the data applied to the multi-round interactive system are obtained in the prior art are solved, the technical effect of efficiently obtaining the required data of the multi-round interactive system is achieved, the data obtaining flow is simplified, and the labor cost is reduced.
EXAMPLE seven
An embodiment of the present invention further provides a computer storage medium storing a computer program, where the computer program is used to execute the method for generating annotation data according to any one of the above embodiments of the present invention when executed by a computer processor:
acquiring sample condition information which is provided by a data demander and matched with a demand sample;
wherein the sample condition information includes: the method comprises the steps of obtaining a current semantic understanding protocol of a demand sample, a historical semantic understanding protocol of a historical sample associated with the demand sample, a sample type of the demand sample and a grammar rule of the demand sample;
providing the sample condition information to at least one data labeling party, and acquiring an alternative labeling sample generated by the data labeling party for the sample condition information;
performing rationality verification on the alternative annotation sample according to the sample condition information to obtain a target annotation sample;
and constructing structured labeling data according to the target labeling sample and the sample condition information.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1.一种标注数据的生成方法,其特征在于,包括:1. a generation method of labeling data, is characterized in that, comprises: 获取数据需求方提供的与需求样本匹配的样本条件信息;所述数据需求方,指生成或者优化多轮交互系统的用户;Obtain the sample condition information provided by the data demander that matches the demand sample; the data demander refers to the user who generates or optimizes the multi-round interactive system; 其中,所述样本条件信息包括:需求样本的当前语义理解协议、与需求样本关联的历史样本的历史语义理解协议、需求样本的样本类型以及需求样本的语法规则;The sample condition information includes: the current semantic understanding protocol of the requirement sample, the historical semantic understanding protocol of the historical sample associated with the requirement sample, the sample type of the requirement sample, and the grammar rule of the requirement sample; 将所述样本条件信息提供给至少一个数据标注方,并获取所述数据标注方针对所述样本条件信息生成的备选标注样本;providing the sample condition information to at least one data labeling party, and obtaining candidate labeling samples generated by the data labeling party for the sample condition information; 根据所述样本条件信息对所述备选标注样本进行合理性校验,得到目标标注样本;所述合理性校验指对备选标注样本是否符合当前需求样本的需求进行的校验操作;According to the sample condition information, the candidate annotation sample is checked for rationality, and the target annotation sample is obtained; the rationality check refers to the verification operation of whether the candidate annotation sample meets the requirements of the current demand sample; 根据所述目标标注样本以及所述样本条件信息,构造结构化的标注数据。Structured labeled data is constructed according to the target labeled samples and the sample condition information. 2.根据权利要求1所述的方法,其特征在于,所述需求样本包括:用户端在当前对话轮次下的交互式;2. The method according to claim 1, wherein the demand sample comprises: the interaction of the user terminal under the current dialogue round; 与需求样本关联的历史样本包括:在所述当前对话轮次关联的至少一个历史对话轮次下,用户端和/或系统端的交互式。The historical samples associated with the demand samples include: interactions on the user side and/or the system side under at least one historical dialogue round associated with the current dialogue round. 3.根据权利要求2所述的方法,其特征在于:3. method according to claim 2, is characterized in that: 在所述当前语义理解协议中,以JSON格式定义了与所述需求样本关联的第一目标字段,以及与所述第一目标字段对应的字段值;In the current semantic understanding protocol, a first target field associated with the requirement sample and a field value corresponding to the first target field are defined in JSON format; 在所述历史语义理解协议中,以JSON格式定义了与所述历史样本关联的第二目标字段,以及与所述第二目标字段对应的字段值;In the historical semantic understanding protocol, a second target field associated with the historical sample and a field value corresponding to the second target field are defined in JSON format; 在所述语法规则中,定义了所述需求样本中必须包含的第三目标字段,以及所述需求样本中不能包含的第四目标字段;In the grammar rule, a third target field that must be included in the requirement sample and a fourth target field that cannot be included in the requirement sample are defined; 所述需求样本的样本类型包括:所述需求样本与所述当前语义理解协议的语境相符合的正样本类型,或者所述需求样本与所述当前语义理解协议的语境不相符合的负样本类型;The sample type of the requirement sample includes: a positive sample type in which the requirement sample conforms to the context of the current semantic understanding protocol, or a negative sample type in which the requirement sample does not conform to the context of the current semantic understanding protocol. sample type; 其中,所述第一目标字段与所述第二目标字段相同,所述第一目标字段或者所述第二目标字段包括下述至少一项:领域、意图、语义动作以及槽信息。The first target field is the same as the second target field, and the first target field or the second target field includes at least one of the following: domain, intent, semantic action, and slot information. 4.根据权利要求3所述的方法,其特征在于,根据所述样本条件信息对备选标注样本进行合理性校验,得到目标标注样本,包括:4. The method according to claim 3, wherein, according to the sample condition information, the candidate labeling samples are checked for rationality, and the target labeling samples are obtained, comprising: 在所述样本条件信息中,获取所述需求样本的当前语义理解协议;In the sample condition information, obtain the current semantic understanding protocol of the demand sample; 在所述备选标注样本中,获取与所述当前语义理解协议中包括的第一目标字段对应的待验证字段值;In the candidate annotation sample, obtain the field value to be verified corresponding to the first target field included in the current semantic understanding protocol; 如果确定所述待验证字段值与所述当前语义理解协议中的所述第一目标字段对应的字段值相匹配,则将所述备选标注样本确定为所述目标标注样本。If it is determined that the field value to be verified matches the field value corresponding to the first target field in the current semantic understanding protocol, the candidate annotation sample is determined as the target annotation sample. 5.根据权利要求3所述的方法,其特征在于,根据所述样本条件信息对备选标注样本进行合理性校验,得到目标标注样本,包括:5. The method according to claim 3, wherein, according to the sample condition information, the candidate labeling samples are checked for rationality to obtain the target labeling samples, comprising: 在所述样本条件信息中,获取所述需求样本的语法规则;In the sample condition information, obtain the grammatical rules of the demand sample; 在所述备选标注样本中,查找与所述语法规则对应的第三目标字段以及第四目标字段;In the candidate annotation sample, search for the third target field and the fourth target field corresponding to the grammar rule; 如果确定查找结果与所述语法规则相匹配,则将所述备选标注样本确定为所述目标标注样本。If it is determined that the search result matches the grammar rule, the candidate annotation sample is determined as the target annotation sample. 6.根据权利要求1-5任一项所述的方法,其特征在于,所述样本条件信息还包括:6. The method according to any one of claims 1-5, wherein the sample condition information further comprises: 与所述需求样本匹配的所述当前对话轮次的交互式示例,以及与所述历史样本匹配的所述历史对话轮次的交互式示例;an interactive example of the current conversation round that matches the demand sample, and an interactive example of the historical conversation round that matches the historical sample; 其中,所述当前对话轮次的交互式示例与所述需求样本的当前语义理解协议相符合;所述历史对话轮次的交互式示例与所述历史样本的历史语义理解协议相符合。Wherein, the interactive example of the current dialogue round conforms to the current semantic understanding protocol of the requirement sample; the interactive example of the historical dialogue round conforms to the historical semantic understanding protocol of the historical sample. 7.根据权利要求1所述的方法,其特征在于,根据所述目标标注样本以及所述样本条件信息,构造结构化的标注数据,包括:7. The method according to claim 1, wherein, according to the target labeling samples and the sample condition information, constructing structured labeling data, comprising: 在所述样本条件信息中,获取需求样本的当前语义理解协议、与需求样本关联的历史样本的历史语义理解协议以及需求样本的样本类型;In the sample condition information, obtain the current semantic understanding protocol of the requirement sample, the historical semantic understanding protocol of the historical sample associated with the requirement sample, and the sample type of the requirement sample; 将所述目标标注样本、所述需求样本的当前语义理解协议、与需求样本关联的历史样本的历史语义理解协议以及需求样本的样本类型进行组合,得到所述结构化的标注数据;Combining the target annotation sample, the current semantic understanding protocol of the requirement sample, the historical semantic understanding protocol of the historical sample associated with the requirement sample, and the sample type of the requirement sample to obtain the structured annotation data; 所述方法还包括:将所述结构化的标注数据输入至预设模型中进行训练,得到对用户端在当前对话轮次下的交互式进行语义动作识别的模型。The method further includes: inputting the structured annotation data into a preset model for training, so as to obtain a model for interactive semantic action recognition of the user terminal in the current dialogue round. 8.一种标注数据的生成装置,其特征在于,包括:8. A device for generating labeling data, comprising: 信息获取模块,用于获取数据需求方提供的与需求样本匹配的样本条件信息;所述数据需求方,指生成或者优化多轮交互系统的用户;The information acquisition module is used to acquire the sample condition information provided by the data demander that matches the demand sample; the data demander refers to the user who generates or optimizes the multi-round interaction system; 其中,所述样本条件信息包括:需求样本的当前语义理解协议、与需求样本关联的历史样本的历史语义理解协议、需求样本的样本类型以及需求样本的语法规则;The sample condition information includes: the current semantic understanding protocol of the requirement sample, the historical semantic understanding protocol of the historical sample associated with the requirement sample, the sample type of the requirement sample, and the grammar rule of the requirement sample; 样本获取模块,用于将所述样本条件信息提供给至少一个数据标注方,并获取所述数据标注方针对所述样本条件信息生成的备选标注样本;a sample acquisition module, configured to provide the sample condition information to at least one data labeling party, and obtain candidate labeling samples generated by the data labeling party for the sample condition information; 样本校验模块,用于根据所述样本条件信息对所述备选标注样本进行合理性校验,得到目标标注样本;所述合理性校验指对备选标注样本是否符合当前需求样本的需求进行的校验操作;A sample verification module, configured to perform rationality verification on the candidate marked samples according to the sample condition information to obtain target marked samples; the rationality verification refers to whether the candidate marked samples meet the requirements of the current demand sample The verification operation performed; 数据构造模块,用于根据所述目标标注样本以及所述样本条件信息,构造结构化的标注数据。The data construction module is used for constructing structured labeling data according to the target labeling samples and the sample condition information. 9.一种计算机设备,其特征在于,所述设备包括:9. A computer device, wherein the device comprises: 一个或多个处理器;one or more processors; 存储装置,用于存储一个或多个程序,storage means for storing one or more programs, 当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-7中任一所述的标注数据的生成方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the method for generating annotation data according to any one of claims 1-7. 10.一种计算机存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-7中任一所述的标注数据的生成方法。10. A computer storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the method for generating annotation data according to any one of claims 1-7 is implemented.
CN201810580489.2A 2018-06-07 2018-06-07 Method, device and equipment for generating labeled data and storage medium Active CN108959412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810580489.2A CN108959412B (en) 2018-06-07 2018-06-07 Method, device and equipment for generating labeled data and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810580489.2A CN108959412B (en) 2018-06-07 2018-06-07 Method, device and equipment for generating labeled data and storage medium

Publications (2)

Publication Number Publication Date
CN108959412A CN108959412A (en) 2018-12-07
CN108959412B true CN108959412B (en) 2021-09-14

Family

ID=64493637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810580489.2A Active CN108959412B (en) 2018-06-07 2018-06-07 Method, device and equipment for generating labeled data and storage medium

Country Status (1)

Country Link
CN (1) CN108959412B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429895B (en) * 2018-12-21 2023-05-05 广东美的白色家电技术创新中心有限公司 Semantic understanding method and device for multi-round interaction and computer storage medium
CN111753814B (en) * 2019-03-26 2023-07-25 杭州海康威视数字技术股份有限公司 Sample generation method, device and equipment
CN112036186B (en) * 2019-06-04 2024-12-06 腾讯科技(深圳)有限公司 Corpus annotation method, device, computer storage medium and electronic device
CN114254085A (en) * 2020-09-24 2022-03-29 大众问问(北京)信息科技有限公司 Semantic annotation method and device and vehicle-mounted terminal equipment
CN113127624B (en) * 2021-06-16 2021-11-16 北京金山数字娱乐科技有限公司 Question answering model training method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9684683B2 (en) * 2010-02-09 2017-06-20 Siemens Aktiengesellschaft Semantic search tool for document tagging, indexing and search
CN107316643A (en) * 2017-07-04 2017-11-03 科大讯飞股份有限公司 Voice interactive method and device
CN107357838A (en) * 2017-06-23 2017-11-17 上海交通大学 Dialog strategy canbe used on line method based on multi-task learning
CN107799116A (en) * 2016-08-31 2018-03-13 科大讯飞股份有限公司 More wheel interacting parallel semantic understanding method and apparatus
CN108052659A (en) * 2017-12-28 2018-05-18 北京百度网讯科技有限公司 Searching method, device and electronic equipment based on artificial intelligence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9684683B2 (en) * 2010-02-09 2017-06-20 Siemens Aktiengesellschaft Semantic search tool for document tagging, indexing and search
CN107799116A (en) * 2016-08-31 2018-03-13 科大讯飞股份有限公司 More wheel interacting parallel semantic understanding method and apparatus
CN107357838A (en) * 2017-06-23 2017-11-17 上海交通大学 Dialog strategy canbe used on line method based on multi-task learning
CN107316643A (en) * 2017-07-04 2017-11-03 科大讯飞股份有限公司 Voice interactive method and device
CN108052659A (en) * 2017-12-28 2018-05-18 北京百度网讯科技有限公司 Searching method, device and electronic equipment based on artificial intelligence

Also Published As

Publication number Publication date
CN108959412A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN108959412B (en) Method, device and equipment for generating labeled data and storage medium
US12340316B2 (en) Techniques for building a knowledge graph in limited knowledge domains
US10515086B2 (en) Intelligent agent and interface to provide enhanced search
CN107133345B (en) Artificial intelligence-based interaction method and device
CN109325091B (en) Method, device, equipment and medium for updating attribute information of interest points
KR101132509B1 (en) Mobile system, search system and search result providing method for mobile search
US20170243107A1 (en) Interactive search engine
CN110415679A (en) Speech error correction method, device, equipment and storage medium
CN106406806A (en) A control method and device for intelligent apparatuses
US11475900B2 (en) Establishment of audio-based network sessions with non-registered resources
KR20170001550A (en) Human-computer intelligence chatting method and device based on artificial intelligence
KR20170106346A (en) How to Understand Incomplete Natural Language Queries
WO2020044099A1 (en) Service processing method and apparatus based on object recognition
WO2021164244A1 (en) Voice interaction method and apparatus, device and computer storage medium
US20130311506A1 (en) Method and apparatus for user query disambiguation
CN114490975B (en) User question labeling method and device
CN110209777A (en) The method and electronic equipment of question and answer
US20160078083A1 (en) Image display device, method for driving the same, and computer readable recording medium
CN112215010B (en) Semantic recognition method and device
WO2015196987A1 (en) Natural language supported data query method, open platform and user terminal
WO2014117490A1 (en) Method and device for recommending video from video library
CN110377676B (en) Voice instruction processing method, device, equipment and computer storage medium
CN112765460A (en) Conference information query method, device, storage medium, terminal device and server
CN109190116B (en) Semantic analysis method, system, electronic device and storage medium
CN110505143A (en) It is a kind of for sending the method and apparatus of target video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20241127

Address after: 200232 room 2015, floor 2, No. 24, Lane 315, Fenggu Road, Xuhui District, Shanghai

Patentee after: SHANGHAI MOBVOI INFORMATION TECHNOLOGY Co.,Ltd.

Country or region after: China

Address before: 100094 1001, 10th floor, office building a, 19 Zhongguancun Street, Haidian District, Beijing

Patentee before: MOBVOI INFORMATION TECHNOLOGY Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right