[go: up one dir, main page]

CN114186015B - Information retrieval method, device and computer readable storage medium - Google Patents

Information retrieval method, device and computer readable storage medium Download PDF

Info

Publication number
CN114186015B
CN114186015B CN202010970977.1A CN202010970977A CN114186015B CN 114186015 B CN114186015 B CN 114186015B CN 202010970977 A CN202010970977 A CN 202010970977A CN 114186015 B CN114186015 B CN 114186015B
Authority
CN
China
Prior art keywords
query
information retrieval
model
data
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010970977.1A
Other languages
Chinese (zh)
Other versions
CN114186015A (en
Inventor
丁磊
童毅轩
董滨
姜珊珊
张永伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CN202010970977.1A priority Critical patent/CN114186015B/en
Priority to JP2021149311A priority patent/JP7230979B2/en
Publication of CN114186015A publication Critical patent/CN114186015A/en
Application granted granted Critical
Publication of CN114186015B publication Critical patent/CN114186015B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供了一种信息检索方法、装置及计算机可读存储介质。本发明提供的信息检索方法,包括:获取第一训练数据,所述第一训练数据包括查询指令和与所述查询指令对应的查询结果;清除所述第一训练数据中的噪声,得到第二训练数据;利用所述第二训练数据初始化信息检索模型;利用所述信息检索模型进行信息检索。本发明的技术方案能够提高信息检索结果的准确性,提高信息检索的效率。

The present invention provides an information retrieval method, device and computer-readable storage medium. The information retrieval method provided by the present invention comprises: obtaining first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction; removing noise in the first training data to obtain second training data; initializing an information retrieval model using the second training data; and performing information retrieval using the information retrieval model. The technical solution of the present invention can improve the accuracy of information retrieval results and improve the efficiency of information retrieval.

Description

Information retrieval method, apparatus and computer readable storage medium
Technical Field
The present invention relates to the field of information retrieval, and in particular, to an information retrieval method, apparatus, and computer readable storage medium.
Background
Information retrieval technology is an important technology and is widely applied to search engines, question-answering systems, recommendation systems and other various intelligent services. With better information retrieval techniques, vendors can accurately understand the intent of customers and provide the proper products or services.
At present, the main method of information retrieval is to judge the semantic relevance between the user query and the document based on a large-scale neural network model. Training a large-scale neural network model requires a large amount of labeling data, but the cost of manual labeling is high. The related art proposes constructing annotation data for training based on a generated method. However, the generated data usually contains some noise, and the correlation of the negative samples in the generated data is insufficient, which affects the effect of information retrieval.
Disclosure of Invention
The technical problem to be solved by the embodiment of the invention is to provide an information retrieval method, an information retrieval device and a computer readable storage medium, which can improve the accuracy of information retrieval results and the efficiency of information retrieval.
According to an aspect of an embodiment of the present invention, there is provided an information retrieval method including:
Acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;
removing noise in the first training data to obtain second training data;
initializing an information retrieval model by using the second training data;
and carrying out information retrieval by utilizing the information retrieval model.
Furthermore, in accordance with at least one embodiment of the present invention, after initializing the information retrieval model, the method further comprises:
the information retrieval model is optimized by an antagonistic query.
Furthermore, in accordance with at least one embodiment of the present invention, the acquiring the first training data includes:
acquiring open data, wherein the open data comprises a query instruction and a query result corresponding to the query instruction;
Generating a query data generation model by utilizing the open data training, wherein the query data generation model can generate a query instruction corresponding to an input query result according to the query result;
And inputting the documents in the specific field into the query data generation model to generate the first training data.
Furthermore, in accordance with at least one embodiment of the present invention, the removing noise in the first training data includes:
initializing a noise classification model by using the first training data;
Training the noise classification model;
And removing noise in the first training data by using the trained noise classification model.
Furthermore, in accordance with at least one embodiment of the present invention, the training the noise classification model comprises:
performing N times of iteration to obtain a trained noise classification model, wherein N is a positive integer;
And in each iteration, removing noise in the first training data by using the noise classification model, training the information retrieval model by using the data after removing the noise, and updating parameters of the noise classification model by using a loss function of the information retrieval model after training.
Furthermore, in accordance with at least one embodiment of the present invention, the optimizing the information retrieval model by an antagonistic query includes:
initializing an irrelevant query generation model by using the second training data, wherein the input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and the output of the irrelevant query generation model is a second query instruction irrelevant to the query result;
And inputting the output result of the information retrieval model into the irrelevant query generation model, and training the information retrieval model by utilizing the output result of the irrelevant query generation model.
Furthermore, in accordance with at least one embodiment of the present invention, the objective function of the irrelevant query generation model comprises:
the relevance of the second query instruction generated by the uncorrelated query generation model and the query result;
And the text similarity of the second query instruction and the first query instruction generated by the irrelevant query generation model.
According to another aspect of an embodiment of the present invention, there is provided an information retrieval apparatus including:
the acquisition unit is used for acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;
The noise removing unit is used for removing noise in the first training data to obtain second training data;
an initializing unit, configured to initialize an information retrieval model using the second training data;
And the information retrieval unit is used for retrieving information by utilizing the information retrieval model.
Furthermore, in accordance with at least one embodiment of the present invention, the apparatus further comprises:
and the optimizing unit is used for optimizing the information retrieval model through the countermeasure type inquiry.
Furthermore, according to at least one embodiment of the present invention, the acquisition unit includes:
The acquisition subunit is used for acquiring open data, wherein the open data comprises a query instruction and a query result corresponding to the query instruction;
The first processing subunit is used for generating a query data generation model by utilizing the open data training, and the query data generation model can generate a query instruction corresponding to an input query result according to the query result;
And the second processing subunit is used for inputting the documents in the specific field into the query data generation model to generate the first training data.
Furthermore, according to at least one embodiment of the present invention, the noise removing unit includes:
A first initializing subunit, configured to initialize a noise classification model using the first training data;
The training subunit is used for training the noise classification model;
and the clearing subunit is used for clearing noise in the first training data by using the trained noise classification model.
Furthermore, according to at least one embodiment of the present invention, the optimizing unit includes:
A second initialization subunit, configured to initialize an irrelevant query generation model using the second training data, where an input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and output a second query instruction irrelevant to the query result;
And the countermeasure training subunit is used for inputting the output result of the information retrieval model into the uncorrelated query generation model, and training the information retrieval model by utilizing the output result of the uncorrelated query generation model.
The embodiment of the invention also provides an information retrieval device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the steps of the information retrieval method when being executed by the processor.
Embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the information retrieval method as described above.
Compared with the prior art, the information retrieval method, the information retrieval device and the computer readable storage medium provided by the embodiment of the invention have the advantages that after the first training data for training the information retrieval model is obtained, the first training data is not directly used for generating the information retrieval model, noise in the first training data is removed, the information retrieval model is initialized by using the second training data after the noise removal, the performance of the information retrieval model can be optimized, the accuracy of information retrieval results is improved, and the information retrieval efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an information retrieval method according to an embodiment of the present invention;
FIG. 2 is a flowchart of acquiring first training data according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a process for removing noise from first training data according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of training a noise classification model according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart of optimizing an information retrieval model according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of generating an irrelevant query according to an embodiment of the invention;
FIG. 7 is a schematic diagram of an information retrieval model and an irrelevant query generation model for countermeasure training in accordance with an embodiment of the present invention;
FIG. 8 is a schematic diagram of an information retrieval device according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of another structure of an information retrieval device according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an acquisition unit according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of a noise removing unit according to an embodiment of the present invention;
FIG. 12 is a schematic diagram of an optimizing unit according to an embodiment of the present invention;
Fig. 13 is a schematic diagram of still another structure of an information retrieval device according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments. In the following description, specific details such as specific configurations and components are provided merely to facilitate a thorough understanding of embodiments of the invention. It will therefore be apparent to those skilled in the art that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
In various embodiments of the present invention, it should be understood that the sequence numbers of the following processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
In order to solve the problems of large quantity of marking data and high cost of manual marking data required by an information retrieval system, an open data training query data generation model can be used for generating a query on a document in the target field by using the query data generation model, and then a query-result data pair is constructed by using the generated query to train the information retrieval model.
However, the method has 2 problems that noise exists in the generated data, the generated query is used as a related query in the method, the uncorrelated query is constructed by randomly selecting the queries of other documents, and the quantity and quality of the uncorrelated query cannot meet the requirements. While high quality non-related queries, which are literally similar to related queries but not related to the content of the query results, can effectively enhance the effectiveness of the information retrieval system. For example, "iphoneX produced by apple corporation," what is the manufacturer of the corresponding related query of "iphoneX," what is the color of the corresponding high quality unrelated query of "iphoneX," and "who is the low quality unrelated query of" AAA.
Embodiments of the present invention provide an information retrieval method, apparatus, and computer-readable storage medium, which can improve accuracy of information retrieval results and improve efficiency of information retrieval.
Example 1
An embodiment of the present invention provides an information retrieval method, as shown in fig. 1, including:
step 101, acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;
For example, "is the manufacturer of iphoneX" the query instruction, "is the query result" iphoneX produced by apple inc "corresponding thereto. Wherein the first training data is a "query instruction-query result" data pair for a particular target domain.
As shown in fig. 2, acquiring the first training data includes the steps of:
Step 1011, obtaining open data, wherein the open data comprises a query instruction and a query result corresponding to the query instruction;
The open data can be a data set disclosed by a network or can be acquired from the network. For example, the "question-answer" data on the question-answer website may be considered as a query instruction and a query result corresponding to the query instruction, and the "question-answer" data may be collected as training data for generating a query data generation model.
Unlike the first training data, the acquired open data is not necessarily data of a specific target area, such as a medical area, and may be data of other areas, such as a mechanical area.
Step 1012, generating a query data generation model by utilizing the open data training, wherein the query data generation model can generate a query instruction corresponding to an input query result according to the query result;
The query data generation model is a neural network model, and generates a query data generation model through acquired open data training, and the query data generation model can generate a query instruction corresponding to an input query result according to the input query result, for example, the input query result is "AAA is B country president", and the generated output is the query instruction "AAA is.
Step 1013, inputting the documents in the specific field into the query data generation model to generate the first training data.
Wherein documents of a particular domain, including but not limited to medical domain, machine-building domain, etc., may be input into the query data generation model as desired. The method has the advantages that the documents in the specific field are input into the query data generation model, and a query instruction-query result data set can be generated, so that a large amount of query instruction-query result data in the specific field can be generated by utilizing the query data generation model, and the problems of large quantity of labeling data and high cost of manually labeling the data required by the information retrieval system are solved.
Step 102, removing noise in the first training data to obtain second training data;
the field-specific "query instruction-query result" dataset generated in step 101 may have adverse effects on the accuracy of the information retrieval model if noise (i.e., incorrect data) is present, and therefore, noise in the data needs to be removed prior to initializing the training information retrieval model. In this embodiment, noise in the first training data may be removed by using a noise classification model, which may be any text classification model, and it may be possible to distinguish whether a piece of data is noise.
As shown in fig. 3, in this embodiment, removing noise in the first training data includes the following steps:
step 1021, initializing a noise classification model by using the first training data;
step 1022, training the noise classification model;
in this embodiment, N iterations may be performed to obtain a trained noise classification model, where N is a positive integer.
As shown in fig. 4, in each iteration, noise in the first training data is removed by using a noise classification model, an information retrieval model is trained by using the data after removing the noise, and parameters of the noise classification model are updated by using a loss function of the trained information retrieval model, so that the noise classification model is optimized.
In this embodiment, the noise classification model may predict the probability p j that the data is noise, as follows:
pj=π(a=0|θ);
Where pi is a function of the noise classification model, a=1 indicates that the data is not noise, and a=0 indicates that the data is noise.
The parameter θ of the noise classification model may be updated by the following loss function:
Wherein, U i is the data remained after removing the noise in the first training data by using the noise classification model in the ith iteration, U i-1 is the data remained after removing the noise in the first training data by using the noise classification model in the ith-1 iteration, and F is the performance index of the evaluation information retrieval model.
The N may be a preset value, for example, 50 or 100, or may be determined according to the performance of the information retrieval model after the iteration, and if the performance of the information retrieval model after the nth iteration is improved to a limited extent compared with the performance of the N-1 th iteration, the iteration is stopped.
Step 1023, removing noise in the first training data by using the trained noise classification model.
After the noise in the first training data is removed by using the trained noise classification model, incorrect data in the first training data can be removed, and the accuracy of the information retrieval model is improved.
Step 103, initializing an information retrieval model by using the second training data;
in this embodiment, the information retrieval model is initialized by using the second training data with noise removed, so that accuracy of information retrieval results can be improved, and efficiency of information retrieval can be improved.
To improve performance, after initializing the information retrieval model, the method further comprises optimizing the information retrieval model by an antagonistic query. As shown in fig. 5, the optimizing the information retrieval model by the challenge-type query includes the steps of:
Step 1051, initializing an irrelevant query generation model by using the second training data, wherein the input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and outputting a second query instruction irrelevant to the query result;
The second query instruction is a high-quality uncorrelated query, and the second query instruction needs to be uncorrelated with the query result and has text similarity with the first query instruction. The objective function of the uncorrelated query generation model comprises correlation between a second query instruction generated by the uncorrelated query generation model and a query result, and text similarity between the second query instruction generated by the uncorrelated query generation model and the first query instruction. The smaller the objective function is, the better.
As shown in fig. 6, where the input to the irrelevant query generation model is a query result, such as "iphoneX produced by apple company" and a query instruction related to the query result, such as "iphoneX manufacturer is? another query instruction that is not related to the query result, such as" is the color of iphoneX? is the irrelevant query instruction text similarity to "iphoneX manufacturer"? i.e., high quality, irrelevant queries.
During the initialization process, the generated query instructions may be uncorrelated with the query results but textually similar to the correlated query instructions using the following objective function:
p (a=1| result, generated uncorrelated query) +λd (correlated query, uncorrelated query)
Wherein p (a=1|results, the generated irrelevant queries) part represents the relevance of the generated irrelevant queries to the query results, which can be obtained by initializing the trained information retrieval model. d (related queries, unrelated queries) may be an index for judging text similarity such as edit distance. The smaller should be the better be both parts. λ is a weight coefficient that adjusts the importance of the second portion, and the value of λ may be adjusted as needed.
Step 1052, inputting the output result of the information retrieval model into the irrelevant query generation model, and training the information retrieval model by using the output result of the irrelevant query generation model.
As shown in fig. 7, the information retrieval model is trained using the output result of the irrelevant query generation model as training data, and the result of the output of the information retrieval model is fed back into the irrelevant query generation model as feedback to perform countermeasure training to optimize the information retrieval model.
The input of the information retrieval model is 'query instruction-query result', the output is a probability, the probability that the query result is the correct query result corresponding to the query instruction is represented, if the information retrieval model is trained by only utilizing the 'related query instruction and the corresponding query result', the accuracy of the information retrieval model is limited, the information retrieval model is trained by utilizing the 'unrelated query instruction-query result' data output by the unrelated query generation model, the information retrieval model and the unrelated query generation model are resisted, and the effect of the other party can be mutually optimized by iterating the 2 models.
And 104, carrying out information retrieval by utilizing the information retrieval model.
In the embodiment, the semantic relevance between the query instruction of the user and the document can be accurately judged by using the information retrieval model.
In this embodiment, after the first training data is obtained, the first training data is not directly used to generate the information retrieval model, but noise in the first training data is removed first, and the second training data after removing the noise is used to initialize the information retrieval model, so that the performance of the information retrieval model can be optimized, the accuracy of the information retrieval result can be improved, and the efficiency of information retrieval can be improved.
Example two
The embodiment of the invention also provides an information retrieval device, as shown in fig. 8, comprising:
An obtaining unit 21, configured to obtain first training data, where the first training data includes a query instruction and a query result corresponding to the query instruction;
For example, "is the manufacturer of iphoneX" the query instruction, "is the query result" iphoneX produced by apple inc "corresponding thereto. Wherein the first training data is a "query instruction-query result" data pair for a particular target domain.
A noise removing unit 22, configured to remove noise in the first training data to obtain second training data;
If noise (i.e., incorrect data) exists in the generated data set of the query instruction-query result in the specific field, the accuracy of the information retrieval model is adversely affected, so that the noise in the data needs to be removed before the training information retrieval model is initialized. In this embodiment, noise in the first training data may be removed by using a noise classification model, which may be any text classification model, and it may be possible to distinguish whether a piece of data is noise.
An initializing unit 23 for initializing an information retrieval model using the second training data;
in this embodiment, the information retrieval model is initialized by using the second training data with noise removed, so that accuracy of information retrieval results can be improved, and efficiency of information retrieval can be improved.
An information retrieval unit 24 for retrieving information using the information retrieval model.
In the embodiment, the semantic relevance between the query instruction of the user and the document can be accurately judged by using the information retrieval model.
In this embodiment, after the first training data is obtained, the first training data is not directly used to generate the information retrieval model, but noise in the first training data is removed first, and the second training data after removing the noise is used to initialize the information retrieval model, so that the performance of the information retrieval model can be optimized, the accuracy of the information retrieval result can be improved, and the efficiency of information retrieval can be improved.
In some embodiments, as shown in fig. 9, the apparatus further comprises:
an optimization unit 25 for optimizing the information retrieval model by means of an antagonistic query.
In some embodiments, as shown in fig. 10, the acquiring unit 21 includes:
An obtaining subunit 211, configured to obtain open data, where the open data includes a query instruction and a query result corresponding to the query instruction;
A first processing subunit 212, configured to generate a query data generation model by using the open data training, where the query data generation model is capable of generating a query instruction corresponding to an input query result according to the query result;
The open data can be a data set disclosed by a network or can be acquired from the network. For example, the "question-answer" data on the question-answer website may be considered as a query instruction and a query result corresponding to the query instruction, and the "question-answer" data may be collected as training data for generating a query data generation model.
Unlike the first training data, the acquired open data is not necessarily data of a specific target area, such as a medical area, and may be data of other areas, such as a mechanical area.
The query data generation model is a neural network model, and generates a query data generation model through acquired open data training, and the query data generation model can generate a query instruction corresponding to an input query result according to the input query result, for example, the input query result is "AAA is B country president", and the generated output is the query instruction "AAA is.
A second processing subunit 213, configured to input the document in the specific domain into the query data generation model, and generate the first training data.
Wherein documents of a particular domain, including but not limited to medical domain, machine-building domain, etc., may be input into the query data generation model as desired. The method has the advantages that the documents in the specific field are input into the query data generation model, and a query instruction-query result data set can be generated, so that a large amount of query instruction-query result data in the specific field can be generated by utilizing the query data generation model, and the problems of large quantity of labeling data and high cost of manually labeling the data required by the information retrieval system are solved.
In some embodiments, as shown in fig. 11, the noise removing unit 22 includes:
A first initializing subunit 221, configured to initialize a noise classification model with the first training data;
A training subunit 222, configured to train the noise classification model;
in this embodiment, N iterations may be performed to obtain a trained noise classification model, where N is a positive integer.
As shown in fig. 4, in each iteration, noise in the first training data is removed by using a noise classification model, an information retrieval model is trained by using the data after removing the noise, and parameters of the noise classification model are updated by using a loss function of the trained information retrieval model, so that the noise classification model is optimized.
In this embodiment, the noise classification model may predict the probability p j that the data is noise, as follows:
pj=π(a=0|θ);
Where pi is a function of the noise classification model, a=1 indicates that the data is not noise, and a=0 indicates that the data is noise.
The parameter θ of the noise classification model may be updated by the following loss function:
Wherein, U i is the data remained after removing the noise in the first training data by using the noise classification model in the ith iteration, U i-1 is the data remained after removing the noise in the first training data by using the noise classification model in the ith-1 iteration, and F is the performance index of the evaluation information retrieval model.
The N may be a preset value, for example, 50 or 100, or may be determined according to the performance of the information retrieval model after the iteration, and if the performance of the information retrieval model after the nth iteration is improved to a limited extent compared with the performance of the N-1 th iteration, the iteration is stopped.
And a removing subunit 223, configured to remove noise in the first training data by using the trained noise classification model.
After the noise in the first training data is removed by using the trained noise classification model, incorrect data in the first training data can be removed, and the accuracy of the information retrieval model is improved.
In some embodiments, as shown in fig. 12, the optimizing unit 25 includes:
a second initializing subunit 251, configured to initialize an irrelevant query generation model with the second training data, where an input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and output a second query instruction irrelevant to the query result;
The second query instruction is a high-quality uncorrelated query, and the second query instruction needs to be uncorrelated with the query result and has text similarity with the first query instruction. The objective function of the uncorrelated query generation model comprises correlation between a second query instruction generated by the uncorrelated query generation model and a query result, and text similarity between the second query instruction generated by the uncorrelated query generation model and the first query instruction. The smaller the objective function is, the better.
As shown in fig. 6, where the input to the irrelevant query generation model is a query result, such as "iphoneX produced by apple company" and a query instruction related to the query result, such as "iphoneX manufacturer is? another query instruction that is not related to the query result, such as" is the color of iphoneX? is the irrelevant query instruction text similarity to "iphoneX manufacturer"? i.e., high quality, irrelevant queries.
During the initialization process, the generated query instructions may be uncorrelated with the query results but textually similar to the correlated query instructions using the following objective function:
p (a=1| result, generated uncorrelated query) +λd (correlated query, uncorrelated query)
Wherein p (a=1|results, the generated irrelevant queries) part represents the relevance of the generated irrelevant queries to the query results, which can be obtained by initializing the trained information retrieval model. d (related queries, unrelated queries) may be an index for judging text similarity such as edit distance. The smaller should be the better be both parts. λ is a weight coefficient that adjusts the importance of the second portion, and the value of λ may be adjusted as needed.
An countermeasure training subunit 252, configured to input an output result of the information retrieval model into the irrelevant query generation model, and train the information retrieval model by using the output result of the irrelevant query generation model.
As shown in fig. 7, the information retrieval model is trained using the output result of the irrelevant query generation model as training data, and the result of the output of the information retrieval model is fed back into the irrelevant query generation model as feedback to perform countermeasure training to optimize the information retrieval model.
The input of the information retrieval model is 'query instruction-query result', the output is a probability, the probability that the query result is the correct query result corresponding to the query instruction is represented, if the information retrieval model is trained by only utilizing the 'related query instruction and the corresponding query result', the accuracy of the information retrieval model is limited, the information retrieval model is trained by utilizing the 'unrelated query instruction-query result' data output by the unrelated query generation model, the information retrieval model and the unrelated query generation model are resisted, and the effect of the other party can be mutually optimized by iterating the 2 models.
Example III
The embodiment of the present invention further provides an information retrieval apparatus 30, as shown in fig. 13, including:
a processor 32, and
A memory 34, in which memory 34 computer program instructions are stored,
Wherein the computer program instructions, when executed by the processor, cause the processor 32 to perform the steps of:
Acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;
removing noise in the first training data to obtain second training data;
initializing an information retrieval model by using the second training data;
and carrying out information retrieval by utilizing the information retrieval model.
Further, as shown in fig. 13, the information retrieval apparatus 30 further includes a network interface 31, an input device 33, a hard disk 35, and a display device 36.
The interfaces and devices described above may be interconnected by a bus architecture. The bus architecture may include any number of interconnected buses and bridges. One or more Central Processing Units (CPUs), represented in particular by processor 32, and various circuits of one or more memories, represented by memory 34, are connected together. The bus architecture may also connect various other circuits together, such as peripheral devices, voltage regulators, and power management circuits. It is understood that a bus architecture is used to enable connected communications between these components. The bus architecture includes, in addition to a data bus, a power bus, a control bus, and a status signal bus, all of which are well known in the art and therefore will not be described in detail herein.
The network interface 31 may be connected to a network (e.g., the internet, a local area network, etc.), and may acquire related data, such as public data, etc., from the network and may be stored in the hard disk 35.
The input device 33 may receive various instructions entered by an operator and may be sent to the processor 32 for execution. The input device 33 may comprise a keyboard or a pointing device (e.g. a mouse, a trackball, a touch pad or a touch screen, etc.).
The display device 36 may display results from the execution of instructions by the processor 32.
The memory 34 is used for storing programs and data necessary for the operation of the operating system, and data such as intermediate results in the calculation process of the processor 32.
It will be appreciated that the memory 34 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be Read Only Memory (ROM), programmable Read Only Memory (PROM), erasable Programmable Read Only Memory (EPROM), electrically Erasable Programmable Read Only Memory (EEPROM), or flash memory, among others. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. The memory 34 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, memory 34 stores elements, executable modules or data structures, or a subset thereof, or an extended set thereof, operating system 341 and application programs 342.
The operating system 341 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application programs 342 include various application programs, such as a Browser (Browser), etc., for implementing various application services. A program for implementing the method of the embodiment of the present invention may be included in the application program 342.
The processor 32, when calling and executing the application program and the data stored in the memory 34, specifically, the program or the instruction stored in the application program 342, obtains first training data, where the first training data includes a query instruction and a query result corresponding to the query instruction, clears noise in the first training data to obtain second training data, initializes an information retrieval model by using the second training data, and performs information retrieval by using the information retrieval model.
Further, the processor 32 optimizes the information retrieval model by countercheck when invoking and executing applications and data stored in the memory 34, specifically, programs or instructions stored in the application 342.
Further, the processor 32, when calling and executing the application program and the data stored in the memory 34, specifically, the program or the instruction stored in the application program 342, obtains open data, where the open data includes a query instruction and a query result corresponding to the query instruction, trains and generates a query data generation model by using the open data, where the query data generation model can generate a query instruction corresponding to the query result according to the input query result, and inputs a document in a specific field into the query data generation model to generate the first training data.
Further, the processor 32 initializes a noise classification model using the first training data when calling and executing the application program and data stored in the memory 34, specifically, the application program or instructions stored in the application program 342, trains the noise classification model, and clears the noise in the first training data using the trained noise classification model.
Further, the processor 32 performs N iterations when calling and executing the application program and the data stored in the memory 34, specifically, the program or the instruction stored in the application program 342, to obtain a trained noise classification model, where N is a positive integer, and in each iteration, the noise classification model is used to remove noise in the first training data, the information retrieval model is trained by using the data after removing noise, and the parameters of the noise classification model are updated by using the loss function of the trained information retrieval model.
Further, the processor 32 initializes an irrelevant query generation model with the second training data when calling and executing the application programs and data stored in the memory 34, specifically, the application programs or instructions stored in the application programs 342, wherein the input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and the output of the irrelevant query generation model is a second query instruction irrelevant to the query result.
The objective function of the irrelevant query generation model comprises:
the relevance of the second query instruction generated by the uncorrelated query generation model and the query result;
And the text similarity of the second query instruction and the first query instruction generated by the irrelevant query generation model.
The method disclosed in the above embodiment of the present invention may be applied to the processor 32 or implemented by the processor 32. The processor 32 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware in processor 32 or by instructions in the form of software. The processor 32 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components that may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 34 and the processor 32 reads the information in the memory 34 and in combination with its hardware performs the steps of the method described above.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Example IV
The embodiment of the invention also provides a computer readable storage medium storing a computer program, which when being executed by a processor, causes the processor to execute the steps of:
Acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;
removing noise in the first training data to obtain second training data;
initializing an information retrieval model by using the second training data;
and carrying out information retrieval by utilizing the information retrieval model.
The foregoing is a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention and are intended to be comprehended within the scope of the present invention.

Claims (9)

1.一种信息检索方法,其特征在于,包括:1. An information retrieval method, comprising: 获取第一训练数据,所述第一训练数据包括查询指令和与所述查询指令对应的查询结果;Acquire first training data, where the first training data includes a query instruction and a query result corresponding to the query instruction; 清除所述第一训练数据中的噪声,得到第二训练数据;removing noise from the first training data to obtain second training data; 利用所述第二训练数据初始化信息检索模型;Initializing an information retrieval model using the second training data; 利用所述信息检索模型进行信息检索;Performing information retrieval using the information retrieval model; 其中,初始化信息检索模型之后,所述方法还包括:After initializing the information retrieval model, the method further includes: 通过对抗式查询对所述信息检索模型进行优化,包括:Optimizing the information retrieval model through adversarial querying includes: 利用所述第二训练数据初始化不相关查询生成模型,所述不相关查询生成模型的输入是查询结果和与所述查询结果相关的第一查询指令,输出是与所述查询结果不相关的第二查询指令;Initializing an irrelevant query generation model using the second training data, wherein the input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and the output of the irrelevant query generation model is a second query instruction irrelevant to the query result; 将所述信息检索模型的输出结果输入所述不相关查询生成模型,利用所述不相关查询生成模型的输出结果对所述信息检索模型进行训练。The output result of the information retrieval model is input into the irrelevant query generation model, and the information retrieval model is trained using the output result of the irrelevant query generation model. 2.根据权利要求1所述的信息检索方法,其特征在于,所述获取第一训练数据包括:2. The information retrieval method according to claim 1, wherein obtaining the first training data comprises: 获取开放数据,所述开放数据包括查询指令和与所述查询指令对应的查询结果;Acquire open data, the open data including a query instruction and a query result corresponding to the query instruction; 利用所述开放数据训练生成查询数据生成模型,所述查询数据生成模型能够根据输入的查询结果生成与所述查询结果对应的查询指令;A query data generation model is generated by training the open data, wherein the query data generation model can generate a query instruction corresponding to the query result according to the input query result; 将特定领域的文档输入所述查询数据生成模型,生成所述第一训练数据。Documents in a specific field are input into the query data generation model to generate the first training data. 3.根据权利要求1所述的信息检索方法,其特征在于,所述清除所述第一训练数据中的噪声包括:3. The information retrieval method according to claim 1, wherein removing noise from the first training data comprises: 利用所述第一训练数据初始化噪声分类模型;Initializing a noise classification model using the first training data; 对所述噪声分类模型进行训练;Training the noise classification model; 利用训练后的噪声分类模型清除所述第一训练数据中的噪声。The noise in the first training data is removed using the trained noise classification model. 4.根据权利要求3所述的信息检索方法,其特征在于,所述对所述噪声分类模型进行训练包括:4. The information retrieval method according to claim 3, characterized in that the training of the noise classification model comprises: 进行N次迭代,得到训练后的噪声分类模型,N为正整数;Perform N iterations to obtain the trained noise classification model, where N is a positive integer; 其中,在每次迭代中,利用所述噪声分类模型清除所述第一训练数据中的噪声,利用清除噪声后的数据训练所述信息检索模型,利用训练后的所述信息检索模型的损失函数更新所述噪声分类模型的参数。In each iteration, the noise classification model is used to remove the noise in the first training data, the information retrieval model is trained using the noise-removed data, and the parameters of the noise classification model are updated using the loss function of the trained information retrieval model. 5.根据权利要求1所述的信息检索方法,其特征在于,所述不相关查询生成模型的目标函数包括:5. The information retrieval method according to claim 1, wherein the objective function of the irrelevant query generation model comprises: 所述不相关查询生成模型生成的第二查询指令与查询结果的相关性;The correlation between the second query instruction generated by the irrelevant query generation model and the query result; 所述不相关查询生成模型生成的第二查询指令与第一查询指令的文本相似性。The text similarity between the second query instruction generated by the unrelated query generation model and the first query instruction. 6.一种信息检索装置,其特征在于,包括:6. An information retrieval device, comprising: 获取单元,用于获取第一训练数据,所述第一训练数据包括查询指令和与所述查询指令对应的查询结果;An acquisition unit, configured to acquire first training data, wherein the first training data includes a query instruction and a query result corresponding to the query instruction; 噪声清除单元,用于清除所述第一训练数据中的噪声,得到第二训练数据;a noise removal unit, configured to remove noise from the first training data to obtain second training data; 初始化单元,用于利用所述第二训练数据初始化信息检索模型;an initialization unit, used to initialize the information retrieval model using the second training data; 信息检索单元,用于利用所述信息检索模型进行信息检索;An information retrieval unit, used for performing information retrieval using the information retrieval model; 优化单元,用于通过对抗式查询对所述信息检索模型进行优化,所述优化单元包括:An optimization unit, used to optimize the information retrieval model through adversarial query, the optimization unit comprising: 第二初始化子单元,用于利用所述第二训练数据初始化不相关查询生成模型,所述不相关查询生成模型的输入是查询结果和与所述查询结果相关的第一查询指令,输出是与所述查询结果不相关的第二查询指令;A second initialization subunit is used to initialize an irrelevant query generation model using the second training data, wherein the input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and the output of the irrelevant query generation model is a second query instruction irrelevant to the query result; 对抗训练子单元,用于将所述信息检索模型的输出结果输入所述不相关查询生成模型,利用所述不相关查询生成模型的输出结果对所述信息检索模型进行训练。The adversarial training subunit is used to input the output result of the information retrieval model into the irrelevant query generation model, and train the information retrieval model using the output result of the irrelevant query generation model. 7.根据权利要求6所述的信息检索装置,其特征在于,所述获取单元包括:7. The information retrieval device according to claim 6, characterized in that the acquisition unit comprises: 获取子单元,用于获取开放数据,所述开放数据包括查询指令和与所述查询指令对应的查询结果;An acquisition subunit, used to acquire open data, wherein the open data includes a query instruction and a query result corresponding to the query instruction; 第一处理子单元,用于利用所述开放数据训练生成查询数据生成模型,所述查询数据生成模型能够根据输入的查询结果生成与所述查询结果对应的查询指令;A first processing sub-unit is used to generate a query data generation model by training the open data, wherein the query data generation model can generate a query instruction corresponding to the query result according to the input query result; 第二处理子单元,用于将特定领域的文档输入所述查询数据生成模型,生成所述第一训练数据。The second processing subunit is used to input documents in a specific field into the query data generation model to generate the first training data. 8.根据权利要求6所述的信息检索装置,其特征在于,所述噪声清除单元包括:8. The information retrieval device according to claim 6, characterized in that the noise removal unit comprises: 第一初始化子单元,用于利用所述第一训练数据初始化噪声分类模型;A first initialization subunit, used to initialize a noise classification model using the first training data; 训练子单元,用于对所述噪声分类模型进行训练;A training subunit, used for training the noise classification model; 清除子单元,用于利用训练后的噪声分类模型清除所述第一训练数据中的噪声。The cleaning subunit is used to clean the noise in the first training data by using the trained noise classification model. 9.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至5中任一项所述的信息检索方法的步骤。9. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the information retrieval method according to any one of claims 1 to 5 are implemented.
CN202010970977.1A 2020-09-15 2020-09-15 Information retrieval method, device and computer readable storage medium Active CN114186015B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010970977.1A CN114186015B (en) 2020-09-15 2020-09-15 Information retrieval method, device and computer readable storage medium
JP2021149311A JP7230979B2 (en) 2020-09-15 2021-09-14 Information retrieval method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010970977.1A CN114186015B (en) 2020-09-15 2020-09-15 Information retrieval method, device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN114186015A CN114186015A (en) 2022-03-15
CN114186015B true CN114186015B (en) 2025-06-03

Family

ID=80539270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010970977.1A Active CN114186015B (en) 2020-09-15 2020-09-15 Information retrieval method, device and computer readable storage medium

Country Status (2)

Country Link
JP (1) JP7230979B2 (en)
CN (1) CN114186015B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275972A (en) * 2019-06-17 2019-09-24 浙江工业大学 A Content-Based Instance Retrieval Method Introducing Adversarial Training
CN110413757A (en) * 2019-07-30 2019-11-05 中国工商银行股份有限公司 A kind of word paraphrase determines method, apparatus and system
CN110807332A (en) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 Training method of semantic understanding model, semantic processing method, semantic processing device and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3944159B2 (en) * 2003-12-25 2007-07-11 株式会社東芝 Question answering system and program
US9104733B2 (en) * 2012-11-29 2015-08-11 Microsoft Technology Licensing, Llc Web search ranking
US10394901B2 (en) * 2013-03-20 2019-08-27 Walmart Apollo, Llc Method and system for resolving search query ambiguity in a product search engine
WO2016103451A1 (en) * 2014-12-26 2016-06-30 株式会社日立製作所 Method and device for acquiring relevant information and storage medium
US11222277B2 (en) * 2016-01-29 2022-01-11 International Business Machines Corporation Enhancing robustness of pseudo-relevance feedback models using query drift minimization
CN110019658B (en) * 2017-07-31 2023-01-20 腾讯科技(深圳)有限公司 Method and related device for generating search term
US10657259B2 (en) * 2017-11-01 2020-05-19 International Business Machines Corporation Protecting cognitive systems from gradient based attacks through the use of deceiving gradients
CN108446334B (en) * 2018-02-23 2021-08-03 浙江工业大学 A Content-Based Image Retrieval Method with Unsupervised Adversarial Training
CN109697257A (en) * 2018-12-18 2019-04-30 天罡网(北京)安全科技有限公司 It is a kind of based on the network information retrieval method presorted with feature learning anti-noise
JP2020098521A (en) * 2018-12-19 2020-06-25 富士通株式会社 Information processing apparatus, data extraction method, and data extraction program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275972A (en) * 2019-06-17 2019-09-24 浙江工业大学 A Content-Based Instance Retrieval Method Introducing Adversarial Training
CN110413757A (en) * 2019-07-30 2019-11-05 中国工商银行股份有限公司 A kind of word paraphrase determines method, apparatus and system
CN110807332A (en) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 Training method of semantic understanding model, semantic processing method, semantic processing device and storage medium

Also Published As

Publication number Publication date
JP7230979B2 (en) 2023-03-01
JP2022049010A (en) 2022-03-28
CN114186015A (en) 2022-03-15

Similar Documents

Publication Publication Date Title
CN109800307B (en) Analysis method, device, computer equipment and storage medium for product evaluation
US11861308B2 (en) Mapping natural language utterances to operations over a knowledge graph
WO2020062770A1 (en) Method and apparatus for constructing domain dictionary, and device and storage medium
US9311823B2 (en) Caching natural language questions and results in a question and answer system
CN107133290B (en) A kind of Personalized search and device
CN111026319B (en) Intelligent text processing method and device, electronic equipment and storage medium
CN108897852B (en) Method, device and equipment for judging continuity of conversation content
CN114707007B (en) A kind of image text retrieval method, device and computer storage medium
JP2021508391A5 (en)
CN110674306A (en) Method, device and electronic device for constructing knowledge graph
EP4575822A1 (en) Data source mapper for enhanced data retrieval
WO2025189617A1 (en) Request processing method and apparatus, device, and storage medium
CN117633518B (en) Industrial chain construction method and system
CN112988952B (en) Multi-level-length text vector retrieval method and device and electronic equipment
CN111562943A (en) Code clone detection method and device based on event embedded tree and GAT network
CN119248926A (en) Medical record text recommendation method, device, equipment and storage medium
CN116361446A (en) A method, device and electronic device for generating a text summary
CN114186015B (en) Information retrieval method, device and computer readable storage medium
CN115329762A (en) Noise word recognition method, device, electronic device and storage medium
CN114691829A (en) Information query method and device
CN113297854A (en) Method, device and equipment for mapping text to knowledge graph entity and storage medium
US20250110806A1 (en) Composed api requests with zero-shot language model interfaces
CN119719313A (en) Search method, device and search method for arbitration problem
CN106156141B (en) Method and device for constructing semantic query word template
CN118839768A (en) Information processing method, apparatus, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant