CN114186015B

CN114186015B - Information retrieval method, device and computer readable storage medium

Info

Publication number: CN114186015B
Application number: CN202010970977.1A
Authority: CN
Inventors: 丁磊; 童毅轩; 董滨; 姜珊珊; 张永伟
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2025-06-03
Anticipated expiration: 2040-09-15
Also published as: JP7230979B2; JP2022049010A; CN114186015A

Abstract

The present invention provides an information retrieval method, device and computer-readable storage medium. The information retrieval method provided by the present invention comprises: obtaining first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction; removing noise in the first training data to obtain second training data; initializing an information retrieval model using the second training data; and performing information retrieval using the information retrieval model. The technical solution of the present invention can improve the accuracy of information retrieval results and improve the efficiency of information retrieval.

Description

Information retrieval method, apparatus and computer readable storage medium

Technical Field

The present invention relates to the field of information retrieval, and in particular, to an information retrieval method, apparatus, and computer readable storage medium.

Background

Information retrieval technology is an important technology and is widely applied to search engines, question-answering systems, recommendation systems and other various intelligent services. With better information retrieval techniques, vendors can accurately understand the intent of customers and provide the proper products or services.

At present, the main method of information retrieval is to judge the semantic relevance between the user query and the document based on a large-scale neural network model. Training a large-scale neural network model requires a large amount of labeling data, but the cost of manual labeling is high. The related art proposes constructing annotation data for training based on a generated method. However, the generated data usually contains some noise, and the correlation of the negative samples in the generated data is insufficient, which affects the effect of information retrieval.

Disclosure of Invention

The technical problem to be solved by the embodiment of the invention is to provide an information retrieval method, an information retrieval device and a computer readable storage medium, which can improve the accuracy of information retrieval results and the efficiency of information retrieval.

According to an aspect of an embodiment of the present invention, there is provided an information retrieval method including:

Acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;

removing noise in the first training data to obtain second training data;

initializing an information retrieval model by using the second training data;

and carrying out information retrieval by utilizing the information retrieval model.

Furthermore, in accordance with at least one embodiment of the present invention, after initializing the information retrieval model, the method further comprises:

the information retrieval model is optimized by an antagonistic query.

Furthermore, in accordance with at least one embodiment of the present invention, the acquiring the first training data includes:

acquiring open data, wherein the open data comprises a query instruction and a query result corresponding to the query instruction;

Generating a query data generation model by utilizing the open data training, wherein the query data generation model can generate a query instruction corresponding to an input query result according to the query result;

And inputting the documents in the specific field into the query data generation model to generate the first training data.

Furthermore, in accordance with at least one embodiment of the present invention, the removing noise in the first training data includes:

initializing a noise classification model by using the first training data;

Training the noise classification model;

And removing noise in the first training data by using the trained noise classification model.

Furthermore, in accordance with at least one embodiment of the present invention, the training the noise classification model comprises:

performing N times of iteration to obtain a trained noise classification model, wherein N is a positive integer;

And in each iteration, removing noise in the first training data by using the noise classification model, training the information retrieval model by using the data after removing the noise, and updating parameters of the noise classification model by using a loss function of the information retrieval model after training.

Furthermore, in accordance with at least one embodiment of the present invention, the optimizing the information retrieval model by an antagonistic query includes:

initializing an irrelevant query generation model by using the second training data, wherein the input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and the output of the irrelevant query generation model is a second query instruction irrelevant to the query result;

And inputting the output result of the information retrieval model into the irrelevant query generation model, and training the information retrieval model by utilizing the output result of the irrelevant query generation model.

Furthermore, in accordance with at least one embodiment of the present invention, the objective function of the irrelevant query generation model comprises:

the relevance of the second query instruction generated by the uncorrelated query generation model and the query result;

And the text similarity of the second query instruction and the first query instruction generated by the irrelevant query generation model.

According to another aspect of an embodiment of the present invention, there is provided an information retrieval apparatus including:

the acquisition unit is used for acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;

The noise removing unit is used for removing noise in the first training data to obtain second training data;

an initializing unit, configured to initialize an information retrieval model using the second training data;

And the information retrieval unit is used for retrieving information by utilizing the information retrieval model.

Furthermore, in accordance with at least one embodiment of the present invention, the apparatus further comprises:

and the optimizing unit is used for optimizing the information retrieval model through the countermeasure type inquiry.

Furthermore, according to at least one embodiment of the present invention, the acquisition unit includes:

The acquisition subunit is used for acquiring open data, wherein the open data comprises a query instruction and a query result corresponding to the query instruction;

The first processing subunit is used for generating a query data generation model by utilizing the open data training, and the query data generation model can generate a query instruction corresponding to an input query result according to the query result;

And the second processing subunit is used for inputting the documents in the specific field into the query data generation model to generate the first training data.

Furthermore, according to at least one embodiment of the present invention, the noise removing unit includes:

A first initializing subunit, configured to initialize a noise classification model using the first training data;

The training subunit is used for training the noise classification model;

and the clearing subunit is used for clearing noise in the first training data by using the trained noise classification model.

Furthermore, according to at least one embodiment of the present invention, the optimizing unit includes:

A second initialization subunit, configured to initialize an irrelevant query generation model using the second training data, where an input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and output a second query instruction irrelevant to the query result;

And the countermeasure training subunit is used for inputting the output result of the information retrieval model into the uncorrelated query generation model, and training the information retrieval model by utilizing the output result of the uncorrelated query generation model.

The embodiment of the invention also provides an information retrieval device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the steps of the information retrieval method when being executed by the processor.

Embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the information retrieval method as described above.

Compared with the prior art, the information retrieval method, the information retrieval device and the computer readable storage medium provided by the embodiment of the invention have the advantages that after the first training data for training the information retrieval model is obtained, the first training data is not directly used for generating the information retrieval model, noise in the first training data is removed, the information retrieval model is initialized by using the second training data after the noise removal, the performance of the information retrieval model can be optimized, the accuracy of information retrieval results is improved, and the information retrieval efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an information retrieval method according to an embodiment of the present invention;

FIG. 2 is a flowchart of acquiring first training data according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a process for removing noise from first training data according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of training a noise classification model according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of optimizing an information retrieval model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of generating an irrelevant query according to an embodiment of the invention;

FIG. 7 is a schematic diagram of an information retrieval model and an irrelevant query generation model for countermeasure training in accordance with an embodiment of the present invention;

FIG. 8 is a schematic diagram of an information retrieval device according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of another structure of an information retrieval device according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an acquisition unit according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a noise removing unit according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of an optimizing unit according to an embodiment of the present invention;

Fig. 13 is a schematic diagram of still another structure of an information retrieval device according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments. In the following description, specific details such as specific configurations and components are provided merely to facilitate a thorough understanding of embodiments of the invention. It will therefore be apparent to those skilled in the art that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In various embodiments of the present invention, it should be understood that the sequence numbers of the following processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

In order to solve the problems of large quantity of marking data and high cost of manual marking data required by an information retrieval system, an open data training query data generation model can be used for generating a query on a document in the target field by using the query data generation model, and then a query-result data pair is constructed by using the generated query to train the information retrieval model.

However, the method has 2 problems that noise exists in the generated data, the generated query is used as a related query in the method, the uncorrelated query is constructed by randomly selecting the queries of other documents, and the quantity and quality of the uncorrelated query cannot meet the requirements. While high quality non-related queries, which are literally similar to related queries but not related to the content of the query results, can effectively enhance the effectiveness of the information retrieval system. For example, "iphoneX produced by apple corporation," what is the manufacturer of the corresponding related query of "iphoneX," what is the color of the corresponding high quality unrelated query of "iphoneX," and "who is the low quality unrelated query of" AAA.

Embodiments of the present invention provide an information retrieval method, apparatus, and computer-readable storage medium, which can improve accuracy of information retrieval results and improve efficiency of information retrieval.

Example 1

An embodiment of the present invention provides an information retrieval method, as shown in fig. 1, including:

step 101, acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;

For example, "is the manufacturer of iphoneX" the query instruction, "is the query result" iphoneX produced by apple inc "corresponding thereto. Wherein the first training data is a "query instruction-query result" data pair for a particular target domain.

As shown in fig. 2, acquiring the first training data includes the steps of:

Step 1011, obtaining open data, wherein the open data comprises a query instruction and a query result corresponding to the query instruction;

The open data can be a data set disclosed by a network or can be acquired from the network. For example, the "question-answer" data on the question-answer website may be considered as a query instruction and a query result corresponding to the query instruction, and the "question-answer" data may be collected as training data for generating a query data generation model.

Unlike the first training data, the acquired open data is not necessarily data of a specific target area, such as a medical area, and may be data of other areas, such as a mechanical area.

Step 1012, generating a query data generation model by utilizing the open data training, wherein the query data generation model can generate a query instruction corresponding to an input query result according to the query result;

The query data generation model is a neural network model, and generates a query data generation model through acquired open data training, and the query data generation model can generate a query instruction corresponding to an input query result according to the input query result, for example, the input query result is "AAA is B country president", and the generated output is the query instruction "AAA is.

Step 1013, inputting the documents in the specific field into the query data generation model to generate the first training data.

Wherein documents of a particular domain, including but not limited to medical domain, machine-building domain, etc., may be input into the query data generation model as desired. The method has the advantages that the documents in the specific field are input into the query data generation model, and a query instruction-query result data set can be generated, so that a large amount of query instruction-query result data in the specific field can be generated by utilizing the query data generation model, and the problems of large quantity of labeling data and high cost of manually labeling the data required by the information retrieval system are solved.

Step 102, removing noise in the first training data to obtain second training data;

the field-specific "query instruction-query result" dataset generated in step 101 may have adverse effects on the accuracy of the information retrieval model if noise (i.e., incorrect data) is present, and therefore, noise in the data needs to be removed prior to initializing the training information retrieval model. In this embodiment, noise in the first training data may be removed by using a noise classification model, which may be any text classification model, and it may be possible to distinguish whether a piece of data is noise.

As shown in fig. 3, in this embodiment, removing noise in the first training data includes the following steps:

step 1021, initializing a noise classification model by using the first training data;

step 1022, training the noise classification model;

in this embodiment, N iterations may be performed to obtain a trained noise classification model, where N is a positive integer.

As shown in fig. 4, in each iteration, noise in the first training data is removed by using a noise classification model, an information retrieval model is trained by using the data after removing the noise, and parameters of the noise classification model are updated by using a loss function of the trained information retrieval model, so that the noise classification model is optimized.

In this embodiment, the noise classification model may predict the probability p _j that the data is noise, as follows:

p_j=π(a=0|θ);

Where pi is a function of the noise classification model, a=1 indicates that the data is not noise, and a=0 indicates that the data is noise.

The parameter θ of the noise classification model may be updated by the following loss function:

Wherein, U _i is the data remained after removing the noise in the first training data by using the noise classification model in the ith iteration, U _i-1 is the data remained after removing the noise in the first training data by using the noise classification model in the ith-1 iteration, and F is the performance index of the evaluation information retrieval model.

The N may be a preset value, for example, 50 or 100, or may be determined according to the performance of the information retrieval model after the iteration, and if the performance of the information retrieval model after the nth iteration is improved to a limited extent compared with the performance of the N-1 th iteration, the iteration is stopped.

Step 1023, removing noise in the first training data by using the trained noise classification model.

After the noise in the first training data is removed by using the trained noise classification model, incorrect data in the first training data can be removed, and the accuracy of the information retrieval model is improved.

Step 103, initializing an information retrieval model by using the second training data;

in this embodiment, the information retrieval model is initialized by using the second training data with noise removed, so that accuracy of information retrieval results can be improved, and efficiency of information retrieval can be improved.

To improve performance, after initializing the information retrieval model, the method further comprises optimizing the information retrieval model by an antagonistic query. As shown in fig. 5, the optimizing the information retrieval model by the challenge-type query includes the steps of:

Step 1051, initializing an irrelevant query generation model by using the second training data, wherein the input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and outputting a second query instruction irrelevant to the query result;

The second query instruction is a high-quality uncorrelated query, and the second query instruction needs to be uncorrelated with the query result and has text similarity with the first query instruction. The objective function of the uncorrelated query generation model comprises correlation between a second query instruction generated by the uncorrelated query generation model and a query result, and text similarity between the second query instruction generated by the uncorrelated query generation model and the first query instruction. The smaller the objective function is, the better.

As shown in fig. 6, where the input to the irrelevant query generation model is a query result, such as "iphoneX produced by apple company" and a query instruction related to the query result, such as "iphoneX manufacturer is? another query instruction that is not related to the query result, such as" is the color of iphoneX? is the irrelevant query instruction text similarity to "iphoneX manufacturer"? i.e., high quality, irrelevant queries.

During the initialization process, the generated query instructions may be uncorrelated with the query results but textually similar to the correlated query instructions using the following objective function:

p (a=1| result, generated uncorrelated query) +λd (correlated query, uncorrelated query)

Wherein p (a=1|results, the generated irrelevant queries) part represents the relevance of the generated irrelevant queries to the query results, which can be obtained by initializing the trained information retrieval model. d (related queries, unrelated queries) may be an index for judging text similarity such as edit distance. The smaller should be the better be both parts. λ is a weight coefficient that adjusts the importance of the second portion, and the value of λ may be adjusted as needed.

Step 1052, inputting the output result of the information retrieval model into the irrelevant query generation model, and training the information retrieval model by using the output result of the irrelevant query generation model.

As shown in fig. 7, the information retrieval model is trained using the output result of the irrelevant query generation model as training data, and the result of the output of the information retrieval model is fed back into the irrelevant query generation model as feedback to perform countermeasure training to optimize the information retrieval model.

The input of the information retrieval model is 'query instruction-query result', the output is a probability, the probability that the query result is the correct query result corresponding to the query instruction is represented, if the information retrieval model is trained by only utilizing the 'related query instruction and the corresponding query result', the accuracy of the information retrieval model is limited, the information retrieval model is trained by utilizing the 'unrelated query instruction-query result' data output by the unrelated query generation model, the information retrieval model and the unrelated query generation model are resisted, and the effect of the other party can be mutually optimized by iterating the 2 models.

And 104, carrying out information retrieval by utilizing the information retrieval model.

In the embodiment, the semantic relevance between the query instruction of the user and the document can be accurately judged by using the information retrieval model.

In this embodiment, after the first training data is obtained, the first training data is not directly used to generate the information retrieval model, but noise in the first training data is removed first, and the second training data after removing the noise is used to initialize the information retrieval model, so that the performance of the information retrieval model can be optimized, the accuracy of the information retrieval result can be improved, and the efficiency of information retrieval can be improved.

Example two

The embodiment of the invention also provides an information retrieval device, as shown in fig. 8, comprising:

An obtaining unit 21, configured to obtain first training data, where the first training data includes a query instruction and a query result corresponding to the query instruction;

A noise removing unit 22, configured to remove noise in the first training data to obtain second training data;

If noise (i.e., incorrect data) exists in the generated data set of the query instruction-query result in the specific field, the accuracy of the information retrieval model is adversely affected, so that the noise in the data needs to be removed before the training information retrieval model is initialized. In this embodiment, noise in the first training data may be removed by using a noise classification model, which may be any text classification model, and it may be possible to distinguish whether a piece of data is noise.

An initializing unit 23 for initializing an information retrieval model using the second training data;

An information retrieval unit 24 for retrieving information using the information retrieval model.

In some embodiments, as shown in fig. 9, the apparatus further comprises:

an optimization unit 25 for optimizing the information retrieval model by means of an antagonistic query.

In some embodiments, as shown in fig. 10, the acquiring unit 21 includes:

An obtaining subunit 211, configured to obtain open data, where the open data includes a query instruction and a query result corresponding to the query instruction;

A first processing subunit 212, configured to generate a query data generation model by using the open data training, where the query data generation model is capable of generating a query instruction corresponding to an input query result according to the query result;

A second processing subunit 213, configured to input the document in the specific domain into the query data generation model, and generate the first training data.

In some embodiments, as shown in fig. 11, the noise removing unit 22 includes:

A first initializing subunit 221, configured to initialize a noise classification model with the first training data;

A training subunit 222, configured to train the noise classification model;

p_j=π(a=0|θ);

And a removing subunit 223, configured to remove noise in the first training data by using the trained noise classification model.

In some embodiments, as shown in fig. 12, the optimizing unit 25 includes:

a second initializing subunit 251, configured to initialize an irrelevant query generation model with the second training data, where an input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and output a second query instruction irrelevant to the query result;

An countermeasure training subunit 252, configured to input an output result of the information retrieval model into the irrelevant query generation model, and train the information retrieval model by using the output result of the irrelevant query generation model.

Example III

The embodiment of the present invention further provides an information retrieval apparatus 30, as shown in fig. 13, including:

a processor 32, and

A memory 34, in which memory 34 computer program instructions are stored,

Wherein the computer program instructions, when executed by the processor, cause the processor 32 to perform the steps of:

removing noise in the first training data to obtain second training data;

initializing an information retrieval model by using the second training data;

Further, as shown in fig. 13, the information retrieval apparatus 30 further includes a network interface 31, an input device 33, a hard disk 35, and a display device 36.

The interfaces and devices described above may be interconnected by a bus architecture. The bus architecture may include any number of interconnected buses and bridges. One or more Central Processing Units (CPUs), represented in particular by processor 32, and various circuits of one or more memories, represented by memory 34, are connected together. The bus architecture may also connect various other circuits together, such as peripheral devices, voltage regulators, and power management circuits. It is understood that a bus architecture is used to enable connected communications between these components. The bus architecture includes, in addition to a data bus, a power bus, a control bus, and a status signal bus, all of which are well known in the art and therefore will not be described in detail herein.

The network interface 31 may be connected to a network (e.g., the internet, a local area network, etc.), and may acquire related data, such as public data, etc., from the network and may be stored in the hard disk 35.

The input device 33 may receive various instructions entered by an operator and may be sent to the processor 32 for execution. The input device 33 may comprise a keyboard or a pointing device (e.g. a mouse, a trackball, a touch pad or a touch screen, etc.).

The display device 36 may display results from the execution of instructions by the processor 32.

The memory 34 is used for storing programs and data necessary for the operation of the operating system, and data such as intermediate results in the calculation process of the processor 32.

It will be appreciated that the memory 34 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be Read Only Memory (ROM), programmable Read Only Memory (PROM), erasable Programmable Read Only Memory (EPROM), electrically Erasable Programmable Read Only Memory (EEPROM), or flash memory, among others. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. The memory 34 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 34 stores elements, executable modules or data structures, or a subset thereof, or an extended set thereof, operating system 341 and application programs 342.

The operating system 341 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application programs 342 include various application programs, such as a Browser (Browser), etc., for implementing various application services. A program for implementing the method of the embodiment of the present invention may be included in the application program 342.

The processor 32, when calling and executing the application program and the data stored in the memory 34, specifically, the program or the instruction stored in the application program 342, obtains first training data, where the first training data includes a query instruction and a query result corresponding to the query instruction, clears noise in the first training data to obtain second training data, initializes an information retrieval model by using the second training data, and performs information retrieval by using the information retrieval model.

Further, the processor 32 optimizes the information retrieval model by countercheck when invoking and executing applications and data stored in the memory 34, specifically, programs or instructions stored in the application 342.

Further, the processor 32, when calling and executing the application program and the data stored in the memory 34, specifically, the program or the instruction stored in the application program 342, obtains open data, where the open data includes a query instruction and a query result corresponding to the query instruction, trains and generates a query data generation model by using the open data, where the query data generation model can generate a query instruction corresponding to the query result according to the input query result, and inputs a document in a specific field into the query data generation model to generate the first training data.

Further, the processor 32 initializes a noise classification model using the first training data when calling and executing the application program and data stored in the memory 34, specifically, the application program or instructions stored in the application program 342, trains the noise classification model, and clears the noise in the first training data using the trained noise classification model.

Further, the processor 32 performs N iterations when calling and executing the application program and the data stored in the memory 34, specifically, the program or the instruction stored in the application program 342, to obtain a trained noise classification model, where N is a positive integer, and in each iteration, the noise classification model is used to remove noise in the first training data, the information retrieval model is trained by using the data after removing noise, and the parameters of the noise classification model are updated by using the loss function of the trained information retrieval model.

Further, the processor 32 initializes an irrelevant query generation model with the second training data when calling and executing the application programs and data stored in the memory 34, specifically, the application programs or instructions stored in the application programs 342, wherein the input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and the output of the irrelevant query generation model is a second query instruction irrelevant to the query result.

The objective function of the irrelevant query generation model comprises:

The method disclosed in the above embodiment of the present invention may be applied to the processor 32 or implemented by the processor 32. The processor 32 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware in processor 32 or by instructions in the form of software. The processor 32 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components that may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 34 and the processor 32 reads the information in the memory 34 and in combination with its hardware performs the steps of the method described above.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Example IV

The embodiment of the invention also provides a computer readable storage medium storing a computer program, which when being executed by a processor, causes the processor to execute the steps of:

removing noise in the first training data to obtain second training data;

initializing an information retrieval model by using the second training data;

The foregoing is a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention and are intended to be comprehended within the scope of the present invention.

Claims

1. An information retrieval method, comprising:

Acquire first training data, where the first training data includes a query instruction and a query result corresponding to the query instruction;

removing noise from the first training data to obtain second training data;

Initializing an information retrieval model using the second training data;

Performing information retrieval using the information retrieval model;

After initializing the information retrieval model, the method further includes:

Optimizing the information retrieval model through adversarial querying includes:

Initializing an irrelevant query generation model using the second training data, wherein the input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and the output of the irrelevant query generation model is a second query instruction irrelevant to the query result;

The output result of the information retrieval model is input into the irrelevant query generation model, and the information retrieval model is trained using the output result of the irrelevant query generation model.

2. The information retrieval method according to claim 1, wherein obtaining the first training data comprises:

Acquire open data, the open data including a query instruction and a query result corresponding to the query instruction;

A query data generation model is generated by training the open data, wherein the query data generation model can generate a query instruction corresponding to the query result according to the input query result;

Documents in a specific field are input into the query data generation model to generate the first training data.

3. The information retrieval method according to claim 1, wherein removing noise from the first training data comprises:

Initializing a noise classification model using the first training data;

Training the noise classification model;

The noise in the first training data is removed using the trained noise classification model.

4. The information retrieval method according to claim 3, characterized in that the training of the noise classification model comprises:

Perform N iterations to obtain the trained noise classification model, where N is a positive integer;

In each iteration, the noise classification model is used to remove the noise in the first training data, the information retrieval model is trained using the noise-removed data, and the parameters of the noise classification model are updated using the loss function of the trained information retrieval model.

5. The information retrieval method according to claim 1, wherein the objective function of the irrelevant query generation model comprises:

The correlation between the second query instruction generated by the irrelevant query generation model and the query result;

The text similarity between the second query instruction generated by the unrelated query generation model and the first query instruction.

6. An information retrieval device, comprising:

An acquisition unit, configured to acquire first training data, wherein the first training data includes a query instruction and a query result corresponding to the query instruction;

a noise removal unit, configured to remove noise from the first training data to obtain second training data;

an initialization unit, used to initialize the information retrieval model using the second training data;

An information retrieval unit, used for performing information retrieval using the information retrieval model;

An optimization unit, used to optimize the information retrieval model through adversarial query, the optimization unit comprising:

A second initialization subunit is used to initialize an irrelevant query generation model using the second training data, wherein the input of the irrelevant query generation model is a query result and a first query instruction related to the query result, and the output of the irrelevant query generation model is a second query instruction irrelevant to the query result;

The adversarial training subunit is used to input the output result of the information retrieval model into the irrelevant query generation model, and train the information retrieval model using the output result of the irrelevant query generation model.

7. The information retrieval device according to claim 6, characterized in that the acquisition unit comprises:

An acquisition subunit, used to acquire open data, wherein the open data includes a query instruction and a query result corresponding to the query instruction;

A first processing sub-unit is used to generate a query data generation model by training the open data, wherein the query data generation model can generate a query instruction corresponding to the query result according to the input query result;

The second processing subunit is used to input documents in a specific field into the query data generation model to generate the first training data.

8. The information retrieval device according to claim 6, characterized in that the noise removal unit comprises:

A first initialization subunit, used to initialize a noise classification model using the first training data;

A training subunit, used for training the noise classification model;

The cleaning subunit is used to clean the noise in the first training data by using the trained noise classification model.

9. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the information retrieval method according to any one of claims 1 to 5 are implemented.