CN118673101B - Data retrieval method, device, electronic equipment and storage medium - Google Patents
Data retrieval method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN118673101B CN118673101B CN202411162580.4A CN202411162580A CN118673101B CN 118673101 B CN118673101 B CN 118673101B CN 202411162580 A CN202411162580 A CN 202411162580A CN 118673101 B CN118673101 B CN 118673101B
- Authority
- CN
- China
- Prior art keywords
- matching
- matching field
- database
- extended
- field
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The data retrieval method, the device, the electronic equipment and the storage medium provided by the invention relate to the technical field of electric digital data processing, and the matching fields of the query statement are obtained through the characteristic information, so that the accurate matching according to the query intention, the key information and the context information is realized, and the accuracy of the matching field group is improved. And a matching field group is obtained according to the field-topic mapping table, so that the quick positioning of the query statement is realized, and the efficiency and the accuracy of data retrieval are improved. And the multidimensional search is carried out according to the extended matching field group, so that the search is carried out according to the entity of the matching field group, and the accuracy and the efficiency of data search are improved. The retrieval results are weighted and fused according to the weight of the extended matching field group, so that the weight condition of each retrieval result can be accurately obtained, and a user can conveniently and quickly find the retrieval data most relevant to the query statement.
Description
Technical Field
The present invention relates to the field of electronic digital data processing technologies, and in particular, to a data retrieval method, a data retrieval device, an electronic device, and a storage medium.
Background
With the deep development of the digital age, the amount of data is increasing explosively and the data structure is also becoming increasingly complex. Traditional data retrieval methods are frustrating when dealing with large-scale, multi-field databases. Although search enhanced generation (RETRIEVAL-Augmented Generation, RAG) systems have advanced by combining search and generation techniques, significant limitations remain in processing complex queries and multidimensional data.
Currently, the mainstream RAG system generally performs vectorization processing on only the main text of the database, and performs full word matching on other attributes simply as metadata. This approach causes a series of problems in that the flexibility of retrieval is limited and users often need to explicitly specify matching content. In addition, existing systems also have shortcomings in terms of query understanding capabilities, and it is difficult to accurately capture the user's real intent, particularly when dealing with complex, versatile queries. Finally, most data retrieval systems lack efficient results evaluation and optimization mechanisms.
The above problems result in inefficiency in existing data retrieval.
Disclosure of Invention
The invention provides a data retrieval method, a data retrieval device, electronic equipment and a storage medium, which are used for solving the defect of low data retrieval efficiency in the prior art and improving the data retrieval efficiency.
The invention provides a data retrieval method which comprises the following steps of obtaining feature information of a query sentence based on a semantic analysis result of the query sentence of a user by a language model, obtaining a plurality of matching field groups of the query sentence in a field-topic mapping table of a database based on the feature information, wherein the field-topic mapping table comprises preset topics of the database and mapping relations of fields of the database, the matching field groups are in one-to-one correspondence with the preset topics, expanding the matching fields in each matching field group to obtain a plurality of expansion matching field groups of the query sentence, carrying out multidimensional retrieval in the database based on the expansion matching field groups to obtain a retrieval result of each expansion matching field group, and carrying out weighted fusion on the retrieval result based on weights of the expansion matching field groups to obtain retrieval data of the query sentence.
The data retrieval method comprises the steps of obtaining a plurality of matching field groups of query sentences in a field-topic mapping table of a database based on characteristic information, obtaining a plurality of matching field sequences of the query sentences in the field-topic mapping table based on the characteristic information, obtaining the correlation between matching fields in each matching field sequence and the query sentences, sequencing the matching fields according to the sequence from big to small in correlation to obtain sequenced matching field sequences, and taking at least one matching field which is ranked at the forefront in each sequenced matching field sequence as the matching field group.
The data retrieval method comprises the steps of carrying out weighted fusion on retrieval results based on the weight of an extended matching field group to obtain retrieval data of a query sentence, updating the matching field group based on each ordered matching field sequence when the retrieval score of the retrieval data is lower than a set score, carrying out iterative retrieval based on the updated matching field group until the retrieval score is greater than or equal to the set score or the number of iterative retrieval reaches the set number of times, and taking the retrieval data with the highest retrieval score as final retrieval data.
The data retrieval method provided by the invention expands the matching fields in each matching field group to obtain a plurality of expansion matching field groups of the query statement, and comprises the steps of carrying out named entity recognition analysis on each matching field of the matching field groups to obtain the associated field of the matching field, and obtaining each expansion matching field group based on all the matching fields in each matching field group and the associated fields of the matching fields.
The data retrieval method provided by the invention is characterized in that the weight of the expansion matching field group is determined based on the steps of acquiring the basic weight of each expansion matching field group based on a preset theme, determining the importance level of each expansion matching field group based on characteristic information, and adjusting the basic weight based on the importance level to obtain the weight of the expansion matching field group.
The data retrieval method includes the steps that a database is determined based on the following steps that data is cleaned on initial texts participating in retrieval in the initial database to unify formats of the initial texts, the cleaned initial texts are converted into text vectors, if the repetition times of the text vectors in the initial database are lower than set repetition times, an index structure of the text vectors is built, and the database is obtained based on the index structure and the text vectors.
The data retrieval method comprises the steps of carrying out multidimensional retrieval in a database based on a plurality of expansion matching field groups to obtain a retrieval result of each expansion matching field group, carrying out multidimensional retrieval in the database based on one expansion matching field group, carrying out multidimensional retrieval in the database based on a plurality of expansion matching field groups, determining the retrieval result based on the similarity of a text vector of an index structure and an expansion matching field if the index structure matched with the expansion matching field exists in the database, and carrying out fuzzy matching on the unstructured text vector if the unstructured text vector matched with the expansion matching field exists in the database to obtain the retrieval result.
The invention further provides a data retrieval device, which comprises a characteristic information determining module, a matching module and a fusion module, wherein the characteristic information determining module is used for obtaining characteristic information of a query statement based on a semantic analysis result of the query statement of a user, the characteristic information comprises query intention, key information and context information, the matching module is used for obtaining a plurality of matching field groups of the query statement in a field-topic mapping table of a database based on the characteristic information, the field-topic mapping table comprises preset topics of the database and mapping relations of fields of the database, the matching field groups are in one-to-one correspondence with the preset topics, the expansion module is used for expanding the matching fields in each matching field group to obtain a plurality of expansion matching field groups of the query statement, the retrieval module is used for carrying out multidimensional retrieval in the database based on the expansion matching field groups to obtain retrieval results of each expansion matching field group, and the fusion module is used for carrying out weighted fusion on the retrieval results based on weights of the expansion matching field groups to obtain retrieval data of the query statement.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing any one of the data retrieval methods described above when executing the computer program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data retrieval method as any one of the above.
According to the data retrieval method, the data retrieval device, the electronic equipment and the storage medium, the matching fields of the query statement are obtained through the characteristic information, so that accurate matching according to the query intention, the key information and the context information is realized, and the accuracy of the matching field group is improved. And a matching field group is obtained according to the field-topic mapping table, so that the quick positioning of the query statement is realized, and the efficiency and the accuracy of data retrieval are improved. And the multidimensional search is carried out according to the extended matching field group, so that the search is carried out according to the entity of the matching field group, and the accuracy and the efficiency of data search are improved. The retrieval results are weighted and fused according to the weight of the extended matching field group, so that the weight condition of each retrieval result can be accurately obtained, and a user can conveniently and quickly find the retrieval data most relevant to the query statement.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a data retrieval method provided by the invention.
Fig. 2 is a schematic structural diagram of a data retrieval device provided by the present invention.
Fig. 3 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The data retrieval method, apparatus, electronic device and storage medium of the present invention are described below with reference to fig. 1 to 3.
Fig. 1 is a schematic flow chart of the data retrieval method provided by the present invention, as shown in fig. 1, the data retrieval method includes steps S100 to S500, and each step is specifically as follows.
And S100, acquiring characteristic information of the query statement based on a semantic analysis result of the language model on the query statement of the user, wherein the characteristic information comprises query intention, key information and context information.
A language model is a model capable of understanding and expressing a language, and is used to process a natural language. The language model processes semantic information of text using machine learning and deep learning techniques. The language model includes chat generation pre-training model (ChatGPT), universal thousand language model, bean bag language model, and the like.
And performing basic text cleaning and standardization on the query sentences of the user. Query intent, key information, and contextual information of a query statement are analyzed using a language model. Where query intent includes factual queries, relational queries, and exploratory queries. The key information includes keywords, entities, and relationships in the query statement. The context information includes potential semantics of the query statement and the context information.
S200, acquiring a plurality of matching field groups of the query statement in a field-topic mapping table of the database based on the characteristic information.
The field-topic mapping table comprises mapping relations between preset topics of a plurality of databases and fields of the databases, and the matched field groups are in one-to-one correspondence with the preset topics.
A field-topic mapping table of the database is pre-constructed. The field-topic mapping table is used to describe the semantics and applicable scenarios of each field in the database. The field-topic map includes a number of preset topics, such as title, body, author, date, journal quality, and the like. Wherein the title includes a plurality of title fields for generalized queries, keyword matching. The text includes a plurality of text fields for detailed content queries, full text searches. The author includes a plurality of author fields for a particular personally related query. The date includes a plurality of date fields for time-related queries.
And matching the query statement of the user with the field-topic mapping table by using the language model as a route, and selecting a plurality of matched preset topics. And selecting the field most relevant to the query statement and the matched preset theme as a matched field group.
For example, a query sentence is "I want to find a review article published in the top journal for the last five years, about the application of artificial intelligence in agricultural modernization. Of particular interest are studies using computer vision and deep learning techniques, preferably from authors of units a or B. It is better if the application of technology in small farmers is difficult or environmental sustainability problems are discussed in the paper. "it can be seen that the query intent is to find matching documents. The key information is artificial intelligence, agricultural modernization, computer vision and deep learning technology. The contextual information is "want to find", "pay particular attention to", "preferably" and "better if.
Multiple matching field sets of query statements are obtained in a field-topic mapping table based on feature information (including query intent, key information, and context information), including time matching field sets (last five years), type matching field sets (review articles), title matching field sets (artificial intelligence, agricultural modernization), body matching field sets (computer vision and deep learning), author matching field sets (authors in a or B units), journal quality matching field sets (top journals), and other matching field sets (small farmer application problem, environmental sustainability).
And S300, expanding the matching fields in each matching field group to obtain a plurality of expanded matching field groups of the query statement.
And expanding the matching fields in each matching field group to obtain a plurality of expanded matching field groups of the query statement, specifically, carrying out named entity recognition analysis on each matching field of the matching field groups to obtain an associated field of the matching field, and obtaining each expanded matching field group based on all the matching fields in each matching field group and the associated fields of the matching fields.
Named Entity Recognition (NER) analysis is an entity that analyzes matching fields, such as person name, organization name, place name, date, number, etc. The NER analysis of the present invention is used to accurately extract entities associated with matching fields, thereby improving the accuracy of the search.
For example, the matching field is nearly five years, and the extended matching field obtained after NER analysis is 2019 to 2024. And by analogy, the extended matching fields corresponding to the review articles are reviewed, commented, reviewed and meta-analyzed. The extended matching fields corresponding to the artificial intelligence are artificial intelligence, intelligent technology, machine learning and intelligent agriculture. The extended matching fields corresponding to the agriculture modernization are agriculture modernization, agriculture 4.0, intelligent agriculture and accurate agriculture. The extended matching fields corresponding to the computer vision are computer vision, image recognition, target detection and remote sensing image analysis. The extended matching field corresponding to the deep learning is a neural network, a convolutional neural network and a cyclic neural network. The extension matching field corresponding to the author of the A unit or the B unit is the author of the A unit, the B unit, the subordinate unit of the A unit and the subordinate unit of the B unit. The extended matching fields corresponding to the top-level journal are a C journal, a D journal and an E journal. The extended matching field corresponding to the small farmer application problem is a small farmer, technology popularization and cost problem. The extended matching fields corresponding to the environmental sustainability are environmental, ecological and green agriculture.
And taking all the matching fields in the matching field group and the associated fields of the matching fields as extension matching fields, thereby obtaining an extension matching field group.
According to the invention, the associated field of the matching field is obtained according to the named entity recognition analysis, so that the entity mining of the matching field is realized, the accuracy and the association of the associated field are improved, and the retrieval precision is improved.
And S400, carrying out multidimensional search in the database based on a plurality of expansion matching field groups to obtain a search result of each expansion matching field group.
And carrying out dimension retrieval according to each extension matching field group. For example, there are multiple sets of extension match fields, respectively a time extension match field set, a type extension match field set, a title extension match field set, a body extension match field set, an author extension match field set, a journal quality extension match field set, and other extension match field sets. And searching an extension matching field group corresponding to one dimension to obtain a search result.
For example, if the time-extended matching field group is 2019 to 2024, all documents in 2019 to 2024 are matched, and a document list set (search result) is obtained.
And S500, carrying out weighted fusion on the search result based on the weight of the extended matching field group to obtain the search data of the query sentence.
The weight of each extended matching field group is obtained, for example, the weight of the header extended matching field group is 1, the weight of the text extended matching field group is 0.8, the weight of the type extended matching field group is 0.7, the weight of the time extended matching field group is 0.6, the weight of the author extended matching field group is 0.4, the weight of the journal quality extended matching field group is 0.5, and the weight of other extended matching field groups is 0.3.
And determining the weighted and fused weight of the search result according to the weight of each extended matching field group, and sorting the search result according to the weighted and fused weight, for example, sorting the search result according to the order of the weights of the weighted summation from large to small, so as to obtain the search data.
For example, if the a document matches the time extension match field set, the type extension match field set, the title extension match field set, the body extension match field set, the author extension match field set, the journal quality extension match field set, and other extension match field sets, the weighted fusion weight of the a document is 1+0.8+0.7+0.6+0.5+0.4+0.3=4.3 (which is the maximum weight), and the a document is arranged in the first place.
According to the data retrieval method provided by the embodiment of the invention, the matching fields of the query statement are obtained through the characteristic information, so that the accurate matching according to the query intention, the key information and the context information is realized, and the accuracy of the matching field group is improved. And a matching field group is obtained according to the field-topic mapping table, so that the quick positioning of the query statement is realized, and the efficiency and the accuracy of data retrieval are improved. And the multidimensional search is carried out according to the extended matching field group, so that the search is carried out according to the entity of the matching field group, and the accuracy and the efficiency of data search are improved. The retrieval results are weighted and fused according to the weight of the extended matching field group, so that the weight condition of each retrieval result can be accurately obtained, and a user can conveniently and quickly find the retrieval data most relevant to the query statement.
Based on the above embodiment, a plurality of matching field sets of the query sentence are obtained in the field-topic mapping table of the database based on the feature information, including steps S210 to S230, and each step is specifically as follows.
S210, acquiring a plurality of matching field sequences of the query statement in a field-topic mapping table based on the characteristic information.
S220, obtaining the correlation between the matching fields in each matching field sequence and the query statement, and sequencing the matching fields according to the sequence from the big correlation to the small correlation to obtain a sequenced matching field sequence.
S230, in each ordered matching field sequence, at least one matching field arranged at the forefront is used as a matching field group.
And acquiring a plurality of matching field sequences matched with the query statement in the field-topic mapping table according to the query intention, the key information and the context information of the query statement. For example, a query statement is "a highly cited paper about artificial intelligence in medical applications published in the last five years". The matching field sequences include title and keyword matching field sequences (including artificial intelligence and medical domain, artificial intelligence, medical domain), referenced number matching field sequences (including referenced number greater than 100, referenced number between 100 and 10, referenced number less than 10), and time field matching sequences (including last 1 year, last 3 years, last 5 years, last 10 years).
And in the matching field sequence, the matching fields are ordered according to the sequence from the big correlation to the small correlation, and the ordered matching field sequence is obtained. For example, the ordered sequence of the referenced number matching fields is referenced number greater than 100, referenced number between 100 and 10, referenced number less than 10.
In each ordered matching field sequence, at least one matching field arranged at the forefront is used as a matching field group. For example, the matching field of the first bit in the front is set as the matching field group. The number of times referenced is greater than 100 as a set of matching fields.
The invention preliminarily locks the range of the query statement by acquiring a plurality of matching field sequences of the query statement. And the matching field group is determined according to the correlation, so that the degree of correlation between the matching field group and the query statement is improved, and the retrieval efficiency and accuracy are improved.
Based on the above embodiment, the method further includes steps S600 to S800 after the search result is weighted and fused based on the weight of the extended matching field group to obtain the search data of the query sentence, where each step is specifically as follows.
And S600, updating the matching field group based on each ordered matching field sequence when the retrieval score of the retrieval data is lower than the set score.
And S700, performing iterative search based on the updated matching field group until the search score is greater than or equal to the set score or the number of iterative search reaches the set number.
And S800, taking the search data with the highest search scores as final search data.
And carrying out semantic relevance scoring on the search data according to the language model to obtain the search score of the query sentence. And comparing the search score with the set score, and judging whether optimization is needed or not. If the search score is lower than the set score, the search data of the current time is not ideal, and the search needs to be carried out again.
And updating the matching field group based on each ordered matching field sequence. For example, the ordered matching field sequence is referenced more than 100 times, referenced between 100 and 10 times, and referenced less than 10 times. The matching field group in the current retrieval is referenced times greater than 100 times. The updated set of matching fields in the next search is referenced between 100 and 10 times.
And searching again according to all the updated matching field groups until the search score is greater than or equal to the set score or the number of iterative search reaches the set number of times, stopping searching. And taking the search data with the highest search score as final search data.
According to the invention, the matching field group is updated to perform iterative search, so that the final search precision is improved.
Based on the above embodiment, the weight of the extended matching field group is determined based on steps S510 to S530, and each step is specifically as follows.
And S510, acquiring the basic weight of each extension matching field group based on a preset theme.
And S520, determining the importance level of each extension matching field group based on the characteristic information.
And S530, adjusting the basic weight based on the importance level to obtain the weight of the extended matching field group.
The basic weight of each extended matching field group is preset, for example, the basic weight of the header extended matching field group is 1, the basic weight of the text extended matching field group is 0.8, the basic weight of the type extended matching field group is 0.7, the basic weight of the time extended matching field group is 0.6, and the basic weight of the author extended matching field group is 0.4.
The importance level of each set of extended matching fields is determined based on the characteristic information, including query intent, key information, and context information. For example, if the query statement is determined to be time-sensitive according to the feature information, the importance level of the time-expansion matching field set is determined to be the highest importance level, the basic weight of the time-expansion matching field set is adjusted to be 1 according to the highest importance level, and the weight of the final time-expansion matching field set is 1.
The invention adjusts the basic weight according to the characteristic information, realizes the adjustment of the basic weight according to the actual demand of the query statement, ensures that the retrieval result can meet the actual demand of the user to the greatest extent, and is beneficial to improving the accuracy of data retrieval.
Based on the above embodiment, the database is determined based on steps S410 to S440, and each step is specifically as follows.
And S410, data cleaning is carried out on the initial texts participating in the retrieval in the initial database so as to unify the formats of the initial texts.
S420, converting the cleaned initial text into a text vector.
And S430, if the repetition number of the text vector in the initial database is lower than the set repetition number, constructing an index structure of the text vector.
And S440, obtaining a database based on the index structure and the text vector.
All initial text in the initial database that may be involved in the search is identified, including but not limited to title, body, abstract, keywords, author, date, etc. And cleaning the data of the initial text to remove special characters and unify the format of the initial text. The cleaned initial text is converted into text vectors using a pre-trained language model, e.g., a BGE-M3 language model. For a relatively long initial text, segmenting the initial text to obtain segmented texts, and converting each segmented text into a text vector.
If the number of repetitions of the text vector in the initial database is lower than the set number of repetitions, it is indicated that the text vector is a text vector that is not repeated in large numbers, such as body and title. An index structure of a non-large number of repeated text vectors is constructed. For example, the index structure is constructed using a hierarchically navigable small world (HIERARCHICAL NAVIGABLE SMALL WORLD, HNSW) algorithm. The HNSW algorithm can complete searching within logarithmic time complexity, and the retrieval speed of large-scale vector data is improved. The HNSW algorithm supports dynamic insertion of new vectors, suitable for processing ever-increasing data sets. The index structure obtained according to HNSW algorithm is relatively compact, and memory resources can be effectively utilized.
If the number of repetitions of the text vector in the initial database is greater than the set number of repetitions, the text vector is a number of repeated text vectors, e.g., author, tag, etc. No index structure is built for this large number of repeated text vectors, matching is done directly at query time.
According to the invention, the index structure is constructed by the text vector with the repetition frequency lower than the set repetition frequency, so that the searching speed of the database is improved, and the utilization rate of the memory resource of the database is also improved.
According to the invention, vectorization processing is carried out on all initial texts participating in matching or searching in the database, so that the semantic understanding capability of the database is greatly expanded, and a foundation is laid for subsequent accurate searching. Meanwhile, to support efficient retrieval, the database also builds an efficient index structure (e.g., HNSW) for the vectorized non-large number of repeated fields to support fast retrieval. The fields with a large number of repeated categories do not need to be indexed, so that the effect is improved, the categories can be increased or decreased rapidly, and an index structure is not required to be constructed frequently and repeatedly.
Based on the above embodiment, multidimensional searching is performed in the database based on a plurality of extension matching field groups, so as to obtain a searching result of each extension matching field group, including steps S540 to S550, and each step is specifically as follows.
S540, performing a dimension search in the database based on an extended matching field set.
S550, multi-dimensional search is carried out in the database based on a plurality of expansion matching field groups, wherein if an index structure matched with the expansion matching field exists in the database, a search result is determined based on the similarity between a text vector of the index structure and the expansion matching field, and if an unstructured text vector matched with the expansion matching field exists in the database, fuzzy matching is carried out on the unstructured text vector to obtain the search result.
A dimension search is performed in the database based on a set of extended matching fields. A multi-dimensional search is performed in the database based on the plurality of extended matching field sets. If the index structure is retrieved according to the extension matching field in the extension matching field group, a nearest neighbor algorithm (for example HNSW) is used to calculate the similarity between the extension matching field and the text vector in the index structure, and the text vector with the highest similarity is used as the retrieval result of the extension matching field. If the unstructured text vector is retrieved according to the extended matching field, fuzzy matching is performed on the unstructured text vector, and the fuzzy matching allows a certain degree of error in character string comparison. The fuzzy matching takes text vectors in the conditions of misspelling, homonyms, hyponyms and the like as matched text vectors, so that the retrieval flexibility and recall rate are improved. If the extended matching field is matched with the category field, the accurate or fuzzy matching is directly carried out.
According to the invention, multidimensional searching is carried out according to a plurality of expansion matching field groups, so that the searching speed and the searching comprehensiveness are improved. Similarity matching and fuzzy matching are introduced in the retrieval, so that the retrieval precision is improved.
The data retrieval device provided by the invention is described below, and the data retrieval device described below and the data retrieval method described above can be referred to correspondingly to each other.
As shown in fig. 2, a data retrieval apparatus includes a feature information determination module 201 for acquiring feature information of a query sentence of a user based on a semantic analysis result of the query sentence by a language model, the feature information including a query intention, key information, and context information.
The matching module 202 is configured to obtain a plurality of matching field sets of the query statement from a field-topic mapping table of the database based on the feature information, where the field-topic mapping table includes mapping relationships between preset topics of the plurality of databases and fields of the database, and the matching field sets are in one-to-one correspondence with the preset topics.
And the expansion module 203 is configured to expand the matching fields in each matching field set to obtain a plurality of expanded matching field sets of the query statement.
The retrieving module 204 is configured to perform multidimensional retrieval in the database based on the multiple extended matching field sets, so as to obtain a retrieval result of each extended matching field set.
And the fusion module 205 is configured to perform weighted fusion on the search result based on the weight of the extended matching field set, so as to obtain search data of the query sentence.
According to the data retrieval device provided by the embodiment of the invention, the matching fields of the query statement are obtained through the characteristic information, so that the accurate matching according to the query intention, the key information and the context information is realized, and the accuracy of the matching field group is improved. And a matching field group is obtained according to the field-topic mapping table, so that the quick positioning of the query statement is realized, and the efficiency and the accuracy of data retrieval are improved. And the multidimensional search is carried out according to the extended matching field group, so that the search is carried out according to the entity of the matching field group, and the accuracy and the efficiency of data search are improved. The retrieval results are weighted and fused according to the weight of the extended matching field group, so that the weight condition of each retrieval result can be accurately obtained, and a user can conveniently and quickly find the retrieval data most relevant to the query statement.
In one embodiment, the matching module 202 is configured to obtain a plurality of matching field sequences of the query statement in the field-topic mapping table based on the feature information, obtain correlations between the matching fields and the query statement in each matching field sequence, sort the matching fields in order of the correlations from big to small to obtain sorted matching field sequences, and use at least one matching field arranged at the forefront in each sorted matching field sequence as a matching field group.
In one embodiment, the fusion module 205 is further configured to update the set of matching fields based on each ordered sequence of matching fields when the search score of the search data is lower than the set score, perform iterative search based on the updated set of matching fields until the search score is greater than or equal to the set score, or the number of iterative searches reaches the set number of times, and take the search data with the highest search score as the final search data.
In one embodiment, the expansion module 203 is configured to perform named entity recognition analysis on each matching field of the matching field groups to obtain associated fields of the matching fields, and obtain each expanded matching field group based on all matching fields in each matching field group and associated fields of the matching fields.
In one embodiment, the fusion module 205 is configured to obtain a basic weight of each set of extended matching fields based on a preset theme, determine an importance level of each set of extended matching fields based on the feature information, and adjust the basic weight based on the importance level to obtain a weight of the set of extended matching fields.
In one embodiment, the retrieval module 204 is further configured to perform data cleansing on the initial text participating in retrieval in the initial database to unify the format of the initial text, convert the cleansed initial text into a text vector, construct an index structure of the text vector if the number of repetitions of the text vector in the initial database is lower than a set number of repetitions, and obtain the database based on the index structure and the text vector.
In one embodiment, the retrieval module 204 is configured to perform a dimension retrieval in the database based on one set of extended matching fields, perform a multi-dimension retrieval in the database based on a plurality of sets of extended matching fields, determine a retrieval result based on a similarity between a text vector of the index structure and the extended matching field if an index structure matching the extended matching fields exists in the database, and perform fuzzy matching on the unstructured text vector if an unstructured text vector matching the extended matching fields exists in the database, so as to obtain the retrieval result.
Fig. 3 illustrates a physical schematic diagram of an electronic device, which may include a processor 310, a communication interface (Communications Interface), a memory 330, and a communication bus 340, as shown in fig. 3, where the processor 310, the communication interface 320, and the memory 330 communicate with each other via the communication bus 340. The processor 310 may call logic instructions in the memory 330 to execute a data retrieval method, where the method includes obtaining feature information of a query statement based on a language model for a semantic analysis result of the query statement of a user, where the feature information includes query intention, key information and context information, obtaining a plurality of matching field groups of the query statement in a field-topic mapping table of a database based on the feature information, where the field-topic mapping table includes mapping relationships between preset topics of the database and fields of the database, where the matching field groups correspond to the preset topics one by one, expanding matching fields in each matching field group to obtain a plurality of expanded matching field groups of the query statement, performing multidimensional retrieval in the database based on the plurality of expanded matching field groups to obtain a retrieval result of each expanded matching field group, and performing weighted fusion on the retrieval result based on weights of the expanded matching field groups to obtain the retrieval data of the query statement.
Further, the logic instructions in the memory 330 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program when executed by a processor is implemented to perform the data retrieval method provided by the above methods, where the method includes obtaining feature information of a query statement based on a semantic analysis result of the query statement by a language model, the feature information including a query intention, key information, and context information, obtaining a plurality of matching field groups of the query statement in a field-topic mapping table of a database based on the feature information, the field-topic mapping table including mapping relationships between preset topics of the plurality of databases and fields of the database, the matching field groups corresponding to the preset topics one by one, expanding the matching fields in each matching field group to obtain a plurality of expanded matching field groups of the query statement, performing multidimensional retrieval in the database based on the plurality of expanded matching field groups to obtain a retrieval result of each expanded matching field group, and performing weighted fusion on the retrieval result based on the weights of the expanded matching field groups to obtain the retrieval data of the query statement.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411162580.4A CN118673101B (en) | 2024-08-23 | 2024-08-23 | Data retrieval method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411162580.4A CN118673101B (en) | 2024-08-23 | 2024-08-23 | Data retrieval method, device, electronic equipment and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN118673101A CN118673101A (en) | 2024-09-20 |
| CN118673101B true CN118673101B (en) | 2025-01-07 |
Family
ID=92724814
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411162580.4A Active CN118673101B (en) | 2024-08-23 | 2024-08-23 | Data retrieval method, device, electronic equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118673101B (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112015762A (en) * | 2019-05-30 | 2020-12-01 | 广州慧睿思通信息科技有限公司 | Case retrieval method and device, computer equipment and storage medium |
| CN112035598A (en) * | 2020-11-03 | 2020-12-04 | 北京淇瑀信息科技有限公司 | Intelligent semantic retrieval method and system and electronic equipment |
| CN118193714A (en) * | 2024-05-17 | 2024-06-14 | 山东浪潮科学研究院有限公司 | Dynamic adaptation question-answering system and method based on hierarchical structure and retrieval enhancement |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR102090237B1 (en) * | 2018-07-31 | 2020-03-17 | 주식회사 포티투마루 | Method, system and computer program for knowledge extension based on triple-semantic |
| CN116881436B (en) * | 2023-08-09 | 2025-08-19 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Knowledge graph-based document retrieval method, system, terminal and storage medium |
| CN117633202A (en) * | 2023-11-23 | 2024-03-01 | 中国船舶集团有限公司系统工程研究院 | An unstructured data processing method, device, equipment and storage medium |
-
2024
- 2024-08-23 CN CN202411162580.4A patent/CN118673101B/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112015762A (en) * | 2019-05-30 | 2020-12-01 | 广州慧睿思通信息科技有限公司 | Case retrieval method and device, computer equipment and storage medium |
| CN112035598A (en) * | 2020-11-03 | 2020-12-04 | 北京淇瑀信息科技有限公司 | Intelligent semantic retrieval method and system and electronic equipment |
| CN118193714A (en) * | 2024-05-17 | 2024-06-14 | 山东浪潮科学研究院有限公司 | Dynamic adaptation question-answering system and method based on hierarchical structure and retrieval enhancement |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118673101A (en) | 2024-09-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN108804521B (en) | Knowledge graph-based question-answering method and agricultural encyclopedia question-answering system | |
| US8341159B2 (en) | Creating taxonomies and training data for document categorization | |
| CN108959461B (en) | An Entity Linking Method Based on Graph Model | |
| CN104239513B (en) | A Semantic Retrieval Method for Domain Data | |
| CN112667794A (en) | Intelligent question-answer matching method and system based on twin network BERT model | |
| CN108132927B (en) | Keyword extraction method for combining graph structure and node association | |
| CN109829104A (en) | Pseudo-linear filter model information search method and system based on semantic similarity | |
| CN112559684A (en) | Keyword extraction and information retrieval method | |
| CN108509521B (en) | An Image Retrieval Method for Automatically Generated Text Index | |
| CN112328800A (en) | System and method for automatically generating programming specification question answers | |
| CN111581368A (en) | Intelligent expert recommendation-oriented user image drawing method based on convolutional neural network | |
| CN115563313A (en) | Semantic retrieval system for literature and books based on knowledge graph | |
| CN107291895A (en) | A kind of quick stratification document searching method | |
| CN117891838B (en) | Large model retrieval enhancement generation method and device | |
| CN112199461A (en) | Document retrieval method, apparatus, medium and device based on block index structure | |
| CN115757726A (en) | A cold start method and device for an intelligent question answering system oriented to a specific field | |
| CN112860898A (en) | Short text box clustering method, system, equipment and storage medium | |
| CN112989813A (en) | Scientific and technological resource relation extraction method and device based on pre-training language model | |
| CN110728135A (en) | Text theme indexing method and device, electronic equipment and computer storage medium | |
| CN112711944A (en) | Word segmentation method and system and word segmentation device generation method and system | |
| CN116401344A (en) | Method and device for searching table according to question | |
| CN112417170A (en) | Relation linking method for incomplete knowledge graph | |
| CN112507097B (en) | Method for improving generalization capability of question-answering system | |
| Afuan et al. | A new approach in query expansion methods for improving information retrieval | |
| CN113516202A (en) | Webpage accurate classification method for CBL feature extraction and denoising |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |