CN118468814A - Document data extraction method and system based on large model and extraction target - Google Patents
Document data extraction method and system based on large model and extraction target Download PDFInfo
- Publication number
- CN118468814A CN118468814A CN202410670078.8A CN202410670078A CN118468814A CN 118468814 A CN118468814 A CN 118468814A CN 202410670078 A CN202410670078 A CN 202410670078A CN 118468814 A CN118468814 A CN 118468814A
- Authority
- CN
- China
- Prior art keywords
- document
- text
- extraction
- data
- document tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/154—Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application relates to the technical field of natural language processing, in particular to a document data extraction method and system based on a large model and an extraction target, which aim to solve the problem of poor universality and flexibility of knowledge extraction caused by a large number of different non-universal models, processes and rules. For this purpose, the application carries out unified format conversion and division on the document data, generates a corresponding document tree, obtains the document tree node matched with the extraction target by using a matching algorithm based on the document tree, and extracts the document data based on the large model, the extraction target and the document tree node. The application realizes the purpose of uniformly extracting the document data in multiple formats, achieves the effects of reducing the extraction noise of the large model and reducing the workload of the large model, and solves the problems of poor universality and flexibility of knowledge extraction caused by the need of training different algorithm models for extracting different types of data.
Description
Technical Field
The application relates to the technical field of natural language processing, in particular to a document data extraction method and system based on a large model and an extraction target.
Background
With the development of artificial intelligence, the use of large models to extract the required information from massive documents has become a research hotspot. However, the document contains text data and other types of data such as pictures, tables and the like, and different types of data extraction usually requires training different algorithm models. The large number of different non-generic models, flows and rules results in poor versatility and flexibility in knowledge extraction.
Accordingly, there is a need in the art for a new extraction target-based document data extraction scheme to solve the above-described problems.
Disclosure of Invention
The application aims to solve the technical problems, namely the problems of poor universality and flexibility of knowledge extraction caused by a large number of different non-universal models, flows and rules.
In a first aspect, the present application provides a document data extraction method based on a large model and an extraction target, the method comprising: carrying out unified format conversion and division on the document data to generate a corresponding document tree; acquiring a document tree node matched with the extraction target by using a matching algorithm based on the document tree; document data is extracted based on the large model, the extraction target, and the document tree nodes.
In one technical scheme of the document data extraction method based on the large model and the extraction target, any document tree node comprises a corresponding document number, a node number, a father node number, leaf node information and text information.
In one technical scheme of the document data extraction method based on the large model and the extraction target, the document tree comprises a text document tree, the document data is subjected to unified format conversion and division, and a corresponding document tree is generated, and the method comprises the following steps: uniformly converting the document data into text data; dividing text data according to a preset document structure; a text document tree is generated based on the divided text data.
In one technical scheme of the document data extraction method based on the large model and the extraction target, generating a text document tree based on the divided text data includes: extracting corresponding summary information according to the divided text data; a text document tree is generated based on the partitioned text data and the corresponding summary information.
In one technical scheme of the document data extraction method based on the large model and the extraction target, the document tree further comprises a vector document tree, and the method further comprises: the text document tree is converted into a vector document tree.
In one technical scheme of the document data extraction method based on the large model and the extraction target, obtaining a document tree node matched with the extraction target by using a matching algorithm based on a document tree includes: generating text matching scores of the extraction targets and nodes of the text document tree by using a matching algorithm; generating vector matching scores of the extraction targets and all nodes of the vector document tree by using a matching algorithm; calculating the matching score of each node based on the text matching score of each node and the vector matching score of each node; and acquiring text document tree nodes related to the extraction target according to the matching score based on a preset strategy.
In one technical scheme of the document data extraction method based on the large model and the extraction target, the document data extraction method based on the large model, the extraction target and the document tree node comprises the following steps: acquiring text information corresponding to text document tree nodes related to an extraction target; and extracting the document data according to the extraction target, the text information and the preset extraction requirement by using the large model.
In a second aspect, the present application provides a document data extraction system based on a large model and an extraction target, the system comprising: the document tree generating module is used for carrying out unified format conversion and division on document data to generate a corresponding document tree; the node matching module is used for acquiring the document tree nodes matched with the extraction target by utilizing a matching algorithm based on the document tree; and the data extraction module is used for extracting the document data based on the large model, the extraction target and the document tree node.
In one technical scheme of the document data extraction system based on the large model and the extraction target, any document tree node comprises a corresponding document number, a node number, a father node number, leaf node information and text information.
In one technical solution of the above document data extraction system based on a large model and an extraction target, the document tree includes a text document tree, and the document tree generating module includes: the conversion unit is used for uniformly converting the document data into text data; the dividing unit is used for dividing the text data according to a preset document structure; and a text document tree generating unit for generating a text document tree based on the divided text data.
In one aspect of the document data extraction system based on the large model and the extraction target, the text document tree generating unit includes: an extraction subunit, configured to extract corresponding summary information according to the divided text data; and the generation subunit is used for generating a text document tree based on the divided text data and the corresponding summary information.
In one technical solution of the above document data extraction system based on the large model and the extraction target, the document tree further includes a vector document tree, and the system further includes: and the document number conversion module is used for converting the text document tree into a vector document tree.
In one technical scheme of the document data extraction system based on the large model and the extraction target, the node matching module includes: the first matching unit is used for generating text matching scores of the extraction targets and all nodes of the text document tree by using a matching algorithm; the second matching unit is used for generating vector matching scores of the extraction targets and all nodes of the vector document tree by using a matching algorithm; a matching score calculation unit for calculating a matching score of each node based on the text matching score of each node and the vector matching score of each node; the node extraction unit is used for acquiring text document tree nodes related to the extraction target according to the matching score based on a preset strategy.
In one technical solution of the above document data extraction system based on a large model and an extraction target, the data extraction module includes: the acquisition unit is used for acquiring text information corresponding to the text document tree nodes related to the extraction target; the obtaining unit is used for extracting the document data according to the extraction target, the text information and the preset extraction requirement by utilizing the large model.
In a third aspect, the present application provides a computer-readable storage medium having stored therein a plurality of program codes adapted to be loaded and executed by a processor to perform the document data extraction method based on the large model and the extraction target in the above first aspect or any one of its corresponding aspects.
In a fourth aspect, the present application provides an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores a computer program, and the computer program when executed by at least one processor implements the document data extraction method based on the large model and the extraction target in the first aspect or any of the corresponding aspects.
One or more of the above technical solutions of the present application at least has one or more of the following
The beneficial effects are that:
in the technical scheme of the application, the purposes of unifying the format of the document and splitting the document data are realized by performing unified format conversion and division on the document data, and the corresponding document tree is generated according to the converted and divided document data, so that the effect of conveniently searching and accessing the document data is achieved. The method comprises the steps of obtaining document tree nodes matched with an extraction target by using a matching algorithm based on a document tree, extracting document data according to the document number nodes and the extraction target, thereby realizing the purpose of uniformly extracting the document data in multiple formats, and solving the problems of poor universality and flexibility of knowledge extraction caused by the need of training different algorithm models for extracting different types of data.
In the technical scheme of the application, the text document tree is converted into the vector document tree, and the node matched with the extraction target is determined by utilizing the matching algorithm based on the text document tree and the vector document tree, so that the purposes of improving the accuracy and the reliability of node extraction are realized, and the influence of noise on knowledge extraction is reduced.
In the technical scheme of the application, the summary information corresponding to the divided text data is extracted, and the text document tree is generated by utilizing the summary information and the divided text data, so that the purposes of improving the quality of the generated text document tree and further reducing the influence of noise on knowledge extraction are realized.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flow diagram of the main steps of a document data extraction method based on a large model and extraction targets according to one embodiment of the application;
FIG. 2 is a flow chart of the main steps of a document data extraction method based on a large model and extraction targets according to another embodiment of the application;
FIG. 3 is a schematic block diagram of the primary architecture of a large model and extraction target based document data extraction system according to one embodiment of the application;
FIG. 4 is a schematic block diagram of the primary structure of a large model and extraction target based document data extraction system according to another embodiment of the application;
Fig. 5 is a schematic diagram of the connection between a processor and a memory of an electronic device according to one embodiment of the application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "mounted," "connected," "coupled," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; the two components can be directly connected or indirectly connected through an intermediate medium, or can be communicated inside the two components, or can be connected wirelessly or in a wired way.
Further, "module," "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, or software components, such as program code, or a combination of software and hardware. The processor may be a central processor, a microprocessor, an image processor, a digital signal processor, or any other suitable processor. The processor has data and/or signal processing functions. The processor may be implemented in software, hardware, or a combination of both. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random access memory, and the like.
The following explains some terms related to the present application:
Large model: also known as large language models (Large Language Model, LLM), which are deep learning models with a large parametric scale and complexity, can generate natural language text or understand the meaning of language text.
Noise: referring to errors or outliers contained in the data can interfere with the learning and prediction capabilities of the model, resulting in reduced performance of the model.
Knowledge extraction: and extracting knowledge in the document through recognition, understanding, screening and formatting, and storing the knowledge in a knowledge base in a certain form.
Named entity Recognition (NAMED ENTITY Reconnaissance, NER): the entity with specific meaning in the identification text mainly comprises a person name, a place name, an organization name, a proper noun and the like.
Relation extraction (relation-extraction): refers to extracting (subject, relationship, object) such triples from a piece of text in order to identify target relationships in the text entity.
Event extraction: refers to extracting events of interest to a user from unstructured information and presenting the events to the user in a structured form.
Key-value pair (k-v pair): is an implementation of mapping in a programming language to a data concept, keys are used as indexes to elements, and values represent stored and read data.
Vector model (Embedding): referring to the process of mapping high-dimensional data, e.g., text, pictures, audio, to a low-dimensional space, embedding vectors are typically a vector of real numbers that represent the input data as points in a continuous numerical space.
OCR (Optical Character Recognition ): the technology is that print characters in a document are converted into an image file with black-white lattice by adopting an optical mode, and characters in the image are converted into a text format by a recognition software for further editing and processing by a word processing software.
The picture and text: refers to tasks that can be processed by the visual language model, including image subtitle generation, image-text retrieval and visual question-answering.
Prompt word: refers to text entered into the large model to prompt or guide the large model to give an output that meets expectations.
Root node, parent node, leaf node: if a node of a higher level is not arranged on a certain node in the tree structure, the node is called a root node; its upper node is called a parent node with respect to the current node; if there is no other node below the current node, then this node is called the leaf node, the lowest level node.
Relational database: a database that uses a relational model to organize data, stores data in rows and columns for ease of user understanding, and includes a plurality of rows and columns forming a table and a plurality of tables forming a database.
Vector library (vector database): a vector database is a special type of database that is used to store and process vector data. The main feature of the vector database is the ability to efficiently perform search and compare operations in vector space, such as nearest neighbor search (nearest neighbor search).
Normalized exponential function (Softmax function): is a generalization of the logic function that can "compress" a K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector σ (z) such that each element ranges from 0 to 1 and the sum of all elements is 1, which is used in multi-classification problems.
It should be noted that, although the following embodiments describe the steps in a specific order, it will be understood by those skilled in the art that, in order to achieve the effects of the present application, the steps are not necessarily performed in such an order, and may be performed simultaneously (in parallel) or in other orders, and these variations are within the scope of the present application.
According to an aspect of an embodiment of the present application, there is provided a document data extraction method based on a large model and an extraction target, referring to fig. 1, fig. 1 is a schematic flow chart of main steps of the document data extraction method based on a large model and an extraction target according to an embodiment of the present application, the method mainly includes the following steps S1 to S3:
Step S1, carrying out unified format conversion and division on document data to generate a corresponding document tree.
In the present embodiment, document data refers to data in a plurality of formats in a document, and the document data includes, but is not limited to: picture data, form data, text data. Unified format conversion is to convert document data in multiple formats into a unified format, such as unified conversion into text, vector, or a format that facilitates analysis and extraction of the document data. The division is to divide the converted document data into a plurality of parts, and may be according to a document title, paragraph, label, etc. Finally, a document tree is generated according to the document data after the transformation and division, wherein the document tree can be used for storing and organizing the document data with a hierarchical structure. For example, the document tree may be a multi-tree, in which each node may have a plurality of child nodes, and the multi-level nodes represent a plurality of levels of the document, respectively, corresponding to storing multi-level titles and document data under the multi-level titles. Hierarchical storage and convenient access of document data can be realized based on the document tree structure.
In one embodiment, the document to be extracted may be a web article, a paper, a journal, a magazine, a news, etc., and the document acquisition source may be a web or real life, which is not particularly limited in this embodiment.
In one embodiment, the document data is divided by comprehensively considering a specified maximum number of words, text logic and the like of each node. Illustratively, the division is performed while keeping the integrity of a sentence as much as possible and keeping the logic among multiple sentences.
And S2, obtaining the document tree nodes matched with the extraction targets by using a matching algorithm based on the document tree.
In the present embodiment, the extraction target is a task object of document data extraction (also may be referred to as knowledge extraction). Alternatively, the extraction targets include, but are not limited to, one or more keywords, one or more sentences of text, such as a question, a plurality of keywords extracted from a question, and the like. Specifically, step S2 converts the extraction target into the same format as the document data after unified conversion in step S1, such as text, vector, and the like, and then matches the document data stored based on the document tree structure with the extraction target to obtain a document tree node matched with the extraction target.
In one embodiment, the matching algorithm includes, but is not limited to, matching algorithm cosine similarity (Cosine Similarity), jaccard (Jaccard) similarity coefficients, string matching (KMP) algorithm. Depending on the specific task, an appropriate algorithm may be selected to perform the matching calculation.
In one embodiment, the extraction target and the document data can be converted into multiple formats, the matching calculation is performed under the multiple formats, and finally, the document tree node matched with the extraction target is obtained according to the result of the matching calculation under the multiple formats.
And step S3, extracting the document data based on the large model, the extraction target and the document tree nodes.
In this embodiment, the large model refers to a deep learning model with a huge parameter scale and complexity, and can generate a natural language text or understand the meaning of the language text. The extraction target can also be used as a prompt word of the large model for prompting or guiding the large model to give out the expected output, namely the document data needing to be extracted. Specifically, the extraction target and the document data stored corresponding to the document tree node obtained in the step S2 are input into a large model, and the document data meeting the extraction target can be further extracted by using the large model.
According to the steps S1 to S3, the purposes of unifying the document format and splitting the document data are achieved by performing unified format conversion and division on the document data, and the corresponding document tree is generated according to the converted and divided document data, so that the effect of conveniently searching and accessing the document data is achieved. The method comprises the steps of obtaining document tree nodes matched with an extraction target by using a matching algorithm based on a document tree, extracting document data by using a large model according to the document number nodes and the extraction target, thereby realizing the purpose of uniformly extracting document data in multiple formats, achieving the effects of reducing large model extraction noise and reducing large model workload, and solving the problems of poor universality and flexibility of knowledge extraction caused by the need of training different algorithm models for extracting different types of data.
Steps S1 to S3 are further described below.
In one implementation of the embodiment of the present application, the document tree includes a text document tree, and step S1 may further include the following steps S10 to S12:
Step S10, uniformly converting the document data into text data.
In this embodiment, the document types include, but are not limited to docx, pdf, pptx, txt, etc., the document data includes data in various formats such as text, table, picture, etc., and in this embodiment, taking unified conversion of the document data into text data as an example, the following specific conversion (parsing) methods are given:
the forms can be divided into text forms and picture forms, and for the forms in the form of pictures, i.e. picture forms, in docx, pptx and pdf, the forms can be parsed into text forms by means of OCR (Optical Character Recognition ) technology and then converted into text data (plain text).
The text table is composed of a plurality of lines and columns of text, and in this embodiment, the lines and columns of the text table are divided by a plurality of identifiers so as to be converted into text data.
As an example, using < row > [ table row information ] </row >, a row in a table is represented, a plurality of columns in the same row in the table are divided by "|", and a null value is represented by "-".
TABLE 1 text form example
| Project | Notes attached | 2022 Year | 2021 Year |
| 1. Total incomes of business | 1,720,951,997.61 | 1,667,467,958.09 | |
| Wherein: revenue of business | 5. 32, 32 | 1,720,951,997.61 | 1,667,467,958.09 |
| 2. Total cost of business | 1,776,345,688.55 | 1,547,736,424.89 | |
| Wherein: business cost | 5. 32, 32 | 812,544,687.24 | 777,945,353.96 |
For the text table shown in table 1, the text data obtained after conversion (parsing) is as follows:
< row > project |annex |2022 year |2021 year > < row > 1, total revenue of business |- |1,720,951,997.61|1,667,467,958.09</row > < row > wherein: business incomes I five, 32I 1,720,951,997.61I 1,667,467,958.09< row > II, total business cost I-1,776,345,688.55I 1,547,736,424.89< row > wherein: business costs |five, 32|812,544,687.24|777,945,353.96</row >.
The picture data is mainly converted into text data by means of a picture text and an OCR model. The image understanding tool based on the artificial intelligence technology can automatically generate word description based on information such as scenes, characters, objects, relations and the like in the pictures. The graphic context tools that may be used include, but are not limited to, a text-word, a donut-base, a blank-image-captioning, vit-gpt2-image-captioning, and the like.
According to the extraction purpose, the picture data is divided into two types of pictures and characters. The picture class is a picture with non-text content as main content, and the text class is a picture with text content as main content. The picture class mainly extracts content description in the picture, for example, a text tool is used for analyzing information such as scenes, characters, objects, relations and the like in the picture to obtain corresponding text data. The text class mainly extracts the text in the picture, and corresponding text data is obtained through analysis of an OCR model.
It should be noted that, in this embodiment, the document data format to be converted is not limited, for example, the video may be converted into text data by parsing into multiple frames of pictures; the speech may be converted to text data by a speech conversion tool.
Step S11, dividing the text data according to a preset document structure.
In this embodiment, the preset document structure includes, but is not limited to: various levels of titles, section characters, HTML tags, etc. of the document.
In one embodiment, documents to be extracted are divided (cut) by levels of titles. As one example, a document includes a primary title, a secondary title, and a tertiary title, text data may be divided into three portions.
Step S12, generating a text document tree based on the divided text data.
In this embodiment, the hierarchy of the text document tree may be set according to a specific situation, and it is assumed that the text document tree is a three-layer multi-way tree, and from top to bottom, the first layer is a root node, where the content stored in correspondence with the root node includes, but is not limited to, information such as a name, time, and keywords of the document; the second layer is a father node, and the content stored by the father node includes but is not limited to titles and keywords of all levels of documents; the third layer is a leaf node, and the content stored by the leaf node includes, but is not limited to, specific content under each level of titles of the document, such as text, words parsed from a table or a picture, and the like.
In one embodiment, the text document tree may be generated by summarizing the divided text data, and generating the text document tree using the summarization information together with the divided text data. Wherein the summary information may be generated using a large model.
In one embodiment, the text document tree may be generated by using information such as a document name, a multi-level title name, a document creation time, a document author name, a subject class or category to which the document belongs, which is not limited in this embodiment.
In one embodiment, content corresponding to text document tree nodes may be stored in a relational database, such as: SQL, mySQL, relational Database, etc. Because the tree structure is stored in the relational database, the hierarchical relationship and the connection relationship among the nodes need to be considered, the embodiment designs the information stored in the relational database by the text document tree nodes, thereby realizing document data storage based on the tree structure. The storage mode is beneficial to understanding the hierarchical structure of the document, is convenient for searching and accessing the document data, avoids invalid data access and improves the searching efficiency of the document data.
The information stored by the text document tree node in the relational database is shown in table 2:
TABLE 2 stored information table of text document tree nodes in relational database
In particular, the information that any node in the text document tree may store in the relational database includes, but is not limited to:
Node number: nid (node id), the data type is character type, is the unique id primary key of node, is used for distinguishing a plurality of storage nodes, does not allow to be null value.
Parent node number: parent_id (parent node id), the data type is character type, which is used to represent the parent node of the current storage node, and the parent_id may be null value because not all nodes have parent nodes.
Text information: content represents the original text of a document or parsed text data (text content); summary, i.e., summarizing text content; key information may include node level, keyword sentence, title name, etc. The data types of the text information are character type, and can be null value. It should be noted that the keywords/sentences may be extracted based on, but not limited to, natural language processing techniques such as TF-IDF, classification model, large model, etc.
Leaf node information: is_leaf (whether leaf node) and the data type is boolean, which is used for indicating whether the current storage node is a leaf node or not, and null value is not allowed; leaf_id (leaf node id), the data type is character type, is used for recording the list set of all leaf node ids of the current storage node, and leaf_id can be null value because not all nodes have leaf nodes.
Document number: doc_id (document id), the data type is character type, which means that each document has a unique id, and null value is not allowed.
When the information of the text document tree node is stored in the relational database, the information is stored as one line of data of the table, and a plurality of nodes may store a plurality of lines of information correspondingly.
In one embodiment, after step S12, further comprising:
step S13, converting the text document tree into a vector document tree.
In the present embodiment, the text document tree generated as described above is converted into a vector document tree. Illustratively, a text document tree is converted to a vector document tree using a word embedding (Embedding) model, the tree structure is maintained, and the converted vector content is stored in a vector database.
In one embodiment, for each node, one or more of its non-empty "content", "summary", "key_information" is selected, and converted into a corresponding vector by a vector model (e.g., word vector model, sentence vector model, segment vector model, etc.), and stored in a vector library according to an original tree structure, where the vector (database) includes, but is not limited to Milvus, annoy, elasticsearch, faiss, milvus, etc.
It should be noted that the vector database is a special type of database for storing and processing vector data. The vector database can efficiently perform search and comparison operations in a vector space, such as nearest neighbor search (nearest neighbor search), similarity calculation, and the like, for vectors, while also supporting high-dimensional search and custom indexing, and the use of the vector database for searching has the advantages of being scalable, flexible, and efficient.
In one implementation of the embodiment of the present application, step S2 may further include the following steps S20 to S23:
and S20, generating text matching scores of the extraction targets and nodes of the text document tree by using a matching algorithm.
In this embodiment, the matching algorithm is used to calculate the text similarity (score) of the extraction target and the text of each node of the text document tree, such as one or more of "content", "summary", "key_information".
In one embodiment, assuming that the extraction target is a sentence, such as "what the contact address of the first party is," the longest common substring length may be calculated from the document content "stored by the extraction target corresponding to each node of the text document tree. For example, "what the contact address of the a party is" with a certain node "content" is as follows: the longest common substring that "he provided with an address of Tiannan Lu Shanghua street" is "address", and the longest common substring is 2 in length. It should be noted that, since there may be a large number of nonsensical words in the text, such as stop words representing prepositions, conjunctions, pronouns, etc., the longest common substring distinction may not be obvious. Therefore, the keyword may be extracted from the extraction target by using the word segmentation technique, for example, the keyword may be extracted from the extraction target: "party a", "contact address". And then calculating the length of the longest public substring by the extracted keywords and the content of the document in each node, thereby achieving the purpose of reducing noise influence.
It will be appreciated that matching any one extraction target to a plurality of nodes results in a plurality of text match scores, and therefore the resulting text match score is a vector comprising the matching scores of the plurality of nodes to the extraction target. The matching scores of the plurality of nodes in the vector can be ordered according to nid of the nodes or ordered according to the score size.
In one embodiment, when there are multiple extraction targets, a plurality of matching scores (vectors) corresponding to the multiple extraction targets may be summed or averaged and the like to obtain a text matching score between the extraction target and each node of the text document tree.
In one embodiment, the number of hits may also be compared with the "key_information" of each node by the keyword extracted from the extraction target, so as to obtain a text matching score of each node of the text document tree and the extraction target. Wherein, hit indicates that the two words being compared are identical, the hit number is the number of times the two words are identical.
In one embodiment, the matching score of the extraction target and each node text can be calculated by one or more text matching methods, and then the text matching score of each node of the text document tree and the extraction target can be calculated by summing or averaging the text matching scores obtained by the multiple matching methods.
In one embodiment, for the computed text match score for the extraction target and each node of the text document tree, it may be converted to a probability score between [0,1] using the Softmax algorithm. Illustratively, the Softmax algorithm probability transfer function is as follows:
Where z i represents the text matching score vector of the extraction target and each node of the text document tree, exp represents an exponential function based on a natural constant e. Assuming three text document tree nodes, the text matching score vectors of the three nodes and the extraction target are as follows: [2,1,3] is converted by the Softmax algorithm to obtain [0.24,0.09,0.67].
And S21, generating vector matching scores of the extraction targets and all nodes of the vector document tree by using a matching algorithm.
In this embodiment, the extraction target is converted into a vector, and then a vector matching algorithm is used to calculate a vector matching score, which is the similarity between the converted vector and the vector of each node of the vector document tree. For example, one or more vectors of "content", "summary", "key_information" may be selected according to specific requirements, and vector matching scores may be calculated for vectors corresponding to the extraction targets.
In one embodiment, the method of vector similarity calculation includes, but is not limited to, cosine (cosine) and matrix dot product methods.
In one embodiment, when there are multiple sets of vector matching scores, the multiple sets of vector matching scores may be summed or averaged to obtain a vector matching score for each node of the extraction target and vector document tree. Please refer to other embodiments in detail, which are not described herein.
In one embodiment, for the calculated vector matching score of the extraction target and each node of the vector document tree, it may be converted to a probability score between [0,1] using the Softmax algorithm. Please refer to other embodiments in detail, which are not described herein.
Step S22, calculating the matching score of each node based on the text matching score of each node and the vector matching score of each node.
In this embodiment, the matching score of each node is calculated by the sum or weighted sum of the text matching score and the vector matching score.
In one embodiment, the weights of the weighted sums may be set manually or may be determined by training the labeled samples.
In one embodiment, the text similarity weight is set to 0.23, the vector similarity weight is set to 0.83, and the calculated matching score can obtain a better matching result.
Step S23, obtaining text document tree nodes related to the extraction targets according to the matching scores based on a preset strategy.
In this embodiment, extracting text document tree nodes based on a preset policy includes, but is not limited to: selecting text document tree nodes with the matching score larger than a preset threshold value in the step S22; or after sorting the matching scores, selecting the top N text document tree nodes.
Based on the steps S20 to S23, the present embodiment converts the text document tree into a vector document tree, and determines the node matched with the extraction target based on the text document tree and the vector document tree by using the matching algorithm, thereby achieving the purpose of improving the accuracy and reliability of node extraction, and reducing the influence of noise on knowledge extraction.
In one implementation of the embodiment of the present application, step S3 may further include the following steps S30 and S31:
Step S30, obtaining text information corresponding to the text document tree nodes related to the extraction targets.
In this embodiment, based on the text document tree node extracted in step S23, corresponding text information is acquired, where the text information includes "content", "summary", and "key_information".
Step S31, extracting the document data according to the extraction target, the text information and the preset extraction requirement by using the large model.
In this embodiment, the preset extraction requirement is additional information provided by extracting information using the large model, and in order to ensure that information is not missed when the large model is used for extraction, "content" may be selected as text information of the input large model.
In one embodiment, when obtaining the text document tree nodes related to the extraction target, if the number of documents is large, the first two layers of the document tree, such as the document name, time and keywords, can be mainly considered; each level of title and keyword; if the number of documents is small, the second layer of the document tree, such as titles and keywords of each level and text parsed by the original document, can be mainly considered.
In one embodiment, based on the text document tree and the vector document tree, the leaf node which is most relevant to the extraction target, namely "content", is recalled by using a similarity algorithm, and if the root node or the father node enters a recall range, all leaf nodes which belong to the root node can be recalled.
In one embodiment, assume that the extraction goal is "what is a daily monitoring of the office area by the running detection unit? The text information corresponding to the text document tree node extracted from the document data according to the extraction target is as follows:
"5.1.1 running monitoring units daily monitoring should meet the following requirements:
1. The data analysis personnel and the monitoring on-duty personnel are equipped by combining the engineering scale of the urban lifeline, daily monitoring work content and the like;
2. Establishing a 24-hour uninterrupted duty working system;
3. setting an office area meeting at least 2 people;
4. the data analysis personnel and the monitoring personnel unify the dressing according to the unit requirement or standardize the dressing according to the industry requirement, and the information such as the name, the job post and the like is clearly shown on the light plate. "
Further, the following is input into the large model:
"please answer questions based on the information provided below, answer [ UNKNOWN ]:
5.1.1 running monitoring units daily monitoring should meet the following requirements:
1. The data analysis personnel and the monitoring on-duty personnel are equipped by combining the engineering scale of the urban lifeline, daily monitoring work content and the like;
2. Establishing a 24-hour uninterrupted duty working system;
3. setting an office area meeting at least 2 people;
4. The data analysis personnel and the monitoring personnel unify the dressing according to the unit requirement or standardize the dressing according to the industry requirement, and the information such as the name, the job post and the like is clearly shown on the light plate.
The problems are: operating detection units daily monitoring what is a requirement for office area?
Your answer is: "
The extraction problem and other contents except text information are preset extraction requirements, and can be defined according to specific requirements. The extraction problem, the text information and the preset extraction requirement together form a prompt word of the large model, and the prompt word is used for prompting or guiding the large model to give out output meeting expectations.
For the above hint words, the output of the large model, i.e., the extracted document data, is as follows: the office area for daily monitoring of the running detection unit should meet the requirement of being capable of accommodating at least 2 persons. "
Based on the above steps S30 and S31, the present embodiment extracts summary information corresponding to the divided text data, and generates a text document tree using the summary information and the divided text data, thereby achieving the purposes of improving the quality of the generated text document tree and further reducing the influence of noise on knowledge extraction.
Alternatively, fig. 2 is a schematic flow chart of main steps of a document data extraction method based on a large model and an extraction target according to another embodiment of the present application. As shown in fig. 2, the method mainly comprises the following steps:
Step S201, document data is parsed. Namely, the table is converted into text in the form of key value pairs (k-v pairs) by the parsing algorithm, and the picture is converted into text description, please refer to step S10 of other embodiments in detail, which is not described herein.
Step S202, a text document tree is generated. That is, the document is cut into document trees according to the titles of the document levels, and the contents of the nodes of the document trees include, but are not limited to: word segmentation, content summaries, titles, etc., which are stored in a relational database. Please refer to step S11 in other embodiments, which is not described herein.
Step S203, vector document tree generation. The text content of each node is converted into a vector through a vector algorithm and is stored in a vector database. Please refer to step S12 in other embodiments, which is not described herein.
Step S204, recall. I.e., recall nodes associated with the extraction target from the relational database and the vector database using a matching algorithm (similarity algorithm). Please refer to the steps S20 to S23 in other embodiments, which are not described herein.
Step S205, extraction. And constructing a prompt word of the large model according to the extraction target and the recalled content, and further extracting the document data. Please refer to the step S30 and the step S31 in other embodiments, which are not described herein.
Based on the characteristic that the input content of the large model is limited, the method constructs the document hierarchical structure data of multiple tree structures by carrying out unified format conversion on the document data of multiple formats, searches nodes matched with extraction targets, and finally inputs the extraction problem, text information and preset extraction into the large model together, so that the purpose of knowledge extraction on the document data of multiple formats is achieved, and the effects of reducing the input data quantity and input noise of the large model and improving the retrieval efficiency and accuracy of the large model are achieved.
Further, the application also provides a document data extraction system based on the large model and the extraction target, as shown in fig. 3, the system comprises: the document tree generating module 301 is configured to perform unified format conversion and division on document data, and generate a corresponding document tree; the node matching module 302 is configured to obtain, based on the document tree, a document tree node that matches the extraction target by using a matching algorithm; the data extraction module 303 is configured to extract document data based on the large model, the extraction target, and the document tree node.
In one technical scheme of the document data extraction system based on the large model and the extraction target, any document tree node comprises a corresponding document number, a node number, a father node number, leaf node information and text information.
In one technical solution of the above document data extraction system based on a large model and an extraction target, the document tree includes a text document tree, and the document tree generating module includes: the conversion unit is used for uniformly converting the document data into text data; the dividing unit is used for dividing the text data according to a preset document structure; and a text document tree generating unit for generating a text document tree based on the divided text data.
In one aspect of the document data extraction system based on the large model and the extraction target, the text document tree generating unit includes: an extraction subunit, configured to extract corresponding summary information according to the divided text data; and the generation subunit is used for generating a text document tree based on the divided text data and the corresponding summary information.
In one technical solution of the above document data extraction system based on the large model and the extraction target, the document tree further includes a vector document tree, and the system further includes: and the document number conversion module is used for converting the text document tree into a vector document tree.
In one technical scheme of the document data extraction system based on the large model and the extraction target, the node matching module includes: the first matching unit is used for generating text matching scores of the extraction targets and all nodes of the text document tree by using a matching algorithm; the second matching unit is used for generating vector matching scores of the extraction targets and all nodes of the vector document tree by using a matching algorithm; a matching score calculation unit for calculating a matching score of each node based on the text matching score of each node and the vector matching score of each node; the node extraction unit is used for acquiring text document tree nodes related to the extraction target according to the matching score based on a preset strategy.
In one technical solution of the above document data extraction system based on a large model and an extraction target, the data extraction module includes: the acquisition unit is used for acquiring text information corresponding to the text document tree nodes related to the extraction target; the obtaining unit is used for extracting the document data according to the extraction target, the text information and the preset extraction requirement by utilizing the large model.
The above-mentioned document data extraction system based on the large model and the extraction target is used for executing the embodiment of the document data extraction method based on the large model and the extraction target shown in fig. 1, and the technical principles of the two are similar, the technical problems to be solved and the technical effects to be produced are similar, and those skilled in the art can clearly understand that, for convenience and brevity of description, the specific working process and related description of the document data extraction system based on the large model and the extraction target may refer to the description of the embodiment of the document data extraction method based on the large model and the extraction target, and will not be repeated herein.
Optionally, the current knowledge extraction task mainly includes: named entity recognition, relationship extraction, and event extraction. For different types of entities, relationships, events and different types of data sources (text, tables and pictures) in documents, different algorithm models often need to be trained, so that a large number of different non-universal models, flows and rules are integrated in a traditional knowledge extraction system, and the universality and the flexibility are poor.
The embodiment provides a document data extraction system based on a large model and an extraction target, and the document data extraction system uses a similarity calculation technology to recall content related to the extraction target, namely document tree nodes matched with the extraction target, from document data so as to reduce the influence of noise on knowledge extraction precision, and builds a prompt word of the large model based on the recalled content so as to complete the knowledge extraction task, thereby solving the problems of insufficient generality and flexibility of the traditional knowledge extraction system.
Specifically, FIG. 4 is a schematic block diagram of a primary architecture of a large model and extraction target based document data extraction system according to another embodiment of the application. As shown in fig. 4, the system includes:
The pre-parsing module 401 is configured to parse the document into text, form and picture according to the structure. Specifically, the table is converted into characters in the form of key value (k-v) pairs, wherein key is a header, and value is a value in a cell; the picture with non-text content as main part is converted into text through the text-to-text model, and the text information is extracted through OCR algorithm.
A document tree generating module 402, configured to generate a document tree according to the data parsed by the pre-parsing module 401.
A recall module 403, configured to recall a node corresponding to the extraction target according to the document tree.
And the extraction module 404 is used for splicing the node text content extracted by the recall module, the extraction target and the preset extraction requirement together into a prompt word input large model to obtain an extraction result.
This embodiment has the following advantages:
The system solves the problems of insufficient universality and flexibility caused by the need of integrating a large number of non-universal models, processes and rules when the traditional knowledge extraction system faces a large number of extraction tasks and data sources of different types. And for different types of extraction tasks, recall related contents are calculated through similarity, so that the interference of noise on a large model is reduced. And converting the tables and pictures in the document into characters through an analysis algorithm, so that the unified extraction of knowledge in the characters, the tables and the pictures by the large model is realized.
Further, it should be understood that, since the respective modules are merely set to illustrate the functional units of the apparatus of the present application, the physical devices corresponding to the modules may be the processor itself, or a part of software in the processor, a part of hardware, or a part of a combination of software and hardware. Accordingly, the number of individual modules in the figures is merely illustrative.
Those skilled in the art will appreciate that the various modules in the system may be adaptively split or combined. Such splitting or combining of specific modules does not cause the technical solution to deviate from the principle of the present application, and therefore, the technical solution after splitting or combining falls within the protection scope of the present application.
It will be appreciated by those skilled in the art that the present application may implement all or part of the processes in the methods of any of the above embodiments, or may be implemented by a computer program for instructing the relevant hardware, the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each of the above method embodiments when executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable storage medium may include: any entity or device capable of carrying computer program code, a medium, a USB flash disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory, a random access memory, an electrical carrier wave signal, a telecommunication signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable storage medium may be appropriately scaled according to the requirements of jurisdictions in which such computer readable storage medium does not include electrical carrier signals and telecommunication signals, for example, according to jurisdictions and patent practices.
Further, the application also provides a computer readable storage medium. In one embodiment of the computer-readable storage medium according to the present application, the computer-readable storage medium may be configured to store a program for executing the large model and extraction target-based document data extraction method of the above-described method embodiment, which program may be loaded and executed by a processor to implement the large model and extraction target-based document data extraction method described above. For convenience of explanation, only those portions of the embodiments of the present application that are relevant to the embodiments of the present application are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present application. The computer readable storage medium may be a storage device including various electronic devices, and optionally, the computer readable storage medium in the embodiments of the present application is a non-transitory computer readable storage medium.
Further, the present application also provides an electronic device including a memory and a processor, in which a computer program is stored, the processor being configured to execute the program of the large model and extraction target-based document data extraction method of the above-described method embodiment by the computer program, the processor being configured to execute the program in the storage device, the program including, but not limited to, the program of the large model and extraction target-based document data extraction method of the above-described method embodiment. Referring to fig. 5, fig. 5 is a schematic diagram illustrating a connection relationship between a processor and a memory of an electronic device according to an embodiment of the present application. As shown in fig. 5, the memory and processor are illustratively shown in fig. 5 as being communicatively coupled via a bus. For convenience of description, only those portions of the embodiments of the present application are shown, and specific technical details are not disclosed, so reference is made to the method or apparatus portions of the embodiments of the present application.
Thus far, the technical solution of the present application has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present application is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present application, and such modifications and substitutions will fall within the scope of the present application.
Claims (10)
1. A document data extraction method based on a large model and an extraction target, the method comprising:
carrying out unified format conversion and division on the document data to generate a corresponding document tree;
Acquiring a document tree node matched with an extraction target by using a matching algorithm based on the document tree;
Document data is extracted based on the large model, the extraction target and the document tree node.
2. The document data extraction method based on the large model and the extraction target according to claim 1, wherein any document tree node includes a corresponding document number, node number, parent node number, leaf node information, and text information.
3. The method for extracting document data based on a large model and an extraction target according to claim 2, wherein the document tree includes a text document tree, the converting and dividing the document data into a unified format, and generating a corresponding document tree includes:
Uniformly converting the document data into text data;
dividing the text data according to a preset document structure;
a text document tree is generated based on the divided text data.
4. The document data extraction method based on the large model and the extraction target according to claim 3, wherein the generating a text document tree based on the divided text data includes:
Extracting corresponding summary information according to the divided text data;
a text document tree is generated based on the partitioned text data and the corresponding summary information.
5. The document data extraction method based on a large model and an extraction target according to claim 3, wherein the document tree further includes a vector document tree, the method further comprising:
The text document tree is converted into a vector document tree.
6. The document data extraction method based on the large model and the extraction target according to claim 5, wherein the obtaining, based on the document tree, the document tree node matching the extraction target using a matching algorithm includes:
generating text matching scores of the extraction targets and nodes of the text document tree by using a matching algorithm;
generating vector matching scores of the extraction targets and all nodes of the vector document tree by using a matching algorithm;
calculating the matching score of each node based on the text matching score of each node and the vector matching score of each node;
and acquiring text document tree nodes related to the extraction target according to the matching score based on a preset strategy.
7. The document data extraction method based on the large model and the extraction target according to claim 6, wherein the extracting document data based on the large model, the extraction target, and the document tree node comprises:
Acquiring text information corresponding to text document tree nodes related to an extraction target;
And extracting the document data according to the extraction target, the text information and the preset extraction requirement by using the large model.
8. A document data extraction system based on a large model and an extraction target, the system comprising:
The document tree generating module is used for carrying out unified format conversion and division on document data to generate a corresponding document tree;
The node matching module is used for acquiring the document tree nodes matched with the extraction target by utilizing a matching algorithm based on the document tree;
And the data extraction module is used for extracting the document data based on the large model, the extraction target and the document tree node.
9. A computer readable storage medium having stored therein a plurality of program codes, wherein the program codes are adapted to be loaded and executed by a processor to perform the large model and extraction target based document data extraction method of any one of claims 1 to 7.
10. An electronic device, comprising:
At least one processor;
And a memory communicatively coupled to the at least one processor;
Wherein the memory has stored therein a computer program which, when executed by the at least one processor, implements the large model and extraction target based document data extraction method of any one of claims 1 to 7.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410670078.8A CN118468814A (en) | 2024-05-27 | 2024-05-27 | Document data extraction method and system based on large model and extraction target |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410670078.8A CN118468814A (en) | 2024-05-27 | 2024-05-27 | Document data extraction method and system based on large model and extraction target |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN118468814A true CN118468814A (en) | 2024-08-09 |
Family
ID=92153317
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410670078.8A Pending CN118468814A (en) | 2024-05-27 | 2024-05-27 | Document data extraction method and system based on large model and extraction target |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118468814A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119476269A (en) * | 2024-10-16 | 2025-02-18 | 易方达基金管理有限公司 | A method, device, terminal device and storage medium for constructing a document title tree |
| CN119625767A (en) * | 2024-12-02 | 2025-03-14 | 舜恒智能科技(山东)有限公司 | A method and system for managing electronic archives |
-
2024
- 2024-05-27 CN CN202410670078.8A patent/CN118468814A/en active Pending
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119476269A (en) * | 2024-10-16 | 2025-02-18 | 易方达基金管理有限公司 | A method, device, terminal device and storage medium for constructing a document title tree |
| CN119625767A (en) * | 2024-12-02 | 2025-03-14 | 舜恒智能科技(山东)有限公司 | A method and system for managing electronic archives |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10860654B2 (en) | System and method for generating an answer based on clustering and sentence similarity | |
| US8315997B1 (en) | Automatic identification of document versions | |
| CN112818093B (en) | Evidence document retrieval method, system and storage medium based on semantic matching | |
| US8190538B2 (en) | Methods and systems for matching records and normalizing names | |
| US8630989B2 (en) | Systems and methods for information extraction using contextual pattern discovery | |
| CN111767716B (en) | Method and device for determining enterprise multi-level industry information and computer equipment | |
| Khusro et al. | On methods and tools of table detection, extraction and annotation in PDF documents | |
| Avasthi et al. | Techniques, applications, and issues in mining large-scale text databases | |
| EP1883026A1 (en) | Reference resolution for text enrichment and normalization in mining mixed data | |
| CN118468814A (en) | Document data extraction method and system based on large model and extraction target | |
| CN101449271A (en) | Annotation by search | |
| CN108959203A (en) | A kind of method text gear typing and compared | |
| Valarakos et al. | Enhancing ontological knowledge through ontology population and enrichment | |
| CN114722174B (en) | Prompt method and device, electronic device and storage medium | |
| US20150193505A1 (en) | Apparatus and method for searching information based on wikipedia's contents | |
| CN118551046A (en) | Method for enhancing document processing flow based on large language model | |
| CN110888970B (en) | Text generation method, device, terminal and storage medium | |
| WO2018056423A1 (en) | Scenario passage classifier, scenario classifier, and computer program therefor | |
| CN115080603B (en) | Database query language conversion method, device, equipment and storage medium | |
| CN113342953A (en) | Government affair question and answer method based on multi-model integration | |
| CN111831624A (en) | Data table creating method and device, computer equipment and storage medium | |
| CN117520503A (en) | Financial customer service dialogue generation method, device, equipment and medium based on LLM model | |
| CN113821590B (en) | Text category determining method, related device and equipment | |
| CN118689992A (en) | A table content RAG customer service question and answer method based on a large language model | |
| Dadure et al. | BERT-based embedding model for formula retrieval. |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |