Background
      The existing bill recognition scheme generally adopts a serial connection mode of an OCR (optical character recognition) model and an LLM large language large model to realize information extraction and structuring processing. Specifically, this procedure is generally divided into two main steps:
       1. And (5) extracting the full text by utilizing the OCR model. 
      At this stage, the OCR model is responsible for recognizing and extracting all the text in the document. The process includes (1) character positioning and recognition, wherein an OCR model recognizes and extracts character information in document images by scanning the document images, and the character information comprises text content, numbers, punctuation marks and the like, and the problems of different fonts, font sizes, character spacing, printing quality and the like are generally required to be processed. (2) Image-to-text conversion OCR models are able to convert scanned pictures or photographed images into editable text data. The OCR model not only can recognize standard machine-made texts, but also can process the complex factors such as handwritten texts, seals, labels, signatures and the like. (3) Basic layout understanding some advanced OCR models may also partially understand the layout information of documents, identifying structures such as titles, paragraphs, tables, etc. The main goal of this stage is to ensure that all text is extracted, providing the underlying data for the next stage of analysis.
      2. The LLM large language model carries out summarization and structural analysis on the text content.
      After the OCR model has extracted the original text, the next task is to further process the text using the LLM large language model. The large language model has the functions of (1) text understanding and semantic analysis, namely, the LLM large language model carries out deep understanding and analysis on the extracted words through a natural language processing technology, and can understand the context, grammar structure and semantic meaning in the text, so that the LLM large language model can identify key information in the text, such as invoice number, date, amount, customer name, product information and the like. (2) Information induction and summarization the LLM big language model can induce and summarize the extracted words according to the preset rules or the user demands. For example, from a contract containing multiple fields, key data such as "signing date", "contract amount", and "signing party" are extracted and structured output is formed.
      However, the existing document identification scheme has the following general problems:
       1. the existing document identification scheme has the problem of weak generalization (a large amount of customized development cost exists). 
      Currently, many document recognition schemes (especially conventional OCR or deep learning based models) are typically custom developed based on a specific domain, specific type of document. For example, one system may be specific to invoice identification, while another system may be specific to medical documents. Although this approach can achieve higher accuracy in a particular task, it also presents problems in that (1) a large amount of custom development costs are incurred, since each type of document has different formats, field layouts, semantic content, etc., the conventional model often requires custom training according to the type of document, which means that each new type of document requires re-development or re-training of the model, thereby increasing maintenance and update costs of the system. (2) Weak generalization existing document identification models often do not migrate well to new, unseen document types, e.g., existing models may not adequately accommodate these changes when processing a new supply chain document or different formats of invoices, resulting in reduced identification accuracy.
      2. The existing document identification scheme has the problem of weak robustness (for noise such as watermark, seal and the like, the existing scheme has poor resolving power).
      In many practical applications, documents (such as invoices, contracts, certificates, etc.) often face various interference factors, for example, watermark, anti-counterfeiting watermark, corporate logo, etc. may affect accurate recognition of OCR systems, and seal and handwritten content, i.e. seal, signature or handwritten text, may have differences from printed text, which affects the effect of image recognition.
      Existing document recognition schemes tend to be vulnerable to these interfering factors (i.e., noise) and can easily confuse watermark seals with normal text after OCR results are generated, resulting in even simple noise that can significantly impact recognition accuracy.
      3. The VLM model has a problem that long numbers cannot be precisely extracted (this problem results in lower accuracy in document recognition).
      The VLM model has made remarkable progress in recent years, can process joint tasks of images and texts, and is suitable for tasks such as image recognition, natural language understanding and the like. However, in terms of OCR, the VLM model still faces some challenges, especially for recognition of long numbers. In many documents (e.g., bank bills, invoices, contracts, etc.), often containing long digits (e.g., identification card numbers, bank account numbers, ticket numbers, etc.), current VLM models tend to be less accurate for the identification of such long digits, especially in the process of identifying digits.
      Therefore, in view of the above-mentioned drawbacks in the prior art, there is a need to develop a new document identification method, system, device and medium.
    
    
      Disclosure of Invention
      In order to overcome the defects of the prior art, the invention provides a document identification method, a system, equipment and a medium based on a multi-mode large model, which have strong generalization, can adapt to various types of documents and can provide efficient and accurate identification results.
      In order to achieve the above object, the present invention provides the following technical solutions:
       A document identification method based on a multi-mode large model is characterized by comprising the following steps: 
       1) Preprocessing the document image input by the user; 
       2) Reasoning a multi-mode large model based on the preprocessed document image, the configured JSON template and the prompt word template, and obtaining a JSON result by reasoning the multi-mode large model; 
       3) OCR recognition, namely recognizing the document image input by the user by using an OCR recognition technology to obtain an OCR recognition result; 
       4) And checking, namely comparing the similarity between the OCR recognition result and the JSON result, and determining a document recognition result based on the similarity comparison result. 
      Preferably, the step 1) specifically includes:
       11 Using PaddleOCR model to make inference on the document image inputted by user to generate inference result, and the inference result includes all characters on the document image, confidence values of characters and coordinates of character anchor frame; 
       12 Judging the rotation angle and the cutting size of the document image input by the user according to the reasoning result, and rotating and cutting the document image input by the user according to the rotation angle and the cutting size to obtain a cut document image; 
       13 Using a straight line detection algorithm to judge whether the cut document image is inclined and correcting the cut document image when the cut document image is inclined. 
      Preferably, the step 12) specifically includes:
       121 Rotating the document image input by the user by four right angles to obtain four rotated document images, respectively reasoning the four rotated document images to obtain four reasoning results, judging the direction of the characters through the shapes of the character anchor frames in the four reasoning results, and determining two alternative rotated document images based on the directions of the characters; 
       122 Judging the sum of confidence values of characters in the reasoning results corresponding to the two alternative rotated document images, wherein the document image with the largest sum is used as the document image with the correct direction; 
       123 Judging the cutting size through the most edge of all the text anchor frames in the document image with the correct direction, and cutting the document image with the correct direction according to the cutting size to obtain the cut document image. 
      Preferably, the step 2) specifically includes:
       21 Configuration of JSON templates, i.e., fields in the configured document where text needs to be extracted; 
       22 Splicing the configured JSON template and the prompt word template, and inputting the JSON template and the prompt word template into a multi-mode large model together with the preprocessed document image for reasoning to generate a reasoning result; 
       23 Using a JSON verification module to verify and correct the reasoning result to obtain a JSON result. 
      Preferably, the OCR recognition result in the step 3) includes all characters on the single image input by the user, confidence values of the characters, and coordinates of the character anchor frame.
      Preferably, the step 4) specifically includes:
       41 Respectively carrying out similarity matching on characters in the OCR recognition result and fields in the JSON result, if the similarity of the matching result is 1, the recognition is correct, and if the similarity of the matching result is not 1 and is larger than a threshold value, the fields are listed as verification objects; 
       42 Using PaddleOCR model to identify the coordinate of the check object and cutting the graph to obtain the field image; 
       43 Using SVTR V2 model to identify the field image, and updating the final result of the field as the identification result to the JSON result to obtain the document identification result. 
      Preferably, the threshold is 0.7.
      In addition, the invention also provides a document identification system based on the multi-mode large model, which is characterized by comprising the following steps:
       the image preprocessing module is used for preprocessing the document image input by the user; 
       The multi-mode large model reasoning module is used for reasoning by the multi-mode large model based on the preprocessed document image, the configured JSON template and the prompt word template to obtain a JSON result; 
       the OCR recognition module is used for recognizing the document image input by the user by using an OCR recognition technology to obtain an OCR recognition result; 
       and the verification module is used for comparing the similarity between the OCR recognition result and the JSON result and determining a document recognition result based on the similarity comparison result. 
      Moreover, the present invention also provides a document identification apparatus based on a multimodal big model, characterized by comprising:
       one or more processors; 
       a memory for storing one or more programs; 
       The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the multi-modal large model-based document identification method as described above. 
      Finally, the present invention also provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of the multi-modal large model-based document identification method as described above.
      Compared with the prior art, the document identification method, system, equipment and medium based on the multi-mode large model have one or more of the following beneficial technical effects:
       1. the invention uses the general capability of the VLM model, has strong generalization, can adapt to different document types and ensures the accuracy of cold start. 
      Conventional document identification systems typically rely on customized development for a particular type of document, so when a new type of document is encountered, the model must be retrained or adjusted, which increases development costs and does not efficiently process the various types of documents in practical applications. The invention uses the universal prompt word template, applies the cross-mode (image-text) understanding capability of the VLM model, and can process various types of documents. The template design has the characteristics of (1) strong universality, no need of independently training a model for each document type, automatic adaptation to documents of different formats such as invoices, contracts, customs notes, transfer certificates and the like, (2) higher accuracy, higher accuracy (more than 80 percent) can be maintained when the documents of various types are processed, and good performance in practical application, and (3) the universality greatly reduces the dependence on customized development, and higher processing efficiency and accuracy can still be maintained when the documents of unknown or new types are faced.
      2. The invention can improve the field precipitation accuracy with low cost by field configuration according to the requirements of clients.
      In the conventional document identification technology, only the preset fields in the document can be extracted, and the custom requirement of a client on certain specific fields or information cannot be met, and at this time, the client often needs additional development or manual intervention to complete the tasks. The invention supports the field separation customized according to the specific requirements of the clients in a field configuration mode, the clients can specify the field names to be extracted, and the target information can be accurately extracted by identifying the fields and matching the fields with the actual contents in the document. The method has the specific advantages that (1) customization is efficient, a customer can flexibly select fields to be identified according to actual demands, such as extracting an invoice number and an amount from an invoice or extracting a signing date and a clause from a contract, and (2) accuracy is improved, a model can concentrate on extracting specific content by clearly specifying the fields, and the probability of false identification is reduced, so that identification accuracy is improved. The function not only improves the flexibility of identification, but also enables the device to adapt to the personalized requirements of different clients and meet diversified business scenes.
      3. The invention combines the VLM model and the front character recognition model, thereby improving the recognition accuracy.
      Most of the existing document recognition technologies rely on OCR technology alone or are based on models of visual features for training, and lack joint understanding between images and text. Whereas relying solely on OCR has poor recognition effects on complex structured documents, especially table information and long numbers, etc. The invention combines VLM model and front character recognition technology to improve the overall understanding ability of documents. The VLM model can process image information and can also combine texts in the document to perform joint learning, so that the effects that (1) the VLM model can simultaneously understand visual content and text content in the image, and further improve recognition effects, for example, by combining table structures in the image and field information in the text, structured data in the document can be automatically extracted, and (2) complex texts can be accurately processed by combining modern OCR technology, for example, complicated characters and long numbers such as account numbers in an invoice and account numbers in transfer certificates can be more accurately processed, and the problem of misrecognition of the traditional VLM model is avoided.
    
    
      Detailed Description
      Before any embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the following drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including" or "having" and variations thereof herein is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. Unless specified or limited otherwise, the terms "mounted," "connected," "supported," and "coupled" and variations thereof are used broadly and encompass both direct and indirect mountings, connections, supports, and couplings. Furthermore, "connected" and "coupled" are not restricted to physical or mechanical connections or couplings.
      Also, in the present disclosure, the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in the azimuth or positional relationships indicated in the drawings, which are merely for convenience of description and to simplify the description, and do not denote or imply that the apparatus or elements in question must have a particular azimuth, be configured and operated in a particular azimuth, and thus the terms should not be construed as limiting the invention, and in the second aspect, the terms "a" and "an" should be construed as "at least one" or "one or more," i.e., in one embodiment, the number of one element may be one, while in another embodiment, the number of the element may be plural, and the term "a" should not be construed as limiting the number.
      Before describing the present invention in detail, some technical terms used in the present invention will be briefly described so as to facilitate a better understanding of the present invention by those skilled in the art.
      1. LLM (large language model) LLM refers to a large language model (Large Language Model) which is trained by deep learning techniques to understand and generate a model of natural language. It can process text data, answer questions, write articles, translate language, etc. Examples of LLM include GPT, GLM, and the like.
      2. VLM (multi-modal large Model) VLM refers to a multi-modal large Model (Vision-Language Model) that can simultaneously understand and process different types of data, such as text and images. For example, the VLM not only can analyze the content of a picture, but also can understand the text information related to the content, and is widely applied to the fields of picture-text matching, image description and the like. Common models include CogVLM, etc.
      3. OCR (optical character recognition) OCR refers to optical character recognition (Optical Character Recognition). It is a technique for converting handwritten or printed text in a scanned or photographed document into editable digital text. For example, after you scan a book, OCR can extract the text in the scanned image and convert it into a editable document.
      4. Document identification-document identification refers to identifying and extracting information on documents by using an automated technique. These documents may be contracts, invoices, certificates, etc., and document identification typically involves OCR and other AI techniques to identify key information on the document for processing.
      5. Generalization-generalization refers to the expressive power of a large model when it encounters new data. A model with good generalization performance not only performs well on training data, but also can effectively process unseen data. In other words, the stronger the generalization, the wider the application range of the model, and the better the adaptation to different situations in practical application.
      The present invention will be described in detail below, wherein fig. 1 shows a flowchart of the document identification method based on the multi-modal large model of the present invention. As shown in fig. 1, the document identification method based on the multi-mode large model of the invention comprises the following steps:
       1. And (5) preprocessing an image. 
      To identify documents, the document image input by the user needs to be preprocessed to facilitate subsequent identification.
      In the invention, preprocessing the document image input by the user specifically comprises the following steps:
       1. The existing OCR model, such as PaddleOCR model, is used to infer the document image input by the user, and an inference result is generated. The reasoning result comprises all the characters on the document image, the confidence values of the characters (the confidence values are related to the directions of the characters) and the coordinates of the character anchor frame. 
      2. And judging the rotation angle and the cutting size of the document image input by the user according to the reasoning result, and rotating and cutting the document image input by the user according to the rotation angle and the cutting size to obtain the cut document image.
      Specifically, first, the document image input by the user is rotated four right angles (i.e., rotated four times, each rotated 90 °) to obtain four rotated document images, and the four rotated document images are respectively inferred to obtain four inference results.
      And then, judging the direction of the characters through the shapes of the character anchor frames in the four reasoning results and determining two alternative rotated document images based on the directions of the characters. Specifically, the shape of the text anchor frame may be determined based on the coordinates of the text anchor frame, and for a normal text, the text anchor frame should be rectangular in cross, and thus, the document image in which two text anchor frames are rectangular in cross is determined as the alternative rotated document image.
      And then judging the confidence value sum of the characters in the reasoning results corresponding to the two alternative rotated document images (the direction with the maximum sum points to the correct image direction), and taking the rotated document image with the maximum confidence value sum of the characters as the document image with the correct direction.
      And finally, judging the cutting size by the edge of the most edge of all the text anchor frames in the document image with the correct direction, cutting the document image with the correct direction according to the cutting size, and removing useless information to obtain the cut document image.
      3. And judging whether the cut document image is inclined or not by using a straight line detection algorithm, and correcting when the cut document image is inclined.
      Finally, the existing straight line detection algorithm, such as Hough straight line detection, is used for judging whether the cut document image is inclined or not and correcting the cut document image when the cut document image is inclined, so that a final preprocessed document image is obtained.
      2. Multi-modal large model reasoning.
      After image preprocessing, the present invention uses a multi-modal large model (in the present invention, a smart spectrum closed source CogVLM-plus model) to perform reasoning so as to generate a JSON file including the recognition result, i.e., a JSON result, from the multi-modal large model. That is, when multi-mode big model reasoning is performed, based on the preprocessed document image, the configured JSON template and the prompt word template, the multi-mode big model reasoning is performed to obtain a JSON result, which specifically includes:
       1. And configuring a JSON template. 
      In the invention, when the multi-mode large model is adopted for reasoning, firstly, a configured JSON template is set, namely, the fields needing to extract characters in the configured documents. The fields needing to extract the text in the documents are marked in the configured JSON template, for example, the following JSON { "name": "," physical examination time (year-month-day) "," report type ":", "report representation/view": "," diagnosis/opinion ":" } can be configured for the medical image documents.
      By setting the JSON template, field information to be identified can be provided for the multi-modal large model, so that the multi-modal large model can output a more complete value, and the overall document identification accuracy is improved.
      2. And splicing the configured JSON template and the prompt word template, and inputting the spliced JSON template and the prompt word template and the preprocessed document image into a multi-mode large model for reasoning to generate a reasoning result.
      The prompting word template is a universal template for reasoning and character recognition of documents by adopting a multi-mode large model in the prior art, and aims to guide and control the multi-mode large model to recognize document images so as to generate recognition results. This section belongs to the prior art and is not described in detail here for the sake of simplicity.
      3. And performing checksum correction on the reasoning result by using a JSON verification module to obtain a JSON result.
      After the multi-mode large model generates the reasoning result, the JSON verification module is used for verifying and correcting the reasoning result so as to obtain the JSON result capable of being directly analyzed.
      In this way, the invention supports the field separation customized to the specific requirements of the clients by the field configuration mode through the configured JSON template, the clients can specify the field names to be extracted, and the target information is accurately extracted by identifying the fields and matching the fields with the actual contents in the document. Specific advantages include (1) customization is efficient in that the customer can flexibly select fields to be identified according to actual needs, such as extracting the "invoice number", "amount" fields from an invoice, or extracting the "date signed" and "terms" fields from a contract. (2) The accuracy is improved, namely the multi-mode large model can be focused on extracting specific contents by definitely designating the fields, and the probability of false recognition is reduced, so that the recognition accuracy is improved. The function not only improves the flexibility of identification, but also enables the device to adapt to the personalized requirements of different clients and meet diversified business scenes.
      3. OCR recognition.
      And recognizing the document image input by the user by using an OCR recognition technology to obtain an OCR recognition result.
      In the present invention, existing OCR models, such as PaddleOCR models, can be used to identify the document images entered by the user to obtain OCR results, including all text on a single image entered by the user, confidence values for the text, and coordinates of the text anchor boxes.
      4. And (5) checking.
      Performing similarity comparison on the OCR recognition result and the JSON result, and determining a document recognition result based on the similarity comparison result, wherein the method specifically comprises the following steps of:
       1. the method comprises the steps of respectively carrying out similarity matching on characters in an OCR recognition result and contents of fields in a JSON result, if the similarity of the matching result is 1, describing that the recognition is correct, if the similarity of the matching result is not 1 and is larger than a threshold value, for example, the threshold value can be taken as 0.7, then the contents of the fields in the JSON result are listed as verification objects, and if the similarity is smaller than the threshold value, the characters in the OCR recognition result are considered to be irrelevant to the contents of the fields in the JSON result, and then the processing is not needed. 
      In the invention, when calculating the similarity, the content of the field in the JSON result can be traversed and all the contents containing the digital character strings are extracted, then the OCR recognition result is traversed and all the contents containing the digital character strings are extracted, and the ratio of the lengths of the two digital character strings is used as the similarity value.
      2. And identifying the coordinates of the verification object by using PaddleOCR model, and performing graph cutting to obtain a field image.
      3. The field image is identified by using the existing high-precision Chinese scene text identification model, such as SVTR V2 model, and the search SVTR V2 model can well identify the long digital string and update the identification result as the final result of the field to the JSON result so as to obtain the document identification result.
      The invention combines VLM model and front character recognition technology to improve the overall understanding ability of documents. The VLM model not only can process image information, but also can combine texts in the document to perform joint learning, so that the effects that (1) the VLM model can simultaneously understand visual content and text content in the image, and further improve recognition effects, for example, the VLM model can automatically extract structured data in the document by combining a table structure in the image and field information in the text are achieved. (2) And complex texts are accurately processed, and complex words and long numbers, such as the amount of money in an invoice and account numbers in transfer certificates, can be processed more accurately by combining with a modern OCR technology, such as an SVTR V2 model, so that the problem of misrecognition of a traditional VLM model is avoided.
      FIG. 2 shows a schematic diagram of the constitution of the document identification system based on the multi-modal large model of the present invention. As shown in fig. 2, the document identification system based on the multi-modal large model of the present invention includes:
       1. and an image preprocessing module. 
      The image preprocessing module is used for preprocessing the document image input by the user.
      2. And a multi-mode large model reasoning module.
      The multi-mode large model reasoning module is used for reasoning by the multi-mode large model based on the preprocessed document image, the configured JSON template and the prompt word template to obtain a JSON result.
      3. An OCR recognition module.
      The OCR recognition module is used for recognizing the document image input by the user by using an OCR recognition technology to obtain an OCR recognition result.
      4. And a verification module.
      The verification module is used for comparing the similarity between the OCR recognition result and the JSON result and determining a document recognition result based on the similarity comparison result.
      The invention further relates to a multi-mode large model-based document identification device, which comprises one or more processors, a memory, and a memory, wherein the memory is used for storing one or more programs, and the one or more processors are enabled to realize the multi-mode large model-based document identification method when the one or more programs are executed by the one or more processors.
      Finally, the invention also relates to a computer-readable storage medium on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method for document identification based on a multimodal big model as described above.
      The document identification method, system, equipment and storage medium based on the multi-mode large model have good generalization, can be used for extracting and identifying the structured information of various general documents (which can contain complex form information), and has the following specific scene:
       1. And (5) extracting and identifying the structured information of the customs notes. 
      Customs clearance notes typically involve complex data structures such as the name of the good, the category, the quantity, the amount, the location of the import and export, the number of the bill, the number of the invoice, the tariffs, and the like. Due to the non-uniformity of the customs clearance format and the existence of a large number of data fields, the traditional manual entry mode is easy to make mistakes and has low efficiency.
      2. And (5) extracting and identifying the structured information of the transfer certificate.
      The transfer voucher, as a record of the financial transaction, contains a large amount of important information such as date of transaction, account number, transaction amount, bank name, payment means, sender and receiver information, etc. These certificates often appear in a printing format or a handwriting format, and sometimes also have interfering elements such as bank watermarks, signatures and the like, and the traditional transfer certificate processing mode relies on manual input and is easy to error.
      3. And (5) extracting and identifying the structured information of the medical examination receipt.
      Medical examination slips include, for example, physical examination reports, laboratory test slips, diagnostic reports, etc., which typically contain personal information of the patient, examination items, examination results, doctor advice, etc. Many medical documents are in different formats, some contain complex forms, and others contain handwritten content.
      Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention, and are not intended to limit the scope of the present invention. Modifications and equivalent substitutions can be made by those skilled in the art based on the present teachings without departing from the spirit and scope of the present teachings.