CN120356231A

CN120356231A - Document processing method and device, equipment and storage medium

Info

Publication number: CN120356231A
Application number: CN202510842371.2A
Authority: CN
Inventors: 廖海明; 王保昌; 许其威; 胡宗祥; 贾亚龙; 马志豪; 姚康君; 柳子用
Original assignee: China Unicom Guangdong Industrial Internet Co Ltd
Current assignee: China Unicom Guangdong Industrial Internet Co Ltd
Priority date: 2025-06-23
Filing date: 2025-06-23
Publication date: 2025-07-22

Abstract

The embodiment of the application discloses a document processing method, a device, equipment and a storage medium, which comprise the steps of obtaining a document image and requirement information of a user, carrying out segmentation processing on the document image to obtain m document areas, obtaining description information corresponding to each document area in the n document areas according to the m document areas and a description information generation model, wherein the description information generation model is obtained by training a plurality of sample images and a plurality of sample description information, n is smaller than or equal to m, obtaining target document content corresponding to target description information according to the requirement information and the n description information, and the target description information comprises description information, wherein the matching similarity of the description information and the requirement information in the n description information is larger than a preset similarity threshold value. The method and the device can process the document image corresponding to the document to be processed based on the user demand information to obtain target document content meeting the user demand in the document content of the document to be processed, and improve the accuracy of document processing and the content extraction efficiency.

Description

Document processing method and device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of data processing, in particular to a document processing method, a document processing device, document processing equipment and a storage medium.

Background

With the popularity of digital offices, the number of documents that businesses and individuals need to process at work is increasing. These documents typically include various types of document content, such as text, images, forms, and the like. In some scenarios, such as user working summary, report summary based on existing documents, etc., users often need to select specified document content from a large number of documents to meet the needs of different businesses.

In the related art, a keyword matching or type screening mode is generally adopted to perform primary screening on a document, and then a user performs manual checking on the primarily screened document to obtain required document content. The process not only consumes time and energy of users, but also is easy to cause the condition that the required document content is missed in the manual checking link, thereby affecting the document content acquisition efficiency and accuracy.

Disclosure of Invention

In view of the above, the document processing method, device, equipment and storage medium provided by the embodiments of the present application can process the document image corresponding to the document to be processed based on the user demand information, so as to obtain the target document content meeting the user demand in the document content of the document to be processed, and improve the accuracy of document processing and the content extraction efficiency. The document processing method, the device, the equipment and the storage medium provided by the embodiment of the application are realized as follows:

The first aspect of the present application provides a document processing method, including:

Acquiring a document image and requirement information of a user, wherein the document image comprises document content of a document to be processed, and the document content comprises at least one type of content in a text type, an image type or a form type;

dividing the document image to obtain m document areas, wherein each document area comprises one type of content in the document content, and m is an integer greater than or equal to 1;

Obtaining description information corresponding to each document region in n document regions according to the m document regions and the description information generation model, wherein the description information generation model is obtained by training according to a plurality of sample images and a plurality of sample description information, n is an integer greater than or equal to 1, and n is less than or equal to m;

and obtaining target document content corresponding to target description information according to the requirement information and the n pieces of description information, wherein the target description information comprises description information, of which the matching similarity with the requirement information is larger than a preset similarity threshold, in the n pieces of description information.

In the above technical solution, first, a document image and user's demand information are acquired. Then, the document image is segmented into m relatively independent document areas, the document content corresponding to each document area is ensured to be of a single type, and the accuracy of area segmentation is improved. And then, generating respective corresponding description information for different document areas by using a pre-trained description information generation model, namely obtaining n pieces of description information corresponding to n document areas, so that different types of document contents can obtain description information with uniform formats, and the description information is matched with the requirement information of a user. And finally, determining target description information according to the requirement information of the user and the obtained n pieces of description information to obtain target document content corresponding to the target description information, wherein the target description information can be the description information of which the matching similarity with the requirement information is larger than a preset similarity threshold value in the n pieces of description information, so that the operation required by the user for searching the designated document content is reduced, and the accuracy of document processing and the content extraction efficiency are improved.

As a possible implementation manner, in the first aspect of the present application, the number of the document images is a plurality, and the dividing processing is performed on the document images to obtain m document areas, including:

judging whether any two adjacent document images have page-crossing document contents according to the sequence of a plurality of document images, wherein the page-crossing document contents comprise one type of contents in the document contents;

Performing splicing processing on any two adjacent document images with the page-crossing document content to obtain spliced document images;

And carrying out the segmentation processing on the spliced document image and the document image which is not spliced to obtain the m document areas.

In the technical scheme, the number of the document images can be multiple, any two adjacent document images have page-crossing document contents, such as page-crossing text paragraphs or page-crossing tables, and the like, any two adjacent document images with page-crossing document contents are subjected to splicing processing to obtain spliced document images, and then the spliced document images and the non-spliced document images are subjected to segmentation processing to obtain m document areas, so that the integrity of the page-crossing document contents is ensured, and content splitting caused by respectively segmenting each document image is avoided.

As a possible implementation manner, in the first aspect of the present application, the determining, according to an order of the plurality of document images, whether any two adjacent document images have a spread document content includes:

Judging whether an association relationship exists between first document contents and second document contents or not under the condition that the types of the first document contents at the bottom of the first document images and the types of the second document contents at the top of the second document images are the same, wherein the first document images are images with the front sequence in any two adjacent document images, and the second document images are images with the rear sequence in any two adjacent document images;

And under the condition that the association relation exists between the first document content and the second document content, determining that the page-crossing document content exists in any two adjacent document images.

In the technical scheme, under the condition that the type of the first document content is the same as the type of the second document content, whether the two document contents have the association relationship is judged, so that the recognition efficiency of the page-crossing document content can be effectively improved. For example, when the types of the first document content and the second document content are different, the checking of the association relation is not required, so that unnecessary calculation can be reduced, and the overall processing efficiency is improved.

As a possible implementation manner, in the first aspect of the present application, the generating a model according to the m document areas and the description information to obtain description information corresponding to each document area in the n document areas includes:

According to the target type in the demand information and the type of the document content corresponding to each document region in the m document regions, obtaining n document regions with the same type of the corresponding document content as the target type in the m document regions;

and generating a model and the n document areas according to the description information to obtain the n description information.

In the above technical solution, according to the target type in the requirement information of the user, for example, the requirement information of the user indicates that the user needs to extract the form type in the document, and n document areas corresponding to the document content type identical to the target type are selected from the m document areas, so that unnecessary computation is reduced, and the generated description information is ensured to be more fit with the requirement of the user.

As a possible implementation manner, in the first aspect of the present application, the obtaining, according to the requirement information and the n pieces of description information, target document content corresponding to the target description information includes:

calculating the matching similarity between the requirement information and each of the n pieces of description information, and determining the description information with the matching similarity larger than the preset similarity threshold as the target description information;

And converting the document area corresponding to the target description information to obtain the target document content in a target format.

According to the technical scheme, the document content corresponding to the description information highly related to the user requirement can be screened out by calculating the matching similarity between the requirement information and the description information, so that the accuracy of a screening result is ensured, and the efficiency and reliability of document processing are improved.

As a possible implementation manner, in the first aspect of the present application, after the generating a model according to the m document areas and the description information, the method further includes:

According to the type of the document content corresponding to each document region in the n document regions, carrying out grouping processing on the n description information to obtain at least one group, wherein each group in the at least one group comprises at least one description information with the same type of the corresponding document content;

displaying a catalog interface, wherein the catalog interface comprises a grouping control corresponding to each grouping in the at least one grouping;

responding to triggering operation of a target grouping control, and displaying a grouping interface corresponding to the target grouping control, wherein the grouping interface comprises a description control corresponding to each piece of description information in a group corresponding to the target grouping control;

And responding to the triggering operation of the target description control, and obtaining the document content of the description information corresponding to the target description control.

According to the technical scheme, the user can select a proper type according to the self requirement by displaying the catalog interface and according to the triggering operation of the user, and then select specific description information, so that the required document content can be obtained rapidly. The hierarchical interaction design not only improves the retrieval efficiency, but also obviously improves the user experience and reduces the complexity of operation.

As a possible implementation manner, in the first aspect of the present application, the acquiring the document image and the requirement information of the user includes:

Acquiring the to-be-processed document and voice data of the user;

performing voice recognition on voice data to obtain the requirement information;

And converting the document to be processed to obtain the document image.

According to the technical scheme, the voice data of the user and the document processing flow are combined, convenient document processing experience is provided, the user can acquire the required document content without complex operation, and the working efficiency and the user experience are remarkably improved.

A second aspect of the present application provides a document processing apparatus comprising:

The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a document image and requirement information of a user, wherein the document image comprises document content of a document to be processed, and the document content comprises at least one type of content in a text type, an image type or a form type;

The segmentation module is used for carrying out segmentation processing on the document image to obtain m document areas, wherein each document area comprises one type of content in the document content, and m is an integer greater than or equal to 1;

The description module is used for generating a model according to the m document areas and the description information to obtain the description information corresponding to each document area in the n document areas, wherein the description information generation model is trained according to a plurality of sample images and a plurality of sample description information, n is an integer greater than or equal to 1, and n is less than or equal to m;

and the matching module is used for obtaining target document content corresponding to target description information according to the requirement information and the n pieces of description information, wherein the target description information comprises description information, of which the matching similarity with the requirement information is larger than a preset similarity threshold, in the n pieces of description information.

A third aspect of the application provides a computer device comprising a memory and a processor, the memory storing a computer program executable on the processor, the processor implementing the method provided by the first aspect of the application when executing the program.

A fourth aspect of the application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method provided by the first aspect of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic view of an application scenario of a document processing method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a document processing method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of another document processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a document processing method according to an embodiment of the present application for performing a stitching process on adjacent document images having spread document contents;

FIG. 5 is a schematic flow chart of a document processing method according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of extracting document content according to a triggering operation of a user in the document processing method according to the embodiment of the present application;

FIG. 7 is a schematic diagram of a document processing method for displaying a directory interface according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a document processing method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a document processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the specific technical solutions of the present application will be described in further detail below with reference to the accompanying drawings in the embodiments of the present application. The following examples are illustrative of the application and are not intended to limit the scope of the application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

It should be noted that the term "first/second/third" in relation to embodiments of the present application is used to distinguish between similar or different objects, and does not represent a particular ordering of the objects, it being understood that the "first/second/third" may be interchanged with a particular order or sequencing, as permitted, to enable embodiments of the present application described herein to be implemented in an order other than that illustrated or described herein.

In some scenarios, when a user needs to obtain specific content in a document, a preliminary screening may be performed by means of keyword matching or type screening. For example, the user may use keywords such as "financial data" or "product specification" to filter out the content of the document containing these words, or select one or more types of forms, images or text in the document by a type filtering function, so as to narrow the document to be checked. However, although the primary screening can reduce the number of documents, the subsequent manual secondary screening still requires the user to check the content of the screened documents one by one. This is time and effort consuming and also tends to result in the desired document content being missed, thereby affecting the efficiency and accuracy of the document content acquisition.

In view of the above, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for processing a document image corresponding to a document to be processed based on user requirement information, so as to obtain target document content meeting user requirements in document content of the document to be processed, and improve accuracy and content extraction efficiency of document processing.

The document processing method provided by the embodiment of the application can be applied to electronic devices such as mobile phones, wearable devices (such as smart watches, smart bracelets, smart glasses and the like), tablet computers, notebook computers, vehicle-mounted terminals, PCs (Personal Computer, personal computers) and the like, and is not limited herein. The functions performed by the method may be performed by a processor in an electronic device, which may of course be stored in a computer storage medium, as will be seen, comprising at least a processor and a storage medium.

In order to make the purpose and the technical scheme of the application clearer and more visual, the application scene of the document processing method provided by the application is introduced by combining the attached drawings.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a document processing method according to an embodiment of the present application, where a scenario indicated by the schematic view of the application scenario includes a document image 10, where the document image 10 may be obtained by converting a format of an electronic document to be processed, or may be obtained by scanning a paper document to be processed, or may be an image obtained by directly capturing the paper document to be processed, and the like, which is not limited herein.

Note that the document image 10 may include various types of document content, and as shown in fig. 1, the document image 10 may include, but is not limited to, text document content 11 (e.g., paragraph words, titles, etc.), image document content 12 (e.g., photographs, illustrations, etc.), and form document content 13 (e.g., data forms, etc.), among other types of document content. The image document content 12 may be a visual chart such as a pie chart, a bar chart, a line chart, etc., or may be various images such as a product picture, a flow chart, a schematic, etc., which is not limited herein.

By the method provided by the embodiment of the application, the target document content meeting the user requirement can be rapidly and accurately positioned and extracted according to the document image 10 and the acquired requirement information of the user, wherein the target document content is a document part highly related to the user requirement, and for example, the target document content can be the image document content 12. Therefore, a user can acquire required contents without turning over a large amount of document contents, the efficiency and the accuracy of document processing are improved, and the user experience is improved.

In order to facilitate understanding how to process a document image corresponding to a document to be processed based on user demand information, to improve accuracy of document processing and efficiency of content extraction, an embodiment of the document processing method provided by the present application is described below.

Referring to fig. 2, fig. 2 is a schematic flow chart of a document processing method according to an embodiment of the present application, and as shown in fig. 2, the method may include the following steps:

s201, acquiring a document image and requirement information of a user.

In an embodiment of the present application, the document image includes document content of a document to be processed, the document content including at least one type of content of a text type, an image type, or a form type. The user's demand information indicates the document content that the user wishes to obtain from the document to be processed, e.g., the user may need to find some form data, a particular picture, or some piece of textual description in the document to be processed.

The document image may be an image obtained by scanning or photographing a paper document to be processed by a scanning device or a photographing device, or may be an image obtained by converting an electronic document to be processed.

For example, when the document to be processed is an electronic document in a format such as PPT, word, excel, the document to be processed can be stored as a document image in a format such as PNG, JPEG, etc., so that the obtained document image is subjected to segmentation processing in a subsequent step, and the document content of the document to be processed is divided into a plurality of document areas containing a single content type, thereby improving the accuracy of extracting the content required by the user.

In some possible embodiments, the user's demand information may be natural language text, such as "extract financial data form in document" or "find photo of product a" etc., so that the user can express his demand in an intuitive, convenient way. In an exemplary case where the user's requirement information is a natural language text, the requirement information of the user may be obtained through an input device of the electronic device, such as a mouse, a keyboard, or a touch screen, by applying the method of the present application, or may be obtained through an instruction or data sent by another device connected to the electronic device in a communication manner, which is not limited herein.

In some possible embodiments, the user requirement information may also be a structured instruction, for example, a formatted command defined by JSON, XML, or a specific grammar rule (e.g., { "action": "extract_table", "keyword": financial data "}), or a standardized operation template generated at the graphical operation interface (e.g., check" form extract "option and enter keyword" financial data "), which is not limited herein. Such structured instructions may be generated in batch via A Programming Interface (API), suitable for use in an automated processing scenario.

S202, segmentation processing is carried out on the document image, and m document areas are obtained.

In the embodiment of the present application, each document area includes one type of content in the document content, where m is an integer greater than or equal to 1, and it may be understood that any document area obtained by segmentation processing may include only text type content, may include only image type content, may include only table type content, or other single type content. The division mode is convenient for subsequent identification and processing, ensures the singleness and accuracy of document region division, and meets the requirements of users on accurate extraction and processing of different types of document contents.

In some possible embodiments, the document image may be subjected to the segmentation process by a segmentation algorithm, such as a segmentation algorithm based on edge detection, a segmentation algorithm based on region growing, or a semantic segmentation algorithm based on deep learning (such as convolutional neural network), which is not limited herein.

Illustratively, the segmentation algorithm based on edge detection distinguishes text, images, tables, and the like from other content by detecting lines and contours in the document image. This approach exploits the differences in visual characteristics of different content types, e.g. text, form areas typically have more regular edges, whereas image areas may contain more complex contours. The segmentation algorithm based on region growing may be extended step by step starting from the seed point to identify consecutive regions. This method is suitable for identifying content blocks with similar features by analyzing the pixel attributes of the image, grouping similar pixels together to form regions. The semantic segmentation algorithm based on deep learning, for example, a Convolutional Neural Network (CNN), can learn the characteristics of document images and realize automatic classification and segmentation of different types of contents. This approach more intelligently recognizes and distinguishes between different content types by training models to understand semantic information in the image.

The document image is divided to obtain m document areas, which can be a plurality of different types of content areas or the same type of content areas, depending on the actual content of the document image. For example, in the case where the document image is the document image 10 shown in fig. 1, the document image is subjected to the division processing, and 4 document areas including 2 document areas including text type contents, 1 document area including image type contents, and 1 document area including form type contents can be obtained. The segmentation method can ensure that the content of each document area is single and complete, and is convenient for subsequent identification and processing.

In some possible embodiments, after the document image is segmented to obtain m document areas, the document processing method provided by the present application may further include:

Judging whether a plurality of text paragraphs exist in the document content corresponding to the target document area under the condition that the type of the document content corresponding to the target document area is a text type, wherein the target document area is one document area in m document areas;

and under the condition that a plurality of text paragraphs exist in the document content corresponding to the target document area, dividing the target document area to obtain the document area corresponding to each text paragraph.

It can be appreciated that, compared with the document content of the image, table and other types, the document content of the text type generally comprises a plurality of continuous paragraphs, and the text content is segmented into finer document areas, so that the requirements of users on detailed processing and accurate extraction of the text content can be better met. Illustratively, the boundary of the text paragraph can be determined by detecting the paragraph spacing, the sign, the number of indentation characters, etc., and dividing the target document region.

S203, generating a model according to the m document areas and the description information to obtain the description information corresponding to each document area in the n document areas.

In the embodiment of the application, the description information generation model is trained according to a plurality of sample images and a plurality of sample description information, n is an integer greater than or equal to 1, and n is less than or equal to m.

It will be appreciated that in the resulting m document areas, there may be document content of a portion of the document area that is a blank area, decorative graphic, footer number, or other insubstantial area in the document to be processed. The document areas containing the substantial content can be reserved by screening the document areas, so that the number of the document areas to be processed is reduced, and the efficiency and accuracy of subsequent processing are improved.

In some possible embodiments, if the m document areas all include substantial content, the m document areas may be input into the description information generating model to obtain description information corresponding to each document area, that is, the value of n may be the same as the value of m at this time.

In some possible embodiments, the m document regions may be screened by way of an optical character recognition OCR algorithm, setting a pixel area threshold, or the like.

For example, in the case where the document content corresponding to the document region is text content, the text may be recognized by an OCR algorithm. If the number of characters of the identified text is smaller than a preset threshold of the number of characters (such as 5 characters and 10 characters), or the identified text contains preset text content (such as a header, a footer, copyright information and the like), the document area is determined to be an insubstantial content area, and no further processing is needed.

In the case that the document content corresponding to the document region is image content, the pixel area of the document region may be obtained, and if the obtained pixel area is smaller than the preset area threshold, it may be understood that the document region may be a decorative pattern or background, but not a substantial content region, without further processing.

In the embodiment of the application, the description information generation model can take the document area as the input of the model and input the description information of the document area.

In some possible embodiments, the descriptive information generation model may be a multi-modal model, which may be trained by:

First, training data of the description information generation model is acquired, and the training data may include a plurality of sample images and a plurality of sample description information, and the plurality of sample images may include a plurality of types of document contents such as text, images, tables, and the like.

Then, preprocessing the acquired training data, such as downsampling, noise reduction and the like, classifying each document area, and marking the content type of each document area so that the subsequent model can perform feature extraction and learning according to different types of content.

Finally, training a preset initial model according to the preprocessed training data, wherein the initial model can be a deep learning model, such as a convolutional neural network CNN or a transducer-based model. In the training process, model parameters can be adjusted by setting a loss function or adopting an optimization algorithm and other modes to minimize the difference between the predicted description information and the sample description information, so that the performance of the model is optimized, a finally trained multi-mode model, namely a description information generation model, can be obtained, document areas containing various document content types such as texts, images and tables can be processed, and description information in a unified format can be generated so as to be matched with user demand information later.

In some possible embodiments, the descriptive information generation model may also be a hybrid architecture model, and may include multiple sub-models, for example, the multiple sub-models may be machine learning algorithms, such as support vector machines, decision trees, etc., or deep learning models, such as CNN or transducer-based models, etc. For example, conventional machine learning algorithms may be used to process document regions of a particular type of document content, such as text or form types, while deep learning models may be used to process document regions of complex document content, such as image types, and may generate pictorial descriptive information based on hints. Through the hybrid architecture, the description information generation model can exert the high efficiency of a traditional machine learning algorithm and the characterization capability of the deep learning model, realize the comprehensive processing of the document areas and obtain the description information of each document area.

In the embodiment of the application, the description information corresponding to each document area can be natural language text, for example, "the area contains a abstract part of 2024 financial report" or "the area contains a product release meeting live photo". The text description can intuitively reflect the content of the document area, and is convenient for users to understand and recognize. The descriptive information may also be structured instructions, such as { "action": "extract_table", "keys": "financial quarter" }, etc., without limitation.

It should be noted that, in order to ensure efficiency and accuracy of document processing, the requirement information of the user and the description information corresponding to each document area may be in the same form, so as to improve matching efficiency. For example, if the user's demand information is in a natural language text, the description information should also be generated as a natural language text, and if the user's demand information is in a structured instruction, the description information should also be generated in a structured instruction format. In this way, it is ensured that the target document content required by the user is quickly and accurately identified.

S204, obtaining target document content corresponding to the target description information according to the requirement information and the n description information.

In the embodiment of the application, the target description information comprises description information, of which the matching similarity with the requirement information is larger than a preset similarity threshold, in the n description information. The target document content refers to the document content of the document region corresponding to the target description information.

In some possible embodiments, in the case that the type of the target document content is other types than the image type, such as a text type or a table type, the target document content in the target format may be obtained by processing a document area corresponding to the target description information. For example, for document content of text type, editable text files may be obtained by text extraction and formatting processes, while for document content of form type, reconstruction of form structure and data extraction may be performed, resulting in structured form data. Thus, the user can acquire the target document content in a required format, and the flexibility and practicability of document processing are improved.

In some possible embodiments, in the case where the requirement information core and the description information are natural language text or a structured instruction, matching similarity between the requirement information and each of the n description information may be calculated by a similarity calculation method, such as cosine similarity, jaccard similarity, or semantic-based similarity, respectively, to obtain n matching similarities, and determine that the description information in which the matching similarity is greater than a preset similarity threshold is the target description information, where the preset similarity threshold may be set according to an actual application scenario and an accuracy requirement, and may be set to 0.7 or 0.8 equivalent, for example.

In some possible embodiments, when n is an integer greater than or equal to 2, if matching similarity between 2 or more description information and the requirement information in the n description information is greater than a preset similarity threshold. The description information in which the matching similarity is highest may be determined as the target description information.

In some possible embodiments, when n is an integer greater than or equal to 2, there may be cases where matching similarity between 2 or more pieces of description information and the requirement information in the n pieces of description information is greater than a preset similarity threshold. In this case, the description information with the matching similarity greater than the preset similarity threshold may be determined as target description information, and a plurality of corresponding document contents may be provided for the user to select according to the target description information. Thus, the user can acquire a plurality of potentially relevant document contents, and the flexibility and the user experience of document content extraction are further improved.

The document processing method provided by the embodiment of the application firstly obtains the document image and the requirement information of the user. Then, the document image is segmented into m relatively independent document areas, the document content corresponding to each document area is ensured to be of a single type, and the accuracy of area segmentation is improved. And then, generating respective corresponding description information for different document areas by using a pre-trained description information generation model, namely obtaining n pieces of description information corresponding to n document areas, so that different types of document contents can obtain description information with uniform formats, and the description information is matched with the requirement information of a user. And finally, determining target description information according to the requirement information of the user and the obtained n pieces of description information to obtain target document content corresponding to the target description information, wherein the target description information can be the description information of which the matching similarity with the requirement information is larger than a preset similarity threshold value in the n pieces of description information, so that the operation required by the user for searching the designated document content is reduced, and the accuracy of document processing and the content extraction efficiency are improved.

The manner in which the document image is segmented in the document processing method will be described with reference to the accompanying drawings to better understand the implementation of the document processing method.

Referring to fig. 3, fig. 3 is another flow chart of a document processing method according to an embodiment of the present application, as shown in fig. 3, the method may include the following steps:

S301, acquiring a document image and requirement information of a user.

S302, judging whether any two adjacent document images have page-crossing document contents according to the sequence of the plurality of document images.

In some possible embodiments, the number of document images of the document to be processed may be plural, and the plural document images may be sorted in a certain order, such as page order, photographing time order, etc., for the convenience of subsequent processing and analysis.

The document images adjacent in sequence in the document images may have page-crossing document content, where the page-crossing document content includes one type of content in the document content, and indicates a complete content unit, such as a table, a piece of text, or an image spans two or more document contents, and if the page-crossing document content is not processed, each document image is directly segmented, which may cause the problem that generated description information is inaccurate or the finally obtained target content is incomplete or broken.

In order to solve the problem possibly caused by the page-crossing document content, the document processing method provided by the application can judge whether any two adjacent document images have page-crossing document content or not according to the sequence of the document images. Illustratively, the determination method may be based on a continuity analysis of the bottom and top contents of the page, such as checking whether the text at the bottom of the page is an incomplete sentence, whether there is a significant truncation of the form, or the like.

For text type content, it may be checked whether the text line at the bottom of the page ends with a hyphen or an incomplete word, or whether the text line at the top of the page starts with an incomplete sentence, for example. For table type content, it can be checked whether the table row at the bottom of the page is truncated or whether the table column at the top of the page does not match the column of the previous page. For the image type content, it may be checked whether the image at the bottom or top of the page is truncated, for example, a part of the image is at the bottom of one page and another part is at the top of the next page, or whether the boundary feature of the document area is detected, and whether it is a different part of the same image is determined, which is not limited herein.

In some possible embodiments, determining whether any two adjacent document images have spread document content according to an order of the plurality of document images includes:

judging whether the first document content and the second document content have an association relationship under the condition that the type of the first document content at the bottom of the first document image is the same as the type of the second document content at the top of the second document image, wherein the first document image is a sequentially preceding image in any two adjacent document images, and the second document image is a sequentially following image in any two adjacent document images;

Under the condition that the association relation exists between the first document content and the second document content, determining that any two adjacent document images exist page-crossing document content.

It can be understood that the spread document content in the adjacent document image is the same type of document content, and the efficiency of judging whether the spread document content exists can be improved by comparing the type of the first document content at the bottom of the first document image with the type of the second document content at the top of the second document image. For example, when the type of the first document content is different from the type of the second document content, it is directly determined that the two adjacent document images do not have the page-spread document content without performing judgment of the association relationship.

Wherein the association relationship can indicate whether the first document content and the second document content belong to the same content, such as a piece of text, the same image, the same table, or the like. In the case where the type of the first document content at the bottom of the first document image and the type of the second document content at the top of the second document image are the same, it is possible to determine whether or not there is an association relationship between the first document content and the second document content by the above-described determination means such as continuity check of the text lines, continuity check of the form lines, integrity check of the image, and the like.

In some possible embodiments, the types of document content in each document region may be generated by a descriptive information generation model, or may be determined by other image recognition or text recognition techniques. For example, text content can be identified by OCR technology, image or table content can be identified by an image identification algorithm, accuracy of type information of a document area is ensured, and reliable basis is provided for subsequent processing.

S303, performing splicing processing on any two adjacent document images with the page-crossing document content to obtain spliced document images.

Referring to fig. 4, fig. 4 is a schematic diagram of a document processing method according to an embodiment of the present application for performing a stitching process on adjacent document images having spread document contents. As shown in fig. 4, the types of the first document content 41 at the bottom of the first document image and the second document content 42 at the top of the second document image are the same, and if it is detected that the first document content 41 and the second document content 42 belong to the same table, the first document image and the second document image may be subjected to a stitching process, so as to obtain a stitched document image. Thus, when the spliced document region is subjected to the subsequent segmentation processing, the complete document region 43 can be obtained, and the accuracy and the integrity of content extraction can be improved.

In some possible embodiments, when any two adjacent document images with page-crossing document content are subjected to splicing processing, the quality of the spliced document images after the splicing processing can be improved through technologies such as image alignment, image fusion and the like. For example, the splicing position 44 shown in fig. 4 can be transited naturally by cutting a blank area, adjusting the brightness and contrast of the image, or carrying out gradual blurring on the splicing position, so as to improve the visual effect of the spliced document image and the accuracy of subsequent processing.

In some possible embodiments, after the splicing is completed, it may further be checked whether the spliced document image still has the page-crossing document content with other adjacent document images, and if so, the above-mentioned splicing process needs to be repeated until all page-crossing document contents are completely spliced, so as to ensure the integrity of the document contents. Therefore, in the process of processing the document to be processed containing a large amount of page-crossing content, such as a table containing a large number of lines, the mode of multi-time splicing can ensure the integrity and accuracy of the finally obtained target document content.

S304, segmentation processing is carried out on the spliced document image and the non-spliced document image, so that m document areas are obtained.

In some possible embodiments, when the document to be processed corresponds to a plurality of document images and any two adjacent document images contain the page-crossing document content, the splitting processing may be performed on the spliced document image and the non-spliced document image that does not participate in the splitting processing in the plurality of document images, so as to obtain m document areas.

For example, if the 4 document images of the document to be processed have a sequence, the sequence is respectively a document image 1, a document image 2, a document image 3 and a document image 4, and if the document image 2 and the document image 3 have page-crossing document contents, the page-crossing document contents are spliced to obtain a spliced document image 5. At this time, the required document region can be obtained by performing the segmentation processing on the spliced document image 5 and the non-spliced document image 1 and the document image 4.

S305, generating a model according to the m document areas and the description information to obtain the description information corresponding to each document area in the n document areas.

S306, obtaining target document content corresponding to the target description information according to the requirement information and the n description information.

By implementing the technical scheme, the page-crossing document content can be effectively identified, incomplete or broken document content caused by page-crossing problem is avoided, continuity and integrity of the document content are ensured, and efficiency and accuracy of document processing are improved.

Referring to fig. 5, fig. 5 is a schematic flow chart of a document processing method according to an embodiment of the present application, and as shown in fig. 5, the method may include the following steps:

s501, acquiring a document image and requirement information of a user.

In some possible embodiments, acquiring the document image and the user's demand information includes:

Acquiring a document to be processed and voice data of a user;

performing voice recognition on the voice data to obtain the requirement information;

And converting the document to be processed to obtain a document image.

For example, the speech signal may be converted to text by a deep learning based speech recognition model. By combining the voice data of the user with the document processing flow, a convenient document processing experience is provided, the user can acquire the required document content without complex operation, and the working efficiency and the user experience are remarkably improved.

S502, carrying out segmentation processing on the document image to obtain m document areas.

S503, according to the target type in the demand information and the type of the document content corresponding to each document region in the m document regions, obtaining n document regions with the same type as the target type corresponding to the document content in the m document regions.

In some possible embodiments, the demand information may be expressed in the form of "natural language text requiring a financial form" or { "action": "extract_table", "keywords": "financial quarter" } structured instructions, etc. If the requirement information is detected to contain a target type (such as a table or an extract table) designated by a user, all document areas with the document content type of the table can be screened out from m document areas, so that n document areas are obtained.

It can be understood that if the target type in the requirement information of the user indicates that the form type in the document needs to be extracted, the system screens out all document areas with the document content type of the form from m document areas as n document areas. Thus, unnecessary calculation is reduced, and generated description information is ensured to be more fit with the requirements of users.

S504, generating a model and n document areas according to the description information to obtain n description information.

S505, calculating the matching similarity between the requirement information and each of the n pieces of description information, and determining the description information with the matching similarity larger than a preset similarity threshold as the target description information.

In some possible embodiments, matching similarity between the requirement information and each of the n pieces of description information may be calculated by a similarity calculation method, such as cosine similarity, jaccard similarity, or semantic-based similarity, and the description information with the matching similarity greater than the preset similarity threshold may be determined as the target description information.

S506, converting the document area corresponding to the target description information to obtain target document content in a target format.

Through the implementation of the technical scheme, the user demand information and the document image can be conveniently acquired through voice recognition and document conversion, and the user operation steps are simplified. And meanwhile, according to the target type in the requirement information of the user, n document areas with the same corresponding document content type as the target type are screened from the m document areas, so that the efficiency and accuracy of document content extraction are improved.

Referring to fig. 6, fig. 6 is a schematic flow chart of extracting document content according to a triggering operation of a user in the document processing method according to the embodiment of the present application, and as shown in fig. 6, the method may include the following steps:

S601, grouping n pieces of description information according to the type of the document content corresponding to each document area in the n document areas to obtain at least one group.

In some possible embodiments, after generating a model according to m document areas and description information, obtaining the description information corresponding to each document area in the n document areas, the document processing method provided by the application can provide a more visual and convenient interaction mode for extracting document contents for users through a display interface and a corresponding control mode.

It should be noted that each of the at least one group includes at least one description information of the same type of the corresponding document content. For example, in the case where the document content of the document to be processed includes a text type, an image type, and a form type, the description information corresponding to all the text types may be divided into one group, the description information corresponding to all the image types may be divided into one group, and the description information corresponding to all the form types may be divided into one group.

S602, displaying a catalog interface.

In some possible embodiments, the directory interface includes a group control corresponding to each of the at least one group.

Referring to fig. 7, fig. 7 is a schematic diagram of a document processing method according to an embodiment of the present application, where, as shown in fig. 7, a directory interface 70 includes 3 grouping controls, namely, a text grouping control 71, an image grouping control 72, and a form grouping control 73.

In some possible embodiments, the description information contained in each group, that is, the number of document areas of the corresponding category, may also be displayed at the position of the group control, and as shown in fig. 7, 9 document areas of text categories, 4 document areas of image categories, and 3 document areas of form categories are extracted in total. Thus, a user can quickly know how many document areas exist under each category, and the efficiency and the user experience of document processing are improved.

S603, responding to the triggering operation of the target grouping control, and displaying a grouping interface corresponding to the target grouping control.

In some possible embodiments, the grouping interface includes a description control corresponding to each piece of description information in the group corresponding to the target grouping control.

Referring to fig. 8, fig. 8 is a schematic diagram of a grouping interface displayed in the document processing method according to the embodiment of the present application, in some possible embodiments, in a case where a target grouping control triggered by a user is a form grouping control in a plurality of grouping controls, a grouping interface 80 shown in fig. 8 may be displayed, where the grouping interface 80 includes 3 description controls corresponding to 3 description information in the grouping, which are a first description control 81 corresponding to "current year financial form", a second description control 82 corresponding to "calendar year financial form", and a third description control 83 corresponding to "service growth form", respectively.

Through displaying the description control corresponding to each piece of description information, a user can quickly locate the required document content, and the efficiency and user experience of document content extraction are improved.

In some possible embodiments, the grouping interface may further include document content corresponding to each of the description information, so that the user can more intuitively understand the content of each document region. For example, in the grouping interface, in addition to displaying the description control, a complete image or a thumbnail of a document area corresponding to each description information may be displayed separately. Therefore, a user can quickly browse and screen the required content without opening the document, and the efficiency and convenience of document processing are further improved.

S604, responding to the triggering operation of the target description control, and obtaining the document content of the description information corresponding to the target description control.

In some possible embodiments, the target description control may be one of a plurality of description controls in the grouping interface, and by triggering the target description control, the user can flexibly select the required document content, and the obtained document content may be document content in an image form or may be converted into a format corresponding to the type of the document content, which is not limited herein.

By implementing the technical scheme, the user can select a proper type according to the self requirement by displaying the catalog interface and according to the triggering operation of the user, and then select specific description information, so that the required document content can be acquired rapidly. The hierarchical interaction design not only improves the retrieval efficiency, but also obviously improves the user experience and reduces the complexity of operation.

It should be understood that, although the steps in the flowcharts described above are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of sub-steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with at least a part of the sub-steps or stages of other steps or other steps.

Based on the foregoing embodiments, the embodiments of the present application provide a document processing apparatus, where the document processing apparatus includes each module included and each unit included in each module may be implemented by a processor, or may of course be implemented by a specific logic circuit, and in the implementation process, the processor may be a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a document processing apparatus according to an embodiment of the present application, where, as shown in fig. 9, the document processing apparatus includes an obtaining module 901, a dividing module 902, a describing module 903, and a matching module 904, where:

An acquisition module 901, configured to acquire a document image and requirement information of a user, where the document image includes document content of a document to be processed, and the document content includes at least one type of content of a text type, an image type, or a form type.

A segmentation module 902, configured to perform segmentation processing on a document image to obtain m document areas, where each document area includes one type of content in document contents, and m is an integer greater than or equal to 1.

The description module 903 is configured to obtain description information corresponding to each of n document areas according to m document areas and a description information generating model, where the description information generating model is trained according to a plurality of sample images and a plurality of sample description information, n is an integer greater than or equal to 1, and n is less than or equal to m.

And the matching module 904 is configured to obtain, according to the requirement information and the n pieces of description information, target document content corresponding to the target description information, where the target description information includes description information, of the n pieces of description information, having a matching similarity with the requirement information greater than a preset similarity threshold.

In some possible embodiments, the number of the document images is multiple, the segmentation module 902 is further configured to determine, according to an order of the multiple document images, whether any two adjacent document images have a spread document content, where the spread document content includes one type of content in the document content, perform a stitching process on any two adjacent document images having the spread document content to obtain a stitched document image, and perform a segmentation process on the stitched document image and an un-stitched document image to obtain m document areas.

In some possible embodiments, the segmentation module 902 is further configured to determine whether an association exists between the first document content and the second document content if a type of the first document content at the bottom of the first document image and a type of the second document content at the top of the second document image are the same, where the first document image is a sequentially preceding image in any two adjacent document images, and the second document image is a sequentially following image in any two adjacent document images;

In some possible embodiments, the description module 903 is further configured to obtain n document areas with the same type as the target type corresponding to the document content in the m document areas according to the target type in the requirement information and the type of the document content corresponding to each of the m document areas, and generate a model and the n document areas according to the description information to obtain n description information.

In some possible embodiments, the matching module 904 is further configured to calculate a matching similarity between the requirement information and each of the n pieces of description information, determine that the description information with the matching similarity greater than a preset similarity threshold is the target description information, and convert a document region corresponding to the target description information to obtain target document content in the target format.

In some possible embodiments, the document processing device further includes a display module, configured to perform grouping processing on n pieces of description information according to a type of document content corresponding to each of n pieces of document areas to obtain at least one group, where each of the at least one group includes description information with at least one corresponding document content of the same type, display a directory interface, where the directory interface includes a group control corresponding to each of the at least one group, display a group interface corresponding to a target group control in response to a triggering operation on the target group control, where the group interface includes a description control corresponding to each of the description information in the group corresponding to the target group control, and obtain document content of the description information corresponding to the target description control in response to the triggering operation on the target description control.

In some possible embodiments, the obtaining module 901 is configured to obtain the to-be-processed document and the voice data of the user, perform voice recognition on the voice data to obtain the requirement information, and convert the to-be-processed document to obtain the document image.

The description of the apparatus embodiments above is similar to that of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, please refer to the description of the embodiments of the method of the present application.

It should be noted that, in the embodiment of the present application, the division of modules by the document processing apparatus shown in fig. 9 is merely a logic function division, and another division manner may be adopted in actual implementation. In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. Or in a combination of software and hardware.

It should be noted that, in the embodiment of the present application, if the method is implemented in the form of a software functional module, and sold or used as a separate product, the method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partly contributing to the related art, embodied in the form of a software product stored in a storage medium, including several instructions for causing an electronic device to execute all or part of the methods described in the embodiments of the present application. The storage medium includes various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the application are not limited to any specific combination of hardware and software.

The embodiment of the application provides a computer device, which can be a server, and the internal structure diagram of the computer device can be shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. Which computer program, when being executed by a processor, carries out the above-mentioned method.

An embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method provided in the above-described embodiment.

Embodiments of the present application provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the method provided by the method embodiments described above.

It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, the document processing apparatus provided by the present application may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 10. The memory of the computer device may store the various program modules that make up the apparatus. The computer program of each program module causes a processor to carry out the steps of the method of each embodiment of the application described in the present specification. It should be noted here that the description of the storage medium and the device embodiments above is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the storage medium, the storage medium and the device embodiments of the present application, please refer to the description of the method embodiments of the present application.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" or "some embodiments" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" or "in some embodiments" in various places throughout this specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments are merely illustrative, e.g., the division of the modules is merely a logical division of functionality, and may be implemented in other manners, e.g., multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or modules, whether electrically, mechanically, or otherwise.

The modules described as separate components may or may not be physically separate, and components displayed as modules may or may not be physical modules, may be located in one place or distributed on a plurality of network units, and may select some or all of the modules according to actual needs to achieve the purpose of the embodiment. In addition, each functional module in each embodiment of the present application may be integrated in one processing unit, or each module may be separately used as a unit, or two or more modules may be integrated in one unit, where the integrated modules may be implemented in hardware or in a form of hardware plus a software functional unit.

It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be implemented by hardware associated with program instructions, where the above program may be stored in a computer readable storage medium, where the program when executed performs the steps comprising the above method embodiments, where the above storage medium includes various media that may store program code, such as a removable storage device, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Or the above-described integrated units of the application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partly contributing to the related art, embodied in the form of a software product stored in a storage medium, including several instructions for causing an electronic device to execute all or part of the methods described in the embodiments of the present application. The storage medium includes various media capable of storing program codes such as a removable storage device, a ROM, a magnetic disk, or an optical disk.

The methods disclosed in the method embodiments provided by the application can be arbitrarily combined under the condition of no conflict to obtain a new method embodiment. The features disclosed in the several product embodiments provided by the application can be combined arbitrarily under the condition of no conflict to obtain new product embodiments. The features disclosed in the embodiments of the method or the apparatus provided by the application can be arbitrarily combined without conflict to obtain new embodiments of the method or the apparatus.

The foregoing is merely an embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A document processing method, comprising:

2. The method according to claim 1, wherein the number of the document images is plural, the dividing the document images to obtain m document areas includes:

3. The method of claim 2, wherein the determining whether any two adjacent document images have spread document content according to an order of the plurality of document images, comprises:

4. The method according to claim 1, wherein generating a model according to the m document areas and the description information, to obtain the description information corresponding to each document area in the n document areas, includes:

5. The method of claim 1, wherein the obtaining, according to the requirement information and the n pieces of description information, the target document content corresponding to the target description information includes:

6. The method according to claim 1, wherein after the generating a model according to the m document areas and the description information, obtaining the description information corresponding to each document area in the n document areas, the method further comprises:

7. The method of claim 1, wherein the acquiring the document image and the user's demand information comprises:

Acquiring the to-be-processed document and voice data of the user;

And converting the document to be processed to obtain the document image.

8. A document processing apparatus, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the program is executed.

10. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any one of claims 1 to 7.