CN111209865A

CN111209865A - File content extraction method and device, electronic equipment and storage medium

Info

Publication number: CN111209865A
Application number: CN202010012359.6A
Authority: CN
Inventors: 刘小康; 李健铨
Original assignee: Dinfo Beijing Science Development Co ltd
Current assignee: Dinfo Beijing Science Development Co ltd
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2020-05-29

Abstract

The invention relates to a file content extraction method and device, electronic equipment and a storage medium, and belongs to the field of word processing. The method comprises the following steps: the electronic equipment acquires a file to be extracted; then, segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and then, identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.

Description

File content extraction method and device, electronic equipment and storage medium

Technical Field

The application belongs to the field of word processing, and particularly relates to a file content extraction method and device, electronic equipment and a storage medium.

Background

In recent years, studies on character recognition and character understanding of image texts have become hot.

Optical Character Recognition (OCR) is one of the most important ways of text Recognition, and can achieve higher Recognition accuracy in scanning simple texts (for example, texts with a single background and ordered layout), but because texts to be recognized are often complex in an actual application scene, for example, text formats are various, wrinkles exist, shadows exist, and the like, the Recognition effect obtained when applying OCR to the actual scene is poor, and therefore, OCR cannot meet the actual requirement of extracting text contents.

Disclosure of Invention

In view of the above, an object of the present application is to provide a file content extraction method, an apparatus, an electronic device, and a storage medium, so as to provide a file content extraction scheme that can adapt to the complexity of an actual application scenario.

The embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a file content extraction method, where the method includes:

acquiring a file to be extracted; segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.

With reference to the embodiment of the first aspect, in a possible implementation manner, the file to be extracted is a red-headed file, and the red-headed file includes a red separation line, and the method further includes: determining the position for representing the red separation line from the file to be extracted; determining a file header and a file main body of the red header file by taking the position for representing the red separation line as a reference; and respectively outputting the text content of the file header and the text content of the file main body.

With reference to the embodiment of the first aspect, in a possible implementation manner, after obtaining the plurality of text boxes including text, before recognizing each text box through the text recognition model, the method further includes: calculating the height of the frame line of each text box; and combining the text boxes which are positioned on the same line and have the outline height difference smaller than the threshold value into one text box.

With reference to the embodiment of the first aspect, in a possible implementation manner, after the obtaining the file to be extracted, before the segmenting the file to be extracted by using the text segmentation model to obtain a plurality of text boxes including a text, the method further includes: removing interference factors in the file to be extracted to obtain a preprocessed file;

correspondingly, the segmenting the file to be extracted through the text segmentation model to obtain a plurality of text boxes containing characters includes: and segmenting the preprocessed file through the text segmentation model to obtain a plurality of text boxes containing texts.

With reference to the embodiment of the first aspect, in a possible implementation manner, the removing the interference factor in the file to be extracted includes: and removing the red content of the preset position of the file to be extracted.

With reference to the embodiment of the first aspect, in one possible implementation manner, the method further includes: and correcting the text content in each text box through a pre-stored text correction model.

In a second aspect, an embodiment of the present application provides a file content extracting apparatus, including: the device comprises an acquisition module, a segmentation module and an identification module. The acquisition module is used for acquiring a file to be extracted; the segmentation module is used for segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and the recognition module is used for recognizing each text box through the text recognition model to obtain the text content in each text box.

With reference to the second aspect, in a possible implementation manner, the file to be extracted is a red-headed file, the red-headed file includes a red separation line, and the file content extraction apparatus further includes a determination module and an output module. The determining module is used for determining the position for representing the red separation line from the file to be extracted; the determining module is further configured to determine a file header and a file body of the red-header file based on the position for representing the red separation line; and the output module is used for respectively outputting the text content of the file header and the text content of the file main body.

With reference to the second aspect, in a possible implementation manner, the file content extracting apparatus further includes a calculating module and a merging module. The calculation module is used for calculating the height of the frame line of each text box; and the merging module is used for merging the text boxes which are positioned in the same line and have the height difference of the frame lines smaller than the threshold value into one text box.

With reference to the second aspect, in a possible implementation manner, the file content extracting apparatus further includes a removing module, configured to remove an interference factor in the file to be extracted, so as to obtain a preprocessed file;

correspondingly, the segmentation module is configured to segment the preprocessed file through the text segmentation model to obtain a plurality of text boxes including a text.

With reference to the second aspect, in a possible implementation manner, the removing module is configured to remove red content at a preset position of the file to be extracted.

With reference to the second aspect, in a possible implementation manner, the file content extracting apparatus further includes an error correction module, configured to correct the text content in each text box through a pre-stored text correction model.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory and a processor, the memory and the processor connected; the memory is used for storing programs; the processor calls a program stored in the memory to perform the method of the first aspect embodiment and/or any possible implementation manner of the first aspect embodiment.

In a fourth aspect, the present application further provides a non-volatile computer-readable storage medium (hereinafter, referred to as a storage medium), on which a computer program is stored, where the computer program is executed by a computer to perform the method in the foregoing first aspect and/or any possible implementation manner of the first aspect.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.

Fig. 1 shows one of flowcharts of a file content extraction method provided in an embodiment of the present application.

Fig. 2 shows an operation diagram of a PixelLink model provided in an embodiment of the present application.

Fig. 3 shows a second flowchart of a file content extraction method provided in the embodiment of the present application.

Fig. 4 shows a block diagram of a file content extracting apparatus according to an embodiment of the present application.

Fig. 5 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.

The embodiment of the application provides a file content extraction method and device, electronic equipment and a storage medium, so that file content applied to an actual scene can be extracted conveniently. The technology can be realized by adopting corresponding software, hardware and a combination of software and hardware. The following describes embodiments of the present application in detail.

The following description will be directed to a file content extraction method provided in the present application.

Referring to fig. 1, an embodiment of the present application provides a file content extraction method applied to an electronic device. The steps involved will be described below with reference to fig. 1.

Step S110: and acquiring the file to be extracted.

In the embodiment of the present application, the file to be extracted may be in a picture Format or a PDF (portable document Format) Format.

In addition, the file to be extracted can be a common file or a red header file issued by an official party.

As an optional implementation manner, in order to improve the identification accuracy of the subsequent identification process, after the file to be extracted is obtained, the file to be extracted may be preprocessed to obtain a preprocessed file.

The preprocessing content includes but is not limited to at least one of removing a watermark in the file to be extracted, removing a shadow caused by a light problem when a text image is collected in the file to be extracted, and correcting the inclination of the file to be extracted.

The watermark removal and the shadow removal can be realized by dynamically adjusting the binary value of the extracted file after the file to be extracted is subjected to Gaussian blur.

The method for correcting the inclination in the file to be extracted can be characterized in that an image of the file to be extracted in a frequency domain is firstly obtained through Fourier transformation, then the inclination angle of a straight line in the frequency domain is obtained through Hough straight line transformation, and then the inclination angle of the straight line in the frequency domain is adjusted, so that the inclination in the file to be extracted is corrected.

In addition, when the file to be extracted is a red-headed file, the preprocessed content may further include removing red content in the preset area, where the red content includes, but is not limited to, a red stamp included in the file to be extracted.

For a red-headed file, the red stamp of the file is generally positioned at the lower left corner or the lower right corner of the file, so that a red channel of a preset area (for example, the lower left corner and the lower right corner) of the file to be extracted can be detected, and then the red channel value of the preset area is adjusted to 0, thereby achieving the purpose of removing the red stamp.

It should be noted that the above preprocessing processes are all prior art, and detailed implementation thereof is not described again.

In addition, it is worth pointing out that before the subsequent operation is performed on the file to be extracted, if the file to be extracted is preprocessed to obtain the preprocessed file, the preprocessed file can be subsequently used as a processing object to perform the subsequent operation.

Step S120: and segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts.

In the embodiment of the present application, the text segmentation model is a PixelLink model, and the PixelLink model has a text and non-text classification function.

Specifically, the PixelLink model mainly performs text/non-text classification prediction on a certain pixel (pixel) based on CNN (convolutional neural network), and performs classification prediction on whether a link (link) exists in 8 neighborhood directions of the pixel, as shown in fig. 2, eight heatmaps in a dashed box represent eight-direction link predictions. Then, the PixelLink model performs minarect (minimum bounding rectangle) operation on the connected domain based on OpenCV to obtain text connected domains of different sizes. After text connected domains with different sizes are obtained, the PixelLink model carries out noise filtering operation on the connected domains, and then a plurality of text boxes with border boundaries are obtained through parallel-lookup sets (discrete-set data structures). Wherein different text is included in each text box.

In the process of text segmentation of a file to be extracted or a preprocessed file by a text segmentation model, the PixelLink model may segment characters located in the same line into a plurality of text boxes. For a bill text, such as an invoice, if the text in the same line is divided into a plurality of text boxes, the subsequent text box recognition process may be affected.

To avoid this, in an alternative embodiment, after obtaining a plurality of text boxes, the electronic device calculates the height of the outline (the distance between the upper outline and the lower outline) of each text box and the height of each text box (the vertical distance of the lower outline from the bottom of the file to be extracted), and then merges the text boxes that are located on the same line (when the heights of two text boxes are equal, the two text boxes are located on the same line) and the difference in the heights of the outlines is smaller than the threshold value into one text box. Wherein, the threshold value can be set according to the actual situation.

Step S130: and identifying each text box through a text identification model to obtain the text content in each text box.

In the embodiment of the present application, the text recognition model is CRNN (convolutional recurrent neural network model).

Generally, CRNN includes a convolutional layer, a cyclic layer, and an output layer.

Wherein, the convolution layer is used for extracting a characteristic sequence from an input image; the circulation layer is used for predicting the label distribution of the characteristic sequence obtained from the convolution layer; the output layer is used for converting the label distribution acquired from the circulation layer into a final identification result through operations such as de-duplication integration and the like. Of course, the CRNN needs to be trained in advance, so that the CRNN can fully learn the features of various text contents.

In the embodiment of the application, the CRNN identifies the content in each text box, so that the influence of interference factors outside the text boxes on the identification accuracy can be reduced, and the overall identification accuracy can be improved.

In addition, in order to further improve the accuracy of the finally obtained text content, in an optional implementation manner, a text error correction model can be trained and stored in the electronic device in advance, and a large number of text expression rules corresponding to normal grammars are learned in advance by the text error correction model. After the text content in each text box is recognized based on the CRNN, the obtained text content may be input into a text error correction model, so that the text error correction model identifies text content that may have been erroneously recognized, thereby facilitating error correction of the erroneously recognized text content.

In addition, the red header file generally includes a red separation line for distinguishing a file header and a file body of the text, and therefore, when the file to be extracted is the red header file, as an optional implementation manner, the file header and the file body of the red header file can also be distinguished with respect to the red separation line included in the red header file. In this embodiment, referring to fig. 3, the method further comprises:

step S140: and determining the position for representing the red separation line from the file to be extracted.

Because the length and the color of the red separation line are generally similar in different red-headed files, a red part can be extracted by searching a color channel of the red-headed file, then the extracted red part is subjected to boundary detection by a Canny edge detection algorithm, and a straight line included in a boundary detection result is detected by Hough transform. After obtaining a plurality of straight lines, the longest straight line is determined as a red part red separation line by operations of removing short noise straight lines, combining adjacent straight lines and the like, and thus the position for representing the red separation line is determined.

Step S150: and determining the file header and the file body of the red header file by taking the position for representing the red separation line as a reference.

After the position of the red separation line is determined, the electronic device may determine, based on the red separation line, the text content above the red separation line as a file header, and determine the content below the red separation line as a file body.

Step S160: and respectively outputting the text content of the file header and the text content of the file main body.

In this embodiment, the sequence of steps S140 to S150 may be performed after step S130, that is, after the text content in each text box is determined, the text content of the header of the red-header file and the text content of the main file are determined according to the determined position of the red separating line, and then the text content of the header and the text content of the main file are output respectively.

In addition, in another alternative embodiment, before step S130 is executed, the position of the red partition line is determined according to the flow from step S140 to step S150, and then the region where the file head of the red-headed file is located and the region where the file body is located are determined according to the red partition line. After the area where the file header is located and the area where the file main body is located are determined, the text content in each text box included in each area can be identified according to actual requirements, and then output is carried out. For example, in one embodiment, the user only cares about the text content of the header, and then after determining the area where the header is located, the text content in each text box included in the area where the header is located may be identified, and then only the text content of the header may be output.

According to the file content extraction method provided by the embodiment of the application, the electronic equipment obtains a file to be extracted; then, segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and then, identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.

As shown in fig. 4, an embodiment of the present application further provides a file content extracting apparatus 400, where the file content extracting apparatus 400 may include: an acquisition module 410, a segmentation module 420, and an identification module 430.

An obtaining module 410, configured to obtain a file to be extracted;

the segmentation module 420 is configured to segment the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts;

and the identifying module 430 is configured to identify each text box through a text identification model to obtain the text content in each text box.

In a possible implementation manner, the file to be extracted is a red-head file, the red-head file includes a red separation line, and the file content extraction apparatus 400 further includes a determination module and an output module.

The determining module is used for determining the position for representing the red separation line from the file to be extracted;

the determining module is further configured to determine a file header and a file body of the red-header file based on the position for representing the red separation line;

and the output module is used for respectively outputting the text content of the file header and the text content of the file main body.

In a possible implementation, the file content extracting apparatus 400 further includes a calculating module and a merging module.

The calculation module is used for calculating the height of the frame line of each text box;

and the merging module is used for merging the text boxes which are positioned in the same line and have the height difference of the frame lines smaller than the threshold value into one text box.

In a possible implementation manner, the file content extraction apparatus 400 further includes a removing module, configured to remove an interference factor in the file to be extracted, so as to obtain a preprocessed file;

correspondingly, the segmentation module 420 is configured to segment the preprocessed file through the text segmentation model to obtain a plurality of text boxes including a text.

In a possible implementation manner, the removing module is configured to remove red content at a preset position of the file to be extracted.

In a possible implementation manner, the file content extracting apparatus 400 further includes an error correction module, configured to perform error correction on the text content in each text box through a pre-stored text error correction model.

The document content extraction apparatus 400 provided in the embodiment of the present application has the same implementation principle and the same technical effect as those of the foregoing method embodiments, and for the sake of brief description, no mention is made in the apparatus embodiment, and reference may be made to the corresponding contents in the foregoing method embodiments.

In addition, the embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a computer, the steps included in the file content extraction method as described above are executed.

In addition, please refer to fig. 5, an embodiment of the present application further provides an electronic device 100 for implementing the file content extracting method and apparatus of the embodiment of the present application, where the electronic device 100 may include: a processor 110, a memory 120.

Alternatively, the electronic Device 100 may be, but is not limited to, a Personal Computer (PC), a smart phone, a tablet computer, and a Mobile Internet Device (MID). Among them, the electronic device 100.

It should be noted that the components and structure of electronic device 100 shown in FIG. 5 are exemplary only, and not limiting, and electronic device 100 may have other components and structures as desired.

The processor 110, memory 120, and other components that may be present in the electronic device 100 are electrically connected to each other, directly or indirectly, to enable the transfer or interaction of data. For example, the processor 110, the memory 120, and other components that may be present may be electrically coupled to each other via one or more communication buses or signal lines.

The memory 120 is used for storing a program, such as a program corresponding to the foregoing file content extracting method or the foregoing file content extracting apparatus 400. Optionally, when the file content extracting apparatus 400 is stored in the memory 120, the file content extracting apparatus includes at least one software functional module that can be stored in the memory 120 in the form of software or firmware (firmware).

Alternatively, the software function module included in the file content extraction apparatus 400 may also be solidified in an Operating System (OS) of the electronic device 100.

The processor 110 is used to execute executable modules stored in the memory 120, such as software functional modules or computer programs included in the file content extraction apparatus 400. When the processor 110 receives the execution instruction, it may execute the computer program, for example, to perform: acquiring a file to be extracted; segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and identifying each text box through a text identification model to obtain the text content in each text box.

Of course, the method disclosed in any of the embodiments of the present application can be applied to the processor 110, or implemented by the processor 110.

In summary, in the file content extraction method, the file content extraction device, the electronic device and the storage medium provided by the embodiments of the present invention, the electronic device obtains the file to be extracted; then, segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and then, identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions may be stored in a storage medium if they are implemented in the form of software function modules and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

1. a file content extraction method, is characterized in that, described method comprises:

Get the file to be extracted;

The to-be-extracted file is segmented by a text segmentation model to obtain a plurality of text boxes containing text;

Each text box is recognized by the text recognition model, and the text content in each text box is obtained.

2. The method according to claim 1, wherein the to-be-extracted file is a red header file, and the red header file includes a red dividing line, and the method further comprises:

Determine the position used to characterize the red dividing line from the to-be-extracted file;

Determine the file header and the file body of the red header file based on the position used to characterize the red dividing line;

The text content of the file header and the text content of the file body are respectively output.

3. The method according to claim 1 or 2, characterized in that, after said obtaining a plurality of text boxes containing text, and before said identifying each text box by a text recognition model, the method further comprises: include:

Calculate the frame height of each text box;

Combines text boxes that are on the same row and whose border height difference is less than a threshold into one text box.

4. The method according to claim 1, characterized in that, after the acquisition of the to-be-extracted file, before the segmentation of the to-be-extracted file by a text segmentation model to obtain a plurality of text boxes containing text, The method also includes:

removing interference factors in the to-be-extracted file to obtain a preprocessed file;

Correspondingly, the to-be-extracted file is segmented by the text segmentation model to obtain a plurality of text boxes containing text, including:

The preprocessed file is segmented by the text segmentation model to obtain a plurality of text boxes containing text.

5. The method according to claim 4, wherein the removing interference factors in the to-be-extracted file comprises:

Remove the red content in the preset position of the to-be-extracted file.

6. The method according to claim 1, wherein the method further comprises:

Error correction is performed on the text content in each of the text boxes through a pre-stored text error correction model.

7. A file content extraction device, wherein the file content extraction device comprises:

The acquisition module is used to acquire the file to be extracted;

a segmentation module, configured to segment the to-be-extracted file by a text segmentation model to obtain multiple text boxes containing text;

The recognition module is used for recognizing each text box through a text recognition model to obtain the text content in each text box.

8. The device according to claim 7, wherein the file to be extracted is a red header file, and the red header file includes a red dividing line, and the file content extraction device further comprises a determination module and an output module;

The determining module is used to determine the position used to characterize the red dividing line from the to-be-extracted file;

The determining module is further configured to determine the file header and the file body of the red header file based on the position used to characterize the red dividing line;

The output module is configured to output the text content of the file header and the text content of the file body respectively.

9. An electronic device, comprising: a memory and a processor, wherein the memory is connected to the processor;

the memory is used to store programs;

The processor invokes a program stored in the memory to perform the method of any of claims 1-6.

10. A storage medium, characterized in that a computer program is stored thereon, and the computer program executes the method according to any one of claims 1-6 when the computer program is run by a computer.