[go: up one dir, main page]

CN111209865A - File content extraction method and device, electronic equipment and storage medium - Google Patents

File content extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111209865A
CN111209865A CN202010012359.6A CN202010012359A CN111209865A CN 111209865 A CN111209865 A CN 111209865A CN 202010012359 A CN202010012359 A CN 202010012359A CN 111209865 A CN111209865 A CN 111209865A
Authority
CN
China
Prior art keywords
file
text
extracted
red
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010012359.6A
Other languages
Chinese (zh)
Inventor
刘小康
李健铨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dinfo Beijing Science Development Co ltd
Original Assignee
Dinfo Beijing Science Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dinfo Beijing Science Development Co ltd filed Critical Dinfo Beijing Science Development Co ltd
Priority to CN202010012359.6A priority Critical patent/CN111209865A/en
Publication of CN111209865A publication Critical patent/CN111209865A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a file content extraction method and device, electronic equipment and a storage medium, and belongs to the field of word processing. The method comprises the following steps: the electronic equipment acquires a file to be extracted; then, segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and then, identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.

Description

File content extraction method and device, electronic equipment and storage medium
Technical Field
The application belongs to the field of word processing, and particularly relates to a file content extraction method and device, electronic equipment and a storage medium.
Background
In recent years, studies on character recognition and character understanding of image texts have become hot.
Optical Character Recognition (OCR) is one of the most important ways of text Recognition, and can achieve higher Recognition accuracy in scanning simple texts (for example, texts with a single background and ordered layout), but because texts to be recognized are often complex in an actual application scene, for example, text formats are various, wrinkles exist, shadows exist, and the like, the Recognition effect obtained when applying OCR to the actual scene is poor, and therefore, OCR cannot meet the actual requirement of extracting text contents.
Disclosure of Invention
In view of the above, an object of the present application is to provide a file content extraction method, an apparatus, an electronic device, and a storage medium, so as to provide a file content extraction scheme that can adapt to the complexity of an actual application scenario.
The embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a file content extraction method, where the method includes:
acquiring a file to be extracted; segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.
With reference to the embodiment of the first aspect, in a possible implementation manner, the file to be extracted is a red-headed file, and the red-headed file includes a red separation line, and the method further includes: determining the position for representing the red separation line from the file to be extracted; determining a file header and a file main body of the red header file by taking the position for representing the red separation line as a reference; and respectively outputting the text content of the file header and the text content of the file main body.
With reference to the embodiment of the first aspect, in a possible implementation manner, after obtaining the plurality of text boxes including text, before recognizing each text box through the text recognition model, the method further includes: calculating the height of the frame line of each text box; and combining the text boxes which are positioned on the same line and have the outline height difference smaller than the threshold value into one text box.
With reference to the embodiment of the first aspect, in a possible implementation manner, after the obtaining the file to be extracted, before the segmenting the file to be extracted by using the text segmentation model to obtain a plurality of text boxes including a text, the method further includes: removing interference factors in the file to be extracted to obtain a preprocessed file;
correspondingly, the segmenting the file to be extracted through the text segmentation model to obtain a plurality of text boxes containing characters includes: and segmenting the preprocessed file through the text segmentation model to obtain a plurality of text boxes containing texts.
With reference to the embodiment of the first aspect, in a possible implementation manner, the removing the interference factor in the file to be extracted includes: and removing the red content of the preset position of the file to be extracted.
With reference to the embodiment of the first aspect, in one possible implementation manner, the method further includes: and correcting the text content in each text box through a pre-stored text correction model.
In a second aspect, an embodiment of the present application provides a file content extracting apparatus, including: the device comprises an acquisition module, a segmentation module and an identification module. The acquisition module is used for acquiring a file to be extracted; the segmentation module is used for segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and the recognition module is used for recognizing each text box through the text recognition model to obtain the text content in each text box.
With reference to the second aspect, in a possible implementation manner, the file to be extracted is a red-headed file, the red-headed file includes a red separation line, and the file content extraction apparatus further includes a determination module and an output module. The determining module is used for determining the position for representing the red separation line from the file to be extracted; the determining module is further configured to determine a file header and a file body of the red-header file based on the position for representing the red separation line; and the output module is used for respectively outputting the text content of the file header and the text content of the file main body.
With reference to the second aspect, in a possible implementation manner, the file content extracting apparatus further includes a calculating module and a merging module. The calculation module is used for calculating the height of the frame line of each text box; and the merging module is used for merging the text boxes which are positioned in the same line and have the height difference of the frame lines smaller than the threshold value into one text box.
With reference to the second aspect, in a possible implementation manner, the file content extracting apparatus further includes a removing module, configured to remove an interference factor in the file to be extracted, so as to obtain a preprocessed file;
correspondingly, the segmentation module is configured to segment the preprocessed file through the text segmentation model to obtain a plurality of text boxes including a text.
With reference to the second aspect, in a possible implementation manner, the removing module is configured to remove red content at a preset position of the file to be extracted.
With reference to the second aspect, in a possible implementation manner, the file content extracting apparatus further includes an error correction module, configured to correct the text content in each text box through a pre-stored text correction model.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory and a processor, the memory and the processor connected; the memory is used for storing programs; the processor calls a program stored in the memory to perform the method of the first aspect embodiment and/or any possible implementation manner of the first aspect embodiment.
In a fourth aspect, the present application further provides a non-volatile computer-readable storage medium (hereinafter, referred to as a storage medium), on which a computer program is stored, where the computer program is executed by a computer to perform the method in the foregoing first aspect and/or any possible implementation manner of the first aspect.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.
Fig. 1 shows one of flowcharts of a file content extraction method provided in an embodiment of the present application.
Fig. 2 shows an operation diagram of a PixelLink model provided in an embodiment of the present application.
Fig. 3 shows a second flowchart of a file content extraction method provided in the embodiment of the present application.
Fig. 4 shows a block diagram of a file content extracting apparatus according to an embodiment of the present application.
Fig. 5 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.
The embodiment of the application provides a file content extraction method and device, electronic equipment and a storage medium, so that file content applied to an actual scene can be extracted conveniently. The technology can be realized by adopting corresponding software, hardware and a combination of software and hardware. The following describes embodiments of the present application in detail.
The following description will be directed to a file content extraction method provided in the present application.
Referring to fig. 1, an embodiment of the present application provides a file content extraction method applied to an electronic device. The steps involved will be described below with reference to fig. 1.
Step S110: and acquiring the file to be extracted.
In the embodiment of the present application, the file to be extracted may be in a picture Format or a PDF (portable document Format) Format.
In addition, the file to be extracted can be a common file or a red header file issued by an official party.
As an optional implementation manner, in order to improve the identification accuracy of the subsequent identification process, after the file to be extracted is obtained, the file to be extracted may be preprocessed to obtain a preprocessed file.
The preprocessing content includes but is not limited to at least one of removing a watermark in the file to be extracted, removing a shadow caused by a light problem when a text image is collected in the file to be extracted, and correcting the inclination of the file to be extracted.
The watermark removal and the shadow removal can be realized by dynamically adjusting the binary value of the extracted file after the file to be extracted is subjected to Gaussian blur.
The method for correcting the inclination in the file to be extracted can be characterized in that an image of the file to be extracted in a frequency domain is firstly obtained through Fourier transformation, then the inclination angle of a straight line in the frequency domain is obtained through Hough straight line transformation, and then the inclination angle of the straight line in the frequency domain is adjusted, so that the inclination in the file to be extracted is corrected.
In addition, when the file to be extracted is a red-headed file, the preprocessed content may further include removing red content in the preset area, where the red content includes, but is not limited to, a red stamp included in the file to be extracted.
For a red-headed file, the red stamp of the file is generally positioned at the lower left corner or the lower right corner of the file, so that a red channel of a preset area (for example, the lower left corner and the lower right corner) of the file to be extracted can be detected, and then the red channel value of the preset area is adjusted to 0, thereby achieving the purpose of removing the red stamp.
It should be noted that the above preprocessing processes are all prior art, and detailed implementation thereof is not described again.
In addition, it is worth pointing out that before the subsequent operation is performed on the file to be extracted, if the file to be extracted is preprocessed to obtain the preprocessed file, the preprocessed file can be subsequently used as a processing object to perform the subsequent operation.
Step S120: and segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts.
In the embodiment of the present application, the text segmentation model is a PixelLink model, and the PixelLink model has a text and non-text classification function.
Specifically, the PixelLink model mainly performs text/non-text classification prediction on a certain pixel (pixel) based on CNN (convolutional neural network), and performs classification prediction on whether a link (link) exists in 8 neighborhood directions of the pixel, as shown in fig. 2, eight heatmaps in a dashed box represent eight-direction link predictions. Then, the PixelLink model performs minarect (minimum bounding rectangle) operation on the connected domain based on OpenCV to obtain text connected domains of different sizes. After text connected domains with different sizes are obtained, the PixelLink model carries out noise filtering operation on the connected domains, and then a plurality of text boxes with border boundaries are obtained through parallel-lookup sets (discrete-set data structures). Wherein different text is included in each text box.
In the process of text segmentation of a file to be extracted or a preprocessed file by a text segmentation model, the PixelLink model may segment characters located in the same line into a plurality of text boxes. For a bill text, such as an invoice, if the text in the same line is divided into a plurality of text boxes, the subsequent text box recognition process may be affected.
To avoid this, in an alternative embodiment, after obtaining a plurality of text boxes, the electronic device calculates the height of the outline (the distance between the upper outline and the lower outline) of each text box and the height of each text box (the vertical distance of the lower outline from the bottom of the file to be extracted), and then merges the text boxes that are located on the same line (when the heights of two text boxes are equal, the two text boxes are located on the same line) and the difference in the heights of the outlines is smaller than the threshold value into one text box. Wherein, the threshold value can be set according to the actual situation.
Step S130: and identifying each text box through a text identification model to obtain the text content in each text box.
In the embodiment of the present application, the text recognition model is CRNN (convolutional recurrent neural network model).
Generally, CRNN includes a convolutional layer, a cyclic layer, and an output layer.
Wherein, the convolution layer is used for extracting a characteristic sequence from an input image; the circulation layer is used for predicting the label distribution of the characteristic sequence obtained from the convolution layer; the output layer is used for converting the label distribution acquired from the circulation layer into a final identification result through operations such as de-duplication integration and the like. Of course, the CRNN needs to be trained in advance, so that the CRNN can fully learn the features of various text contents.
In the embodiment of the application, the CRNN identifies the content in each text box, so that the influence of interference factors outside the text boxes on the identification accuracy can be reduced, and the overall identification accuracy can be improved.
In addition, in order to further improve the accuracy of the finally obtained text content, in an optional implementation manner, a text error correction model can be trained and stored in the electronic device in advance, and a large number of text expression rules corresponding to normal grammars are learned in advance by the text error correction model. After the text content in each text box is recognized based on the CRNN, the obtained text content may be input into a text error correction model, so that the text error correction model identifies text content that may have been erroneously recognized, thereby facilitating error correction of the erroneously recognized text content.
In addition, the red header file generally includes a red separation line for distinguishing a file header and a file body of the text, and therefore, when the file to be extracted is the red header file, as an optional implementation manner, the file header and the file body of the red header file can also be distinguished with respect to the red separation line included in the red header file. In this embodiment, referring to fig. 3, the method further comprises:
step S140: and determining the position for representing the red separation line from the file to be extracted.
Because the length and the color of the red separation line are generally similar in different red-headed files, a red part can be extracted by searching a color channel of the red-headed file, then the extracted red part is subjected to boundary detection by a Canny edge detection algorithm, and a straight line included in a boundary detection result is detected by Hough transform. After obtaining a plurality of straight lines, the longest straight line is determined as a red part red separation line by operations of removing short noise straight lines, combining adjacent straight lines and the like, and thus the position for representing the red separation line is determined.
Step S150: and determining the file header and the file body of the red header file by taking the position for representing the red separation line as a reference.
After the position of the red separation line is determined, the electronic device may determine, based on the red separation line, the text content above the red separation line as a file header, and determine the content below the red separation line as a file body.
Step S160: and respectively outputting the text content of the file header and the text content of the file main body.
In this embodiment, the sequence of steps S140 to S150 may be performed after step S130, that is, after the text content in each text box is determined, the text content of the header of the red-header file and the text content of the main file are determined according to the determined position of the red separating line, and then the text content of the header and the text content of the main file are output respectively.
In addition, in another alternative embodiment, before step S130 is executed, the position of the red partition line is determined according to the flow from step S140 to step S150, and then the region where the file head of the red-headed file is located and the region where the file body is located are determined according to the red partition line. After the area where the file header is located and the area where the file main body is located are determined, the text content in each text box included in each area can be identified according to actual requirements, and then output is carried out. For example, in one embodiment, the user only cares about the text content of the header, and then after determining the area where the header is located, the text content in each text box included in the area where the header is located may be identified, and then only the text content of the header may be output.
According to the file content extraction method provided by the embodiment of the application, the electronic equipment obtains a file to be extracted; then, segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and then, identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.
As shown in fig. 4, an embodiment of the present application further provides a file content extracting apparatus 400, where the file content extracting apparatus 400 may include: an acquisition module 410, a segmentation module 420, and an identification module 430.
An obtaining module 410, configured to obtain a file to be extracted;
the segmentation module 420 is configured to segment the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts;
and the identifying module 430 is configured to identify each text box through a text identification model to obtain the text content in each text box.
In a possible implementation manner, the file to be extracted is a red-head file, the red-head file includes a red separation line, and the file content extraction apparatus 400 further includes a determination module and an output module.
The determining module is used for determining the position for representing the red separation line from the file to be extracted;
the determining module is further configured to determine a file header and a file body of the red-header file based on the position for representing the red separation line;
and the output module is used for respectively outputting the text content of the file header and the text content of the file main body.
In a possible implementation, the file content extracting apparatus 400 further includes a calculating module and a merging module.
The calculation module is used for calculating the height of the frame line of each text box;
and the merging module is used for merging the text boxes which are positioned in the same line and have the height difference of the frame lines smaller than the threshold value into one text box.
In a possible implementation manner, the file content extraction apparatus 400 further includes a removing module, configured to remove an interference factor in the file to be extracted, so as to obtain a preprocessed file;
correspondingly, the segmentation module 420 is configured to segment the preprocessed file through the text segmentation model to obtain a plurality of text boxes including a text.
In a possible implementation manner, the removing module is configured to remove red content at a preset position of the file to be extracted.
In a possible implementation manner, the file content extracting apparatus 400 further includes an error correction module, configured to perform error correction on the text content in each text box through a pre-stored text error correction model.
The document content extraction apparatus 400 provided in the embodiment of the present application has the same implementation principle and the same technical effect as those of the foregoing method embodiments, and for the sake of brief description, no mention is made in the apparatus embodiment, and reference may be made to the corresponding contents in the foregoing method embodiments.
In addition, the embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a computer, the steps included in the file content extraction method as described above are executed.
In addition, please refer to fig. 5, an embodiment of the present application further provides an electronic device 100 for implementing the file content extracting method and apparatus of the embodiment of the present application, where the electronic device 100 may include: a processor 110, a memory 120.
Alternatively, the electronic Device 100 may be, but is not limited to, a Personal Computer (PC), a smart phone, a tablet computer, and a Mobile Internet Device (MID). Among them, the electronic device 100.
It should be noted that the components and structure of electronic device 100 shown in FIG. 5 are exemplary only, and not limiting, and electronic device 100 may have other components and structures as desired.
The processor 110, memory 120, and other components that may be present in the electronic device 100 are electrically connected to each other, directly or indirectly, to enable the transfer or interaction of data. For example, the processor 110, the memory 120, and other components that may be present may be electrically coupled to each other via one or more communication buses or signal lines.
The memory 120 is used for storing a program, such as a program corresponding to the foregoing file content extracting method or the foregoing file content extracting apparatus 400. Optionally, when the file content extracting apparatus 400 is stored in the memory 120, the file content extracting apparatus includes at least one software functional module that can be stored in the memory 120 in the form of software or firmware (firmware).
Alternatively, the software function module included in the file content extraction apparatus 400 may also be solidified in an Operating System (OS) of the electronic device 100.
The processor 110 is used to execute executable modules stored in the memory 120, such as software functional modules or computer programs included in the file content extraction apparatus 400. When the processor 110 receives the execution instruction, it may execute the computer program, for example, to perform: acquiring a file to be extracted; segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and identifying each text box through a text identification model to obtain the text content in each text box.
Of course, the method disclosed in any of the embodiments of the present application can be applied to the processor 110, or implemented by the processor 110.
In summary, in the file content extraction method, the file content extraction device, the electronic device and the storage medium provided by the embodiments of the present invention, the electronic device obtains the file to be extracted; then, segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and then, identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions may be stored in a storage medium if they are implemented in the form of software function modules and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims (10)

1.一种文件内容提取方法,其特征在于,所述方法包括:1. a file content extraction method, is characterized in that, described method comprises: 获取待提取文件;Get the file to be extracted; 通过文本分割模型对所述待提取文件进行分割,得到包含文本的多个文本框;The to-be-extracted file is segmented by a text segmentation model to obtain a plurality of text boxes containing text; 通过文本识别模型对每个文本框进行识别,得到每个文本框内的文字内容。Each text box is recognized by the text recognition model, and the text content in each text box is obtained. 2.根据权利要求1所述的方法,其特征在于,所述待提取文件为红头文件,所述红头文件包括红色分隔线,所述方法还包括:2. The method according to claim 1, wherein the to-be-extracted file is a red header file, and the red header file includes a red dividing line, and the method further comprises: 从所述待提取文件中确定出用于表征红色分隔线的位置;Determine the position used to characterize the red dividing line from the to-be-extracted file; 以所述用于表征红色分隔线的位置为基准,确定出所述红头文件的文件头以及文件主体;Determine the file header and the file body of the red header file based on the position used to characterize the red dividing line; 分别输出所述文件头的文字内容以及所述文件主体的文字内容。The text content of the file header and the text content of the file body are respectively output. 3.根据权利要求1或2所述的方法,其特征在于,在所述得到包含文本的多个文本框之后,在所述通过文本识别模型对每个文本框进行识别之前,所述方法还包括:3. The method according to claim 1 or 2, characterized in that, after said obtaining a plurality of text boxes containing text, and before said identifying each text box by a text recognition model, the method further comprises: include: 计算每个文本框的框线高度;Calculate the frame height of each text box; 将位于同一行且框线高度之差小于阈值的文本框合并为一个文本框。Combines text boxes that are on the same row and whose border height difference is less than a threshold into one text box. 4.根据权利要求1所述的方法,其特征在于,在所述获取待提取文件之后,在所述通过文本分割模型对所述待提取文件进行分割,得到包含文本的多个文本框之前,所述方法还包括:4. The method according to claim 1, characterized in that, after the acquisition of the to-be-extracted file, before the segmentation of the to-be-extracted file by a text segmentation model to obtain a plurality of text boxes containing text, The method also includes: 去除所述待提取文件中的干扰因素,得到预处理文件;removing interference factors in the to-be-extracted file to obtain a preprocessed file; 相应的,所述通过文本分割模型对所述待提取文件进行分割,得到包含文字的多个文本框,包括:Correspondingly, the to-be-extracted file is segmented by the text segmentation model to obtain a plurality of text boxes containing text, including: 通过所述文本分割模型对所述预处理文件进行分割,得到包含文本的多个文本框。The preprocessed file is segmented by the text segmentation model to obtain a plurality of text boxes containing text. 5.根据权利要求4所述的方法,其特征在于,所述去除所述待提取文件中的干扰因素,包括:5. The method according to claim 4, wherein the removing interference factors in the to-be-extracted file comprises: 去除所述待提取文件的预设位置的红色内容。Remove the red content in the preset position of the to-be-extracted file. 6.根据权利要求1所述的方法,其特征在于,所述方法还包括:6. The method according to claim 1, wherein the method further comprises: 通过预先保存的文字纠错模型对所述每个文本框内的文字内容进行纠错。Error correction is performed on the text content in each of the text boxes through a pre-stored text error correction model. 7.一种文件内容提取装置,其特征在于,所述文件内容提取装置包括:7. A file content extraction device, wherein the file content extraction device comprises: 获取模块,用于获取待提取文件;The acquisition module is used to acquire the file to be extracted; 分割模块,用于通过文本分割模型对所述待提取文件进行分割,得到包含文本的多个文本框;a segmentation module, configured to segment the to-be-extracted file by a text segmentation model to obtain multiple text boxes containing text; 识别模块,用于通过文本识别模型对每个文本框进行识别,得到每个文本框内的文字内容。The recognition module is used for recognizing each text box through a text recognition model to obtain the text content in each text box. 8.根据权利要求7所述的装置,其特征在于,所述待提取文件为红头文件,所述红头文件包括红色分隔线,所述文件内容提取装置还包括确定模块以及输出模块;8. The device according to claim 7, wherein the file to be extracted is a red header file, and the red header file includes a red dividing line, and the file content extraction device further comprises a determination module and an output module; 所述确定模块,用于从所述待提取文件中确定出用于表征红色分隔线的位置;The determining module is used to determine the position used to characterize the red dividing line from the to-be-extracted file; 所述确定模块,还用于以所述用于表征红色分隔线的位置为基准,确定出所述红头文件的文件头以及文件主体;The determining module is further configured to determine the file header and the file body of the red header file based on the position used to characterize the red dividing line; 所述输出模块,用于分别输出所述文件头的文字内容以及所述文件主体的文字内容。The output module is configured to output the text content of the file header and the text content of the file body respectively. 9.一种电子设备,其特征在于,包括:存储器和处理器,所述存储器和所述处理器连接;9. An electronic device, comprising: a memory and a processor, wherein the memory is connected to the processor; 所述存储器用于存储程序;the memory is used to store programs; 所述处理器调用存储于所述存储器中的程序,以执行如权利要求1-6中任一项所述的方法。The processor invokes a program stored in the memory to perform the method of any of claims 1-6. 10.一种存储介质,其特征在于,其上存储有计算机程序,所述计算机程序被计算机运行时执行如权利要求1-6中任一项所述的方法。10. A storage medium, characterized in that a computer program is stored thereon, and the computer program executes the method according to any one of claims 1-6 when the computer program is run by a computer.
CN202010012359.6A 2020-01-06 2020-01-06 File content extraction method and device, electronic equipment and storage medium Pending CN111209865A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010012359.6A CN111209865A (en) 2020-01-06 2020-01-06 File content extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010012359.6A CN111209865A (en) 2020-01-06 2020-01-06 File content extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111209865A true CN111209865A (en) 2020-05-29

Family

ID=70786609

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010012359.6A Pending CN111209865A (en) 2020-01-06 2020-01-06 File content extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111209865A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111652176A (en) * 2020-06-11 2020-09-11 商汤国际私人有限公司 Information extraction method, device, equipment and storage medium
CN112215235A (en) * 2020-10-16 2021-01-12 深圳市华付信息技术有限公司 Scene text detection method aiming at large character spacing and local shielding
CN113095061A (en) * 2021-03-31 2021-07-09 京华信息科技股份有限公司 Method, system and device for extracting document header and storage medium
CN113343797A (en) * 2021-05-25 2021-09-03 中国平安人寿保险股份有限公司 Information extraction method and device, terminal equipment and computer readable storage medium
CN114267047A (en) * 2021-11-30 2022-04-01 高新兴科技集团股份有限公司 Electronic file text detection method, device, medium and equipment based on deep learning
CN116863479A (en) * 2023-07-25 2023-10-10 山东浪潮科学研究院有限公司 Red lead file auditing method, device, equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105528604A (en) * 2016-01-31 2016-04-27 华南理工大学 Bill automatic identification and processing system based on OCR
CN107798321A (en) * 2017-12-04 2018-03-13 海南云江科技有限公司 A kind of examination paper analysis method and computing device
CN108280389A (en) * 2017-01-06 2018-07-13 南通艾思达智能科技有限公司 Medical bill ICR identifying systems and its medical bank slip recognition method
CN109635627A (en) * 2018-10-23 2019-04-16 中国平安财产保险股份有限公司 Pictorial information extracting method, device, computer equipment and storage medium
CN109992765A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Text error correction method and device, storage medium and electronic equipment
CN110188762A (en) * 2019-04-23 2019-08-30 山东大学 Method, system, equipment and medium for identifying Chinese and English mixed merchant store names
CN110211048A (en) * 2019-05-28 2019-09-06 湖北华中电力科技开发有限责任公司 A kind of complicated archival image Slant Rectify method based on convolutional neural networks
CN110276352A (en) * 2019-06-28 2019-09-24 拉扎斯网络科技(上海)有限公司 Identification recognition method and device, electronic equipment and computer readable storage medium
CN110543810A (en) * 2019-06-28 2019-12-06 南京智录信息科技有限公司 Technology for completely identifying header and footer of PDF (Portable document Format) file
CN110619333A (en) * 2019-08-15 2019-12-27 平安国际智慧城市科技股份有限公司 Text line segmentation method, text line segmentation device and electronic equipment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105528604A (en) * 2016-01-31 2016-04-27 华南理工大学 Bill automatic identification and processing system based on OCR
CN108280389A (en) * 2017-01-06 2018-07-13 南通艾思达智能科技有限公司 Medical bill ICR identifying systems and its medical bank slip recognition method
CN107798321A (en) * 2017-12-04 2018-03-13 海南云江科技有限公司 A kind of examination paper analysis method and computing device
CN109992765A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Text error correction method and device, storage medium and electronic equipment
CN109635627A (en) * 2018-10-23 2019-04-16 中国平安财产保险股份有限公司 Pictorial information extracting method, device, computer equipment and storage medium
CN110188762A (en) * 2019-04-23 2019-08-30 山东大学 Method, system, equipment and medium for identifying Chinese and English mixed merchant store names
CN110211048A (en) * 2019-05-28 2019-09-06 湖北华中电力科技开发有限责任公司 A kind of complicated archival image Slant Rectify method based on convolutional neural networks
CN110276352A (en) * 2019-06-28 2019-09-24 拉扎斯网络科技(上海)有限公司 Identification recognition method and device, electronic equipment and computer readable storage medium
CN110543810A (en) * 2019-06-28 2019-12-06 南京智录信息科技有限公司 Technology for completely identifying header and footer of PDF (Portable document Format) file
CN110619333A (en) * 2019-08-15 2019-12-27 平安国际智慧城市科技股份有限公司 Text line segmentation method, text line segmentation device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHENGHAOY: "opencv 实现特定颜色线条提取与定", Retrieved from the Internet <URL:https://blog.csdn.net/chenghaoy/article/details/86509950> *
王昌杰: "红头文件检测关键技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 06, pages 138 - 2195 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111652176A (en) * 2020-06-11 2020-09-11 商汤国际私人有限公司 Information extraction method, device, equipment and storage medium
CN111652176B (en) * 2020-06-11 2024-05-21 商汤国际私人有限公司 Information extraction method, device, equipment and storage medium
CN112215235A (en) * 2020-10-16 2021-01-12 深圳市华付信息技术有限公司 Scene text detection method aiming at large character spacing and local shielding
CN112215235B (en) * 2020-10-16 2024-04-26 深圳华付技术股份有限公司 Scene text detection method aiming at large character spacing and local shielding
CN113095061A (en) * 2021-03-31 2021-07-09 京华信息科技股份有限公司 Method, system and device for extracting document header and storage medium
CN113095061B (en) * 2021-03-31 2023-08-29 京华信息科技股份有限公司 Method, system, device and storage medium for extracting document header
CN113343797A (en) * 2021-05-25 2021-09-03 中国平安人寿保险股份有限公司 Information extraction method and device, terminal equipment and computer readable storage medium
CN114267047A (en) * 2021-11-30 2022-04-01 高新兴科技集团股份有限公司 Electronic file text detection method, device, medium and equipment based on deep learning
CN116863479A (en) * 2023-07-25 2023-10-10 山东浪潮科学研究院有限公司 Red lead file auditing method, device, equipment and storage medium
CN116863479B (en) * 2023-07-25 2025-09-26 山东浪潮科学研究院有限公司 A method, device, equipment and storage medium for reviewing red-headed documents

Similar Documents

Publication Publication Date Title
US10853638B2 (en) System and method for extracting structured information from image documents
CN111209865A (en) File content extraction method and device, electronic equipment and storage medium
US10706320B2 (en) Determining a document type of a digital document
US11106891B2 (en) Automated signature extraction and verification
JP6366024B2 (en) Method and apparatus for extracting text from an imaged document
US8494273B2 (en) Adaptive optical character recognition on a document with distorted characters
CN110598686B (en) Invoice identification method, system, electronic equipment and medium
US11600088B2 (en) Utilizing machine learning and image filtering techniques to detect and analyze handwritten text
US9965695B1 (en) Document image binarization method based on content type separation
US10643094B2 (en) Method for line and word segmentation for handwritten text images
JP2003515230A (en) Method and system for separating categorizable symbols of video stream
Demilew et al. Ancient Geez script recognition using deep learning
Kaundilya et al. Automated text extraction from images using OCR system
CN109389115B (en) Text recognition method, device, storage medium and computer equipment
WO2021051553A1 (en) Certificate information classification and positioning method and apparatus
CN103606220A (en) Check printed number recognition system and check printed number recognition method based on white light image and infrared image
CN109508716B (en) Image character positioning method and device
Malik et al. An efficient skewed line segmentation technique for cursive script OCR
CN115984859B (en) Image character recognition method, device and storage medium
Bukhari et al. Layout analysis of Arabic script documents
CN110737364A (en) Control method for touch writing acceleration under android systems
US8891822B2 (en) System and method for script and orientation detection of images using artificial neural networks
Kaur et al. Page segmentation in OCR system-a review
US20150186718A1 (en) Segmentation of Overwritten Online Handwriting Input
KR101048399B1 (en) Character detection method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Zone B, 19 / F, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Dingfu Intelligent Technology Co.,Ltd.

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20200529

RJ01 Rejection of invention patent application after publication