CN111209865A - File content extraction method and device, electronic equipment and storage medium - Google Patents
File content extraction method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN111209865A CN111209865A CN202010012359.6A CN202010012359A CN111209865A CN 111209865 A CN111209865 A CN 111209865A CN 202010012359 A CN202010012359 A CN 202010012359A CN 111209865 A CN111209865 A CN 111209865A
- Authority
- CN
- China
- Prior art keywords
- file
- text
- extracted
- red
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a file content extraction method and device, electronic equipment and a storage medium, and belongs to the field of word processing. The method comprises the following steps: the electronic equipment acquires a file to be extracted; then, segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and then, identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.
Description
Technical Field
The application belongs to the field of word processing, and particularly relates to a file content extraction method and device, electronic equipment and a storage medium.
Background
In recent years, studies on character recognition and character understanding of image texts have become hot.
Optical Character Recognition (OCR) is one of the most important ways of text Recognition, and can achieve higher Recognition accuracy in scanning simple texts (for example, texts with a single background and ordered layout), but because texts to be recognized are often complex in an actual application scene, for example, text formats are various, wrinkles exist, shadows exist, and the like, the Recognition effect obtained when applying OCR to the actual scene is poor, and therefore, OCR cannot meet the actual requirement of extracting text contents.
Disclosure of Invention
In view of the above, an object of the present application is to provide a file content extraction method, an apparatus, an electronic device, and a storage medium, so as to provide a file content extraction scheme that can adapt to the complexity of an actual application scenario.
The embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a file content extraction method, where the method includes:
acquiring a file to be extracted; segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.
With reference to the embodiment of the first aspect, in a possible implementation manner, the file to be extracted is a red-headed file, and the red-headed file includes a red separation line, and the method further includes: determining the position for representing the red separation line from the file to be extracted; determining a file header and a file main body of the red header file by taking the position for representing the red separation line as a reference; and respectively outputting the text content of the file header and the text content of the file main body.
With reference to the embodiment of the first aspect, in a possible implementation manner, after obtaining the plurality of text boxes including text, before recognizing each text box through the text recognition model, the method further includes: calculating the height of the frame line of each text box; and combining the text boxes which are positioned on the same line and have the outline height difference smaller than the threshold value into one text box.
With reference to the embodiment of the first aspect, in a possible implementation manner, after the obtaining the file to be extracted, before the segmenting the file to be extracted by using the text segmentation model to obtain a plurality of text boxes including a text, the method further includes: removing interference factors in the file to be extracted to obtain a preprocessed file;
correspondingly, the segmenting the file to be extracted through the text segmentation model to obtain a plurality of text boxes containing characters includes: and segmenting the preprocessed file through the text segmentation model to obtain a plurality of text boxes containing texts.
With reference to the embodiment of the first aspect, in a possible implementation manner, the removing the interference factor in the file to be extracted includes: and removing the red content of the preset position of the file to be extracted.
With reference to the embodiment of the first aspect, in one possible implementation manner, the method further includes: and correcting the text content in each text box through a pre-stored text correction model.
In a second aspect, an embodiment of the present application provides a file content extracting apparatus, including: the device comprises an acquisition module, a segmentation module and an identification module. The acquisition module is used for acquiring a file to be extracted; the segmentation module is used for segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and the recognition module is used for recognizing each text box through the text recognition model to obtain the text content in each text box.
With reference to the second aspect, in a possible implementation manner, the file to be extracted is a red-headed file, the red-headed file includes a red separation line, and the file content extraction apparatus further includes a determination module and an output module. The determining module is used for determining the position for representing the red separation line from the file to be extracted; the determining module is further configured to determine a file header and a file body of the red-header file based on the position for representing the red separation line; and the output module is used for respectively outputting the text content of the file header and the text content of the file main body.
With reference to the second aspect, in a possible implementation manner, the file content extracting apparatus further includes a calculating module and a merging module. The calculation module is used for calculating the height of the frame line of each text box; and the merging module is used for merging the text boxes which are positioned in the same line and have the height difference of the frame lines smaller than the threshold value into one text box.
With reference to the second aspect, in a possible implementation manner, the file content extracting apparatus further includes a removing module, configured to remove an interference factor in the file to be extracted, so as to obtain a preprocessed file;
correspondingly, the segmentation module is configured to segment the preprocessed file through the text segmentation model to obtain a plurality of text boxes including a text.
With reference to the second aspect, in a possible implementation manner, the removing module is configured to remove red content at a preset position of the file to be extracted.
With reference to the second aspect, in a possible implementation manner, the file content extracting apparatus further includes an error correction module, configured to correct the text content in each text box through a pre-stored text correction model.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory and a processor, the memory and the processor connected; the memory is used for storing programs; the processor calls a program stored in the memory to perform the method of the first aspect embodiment and/or any possible implementation manner of the first aspect embodiment.
In a fourth aspect, the present application further provides a non-volatile computer-readable storage medium (hereinafter, referred to as a storage medium), on which a computer program is stored, where the computer program is executed by a computer to perform the method in the foregoing first aspect and/or any possible implementation manner of the first aspect.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.
Fig. 1 shows one of flowcharts of a file content extraction method provided in an embodiment of the present application.
Fig. 2 shows an operation diagram of a PixelLink model provided in an embodiment of the present application.
Fig. 3 shows a second flowchart of a file content extraction method provided in the embodiment of the present application.
Fig. 4 shows a block diagram of a file content extracting apparatus according to an embodiment of the present application.
Fig. 5 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.
The embodiment of the application provides a file content extraction method and device, electronic equipment and a storage medium, so that file content applied to an actual scene can be extracted conveniently. The technology can be realized by adopting corresponding software, hardware and a combination of software and hardware. The following describes embodiments of the present application in detail.
The following description will be directed to a file content extraction method provided in the present application.
Referring to fig. 1, an embodiment of the present application provides a file content extraction method applied to an electronic device. The steps involved will be described below with reference to fig. 1.
Step S110: and acquiring the file to be extracted.
In the embodiment of the present application, the file to be extracted may be in a picture Format or a PDF (portable document Format) Format.
In addition, the file to be extracted can be a common file or a red header file issued by an official party.
As an optional implementation manner, in order to improve the identification accuracy of the subsequent identification process, after the file to be extracted is obtained, the file to be extracted may be preprocessed to obtain a preprocessed file.
The preprocessing content includes but is not limited to at least one of removing a watermark in the file to be extracted, removing a shadow caused by a light problem when a text image is collected in the file to be extracted, and correcting the inclination of the file to be extracted.
The watermark removal and the shadow removal can be realized by dynamically adjusting the binary value of the extracted file after the file to be extracted is subjected to Gaussian blur.
The method for correcting the inclination in the file to be extracted can be characterized in that an image of the file to be extracted in a frequency domain is firstly obtained through Fourier transformation, then the inclination angle of a straight line in the frequency domain is obtained through Hough straight line transformation, and then the inclination angle of the straight line in the frequency domain is adjusted, so that the inclination in the file to be extracted is corrected.
In addition, when the file to be extracted is a red-headed file, the preprocessed content may further include removing red content in the preset area, where the red content includes, but is not limited to, a red stamp included in the file to be extracted.
For a red-headed file, the red stamp of the file is generally positioned at the lower left corner or the lower right corner of the file, so that a red channel of a preset area (for example, the lower left corner and the lower right corner) of the file to be extracted can be detected, and then the red channel value of the preset area is adjusted to 0, thereby achieving the purpose of removing the red stamp.
It should be noted that the above preprocessing processes are all prior art, and detailed implementation thereof is not described again.
In addition, it is worth pointing out that before the subsequent operation is performed on the file to be extracted, if the file to be extracted is preprocessed to obtain the preprocessed file, the preprocessed file can be subsequently used as a processing object to perform the subsequent operation.
Step S120: and segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts.
In the embodiment of the present application, the text segmentation model is a PixelLink model, and the PixelLink model has a text and non-text classification function.
Specifically, the PixelLink model mainly performs text/non-text classification prediction on a certain pixel (pixel) based on CNN (convolutional neural network), and performs classification prediction on whether a link (link) exists in 8 neighborhood directions of the pixel, as shown in fig. 2, eight heatmaps in a dashed box represent eight-direction link predictions. Then, the PixelLink model performs minarect (minimum bounding rectangle) operation on the connected domain based on OpenCV to obtain text connected domains of different sizes. After text connected domains with different sizes are obtained, the PixelLink model carries out noise filtering operation on the connected domains, and then a plurality of text boxes with border boundaries are obtained through parallel-lookup sets (discrete-set data structures). Wherein different text is included in each text box.
In the process of text segmentation of a file to be extracted or a preprocessed file by a text segmentation model, the PixelLink model may segment characters located in the same line into a plurality of text boxes. For a bill text, such as an invoice, if the text in the same line is divided into a plurality of text boxes, the subsequent text box recognition process may be affected.
To avoid this, in an alternative embodiment, after obtaining a plurality of text boxes, the electronic device calculates the height of the outline (the distance between the upper outline and the lower outline) of each text box and the height of each text box (the vertical distance of the lower outline from the bottom of the file to be extracted), and then merges the text boxes that are located on the same line (when the heights of two text boxes are equal, the two text boxes are located on the same line) and the difference in the heights of the outlines is smaller than the threshold value into one text box. Wherein, the threshold value can be set according to the actual situation.
Step S130: and identifying each text box through a text identification model to obtain the text content in each text box.
In the embodiment of the present application, the text recognition model is CRNN (convolutional recurrent neural network model).
Generally, CRNN includes a convolutional layer, a cyclic layer, and an output layer.
Wherein, the convolution layer is used for extracting a characteristic sequence from an input image; the circulation layer is used for predicting the label distribution of the characteristic sequence obtained from the convolution layer; the output layer is used for converting the label distribution acquired from the circulation layer into a final identification result through operations such as de-duplication integration and the like. Of course, the CRNN needs to be trained in advance, so that the CRNN can fully learn the features of various text contents.
In the embodiment of the application, the CRNN identifies the content in each text box, so that the influence of interference factors outside the text boxes on the identification accuracy can be reduced, and the overall identification accuracy can be improved.
In addition, in order to further improve the accuracy of the finally obtained text content, in an optional implementation manner, a text error correction model can be trained and stored in the electronic device in advance, and a large number of text expression rules corresponding to normal grammars are learned in advance by the text error correction model. After the text content in each text box is recognized based on the CRNN, the obtained text content may be input into a text error correction model, so that the text error correction model identifies text content that may have been erroneously recognized, thereby facilitating error correction of the erroneously recognized text content.
In addition, the red header file generally includes a red separation line for distinguishing a file header and a file body of the text, and therefore, when the file to be extracted is the red header file, as an optional implementation manner, the file header and the file body of the red header file can also be distinguished with respect to the red separation line included in the red header file. In this embodiment, referring to fig. 3, the method further comprises:
step S140: and determining the position for representing the red separation line from the file to be extracted.
Because the length and the color of the red separation line are generally similar in different red-headed files, a red part can be extracted by searching a color channel of the red-headed file, then the extracted red part is subjected to boundary detection by a Canny edge detection algorithm, and a straight line included in a boundary detection result is detected by Hough transform. After obtaining a plurality of straight lines, the longest straight line is determined as a red part red separation line by operations of removing short noise straight lines, combining adjacent straight lines and the like, and thus the position for representing the red separation line is determined.
Step S150: and determining the file header and the file body of the red header file by taking the position for representing the red separation line as a reference.
After the position of the red separation line is determined, the electronic device may determine, based on the red separation line, the text content above the red separation line as a file header, and determine the content below the red separation line as a file body.
Step S160: and respectively outputting the text content of the file header and the text content of the file main body.
In this embodiment, the sequence of steps S140 to S150 may be performed after step S130, that is, after the text content in each text box is determined, the text content of the header of the red-header file and the text content of the main file are determined according to the determined position of the red separating line, and then the text content of the header and the text content of the main file are output respectively.
In addition, in another alternative embodiment, before step S130 is executed, the position of the red partition line is determined according to the flow from step S140 to step S150, and then the region where the file head of the red-headed file is located and the region where the file body is located are determined according to the red partition line. After the area where the file header is located and the area where the file main body is located are determined, the text content in each text box included in each area can be identified according to actual requirements, and then output is carried out. For example, in one embodiment, the user only cares about the text content of the header, and then after determining the area where the header is located, the text content in each text box included in the area where the header is located may be identified, and then only the text content of the header may be output.
According to the file content extraction method provided by the embodiment of the application, the electronic equipment obtains a file to be extracted; then, segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and then, identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.
As shown in fig. 4, an embodiment of the present application further provides a file content extracting apparatus 400, where the file content extracting apparatus 400 may include: an acquisition module 410, a segmentation module 420, and an identification module 430.
An obtaining module 410, configured to obtain a file to be extracted;
the segmentation module 420 is configured to segment the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts;
and the identifying module 430 is configured to identify each text box through a text identification model to obtain the text content in each text box.
In a possible implementation manner, the file to be extracted is a red-head file, the red-head file includes a red separation line, and the file content extraction apparatus 400 further includes a determination module and an output module.
The determining module is used for determining the position for representing the red separation line from the file to be extracted;
the determining module is further configured to determine a file header and a file body of the red-header file based on the position for representing the red separation line;
and the output module is used for respectively outputting the text content of the file header and the text content of the file main body.
In a possible implementation, the file content extracting apparatus 400 further includes a calculating module and a merging module.
The calculation module is used for calculating the height of the frame line of each text box;
and the merging module is used for merging the text boxes which are positioned in the same line and have the height difference of the frame lines smaller than the threshold value into one text box.
In a possible implementation manner, the file content extraction apparatus 400 further includes a removing module, configured to remove an interference factor in the file to be extracted, so as to obtain a preprocessed file;
correspondingly, the segmentation module 420 is configured to segment the preprocessed file through the text segmentation model to obtain a plurality of text boxes including a text.
In a possible implementation manner, the removing module is configured to remove red content at a preset position of the file to be extracted.
In a possible implementation manner, the file content extracting apparatus 400 further includes an error correction module, configured to perform error correction on the text content in each text box through a pre-stored text error correction model.
The document content extraction apparatus 400 provided in the embodiment of the present application has the same implementation principle and the same technical effect as those of the foregoing method embodiments, and for the sake of brief description, no mention is made in the apparatus embodiment, and reference may be made to the corresponding contents in the foregoing method embodiments.
In addition, the embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a computer, the steps included in the file content extraction method as described above are executed.
In addition, please refer to fig. 5, an embodiment of the present application further provides an electronic device 100 for implementing the file content extracting method and apparatus of the embodiment of the present application, where the electronic device 100 may include: a processor 110, a memory 120.
Alternatively, the electronic Device 100 may be, but is not limited to, a Personal Computer (PC), a smart phone, a tablet computer, and a Mobile Internet Device (MID). Among them, the electronic device 100.
It should be noted that the components and structure of electronic device 100 shown in FIG. 5 are exemplary only, and not limiting, and electronic device 100 may have other components and structures as desired.
The processor 110, memory 120, and other components that may be present in the electronic device 100 are electrically connected to each other, directly or indirectly, to enable the transfer or interaction of data. For example, the processor 110, the memory 120, and other components that may be present may be electrically coupled to each other via one or more communication buses or signal lines.
The memory 120 is used for storing a program, such as a program corresponding to the foregoing file content extracting method or the foregoing file content extracting apparatus 400. Optionally, when the file content extracting apparatus 400 is stored in the memory 120, the file content extracting apparatus includes at least one software functional module that can be stored in the memory 120 in the form of software or firmware (firmware).
Alternatively, the software function module included in the file content extraction apparatus 400 may also be solidified in an Operating System (OS) of the electronic device 100.
The processor 110 is used to execute executable modules stored in the memory 120, such as software functional modules or computer programs included in the file content extraction apparatus 400. When the processor 110 receives the execution instruction, it may execute the computer program, for example, to perform: acquiring a file to be extracted; segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and identifying each text box through a text identification model to obtain the text content in each text box.
Of course, the method disclosed in any of the embodiments of the present application can be applied to the processor 110, or implemented by the processor 110.
In summary, in the file content extraction method, the file content extraction device, the electronic device and the storage medium provided by the embodiments of the present invention, the electronic device obtains the file to be extracted; then, segmenting the file to be extracted through a text segmentation model to obtain a plurality of text boxes containing texts; and then, identifying each text box through a text identification model to obtain the text content in each text box. Because the text recognition model recognizes the content in each text box, the influence of interference factors outside the text boxes on the recognition accuracy can be reduced, and the overall recognition accuracy can be improved.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions may be stored in a storage medium if they are implemented in the form of software function modules and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010012359.6A CN111209865A (en) | 2020-01-06 | 2020-01-06 | File content extraction method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010012359.6A CN111209865A (en) | 2020-01-06 | 2020-01-06 | File content extraction method and device, electronic equipment and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN111209865A true CN111209865A (en) | 2020-05-29 |
Family
ID=70786609
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010012359.6A Pending CN111209865A (en) | 2020-01-06 | 2020-01-06 | File content extraction method and device, electronic equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111209865A (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111652176A (en) * | 2020-06-11 | 2020-09-11 | 商汤国际私人有限公司 | Information extraction method, device, equipment and storage medium |
| CN112215235A (en) * | 2020-10-16 | 2021-01-12 | 深圳市华付信息技术有限公司 | Scene text detection method aiming at large character spacing and local shielding |
| CN113095061A (en) * | 2021-03-31 | 2021-07-09 | 京华信息科技股份有限公司 | Method, system and device for extracting document header and storage medium |
| CN113343797A (en) * | 2021-05-25 | 2021-09-03 | 中国平安人寿保险股份有限公司 | Information extraction method and device, terminal equipment and computer readable storage medium |
| CN114267047A (en) * | 2021-11-30 | 2022-04-01 | 高新兴科技集团股份有限公司 | Electronic file text detection method, device, medium and equipment based on deep learning |
| CN116863479A (en) * | 2023-07-25 | 2023-10-10 | 山东浪潮科学研究院有限公司 | Red lead file auditing method, device, equipment and storage medium |
Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105528604A (en) * | 2016-01-31 | 2016-04-27 | 华南理工大学 | Bill automatic identification and processing system based on OCR |
| CN107798321A (en) * | 2017-12-04 | 2018-03-13 | 海南云江科技有限公司 | A kind of examination paper analysis method and computing device |
| CN108280389A (en) * | 2017-01-06 | 2018-07-13 | 南通艾思达智能科技有限公司 | Medical bill ICR identifying systems and its medical bank slip recognition method |
| CN109635627A (en) * | 2018-10-23 | 2019-04-16 | 中国平安财产保险股份有限公司 | Pictorial information extracting method, device, computer equipment and storage medium |
| CN109992765A (en) * | 2017-12-29 | 2019-07-09 | 北京京东尚科信息技术有限公司 | Text error correction method and device, storage medium and electronic equipment |
| CN110188762A (en) * | 2019-04-23 | 2019-08-30 | 山东大学 | Method, system, equipment and medium for identifying Chinese and English mixed merchant store names |
| CN110211048A (en) * | 2019-05-28 | 2019-09-06 | 湖北华中电力科技开发有限责任公司 | A kind of complicated archival image Slant Rectify method based on convolutional neural networks |
| CN110276352A (en) * | 2019-06-28 | 2019-09-24 | 拉扎斯网络科技(上海)有限公司 | Identification recognition method and device, electronic equipment and computer readable storage medium |
| CN110543810A (en) * | 2019-06-28 | 2019-12-06 | 南京智录信息科技有限公司 | Technology for completely identifying header and footer of PDF (Portable document Format) file |
| CN110619333A (en) * | 2019-08-15 | 2019-12-27 | 平安国际智慧城市科技股份有限公司 | Text line segmentation method, text line segmentation device and electronic equipment |
-
2020
- 2020-01-06 CN CN202010012359.6A patent/CN111209865A/en active Pending
Patent Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105528604A (en) * | 2016-01-31 | 2016-04-27 | 华南理工大学 | Bill automatic identification and processing system based on OCR |
| CN108280389A (en) * | 2017-01-06 | 2018-07-13 | 南通艾思达智能科技有限公司 | Medical bill ICR identifying systems and its medical bank slip recognition method |
| CN107798321A (en) * | 2017-12-04 | 2018-03-13 | 海南云江科技有限公司 | A kind of examination paper analysis method and computing device |
| CN109992765A (en) * | 2017-12-29 | 2019-07-09 | 北京京东尚科信息技术有限公司 | Text error correction method and device, storage medium and electronic equipment |
| CN109635627A (en) * | 2018-10-23 | 2019-04-16 | 中国平安财产保险股份有限公司 | Pictorial information extracting method, device, computer equipment and storage medium |
| CN110188762A (en) * | 2019-04-23 | 2019-08-30 | 山东大学 | Method, system, equipment and medium for identifying Chinese and English mixed merchant store names |
| CN110211048A (en) * | 2019-05-28 | 2019-09-06 | 湖北华中电力科技开发有限责任公司 | A kind of complicated archival image Slant Rectify method based on convolutional neural networks |
| CN110276352A (en) * | 2019-06-28 | 2019-09-24 | 拉扎斯网络科技(上海)有限公司 | Identification recognition method and device, electronic equipment and computer readable storage medium |
| CN110543810A (en) * | 2019-06-28 | 2019-12-06 | 南京智录信息科技有限公司 | Technology for completely identifying header and footer of PDF (Portable document Format) file |
| CN110619333A (en) * | 2019-08-15 | 2019-12-27 | 平安国际智慧城市科技股份有限公司 | Text line segmentation method, text line segmentation device and electronic equipment |
Non-Patent Citations (2)
| Title |
|---|
| CHENGHAOY: "opencv 实现特定颜色线条提取与定", Retrieved from the Internet <URL:https://blog.csdn.net/chenghaoy/article/details/86509950> * |
| 王昌杰: "红头文件检测关键技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 06, pages 138 - 2195 * |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111652176A (en) * | 2020-06-11 | 2020-09-11 | 商汤国际私人有限公司 | Information extraction method, device, equipment and storage medium |
| CN111652176B (en) * | 2020-06-11 | 2024-05-21 | 商汤国际私人有限公司 | Information extraction method, device, equipment and storage medium |
| CN112215235A (en) * | 2020-10-16 | 2021-01-12 | 深圳市华付信息技术有限公司 | Scene text detection method aiming at large character spacing and local shielding |
| CN112215235B (en) * | 2020-10-16 | 2024-04-26 | 深圳华付技术股份有限公司 | Scene text detection method aiming at large character spacing and local shielding |
| CN113095061A (en) * | 2021-03-31 | 2021-07-09 | 京华信息科技股份有限公司 | Method, system and device for extracting document header and storage medium |
| CN113095061B (en) * | 2021-03-31 | 2023-08-29 | 京华信息科技股份有限公司 | Method, system, device and storage medium for extracting document header |
| CN113343797A (en) * | 2021-05-25 | 2021-09-03 | 中国平安人寿保险股份有限公司 | Information extraction method and device, terminal equipment and computer readable storage medium |
| CN114267047A (en) * | 2021-11-30 | 2022-04-01 | 高新兴科技集团股份有限公司 | Electronic file text detection method, device, medium and equipment based on deep learning |
| CN116863479A (en) * | 2023-07-25 | 2023-10-10 | 山东浪潮科学研究院有限公司 | Red lead file auditing method, device, equipment and storage medium |
| CN116863479B (en) * | 2023-07-25 | 2025-09-26 | 山东浪潮科学研究院有限公司 | A method, device, equipment and storage medium for reviewing red-headed documents |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10853638B2 (en) | System and method for extracting structured information from image documents | |
| CN111209865A (en) | File content extraction method and device, electronic equipment and storage medium | |
| US10706320B2 (en) | Determining a document type of a digital document | |
| US11106891B2 (en) | Automated signature extraction and verification | |
| JP6366024B2 (en) | Method and apparatus for extracting text from an imaged document | |
| US8494273B2 (en) | Adaptive optical character recognition on a document with distorted characters | |
| CN110598686B (en) | Invoice identification method, system, electronic equipment and medium | |
| US11600088B2 (en) | Utilizing machine learning and image filtering techniques to detect and analyze handwritten text | |
| US9965695B1 (en) | Document image binarization method based on content type separation | |
| US10643094B2 (en) | Method for line and word segmentation for handwritten text images | |
| JP2003515230A (en) | Method and system for separating categorizable symbols of video stream | |
| Demilew et al. | Ancient Geez script recognition using deep learning | |
| Kaundilya et al. | Automated text extraction from images using OCR system | |
| CN109389115B (en) | Text recognition method, device, storage medium and computer equipment | |
| WO2021051553A1 (en) | Certificate information classification and positioning method and apparatus | |
| CN103606220A (en) | Check printed number recognition system and check printed number recognition method based on white light image and infrared image | |
| CN109508716B (en) | Image character positioning method and device | |
| Malik et al. | An efficient skewed line segmentation technique for cursive script OCR | |
| CN115984859B (en) | Image character recognition method, device and storage medium | |
| Bukhari et al. | Layout analysis of Arabic script documents | |
| CN110737364A (en) | Control method for touch writing acceleration under android systems | |
| US8891822B2 (en) | System and method for script and orientation detection of images using artificial neural networks | |
| Kaur et al. | Page segmentation in OCR system-a review | |
| US20150186718A1 (en) | Segmentation of Overwritten Online Handwriting Input | |
| KR101048399B1 (en) | Character detection method and apparatus |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information |
Address after: Zone B, 19 / F, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province Applicant after: Dingfu Intelligent Technology Co.,Ltd. Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing Applicant before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd. |
|
| CB02 | Change of applicant information | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200529 |
|
| RJ01 | Rejection of invention patent application after publication |