[go: up one dir, main page]

CN119129529A - PDF document conversion method, device, equipment, storage medium and product - Google Patents

PDF document conversion method, device, equipment, storage medium and product Download PDF

Info

Publication number
CN119129529A
CN119129529A CN202411167727.9A CN202411167727A CN119129529A CN 119129529 A CN119129529 A CN 119129529A CN 202411167727 A CN202411167727 A CN 202411167727A CN 119129529 A CN119129529 A CN 119129529A
Authority
CN
China
Prior art keywords
page
information
pdf document
web page
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411167727.9A
Other languages
Chinese (zh)
Inventor
王恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Glodon Co Ltd
Original Assignee
Glodon Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Glodon Co Ltd filed Critical Glodon Co Ltd
Priority to CN202411167727.9A priority Critical patent/CN119129529A/en
Publication of CN119129529A publication Critical patent/CN119129529A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/183Tabulation, i.e. one-dimensional positioning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to the technical field of computers and discloses a method, a device, equipment, a storage medium and a product for converting a PDF document, wherein the method comprises the steps of obtaining the PDF document to be converted and page information of the PDF document; analyzing the PDF document page by page based on the page information to obtain webpage elements corresponding to the content types of all pages in the PDF document; rendering is carried out based on the webpage elements, and a webpage corresponding to the PDF document is generated. The PDF document is analyzed page by page, so that the processing concurrency caused by full text analysis can be avoided, meanwhile, the accuracy of an analysis result and the accuracy of a subsequent rendering result can be ensured, and on the basis, the fact that the PDF document can be truly represented by a webpage is ensured. Meanwhile, a webpage with a structured mark is generated through webpage elements, so that content with a logic relationship in terms of semantics can be displayed on the webpage, and the retrieval and comparison display of similar content can be conveniently carried out on the basis.

Description

PDF document conversion method, device, equipment, storage medium and product
Technical Field
The invention relates to the technical field of computers, in particular to a method, a device, equipment, a storage medium and a product for converting PDF documents.
Background
For files, it is generally described by PDF documents to avoid modification or format display problems. For the display of PDF documents, retrieval and contrast display of PDF documents are required in some scenarios. Therefore, it is desirable to provide a method for converting a PDF document, which facilitates display using converted contents.
Disclosure of Invention
In view of the above, the present invention provides a method, apparatus, device, storage medium and product for converting PDF documents, so as to solve the problem of converting PDF documents.
In a first aspect, the present invention provides a method for converting a PDF document, where the method includes:
acquiring a PDF document to be converted and page information of the PDF document;
Analyzing the PDF document page by page based on the page information to obtain webpage elements corresponding to content types of each page in the PDF document, wherein the content types comprise at least one of texts, vector graphics, images and tables, and the webpage elements are used for forming webpage pages;
and rendering based on the webpage elements to generate a webpage corresponding to the PDF document.
According to the method for converting the PDF document, page-by-page analysis is carried out on the PDF document, so that the webpage elements corresponding to the content types of all pages in the PDF document are obtained, namely, corresponding webpage elements are obtained for different content types, and the webpage elements obtained after analysis are rendered, so that the webpage pages corresponding to the PDF document are generated. The PDF document is analyzed page by page, so that the processing concurrency caused by full text analysis can be avoided, meanwhile, the accuracy of an analysis result and the accuracy of a subsequent rendering result can be ensured through page by page analysis, and on the basis, the fact that the webpage page can truly represent the PDF document is ensured. The PDF document is characterized by the webpage, so that the retrieval and the contrast display of similar contents can be conveniently carried out on the basis.
In an optional implementation manner, if the content type includes text, the corresponding web page element includes a text element, and the page-by-page parsing is performed on the PDF document based on the page information to obtain a web page element corresponding to the content type of each page, including:
Determining a current page to be analyzed based on the page information;
Analyzing the text in the current page to be analyzed, and determining the word connection information and the sentence breaking information of the text to obtain the text element.
According to the method for converting the PDF document, provided by the embodiment of the invention, the current page to be analyzed is determined through the page information, and the PDF page to be processed at present can be accurately obtained. Meanwhile, the display consistency of the text elements and the text in PDF is further ensured by determining the quantum information and the sentence breaking information of the text during analysis.
In an optional implementation manner, the parsing the text in the current page to be parsed, determining the ligature information and the sentence breaking information of the text, and obtaining the text element includes:
determining characteristic information of each character of each line of characters in the text;
and carrying out continuous word or sentence breaking on the characters in the same row and adjacent rows based on the characteristic information, and determining the continuous word information and sentence breaking information to obtain the text element.
According to the PDF document conversion method provided by the embodiment of the invention, character characteristic information is utilized to respectively process continuous word or sentence breaking aiming at the same row and adjacent rows, and accurate display of characters is ensured through continuous word or sentence breaking.
In an optional implementation manner, the step of performing word connection or sentence breaking on the text in the same line and adjacent lines based on the feature information, determining the word connection information and sentence breaking information, and obtaining the text element includes:
calculating the association degree between two adjacent characters based on the characteristic information aiming at the characters of the same row;
if the association degree is larger than a preset association value, two adjacent characters form continuous word information, otherwise, sentence breaking is carried out between the two adjacent characters to form sentence breaking information;
For characters of adjacent rows, calculating a first margin of a first character of a first row in the adjacent rows and a second margin of a last character of a second row in the adjacent rows based on the feature information, wherein the second row is positioned above the first row;
If the difference between the first page margin and the second page margin is smaller than a preset margin value, forming the first character of the first line and the last character of the second line into ligature information, otherwise, performing sentence breaking between the first character of the first line and the last character of the second line to form sentence breaking information.
According to the method for converting the PDF document, provided by the embodiment of the invention, the continuous word or the sentence breaking processing is respectively carried out in different modes aiming at the same row and the adjacent row, so that the accuracy of the obtained continuous word information and sentence breaking information can be further ensured.
In an optional implementation manner, if the content type includes vector graphics, the web page element includes vector graphics elements, and the page-by-page parsing is performed on the PDF document based on the page information to obtain a web page element corresponding to the content type of each page in the PDF document, where the web page element includes:
Extracting line information of the vector graphics in a current page to be analyzed, wherein the current page to be analyzed is determined based on the page information;
Determining a filling area of the vector graphics based on the line information;
and performing color filling in the filling area to obtain the vector graphic element.
According to the method for converting the PDF document, disclosed by the embodiment of the invention, aiming at the vector graphics, the corresponding filling area is obtained by extracting the line information of the vector graphics, and the color filling is carried out on the basis, so that the vector graphic elements consistent with the vector graphics in the PDF can be obtained.
In an alternative embodiment, if the content type includes an image, the web page element includes an image element, and the analyzing the PDF document page by page based on the page information to obtain the web page element corresponding to the content type of each page in the PDF document includes:
Extracting an image in a current page to be resolved, and determining an image format of the image, wherein the current page to be resolved is determined based on the page information;
and analyzing the image according to the image format to obtain image elements.
According to the method for converting the PDF document, provided by the embodiment of the invention, aiming at the image in the PDF, the image element is obtained by determining the image format and analyzing the image based on the image format, so that the consistency of the image element and the PDF document is ensured.
In an optional implementation manner, if the content type includes a table, the web page element includes a table element, and the analyzing the PDF document page by page based on the page information to obtain the web page element of each page includes:
extracting line segments of the table in a current page to be analyzed, and determining the position information of each line segment in the current page to be analyzed, wherein the current page to be analyzed is determined based on the page information;
And generating the table element of the current page to be analyzed based on the position information of each line segment.
According to the method for converting the PDF document, in the case of the table, the text in the table is analyzed by using the text processing mode, and the line segments in the table do not need to be compared in similarity, so that the position information of the line segments in the table needs to be accurately determined, and the consistency of the table elements and the table in the PDF document can be ensured.
In an optional implementation manner, the generating, based on the webpage element, a webpage corresponding to the PDF document includes:
Determining a rendering tag of the webpage element based on the characteristic information of the webpage element, wherein the characteristic information of the webpage element comprises a character style and sentence characteristics, and the sentence characteristics are determined based on the ligature information and the sentence breaking information of the text;
structuring the webpage element based on the rendering tag to obtain structured information;
rendering is carried out according to the structural information, and the webpage is generated.
According to the method for converting the PDF document, provided by the embodiment of the invention, the rendering tag is determined based on the characteristic information of the webpage element, so that the determined rendering tag can represent the characteristic of the webpage element, and the consistency of the structured processing result with the PDF document can be ensured.
In an optional implementation manner, the structuring the webpage element based on the rendering tag to obtain structured information includes:
detecting whether similar elements exist in the webpage elements;
when the webpage element has similar elements, generating marking information of the webpage element;
and structuring the webpage element based on the marking information and the rendering tag to obtain structured information.
According to the method for converting the PDF document, in the process of generating the webpage by the webpage elements, similarity comparison is further carried out, if the similar elements exist, the marking information of the webpage elements is generated, and the webpage elements are structured by combining the rendering labels, so that the identification of the similarity comparison can be displayed in the webpage.
In an alternative embodiment, the marking information comprises marking position and marking content, and when the web page element has similar elements, the marking information of the web page element is generated, comprising:
when the webpage element has similar elements, associating first identification information corresponding to the webpage element with second identification information corresponding to the similar elements;
determining the marking position and marking content of the webpage element in the webpage based on the first identification information and the second identification information;
the first identification information is used for representing the uniqueness of the webpage element, and the second identification information is used for representing the uniqueness of the similar comparison element.
According to the method for converting the PDF document, provided by the embodiment of the invention, the webpage elements and the similar elements are displayed in a comparison manner by adopting the first identification information and the second identification information in the webpage, so that a user can conveniently and quickly locate similar contents in the document.
In a second aspect, the present invention provides a PDF document conversion apparatus, including:
the acquisition module is used for acquiring the PDF document to be converted and page information of the PDF document;
The analysis module is used for carrying out page-by-page analysis on the PDF document based on the page information to obtain webpage elements corresponding to content types of all pages in the PDF document, wherein the content types comprise at least one of texts, vector graphics, images and tables, and the webpage elements are used for forming webpage pages;
and the rendering module is used for rendering based on the webpage elements and generating a webpage corresponding to the PDF document.
In a third aspect, the present invention provides a computer device, including a memory and a processor, where the memory and the processor are communicatively connected to each other, and the memory stores computer instructions, and the processor executes the computer instructions, so as to execute the method for converting a PDF document according to the first aspect or any embodiment corresponding to the first aspect.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the method for converting a PDF document of the first aspect or any one of the embodiments corresponding thereto.
In a fifth aspect, the present disclosure provides a computer program product comprising computer instructions for causing a computer to perform the method of converting a PDF document of the first aspect or any of its corresponding embodiments described above.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of converting a PDF document according to an embodiment of the invention;
FIG. 2 is a flow chart of another method of converting a PDF document according to an embodiment of the invention;
FIG. 3 is a flow chart of a method of converting a further PDF document according to an embodiment of the invention;
FIG. 4 is a schematic illustration of a netpage page according to an embodiment of the present invention;
fig. 5 is a block diagram of a structure of a converting apparatus of a PDF document according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the related art, PDF documents are converted by converting them into text, or image recognition is performed with the entire PDF document as a picture, or by a hybrid manner.
However, in the manner of converting a PDF document into text, the processing effect is poor for complex typesetting (e.g., multi-column text, mixed-text) in the PDF document, and it is difficult to process non-text content (e.g., images, watermarks, bars, filled graphics).
In the way of image recognition by taking the whole PDF document as a picture, the processing speed is low and the requirement on the computing resource is extremely high because the calculation amount of taking the whole PDF document as a complete image is huge.
In the method of converting the PDF document by the hybrid method, the text and the picture are extracted respectively, so that various elements in the PDF document can be processed more comprehensively, but the original style in the PDF document is lost, and the conversion result is inconsistent with the format of the original PDF document.
Based on the above, according to the method for converting the PDF document provided by the embodiment of the invention, page-by-page analysis is carried out on the PDF document to obtain the webpage element corresponding to each page, and then the webpage element is rendered to obtain the webpage. The method and the device have the advantages that the PDF document is displayed in a mode of being converted into the webpage, the accuracy of an analysis result and the accuracy of a subsequent rendering result are guaranteed in a page-by-page analysis mode, and on the basis, the fact that the webpage can truly represent the PDF document is guaranteed.
According to an embodiment of the present invention, there is provided an embodiment of a method of converting a PDF document, it being noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowchart, in some cases the steps shown or described may be performed in an order different from that herein.
In this embodiment, a method for converting a PDF document is provided, which may be used in an electronic device, such as a computer, a mobile phone, a tablet computer, etc., fig. 1 is a flowchart of a method for converting a PDF document according to an embodiment of the present invention, and as shown in fig. 1, the flowchart includes the following steps:
step S101, acquiring a PDF document to be converted and page information of the PDF document.
The PDF document to be converted is used for representing the PDF document which is required to be converted currently, and the PDF document can be uploaded to the electronic device by a user through interaction with the electronic device, can be stored in the electronic device and obtained from the electronic device when conversion is required, or can be obtained after interaction between the electronic device and third party equipment.
The page information of the PDF document is used for representing the page number of the PDF document, so that the follow-up statistics of which page of the PDF document is analyzed currently is facilitated. The page information may be obtained by analyzing a catalog of the PDF document, or may be obtained by acquiring page information of the PDF document, or the like.
The acquisition mode of the PDF document and the page information thereof is not limited, and the PDF document and the page information thereof are set according to actual requirements.
Step S102, analyzing the PDF document page by page based on the page information to obtain the webpage elements corresponding to the content types of the pages in the PDF document.
Wherein the content type includes at least one of text, vector graphics, images, and tables, and the web page elements are used to form a web page.
The analysis of the PDF document is processed page by page, for example, from page 1, the page information of the current analysis is updated every time the page is acquired for analysis, and statistics is convenient.
After acquiring the PDF document, the content type of the page in the PDF document may be determined first in the parsing process. For example, the identification of the content type may be performed on pages in the PDF document in combination with a type classification model, specifically, for each page in the PDF document, it is input into the type classification model, resulting in the content type included in the page. Of course, other ways of determining the type of content included for each page in the PDF document may be used in addition to the type classification model.
And analyzing the content types of the pages in the PDF document in different modes respectively, so as to obtain the webpage elements corresponding to the content types of the pages. Specifically, the parsing process of the content type may be understood as converting different content types into corresponding web page elements, so as to facilitate subsequent rendering and display.
And step S103, rendering is carried out based on the webpage elements, and a webpage corresponding to the PDF document is generated.
And displaying the webpage elements according to the display format of the PDF document according to the rendering mode of the webpage elements. For example, the fonts in the web page need to be consistent with the corresponding fonts in the PDF document in size, color, etc., and the number of words in each line in the web page need to be the same as the number of words in the corresponding line in the PDF document.
Further, for the generated webpage, if the similarity comparison is required, similar content is marked in the webpage when the webpage is displayed, so that the webpage can be compared and positioned quickly.
According to the method for converting the PDF document, page-by-page analysis is carried out on the PDF document, so that the webpage elements corresponding to the content types of all pages in the PDF document are obtained, namely, corresponding webpage elements are obtained for different content types, and the webpage elements obtained after analysis are rendered, so that the webpage corresponding to the PDF document is generated. The PDF document is analyzed page by page, so that the processing concurrency caused by full text analysis can be avoided, meanwhile, the accuracy of an analysis result and the accuracy of a subsequent rendering result can be ensured through page by page analysis, and on the basis, the fact that the webpage page can truly represent the PDF document is ensured. The PDF document is characterized by the webpage, so that the retrieval and the contrast display of similar contents can be conveniently carried out on the basis.
In this embodiment, a method for converting a PDF document is provided, which may be used in an electronic device, such as a computer, a mobile phone, a tablet computer, etc., fig. 2 is a flowchart describing a method for converting a PDF document according to an embodiment of the present invention, where content types include text, and as shown in fig. 2, the flowchart includes the following steps:
step S201, a PDF document to be converted and page information of the PDF document are acquired. Please refer to step S101 in the embodiment shown in fig. 1 in detail, which is not described herein.
Step S202, analyzing the PDF document page by page based on the page information to obtain the webpage elements corresponding to the content types of the pages in the PDF document.
Wherein the content type includes at least one of text, vector graphics, images, and tables, and the web page elements are used to form a web page.
Specifically, if the content type includes text, the corresponding web page element includes a text element. Based on this, the step S202 includes:
in step S2021, the current page to be parsed is determined based on the page information.
As described above, the page information may be used to characterize the number of pages currently being parsed, and according to this, the current page to be parsed in the PDF document may be accurately determined.
Step S2022, analyzing the text in the current page to be analyzed, and determining the ligature information and the sentence breaking information of the text to obtain text elements.
Because the text comprises characters and punctuation, in order to ensure the consistency of the webpage elements and the text on the basis, the ligature information and the sentence breaking information of the text need to be determined when the text is analyzed. Furthermore, from the perspective of semantic understanding, the determination of the ligature information and the sentence-breaking information is also used for facilitating the integrity of the marked content in the subsequent similarity comparison.
In some alternative embodiments, step S2022 described above comprises:
and a1, determining characteristic information of each character of each line of characters in the text.
And a2, carrying out continuous word or sentence breaking on characters in the same row and adjacent rows based on the characteristic information, and determining continuous word information and sentence breaking information to obtain text elements.
The text in the PDF document may be text from a text paragraph, text in a table, or the like, and the source of the text is not limited in any way. For each line of text in the current page to be parsed, feature information of each character is determined, including but not limited to the position, size, format, etc. of the character.
For the characters in the same row, whether the characters are connected into sentences or are broken can be determined by calculating the distance between the adjacent characters, or whether the characters are connected into sentences or are broken can be determined by determining the angle between the adjacent characters.
For the characters of different lines, the characters of different lines may have a problem of paragraph segmentation, so that the distance between the first character of the next line in two adjacent lines and the margin can be calculated, if the distance exceeds the preset value, the condition that the paragraph segmentation exists is indicated, and sentence breaking is needed. Of course, other ways of determining the ligature information and the sentence breaking information may be used to obtain the text element.
And carrying out continuous word or sentence breaking processing on the same line and adjacent lines respectively by utilizing the characteristic information of the characters, and ensuring accurate display of the characters through continuous word or sentence breaking.
In some alternative embodiments, step a2 includes:
step a21, calculating the association degree between two adjacent characters based on the characteristic information for the characters of the same line.
Step a22, if the association degree is larger than the preset association value, the two adjacent characters form ligature information, otherwise, sentence breaking is carried out between the two adjacent characters to form sentence breaking information.
Step a23, calculating, for the characters of the adjacent rows, a first margin of a first character of a first row in the adjacent rows and a second margin of a last character of a second row in the adjacent rows, the second row being located above the first row, based on the feature information.
Step a24, if the difference between the first page margin and the second page margin is smaller than the preset margin value, forming the first character of the first line and the last character of the second line into continuous word information, otherwise, performing sentence breaking between the first character of the first line and the last character of the second line to form sentence breaking information.
For the characters of the same row, the feature information of each character can be already known in the above description, and when the association degree calculation is performed, the feature information of the adjacent characters needs to be combined for processing.
For example, the degree of association is characterized by the angle between adjacent characters. For the t character, the characteristic information forms a characteristic vector v t=[xt,yt,wt,ht,Stylet, wherein (x t,yt) is used for identifying the left-lower corner coordinate information of the t character, (w t,ht) represents width and height, and Style t is a hot independent format code for representing a pattern. For the adjacent t+1st character, the feature information may also form a feature vector v t+1=[xt+1,yt+1,wt+1,ht+1,Stylet+1. Then, the angle between these two eigenvectors is calculated. If the included angle is smaller than the preset angle, the two eigenvectors are considered to be parallel, i.e., the two characters should be connected together. The preset angle is set to solve the fault tolerance of some characters. In the case that the included angle represents the association degree, the smaller the included angle is, the larger the association degree of two adjacent characters is.
Based on the above, if the association degree of two adjacent characters in the same row is larger than the preset association degree, the two adjacent characters are considered to be connected together to form the ligature information, otherwise, the existence of a sentence break between the two adjacent characters is represented to form the sentence break information.
For characters in adjacent lines, since text processing is from left to right and from top to bottom, the next line of two adjacent lines is referred to as a first line, and the last line of two adjacent lines is referred to as a second line. The first character of the first row and the last character of the second row are determined, then the distance between the first character and the nearest margin (left margin) is determined to obtain a first margin, and the distance between the last character of the second row and the margin (right margin) corresponding to the character direction is determined to obtain a second margin.
And calculating a difference value between the first page distance and the second page distance, if the difference value is smaller than a preset margin value, representing that the two characters need to be connected into sentences, namely, forming the first character of the first row and the last character of the second row into continuous word information, otherwise, performing sentence breaking between the first character of the first row and the last character of the second row to form sentence breaking information. The preset margin value may be the maximum value of the widths of the two characters.
For example, if the difference between the coordinates y t of the first character of the first line and the coordinates y t-1 of the last character of the second line is greater than max { w t,wt-1 }, it is determined whether the difference between the left margin of the current letter t and the right margin of the last letter t-1 is less than max { w t,wt-1 }. If the number is smaller than the preset number, the word is formed into sentences, otherwise, sentence is broken.
The processing of the continuous word or the sentence breaking is carried out in different modes aiming at the same row and adjacent rows, so that the accuracy of the obtained continuous word information and sentence breaking information can be further ensured.
In some alternative embodiments, if the content type includes vector graphics, the web page element includes a vector graphics element. Based on this, the step S202 includes:
and a1, extracting line information of a vector graphic in a current page to be analyzed.
The current page to be resolved is determined based on page information.
And a step a2 of determining a filling area of the vector graphics based on the line information.
And a3, performing color filling in the filling area to obtain the vector graphic element.
The vector graphics are generally used for representing watermarks in a current page to be analyzed, line information of the vector graphics is extracted, a closed area surrounded by lines can be determined by combining parity rules or non-zero winding rules, and a filling area of the vector graphics is obtained. And performing color filling on the filling to obtain the vector graphic element. And if the filling of the watermark is gray, gray filling is carried out on the filling area so as to ensure that the format of the vector graphic element is consistent with the format of the vector graphic in the PDF document.
And carrying out continuous word or sentence breaking processing on the same line and adjacent lines respectively by utilizing the characteristic information of the characters, and ensuring accurate display of the characters through continuous word or sentence breaking.
In some alternative embodiments, if the content type includes an image, the web page element includes an image element. Based on this, the step S202 includes:
And b1, extracting an image in a current page to be analyzed, and determining an image format of the image, wherein the current page to be analyzed is determined based on page information.
And b2, analyzing the image according to the image format to obtain the image element.
For images in the current page to be parsed, since different images have different image formats, including but not limited to jpg, png, wmf, and so on. And correspondingly analyzing the different image formats to obtain the image elements.
Further, the extracted image is compressed and then analyzed, so that the occupation of the image to the storage space can be reduced, and the bandwidth pressure of the image similarity analysis service can be reduced when the subsequent similarity comparison is carried out.
Aiming at the image in the PDF, the image element is obtained by determining the image format and analyzing the image based on the image format, so that the consistency of the image element and the PDF document is ensured.
In some alternative embodiments, if the content type includes a form, the web page element includes a form element. Based on this, the step S202 includes:
Step c1, extracting line segments of a table in a current page to be analyzed, and determining position information of each line segment in the current page to be analyzed, wherein the current page to be analyzed is determined based on page information.
And c2, generating a table element of the current page to be analyzed based on the position information of each line segment.
And analyzing the table in the current page to be analyzed as a line segment, wherein no table element in the true sense exists in practice. Specifically, for the table, since the text in the table is analyzed by using the text processing mode, and the line segments in the table do not need to be compared in similarity, the position information of the line segments of the table needs to be accurately determined, so that the consistency of the table elements and the table in the PDF document can be ensured.
Based on the above, for the line segment of the table in the current to-be-resolved page, the position information of the line segment in the current to-be-resolved page needs to be determined in the resolving process, so that the table element of the current to-be-resolved page is generated based on the position information of each line segment.
Step S203, rendering is carried out based on the webpage elements, and a webpage corresponding to the PDF document is generated. Please refer to step S103 in the embodiment shown in fig. 1 in detail, which is not described herein.
According to the method for converting the PDF document, the current page to be analyzed is determined through the page information, and the PDF page to be processed at present can be accurately obtained. Meanwhile, the display consistency of the text elements and the text in PDF is further ensured by determining the quantum information and the sentence breaking information of the text during analysis. And carrying out continuous word or sentence breaking processing on the same line and adjacent lines respectively by utilizing the characteristic information of the characters, and ensuring accurate display of the characters through continuous word or sentence breaking.
In this embodiment, a method for converting a PDF document is provided, which may be used in an electronic device, such as a computer, a mobile phone, a tablet computer, etc., fig. 3 is a flowchart of a method for converting a PDF document according to an embodiment of the present invention, and as shown in fig. 3, the flowchart includes the following steps:
step S301, acquiring a PDF document to be converted and page information of the PDF document. Please refer to step S101 in the embodiment shown in fig. 1 in detail, which is not described herein.
Step S302, analyzing the PDF document page by page based on the page information to obtain the webpage elements corresponding to the content types of the pages in the PDF document.
Wherein the content type includes at least one of text, vector graphics, images, and tables, and the web page elements are used to form a web page. Please refer to step S202 in the embodiment shown in fig. 2, which is not described herein.
Step S303, rendering is carried out based on the webpage elements, and a webpage corresponding to the PDF document is generated.
Specifically, the step S303 includes:
step S3031, determining a rendering tag of the webpage element based on the feature information of the webpage element.
The characteristic information of the webpage element comprises character patterns and sentence characteristics, and the sentence characteristics are determined based on the ligature information and the sentence breaking information of the text.
And determining rendering labels of the webpage elements by combining the characteristic information of the webpage elements so as to ensure that typesetting and formats in the webpage displayed after rendering are consistent with those in the PDF document. Rendering tags are used to characterize characteristic information of web page elements, including but not limited to character styles, sentence features, and the like.
For example, a complete sentence content is wrapped with a < span > tag and provided with a unique ID. If the following two sentences in the document:
the instrument winch and the wellhead are kept on a straight line, the glass window of the operation room is clean, the sight line is wide, and the communication is good. And the construction operation at night and the well site should ensure illumination.
Complete sentences need to be characterized in the rendering tags, and each complete sentence has a unique ID. Namely, the instrument winch and the wellhead are kept on a straight line, the glass window of the operation room is clean, the sight line is wide, and the communication is good. The rendering label corresponding to the sentence can be < span id= "1" >, and the well site should be ensured to be illuminated during construction work at night. The rendering tag corresponding to the one sentence may be < span id= "2".
For another example, if the font color on the "straight line" in the first sentence is red, the corresponding rendering tag may be < SPAN STYLE = "color: red" >, and if the remaining font colors are black, the corresponding rendering tag may be < SPAN STYLE = "color: black" >.
For another example, if the two sentences belong to the same paragraph, the paragraph can be characterized by a < div > tag.
Step S3032, the webpage elements are structured based on the rendering tag, and structured information is obtained.
The structuring process may be considered as fusing the rendering tag with the web page element, i.e. wrapping the web page element with the rendering tag to obtain the structured information.
Continuing with the above example, the wrapping of sentences by rendering tags, the resulting structured information may be:
< span id= "1" > the instrument winch and wellhead should be kept on the straight line, the glass window of the operation room is clean, the sight is wide, and the communication is good. The < span > < span id= "2" > night construction operation, well site should ensure illumination. </span >;
On the premise of statement wrapping, the content with the same style is wrapped through the < span > tag and then combined with the font color, and the obtained structured information can be:
< span id= "1" > < SPAN STYLE = "color: black" > "the instrument winch and wellhead should keep </span > < SPAN STYLE =" color: red ">" on-line </span > < spanstyle = "color: black" > "the operating room glass window is clean, line of sight is wide, communication is good. The < span > </span > < span id= "2" > < SPAN STYLE = "color: black" > "night construction work, well site should be guaranteed illumination. The term </span > </span >.
Wherein < span > tags wrapping a complete sentence have ID attributes, and < span > tags distinguishing different styles do not contain ID attributes.
Further, a complete paragraph is wrapped with a < div > tag and has unique ID properties. On this basis, the resulting structured information may be:
< divid= "p_1" > < span id= "1" > < SPAN STYLE = "color: black" > "the instrument winch and wellhead should keep </span > < SPAN STYLE =" color: red ">" on-line </span > < spanstyle > = "color: black" > "the operating room glass window is clean, line of sight is wide, communication is good. The < span > </span > < span id= "2" > < SPAN STYLE = "color: black" > "night construction work, well site should be guaranteed illumination. The < span > </span > </div >.
Further, the complete content of one page completes the package using a < div > tag, and each page has a unique ID attribute. The image content is referenced by the < img > tag and has a unique ID.
In some optional embodiments, step S3032 includes:
Step d1, detecting whether similar elements exist in the webpage elements.
And d2, when the similar elements exist in the webpage elements, generating the marking information of the webpage elements.
And d3, structuring the webpage elements based on the marking information and the rendering labels to obtain structured information.
In the webpage element rendering process, the method can also search in a database or a data source to determine whether similar elements exist. The database stores documents to be compared, and the data source may be a document source that can be acquired on a network, or may be other data sources, and the like, which is not limited in any way.
If the elements similar to the webpage elements are searched, namely that the similar elements exist in the webpage elements, generating the marking information of the webpage elements. The marking information is used for marking that similar elements exist in the webpage element. Where the annotation information includes, but is not limited to, highlighting, or different fonts, colors, etc.
After the marking information is determined, when the structuring processing is carried out on the webpage elements, the marking information and the rendering tag processing are combined, and the structuring information is obtained.
And in the process of generating the webpage by the webpage elements, similarity comparison is also carried out, if the similar elements exist, the marking information of the webpage elements is generated, and the webpage elements are structured by combining the rendering labels, so that the identification of the similarity comparison can be displayed in the webpage.
In some alternative embodiments, step d2 includes:
Step d21, when the web page element has similar elements, associating the first identification information corresponding to the web page element with the second identification information corresponding to the similar elements.
Step d22, determining the marking position and marking content of the webpage element in the webpage based on the first identification information and the second identification information.
The first identification information is used for representing the uniqueness of the webpage elements, and the second identification information is used for representing the uniqueness of the similar contrast elements.
When the similar elements exist in the webpage elements, the first identification information corresponding to the webpage elements is associated with the second representation information corresponding to the similar elements, so that the associated display of the webpage elements and the similar elements is realized, and the comparison and the viewing are facilitated.
And determining the marking position and marking content of the webpage element in the webpage based on the first identification information and the second representation information, and simultaneously displaying the webpage element and the similar element when the webpage is displayed and adopting corresponding identification information for displaying.
Step S3033, rendering is carried out according to the structural information, and a webpage is generated.
After the structured information is determined, it is rendered so that the web page is displayed on the browser interface.
According to the method for converting the PDF document, the rendering tag is determined based on the characteristic information of the webpage element, so that the determined rendering tag can represent the characteristic of the webpage element, and the consistency of the structured processing result with the PDF document can be ensured.
As a specific application embodiment of the present invention, in the field of bidding, the PDF document to be converted may be a PDF document participating in bidding. The method for converting the PDF document provided by the embodiment of the invention converts the PDF document and searches similar elements after uploading the PDF document. Finally, when the display is performed on the browser interface, the webpage elements and the corresponding similar elements can be displayed on the same webpage at the same time, and the same mode is adopted for marking, so that the display is convenient for comparison and viewing.
For example, FIG. 4 shows one example of a web page. And displaying the contrast similarity of the PDF document to be converted and the number of the webpage elements with similar elements on the webpage. In addition, the webpage elements and the corresponding similar elements are displayed at the same time. In fig. 4, the web page elements and their corresponding similar elements are selected using text boxes.
The method can realize high-fidelity reservation of the original typesetting and style of the PDF document, including various elements such as texts, images, watermarks and the like. The method ensures that the presentation effect of the document at the browser end is highly consistent with that of the original PDF document, and is convenient for users to view and operate. And (3) independently extracting and analyzing the similarity of various content types such as texts, images, watermarks and the like by adopting a comprehensive content extraction and comparison algorithm. The accuracy of similarity detection can be improved by comprehensively analyzing the results, and the method is suitable for detecting similar contents in documents. The webpage at the browser side realizes the functions of quick marking and positioning, so that a user can quickly find similar contents in the document. The optimized algorithm design gives consideration to the calculation efficiency on the premise of ensuring high-fidelity rendering and accurate content detection. The complex document can be processed in a short time through reasonable algorithm optimization and resource utilization, and the processing requirement of a large-scale document is met. The method supports comprehensive analysis of various content types (such as text, images, watermarks and the like) in the document, and combines the characteristic information of the elements to carry out similarity judgment. The multi-element comprehensive analysis mode improves the recognition capability and accuracy of complex documents.
The application scene used in the scheme is not only suitable for the bidding field described above, but also can be popularized and applied to other scenes needing PDF document high-fidelity rendering and similar content detection, such as the fields of academic paper review, copyright protection and the like. Therefore, the method has wide application prospect due to strong adaptability and practicability.
The embodiment also provides a device for converting a PDF document, which is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a conversion apparatus for PDF documents, as shown in fig. 5, including:
The acquiring module 501 is configured to acquire a PDF document to be converted and page information of the PDF document.
The parsing module 502 is configured to parse the PDF document page by page based on the page information, to obtain web page elements corresponding to content types of each page in the PDF document, where the content types include at least one of text, vector graphics, images, and tables, and the web page elements are used to form a web page.
And the rendering module 503 is configured to render based on the webpage element, and generate a webpage corresponding to the PDF document.
In some alternative embodiments, if the content type includes text, the corresponding web page element includes a text element, and the parsing module 502 includes:
And the page determining unit is used for determining the current page to be analyzed based on the page information.
The text analysis unit is used for analyzing the text in the current page to be analyzed, determining the continuous word information and the sentence breaking information of the text, and obtaining text elements.
In some alternative embodiments, the text parsing unit includes:
And the characteristic information determining subunit is used for determining characteristic information of each character of each line of characters in the text.
And the text element determining subunit is used for carrying out continuous word or sentence breaking on the characters in the same row and adjacent rows based on the characteristic information, and determining continuous word information and sentence breaking information to obtain the text element.
In some alternative embodiments, the text element determination subunit includes:
And the association degree calculating subunit is used for calculating association degrees between two adjacent characters based on the characteristic information aiming at the characters of the same row.
And the first determining subunit is used for forming the ligature information by two adjacent characters if the association degree is larger than a preset association value, or else, performing sentence breaking between the two adjacent characters to form sentence breaking information.
And a margin determination subunit for calculating, for characters of adjacent rows, a first margin of a first character of a first row in the adjacent rows and a second margin of a last character of a second row in the adjacent rows, the second row being located above the first row, based on the feature information.
And the second determining subunit is used for forming the first character of the first row and the last character of the second row into ligature information if the difference between the first page margin and the second page margin is smaller than the preset margin value, or else, performing sentence breaking between the first character of the first row and the last character of the second row to form sentence breaking information.
In some alternative embodiments, if the content type includes a vector graphic, the web page element includes a vector graphic element, and the parsing module 502 includes:
the line information extraction unit is used for extracting line information of the vector graphics in the current page to be analyzed, and the current page to be analyzed is determined based on the page information.
And a filling area determining unit for determining a filling area of the vector graphics based on the line information.
And the color filling unit is used for performing color filling in the filling area to obtain the vector graphic element.
In some alternative embodiments, if the content type includes an image, the web page element includes an image element, and the parsing module 502 includes:
the image extraction unit is used for extracting images in the current page to be analyzed, determining the image format of the images, and determining the current page to be analyzed based on the page information.
And the image analysis unit is used for analyzing the image according to the image format to obtain image elements.
In some alternative embodiments, if the content type includes a table, the web page element includes a table element, and the parsing module 502 includes:
the line segment extraction unit is used for extracting line segments of the table in the current page to be analyzed, determining the position information of each line segment in the current page to be analyzed, and determining the current page to be analyzed based on the page information.
And the table element generating unit is used for generating the table element of the current page to be analyzed based on the position information of each line segment.
In some alternative embodiments, the rendering module 503 includes:
The rendering tag determining unit is used for determining a rendering tag of the webpage element based on the characteristic information of the webpage element, wherein the characteristic information of the webpage element comprises character patterns and sentence characteristics, and the sentence characteristics are determined based on the continuous word information and the sentence breaking information of the text.
And the structuring processing unit is used for structuring the webpage elements based on the rendering labels to obtain structured information.
And the information rendering unit is used for rendering according to the structured information to generate a webpage.
In some alternative embodiments, the structured processing unit comprises:
and the detection subunit is used for detecting whether the webpage elements have similar elements or not.
And the marking information generation subunit is used for generating marking information of the webpage element when the similar element exists in the webpage element.
And the structuring processing subunit is used for structuring the webpage elements based on the marking information and the rendering labels to obtain structuring information.
In some alternative embodiments, the tag information includes a tag location and tag content, and the tag information generating subunit includes:
and the association subunit is used for associating the first identification information corresponding to the webpage element with the second identification information corresponding to the similar element when the similar element exists in the webpage element.
And the third determination subunit is used for determining the marking position and marking content of the webpage element in the webpage based on the first identification information and the second identification information.
The first identification information is used for representing the uniqueness of the webpage elements, and the second identification information is used for representing the uniqueness of the similar contrast elements.
Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.
The PDF document conversion device in this embodiment is presented in the form of a functional unit, where a unit refers to an ASIC (Application SPECIFIC INTEGRATED Circuit) Circuit, a processor and a memory that execute one or more software or a fixed program, and/or other devices that can provide the above functions.
The embodiment of the invention also provides computer equipment, which is provided with the device for converting the PDF document shown in the figure 5.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, and as shown in fig. 6, the computer device includes one or more processors 10, a memory 20, and interfaces for connecting components, including a high-speed interface and a low-speed interface. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 6.
The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.
Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.
The memory 20 may include a storage program area that may store an operating system, application programs required for at least one function, and a storage data area that may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The memory 20 may comprise volatile memory, such as random access memory, or nonvolatile memory, such as flash memory, hard disk or solid state disk, or the memory 20 may comprise a combination of the above types of memory.
The computer device further comprises input means 30 and output means 40. The processor 10, memory 20, input device 30, and output device 40 may be connected by a bus or other means, for example in fig. 6.
The input device 30 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output means 40 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. Such display devices include, but are not limited to, liquid crystal displays, light emitting diodes, displays and plasma displays. In some alternative implementations, the display device may be a touch screen.
The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random-access memory, a flash memory, a hard disk, a solid state disk, or the like, and further, the storage medium may further include a combination of the above types of memories. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.
Portions of the present disclosure may be applied as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present disclosure by way of operation of the computer. Those skilled in the art will appreciate that the existence of computer program instructions in a computer-readable medium includes, but is not limited to, source files, executable files, installation package files, and the like, and accordingly, the manner in which computer program instructions are executed by a computer includes, but is not limited to, the computer directly executing the instructions, or the computer compiling the instructions and then executing the corresponding compiled programs, or the computer reading and executing the instructions, or the computer reading and installing the instructions and then executing the corresponding installed programs. Herein, a computer-readable medium may be any available computer-readable storage medium or communication medium that can be accessed by a computer.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (12)

1.一种PDF文档的转换方法,其特征在于,所述方法包括:1. A method for converting a PDF document, characterized in that the method comprises: 获取待转换的PDF文档以及所述PDF文档的页面信息;Obtaining the PDF document to be converted and page information of the PDF document; 基于所述页面信息对所述PDF文档进行逐页解析,得到所述PDF文档中各个页面的内容类型对应的网页元素,所述内容类型包括文本、矢量图形、图像以及表格中的至少之一,所述网页元素用于形成网页页面;Parsing the PDF document page by page based on the page information to obtain web page elements corresponding to the content type of each page in the PDF document, wherein the content type includes at least one of text, vector graphics, images, and tables, and the web page elements are used to form a web page; 基于所述网页元素进行渲染,生成所述PDF文档对应的网页页面。Rendering is performed based on the web page elements to generate a web page corresponding to the PDF document. 2.根据权利要求1所述的方法,其特征在于,若所述内容类型包括文本,则对应的所述网页元素包括文本元素,所述基于所述页面信息对所述PDF文档进行逐页解析,得到各个页面的内容类型对应的网页元素,包括:2. The method according to claim 1, characterized in that if the content type includes text, the corresponding web page element includes a text element, and the step of parsing the PDF document page by page based on the page information to obtain the web page element corresponding to the content type of each page includes: 基于所述页面信息确定当前待解析页面;Determine the current page to be parsed based on the page information; 对所述当前待解析页面中的文本进行解析,确定所述文本的连字信息和断句信息,得到所述文本元素。The text in the current page to be parsed is parsed to determine the hyphenation information and sentence segmentation information of the text to obtain the text element. 3.根据权利要求2所述的方法,其特征在于,所述对所述当前待解析页面中的文本进行解析,确定所述文本的连字信息和断句信息,得到所述文本元素,包括:3. The method according to claim 2, characterized in that the parsing of the text in the current to-be-parsed page, determining the hyphenation information and sentence segmentation information of the text, and obtaining the text element comprises: 确定所述文本中每行文字的各个字符的特征信息;Determining feature information of each character in each line of text; 基于所述特征信息对相同行和相邻行的字符进行连字或断句,确定所述连字信息和断句信息,得到所述文本元素。Characters in the same row and adjacent rows are hyphenated or segmented based on the feature information, the hyphenation information and the segmentation information are determined, and the text element is obtained. 4.根据权利要求3所述的方法,其特征在于,所述基于所述特征信息对相同行和相邻行的文本进行连字或断句,确定所述连字信息和断句信息,得到所述文本元素,包括:4. The method according to claim 3, characterized in that the step of hyphenating or segmenting the text in the same line and adjacent lines based on the feature information, determining the hyphenation information and segmentation information, and obtaining the text element comprises: 针对相同行的字符,基于所述特征信息计算相邻两个字符之间的关联度;For characters in the same row, calculating the degree of association between two adjacent characters based on the feature information; 若所述关联度大于预设关联值,则将相邻两个字符构成连字信息,否则在相邻两个字符之间进行断句,构成断句信息;If the correlation degree is greater than a preset correlation value, the two adjacent characters are combined into a ligature information, otherwise, a sentence is segmented between the two adjacent characters to form a sentence segmentation information; 针对相邻行的字符,基于所述特征信息计算所述相邻行中第一行的第一个字符的第一页边距以及所述相邻行中第二行的最后一个字符的第二页边距,所述第二行位于所述第一行之上;For characters in adjacent rows, calculating a first page margin of a first character in a first row of the adjacent rows and a second page margin of a last character in a second row of the adjacent rows based on the feature information, the second row being located above the first row; 若所述第一页边距与所述第二页边距的差小于预设边距值,则将所述第一行的第一个字符与所述第二行的最后一个字符构成连字信息,否则在所述第一行的第一个字符与所述第二行的最后一个字符之间进行断句,构成断句信息。If the difference between the first page margin and the second page margin is less than the preset margin value, the first character of the first line and the last character of the second line are used to form hyphenation information; otherwise, a sentence is broken between the first character of the first line and the last character of the second line to form sentence break information. 5.根据权利要求1所述的方法,其特征在于,若所述内容类型包括矢量图形,则所述网页元素包括矢量图形元素,所述基于所述页面信息对所述PDF文档进行逐页解析,得到所述PDF文档中各个页面的内容类型对应的网页元素,包括:5. The method according to claim 1, wherein if the content type includes vector graphics, the web page element includes a vector graphics element, and the step of parsing the PDF document page by page based on the page information to obtain web page elements corresponding to the content type of each page in the PDF document comprises: 提取当前待解析页面中所述矢量图形的线条信息,所述当前待解析页面是基于所述页面信息确定的;Extracting line information of the vector graphic in a current page to be parsed, wherein the current page to be parsed is determined based on the page information; 基于所述线条信息,确定所述矢量图形的填充区域;Based on the line information, determining a fill area of the vector graphic; 在所述填充区域进行颜色填充,得到所述矢量图形元素。Color filling is performed in the filling area to obtain the vector graphic element. 6.根据权利要求1所述的方法,其特征在于,若所述内容类型包括图像,则所述网页元素包括图像元素,所述基于所述页面信息对所述PDF文档进行逐页解析,得到所述PDF文档中各个页面的内容类型对应的网页元素,包括:6. The method according to claim 1, wherein if the content type includes an image, the web page element includes an image element, and the step of parsing the PDF document page by page based on the page information to obtain web page elements corresponding to the content type of each page in the PDF document comprises: 提取当前待解析页面中的图像,确定所述图像的图像格式,所述当前待解析页面是基于所述页面信息确定的;Extracting an image from a current page to be parsed and determining an image format of the image, wherein the current page to be parsed is determined based on the page information; 按照所述图像格式对所述图像进行解析,得到图像元素。The image is parsed according to the image format to obtain image elements. 7.根据权利要求1所述的方法,其特征在于,若所述内容类型包括表格,则所述网页元素包括表格元素,所述基于所述页面信息对所述PDF文档进行逐页解析,得到各个页面的网页元素,包括:7. The method according to claim 1, wherein if the content type includes a table, the web page element includes a table element, and the step of parsing the PDF document page by page based on the page information to obtain the web page elements of each page comprises: 提取当前待解析页面中所述表格的线段,确定各个线段在所述当前待解析页面的位置信息,所述当前待解析页面是基于所述页面信息确定的;Extracting line segments of the table in the current page to be parsed, and determining position information of each line segment in the current page to be parsed, wherein the current page to be parsed is determined based on the page information; 基于各个线段的所述位置信息,生成所述当前待解析页面的表格元素。Based on the position information of each line segment, a table element of the current page to be parsed is generated. 8.根据权利要求1至7中任一项所述的方法,其特征在于,所述基于所述网页元素进行渲染,生成所述PDF文档对应的网页页面,包括:8. The method according to any one of claims 1 to 7, wherein the rendering based on the web page elements to generate a web page corresponding to the PDF document comprises: 基于所述网页元素的特征信息,确定所述网页元素的渲染标签,所述网页元素的特征信息包括字符样式以及句子特征,所述句子特征是基于所述文本的连字信息和断句信息确定的;Determining a rendering tag of the web page element based on feature information of the web page element, wherein the feature information of the web page element includes character style and sentence features, and the sentence features are determined based on hyphenation information and sentence segmentation information of the text; 基于所述渲染标签对所述网页元素进行结构化,得到结构化信息;Structuring the webpage elements based on the rendering tags to obtain structured information; 按照所述结构化信息进行渲染,生成所述网页页面。Rendering is performed according to the structured information to generate the web page. 9.根据权利要求8所述的方法,其特征在于,所述基于所述渲染标签对所述网页元素进行结构化,得到结构化信息,包括:9. The method according to claim 8, wherein structuring the web page element based on the rendering tag to obtain structured information comprises: 检测所述网页元素是否存在相似元素;Detecting whether there are similar elements to the web page elements; 当所述网页元素存在相似元素时,生成所述网页元素的标记信息;When there are similar elements to the web page element, generating tag information of the web page element; 基于所述标记信息以及所述渲染标签,对所述网页元素进行结构化,得到结构化信息。Based on the marking information and the rendering tag, the web page element is structured to obtain structured information. 10.根据权利要求9所述的方法,其特征在于,所述标记信息包括标记位置和标记内容;当所述网页元素存在相似元素时,生成所述网页元素的标记信息,包括:10. The method according to claim 9, characterized in that the marking information includes a marking position and marking content; when there are similar elements in the web page element, generating the marking information of the web page element comprises: 当所述网页元素存在相似元素时,将所述网页元素对应的第一标识信息与所述相似元素对应的第二标识信息进行关联;When there are similar elements to the webpage element, associating the first identification information corresponding to the webpage element with the second identification information corresponding to the similar element; 基于所述第一标识信息和所述第二标识信息,确定所述网页元素在网页页面中的标记位置和标记内容;Based on the first identification information and the second identification information, determining a marking position and marking content of the web page element in the web page; 其中,所述第一标识信息用于表征网页元素唯一性,所述第二标识信息用于表征相似对比元素的唯一性。The first identification information is used to represent the uniqueness of the web page element, and the second identification information is used to represent the uniqueness of the similar comparison element. 11.一种PDF文档的转换装置,其特征在于,所述装置包括:11. A PDF document conversion device, characterized in that the device comprises: 获取模块,用于获取待转换的PDF文档以及所述PDF文档的页面信息;An acquisition module, used to acquire the PDF document to be converted and page information of the PDF document; 解析模块,用于基于所述页面信息对所述PDF文档进行逐页解析,得到所述PDF文档中各个页面的内容类型对应的网页元素,所述内容类型包括文本、矢量图形、图像以及表格中的至少之一,所述网页元素用于形成网页页面;A parsing module, configured to parse the PDF document page by page based on the page information, and obtain web page elements corresponding to the content type of each page in the PDF document, wherein the content type includes at least one of text, vector graphics, images, and tables, and the web page elements are used to form a web page; 渲染模块,用于基于所述网页元素进行渲染,生成所述PDF文档对应的网页页面。The rendering module is used to render based on the web page elements to generate a web page corresponding to the PDF document. 12.一种计算机设备,其特征在于,包括:12. A computer device, comprising: 存储器和处理器,所述存储器和所述处理器之间互相通信连接,所述存储器中存储有计算机指令,所述处理器通过执行所述计算机指令,从而执行权利要求1至10中任一项所述的PDF文档的转换方法。A memory and a processor, wherein the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the PDF document conversion method according to any one of claims 1 to 10 by executing the computer instructions.
CN202411167727.9A 2024-08-23 2024-08-23 PDF document conversion method, device, equipment, storage medium and product Pending CN119129529A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411167727.9A CN119129529A (en) 2024-08-23 2024-08-23 PDF document conversion method, device, equipment, storage medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411167727.9A CN119129529A (en) 2024-08-23 2024-08-23 PDF document conversion method, device, equipment, storage medium and product

Publications (1)

Publication Number Publication Date
CN119129529A true CN119129529A (en) 2024-12-13

Family

ID=93769320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411167727.9A Pending CN119129529A (en) 2024-08-23 2024-08-23 PDF document conversion method, device, equipment, storage medium and product

Country Status (1)

Country Link
CN (1) CN119129529A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119338669A (en) * 2024-12-24 2025-01-21 北京数科网维技术有限责任公司 A method, device and equipment for converting target graphic effects in document conversion

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119338669A (en) * 2024-12-24 2025-01-21 北京数科网维技术有限责任公司 A method, device and equipment for converting target graphic effects in document conversion

Similar Documents

Publication Publication Date Title
CN111723807B (en) End-to-end deep learning recognition machine for typing characters and handwriting characters
US10915788B2 (en) Optical character recognition using end-to-end deep learning
US20200117961A1 (en) Two-dimensional document processing
US8539342B1 (en) Read-order inference via content sorting
JP4461769B2 (en) Document retrieval / browsing technique and document retrieval / browsing device
CN113642584B (en) Character recognition method, device, equipment, storage medium and intelligent dictionary pen
JP4945813B2 (en) Print structured documents
US20140225928A1 (en) Manipulation of textual content data for layered presentation
US9910841B2 (en) Annotation data generation and overlay for enhancing readability on electronic book image stream service
CN115917613A (en) Semantic representation of text in a document
JP2022052716A (en) Query of semantic data from unstructured document
US11934774B2 (en) Systems and methods for generating social assets from electronic publications
US20240303880A1 (en) Method of generating image sample, method of recognizing text, device and medium
CN115659917A (en) Document format restoration method and device, electronic equipment and storage equipment
CN107590288B (en) Method and device for extracting webpage image-text blocks
US20130124684A1 (en) Visual separator detection in web pages using code analysis
CN119129529A (en) PDF document conversion method, device, equipment, storage medium and product
CN116245052A (en) A drawing migration method, device, equipment and storage medium
Hu et al. Analysis of documents born digital
CN112417826A (en) PDF online editing method and device, electronic equipment and readable storage medium
CN115481599A (en) Document processing method and device, electronic equipment and storage medium
WO2025107898A1 (en) Document processing method and apparatus, content generation method and apparatus, and electronic device
CN113886582B (en) Document processing method and device, and image data extraction method and device
CN118228690A (en) Method, device, computer equipment and storage medium for processing tables in PDF documents
CN110457659B (en) Clause document generation method and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination