CN119938908B

CN119938908B - Intelligent abstract extraction method and system for bidding document based on image recognition

Info

Publication number: CN119938908B
Application number: CN202510421859.8A
Authority: CN
Inventors: 杨雨薇; 季倩倩; 范兴健; 刘良; 丁婕尧; 曹丽; 王中洋
Original assignee: NANTONG INSTITUTE OF TECHNOLOGY
Current assignee: NANTONG INSTITUTE OF TECHNOLOGY
Priority date: 2025-04-07
Filing date: 2025-04-07
Publication date: 2025-07-01
Anticipated expiration: 2045-04-07
Also published as: CN119938908A

Abstract

The invention relates to the technical field of intelligent extraction of documents, and discloses an intelligent abstract extraction method and system of bidding documents based on image recognition, which comprises inputting bidding document information through image scanning according to the application field of bidding documents, establishing abstract extraction model to extract bidding document keywords, establishing bidding document abstract through keyword sequencing, checking whether the abstract is smooth and accords with the semantics of initial bidding documents by a tuning correction unit, if the abstract is smooth, whether the initial bidding document semantics are met is continuously checked, if the abstract is not smooth, synonymous replacement and root association are carried out on keywords, and the keywords are reordered, if the initial bidding document semantics are met, intelligent abstract extraction of the bidding document is completed, and if the initial bidding document semantics are not met, whether the keywords are correct is checked again, so that intelligent abstract extraction is realized, complete algorithm logic is provided, and accuracy of abstract extraction of the bidding document is ensured.

Description

Intelligent abstract extraction method and system for bidding document based on image recognition

Technical Field

The invention relates to the technical field of intelligent file extraction, and discloses a bid file intelligent abstract extraction method and system based on image recognition.

Background

The image recognition technology is rapidly developed in recent years, and particularly, the image recognition accuracy is remarkably improved under the promotion of a deep learning algorithm. The intelligent abstract extraction of the bidding document is an application field of the image recognition technology, and mainly relates to automatic extraction of key information from a scanned document or a document in a picture form to generate the document abstract or for subsequent data processing, but there are still a lot of defects, for example, the bidding document may have various formats and layouts including complex elements such as tables, graphs and the like, which pose challenges to image recognition, and high-quality training data is crucial to building an accurate model. However, in practical application, it is not easy to obtain a large amount of well-labeled training data, and bidding documents of different industries have differences in terms of terms, formats and the like, so that a single model is difficult to cover all scenes, and adjustment or retraining is required for different fields.

For example, the chinese patent application with the existing publication number CN118504559a discloses an intelligent extraction method and system for a legal and legal annotation document, which includes collecting a text with legal and legal annotation as an input text according to an original legal and legal annotation, performing data preprocessing on the text according to the original legal and legal annotation, implementing feature construction based on feature engineering, at least forming the title feature, the table text feature, the non-table text feature and the symbolic feature, extracting key information of the text according to the original legal and legal annotation by using the constructed feature, and automatically identifying key entity information in the legal and legal document through text scanning, splitting, feature comparison, regular matching and the like according to the extracted key information. Compared with the prior art, the method and the device can improve the accuracy of 2D gaze point estimation.

The data sets of the training models of the above-mentioned patents may not be sufficiently diverse, resulting in poor performance of the models when faced with new types or formats of documents. For example, different documents may have different terms usage habits, format layouts, etc., and natural language processing techniques, while having made great progress, have certain limitations in understanding and extracting complex semantics. Legal and regulatory texts often contain a large number of terms and complex logical relationships, which place high demands on the understanding capabilities of the machine, relying on fixed rules for information extraction, such methods may fail in the face of flexible and versatile text formats. For example, regular expressions are very useful in certain scenarios, but may become inflexible in the face of complex text structures.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the application and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description of the application and in the title of the application, which may not be used to limit the scope of the application.

In order to solve the technical problems, the main object of the present invention is to provide an intelligent abstract extraction method for bidding documents based on image recognition, comprising:

s1, selecting an application field (the application field is described in detail as an example) to which a bidding document belongs according to the bidding document;

s2, entering bidding document information through image scanning (specifically, the technical means of image scanning can be realized by conventional technical means), and establishing a abstract extraction model to extract bidding document keywords;

S3, establishing a bidding document abstract through keyword sequencing, and checking whether the abstract is smooth and accords with the semantics of an initial bidding document by a tuning correction unit;

S4, if the abstract is smooth, continuously checking whether the abstract accords with the initial bidding document semantics, if the abstract is not smooth, carrying out synonymous replacement and root association on the keywords, and reordering the keywords;

S5, if the initial bidding document semantics are met, intelligent abstract extraction of the bidding document is completed, and if the initial bidding document semantics are not met, whether the keywords are correct is checked again.

As a preferable scheme of the intelligent abstract extraction method of the bidding document based on image recognition, the invention comprises the following steps:

the image scanning input bidding document information converts the document into an editable electronic text format through image recognition and natural language processing;

the image recognition is used for extracting text content from an editable electronic text format;

the natural language processing is used for primarily understanding the extracted text.

The abstract extraction model comprises a word density analysis unit, a word priority screening unit, a word priority sorting unit and an abstract generation and inspection unit;

The word density analysis unit is used for extracting and checking the occurrence density of words and extracting word characteristics through a characteristic extraction function;

the word priority screening unit calculates the priority weight of the words by setting word priority ordering logic;

The word priority ranking unit is used for receiving the word priority weights and ranking the word priorities;

The abstract generation and inspection unit is used for generating a bidding document abstract according to word priority order and performing integrity inspection on the bidding document abstract.

The word density analysis unit is used for extracting text contents in an editable electronic text format for region segmentation, a two-dimensional coordinate system with a size of a multiplied by a is established, grids are numbered from left to right and from top to bottom, the occurrence density of words in each part of grids is calculated, and the calculation expression of the word density analysis unit is as follows:

wherein, the The word density for a grid of the electronic text format, with an abscissa i, and an ordinate j,For the weight of the kth word,For the length of the kth word,Is the distance factor of the kth word in the grid (i, j),Is the total number of words;

the distance factor calculation expression is:

wherein, the E is an exponential constant,As the abscissa position of the word k,Is the abscissa of the central position of the grid,As the ordinate position of the word k,Alpha is a distance coefficient and is related to word spacing;

The distance factor is a value between 0 and 1, indicating that the importance of the word varies with position.

The word priority screening unit extracts the characteristics of the density and the position of the collected grid words by collecting the density and the position of the grid words in the region segmentation, calculates weights according to the characteristics extracted by the characteristics, and finally sorts and outputs the weights according to the priorities of all the grid words;

The feature extraction is used for extracting word frequency and inverse text frequency index of the grid word, the length of the grid word and the position weight of the grid word.

The calculation expression of the weight calculation is as follows:

wherein, the For the grid word priority weights,Is the word frequency of the grid words and the inverse text frequency index,For the length of the word k,The position weight of the word k;

the position weight calculation expression is as follows:

wherein, the The location weight coefficient for the title of the grid word,The position weight coefficient of the first sentence of the grid word,For the grid words to be the position weight coefficients in the sentence,For the position of the kth term in the bid document,Other cases;

Normalizing the grid word priority weight, the word frequency and inverse text frequency index of the grid word, the word length and the word position weight through maximum normalization;

The maximum normalized calculation expression is as follows:

wherein, the For the data after the maximum normalization process,For the input data to be normalized, min is a minimum function, and max is a maximum function;

the word priority ranking unit is used for receiving the word priority weight after the maximum normalization processing and ranking the word priorities.

The keyword sequencing is used for sequencing the grid words according to the extracted grid words and the weights thereof and the sequence from high to low;

The establishing of the bidding document abstract comprises the steps of selecting the first N keywords from the ordered keyword list, and constructing a logically-consistent sentence or paragraph as the abstract;

setting a screening threshold value through the word priority ranking, and screening out high-priority words as the keywords;

and constructing a correction function and a pruning function through the tuning correction unit, and carrying out grammar checking and semantic consistency checking on the abstract, wherein the correction function is used for correcting symmetry, compactness, vanishing distance and orthogonality problems during image feature extraction, and the pruning function compensates the word density analysis unit through a database and text image information.

if the abstract is in order, checking the semantic consistency of the abstract again through a semantic similarity algorithm, and if the abstract is not in order, performing synonymous replacement, root association and reordering until the abstract is in order;

if the keyword does not accord with the initial bidding document semantics, screening the keywords of the bidding document through a word density analysis unit and a word priority ranking unit, and rechecking whether missing and wrong keywords exist;

The semantic similarity algorithm measures the similarity of two vectors through cosine similarity, and the calculation expression is as follows:

wherein, the For the similarity of the abstract and the text,For the cosine similarity it is the cosine similarity,For the vector representation of the abstract,A vector representation of the bid document;

for the extracted keyword k, matching a group of synonyms from a preset synonym library by the synonym replacement, and selecting the synonym with the highest semantic matching degree to replace the original keyword;

and for a specific word t, the root association is based on root (t), and synonyms and derivative words of the root association are retrieved from a preset word stock for replacement.

The intelligent abstract extraction system of the bidding document based on image recognition comprises:

The file identification module comprises a scanning unit for scanning bidding file information and a classification unit for defining the technical field of bidding files;

the input module comprises an input unit for image scanning input, an image recognition unit for extracting text content from an editable electronic text format, and a natural language processing unit for primarily understanding the extracted text;

the feedback module comprises a tuning correction unit, a semantic similarity algorithm, synonym replacement and root association.

As a preferable scheme of the intelligent abstract extraction system of the bidding document based on image recognition, the invention comprises the following steps:

The invention has the beneficial effects that:

The problem that different documents possibly have different terms in use habits is solved through the abstract extraction model, key information extraction of bidding documents is realized through semantic intelligent segmentation and intelligent feature extraction of image recognition, the understanding capability of document images is enhanced, the intellectualization and automation of extraction are realized, and the extraction accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a flow chart of a bid document intelligent abstract extraction method based on image recognition;

FIG. 2 is a workflow diagram of a word density analysis unit of the intelligent abstract extraction method of bidding documents based on image recognition of the present invention;

FIG. 3 is a system diagram of the intelligent abstract extraction system of bidding documents based on image recognition;

FIG. 4 is a general block diagram of the intelligent abstract extraction system of bidding documents based on image recognition of the present invention;

FIG. 5 is an internal block diagram of the intelligent abstract extraction system of bidding documents based on image recognition of the present invention;

FIG. 6 is a diagram of the intelligent abstract extraction system structure of the bidding document based on image recognition;

FIG. 7 is a block diagram of an intelligent abstract extraction system for bidding documents based on image recognition.

The device comprises a cabinet body, a total layer cabin door, a lower layer cabin door, a floor with rollers, a wireless communication module, a keyboard, a recognition camera, a laser printer, a printer paper outlet supporting plate, a host, a file recognition module, a recording module, a lighting protection and distribution unit, a data processing unit, a storage drawer and a storage drawer, wherein the reference numerals comprise the cabinet body, the total layer cabin door, the roller floor, the lower layer cabin door, the roller floor, the wireless communication module, the keyboard, the recognition camera, the laser printer, the printer paper outlet supporting plate, the storage drawer, the host, the file recognition module, the recording module, the lighting protection and distribution unit, the data processing unit, the storage drawer and the storage drawer.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Example 1

As shown in fig. 1, the intelligent abstract extraction method of the bidding document based on image recognition comprises the following steps:

S1, selecting an application field to which a bidding document belongs according to the bidding document;

In this embodiment, the bidding document may belong to multiple fields of construction engineering, information technology, medical equipment, energy development, environmental protection projects and the like, the field to which the bidding document belongs can be primarily determined by analyzing key information such as project description, technical requirements, contract terms and the like in the bidding document, for example, one bidding document contains a large amount of contents about "building structural design", "construction material selection" and "construction progress plan" and can be determined to belong to the field of construction engineering, and by determining the application field to which the bidding document belongs, the subsequent information processing and abstract extraction can be more accurately directed to the specific field.

S2, entering bidding document information through image scanning, and establishing a abstract extraction model to extract bidding document keywords;

In the embodiment, the bid document information is input through image scanning, the recognition accuracy of the modern OCR (optical character recognition) technology is high, the image scanning technology of the invention adopts OCR technology to convert paper bid documents or PDF and other formats of electronic bid documents into editable text formats, the OCR technology converts the paper bid documents or PDF formats into computer readable text data through recognizing the shapes and the arrangement of characters in images, the manual input time and cost are greatly reduced through image scanning and input of the bid document information, a summary extraction model is established to extract the keywords of the bid document, the text content of the bid document is analyzed to extract keywords capable of summarizing the main information of the document, and a great amount of text information is concentrated into a small amount of keywords, so that people can conveniently and quickly understand the main content of the bid document, for example, keywords such as 'project name', 'amount', 'technical requirement', 'construction period' are extracted from a part of bid document.

In this embodiment, the keywords are ranked according to importance and relevance, and then are combined into a summary according to a certain logic structure (such as time sequence, logic hierarchy, etc.), the ranking can be based on the occurrence frequency, position, relevance degree with other keywords, etc. of the keywords in the file, and the summary can be easier to understand through reasonable ranking and combination, for example, the extracted keywords are combined into the summary according to the sequence of project names, bid amounts, technical requirements and construction periods.

Specifically, the generated abstract is checked in grammar, semantics and logic by using the tuning correction unit, whether the abstract is smooth or not is judged by comparing the abstract with the text content of the initial bidding document, whether the semantics of the initial bidding document are accurately conveyed is judged by correcting the abstract, the abstract is ensured to accurately convey the semantics of the initial bidding document, misunderstanding or ambiguity caused by inaccurate abstract is avoided, for example, if the technical requirement part in the abstract deviates from the description in the initial bidding document, the tuning correction unit corrects the abstract so as to ensure the accuracy of the abstract.

in this embodiment, when the abstract is not smooth or does not conform to the initial bid document semantics, synonymous substitution and root association techniques may be used to optimize keywords. The synonym replacement is to replace inappropriate keywords with synonyms or near-meaning words, root association is to associate other related words according to roots or suffixes of the keywords so as to enrich the content of the abstract, the content of the abstract can be flexibly adjusted through synonym replacement and root association, so that the abstract is smoother and easier to understand, for example, if the abstract is not smooth due to the word of construction period in the abstract, the abstract can be replaced with the equivalent words of construction period or construction period.

And by checking again, the keyword in the abstract is ensured to be correct and correct, the quality of the abstract is ensured to meet the requirement, the semantics of the initial bidding document can be accurately conveyed, the abstract is more reliable and reliable, and the whole quality of the bidding document is improved.

Further, the image scanning input bidding document information converts the document into an editable electronic text format through image recognition and natural language processing, wherein the image recognition is used for extracting text content from the editable electronic text format, and the natural language processing is used for primarily understanding the extracted text.

In this embodiment, the paper bidding document is converted into digital images by using a scanner or a camera, the images are usually high-resolution to ensure that the characters are clearly recognizable, the images are preprocessed, including denoising, binarization (converting the images into black and white to more clearly identify the characters) and correction (such as rotation correction to ensure that the characters are orderly arranged), each detail in the document can be accurately captured through high-quality scanning, the document is converted into an editable electronic text format, the editable electronic text format is easy to store and process, the character content is extracted from the preprocessed images by an OCR technology in an image recognition technology, the characters in the images are converted into editable text to facilitate subsequent processing and analysis, the extracted character content is primarily understood through natural language processing, and the natural language processing technology can identify keywords, phrases and sentence structures in the document, so that the subject and the gist of the document are primarily understood.

Further, the abstract extraction model comprises a word density analysis unit, a word priority screening unit, a word priority sorting unit and an abstract generation and inspection unit;

Further, as shown in fig. 2, the word density analysis unit extracts text content in an editable electronic text format for region segmentation, establishes a two-dimensional coordinate system with a×a, numbers grids from left to right and from top to bottom, calculates the density of word occurrence in each part of grids, and the word density analysis unit calculates the expression as follows:

further, the method comprises the steps of, For representing all words contained in the grid;

the distance factor calculation expression is:

Further, the word priority screening unit extracts the characteristics of the density and the position of the collected grid words by collecting the density and the position of the grid words in the region segmentation, calculates weights according to the characteristics extracted by the characteristics, and finally sorts and outputs the obtained weights according to the priority weights of all the grid words;

In this embodiment, the word priority screening unit firstly performs region segmentation on text or image content, divides the whole content into a plurality of grids, identifies and collects all grid words in each grid, through region segmentation, can analyze text or image content more carefully, captures more details, extracts a plurality of features including word frequency, inverse text frequency index, grid word length and grid word position weight from the collected grid words, can evaluate the importance and priority of the grid words more comprehensively by extracting the plurality of features, can screen by combining the plurality of features, can identify key information more accurately, calculates a priority weight for each grid word according to the extracted features, can quantify the importance of the grid word into a specific numerical value by weight calculation, is convenient to compare and order, sorts all grid words according to the calculated priority weights, and outputs the grid according to the sorting result, wherein the sorting result adopts a descending method, namely the grid word with the highest weight is arranged in front, and the sorted result can intuitively display the key information which is more important and is convenient for a user to acquire the key information.

Further, the calculation expression of the weight calculation is as follows:

the position weight calculation expression is as follows:

The maximum normalized calculation expression is as follows:

Further, the keyword ranking ranks the grid words according to the extracted grid words and the weights thereof and the order of the weights from high to low;

In this embodiment, keywords are ranked according to a weight value from high to low, keywords with higher weights can be preferentially processed through ranking, so that efficiency of information processing and text analysis is improved, first N (N is a preset value) keywords are selected from a ranked keyword list to serve as core contents of the abstract, a sentence or paragraph with logical continuity and complete information is constructed according to the keywords to serve as the abstract, relevance among the keywords and context logic need to be considered to ensure accuracy and readability of the abstract when the abstract is constructed, a screening threshold is set in word priority ranking, keywords with higher priorities are screened out, keywords with lower weights can be removed through the screening threshold, so that quality of a keyword list is optimized, a tuning correction unit performs grammar checking and semantic consistency checking on the abstract by constructing a correction function and a trimming function, the correction function is used for solving the problems of symmetry, compactness, distance and orthogonality when image features are extracted, accuracy and accuracy of image information in the abstract are guaranteed, and the function performs compensation and integrity of the abstract on word image information through a database and word image information.

Further, if the abstract is in order, checking the semantic consistency of the abstract again through a semantic similarity algorithm, and if the abstract is not in order, performing synonymous replacement, root association and reordering until the abstract is in order;

Example two

As shown in fig. 3, the intelligent abstract extraction system for bidding documents based on image recognition comprises:

The feedback module comprises a tuning correction unit, a semantic similarity algorithm, synonym replacement and root association;

the correction function is used for correcting symmetry, compactness, vanishing distance and orthogonality problems during image feature extraction, and the pruning function compensates the word density analysis unit through a database and text image information;

Example III

As shown in fig. 4, the image recognition-based intelligent abstract extraction system for bidding documents is structurally schematic and comprises a cabinet body 1, a total layer cabin door 2, a lower layer cabin door 3 and a floor 4 with rollers, wherein the cabinet body 1 is used for supporting a display, and the lower layer cabin door 3 is used for storing a lightning protection and power distribution unit 13, a data processing unit 14 and a storage drawer 15 shown in fig. 5;

Further, the total layer cabin door 2 is used for storing a host 10, a file identification module 11 and an input module 12;

As shown in fig. 5, the device also comprises a wireless communication module 5, a keyboard 6, an identification camera 7, a laser printer 8 and a printer paper outlet supporting plate 9.

Wherein, keyboard 6 is used for controlling the bidding document to input, and identification camera 7 is used for discernment paper bidding document content, and laser printer 8 is used for printing the abstract and generates the result.

Fig. 6 is a diagram of a bid document intelligent abstract extraction system structure based on image recognition.

As shown in fig. 7, the intelligent abstract extraction system of bidding documents inputs office supplies tender books, analyzes the document size and records the input time by inputting to the system, previews the original text, performs preliminary identification processing on the office supplies tender books by marking completion document scanning and keyword extraction, monitors the summary generation process of the bidding documents by processing progress, and finally generates a final summary generation result of the office supplies tender books from the summary result.

It is important to note that the construction and arrangement of the application as shown in the various exemplary embodiments is illustrative only. Although only two embodiments have been described in detail in this disclosure, those skilled in the art who review this disclosure will readily appreciate that many modifications are possible, for example, variations in sizes, dimensions, structures, shapes and proportions of the various elements, values of parameters (e.g., temperature, pressure, etc.), mounting arrangements, use of materials, colors, orientations, etc., without materially departing from the novel teachings and advantages of the subject matter described in this application. For example, elements shown as integrally formed may be constructed of multiple parts or elements, the position of elements may be reversed or otherwise varied, and the nature or number of discrete elements or positions may be altered or varied. Accordingly, all such modifications are intended to be included within the scope of present application. The order or sequence of any process or method steps may be varied or re-sequenced according to alternative embodiments. Any means-plus-function clause is intended to cover the structures described herein as performing the function and not only structural equivalents but also equivalent structures. Other substitutions, modifications, changes and omissions may be made in the design, operating conditions and arrangement of the exemplary embodiments without departing from the scope of the present applications. Therefore, the application is not limited to the specific embodiments, but extends to various modifications that nevertheless fall within the scope of the appended claims.

Furthermore, in an effort to provide a concise description of the exemplary embodiments, all features of an actual implementation may not be described (i.e., those not associated with the best mode presently contemplated for carrying out the invention, or those not associated with practicing the invention).

It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions may be made. Such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims

1. A method for extracting intelligent abstracts from bidding documents based on image recognition, characterized by comprising:

S1. Select the application field to which the bidding document belongs according to the bidding document;

S2, input bidding document information through image scanning, and establish a summary extraction model to extract bidding document keywords;

The abstract extraction model includes a word density analysis unit, a word priority screening unit, a word priority sorting unit and an abstract generation verification unit;

The word density analysis unit is used to extract and check the density of word occurrences, and extract word features through a feature extraction function;

The word priority screening unit calculates the priority weights of the words by setting the word priority sorting logic;

The word priority sorting unit is used to receive word priority weights and sort the word priorities;

The summary generation and verification unit is used to generate a summary of the bidding document according to the word priority ranking, and perform integrity verification on the summary of the bidding document;

S3. Establish a summary of the bidding document by sorting keywords, and use the optimization and correction unit to check whether the summary is fluent and conforms to the semantics of the initial bidding document;

S4. If the abstract is fluent, continue to check whether it conforms to the semantics of the initial bidding document. If the abstract is not fluent, perform synonym replacement and root association on the keywords and re-order the keywords;

S5. If the bidding document semantics are consistent, the intelligent extraction summary of the bidding document is completed. If the bidding document semantics are not consistent, the keywords are rechecked to see if they are correct.

2. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 1 is characterized in that:

The image scanning and input bidding document information converts the document into an editable electronic text format through image recognition and natural language processing;

The image recognition is used to extract text content from an editable electronic text format;

The natural language processing is used to initially understand the extracted text.

3. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 2 is characterized in that:

The word density analysis unit extracts text content from the editable electronic text format and performs regional segmentation, establishes a two-dimensional coordinate system of size a×a, numbers the grids from left to right and from top to bottom, and calculates the density of word occurrences in each part of the grid. The word density analysis unit calculates the expression as follows:

;

in, is the word density of the grid with horizontal coordinate i and vertical coordinate j in the electronic text format, is the kth word weight, is the length of the kth word, is the distance factor of the kth word in the grid (i, j), is the total number of words;

The distance factor calculation expression is:

;

in, is the distance factor, e is the exponential constant, is the horizontal coordinate position of word k, is the horizontal coordinate of the grid center position, is the ordinate position of word k, is the vertical coordinate position of the grid center, α is the distance coefficient;

The distance factor is a value between 0 and 1, indicating that the importance of a word changes with its position.

4. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 3 is characterized by:

The word priority screening unit collects the density and position of the grid words in the area segmentation, extracts features from the collected density and position of the grid words, calculates weights based on the features extracted from the features, and finally sorts and outputs the grid words according to the priority weights of all the grid words;

The feature extraction is used to extract the word frequency and inverse text frequency index of the grid word, the grid word length and the grid word position weight.

5. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 4 is characterized in that:

The calculation expression of the weight calculation is as follows:

;

in, is the grid word priority weight, is the frequency and inverse text frequency index of the grid words, is the length of word k, is the position weight of word k;

The position weight calculation expression is as follows:

;

in, is the position weight coefficient of the grid word as the title, is the position weight coefficient of the grid word as the first sentence, is the position weight coefficient of the grid word in the sentence, is the position of the kth word in the bidding document, For other cases;

Normalizing the grid word priority weight, the grid word frequency and inverse text frequency index, the word length and the word position weight by maximum normalization;

The maximum normalization calculation expression is as follows:

;

in, is the data after maximum normalization processing, is the input data that needs to be normalized, min[] is the minimum value function, and max[] is the maximum value function;

The word priority sorting unit is used to receive the word priority weights after maximum normalization and sort the word priorities.

6. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 5 is characterized in that:

The keyword sorting is based on the extracted grid words and their weights, and the grid words are sorted in descending order of weight;

The step of establishing the bid document summary includes selecting the first N keywords from the sorted keyword list and constructing a logically coherent sentence or paragraph as the summary;

Setting a screening threshold by ranking the word priorities to screen out high-priority words as the keywords;

The correction function and the pruning function are constructed by the tuning correction unit to perform grammatical check and semantic consistency check on the summary. The correction function is used to correct the symmetry, compactness, vanishing distance and orthogonality problems when extracting image features. The pruning function compensates the word density analysis unit through the database and text image information.

7. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 6 is characterized in that:

If the abstract is fluent, the semantic consistency of the abstract is checked again through the semantic similarity algorithm. If the abstract is not fluent, synonym replacement, root association and re-ordering are performed to make the abstract fluent.

If it does not conform to the semantics of the initial bidding document, the keywords of the bidding document will be screened again through the word density analysis unit and the word priority sorting unit to recheck whether there are any missing or wrong keywords;

The semantic similarity algorithm measures the similarity of two vectors by cosine similarity. The calculation expression is as follows:

;

in, is the similarity between the abstract and the text, is the cosine similarity, is the vector representation of the summary, is the vector representation of the bidding document;

For the extracted keyword k, the synonym replacement matches a group of synonyms from a preset synonym library, and selects the synonym with the highest semantic matching degree to replace the original keyword;

For a specific word t, the root association is based on the root root(t), and its synonyms and derivatives are retrieved from the preset word library for replacement.

8. A system for intelligent abstract extraction of bidding documents based on image recognition, used to implement the method for intelligent abstract extraction of bidding documents based on image recognition as claimed in any one of claims 1 to 7, characterized in that it comprises:

The document identification module includes a scanning unit for scanning the bidding document information and a classification unit for defining the technical field to which the bidding document belongs;

An input module, including an input unit for image scanning and input, an image recognition unit for extracting text content from an editable electronic text format, and a natural language processing unit for preliminary understanding of the extracted text;

A summary extraction model, including a word density analysis unit, a word priority screening unit, a word priority sorting unit and a summary generation verification unit;

The feedback module includes a tuning correction unit, a semantic similarity algorithm, synonym replacement, and root association.

9. The bidding document intelligent summary extraction system based on image recognition according to claim 8 is characterized by:

The summary generation and verification unit is used to generate a summary of the bidding document according to the word priority ranking, and to perform integrity verification on the summary of the bidding document.