[go: up one dir, main page]

CN119938908B - Intelligent abstract extraction method and system for bidding document based on image recognition - Google Patents

Intelligent abstract extraction method and system for bidding document based on image recognition Download PDF

Info

Publication number
CN119938908B
CN119938908B CN202510421859.8A CN202510421859A CN119938908B CN 119938908 B CN119938908 B CN 119938908B CN 202510421859 A CN202510421859 A CN 202510421859A CN 119938908 B CN119938908 B CN 119938908B
Authority
CN
China
Prior art keywords
word
abstract
bidding
grid
bidding document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510421859.8A
Other languages
Chinese (zh)
Other versions
CN119938908A (en
Inventor
杨雨薇
季倩倩
范兴健
刘良
丁婕尧
曹丽
王中洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong Economic And Technological Development Zone Public Resource Trading Center Nantong Economic And Technological Development Zone Government Service Agency Center Nantong Public Resource Trading Center Development Branch Center
NANTONG INSTITUTE OF TECHNOLOGY
Original Assignee
Nantong Economic And Technological Development Zone Public Resource Trading Center Nantong Economic And Technological Development Zone Government Service Agency Center Nantong Public Resource Trading Center Development Branch Center
NANTONG INSTITUTE OF TECHNOLOGY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong Economic And Technological Development Zone Public Resource Trading Center Nantong Economic And Technological Development Zone Government Service Agency Center Nantong Public Resource Trading Center Development Branch Center, NANTONG INSTITUTE OF TECHNOLOGY filed Critical Nantong Economic And Technological Development Zone Public Resource Trading Center Nantong Economic And Technological Development Zone Government Service Agency Center Nantong Public Resource Trading Center Development Branch Center
Priority to CN202510421859.8A priority Critical patent/CN119938908B/en
Publication of CN119938908A publication Critical patent/CN119938908A/en
Application granted granted Critical
Publication of CN119938908B publication Critical patent/CN119938908B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention relates to the technical field of intelligent extraction of documents, and discloses an intelligent abstract extraction method and system of bidding documents based on image recognition, which comprises inputting bidding document information through image scanning according to the application field of bidding documents, establishing abstract extraction model to extract bidding document keywords, establishing bidding document abstract through keyword sequencing, checking whether the abstract is smooth and accords with the semantics of initial bidding documents by a tuning correction unit, if the abstract is smooth, whether the initial bidding document semantics are met is continuously checked, if the abstract is not smooth, synonymous replacement and root association are carried out on keywords, and the keywords are reordered, if the initial bidding document semantics are met, intelligent abstract extraction of the bidding document is completed, and if the initial bidding document semantics are not met, whether the keywords are correct is checked again, so that intelligent abstract extraction is realized, complete algorithm logic is provided, and accuracy of abstract extraction of the bidding document is ensured.

Description

Intelligent abstract extraction method and system for bidding document based on image recognition
Technical Field
The invention relates to the technical field of intelligent file extraction, and discloses a bid file intelligent abstract extraction method and system based on image recognition.
Background
The image recognition technology is rapidly developed in recent years, and particularly, the image recognition accuracy is remarkably improved under the promotion of a deep learning algorithm. The intelligent abstract extraction of the bidding document is an application field of the image recognition technology, and mainly relates to automatic extraction of key information from a scanned document or a document in a picture form to generate the document abstract or for subsequent data processing, but there are still a lot of defects, for example, the bidding document may have various formats and layouts including complex elements such as tables, graphs and the like, which pose challenges to image recognition, and high-quality training data is crucial to building an accurate model. However, in practical application, it is not easy to obtain a large amount of well-labeled training data, and bidding documents of different industries have differences in terms of terms, formats and the like, so that a single model is difficult to cover all scenes, and adjustment or retraining is required for different fields.
For example, the chinese patent application with the existing publication number CN118504559a discloses an intelligent extraction method and system for a legal and legal annotation document, which includes collecting a text with legal and legal annotation as an input text according to an original legal and legal annotation, performing data preprocessing on the text according to the original legal and legal annotation, implementing feature construction based on feature engineering, at least forming the title feature, the table text feature, the non-table text feature and the symbolic feature, extracting key information of the text according to the original legal and legal annotation by using the constructed feature, and automatically identifying key entity information in the legal and legal document through text scanning, splitting, feature comparison, regular matching and the like according to the extracted key information. Compared with the prior art, the method and the device can improve the accuracy of 2D gaze point estimation.
The data sets of the training models of the above-mentioned patents may not be sufficiently diverse, resulting in poor performance of the models when faced with new types or formats of documents. For example, different documents may have different terms usage habits, format layouts, etc., and natural language processing techniques, while having made great progress, have certain limitations in understanding and extracting complex semantics. Legal and regulatory texts often contain a large number of terms and complex logical relationships, which place high demands on the understanding capabilities of the machine, relying on fixed rules for information extraction, such methods may fail in the face of flexible and versatile text formats. For example, regular expressions are very useful in certain scenarios, but may become inflexible in the face of complex text structures.
Disclosure of Invention
This section is intended to outline some aspects of embodiments of the application and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description of the application and in the title of the application, which may not be used to limit the scope of the application.
In order to solve the technical problems, the main object of the present invention is to provide an intelligent abstract extraction method for bidding documents based on image recognition, comprising:
s1, selecting an application field (the application field is described in detail as an example) to which a bidding document belongs according to the bidding document;
s2, entering bidding document information through image scanning (specifically, the technical means of image scanning can be realized by conventional technical means), and establishing a abstract extraction model to extract bidding document keywords;
S3, establishing a bidding document abstract through keyword sequencing, and checking whether the abstract is smooth and accords with the semantics of an initial bidding document by a tuning correction unit;
S4, if the abstract is smooth, continuously checking whether the abstract accords with the initial bidding document semantics, if the abstract is not smooth, carrying out synonymous replacement and root association on the keywords, and reordering the keywords;
S5, if the initial bidding document semantics are met, intelligent abstract extraction of the bidding document is completed, and if the initial bidding document semantics are not met, whether the keywords are correct is checked again.
As a preferable scheme of the intelligent abstract extraction method of the bidding document based on image recognition, the invention comprises the following steps:
the image scanning input bidding document information converts the document into an editable electronic text format through image recognition and natural language processing;
the image recognition is used for extracting text content from an editable electronic text format;
the natural language processing is used for primarily understanding the extracted text.
As a preferable scheme of the intelligent abstract extraction method of the bidding document based on image recognition, the invention comprises the following steps:
The abstract extraction model comprises a word density analysis unit, a word priority screening unit, a word priority sorting unit and an abstract generation and inspection unit;
The word density analysis unit is used for extracting and checking the occurrence density of words and extracting word characteristics through a characteristic extraction function;
the word priority screening unit calculates the priority weight of the words by setting word priority ordering logic;
The word priority ranking unit is used for receiving the word priority weights and ranking the word priorities;
The abstract generation and inspection unit is used for generating a bidding document abstract according to word priority order and performing integrity inspection on the bidding document abstract.
As a preferable scheme of the intelligent abstract extraction method of the bidding document based on image recognition, the invention comprises the following steps:
The word density analysis unit is used for extracting text contents in an editable electronic text format for region segmentation, a two-dimensional coordinate system with a size of a multiplied by a is established, grids are numbered from left to right and from top to bottom, the occurrence density of words in each part of grids is calculated, and the calculation expression of the word density analysis unit is as follows:
wherein, the The word density for a grid of the electronic text format, with an abscissa i, and an ordinate j,For the weight of the kth word,For the length of the kth word,Is the distance factor of the kth word in the grid (i, j),Is the total number of words;
the distance factor calculation expression is:
wherein, the E is an exponential constant,As the abscissa position of the word k,Is the abscissa of the central position of the grid,As the ordinate position of the word k,Alpha is a distance coefficient and is related to word spacing;
The distance factor is a value between 0 and 1, indicating that the importance of the word varies with position.
As a preferable scheme of the intelligent abstract extraction method of the bidding document based on image recognition, the invention comprises the following steps:
The word priority screening unit extracts the characteristics of the density and the position of the collected grid words by collecting the density and the position of the grid words in the region segmentation, calculates weights according to the characteristics extracted by the characteristics, and finally sorts and outputs the weights according to the priorities of all the grid words;
The feature extraction is used for extracting word frequency and inverse text frequency index of the grid word, the length of the grid word and the position weight of the grid word.
As a preferable scheme of the intelligent abstract extraction method of the bidding document based on image recognition, the invention comprises the following steps:
The calculation expression of the weight calculation is as follows:
wherein, the For the grid word priority weights,Is the word frequency of the grid words and the inverse text frequency index,For the length of the word k,The position weight of the word k;
the position weight calculation expression is as follows:
wherein, the The location weight coefficient for the title of the grid word,The position weight coefficient of the first sentence of the grid word,For the grid words to be the position weight coefficients in the sentence,For the position of the kth term in the bid document,Other cases;
Normalizing the grid word priority weight, the word frequency and inverse text frequency index of the grid word, the word length and the word position weight through maximum normalization;
The maximum normalized calculation expression is as follows:
wherein, the For the data after the maximum normalization process,For the input data to be normalized, min is a minimum function, and max is a maximum function;
the word priority ranking unit is used for receiving the word priority weight after the maximum normalization processing and ranking the word priorities.
As a preferable scheme of the intelligent abstract extraction method of the bidding document based on image recognition, the invention comprises the following steps:
The keyword sequencing is used for sequencing the grid words according to the extracted grid words and the weights thereof and the sequence from high to low;
The establishing of the bidding document abstract comprises the steps of selecting the first N keywords from the ordered keyword list, and constructing a logically-consistent sentence or paragraph as the abstract;
setting a screening threshold value through the word priority ranking, and screening out high-priority words as the keywords;
and constructing a correction function and a pruning function through the tuning correction unit, and carrying out grammar checking and semantic consistency checking on the abstract, wherein the correction function is used for correcting symmetry, compactness, vanishing distance and orthogonality problems during image feature extraction, and the pruning function compensates the word density analysis unit through a database and text image information.
As a preferable scheme of the intelligent abstract extraction method of the bidding document based on image recognition, the invention comprises the following steps:
if the abstract is in order, checking the semantic consistency of the abstract again through a semantic similarity algorithm, and if the abstract is not in order, performing synonymous replacement, root association and reordering until the abstract is in order;
if the keyword does not accord with the initial bidding document semantics, screening the keywords of the bidding document through a word density analysis unit and a word priority ranking unit, and rechecking whether missing and wrong keywords exist;
The semantic similarity algorithm measures the similarity of two vectors through cosine similarity, and the calculation expression is as follows:
wherein, the For the similarity of the abstract and the text,For the cosine similarity it is the cosine similarity,For the vector representation of the abstract,A vector representation of the bid document;
for the extracted keyword k, matching a group of synonyms from a preset synonym library by the synonym replacement, and selecting the synonym with the highest semantic matching degree to replace the original keyword;
and for a specific word t, the root association is based on root (t), and synonyms and derivative words of the root association are retrieved from a preset word stock for replacement.
The intelligent abstract extraction system of the bidding document based on image recognition comprises:
The file identification module comprises a scanning unit for scanning bidding file information and a classification unit for defining the technical field of bidding files;
the input module comprises an input unit for image scanning input, an image recognition unit for extracting text content from an editable electronic text format, and a natural language processing unit for primarily understanding the extracted text;
The abstract extraction model comprises a word density analysis unit, a word priority screening unit, a word priority sorting unit and an abstract generation and inspection unit;
the feedback module comprises a tuning correction unit, a semantic similarity algorithm, synonym replacement and root association.
As a preferable scheme of the intelligent abstract extraction system of the bidding document based on image recognition, the invention comprises the following steps:
The word density analysis unit is used for extracting and checking the occurrence density of words and extracting word characteristics through a characteristic extraction function;
the word priority screening unit calculates the priority weight of the words by setting word priority ordering logic;
The word priority ranking unit is used for receiving the word priority weights and ranking the word priorities;
The abstract generation and inspection unit is used for generating a bidding document abstract according to word priority order and performing integrity inspection on the bidding document abstract.
The invention has the beneficial effects that:
The problem that different documents possibly have different terms in use habits is solved through the abstract extraction model, key information extraction of bidding documents is realized through semantic intelligent segmentation and intelligent feature extraction of image recognition, the understanding capability of document images is enhanced, the intellectualization and automation of extraction are realized, and the extraction accuracy is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:
FIG. 1 is a flow chart of a bid document intelligent abstract extraction method based on image recognition;
FIG. 2 is a workflow diagram of a word density analysis unit of the intelligent abstract extraction method of bidding documents based on image recognition of the present invention;
FIG. 3 is a system diagram of the intelligent abstract extraction system of bidding documents based on image recognition;
FIG. 4 is a general block diagram of the intelligent abstract extraction system of bidding documents based on image recognition of the present invention;
FIG. 5 is an internal block diagram of the intelligent abstract extraction system of bidding documents based on image recognition of the present invention;
FIG. 6 is a diagram of the intelligent abstract extraction system structure of the bidding document based on image recognition;
FIG. 7 is a block diagram of an intelligent abstract extraction system for bidding documents based on image recognition.
The device comprises a cabinet body, a total layer cabin door, a lower layer cabin door, a floor with rollers, a wireless communication module, a keyboard, a recognition camera, a laser printer, a printer paper outlet supporting plate, a host, a file recognition module, a recording module, a lighting protection and distribution unit, a data processing unit, a storage drawer and a storage drawer, wherein the reference numerals comprise the cabinet body, the total layer cabin door, the roller floor, the lower layer cabin door, the roller floor, the wireless communication module, the keyboard, the recognition camera, the laser printer, the printer paper outlet supporting plate, the storage drawer, the host, the file recognition module, the recording module, the lighting protection and distribution unit, the data processing unit, the storage drawer and the storage drawer.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
Example 1
As shown in fig. 1, the intelligent abstract extraction method of the bidding document based on image recognition comprises the following steps:
S1, selecting an application field to which a bidding document belongs according to the bidding document;
In this embodiment, the bidding document may belong to multiple fields of construction engineering, information technology, medical equipment, energy development, environmental protection projects and the like, the field to which the bidding document belongs can be primarily determined by analyzing key information such as project description, technical requirements, contract terms and the like in the bidding document, for example, one bidding document contains a large amount of contents about "building structural design", "construction material selection" and "construction progress plan" and can be determined to belong to the field of construction engineering, and by determining the application field to which the bidding document belongs, the subsequent information processing and abstract extraction can be more accurately directed to the specific field.
S2, entering bidding document information through image scanning, and establishing a abstract extraction model to extract bidding document keywords;
In the embodiment, the bid document information is input through image scanning, the recognition accuracy of the modern OCR (optical character recognition) technology is high, the image scanning technology of the invention adopts OCR technology to convert paper bid documents or PDF and other formats of electronic bid documents into editable text formats, the OCR technology converts the paper bid documents or PDF formats into computer readable text data through recognizing the shapes and the arrangement of characters in images, the manual input time and cost are greatly reduced through image scanning and input of the bid document information, a summary extraction model is established to extract the keywords of the bid document, the text content of the bid document is analyzed to extract keywords capable of summarizing the main information of the document, and a great amount of text information is concentrated into a small amount of keywords, so that people can conveniently and quickly understand the main content of the bid document, for example, keywords such as 'project name', 'amount', 'technical requirement', 'construction period' are extracted from a part of bid document.
S3, establishing a bidding document abstract through keyword sequencing, and checking whether the abstract is smooth and accords with the semantics of an initial bidding document by a tuning correction unit;
In this embodiment, the keywords are ranked according to importance and relevance, and then are combined into a summary according to a certain logic structure (such as time sequence, logic hierarchy, etc.), the ranking can be based on the occurrence frequency, position, relevance degree with other keywords, etc. of the keywords in the file, and the summary can be easier to understand through reasonable ranking and combination, for example, the extracted keywords are combined into the summary according to the sequence of project names, bid amounts, technical requirements and construction periods.
Specifically, the generated abstract is checked in grammar, semantics and logic by using the tuning correction unit, whether the abstract is smooth or not is judged by comparing the abstract with the text content of the initial bidding document, whether the semantics of the initial bidding document are accurately conveyed is judged by correcting the abstract, the abstract is ensured to accurately convey the semantics of the initial bidding document, misunderstanding or ambiguity caused by inaccurate abstract is avoided, for example, if the technical requirement part in the abstract deviates from the description in the initial bidding document, the tuning correction unit corrects the abstract so as to ensure the accuracy of the abstract.
S4, if the abstract is smooth, continuously checking whether the abstract accords with the initial bidding document semantics, if the abstract is not smooth, carrying out synonymous replacement and root association on the keywords, and reordering the keywords;
in this embodiment, when the abstract is not smooth or does not conform to the initial bid document semantics, synonymous substitution and root association techniques may be used to optimize keywords. The synonym replacement is to replace inappropriate keywords with synonyms or near-meaning words, root association is to associate other related words according to roots or suffixes of the keywords so as to enrich the content of the abstract, the content of the abstract can be flexibly adjusted through synonym replacement and root association, so that the abstract is smoother and easier to understand, for example, if the abstract is not smooth due to the word of construction period in the abstract, the abstract can be replaced with the equivalent words of construction period or construction period.
S5, if the initial bidding document semantics are met, intelligent abstract extraction of the bidding document is completed, and if the initial bidding document semantics are not met, whether the keywords are correct is checked again.
And by checking again, the keyword in the abstract is ensured to be correct and correct, the quality of the abstract is ensured to meet the requirement, the semantics of the initial bidding document can be accurately conveyed, the abstract is more reliable and reliable, and the whole quality of the bidding document is improved.
Further, the image scanning input bidding document information converts the document into an editable electronic text format through image recognition and natural language processing, wherein the image recognition is used for extracting text content from the editable electronic text format, and the natural language processing is used for primarily understanding the extracted text.
In this embodiment, the paper bidding document is converted into digital images by using a scanner or a camera, the images are usually high-resolution to ensure that the characters are clearly recognizable, the images are preprocessed, including denoising, binarization (converting the images into black and white to more clearly identify the characters) and correction (such as rotation correction to ensure that the characters are orderly arranged), each detail in the document can be accurately captured through high-quality scanning, the document is converted into an editable electronic text format, the editable electronic text format is easy to store and process, the character content is extracted from the preprocessed images by an OCR technology in an image recognition technology, the characters in the images are converted into editable text to facilitate subsequent processing and analysis, the extracted character content is primarily understood through natural language processing, and the natural language processing technology can identify keywords, phrases and sentence structures in the document, so that the subject and the gist of the document are primarily understood.
Further, the abstract extraction model comprises a word density analysis unit, a word priority screening unit, a word priority sorting unit and an abstract generation and inspection unit;
The word density analysis unit is used for extracting and checking the occurrence density of words and extracting word characteristics through a characteristic extraction function;
the word priority screening unit calculates the priority weight of the words by setting word priority ordering logic;
The word priority ranking unit is used for receiving the word priority weights and ranking the word priorities;
The abstract generation and inspection unit is used for generating a bidding document abstract according to word priority order and performing integrity inspection on the bidding document abstract.
Further, as shown in fig. 2, the word density analysis unit extracts text content in an editable electronic text format for region segmentation, establishes a two-dimensional coordinate system with a×a, numbers grids from left to right and from top to bottom, calculates the density of word occurrence in each part of grids, and the word density analysis unit calculates the expression as follows:
wherein, the The word density for a grid of the electronic text format, with an abscissa i, and an ordinate j,For the weight of the kth word,For the length of the kth word,Is the distance factor of the kth word in the grid (i, j),Is the total number of words;
further, the method comprises the steps of, For representing all words contained in the grid;
the distance factor calculation expression is:
wherein, the E is an exponential constant,As the abscissa position of the word k,Is the abscissa of the central position of the grid,As the ordinate position of the word k,Alpha is a distance coefficient and is related to word spacing;
The distance factor is a value between 0 and 1, indicating that the importance of the word varies with position.
Further, the word priority screening unit extracts the characteristics of the density and the position of the collected grid words by collecting the density and the position of the grid words in the region segmentation, calculates weights according to the characteristics extracted by the characteristics, and finally sorts and outputs the obtained weights according to the priority weights of all the grid words;
The feature extraction is used for extracting word frequency and inverse text frequency index of the grid word, the length of the grid word and the position weight of the grid word.
In this embodiment, the word priority screening unit firstly performs region segmentation on text or image content, divides the whole content into a plurality of grids, identifies and collects all grid words in each grid, through region segmentation, can analyze text or image content more carefully, captures more details, extracts a plurality of features including word frequency, inverse text frequency index, grid word length and grid word position weight from the collected grid words, can evaluate the importance and priority of the grid words more comprehensively by extracting the plurality of features, can screen by combining the plurality of features, can identify key information more accurately, calculates a priority weight for each grid word according to the extracted features, can quantify the importance of the grid word into a specific numerical value by weight calculation, is convenient to compare and order, sorts all grid words according to the calculated priority weights, and outputs the grid according to the sorting result, wherein the sorting result adopts a descending method, namely the grid word with the highest weight is arranged in front, and the sorted result can intuitively display the key information which is more important and is convenient for a user to acquire the key information.
Further, the calculation expression of the weight calculation is as follows:
wherein, the For the grid word priority weights,Is the word frequency of the grid words and the inverse text frequency index,For the length of the word k,The position weight of the word k;
the position weight calculation expression is as follows:
wherein, the The location weight coefficient for the title of the grid word,The position weight coefficient of the first sentence of the grid word,For the grid words to be the position weight coefficients in the sentence,For the position of the kth term in the bid document,Other cases;
Normalizing the grid word priority weight, the word frequency and inverse text frequency index of the grid word, the word length and the word position weight through maximum normalization;
The maximum normalized calculation expression is as follows:
wherein, the For the data after the maximum normalization process,For the input data to be normalized, min is a minimum function, and max is a maximum function;
the word priority ranking unit is used for receiving the word priority weight after the maximum normalization processing and ranking the word priorities.
Further, the keyword ranking ranks the grid words according to the extracted grid words and the weights thereof and the order of the weights from high to low;
The establishing of the bidding document abstract comprises the steps of selecting the first N keywords from the ordered keyword list, and constructing a logically-consistent sentence or paragraph as the abstract;
setting a screening threshold value through the word priority ranking, and screening out high-priority words as the keywords;
and constructing a correction function and a pruning function through the tuning correction unit, and carrying out grammar checking and semantic consistency checking on the abstract, wherein the correction function is used for correcting symmetry, compactness, vanishing distance and orthogonality problems during image feature extraction, and the pruning function compensates the word density analysis unit through a database and text image information.
In this embodiment, keywords are ranked according to a weight value from high to low, keywords with higher weights can be preferentially processed through ranking, so that efficiency of information processing and text analysis is improved, first N (N is a preset value) keywords are selected from a ranked keyword list to serve as core contents of the abstract, a sentence or paragraph with logical continuity and complete information is constructed according to the keywords to serve as the abstract, relevance among the keywords and context logic need to be considered to ensure accuracy and readability of the abstract when the abstract is constructed, a screening threshold is set in word priority ranking, keywords with higher priorities are screened out, keywords with lower weights can be removed through the screening threshold, so that quality of a keyword list is optimized, a tuning correction unit performs grammar checking and semantic consistency checking on the abstract by constructing a correction function and a trimming function, the correction function is used for solving the problems of symmetry, compactness, distance and orthogonality when image features are extracted, accuracy and accuracy of image information in the abstract are guaranteed, and the function performs compensation and integrity of the abstract on word image information through a database and word image information.
Further, if the abstract is in order, checking the semantic consistency of the abstract again through a semantic similarity algorithm, and if the abstract is not in order, performing synonymous replacement, root association and reordering until the abstract is in order;
if the keyword does not accord with the initial bidding document semantics, screening the keywords of the bidding document through a word density analysis unit and a word priority ranking unit, and rechecking whether missing and wrong keywords exist;
The semantic similarity algorithm measures the similarity of two vectors through cosine similarity, and the calculation expression is as follows:
wherein, the For the similarity of the abstract and the text,For the cosine similarity it is the cosine similarity,For the vector representation of the abstract,A vector representation of the bid document;
for the extracted keyword k, matching a group of synonyms from a preset synonym library by the synonym replacement, and selecting the synonym with the highest semantic matching degree to replace the original keyword;
and for a specific word t, the root association is based on root (t), and synonyms and derivative words of the root association are retrieved from a preset word stock for replacement.
Example two
As shown in fig. 3, the intelligent abstract extraction system for bidding documents based on image recognition comprises:
The file identification module comprises a scanning unit for scanning bidding file information and a classification unit for defining the technical field of bidding files;
the input module comprises an input unit for image scanning input, an image recognition unit for extracting text content from an editable electronic text format, and a natural language processing unit for primarily understanding the extracted text;
The abstract extraction model comprises a word density analysis unit, a word priority screening unit, a word priority sorting unit and an abstract generation and inspection unit;
The feedback module comprises a tuning correction unit, a semantic similarity algorithm, synonym replacement and root association;
the correction function is used for correcting symmetry, compactness, vanishing distance and orthogonality problems during image feature extraction, and the pruning function compensates the word density analysis unit through a database and text image information;
The word density analysis unit is used for extracting and checking the occurrence density of words and extracting word characteristics through a characteristic extraction function;
the word priority screening unit calculates the priority weight of the words by setting word priority ordering logic;
The word priority ranking unit is used for receiving the word priority weights and ranking the word priorities;
The abstract generation and inspection unit is used for generating a bidding document abstract according to word priority order and performing integrity inspection on the bidding document abstract.
Example III
As shown in fig. 4, the image recognition-based intelligent abstract extraction system for bidding documents is structurally schematic and comprises a cabinet body 1, a total layer cabin door 2, a lower layer cabin door 3 and a floor 4 with rollers, wherein the cabinet body 1 is used for supporting a display, and the lower layer cabin door 3 is used for storing a lightning protection and power distribution unit 13, a data processing unit 14 and a storage drawer 15 shown in fig. 5;
Further, the total layer cabin door 2 is used for storing a host 10, a file identification module 11 and an input module 12;
As shown in fig. 5, the device also comprises a wireless communication module 5, a keyboard 6, an identification camera 7, a laser printer 8 and a printer paper outlet supporting plate 9.
Wherein, keyboard 6 is used for controlling the bidding document to input, and identification camera 7 is used for discernment paper bidding document content, and laser printer 8 is used for printing the abstract and generates the result.
Fig. 6 is a diagram of a bid document intelligent abstract extraction system structure based on image recognition.
As shown in fig. 7, the intelligent abstract extraction system of bidding documents inputs office supplies tender books, analyzes the document size and records the input time by inputting to the system, previews the original text, performs preliminary identification processing on the office supplies tender books by marking completion document scanning and keyword extraction, monitors the summary generation process of the bidding documents by processing progress, and finally generates a final summary generation result of the office supplies tender books from the summary result.
It is important to note that the construction and arrangement of the application as shown in the various exemplary embodiments is illustrative only. Although only two embodiments have been described in detail in this disclosure, those skilled in the art who review this disclosure will readily appreciate that many modifications are possible, for example, variations in sizes, dimensions, structures, shapes and proportions of the various elements, values of parameters (e.g., temperature, pressure, etc.), mounting arrangements, use of materials, colors, orientations, etc., without materially departing from the novel teachings and advantages of the subject matter described in this application. For example, elements shown as integrally formed may be constructed of multiple parts or elements, the position of elements may be reversed or otherwise varied, and the nature or number of discrete elements or positions may be altered or varied. Accordingly, all such modifications are intended to be included within the scope of present application. The order or sequence of any process or method steps may be varied or re-sequenced according to alternative embodiments. Any means-plus-function clause is intended to cover the structures described herein as performing the function and not only structural equivalents but also equivalent structures. Other substitutions, modifications, changes and omissions may be made in the design, operating conditions and arrangement of the exemplary embodiments without departing from the scope of the present applications. Therefore, the application is not limited to the specific embodiments, but extends to various modifications that nevertheless fall within the scope of the appended claims.
Furthermore, in an effort to provide a concise description of the exemplary embodiments, all features of an actual implementation may not be described (i.e., those not associated with the best mode presently contemplated for carrying out the invention, or those not associated with practicing the invention).
It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions may be made. Such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims (9)

1.基于图像识别的投标文件智能摘要提取方法,其特征在于,包括:1. A method for extracting intelligent abstracts from bidding documents based on image recognition, characterized by comprising: S1、根据投标文件选择投标文件所属的应用领域;S1. Select the application field to which the bidding document belongs according to the bidding document; S2、通过图像扫描录入投标文件信息,并建立摘要提取模型提取投标文件关键词;S2, input bidding document information through image scanning, and establish a summary extraction model to extract bidding document keywords; 所述摘要提取模型包括词语密度分析单元、词语优先级筛选单元、词语优先级排序单元和摘要生成检验单元;The abstract extraction model includes a word density analysis unit, a word priority screening unit, a word priority sorting unit and an abstract generation verification unit; 所述词语密度分析单元用于抽取和检验词语出现的密度,并通过特征提取函数提取词语特征;The word density analysis unit is used to extract and check the density of word occurrences, and extract word features through a feature extraction function; 所述词语优先级筛选单元通过设置词语优先级排序逻辑,对词语进行优先级权重计算;The word priority screening unit calculates the priority weights of the words by setting the word priority sorting logic; 所述词语优先级排序单元用于接收词语优先级权重,并对词语优先级进行排序;The word priority sorting unit is used to receive word priority weights and sort the word priorities; 所述摘要生成检验单元用于根据词语优先级排序生成投标文件摘要,并对所述投标文件摘要进行完整性检验;The summary generation and verification unit is used to generate a summary of the bidding document according to the word priority ranking, and perform integrity verification on the summary of the bidding document; S3、通过关键词排序建立投标文件摘要,并由调优校正单元检验摘要是否通顺和是否符合初始投标文件的语义;S3. Establish a summary of the bidding document by sorting keywords, and use the optimization and correction unit to check whether the summary is fluent and conforms to the semantics of the initial bidding document; S4、若摘要通顺,则继续检查是否符合初始投标文件语义,若摘要不通顺,则对关键词做同义替换和词根联想,并重新排序关键词;S4. If the abstract is fluent, continue to check whether it conforms to the semantics of the initial bidding document. If the abstract is not fluent, perform synonym replacement and root association on the keywords and re-order the keywords; S5、若符合初始投标文件语义,则完成投标文件智能提取摘要,若不符合初始投标文件语义,则重新检查关键词是否正确。S5. If the bidding document semantics are consistent, the intelligent extraction summary of the bidding document is completed. If the bidding document semantics are not consistent, the keywords are rechecked to see if they are correct. 2.根据权利要求1所述的基于图像识别的投标文件智能摘要提取方法,其特征在于:2. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 1 is characterized in that: 所述图像扫描录入投标文件信息通过图像识别和自然语言处理将文档转化为可编辑电子文本格式;The image scanning and input bidding document information converts the document into an editable electronic text format through image recognition and natural language processing; 所述图像识别用于从可编辑电子文本格式提取文字内容;The image recognition is used to extract text content from an editable electronic text format; 所述自然语言处理用于初步理解提取文字。The natural language processing is used to initially understand the extracted text. 3.根据权利要求2所述的基于图像识别的投标文件智能摘要提取方法,其特征在于:3. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 2 is characterized in that: 所述词语密度分析单元通过将可编辑电子文本格式提取文字内容进行区域分割,建立a×a大小的二维坐标系,将网格从左到右,从上到下进行编号,计算每一部分网格内词语出现的密度,词语密度分析单元计算表达式如下所示:The word density analysis unit extracts text content from the editable electronic text format and performs regional segmentation, establishes a two-dimensional coordinate system of size a×a, numbers the grids from left to right and from top to bottom, and calculates the density of word occurrences in each part of the grid. The word density analysis unit calculates the expression as follows: ; 其中,为电子文本格式中的横坐标为i,纵坐标为j的网格的词语密度,为第k个词语权重,为第k个词语的长度,为第k个词语在网格(i,j)中的距离因子,为词语总数;in, is the word density of the grid with horizontal coordinate i and vertical coordinate j in the electronic text format, is the kth word weight, is the length of the kth word, is the distance factor of the kth word in the grid (i, j), is the total number of words; 所述距离因子计算表达式为:The distance factor calculation expression is: ; 其中,为距离因子,e为指数常数,为词语k的横坐标位置,为网格中心位置横坐标,为词语k的纵坐标位置,为网格中心纵坐标位置,α为距离系数;in, is the distance factor, e is the exponential constant, is the horizontal coordinate position of word k, is the horizontal coordinate of the grid center position, is the ordinate position of word k, is the vertical coordinate position of the grid center, α is the distance coefficient; 所述距离因子是一个介于0到1之间的数值,表示词语的重要性随位置变化而变化。The distance factor is a value between 0 and 1, indicating that the importance of a word changes with its position. 4.根据权利要求3所述的基于图像识别的投标文件智能摘要提取方法,其特征在于:4. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 3 is characterized by: 所述词语优先级筛选单元通过收集所述区域分割中网格词语的密度和位置,对收集的网格词语的密度和位置进行特征提取,根据所述特征提取的特征进行权重计算,最终按照所有网格词语优先级权重进行排序输出;The word priority screening unit collects the density and position of the grid words in the area segmentation, extracts features from the collected density and position of the grid words, calculates weights based on the features extracted from the features, and finally sorts and outputs the grid words according to the priority weights of all the grid words; 所述特征提取用于提取出所述网格词语的词频及逆文本频率指数、所述网格词语长度和所述网格词语位置权重。The feature extraction is used to extract the word frequency and inverse text frequency index of the grid word, the grid word length and the grid word position weight. 5.根据权利要求4所述的基于图像识别的投标文件智能摘要提取方法,其特征在于:5. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 4 is characterized in that: 所述权重计算的计算表达式如下所示:The calculation expression of the weight calculation is as follows: ; 其中,为网格词语优先级权重,为网格词语的词频及逆文本频率指数,为词语k的长度,为词语k的位置权重;in, is the grid word priority weight, is the frequency and inverse text frequency index of the grid words, is the length of word k, is the position weight of word k; 所述位置权重计算表达式如下所示:The position weight calculation expression is as follows: ; 其中,为网格词语为标题的位置权重系数,为网格词语为首句的位置权重系数,为网格词语为句中的位置权重系数,为第k个词语在投标文件中位置,为其他情况;in, is the position weight coefficient of the grid word as the title, is the position weight coefficient of the grid word as the first sentence, is the position weight coefficient of the grid word in the sentence, is the position of the kth word in the bidding document, For other cases; 通过最大归一化处理对所述网格词语优先级权重、所述网格词语的词频及逆文本频率指数、所述词语长度和所述词语位置权重进行归一化处理;Normalizing the grid word priority weight, the grid word frequency and inverse text frequency index, the word length and the word position weight by maximum normalization; 所述最大归一化计算表达式如下所示:The maximum normalization calculation expression is as follows: ; 其中,为最大归一化处理后的数据,为输入的需要归一化处理的数据,min[]为取最小值函数,max[]为取最大值函数;in, is the data after maximum normalization processing, is the input data that needs to be normalized, min[] is the minimum value function, and max[] is the maximum value function; 所述词语优先级排序单元用于接收最大归一化处理后的词语优先级权重,并对词语优先级排序。The word priority sorting unit is used to receive the word priority weights after maximum normalization and sort the word priorities. 6.根据权利要求5所述的基于图像识别的投标文件智能摘要提取方法,其特征在于:6. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 5 is characterized in that: 所述关键词排序根据提取的网格词语及其权重,按照权重从高到低的顺序对网格词语进行排序;The keyword sorting is based on the extracted grid words and their weights, and the grid words are sorted in descending order of weight; 所述建立投标文件摘要包括从排序后的关键词列表中选取前N个关键词,并构建一个逻辑连贯的句子或段落作为摘要;The step of establishing the bid document summary includes selecting the first N keywords from the sorted keyword list and constructing a logically coherent sentence or paragraph as the summary; 通过所述词语优先级排序设置筛选阈值,用于筛选出高优先级词作为所述关键词;Setting a screening threshold by ranking the word priorities to screen out high-priority words as the keywords; 通过所述调优校正单元构建校正函数和修剪函数,对所述摘要进行语法检查和语义一致性检查,所述校正函数用于校正图像特征提取时的对称性、紧致性、消失距和正交性问题,所述修剪函数通过数据库和文字图像信息对词语密度分析单元进行补偿。The correction function and the pruning function are constructed by the tuning correction unit to perform grammatical check and semantic consistency check on the summary. The correction function is used to correct the symmetry, compactness, vanishing distance and orthogonality problems when extracting image features. The pruning function compensates the word density analysis unit through the database and text image information. 7.根据权利要求6所述的基于图像识别的投标文件智能摘要提取方法,其特征在于:7. The method for extracting intelligent abstracts from bidding documents based on image recognition according to claim 6 is characterized in that: 若摘要通顺,则通过语义相似度算法再次检查摘要的语义一致性,若摘要不通顺则通过同义替换、词根联想和重新排序,至摘要通顺;If the abstract is fluent, the semantic consistency of the abstract is checked again through the semantic similarity algorithm. If the abstract is not fluent, synonym replacement, root association and re-ordering are performed to make the abstract fluent. 若不符合初始投标文件语义,则重新通过词语密度分析单元和词语优先级排序单元对投标文件的关键词进行筛选,重新检查是否有遗漏和错误的关键词;If it does not conform to the semantics of the initial bidding document, the keywords of the bidding document will be screened again through the word density analysis unit and the word priority sorting unit to recheck whether there are any missing or wrong keywords; 所述语义相似度算法通过余弦相似度衡量两个向量的相似度,计算表达式如下所示:The semantic similarity algorithm measures the similarity of two vectors by cosine similarity. The calculation expression is as follows: ; 其中,为摘要和文本的相似度,为余弦相似度,为摘要的向量表示,为投标文件的向量表示;in, is the similarity between the abstract and the text, is the cosine similarity, is the vector representation of the summary, is the vector representation of the bidding document; 对于提取的关键词k,所述同义替换从预置同义词库中匹配一组同义词,并选择语义匹配度最高的同义词替换原关键词;For the extracted keyword k, the synonym replacement matches a group of synonyms from a preset synonym library, and selects the synonym with the highest semantic matching degree to replace the original keyword; 对于特定词t,所述词根联想基于词根root(t),从预置词库中检索其同义词及衍生词进行替换。For a specific word t, the root association is based on the root root(t), and its synonyms and derivatives are retrieved from the preset word library for replacement. 8.基于图像识别的投标文件智能摘要提取系统,用于实现权利要求1-7中任一项所述的基于图像识别的投标文件智能摘要提取方法,其特征在于:包括:8. A system for intelligent abstract extraction of bidding documents based on image recognition, used to implement the method for intelligent abstract extraction of bidding documents based on image recognition as claimed in any one of claims 1 to 7, characterized in that it comprises: 文件识别模块,包括用于扫描投标文件信息的扫描单元,用于界定投标文件所属技术领域的分类单元;The document identification module includes a scanning unit for scanning the bidding document information and a classification unit for defining the technical field to which the bidding document belongs; 录入模块,包括用于图像扫描录入的录入单元,用于从可编辑电子文本格式提取文字内容的图像识别单元,用于初步理解提取文字的自然语言处理单元;An input module, including an input unit for image scanning and input, an image recognition unit for extracting text content from an editable electronic text format, and a natural language processing unit for preliminary understanding of the extracted text; 摘要提取模型,包括词语密度分析单元、词语优先级筛选单元、词语优先级排序单元和摘要生成检验单元;A summary extraction model, including a word density analysis unit, a word priority screening unit, a word priority sorting unit and a summary generation verification unit; 反馈模块包括调优校正单元、语义相似度算法、同义词替换和词根联想。The feedback module includes a tuning correction unit, a semantic similarity algorithm, synonym replacement, and root association. 9.根据权利要求8所述的基于图像识别的投标文件智能摘要提取系统,其特征在于:9. The bidding document intelligent summary extraction system based on image recognition according to claim 8 is characterized by: 所述词语密度分析单元用于抽取和检验词语出现的密度,并通过特征提取函数提取词语特征;The word density analysis unit is used to extract and check the density of word occurrences, and extract word features through a feature extraction function; 所述词语优先级筛选单元通过设置词语优先级排序逻辑,对词语进行优先级权重计算;The word priority screening unit calculates the priority weights of the words by setting the word priority sorting logic; 所述词语优先级排序单元用于接收词语优先级权重,并对词语优先级进行排序;The word priority sorting unit is used to receive word priority weights and sort the word priorities; 所述摘要生成检验单元用于根据词语优先级排序生成投标文件摘要,并对所述投标文件摘要进行完整性检验。The summary generation and verification unit is used to generate a summary of the bidding document according to the word priority ranking, and to perform integrity verification on the summary of the bidding document.
CN202510421859.8A 2025-04-07 2025-04-07 Intelligent abstract extraction method and system for bidding document based on image recognition Active CN119938908B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510421859.8A CN119938908B (en) 2025-04-07 2025-04-07 Intelligent abstract extraction method and system for bidding document based on image recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510421859.8A CN119938908B (en) 2025-04-07 2025-04-07 Intelligent abstract extraction method and system for bidding document based on image recognition

Publications (2)

Publication Number Publication Date
CN119938908A CN119938908A (en) 2025-05-06
CN119938908B true CN119938908B (en) 2025-07-01

Family

ID=95548890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510421859.8A Active CN119938908B (en) 2025-04-07 2025-04-07 Intelligent abstract extraction method and system for bidding document based on image recognition

Country Status (1)

Country Link
CN (1) CN119938908B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116362927A (en) * 2023-03-15 2023-06-30 鹰谷睿科(重庆)数据科技有限公司 Course target achievement condition evaluation rationality evaluation method based on semantic analysis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100905434B1 (en) * 2008-08-08 2009-07-02 (주)이스트소프트 File upload method with real-time index information extraction and web storage system using the same

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116362927A (en) * 2023-03-15 2023-06-30 鹰谷睿科(重庆)数据科技有限公司 Course target achievement condition evaluation rationality evaluation method based on semantic analysis

Also Published As

Publication number Publication date
CN119938908A (en) 2025-05-06

Similar Documents

Publication Publication Date Title
CN109783787A (en) A kind of generation method of structured document, device and storage medium
CN115830620B (en) Archive text data processing method and system based on OCR
CN117390214B (en) A file retrieval method and system based on OCR technology
CN108197119A (en) The archives of paper quality digitizing solution of knowledge based collection of illustrative plates
Pletschacher et al. Europeana newspapers OCR workflow evaluation
CN114461839A (en) Multi-mode pre-training-based similar picture retrieval method and device and electronic equipment
CN118379754B (en) A method and system for detecting exercise books based on cloud computing and artificial intelligence
CN119557424B (en) Data analysis method, system and storage medium
CN115630843A (en) Contract clause automatic checking method and system
CN117745398A (en) An automated formula calculation method and system for electronic bidding scoring standards
CN115114916A (en) User feedback data analysis method and device and computer equipment
CN118941423A (en) Educational resource sharing method and sharing system based on education cloud platform
CN120014664A (en) A method and related device for extracting information of relay protection setting value list of power system
CN110795942A (en) Keyword determination method and device based on semantic recognition and storage medium
CN118656474A (en) Safety prescription prompt system and method based on plug-in index code query
CN108595411A (en) More text snippet acquisition methods in a kind of same subject text set
CN114003750B (en) Material online method, device, equipment and storage medium
CN111209375A (en) Universal clause and document matching method
CN117993876B (en) Resume evaluation system, method, device and medium
CN119672750A (en) A method and system for extracting key parameter information from PDF drawings
CN119938908B (en) Intelligent abstract extraction method and system for bidding document based on image recognition
CN119250796A (en) An intelligent reporting system for distribution network maintenance applications
CN119358546A (en) Document-level knowledge extraction and fusion method and system based on large language model
CN112464907A (en) Document processing system and method
CN117668234A (en) Text label dividing method, medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant