[go: up one dir, main page]

CN119339387B - Image text segmentation method, device, electronic device and readable storage medium - Google Patents

Image text segmentation method, device, electronic device and readable storage medium Download PDF

Info

Publication number
CN119339387B
CN119339387B CN202411884460.5A CN202411884460A CN119339387B CN 119339387 B CN119339387 B CN 119339387B CN 202411884460 A CN202411884460 A CN 202411884460A CN 119339387 B CN119339387 B CN 119339387B
Authority
CN
China
Prior art keywords
target
features
feature
text
decoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411884460.5A
Other languages
Chinese (zh)
Other versions
CN119339387A (en
Inventor
段杰
石雅洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Longhu Longzhi Manufacturing Engineering Construction Management Co ltd
Original Assignee
Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xumi Yuntu Space Technology Co Ltd filed Critical Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority to CN202411884460.5A priority Critical patent/CN119339387B/en
Publication of CN119339387A publication Critical patent/CN119339387A/en
Application granted granted Critical
Publication of CN119339387B publication Critical patent/CN119339387B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and provides an image text segmentation method, an image text segmentation device, electronic equipment and a readable storage medium. The method comprises the steps of obtaining an image to be segmented, carrying out feature extraction on the image to be segmented to obtain character edge features, character skeleton features and multi-stage visual features, carrying out feature screening on the character edge features and the character skeleton features by utilizing the multi-stage visual features to obtain target edge features and target skeleton features, carrying out feature conversion on the multi-stage visual features to obtain target visual features, obtaining target inquiry for guiding image text segmentation, carrying out fusion decoding on the target edge features, the target skeleton features and the target visual features according to the target inquiry to obtain target decoding results, and determining text segmentation results according to the target decoding results and the target visual features. The invention can more accurately segment the text from the image and improve the accuracy of text segmentation.

Description

Image text segmentation method and device, electronic equipment and readable storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to an image text segmentation method, an image text segmentation device, an electronic device, and a readable storage medium.
Background
The core goal of image text segmentation techniques is to accurately extract text portions from an image, thereby providing a clearer, more accurate basis for subsequent text recognition and understanding. Specifically, image text segmentation techniques identify which regions contain text information by analyzing pixels and features in an image, and current techniques have been able to utilize text-related supervisory information to some extent to improve the accuracy of text segmentation and to separate these regions from the background. However, the existing method has the problem of inaccurate text segmentation.
Disclosure of Invention
In view of the above, the embodiments of the present invention provide a method, an apparatus, an electronic device, and a readable storage medium for segmenting an image text, so as to solve the problem of inaccurate text segmentation in the prior art.
In a first aspect of an embodiment of the present invention, there is provided an image text segmentation method, including:
The method comprises the steps of obtaining an image to be segmented, carrying out feature extraction on the image to be segmented to obtain character edge features, character skeleton features and multi-stage visual features, carrying out feature screening on the character edge features and the character skeleton features by utilizing the multi-stage visual features to obtain target edge features and target skeleton features, carrying out feature conversion on the multi-stage visual features to obtain target visual features, obtaining target inquiry for guiding image text segmentation, carrying out fusion decoding on the target edge features, the target skeleton features and the target visual features according to the target inquiry to obtain target decoding results, and determining text segmentation results according to the target decoding results and the target visual features.
In a second aspect of an embodiment of the present invention, there is provided an image text segmentation apparatus, including:
The device comprises a feature extraction module, a feature filtering module, a feature conversion module, a fusion decoding module and a segmentation module, wherein the feature extraction module is used for obtaining an image to be segmented, carrying out feature extraction on the image to be segmented to obtain character edge features, character skeleton features and multi-stage visual features, the feature filtering module is used for carrying out feature filtering on the character edge features and the character skeleton features by utilizing the multi-stage visual features to obtain target edge features and target skeleton features, the feature conversion module is used for carrying out feature conversion on the multi-stage visual features to obtain target visual features, the fusion decoding module is used for obtaining target inquiry for guiding image text segmentation, carrying out fusion decoding on the target edge features, the target skeleton features and the target visual features according to the target inquiry to obtain target decoding results, and the segmentation module is used for determining text segmentation results according to the target decoding results and the target visual features.
In a third aspect of the embodiments of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect of the embodiments of the present invention, there is provided a readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.
Compared with the prior art, the embodiment of the invention has the beneficial effects that:
The method comprises the steps of obtaining an image to be segmented, carrying out feature extraction on the image to be segmented to obtain text edge features, text skeleton features and multi-stage visual features, extracting the text edge features and the text skeleton features, leading in text edge perception and text skeleton perception in text segmentation, carrying out feature screening on the text edge features and the text skeleton features by utilizing the multi-stage visual features to enhance the text edge perception and the text skeleton perception, enabling the obtained target edge features and the obtained target skeleton features to reflect text information in the image, focusing on edges and skeletons of texts in the segmentation process to accurately identify and segment text areas, further carrying out feature conversion on the multi-stage visual features to obtain target visual features, obtaining target query for guiding the text segmentation of the image, carrying out fusion decoding on the target edge features, the target skeleton features and the target visual features according to the target query, determining text segmentation results according to the target decoding results and the target visual features, and carrying out text segmentation from the image more accurately by utilizing fusion decoding results containing a plurality of guide features, and improving the accuracy of text segmentation.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of an image text segmentation method according to an embodiment of the present invention;
Fig. 2 is a schematic structural diagram of an image text segmentation apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In the field of image text processing, an image text segmentation task is in a very critical position, and a core requirement of the image text segmentation task is to accurately segment text areas from an image. This operation is an important cornerstone for efficient text editing and complete text removal to follow, directly related to the quality and efficiency of the entire text processing flow.
The existing text segmentation technology is continuously developed and evolved, and the existing method improves the performance of the method to a certain extent by means of various supervision signals closely related to texts. However, these techniques generally have a distinct short panel, i.e., often fail to fully emphasize text edges and decisive role that text skeletons play in the segmentation process. The text edge can outline the outline boundary of the text region and provide key clues for accurately defining the text range, and the text skeleton reflects the morphological characteristics of the text from the internal structure level, which are both indispensable for realizing high-precision text segmentation. For example, conventional edge detection algorithms, while able to distinguish text edges well, fail to distinguish text regions from non-text regions, resulting in edges detected in non-text regions that may interfere with the performance of the text segmentation model. In addition, the existing text segmentation method still has certain limitations when facing segmentation of text edge regions, and particularly, the obtained result is worse under the condition of blurred text edges or complex background. Based on the above, the invention provides an image text segmentation method which can more accurately segment the text from the image and improve the accuracy of text segmentation.
An image text segmentation method, an apparatus, an electronic device, and a readable storage medium according to embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Fig. 1 is a flow chart of an image text segmentation method according to an embodiment of the present invention. As shown in fig. 1, the method includes:
S101, acquiring an image to be segmented, and extracting features of the image to be segmented to obtain character edge features, character skeleton features and multi-level visual features;
s102, performing feature screening on character edge features and character skeleton features by utilizing multi-level visual features to obtain target edge features and target skeleton features;
s103, performing feature conversion on the multi-level visual features to obtain target visual features;
S104, acquiring target inquiry for guiding image text segmentation, and carrying out fusion decoding on target edge characteristics, target skeleton characteristics and target visual characteristics according to the target inquiry to obtain a target decoding result;
s105, determining a text segmentation result according to the target decoding result and the target visual characteristics.
Specifically, the image text segmentation method of the present embodiment may be executed by a client or a server, or may be executed by both the client and the server, and the following description will be given by taking an execution subject as an example. The image to be segmented is an original image input when an image text segmentation task is performed, after the image to be segmented is obtained, the image to be segmented can be preprocessed, for example, smoothing technologies such as Gaussian filtering, median filtering, bilateral filtering, mean filtering and the like are adopted for preprocessing the image to be segmented, so that noise is reduced. Further, feature extraction is performed on the preprocessed image to be segmented, specifically, firstly, an edge detection algorithm such as a Canny edge detection algorithm, a Prewitt edge detection algorithm or a Roberts edge detection algorithm can be used for performing feature extraction on the image to be segmented to obtain character edge features, secondly, a character skeleton feature can be obtained by performing feature extraction in a character skeleton extraction algorithm such as a hildrich algorithm or a Zhang-Suen algorithm, and then, a visual encoder can be used for performing visual feature extraction on the image to be segmented, and the visual encoder can use a depth learning model such as ResNet as a backbone network of the visual encoder to extract multi-stage visual features in the image to be segmented.
It can be appreciated that text edge features can clearly delineate the boundary between text and background. In the image, the edges of the text include contour information of the text, such as the start and stop positions, the degree of bending, and the like of the strokes. This is critical to accurately locating the text in the image. By extracting the character edge features, the subsequent segmentation algorithm can be helped to better separate the characters from complex backgrounds or other interfering elements. The character skeleton feature reflects the internal structure of the character. It can be regarded as the "central axis" of a text stroke, a simplified and abstract representation of the shape of the text. Extracting skeleton features of text is helpful for understanding basic shapes and structures of text, such as in handwriting recognition or text segmentation with large font deformation, and skeleton features can provide connectivity and topological structure information of text strokes. The multi-level visual features then contain information extracted from the different levels and angles of the image. It may include low-level pixel-level features such as basic visual information of color, brightness, etc., as well as high-level semantic features such as more abstract information of texture, shape combinations, etc.
In some examples, the characteristic screening is performed on the character edge characteristic and the character skeleton characteristic by using the multi-level visual characteristic to obtain the target edge characteristic and the target skeleton characteristic, specifically, by using the multi-level visual characteristic to screen the target edge characteristic, false edges possibly caused by noise or non-character factors (such as a decoration line in an image, a false edge generated by image compression and the like) in the character edge characteristic can be removed, more reliable boundary information can be provided for character segmentation, and the precision of character segmentation is improved. For the target skeleton feature, the screening process can eliminate false skeleton branches or redundant information which do not accord with the actual structure of the characters, the target skeleton feature can better reflect the actual internal structure of the characters, and more accurate shape guidance is provided for subsequent character segmentation, so that the quality and accuracy of character segmentation are improved.
In some examples, the multi-level visual features are subjected to feature conversion to obtain target visual features, the multi-level visual features represent feature information of different layers of the image to be segmented, the features may have differences in format, dimension and the like, and the features of different formats and dimensions can be unified into a proper form through feature conversion, so that the text can be predicted conveniently.
In some examples, the target query for guiding the image text segmentation is obtained, and the target edge feature, the target skeleton feature and the target visual feature are fused and decoded according to the target query to obtain a target decoding result, specifically, the target query is a learnable query which can learn common positions of texts in different scenes, typical features of the texts and association relations between the texts and other elements of the image, and through the learning process, the learnable query is just like having a 'knowledge base' for image text segmentation, so that an accurate guiding direction can be provided for subsequent segmentation operation. Further, under the guidance of the target query, the target edge features, the target skeleton features and the target visual features are fused and decoded. The target inquiry is just like a commander, and decides how to fuse the boundary information of the target edge characteristics, the structural information of the target skeleton characteristics and the comprehensive description information of the target visual characteristics according to the learned text rule, so that the finally obtained target decoding result can embody the real condition and the accurate position of characters in the image to the greatest extent, and lays a foundation for further obtaining an accurate text segmentation result. Finally, according to the target decoding result and the target visual feature, a more accurate text segmentation result is obtained by using the target decoding result and the target visual feature segmentation, and how to obtain the text segmentation result according to the target decoding result and the target visual feature will be described in detail in the following embodiments, which will not be repeated here.
According to the technical scheme provided by the embodiment of the invention, an image to be segmented is obtained, characteristic extraction is carried out on the image to be segmented to obtain text edge characteristics, text skeleton characteristics and multi-level visual characteristics, text edge perception and text skeleton perception can be introduced in text segmentation by extracting the text edge characteristics and the text skeleton characteristics, characteristic screening is carried out on the text edge characteristics and the text skeleton characteristics respectively by utilizing the multi-level visual characteristics to enhance the text edge perception and the text skeleton perception, the obtained target edge characteristics and the target skeleton characteristics can reflect text information in the image, meanwhile, the edge and skeleton of a text are focused in the segmentation process so as to accurately identify and segment text areas, characteristic conversion is further carried out on the multi-level visual characteristics to obtain target visual characteristics, target inquiry for guiding the text segmentation of the image is obtained, fusion decoding is carried out on the target edge characteristics, the target skeleton characteristics and the target visual characteristics according to target inquiry, and a target decoding result is obtained, and the text segmentation result is determined according to the target decoding result and the target visual characteristics, so that text segmentation is carried out from the image more accurately by utilizing the fusion decoding result containing a plurality of guiding characteristics, and the text segmentation accuracy is improved.
In some embodiments, the method for obtaining the target edge features and the target skeleton features by utilizing the multi-level visual features to conduct feature screening on the text edge features and the text skeleton features respectively comprises the steps of splicing the multi-level visual features, taking a target convolution network as a text detection head, inputting the spliced multi-level visual features into the target convolution network, determining text areas of the multi-level visual features by utilizing the target convolution network to obtain text area detection frames, and screening the text edge features and the text skeleton features according to the text area detection frames to obtain the target edge features and the target skeleton features.
Specifically, in the process of performing feature screening on character edge features and character skeleton features by using multi-level visual features, the multi-level visual features refer to features extracted at different levels, and include object and scene information from low-level edges and textures to high-level edges and textures, so that the multi-level visual features are spliced and can be fused into a unified feature representation. This feature representation is then input to the target convolutional network, which in this embodiment preferably employs a 1x1 convolutional layer. A 1x1 convolutional layer is a special convolutional layer whose convolution kernel is 1x1 in size so that it does not change the spatial dimensions of the input features. However, the number of the characteristic channels can be changed by adjusting the number of the convolution kernels, so that the dimension reduction or dimension increase of the characteristic is realized.
It can be understood that the convolution layer of 1x1 in this embodiment is used as a text detection head, and the spliced features are further processed and transformed to predict the text region, so as to obtain a text region detection box (denoted as Mask M). Mask M is a binary matrix of the same size as the input image, where pixels with a value of 1 represent text regions and pixels with a value of 0 represent non-text regions. The 1x1 convolution layer can accurately predict Mask M, so that the task of text detection is realized.
It can be appreciated that text edge features are previously extracted features that delineate the boundary between text and background. However, during feature extraction, there may be some noise or false edges caused by non-literal factors. Now with the text region detection box, only the portion within the detection box can be focused on. The text region detection frame is used for screening the character edge characteristics, so that the interference edges irrelevant to the text can be removed, only the effective edge characteristics truly belonging to the characters in the detection frame are reserved, and the target edge characteristics are obtained, which more accurately represent the real boundaries of the characters in the text region. Similarly, the text region detection box also has the same principle of screening the character skeleton characteristics.
Therefore, the multi-level visual features are utilized to effectively screen the character edge features and the character skeleton features, more accurate target edge features and target skeleton features are obtained, and a more reliable basis is provided for subsequent tasks such as image text segmentation.
In some embodiments, the text edge feature and the text skeleton feature are respectively screened according to the text region detection frame to obtain a target edge feature and a target skeleton feature, and the text edge feature and the text region detection frame are subjected to element-by-element multiplication to obtain the target edge feature, and the text skeleton feature and the text region detection frame are subjected to element-by-element multiplication to obtain the target skeleton feature.
Specifically, since the text region detection box Mask M is a matrix corresponding to the image size, the element values in the matrix are generally in two cases, that is, in a position within the text region, the element value is 1 (or other specific value indicating "belongs to the text region"), and in a position outside the text region, the element value is 0 (or specific value indicating "does not belong to the text region"). It can be thought of as a "mask" that accurately overlays the text region of the image, distinguishing the text region from the non-text region.
When the text edge feature is multiplied element by element with the text region detection box Mask M, it is actually the "screening" effect of Mask M that is utilized. And for each element in the character edge feature, carrying out multiplication operation with the element at the corresponding position in the Mask M. Since Mask M has a value of 1 (or corresponding representation value) in the text region and a value of 0 (or corresponding representation value) outside the text region, then after multiplication, the text edge feature elements in the text region remain unchanged because of multiplication with elements with a value of 1 (or corresponding representation value) in Mask M, and these remaining elements constitute valid edge features that truly belong to the text in the text region. The text edge feature elements outside the text region, which result from multiplication with elements with a value of 0 (or corresponding representation value) in Mask M, are "filtered" out of these potentially disturbing edge features outside the text region. After the element-by-element multiplication operation, the obtained result is the target edge feature. Compared with the original text edge feature, the target edge feature more accurately only contains the effective edge feature of the text in the text region, and removes false edge features which are possibly caused by background, noise and other factors and are located outside the text region, so that more accurate text boundary information is provided for subsequent text segmentation and other tasks.
It can be understood that the operation and principle of multiplying the text skeleton feature by the text region detection box element by element are similar to the operation of obtaining the target skeleton feature, and are not repeated here.
In some embodiments, fusion decoding is performed on target edge features, target skeleton features and target visual features according to target query to obtain a target decoding result, wherein the target query is used as a first parameter, the target edge features, the target skeleton features and the target visual features are used as second parameters, the first parameters and the second parameters are input to a target decoder, the target decoder comprises a plurality of continuously arranged decoding submodules, fusion decoding is performed on the second parameters by using a first decoding submodule according to the first parameters to obtain a first decoding result, the first decoding result is used as a new first parameter, fusion decoding is performed on the second parameters by using a second decoding submodule according to the new first parameter to obtain a second decoding result, and the fusion decoding operation of each decoding submodule is repeatedly performed until the target decoding result output by the last decoding submodule is obtained.
Specifically, the first parameter (target query) and the second parameter (target edge feature, target skeleton feature, target visual feature) are input together to the target decoder. In this embodiment, the target decoder preferably adopts a decoder with a transducer structure, and the decoder with the transducer structure includes a plurality of transducer decoding submodules arranged in succession, because the self-attention mechanism in the transducer decoder can effectively process the input long-sequence data, and when processing a complex second parameter sequence including the target edge feature, the target skeleton feature, the target visual feature and the like, the self-attention mechanism can capture the relationship between different positions in the sequence, and secondly, can automatically assign different weights to different features in the second parameter according to the input first parameter. For example, when the target query focuses more on the skeletal structure of the target, the transducer decoder may assign higher weights to the target skeletal features by a self-attention mechanism, thereby fusing these features more effectively to generate the decoding result. This ability to dynamically adjust feature weights according to task requirements makes the Transformer decoder more flexible and efficient when fusing multiple types of features (e.g., edges, skeletons, and visual features).
Further, according to the first parameter, the first decoding submodule is utilized to perform fusion decoding on the second parameter to obtain a first decoding result, namely the first parameter (target query) and the second parameter are input into the first decoding submodule together to perform fusion decoding, and the first decoding result is obtained. And then the first decoding result is used as a new first parameter, the new first parameter and the new second parameter are input into the second decoding submodule together for fusion decoding, and the fusion decoding operation of each decoding submodule is repeatedly executed until the preset fusion decoding step number is reached, so that the final target fusion result is obtained.
Thus, the final target decoding result is the final result which is obtained by sequentially processing and continuously fusing information through all decoding submodules in the whole target decoder and can meet the related requirements of the target query which is originally set.
In some embodiments, each decoding submodule comprises a self-attention layer, a pixel-aware attention layer, an edge-aware attention layer, a skeleton-aware attention layer and a feed-forward network layer, and the fusion decoding operation of each decoding submodule comprises the steps of processing a first parameter by the self-attention layer, fusing a target visual characteristic with the processed first parameter by the pixel-aware attention layer to obtain a first fusion characteristic, fusing the target edge characteristic with the first fusion characteristic by the edge-aware attention layer to obtain a second fusion characteristic, fusing the target skeleton characteristic with the second fusion characteristic by the skeleton-aware attention layer to obtain a third fusion result, and performing nonlinear transformation on the third fusion result by the feed-forward network layer to obtain a decoding result corresponding to each decoding submodule.
In particular, the self-attention layer processes the first parameter using a self-attention mechanism. The self-attention mechanism can capture the relation between different parts in the first parameter, different weights are automatically distributed according to the correlation of the parts, the processed result of the self-attention layer is the processed first parameter, the pixel perception attention layer is focused on fusing the target visual characteristics and the first parameter processed by the self-attention layer, and the first fusion characteristics are obtained through the fusion operation. The feature contains key information about the target query carried by the first parameter after the self-attention layer processing, and integrates the detail of the visual aspect provided by the target visual feature, so that the subsequent processing can be performed based on a more comprehensive and integrated visual information basis.
Further, the edge-aware attention layer then fuses the target edge feature with the first fused feature obtained previously, and through this fusion, a second fused feature is generated. The features at this time further integrate the contour information of the edge features of the target and the content contained in the previous first fusion feature, so that the features are richer and more comprehensive, and the accurate understanding and processing of the target are facilitated.
Further, the skeleton perception attention layer is responsible for fusing the target skeleton feature and the second fusion feature, and a third fusion result is obtained after the fusion is completed. The result is that the first parameter is processed in multiple steps, and a plurality of characteristics such as target vision, edges, frameworks and the like are fused in sequence, so that the result is a result of comprehensive information in multiple aspects, and a complete information basis is provided for a final decoding result.
Further, the feedforward network layer performs nonlinear transformation on the third fusion result obtained in the previous step. The feed forward network layer typically comprises a plurality of neuron layers, and the nonlinear transformation is implemented by means of an activation function or the like. The third fusion result can be further processed to mine hidden relationships and features therein and convert it to a form more suitable as a decoding result. And finally obtaining a decoding result corresponding to each decoding submodule through nonlinear transformation of the feedforward network layer.
In some embodiments, determining the text segmentation result according to the target decoding result and the target visual feature comprises performing pixel prediction on the target decoding result to obtain a prediction classification result of the pixel, and multiplying the prediction classification result of the pixel with the target visual feature element by element to obtain a segmented text result.
Specifically, in the process of determining the text segmentation result, pixel prediction is performed on the target decoding result, and the probability that each pixel belongs to the text or the background is predicted. Specifically, the target decoding result may be input into a Mask header of a multi-layer perceptron structure, the classification result of the pixel point is predicted, and then the predicted classification result and the target visual feature are subjected to element-by-element multiplication operation, so as to strengthen the feature of the text region and inhibit the feature of the background region. In this way, a more accurate text segmentation result can be obtained. Finally, according to the prediction classification result of the pixel points, the outline of the text region can be constructed, so that the text in the image is effectively segmented. Thus, not only can the accuracy of text segmentation be improved, but also better robustness can be shown when processing texts with complex backgrounds and different font sizes.
In some embodiments, feature conversion is performed on the multi-level visual features to obtain target visual features, including upsampling the multi-level visual features to convert low resolution visual features in the multi-level visual features to high resolution visual features, and taking the upsampled multi-level visual features as the target visual features.
Specifically, in the process of feature conversion, multi-level visual features are first up-sampled to enhance the resolution of the feature map so that details that might otherwise be ignored at lower resolutions can be captured and utilized. Upsampling may be achieved by a variety of methods, such as bilinear interpolation, bicubic interpolation, or by transposed convolutional layers in convolutional neural networks. By upsampling, the low resolution visual features are converted to high resolution visual features while gradually restoring the multi-level visual features to the size of the image to be segmented, which facilitates the subsequent pixel prediction process, achieving accurate pixel level prediction, as the high resolution feature map can provide more rich spatial information.
It should be understood that the sequence number of each step in the above embodiment does not mean the sequence of execution sequence, and the execution sequence of each process should be determined by the function and the internal logic of each process, and should not be construed as limiting the process in the embodiment of the present invention.
Any combination of the above optional solutions may be adopted to form an optional embodiment of the present invention, which is not described herein.
The following are examples of the apparatus of the present invention that may be used to perform the method embodiments of the present invention. For details not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the method of the present invention.
Fig. 2 is a schematic structural diagram of an image text segmentation apparatus according to an embodiment of the present invention. As shown in fig. 2, the apparatus includes:
The feature extraction module 201 is configured to obtain an image to be segmented, and perform feature extraction on the image to be segmented to obtain character edge features, character skeleton features and multi-level visual features;
The feature screening module 202 is configured to perform feature screening on the character edge feature and the character skeleton feature by using the multi-level visual features to obtain a target edge feature and a target skeleton feature;
the feature conversion module 203 is configured to perform feature conversion on the multi-level visual features to obtain target visual features;
The fusion decoding module 204 is configured to obtain a target query for guiding image text segmentation, and fusion decode the target edge feature, the target skeleton feature and the target visual feature according to the target query to obtain a target decoding result;
the segmentation module 205 is configured to determine a text segmentation result based on the target decoding result and the target visual features.
In some embodiments, the feature screening module 202 is further configured to splice the multi-level visual features and use the target convolutional network as a text detection head, input the spliced multi-level visual features to the target convolutional network, determine text regions of the multi-level visual features by using the target convolutional network to obtain a text region detection frame, and screen the text edge features and the text skeleton features according to the text region detection frame to obtain target edge features and target skeleton features.
In some embodiments, the feature screening module 202 is further configured to multiply the text edge feature element by element with the text region detection box to obtain the target edge feature, and multiply the text skeleton feature element by element with the text region detection box to obtain the target skeleton feature.
In some embodiments, the fusion decoding module 204 is further configured to use the target query as a first parameter and use the target edge feature, the target skeleton feature and the target visual feature as a second parameter, input the first parameter and the second parameter to a target decoder, where the target decoder includes a plurality of continuously set decoding submodules, perform fusion decoding on the second parameter by using the first decoding submodule according to the first parameter to obtain a first decoding result, use the first decoding result as a new first parameter, and perform fusion decoding on the second parameter by using the second decoding submodule according to the new first parameter to obtain a second decoding result, and repeatedly perform the fusion decoding operation of each decoding submodule until the target decoding result output by the last decoding submodule is obtained.
In some embodiments, the fusion decoding module 204 is further configured to process the first parameter by using the self-attention layer, fuse the target visual feature with the processed first parameter by using the pixel-aware attention layer to obtain a first fused feature, fuse the target edge feature with the first fused feature by using the edge-aware attention layer to obtain a second fused feature, fuse the target skeleton feature with the second fused feature by using the skeleton-aware attention layer to obtain a third fused result, and perform nonlinear transformation on the third fused result by using the feedforward network layer to obtain a decoding result corresponding to each decoding submodule.
In some embodiments, the segmentation module 205 is further configured to predict the pixel points of the target decoding result to obtain a prediction classification result of the pixel points, and multiply the prediction classification result of the pixel points with the target visual features element by element to obtain a segmented text result.
In some embodiments, the feature conversion module 203 is further configured to upsample the multi-level visual features to convert low resolution visual features in the multi-level visual features to high resolution visual features, and to take the upsampled multi-level visual features as target visual features.
According to the device provided by the embodiment of the invention, an image to be segmented is obtained, characteristic extraction is carried out on the image to be segmented to obtain text edge characteristics, text skeleton characteristics and multi-level visual characteristics, text edge perception and text skeleton perception can be introduced in text segmentation by extracting the text edge characteristics and the text skeleton characteristics, characteristic screening is carried out on the text edge characteristics and the text skeleton characteristics respectively by utilizing the multi-level visual characteristics to enhance the text edge perception and the text skeleton perception, the obtained target edge characteristics and the target skeleton characteristics can reflect text information in the image, meanwhile, the edge and skeleton of a text are focused in the segmentation process so as to accurately identify and segment text areas, characteristic conversion is further carried out on the multi-level visual characteristics to obtain target visual characteristics, target inquiry for guiding the text segmentation of the image is obtained, fusion decoding is carried out on the target edge characteristics, the target skeleton characteristics and the target visual characteristics according to target inquiry to obtain target decoding results, and the text segmentation results are determined according to the target decoding results, and the fusion decoding results comprising a plurality of guiding characteristics are used for accurately carrying out text segmentation from the image, and the text segmentation accuracy is improved.
Fig. 3 is a schematic diagram of an electronic device 3 according to an embodiment of the present invention. As shown in fig. 3, the electronic device 3 of this embodiment comprises a processor 301, a memory 302 and a computer program 303 stored in the memory 302 and executable on the processor 301. The steps of the various method embodiments described above are implemented when the processor 301 executes the computer program 303. Or the processor 301 when executing the computer program 303 performs the functions of the modules/units in the above-described device embodiments.
The electronic device 3 may be an electronic device such as a desktop computer, a notebook computer, a palm computer, or a cloud server. The electronic device 3 may include, but is not limited to, a processor 301 and a memory 302. It will be appreciated by those skilled in the art that fig. 3 is merely an example of the electronic device 3 and is not limiting of the electronic device 3 and may include more or fewer components than shown, or different components.
The Processor 301 may be a central processing unit (Central Processing Unit, CPU) or other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
The memory 302 may be an internal storage unit of the electronic device 3, for example, a hard disk or a memory of the electronic device 3. The memory 302 may also be an external storage device of the electronic device 3, for example, a plug-in hard disk provided on the electronic device 3, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. The memory 302 may also include both internal storage units and external storage devices of the electronic device 3. The memory 302 is used to store computer programs and other programs and data required by the electronic device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.
The integrated modules/units may be stored in a readable storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a readable storage medium, where the computer program may implement the steps of the method embodiments described above when executed by a processor. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The readable storage medium may include any entity or device that can carry computer program code, recording media, USB flash disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media, among others.
The foregoing embodiments are merely for illustrating the technical solution of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solution described in the foregoing embodiments may be modified or substituted for some of the technical features thereof, and that these modifications or substitutions should not depart from the spirit and scope of the technical solution of the embodiments of the present invention and should be included in the protection scope of the present invention.

Claims (6)

1.一种图像文本分割方法,其特征在于,包括:1. A method for image text segmentation, comprising: 获取待分割图像,对所述待分割图像进行特征提取,得到文字边缘特征、文字骨架特征以及多级视觉特征;Acquire an image to be segmented, perform feature extraction on the image to be segmented, and obtain text edge features, text skeleton features, and multi-level visual features; 利用所述多级视觉特征对所述文字边缘特征和所述文字骨架特征分别进行特征筛选,得到目标边缘特征和目标骨架特征;Using the multi-level visual features, feature screening is performed on the text edge features and the text skeleton features to obtain target edge features and target skeleton features; 对所述多级视觉特征进行特征转换,得到目标视觉特征;Performing feature conversion on the multi-level visual features to obtain target visual features; 获取用于指导图像文本分割的目标查询,并根据所述目标查询,对所述目标边缘特征、所述目标骨架特征以及所述目标视觉特征进行融合解码,得到目标解码结果;Obtaining a target query for guiding image text segmentation, and fusing and decoding the target edge features, the target skeleton features, and the target visual features according to the target query to obtain a target decoding result; 根据所述目标解码结果和所述目标视觉特征,确定文本分割结果;Determining a text segmentation result according to the target decoding result and the target visual feature; 利用所述多级视觉特征对所述文字边缘特征和所述文字骨架特征分别进行特征筛选,得到目标边缘特征和目标骨架特征,包括:Using the multi-level visual features to perform feature screening on the text edge features and the text skeleton features respectively to obtain target edge features and target skeleton features, including: 将所述多级视觉特征进行拼接,并将目标卷积网络作为文本检测头;Concatenating the multi-level visual features and using the target convolutional network as a text detection head; 将拼接后的多级视觉特征输入至所述目标卷积网络,利用所述目标卷积网络确定多级视觉特征的文本区域,得到文本区域检测框;所述目标卷积网络为1*1的卷积层;Input the spliced multi-level visual features into the target convolutional network, use the target convolutional network to determine the text area of the multi-level visual features, and obtain a text area detection frame; the target convolutional network is a 1*1 convolutional layer; 将所述文字边缘特征与所述文本区域检测框进行逐元素相乘,得到所述目标边缘特征;Multiplying the text edge feature by the text area detection frame element by element to obtain the target edge feature; 将所述文字骨架特征与所述文本区域检测框进行逐元素相乘,得到所述目标骨架特征;Multiplying the character skeleton feature by the text area detection frame element by element to obtain the target skeleton feature; 所述根据所述目标查询,对所述目标边缘特征、所述目标骨架特征以及所述目标视觉特征进行融合解码,得到目标解码结果,包括:According to the target query, the target edge feature, the target skeleton feature and the target visual feature are fused and decoded to obtain a target decoding result, including: 将所述目标查询作为第一参数,并将所述目标边缘特征、所述目标骨架特征以及所述目标视觉特征作为第二参数;Using the target query as a first parameter, and using the target edge feature, the target skeleton feature, and the target visual feature as a second parameter; 将所述第一参数和第二参数输入至目标解码器,所述目标解码器包括多个连续设置的解码子模块;Inputting the first parameter and the second parameter into a target decoder, wherein the target decoder comprises a plurality of decoding submodules arranged in series; 根据所述第一参数,利用第一解码子模块对所述第二参数进行融合解码,得到第一解码结果;According to the first parameter, using a first decoding submodule to perform fusion decoding on the second parameter to obtain a first decoding result; 将所述第一解码结果作为新的第一参数,并根据所述新的第一参数,利用第二解码子模块对所述第二参数进行融合解码,得到第二解码结果;Using the first decoding result as a new first parameter, and using a second decoding submodule to perform fusion decoding on the second parameter according to the new first parameter to obtain a second decoding result; 重复执行每一解码子模块的融合解码操作,直至得到最后一解码子模块输出的目标解码结果;Repeat the fusion decoding operation of each decoding submodule until the target decoding result output by the last decoding submodule is obtained; 所述对所述多级视觉特征进行特征转换,得到目标视觉特征,包括:The step of performing feature conversion on the multi-level visual features to obtain target visual features includes: 对所述多级视觉特征进行上采样,以将所述多级视觉特征中的低分辨率视觉特征转换为高分辨率视觉特征;Upsampling the multi-level visual features to convert low-resolution visual features in the multi-level visual features into high-resolution visual features; 将上采样后的多级视觉特征作为所述目标视觉特征。The upsampled multi-level visual features are used as the target visual features. 2.根据权利要求1所述的方法,其特征在于,每一所述解码子模块包括自注意力层、像素感知注意力层、边缘感知注意力层、骨架感知注意力层以及前馈网络层,每一解码子模块的融合解码操作包括:2. The method according to claim 1, characterized in that each of the decoding submodules includes a self-attention layer, a pixel-aware attention layer, an edge-aware attention layer, a skeleton-aware attention layer and a feedforward network layer, and the fusion decoding operation of each decoding submodule includes: 利用所述自注意力层对所述第一参数进行处理,并利用所述像素感知注意力层将所述目标视觉特征与处理后的第一参数进行融合,得到第一融合特征;Using the self-attention layer to process the first parameter, and using the pixel-aware attention layer to fuse the target visual feature with the processed first parameter to obtain a first fused feature; 利用所述边缘感知注意力层将所述目标边缘特征与所述第一融合特征进行融合,得到第二融合特征;Using the edge-aware attention layer to fuse the target edge feature with the first fused feature to obtain a second fused feature; 利用所述骨架感知注意力层将所述目标骨架特征与所述第二融合特征进行融合,得到第三融合结果;Using the skeleton-aware attention layer to fuse the target skeleton feature with the second fusion feature to obtain a third fusion result; 利用所述前馈网络层对所述第三融合结果进行非线性变换,得到每一解码子模块对应的解码结果。The third fusion result is nonlinearly transformed by using the feedforward network layer to obtain a decoding result corresponding to each decoding submodule. 3.根据权利要求1所述的方法,其特征在于,所述根据所述目标解码结果和所述目标视觉特征,确定文本分割结果,包括:3. The method according to claim 1, characterized in that the determining of the text segmentation result according to the target decoding result and the target visual feature comprises: 对所述目标解码结果进行像素点预测,得到像素点的预测分类结果;Perform pixel prediction on the target decoding result to obtain a predicted classification result of the pixel; 将所述像素点的预测分类结果与所述目标视觉特征进行逐元素相乘,得到分割的所述文本结果。The predicted classification result of the pixel point is multiplied element by element by the target visual feature to obtain the segmented text result. 4.一种图像文本分割装置,其特征在于,包括:4. An image text segmentation device, characterized by comprising: 特征提取模块,被配置为获取待分割图像,对所述待分割图像进行特征提取,得到文字边缘特征、文字骨架特征以及多级视觉特征;A feature extraction module is configured to obtain an image to be segmented, perform feature extraction on the image to be segmented, and obtain text edge features, text skeleton features, and multi-level visual features; 特征筛选模块,被配置为利用所述多级视觉特征对所述文字边缘特征和所述文字骨架特征分别进行特征筛选,得到目标边缘特征和目标骨架特征;利用所述多级视觉特征对所述文字边缘特征和所述文字骨架特征分别进行特征筛选,得到目标边缘特征和目标骨架特征,包括:The feature screening module is configured to use the multi-level visual features to perform feature screening on the text edge features and the text skeleton features respectively to obtain target edge features and target skeleton features; and use the multi-level visual features to perform feature screening on the text edge features and the text skeleton features respectively to obtain target edge features and target skeleton features, including: 将所述多级视觉特征进行拼接,并将目标卷积网络作为文本检测头;Concatenating the multi-level visual features and using the target convolutional network as a text detection head; 将拼接后的多级视觉特征输入至所述目标卷积网络,利用所述目标卷积网络确定多级视觉特征的文本区域,得到文本区域检测框;所述目标卷积网络为1*1的卷积层;Input the spliced multi-level visual features into the target convolutional network, use the target convolutional network to determine the text area of the multi-level visual features, and obtain a text area detection frame; the target convolutional network is a 1*1 convolutional layer; 将所述文字边缘特征与所述文本区域检测框进行逐元素相乘,得到所述目标边缘特征;Multiplying the text edge feature by the text area detection frame element by element to obtain the target edge feature; 将所述文字骨架特征与所述文本区域检测框进行逐元素相乘,得到所述目标骨架特征;Multiplying the character skeleton feature by the text area detection frame element by element to obtain the target skeleton feature; 特征转换模块,被配置为对所述多级视觉特征进行特征转换,得到目标视觉特征;对所述多级视觉特征进行特征转换,得到目标视觉特征,包括:对所述多级视觉特征进行上采样,以将所述多级视觉特征中的低分辨率视觉特征转换为高分辨率视觉特征;将上采样后的多级视觉特征作为所述目标视觉特征;The feature conversion module is configured to perform feature conversion on the multi-level visual features to obtain target visual features; performing feature conversion on the multi-level visual features to obtain target visual features, including: upsampling the multi-level visual features to convert low-resolution visual features in the multi-level visual features into high-resolution visual features; and using the upsampled multi-level visual features as the target visual features; 融合解码模块,被配置为获取用于指导图像文本分割的目标查询,并根据所述目标查询,对所述目标边缘特征、所述目标骨架特征以及所述目标视觉特征进行融合解码,得到目标解码结果;所述根据所述目标查询,对所述目标边缘特征、所述目标骨架特征以及所述目标视觉特征进行融合解码,得到目标解码结果,包括:将所述目标查询作为第一参数,并将所述目标边缘特征、所述目标骨架特征以及所述目标视觉特征作为第二参数;将所述第一参数和第二参数输入至目标解码器,所述目标解码器包括多个连续设置的解码子模块;根据所述第一参数,利用第一解码子模块对所述第二参数进行融合解码,得到第一解码结果;将所述第一解码结果作为新的第一参数,并根据所述新的第一参数,利用第二解码子模块对所述第二参数进行融合解码,得到第二解码结果;重复执行每一解码子模块的融合解码操作,直至得到最后一解码子模块输出的目标解码结果;A fusion decoding module is configured to obtain a target query for guiding image text segmentation, and according to the target query, fuse and decode the target edge feature, the target skeleton feature and the target visual feature to obtain a target decoding result; according to the target query, fuse and decode the target edge feature, the target skeleton feature and the target visual feature to obtain a target decoding result, including: taking the target query as a first parameter, and taking the target edge feature, the target skeleton feature and the target visual feature as a second parameter; inputting the first parameter and the second parameter into a target decoder, the target decoder including a plurality of decoding submodules arranged in succession; according to the first parameter, using the first decoding submodule to fuse and decode the second parameter to obtain a first decoding result; using the first decoding result as a new first parameter, and according to the new first parameter, using the second decoding submodule to fuse and decode the second parameter to obtain a second decoding result; repeatedly performing the fusion decoding operation of each decoding submodule until the target decoding result output by the last decoding submodule is obtained; 分割模块,被配置为根据所述目标解码结果和所述目标视觉特征,确定文本分割结果。The segmentation module is configured to determine a text segmentation result according to the target decoding result and the target visual feature. 5.一种电子设备,包括存储器、处理器以及存储在所述存储器中并且可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至3中任一项所述方法的步骤。5. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to any one of claims 1 to 3 when executing the computer program. 6.一种可读存储介质,所述可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至3中任一项所述方法的步骤。6. A readable storage medium storing a computer program, wherein the computer program implements the steps of the method according to any one of claims 1 to 3 when executed by a processor.
CN202411884460.5A 2024-12-20 2024-12-20 Image text segmentation method, device, electronic device and readable storage medium Active CN119339387B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411884460.5A CN119339387B (en) 2024-12-20 2024-12-20 Image text segmentation method, device, electronic device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411884460.5A CN119339387B (en) 2024-12-20 2024-12-20 Image text segmentation method, device, electronic device and readable storage medium

Publications (2)

Publication Number Publication Date
CN119339387A CN119339387A (en) 2025-01-21
CN119339387B true CN119339387B (en) 2025-04-18

Family

ID=94271664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411884460.5A Active CN119339387B (en) 2024-12-20 2024-12-20 Image text segmentation method, device, electronic device and readable storage medium

Country Status (1)

Country Link
CN (1) CN119339387B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912837A (en) * 2023-06-29 2023-10-20 山东大学 A detail- and boundary-driven reference target image segmentation method and system
CN116978030A (en) * 2023-07-05 2023-10-31 腾讯科技(深圳)有限公司 Text information recognition method and training method of text information recognition model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119152564A (en) * 2024-09-05 2024-12-17 北京中科领虹科技有限公司 Iris segmentation model training method, iris segmentation system and medium
CN119049058B (en) * 2024-09-06 2025-07-25 广东石油化工学院 Reference image segmentation method and system based on multi-level feature fusion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912837A (en) * 2023-06-29 2023-10-20 山东大学 A detail- and boundary-driven reference target image segmentation method and system
CN116978030A (en) * 2023-07-05 2023-10-31 腾讯科技(深圳)有限公司 Text information recognition method and training method of text information recognition model

Also Published As

Publication number Publication date
CN119339387A (en) 2025-01-21

Similar Documents

Publication Publication Date Title
JP4806230B2 (en) Deterioration dictionary generation program, method and apparatus
CN111914654B (en) Text layout analysis method, device, equipment and medium
CN112070649B (en) Method and system for removing specific character string watermark
CN112733857B (en) Image character detection model training method and device for automatically segmenting character area
CN111899292A (en) Character recognition method and device, electronic equipment and storage medium
CN115376118B (en) A street view text recognition method, system, device and medium
EP4632664A1 (en) Image processing method and apparatus, and device and medium
CN111932577B (en) Text detection method, electronic device and computer readable medium
CN115909378A (en) Training method of receipt text detection model and receipt text detection method
CN117576699A (en) Locomotive work order information intelligent recognition method and system based on deep learning
CN112801960A (en) Image processing method and device, storage medium and electronic equipment
CN113496223B (en) Method and device for establishing text region detection model
CN118941675A (en) Image generation method, electronic device, storage medium, and computer program product
CN118968142A (en) A method, device, terminal equipment and storage medium for extracting photovoltaic areas based on roofs
CN119339387B (en) Image text segmentation method, device, electronic device and readable storage medium
CN110378167B (en) Bar code image correction method based on deep learning
Rani et al. Object Detection in Natural Scene Images Using Thresholding Techniques
CN118297790A (en) Training method and device for font image generation model, electronic equipment and medium
US6983071B2 (en) Character segmentation device, character segmentation method used thereby, and program therefor
CN116798041A (en) Image recognition method and device and electronic equipment
CN115439850A (en) Image-text character recognition method, device, equipment and storage medium based on examination sheet
CN116824129A (en) Portrait matting method, device, equipment and storage medium
CN111340137A (en) Image recognition method, device and storage medium
CN113591831B (en) Font identification method, system and storage medium based on deep learning
CN120162727B (en) Unsupervised anomaly detection method, system and readable storage medium based on diffusion model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20250806

Address after: 611400 Sichuan Province Chengdu City Xinjin District Huayuan Street Qingci Road No.51 Building 16 First Floor No.30

Patentee after: Chengdu Longhu Longzhi Manufacturing Engineering Construction Management Co.,Ltd.

Country or region after: China

Address before: 518054 cable information transmission building 25f2504, no.3369 Binhai Avenue, Haizhu community, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen Xumi yuntu Space Technology Co.,Ltd.

Country or region before: China