CN119339387B

CN119339387B - Image text segmentation method, device, electronic device and readable storage medium

Info

Publication number: CN119339387B
Application number: CN202411884460.5A
Authority: CN
Inventors: 段杰; 石雅洁
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Chengdu Longhu Longzhi Manufacturing Engineering Construction Management Co ltd
Priority date: 2024-12-20
Filing date: 2024-12-20
Publication date: 2025-04-18
Anticipated expiration: 2044-12-20
Also published as: CN119339387A

Abstract

The invention relates to the technical field of artificial intelligence, and provides an image text segmentation method, an image text segmentation device, electronic equipment and a readable storage medium. The method comprises the steps of obtaining an image to be segmented, carrying out feature extraction on the image to be segmented to obtain character edge features, character skeleton features and multi-stage visual features, carrying out feature screening on the character edge features and the character skeleton features by utilizing the multi-stage visual features to obtain target edge features and target skeleton features, carrying out feature conversion on the multi-stage visual features to obtain target visual features, obtaining target inquiry for guiding image text segmentation, carrying out fusion decoding on the target edge features, the target skeleton features and the target visual features according to the target inquiry to obtain target decoding results, and determining text segmentation results according to the target decoding results and the target visual features. The invention can more accurately segment the text from the image and improve the accuracy of text segmentation.

Description

Image text segmentation method and device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to an image text segmentation method, an image text segmentation device, an electronic device, and a readable storage medium.

Background

The core goal of image text segmentation techniques is to accurately extract text portions from an image, thereby providing a clearer, more accurate basis for subsequent text recognition and understanding. Specifically, image text segmentation techniques identify which regions contain text information by analyzing pixels and features in an image, and current techniques have been able to utilize text-related supervisory information to some extent to improve the accuracy of text segmentation and to separate these regions from the background. However, the existing method has the problem of inaccurate text segmentation.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a method, an apparatus, an electronic device, and a readable storage medium for segmenting an image text, so as to solve the problem of inaccurate text segmentation in the prior art.

In a first aspect of an embodiment of the present invention, there is provided an image text segmentation method, including:

The method comprises the steps of obtaining an image to be segmented, carrying out feature extraction on the image to be segmented to obtain character edge features, character skeleton features and multi-stage visual features, carrying out feature screening on the character edge features and the character skeleton features by utilizing the multi-stage visual features to obtain target edge features and target skeleton features, carrying out feature conversion on the multi-stage visual features to obtain target visual features, obtaining target inquiry for guiding image text segmentation, carrying out fusion decoding on the target edge features, the target skeleton features and the target visual features according to the target inquiry to obtain target decoding results, and determining text segmentation results according to the target decoding results and the target visual features.

In a second aspect of an embodiment of the present invention, there is provided an image text segmentation apparatus, including:

The device comprises a feature extraction module, a feature filtering module, a feature conversion module, a fusion decoding module and a segmentation module, wherein the feature extraction module is used for obtaining an image to be segmented, carrying out feature extraction on the image to be segmented to obtain character edge features, character skeleton features and multi-stage visual features, the feature filtering module is used for carrying out feature filtering on the character edge features and the character skeleton features by utilizing the multi-stage visual features to obtain target edge features and target skeleton features, the feature conversion module is used for carrying out feature conversion on the multi-stage visual features to obtain target visual features, the fusion decoding module is used for obtaining target inquiry for guiding image text segmentation, carrying out fusion decoding on the target edge features, the target skeleton features and the target visual features according to the target inquiry to obtain target decoding results, and the segmentation module is used for determining text segmentation results according to the target decoding results and the target visual features.

In a third aspect of the embodiments of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present invention, there is provided a readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

Compared with the prior art, the embodiment of the invention has the beneficial effects that:

The method comprises the steps of obtaining an image to be segmented, carrying out feature extraction on the image to be segmented to obtain text edge features, text skeleton features and multi-stage visual features, extracting the text edge features and the text skeleton features, leading in text edge perception and text skeleton perception in text segmentation, carrying out feature screening on the text edge features and the text skeleton features by utilizing the multi-stage visual features to enhance the text edge perception and the text skeleton perception, enabling the obtained target edge features and the obtained target skeleton features to reflect text information in the image, focusing on edges and skeletons of texts in the segmentation process to accurately identify and segment text areas, further carrying out feature conversion on the multi-stage visual features to obtain target visual features, obtaining target query for guiding the text segmentation of the image, carrying out fusion decoding on the target edge features, the target skeleton features and the target visual features according to the target query, determining text segmentation results according to the target decoding results and the target visual features, and carrying out text segmentation from the image more accurately by utilizing fusion decoding results containing a plurality of guide features, and improving the accuracy of text segmentation.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of an image text segmentation method according to an embodiment of the present invention;

Fig. 2 is a schematic structural diagram of an image text segmentation apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In the field of image text processing, an image text segmentation task is in a very critical position, and a core requirement of the image text segmentation task is to accurately segment text areas from an image. This operation is an important cornerstone for efficient text editing and complete text removal to follow, directly related to the quality and efficiency of the entire text processing flow.

The existing text segmentation technology is continuously developed and evolved, and the existing method improves the performance of the method to a certain extent by means of various supervision signals closely related to texts. However, these techniques generally have a distinct short panel, i.e., often fail to fully emphasize text edges and decisive role that text skeletons play in the segmentation process. The text edge can outline the outline boundary of the text region and provide key clues for accurately defining the text range, and the text skeleton reflects the morphological characteristics of the text from the internal structure level, which are both indispensable for realizing high-precision text segmentation. For example, conventional edge detection algorithms, while able to distinguish text edges well, fail to distinguish text regions from non-text regions, resulting in edges detected in non-text regions that may interfere with the performance of the text segmentation model. In addition, the existing text segmentation method still has certain limitations when facing segmentation of text edge regions, and particularly, the obtained result is worse under the condition of blurred text edges or complex background. Based on the above, the invention provides an image text segmentation method which can more accurately segment the text from the image and improve the accuracy of text segmentation.

An image text segmentation method, an apparatus, an electronic device, and a readable storage medium according to embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a flow chart of an image text segmentation method according to an embodiment of the present invention. As shown in fig. 1, the method includes:

S101, acquiring an image to be segmented, and extracting features of the image to be segmented to obtain character edge features, character skeleton features and multi-level visual features;

s102, performing feature screening on character edge features and character skeleton features by utilizing multi-level visual features to obtain target edge features and target skeleton features;

s103, performing feature conversion on the multi-level visual features to obtain target visual features;

S104, acquiring target inquiry for guiding image text segmentation, and carrying out fusion decoding on target edge characteristics, target skeleton characteristics and target visual characteristics according to the target inquiry to obtain a target decoding result;

s105, determining a text segmentation result according to the target decoding result and the target visual characteristics.

Specifically, the image text segmentation method of the present embodiment may be executed by a client or a server, or may be executed by both the client and the server, and the following description will be given by taking an execution subject as an example. The image to be segmented is an original image input when an image text segmentation task is performed, after the image to be segmented is obtained, the image to be segmented can be preprocessed, for example, smoothing technologies such as Gaussian filtering, median filtering, bilateral filtering, mean filtering and the like are adopted for preprocessing the image to be segmented, so that noise is reduced. Further, feature extraction is performed on the preprocessed image to be segmented, specifically, firstly, an edge detection algorithm such as a Canny edge detection algorithm, a Prewitt edge detection algorithm or a Roberts edge detection algorithm can be used for performing feature extraction on the image to be segmented to obtain character edge features, secondly, a character skeleton feature can be obtained by performing feature extraction in a character skeleton extraction algorithm such as a hildrich algorithm or a Zhang-Suen algorithm, and then, a visual encoder can be used for performing visual feature extraction on the image to be segmented, and the visual encoder can use a depth learning model such as ResNet as a backbone network of the visual encoder to extract multi-stage visual features in the image to be segmented.

It can be appreciated that text edge features can clearly delineate the boundary between text and background. In the image, the edges of the text include contour information of the text, such as the start and stop positions, the degree of bending, and the like of the strokes. This is critical to accurately locating the text in the image. By extracting the character edge features, the subsequent segmentation algorithm can be helped to better separate the characters from complex backgrounds or other interfering elements. The character skeleton feature reflects the internal structure of the character. It can be regarded as the "central axis" of a text stroke, a simplified and abstract representation of the shape of the text. Extracting skeleton features of text is helpful for understanding basic shapes and structures of text, such as in handwriting recognition or text segmentation with large font deformation, and skeleton features can provide connectivity and topological structure information of text strokes. The multi-level visual features then contain information extracted from the different levels and angles of the image. It may include low-level pixel-level features such as basic visual information of color, brightness, etc., as well as high-level semantic features such as more abstract information of texture, shape combinations, etc.

In some examples, the characteristic screening is performed on the character edge characteristic and the character skeleton characteristic by using the multi-level visual characteristic to obtain the target edge characteristic and the target skeleton characteristic, specifically, by using the multi-level visual characteristic to screen the target edge characteristic, false edges possibly caused by noise or non-character factors (such as a decoration line in an image, a false edge generated by image compression and the like) in the character edge characteristic can be removed, more reliable boundary information can be provided for character segmentation, and the precision of character segmentation is improved. For the target skeleton feature, the screening process can eliminate false skeleton branches or redundant information which do not accord with the actual structure of the characters, the target skeleton feature can better reflect the actual internal structure of the characters, and more accurate shape guidance is provided for subsequent character segmentation, so that the quality and accuracy of character segmentation are improved.

In some examples, the multi-level visual features are subjected to feature conversion to obtain target visual features, the multi-level visual features represent feature information of different layers of the image to be segmented, the features may have differences in format, dimension and the like, and the features of different formats and dimensions can be unified into a proper form through feature conversion, so that the text can be predicted conveniently.

In some examples, the target query for guiding the image text segmentation is obtained, and the target edge feature, the target skeleton feature and the target visual feature are fused and decoded according to the target query to obtain a target decoding result, specifically, the target query is a learnable query which can learn common positions of texts in different scenes, typical features of the texts and association relations between the texts and other elements of the image, and through the learning process, the learnable query is just like having a 'knowledge base' for image text segmentation, so that an accurate guiding direction can be provided for subsequent segmentation operation. Further, under the guidance of the target query, the target edge features, the target skeleton features and the target visual features are fused and decoded. The target inquiry is just like a commander, and decides how to fuse the boundary information of the target edge characteristics, the structural information of the target skeleton characteristics and the comprehensive description information of the target visual characteristics according to the learned text rule, so that the finally obtained target decoding result can embody the real condition and the accurate position of characters in the image to the greatest extent, and lays a foundation for further obtaining an accurate text segmentation result. Finally, according to the target decoding result and the target visual feature, a more accurate text segmentation result is obtained by using the target decoding result and the target visual feature segmentation, and how to obtain the text segmentation result according to the target decoding result and the target visual feature will be described in detail in the following embodiments, which will not be repeated here.

According to the technical scheme provided by the embodiment of the invention, an image to be segmented is obtained, characteristic extraction is carried out on the image to be segmented to obtain text edge characteristics, text skeleton characteristics and multi-level visual characteristics, text edge perception and text skeleton perception can be introduced in text segmentation by extracting the text edge characteristics and the text skeleton characteristics, characteristic screening is carried out on the text edge characteristics and the text skeleton characteristics respectively by utilizing the multi-level visual characteristics to enhance the text edge perception and the text skeleton perception, the obtained target edge characteristics and the target skeleton characteristics can reflect text information in the image, meanwhile, the edge and skeleton of a text are focused in the segmentation process so as to accurately identify and segment text areas, characteristic conversion is further carried out on the multi-level visual characteristics to obtain target visual characteristics, target inquiry for guiding the text segmentation of the image is obtained, fusion decoding is carried out on the target edge characteristics, the target skeleton characteristics and the target visual characteristics according to target inquiry, and a target decoding result is obtained, and the text segmentation result is determined according to the target decoding result and the target visual characteristics, so that text segmentation is carried out from the image more accurately by utilizing the fusion decoding result containing a plurality of guiding characteristics, and the text segmentation accuracy is improved.

In some embodiments, the method for obtaining the target edge features and the target skeleton features by utilizing the multi-level visual features to conduct feature screening on the text edge features and the text skeleton features respectively comprises the steps of splicing the multi-level visual features, taking a target convolution network as a text detection head, inputting the spliced multi-level visual features into the target convolution network, determining text areas of the multi-level visual features by utilizing the target convolution network to obtain text area detection frames, and screening the text edge features and the text skeleton features according to the text area detection frames to obtain the target edge features and the target skeleton features.

Specifically, in the process of performing feature screening on character edge features and character skeleton features by using multi-level visual features, the multi-level visual features refer to features extracted at different levels, and include object and scene information from low-level edges and textures to high-level edges and textures, so that the multi-level visual features are spliced and can be fused into a unified feature representation. This feature representation is then input to the target convolutional network, which in this embodiment preferably employs a 1x1 convolutional layer. A 1x1 convolutional layer is a special convolutional layer whose convolution kernel is 1x1 in size so that it does not change the spatial dimensions of the input features. However, the number of the characteristic channels can be changed by adjusting the number of the convolution kernels, so that the dimension reduction or dimension increase of the characteristic is realized.

It can be understood that the convolution layer of 1x1 in this embodiment is used as a text detection head, and the spliced features are further processed and transformed to predict the text region, so as to obtain a text region detection box (denoted as Mask M). Mask M is a binary matrix of the same size as the input image, where pixels with a value of 1 represent text regions and pixels with a value of 0 represent non-text regions. The 1x1 convolution layer can accurately predict Mask M, so that the task of text detection is realized.

It can be appreciated that text edge features are previously extracted features that delineate the boundary between text and background. However, during feature extraction, there may be some noise or false edges caused by non-literal factors. Now with the text region detection box, only the portion within the detection box can be focused on. The text region detection frame is used for screening the character edge characteristics, so that the interference edges irrelevant to the text can be removed, only the effective edge characteristics truly belonging to the characters in the detection frame are reserved, and the target edge characteristics are obtained, which more accurately represent the real boundaries of the characters in the text region. Similarly, the text region detection box also has the same principle of screening the character skeleton characteristics.

Therefore, the multi-level visual features are utilized to effectively screen the character edge features and the character skeleton features, more accurate target edge features and target skeleton features are obtained, and a more reliable basis is provided for subsequent tasks such as image text segmentation.

In some embodiments, the text edge feature and the text skeleton feature are respectively screened according to the text region detection frame to obtain a target edge feature and a target skeleton feature, and the text edge feature and the text region detection frame are subjected to element-by-element multiplication to obtain the target edge feature, and the text skeleton feature and the text region detection frame are subjected to element-by-element multiplication to obtain the target skeleton feature.

Specifically, since the text region detection box Mask M is a matrix corresponding to the image size, the element values in the matrix are generally in two cases, that is, in a position within the text region, the element value is 1 (or other specific value indicating "belongs to the text region"), and in a position outside the text region, the element value is 0 (or specific value indicating "does not belong to the text region"). It can be thought of as a "mask" that accurately overlays the text region of the image, distinguishing the text region from the non-text region.

When the text edge feature is multiplied element by element with the text region detection box Mask M, it is actually the "screening" effect of Mask M that is utilized. And for each element in the character edge feature, carrying out multiplication operation with the element at the corresponding position in the Mask M. Since Mask M has a value of 1 (or corresponding representation value) in the text region and a value of 0 (or corresponding representation value) outside the text region, then after multiplication, the text edge feature elements in the text region remain unchanged because of multiplication with elements with a value of 1 (or corresponding representation value) in Mask M, and these remaining elements constitute valid edge features that truly belong to the text in the text region. The text edge feature elements outside the text region, which result from multiplication with elements with a value of 0 (or corresponding representation value) in Mask M, are "filtered" out of these potentially disturbing edge features outside the text region. After the element-by-element multiplication operation, the obtained result is the target edge feature. Compared with the original text edge feature, the target edge feature more accurately only contains the effective edge feature of the text in the text region, and removes false edge features which are possibly caused by background, noise and other factors and are located outside the text region, so that more accurate text boundary information is provided for subsequent text segmentation and other tasks.

It can be understood that the operation and principle of multiplying the text skeleton feature by the text region detection box element by element are similar to the operation of obtaining the target skeleton feature, and are not repeated here.

In some embodiments, fusion decoding is performed on target edge features, target skeleton features and target visual features according to target query to obtain a target decoding result, wherein the target query is used as a first parameter, the target edge features, the target skeleton features and the target visual features are used as second parameters, the first parameters and the second parameters are input to a target decoder, the target decoder comprises a plurality of continuously arranged decoding submodules, fusion decoding is performed on the second parameters by using a first decoding submodule according to the first parameters to obtain a first decoding result, the first decoding result is used as a new first parameter, fusion decoding is performed on the second parameters by using a second decoding submodule according to the new first parameter to obtain a second decoding result, and the fusion decoding operation of each decoding submodule is repeatedly performed until the target decoding result output by the last decoding submodule is obtained.

Specifically, the first parameter (target query) and the second parameter (target edge feature, target skeleton feature, target visual feature) are input together to the target decoder. In this embodiment, the target decoder preferably adopts a decoder with a transducer structure, and the decoder with the transducer structure includes a plurality of transducer decoding submodules arranged in succession, because the self-attention mechanism in the transducer decoder can effectively process the input long-sequence data, and when processing a complex second parameter sequence including the target edge feature, the target skeleton feature, the target visual feature and the like, the self-attention mechanism can capture the relationship between different positions in the sequence, and secondly, can automatically assign different weights to different features in the second parameter according to the input first parameter. For example, when the target query focuses more on the skeletal structure of the target, the transducer decoder may assign higher weights to the target skeletal features by a self-attention mechanism, thereby fusing these features more effectively to generate the decoding result. This ability to dynamically adjust feature weights according to task requirements makes the Transformer decoder more flexible and efficient when fusing multiple types of features (e.g., edges, skeletons, and visual features).

Further, according to the first parameter, the first decoding submodule is utilized to perform fusion decoding on the second parameter to obtain a first decoding result, namely the first parameter (target query) and the second parameter are input into the first decoding submodule together to perform fusion decoding, and the first decoding result is obtained. And then the first decoding result is used as a new first parameter, the new first parameter and the new second parameter are input into the second decoding submodule together for fusion decoding, and the fusion decoding operation of each decoding submodule is repeatedly executed until the preset fusion decoding step number is reached, so that the final target fusion result is obtained.

Thus, the final target decoding result is the final result which is obtained by sequentially processing and continuously fusing information through all decoding submodules in the whole target decoder and can meet the related requirements of the target query which is originally set.

In some embodiments, each decoding submodule comprises a self-attention layer, a pixel-aware attention layer, an edge-aware attention layer, a skeleton-aware attention layer and a feed-forward network layer, and the fusion decoding operation of each decoding submodule comprises the steps of processing a first parameter by the self-attention layer, fusing a target visual characteristic with the processed first parameter by the pixel-aware attention layer to obtain a first fusion characteristic, fusing the target edge characteristic with the first fusion characteristic by the edge-aware attention layer to obtain a second fusion characteristic, fusing the target skeleton characteristic with the second fusion characteristic by the skeleton-aware attention layer to obtain a third fusion result, and performing nonlinear transformation on the third fusion result by the feed-forward network layer to obtain a decoding result corresponding to each decoding submodule.

In particular, the self-attention layer processes the first parameter using a self-attention mechanism. The self-attention mechanism can capture the relation between different parts in the first parameter, different weights are automatically distributed according to the correlation of the parts, the processed result of the self-attention layer is the processed first parameter, the pixel perception attention layer is focused on fusing the target visual characteristics and the first parameter processed by the self-attention layer, and the first fusion characteristics are obtained through the fusion operation. The feature contains key information about the target query carried by the first parameter after the self-attention layer processing, and integrates the detail of the visual aspect provided by the target visual feature, so that the subsequent processing can be performed based on a more comprehensive and integrated visual information basis.

Further, the edge-aware attention layer then fuses the target edge feature with the first fused feature obtained previously, and through this fusion, a second fused feature is generated. The features at this time further integrate the contour information of the edge features of the target and the content contained in the previous first fusion feature, so that the features are richer and more comprehensive, and the accurate understanding and processing of the target are facilitated.

Further, the skeleton perception attention layer is responsible for fusing the target skeleton feature and the second fusion feature, and a third fusion result is obtained after the fusion is completed. The result is that the first parameter is processed in multiple steps, and a plurality of characteristics such as target vision, edges, frameworks and the like are fused in sequence, so that the result is a result of comprehensive information in multiple aspects, and a complete information basis is provided for a final decoding result.

Further, the feedforward network layer performs nonlinear transformation on the third fusion result obtained in the previous step. The feed forward network layer typically comprises a plurality of neuron layers, and the nonlinear transformation is implemented by means of an activation function or the like. The third fusion result can be further processed to mine hidden relationships and features therein and convert it to a form more suitable as a decoding result. And finally obtaining a decoding result corresponding to each decoding submodule through nonlinear transformation of the feedforward network layer.

In some embodiments, determining the text segmentation result according to the target decoding result and the target visual feature comprises performing pixel prediction on the target decoding result to obtain a prediction classification result of the pixel, and multiplying the prediction classification result of the pixel with the target visual feature element by element to obtain a segmented text result.

Specifically, in the process of determining the text segmentation result, pixel prediction is performed on the target decoding result, and the probability that each pixel belongs to the text or the background is predicted. Specifically, the target decoding result may be input into a Mask header of a multi-layer perceptron structure, the classification result of the pixel point is predicted, and then the predicted classification result and the target visual feature are subjected to element-by-element multiplication operation, so as to strengthen the feature of the text region and inhibit the feature of the background region. In this way, a more accurate text segmentation result can be obtained. Finally, according to the prediction classification result of the pixel points, the outline of the text region can be constructed, so that the text in the image is effectively segmented. Thus, not only can the accuracy of text segmentation be improved, but also better robustness can be shown when processing texts with complex backgrounds and different font sizes.

In some embodiments, feature conversion is performed on the multi-level visual features to obtain target visual features, including upsampling the multi-level visual features to convert low resolution visual features in the multi-level visual features to high resolution visual features, and taking the upsampled multi-level visual features as the target visual features.

Specifically, in the process of feature conversion, multi-level visual features are first up-sampled to enhance the resolution of the feature map so that details that might otherwise be ignored at lower resolutions can be captured and utilized. Upsampling may be achieved by a variety of methods, such as bilinear interpolation, bicubic interpolation, or by transposed convolutional layers in convolutional neural networks. By upsampling, the low resolution visual features are converted to high resolution visual features while gradually restoring the multi-level visual features to the size of the image to be segmented, which facilitates the subsequent pixel prediction process, achieving accurate pixel level prediction, as the high resolution feature map can provide more rich spatial information.

It should be understood that the sequence number of each step in the above embodiment does not mean the sequence of execution sequence, and the execution sequence of each process should be determined by the function and the internal logic of each process, and should not be construed as limiting the process in the embodiment of the present invention.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present invention, which is not described herein.

The following are examples of the apparatus of the present invention that may be used to perform the method embodiments of the present invention. For details not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the method of the present invention.

Fig. 2 is a schematic structural diagram of an image text segmentation apparatus according to an embodiment of the present invention. As shown in fig. 2, the apparatus includes:

The feature extraction module 201 is configured to obtain an image to be segmented, and perform feature extraction on the image to be segmented to obtain character edge features, character skeleton features and multi-level visual features;

The feature screening module 202 is configured to perform feature screening on the character edge feature and the character skeleton feature by using the multi-level visual features to obtain a target edge feature and a target skeleton feature;

the feature conversion module 203 is configured to perform feature conversion on the multi-level visual features to obtain target visual features;

The fusion decoding module 204 is configured to obtain a target query for guiding image text segmentation, and fusion decode the target edge feature, the target skeleton feature and the target visual feature according to the target query to obtain a target decoding result;

the segmentation module 205 is configured to determine a text segmentation result based on the target decoding result and the target visual features.

In some embodiments, the feature screening module 202 is further configured to splice the multi-level visual features and use the target convolutional network as a text detection head, input the spliced multi-level visual features to the target convolutional network, determine text regions of the multi-level visual features by using the target convolutional network to obtain a text region detection frame, and screen the text edge features and the text skeleton features according to the text region detection frame to obtain target edge features and target skeleton features.

In some embodiments, the feature screening module 202 is further configured to multiply the text edge feature element by element with the text region detection box to obtain the target edge feature, and multiply the text skeleton feature element by element with the text region detection box to obtain the target skeleton feature.

In some embodiments, the fusion decoding module 204 is further configured to use the target query as a first parameter and use the target edge feature, the target skeleton feature and the target visual feature as a second parameter, input the first parameter and the second parameter to a target decoder, where the target decoder includes a plurality of continuously set decoding submodules, perform fusion decoding on the second parameter by using the first decoding submodule according to the first parameter to obtain a first decoding result, use the first decoding result as a new first parameter, and perform fusion decoding on the second parameter by using the second decoding submodule according to the new first parameter to obtain a second decoding result, and repeatedly perform the fusion decoding operation of each decoding submodule until the target decoding result output by the last decoding submodule is obtained.

In some embodiments, the fusion decoding module 204 is further configured to process the first parameter by using the self-attention layer, fuse the target visual feature with the processed first parameter by using the pixel-aware attention layer to obtain a first fused feature, fuse the target edge feature with the first fused feature by using the edge-aware attention layer to obtain a second fused feature, fuse the target skeleton feature with the second fused feature by using the skeleton-aware attention layer to obtain a third fused result, and perform nonlinear transformation on the third fused result by using the feedforward network layer to obtain a decoding result corresponding to each decoding submodule.

In some embodiments, the segmentation module 205 is further configured to predict the pixel points of the target decoding result to obtain a prediction classification result of the pixel points, and multiply the prediction classification result of the pixel points with the target visual features element by element to obtain a segmented text result.

In some embodiments, the feature conversion module 203 is further configured to upsample the multi-level visual features to convert low resolution visual features in the multi-level visual features to high resolution visual features, and to take the upsampled multi-level visual features as target visual features.

According to the device provided by the embodiment of the invention, an image to be segmented is obtained, characteristic extraction is carried out on the image to be segmented to obtain text edge characteristics, text skeleton characteristics and multi-level visual characteristics, text edge perception and text skeleton perception can be introduced in text segmentation by extracting the text edge characteristics and the text skeleton characteristics, characteristic screening is carried out on the text edge characteristics and the text skeleton characteristics respectively by utilizing the multi-level visual characteristics to enhance the text edge perception and the text skeleton perception, the obtained target edge characteristics and the target skeleton characteristics can reflect text information in the image, meanwhile, the edge and skeleton of a text are focused in the segmentation process so as to accurately identify and segment text areas, characteristic conversion is further carried out on the multi-level visual characteristics to obtain target visual characteristics, target inquiry for guiding the text segmentation of the image is obtained, fusion decoding is carried out on the target edge characteristics, the target skeleton characteristics and the target visual characteristics according to target inquiry to obtain target decoding results, and the text segmentation results are determined according to the target decoding results, and the fusion decoding results comprising a plurality of guiding characteristics are used for accurately carrying out text segmentation from the image, and the text segmentation accuracy is improved.

Fig. 3 is a schematic diagram of an electronic device 3 according to an embodiment of the present invention. As shown in fig. 3, the electronic device 3 of this embodiment comprises a processor 301, a memory 302 and a computer program 303 stored in the memory 302 and executable on the processor 301. The steps of the various method embodiments described above are implemented when the processor 301 executes the computer program 303. Or the processor 301 when executing the computer program 303 performs the functions of the modules/units in the above-described device embodiments.

The electronic device 3 may be an electronic device such as a desktop computer, a notebook computer, a palm computer, or a cloud server. The electronic device 3 may include, but is not limited to, a processor 301 and a memory 302. It will be appreciated by those skilled in the art that fig. 3 is merely an example of the electronic device 3 and is not limiting of the electronic device 3 and may include more or fewer components than shown, or different components.

The Processor 301 may be a central processing unit (Central Processing Unit, CPU) or other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 302 may be an internal storage unit of the electronic device 3, for example, a hard disk or a memory of the electronic device 3. The memory 302 may also be an external storage device of the electronic device 3, for example, a plug-in hard disk provided on the electronic device 3, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. The memory 302 may also include both internal storage units and external storage devices of the electronic device 3. The memory 302 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units may be stored in a readable storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a readable storage medium, where the computer program may implement the steps of the method embodiments described above when executed by a processor. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The readable storage medium may include any entity or device that can carry computer program code, recording media, USB flash disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media, among others.

The foregoing embodiments are merely for illustrating the technical solution of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solution described in the foregoing embodiments may be modified or substituted for some of the technical features thereof, and that these modifications or substitutions should not depart from the spirit and scope of the technical solution of the embodiments of the present invention and should be included in the protection scope of the present invention.

Claims

1. A method for image text segmentation, comprising:

Acquire an image to be segmented, perform feature extraction on the image to be segmented, and obtain text edge features, text skeleton features, and multi-level visual features;

Using the multi-level visual features, feature screening is performed on the text edge features and the text skeleton features to obtain target edge features and target skeleton features;

Performing feature conversion on the multi-level visual features to obtain target visual features;

Obtaining a target query for guiding image text segmentation, and fusing and decoding the target edge features, the target skeleton features, and the target visual features according to the target query to obtain a target decoding result;

Determining a text segmentation result according to the target decoding result and the target visual feature;

Using the multi-level visual features to perform feature screening on the text edge features and the text skeleton features respectively to obtain target edge features and target skeleton features, including:

Concatenating the multi-level visual features and using the target convolutional network as a text detection head;

Input the spliced multi-level visual features into the target convolutional network, use the target convolutional network to determine the text area of the multi-level visual features, and obtain a text area detection frame; the target convolutional network is a 1*1 convolutional layer;

Multiplying the text edge feature by the text area detection frame element by element to obtain the target edge feature;

Multiplying the character skeleton feature by the text area detection frame element by element to obtain the target skeleton feature;

According to the target query, the target edge feature, the target skeleton feature and the target visual feature are fused and decoded to obtain a target decoding result, including:

Using the target query as a first parameter, and using the target edge feature, the target skeleton feature, and the target visual feature as a second parameter;

Inputting the first parameter and the second parameter into a target decoder, wherein the target decoder comprises a plurality of decoding submodules arranged in series;

According to the first parameter, using a first decoding submodule to perform fusion decoding on the second parameter to obtain a first decoding result;

Using the first decoding result as a new first parameter, and using a second decoding submodule to perform fusion decoding on the second parameter according to the new first parameter to obtain a second decoding result;

Repeat the fusion decoding operation of each decoding submodule until the target decoding result output by the last decoding submodule is obtained;

The step of performing feature conversion on the multi-level visual features to obtain target visual features includes:

Upsampling the multi-level visual features to convert low-resolution visual features in the multi-level visual features into high-resolution visual features;

The upsampled multi-level visual features are used as the target visual features.

2. The method according to claim 1, characterized in that each of the decoding submodules includes a self-attention layer, a pixel-aware attention layer, an edge-aware attention layer, a skeleton-aware attention layer and a feedforward network layer, and the fusion decoding operation of each decoding submodule includes:

Using the self-attention layer to process the first parameter, and using the pixel-aware attention layer to fuse the target visual feature with the processed first parameter to obtain a first fused feature;

Using the edge-aware attention layer to fuse the target edge feature with the first fused feature to obtain a second fused feature;

Using the skeleton-aware attention layer to fuse the target skeleton feature with the second fusion feature to obtain a third fusion result;

The third fusion result is nonlinearly transformed by using the feedforward network layer to obtain a decoding result corresponding to each decoding submodule.

3. The method according to claim 1, characterized in that the determining of the text segmentation result according to the target decoding result and the target visual feature comprises:

Perform pixel prediction on the target decoding result to obtain a predicted classification result of the pixel;

The predicted classification result of the pixel point is multiplied element by element by the target visual feature to obtain the segmented text result.

4. An image text segmentation device, characterized by comprising:

A feature extraction module is configured to obtain an image to be segmented, perform feature extraction on the image to be segmented, and obtain text edge features, text skeleton features, and multi-level visual features;

The feature screening module is configured to use the multi-level visual features to perform feature screening on the text edge features and the text skeleton features respectively to obtain target edge features and target skeleton features; and use the multi-level visual features to perform feature screening on the text edge features and the text skeleton features respectively to obtain target edge features and target skeleton features, including:

The feature conversion module is configured to perform feature conversion on the multi-level visual features to obtain target visual features; performing feature conversion on the multi-level visual features to obtain target visual features, including: upsampling the multi-level visual features to convert low-resolution visual features in the multi-level visual features into high-resolution visual features; and using the upsampled multi-level visual features as the target visual features;

A fusion decoding module is configured to obtain a target query for guiding image text segmentation, and according to the target query, fuse and decode the target edge feature, the target skeleton feature and the target visual feature to obtain a target decoding result; according to the target query, fuse and decode the target edge feature, the target skeleton feature and the target visual feature to obtain a target decoding result, including: taking the target query as a first parameter, and taking the target edge feature, the target skeleton feature and the target visual feature as a second parameter; inputting the first parameter and the second parameter into a target decoder, the target decoder including a plurality of decoding submodules arranged in succession; according to the first parameter, using the first decoding submodule to fuse and decode the second parameter to obtain a first decoding result; using the first decoding result as a new first parameter, and according to the new first parameter, using the second decoding submodule to fuse and decode the second parameter to obtain a second decoding result; repeatedly performing the fusion decoding operation of each decoding submodule until the target decoding result output by the last decoding submodule is obtained;

The segmentation module is configured to determine a text segmentation result according to the target decoding result and the target visual feature.

5. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to any one of claims 1 to 3 when executing the computer program.

6. A readable storage medium storing a computer program, wherein the computer program implements the steps of the method according to any one of claims 1 to 3 when executed by a processor.