Disclosure of Invention
The application aims to provide a quality detection method and device for text images, a computer readable medium and electronic equipment, and at least solves the technical problems of inaccurate Chinese line identification and the like in the related technology to a certain extent.
Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.
According to an aspect of an embodiment of the present application, there is provided a quality detection method of a text image, including:
Detecting the character scale of the text image, wherein the character scale of the text image is the average scale of characters of the text image;
When the character scale of the text image is smaller than a first preset scale, amplifying the text image to enable the character scale of the text image to be in a preset scale range, and when the character scale of the text image is larger than a second preset scale, reducing the text image to enable the character scale of the text image to be in a preset scale range, wherein the preset scale range is a scale range which is larger than the first preset scale and smaller than the second preset scale;
Inputting the text image into a first neural network, and performing feature extraction and mapping processing on the text image through the first neural network to detect one or more text areas in the text image, wherein a sensing window is configured in the first neural network, the sensing window is in a preset scale range, and the sensing window is used for moving on the text image to perform feature extraction on the text image;
predicting the quality of a text region in the text image to obtain the quality of the text region;
And acquiring preset weights corresponding to the text areas, and carrying out weighted summation on the quality scores of the text areas based on the preset weights to obtain the quality scores of the text images.
According to an aspect of an embodiment of the present application, there is provided a quality detection apparatus of a text image, the quality detection apparatus including:
a character scale detection module configured to detect a character scale of the text image, the character scale of the text image being an average scale of characters of the text image;
The scaling module is configured to amplify the text image so that the character scale of the text image is in a preset scale range when the character scale of the text image is smaller than a first preset scale, and to reduce the text image so that the character scale of the text image is in a preset scale range when the character scale of the text image is larger than a second preset scale, wherein the preset scale range is a scale range which is larger than the first preset scale and smaller than the second preset scale;
The text region detection module is configured to input the text image into a first neural network, and perform feature extraction and mapping processing on the text image through the first neural network so as to detect one or more text regions in the text image, wherein a sensing window is configured in the first neural network, the sensing window is in a preset scale range, and the sensing window is used for moving on the text image so as to perform feature extraction on the text image;
the quality score prediction module is configured to predict the quality score of a text region in the text image to obtain the quality score of the text region;
And the weighted summation module is configured to acquire preset weights corresponding to the text areas, and perform weighted summation processing on the quality scores of the text areas based on the preset weights to obtain the quality scores of the text images.
In some embodiments of the present application, based on the above technical solution, the weighted summation module includes:
A word number obtaining unit configured to obtain a word number of each text region, and perform a summation operation on the word number of each text region to obtain a total word number of the text image;
And the preset weight calculation unit is configured to acquire the ratio of the word number of each text region to the total word number of the text image, and take the ratio as the preset weight corresponding to each text region.
In some embodiments of the present application, based on the above technical solution, the text region is a text line, the text line includes a continuous image region of one or more serially arranged characters in the text image, and the quality prediction module includes:
A length-to-height ratio detection unit configured to detect a text line in the text image and a length-to-height ratio of the text line, the length of the text line being a length of an extension line of the text line extending along a character arrangement direction in the text line, the height of the text line being a height of the text line perpendicular to the extension line;
a text line dividing unit configured to divide a text line having a length-to-height ratio greater than a preset value into a plurality of text lines having a length-to-height ratio less than or equal to the preset value;
And the quality score prediction unit is configured to perform feature extraction and quality score prediction on the text lines with the length-height ratios smaller than or equal to a preset value respectively to obtain the quality scores of the text lines with the length-height ratios smaller than or equal to the preset value.
In some embodiments of the present application, based on the above technical solution, the text line segmentation unit includes:
A text line projection subunit, configured to project the text line with the length-to-height ratio greater than a preset value onto the length direction of the text line to form a one-dimensional projection point set, wherein the projection of the pixel point at the position of the character in the length direction of the text line forms a real point projection, the projection of the pixel point at the position other than the position of the character in the length direction of the text line forms a virtual point projection, and the real point projection and the virtual point projection form a projection point set;
And the text line segmentation subunit is configured to acquire segmentation points on the line segments aggregated by the virtual point projection, segment the text line according to the segmentation points, and segment the text line with the length-to-height ratio larger than a preset value into a plurality of text lines with the length-to-height ratio smaller than or equal to the preset value.
In some embodiments of the present application, based on the above technical solution, the mass fraction prediction module may include:
an input unit configured to input a text region in the text image into a second neural network;
The feature extraction unit is configured to extract features of the text region through a convolution layer of the second neural network to obtain plane features;
The dimension reduction processing unit is configured to perform dimension reduction processing on the plane features through a pooling layer of the second neural network to obtain feature vectors;
And the full-connection calculation unit is configured to perform full-connection calculation on the feature vector through a full-connection layer of the second neural network to obtain the prediction quality score of the text region.
In some embodiments of the present application, based on the above technical solution, the mass fraction prediction module may further include:
A data set acquisition unit configured to acquire a text recognition data set including a text image and a recognition accuracy of the text image, wherein the recognition accuracy of the text image is a ratio of a recognizable character number of the text image to an actual character number, the recognizable character number being a number of characters that the text image can be correctly recognized by a character recognition model, the actual character number being a number of characters actually included in the text image;
The data set labeling unit is configured to perform equal ratio conversion on the recognition accuracy of the text image to obtain the quality score of the text image, and label the quality score of the text image into the text recognition data set;
A neural network training unit configured to input the text recognition data set into the second neural network, training the second neural network.
In some embodiments of the present application, based on the above technical solution, the text region is a single word region, and the single word region is a continuous image region including one character in the text image.
According to an aspect of the embodiments of the present application, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a quality detection method of a text image as in the above technical solution.
According to an aspect of an embodiment of the present application, there is provided an electronic device including a processor, and a memory for storing executable instructions of the processor, wherein the processor is configured to perform the quality detection method of a text image as in the above technical solution via execution of the executable instructions.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the quality detection method of a text image as in the above technical solution.
In the technical scheme provided by the embodiment of the application, the text image with the character scale exceeding the preset scale range is subjected to the enlarging process or the shrinking process, so that the character scale of the text image is in the preset scale range, and the sensing window of the first neural network is also in the preset scale range. Therefore, the matching degree of the sensing window of the first neural network and the character size of the text image is higher, the accuracy of detecting the text region in the text image can be improved, the accuracy of predicting the text region and the quality of the text image can be improved, and meanwhile, the robustness of the text image quality detection scheme of the embodiment of the application is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many different forms and should not be construed as limited to the examples set forth herein, but rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
As shown in fig. 1, system architecture 100 may include a terminal device 110, a network 120, and a server 130. Terminal device 110 may include various electronic devices such as smart phones, tablet computers, notebook computers, desktop computers, and the like. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, and may be, for example, a wired communication link or a wireless communication link.
The system architecture in embodiments of the present application may have any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by the terminal device 110 and the server 130 together, which is not limited in particular.
For example, the server 130 may be equipped with the method for detecting quality of a text image according to the embodiment of the present application, and after the user uploads the text image through the terminal device 110, the server may implement the method for detecting quality of the text image according to the embodiment of the present application. Therefore, the matching degree of the sensing window of the first neural network and the character size of the text image is higher, the accuracy of detecting the text region in the text image can be improved, the accuracy of predicting the text region and the quality of the text image can be improved, and meanwhile, the robustness of the text image quality detection scheme of the embodiment of the application is improved.
In addition, in the technical scheme provided by the embodiment of the application, the quality score prediction is performed on the text region in the text image to obtain the quality score of the text region, then the preset weight corresponding to each text region is obtained, and the quality score of the text region is weighted and summed based on the preset weight to obtain the quality score of the text image, so that the quality score prediction problem of the text image is converted into the quality score prediction problem of the text region and the weight relative to each text region is obtained. Therefore, compared with the method for directly carrying out quality score prediction on the text image, the method for carrying out quality score prediction on the text region can eliminate the influence of the region which does not contain characters in the text image on the quality score prediction result, and the quality score prediction accuracy is higher. Meanwhile, different text regions can have different preset weights, and the weighted summation processing is carried out on the quality scores of the text regions based on the preset weights, so that the robustness of the text image quality detection scheme of the embodiment of the application is further improved.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted device, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
With the widespread use of smart devices in our daily lives, it is often necessary to submit mobile captured text images in business processes of companies, so that the number of text images is rapidly increasing. Thus, intelligent document identification is becoming increasingly important for business process automation. Intelligent document recognition is very sensitive to text image quality. Since unavoidable distortions during image capturing may result in lower text image quality, the accuracy of recognition of the captured text image is often reduced, which may severely hamper subsequent business processes. For example, in an online insurance underwriting service, immediate recapturing is required if low quality document images submitted for claims are not detected as soon as possible. Because the image of the underwriting file cannot be obtained once the paper file is lost or the user is not matched with the provision, key information can be lost in the business process. Since the quality of text images uploaded by users varies, it is necessary to evaluate the quality of such text images in advance to reject text images of low quality.
The quality detection method provided by the application is described in detail below with reference to the specific embodiments.
Fig. 2 schematically illustrates a flow chart of the steps of a quality detection method of some embodiments of the present application. The execution subject of the quality detection method may be a terminal device, a server, or the like, and the present application is not limited to this. As shown in FIG. 2, the quality detection method mainly comprises the following steps S210 to S250.
S210, detecting the character scale of the text image, wherein the character scale of the text image is the average scale of characters of the text image;
S220, when the character scale of the text image is smaller than a first preset scale, amplifying the text image to enable the character scale of the text image to be in a preset scale range, and when the character scale of the text image is larger than a second preset scale, reducing the text image to enable the character scale of the text image to be in the preset scale range, wherein the preset scale range is a scale range which is larger than the first preset scale and smaller than the second preset scale;
S230, inputting a text image into a first neural network, and performing feature extraction and mapping processing on the text image through the first neural network to detect one or more text areas in the text image, wherein a sensing window is configured in the first neural network, the sensing window is in a preset scale range and is used for moving on the text image to perform feature extraction on the text image, and the text areas are continuous image areas consisting of partial or all characters in the text image;
S240, predicting the quality of a text region in the text image to obtain the quality of the text region;
S250, obtaining preset weights corresponding to the text areas, and carrying out weighted summation processing on the quality scores of the text areas based on the preset weights to obtain the quality scores of the text images.
Wherein the text image is an image containing text, i.e. an image containing characters. The character scale of the text image is the average scale of the characters of the text image. In particular, the dimensions may be height or area, etc. In some embodiments, the average size of the characters of the text image may be an average of the character heights of each character in the text image, and in other embodiments, the average size of the characters of the text image may be an average of the character areas of each character in the text image. In a specific embodiment, an MSER (Maximally Stable Extremal Regions, maximum stable extremum region) detector may be employed as a single word detector to detect a single character in a text image. Then, the average height of all characters in the text image is obtained, and scaling processing is performed on the original image in a self-adaptive manner according to the calculated average height, as described in step S220, so that the character scale of the text image is in a preset scale range, so as to adapt to the scale size of the sensing window of the first neural network, and the accuracy of the detection of the text region by the first neural network can be improved. The receptive window is a matrix for convoluting data by a convolution layer in the convolutional neural network, namely the receptive field of the convolutional neural network.
In some embodiments, after the text image is input to the first neural network, the feature extraction unit of the first neural network performs feature extraction on the text image, and then the mapping processing unit of the first neural network performs mapping processing, so that one or more text regions in the text image can be detected. In a specific embodiment, the first neural network may be a PSE (Progressive Scale Expansion, progressive extension algorithm) split network, i.e., PSENet. Thus, multi-direction, multi-angle and multi-scale detection of the text image can be realized, large-scale characters and small-scale characters of the text image with characters with scales which are different by several times can be detected and distinguished, and the text image with character arrangement directions of oblique, bending, reverse and the like can be detected and distinguished. Referring to fig. 3, fig. 3 schematically illustrates a comparison diagram of detection effects of a certain embodiment of the present application and related art. As shown in fig. 3, the detection of text regions, in this embodiment text lines, is performed on the same text image using related art and some embodiments of the present application. The text image 310 shows that only text lines in a single direction (lateral direction as shown in fig. 3) can be detected in the related art, that no text lines in the longitudinal direction can be detected, and that there is some edge loss for larger-scale characters that are "respectfully" to be detected incompletely. The text image 320 shows that the technical scheme of the application can realize the detection of the text image in multiple directions (horizontal and longitudinal), multiple angles (horizontal and vertical) and multiple scales (large scale and small scale), can realize the complete detection of the text line, and can greatly reduce the edge loss of the text line and the character in the detection process. In addition to PSE partitioning networks, the first neural network may also be other CNN (Cable News Network, convolutional neural network) -based neural networks for detecting text regions in text images, which is not limited in this regard.
It will be appreciated that unlike a scene image, a text image is essentially more text-focused. Therefore, in the embodiment of the application, the quality score of the text image is obtained by carrying out weighted summation processing on the quality score of the text region based on the preset weight, so that the quality of the text image can be accurately reflected.
Fig. 4 schematically shows a flowchart of the steps for quality score prediction of text regions in a text image to obtain quality scores of the text regions in an embodiment of the application. As shown in fig. 4, based on the above embodiment, the quality score prediction is performed on the text region in the text image in step S240 to obtain the quality score of the text region, which may further include the following steps S410 to S440.
S410, inputting a text region in the text image into a second neural network;
s420, extracting features of the text region through a convolution layer of the second neural network to obtain plane features;
S430, performing dimension reduction processing on the plane characteristics through a pooling layer of the second neural network to obtain characteristic vectors;
S440, performing full-connection calculation on the feature vector through a full-connection layer of the second neural network to obtain the prediction quality score of the text region.
In some embodiments, the structure of the second neural network may include a convolutional layer, a pooling layer, and a fully-connected layer.
Specifically, fig. 5 schematically illustrates a process of performing feature extraction, dimension reduction processing and full-connection calculation on a text region by using the second neural network according to an embodiment of the present application, and obtaining a predicted quality score of the text region. Referring to fig. 5, the convolution layer of the second neural network performs feature extraction on the text region through convolution, maximum pooling and convolution sub-steps to obtain the first plane feature. Then, the pooling layer of the second neural network obtains the corresponding feature vector by pooling the maximum value and the minimum value of the first plane feature. And then, the full connection layer of the second neural network obtains the quality score corresponding to the text region through full connection calculation of the feature vector.
In some implementations, the text region can be a text line. The second neural network may be a DIQA (Deep CNN-Based Blind Image Quality Predictor, image quality assessment) framework based on text lines. In a specific embodiment, the second neural network may be constructed based on ResNet (residual network), and since ResNet has excellent feature representation capability, feature extraction can be performed on the text image more accurately, so that accuracy of prediction of the quality score can be improved. The second neural network may also be other CNN networks such as VGG (Visual Geometry Group ) networks.
In some embodiments, a text line having a length to height ratio greater than a preset value may be segmented into a plurality of text lines having a length to height ratio less than or equal to the preset value. Notably, the size of the text image is typically much larger than the image size accepted by the deep convolutional neural network. In order to meet the condition that the size of an image accepted by the deep convolutional neural network is relatively fixed, if the size of a text image is adjusted in an undersampling mode before detection, a text with smaller characters possibly becomes fuzzy or even illegible after undersampling is caused, and finally, the detection of text lines is inaccurate, so that the quality score prediction of the text image is relatively inaccurate.
Therefore, in order to avoid degradation caused by undersampling of text lines, in the embodiment of the present application, the first neural network performs feature extraction and mapping processing on the text image, and after detecting a text region in the text image, the text region is input into the second neural network, that is, the deep convolutional neural network, to perform mass fraction prediction. By detecting the text region in the text image and inputting the text region into the second neural network for quality prediction, the text image is decomposed into one or more text regions and then is input into the second neural network, so that the problem of overlarge image caused by directly inputting the text image into the neural network can be avoided, and the image degradation caused by adjusting the size of the text image in an undersampling mode before detection can be avoided. In addition, the input images are text areas, and most of information in the images of the text areas is text information, so that the second neural network is favorable for extracting features and analyzing features, and the accuracy and the robustness of the second neural network on the prediction of the image quality can be improved. Based on the above effects, it can be understood that the text image quality detection method according to the embodiment of the application has more accurate quality detection effects on quality detection of business document images such as insurance check and contract, and also has more accurate quality detection effects under natural text image quality evaluation scenes such as bank card shooting images, medical record sheet shooting images, invoice shooting images and the like with larger interference and higher difficulty in identifying characters.
In the second neural network, the L2 loss may be employed as an estimated loss to describe the difference between the predicted and actual quality. Specifically, the estimated loss L2 is defined as:
Where q is the predicted quality score, q gt is the quality score of the text image label in the text recognition dataset, and q gt is the accuracy of text recognition of the text image when the scaling ratio of the scaling is 1.
The parameters of the fully connected layer may be randomly initialized at a uniform normal distribution within (-0.1,0.1) range prior to training the second neural network.
Fig. 6 schematically shows a partial step flow diagram of a quality detection method according to an embodiment of the application, before entering text regions in a text image into a second neural network. As shown in fig. 6, before inputting the text region in the text image into the second neural network in step S410, the following steps S610 to S630 may be further included on the basis of the above embodiments.
S610, acquiring a text recognition data set, wherein the text recognition data set comprises a text image and the recognition accuracy of the text image, the recognition accuracy of the text image is the ratio of the number of recognizable characters of the text image to the actual number of characters, the number of recognizable characters is the number of characters of the text image which can be correctly recognized by a character recognition model, and the actual number of characters is the number of characters actually included in the text image;
s620, performing equal ratio conversion on the recognition accuracy of the text image to obtain the quality score of the text image, and labeling the quality score of the text image into a text recognition data set;
S630, inputting the text recognition data set into a second neural network, and training the second neural network.
Specifically, the quality score of the text image is obtained by performing equal-ratio conversion on the recognition accuracy of the text image, which may be obtained by converting the recognition accuracy of the text image into a percentage quality score, and labeling the quality score of the text image into the text recognition data set. The text recognition dataset is used to train the second neural network to improve accuracy of the second neural network's predictions of quality scores.
The conversion ratio of the equal ratio conversion may be 0.1, 1, 10, 100, or the like, and the present application is not limited thereto.
In some embodiments, the text image in the text recognition dataset may be a text line image, i.e., an image containing one or more serially arranged characters. Wherein the characters are arranged in series, i.e. characters arranged in series one by one. Specifically, the serially arranged characters may be a single row of characters or a single column of characters. Therefore, it can be understood that the background clutter and noise of the image can be reduced by training the second neural network based on the text line image, so that the influence of the background clutter and noise of the image on the quality score can be reduced, and the accuracy of quality score prediction can be improved. In this case, if the second neural network is trained based on a single-line text image, the attribute of the input image in the training process of the second neural network can be close to the attribute of the input image when the quality of the input image is predicted by using the second neural network, and the input image is an image containing one or more characters arranged in series, thereby being beneficial to improving the prediction accuracy of the quality of the text region prediction by using the second neural network.
In some embodiments, the text image in the text recognition dataset may be derived from an artificial image synthesized using an algorithm such as fuzzy noise, and the quality score of the text image in the text recognition dataset may be automatically generated during fuzzy noise synthesis.
In other embodiments, the text image in the text recognition dataset may be derived from a real text image and the quality score of the text image in the text recognition dataset may be derived from a manual score.
In still other embodiments, the text image in the text recognition dataset may be derived from recognition data of an OCR (Optical Character Recognition ) model in the text recognition task for a real text image. In this embodiment, the recognition accuracy of the text image is subjected to scaling to obtain the quality score of the text image, and the quality score of the text image is marked in the text recognition data set, so that the text recognition data set has the advantage of being more objective compared with the manually marked quality score, and because the accuracy is a continuous real number and is not a manually marked discrete quality score, the training of the parameter value of the second neural network can be optimized, and the training task is helped to converge better. Moreover, subjective scoring of a text image is difficult, a data set of quality scores of the image is lacking, a data set of identification accuracy of a text identification task to the image is very popular, and the data set of identification accuracy of the text identification task to the image is adopted in the scheme, so that the quality scores of the text image are converted according to the identification accuracy of the text image. Meanwhile, the real text image accords with the actual scene better than the artificial image synthesized by adopting algorithms such as fuzzy noise and the like, and is beneficial to improving the parameter training effect of the second neural network, so that the prediction accuracy of the text region prediction quality score by using the second neural network is beneficial to improvement.
Fig. 7 schematically shows a flowchart of the steps for predicting the quality of a text region in a text image to obtain the quality of the text region according to an embodiment of the present application. As shown in fig. 7, based on the above embodiment, the text region is a text line, the text line text image includes one or more continuous image regions of characters arranged in series, and the quality score prediction is performed on the text region in the text image in step S240 to obtain the quality score of the text region, which may further include the following steps S710 to S730.
S710, detecting text lines in a text image and the length-to-height ratio of the text lines, wherein the length-to-height ratio is the ratio of the length to the height of the text lines, the length of the text lines is the length of extension lines of the text lines extending along the character arrangement direction in the text lines, and the height of the text lines is the height of the text lines perpendicular to the extension lines;
S720, dividing a text line with the length-height ratio larger than a preset value into a plurality of text lines with the length-height ratio smaller than or equal to the preset value;
s730, respectively carrying out feature extraction and quality score prediction on a plurality of text lines with length-height ratios smaller than or equal to a preset value to obtain quality scores of the text lines with length-height ratios smaller than or equal to the preset value.
According to the method and the device, the text line with the length-height ratio larger than the preset value is divided into a plurality of text lines with the length-height ratio smaller than or equal to the preset value, and the text line with the longer length and the more characters can be divided into the text lines with the shorter length and the length-height ratio smaller than or equal to the preset value, so that the long text is adaptively divided, the overlong text line is prevented from being input into the second neural network, and the prediction accuracy of the second neural network can be greatly improved.
In some embodiments, the text image in the text recognition dataset may be a text line image, i.e. an image comprising one or more serially arranged characters, more specifically the length to height ratio of the text line image may be less than or equal to a preset value. Therefore, the length-to-height ratio of the processed image is smaller than or equal to a preset value in a training stage of adopting the text recognition data set and a stage of carrying out quality score prediction on the text, so that the optimization of the second neural network is facilitated, the calculated amount of the second neural network can be reduced, and the prediction efficiency of carrying out quality score prediction on the text is improved.
Fig. 8 schematically illustrates a flowchart of the steps for dividing a text line with a length to height ratio greater than a predetermined value into a plurality of text lines with a length to height ratio less than or equal to the predetermined value in an embodiment of the present application. As shown in fig. 8, on the basis of the above embodiment, the step S720 of dividing the text line with the length-height ratio greater than the preset value into a plurality of text lines with the length-height ratio less than or equal to the preset value may further include the following steps S810 to S820.
S810, projecting a text line with a length-to-height ratio larger than a preset value onto the length direction of the text line to form a one-dimensional projection point set, wherein projection of pixel points at positions of characters on the length direction of the text line forms real point projection, projection of pixel points at positions except for the positions of the characters on the length direction of the text line forms virtual point projection, and the real point projection and the virtual point projection form a projection point set;
s820, dividing points are taken from line segments gathered by virtual point projection, and text lines are divided according to the dividing points, so that the text lines with the length-height ratio larger than a preset value are divided into a plurality of text lines with the length-height ratio smaller than or equal to the preset value.
Specifically, the real-point projection may be a black-point pixel, the virtual-point projection may be a white-point pixel, and the one-dimensional projection point set may be a line segment composed of the black-point pixel, a line segment composed of the white-point pixel, or a line segment composed of the black-point pixel and the white-point pixel. The segmentation points are taken from the line segments gathered by the virtual point projection, and the text line is segmented according to the segmentation points, so that it can be understood that the line segments gathered by the virtual point projection are the projection line segments of some pixels at positions except the positions of the characters in the length direction of the text line, and the segmentation points are taken from the line segments gathered by the virtual point projection, but the text line is segmented at the positions except the positions of the characters, so that the segmentation of a single character into two different text lines can be avoided, the character recognition difficulty is caused, and the accuracy of quality part prediction is reduced.
Fig. 9 schematically illustrates a flowchart of the steps for obtaining preset weights corresponding to respective text regions in an embodiment of the present application. As shown in fig. 9, on the basis of the above embodiment, the acquiring of the preset weights corresponding to the respective text regions in step S250 may further include the following steps S910 to S920.
S910, obtaining the word numbers of all the text areas, and carrying out summation operation on the word numbers of all the text areas to obtain the total word numbers of the text images;
S920, respectively obtaining the ratio of the number of words of each text region to the total number of words of the text image, and taking the ratio as a preset weight corresponding to each text region.
It can be understood that the longer text lines in the text image occupy more content, and the quality of the longer text lines has larger influence on the quality of the text image, so that the ratio of the number of words in each text region to the total number of words in the text image is obtained respectively, and the ratio is used as the preset weight corresponding to each text region, so that the preset weight of the longer text lines is larger, the quality detection method of the text image in the embodiment of the application accords with the judgment logic of actual quality judgment, the accuracy of quality detection of the text image can be improved, and the detection efficiency of the quality detection method is improved.
In a specific embodiment, the quality score of the text imageThe method comprises the following steps:
Wherein q j is the predicted quality of the jth text line, and w j is the preset weight of the jth text line in the text image. In some embodiments of the present invention, in some embodiments, May be equal to the sum of the predicted quality scores of all text lines multiplied by the corresponding preset weights. The definition of w j can be as follows:
Wherein, R (j) is the number of characters of the j text lines, and Sigma kR(k) is the sum of the numbers of characters of k text lines in the text image. In some embodiments, the number of words of a text line may be detected by a single word detector, such as an MSER detector. In other embodiments, the number of words R (k) in a text line is approximately determined by the aspect ratio of the text line, for example:
Where line_w is the length of the text line and line_h is the height of the text line. The length of the text line is the length of an extension line of the text line extending along the character arrangement direction in the text line, and the height of the text line is the height of the text line perpendicular to the extension line. Therefore, the length-height ratio of the text lines is adopted to approximately calculate the number of the text lines, so that the detection flow of the quality detection method of the embodiment of the application can be simplified, and the detection quantity can be reduced.
Or in some embodiments, acquiring preset weights corresponding to the text regions comprises acquiring the length-to-height ratio of each text region, summing the length-to-height ratios of each text region to obtain a total length-to-height ratio, and acquiring the ratio of the length-to-height ratio of each text region to the total length-to-height ratio as the preset weight corresponding to each text region. Therefore, the preset weights corresponding to the text areas are directly calculated by adopting the length-height ratio of the text lines, so that the calculated amount can be reduced, and the detection efficiency of the quality detection method can be improved.
Fig. 10 schematically illustrates a scene of detecting text lines in a text image and quality scores and numbers of text lines according to an embodiment of the present application. Referring to fig. 10, text line detection is performed on a text image, then quality score prediction and text number detection are performed on the text line in the text image, and the results of the text line quality score prediction and text number detection are displayed below the corresponding text line. For example, the number of characters of the text line "women" displayed below the text line "women" is 2, and the predicted quality is 0.8. At this time, the predicted quality score of the text image may be calculated from the formula (1) of the quality score of the text image, the results of the quality score prediction and the text line number detection.
Fig. 11 schematically illustrates a scene diagram for detecting text lines in a text image and quality scores and aspect ratios of the text lines in accordance with an embodiment of the application. Referring to fig. 11, text line detection is performed on a text image, then quality score prediction and length-to-height ratio detection are performed on the text line in the text image, and the results of the quality score prediction and length-to-height ratio detection of the text line are displayed below the corresponding text line. For example, text line "ultrasound description:" length to height ratio of 4.4, predicted mass fraction of 0.848. At this time, the predicted quality score of the text image may be calculated from the formula (1) of the quality score of the text image and the quality score and length-to-height ratio of the text line.
In some embodiments, the text region is a single word region, which is a continuous image region of the text image that contains one character. Then, the quality score of each word area is obtained by carrying out quality score prediction on the word areas in the text image, then the preset weight corresponding to each word area is obtained, and the quality scores of the word areas are weighted and summed based on the preset weight to obtain the quality score of the text image. Wherein, the preset weights of the single word areas are all 1. Or the preset weight of the single word region may be determined by the proportion of the area of the single word region to the sum of the areas of the single word regions, for example:
quality score of text image The method comprises the following steps:
Wherein q j is the predicted quality of the jth text line, and v j is the preset weight of the jth word region in the text image. In some embodiments of the present invention, in some embodiments, The sum value obtained by multiplying the predicted quality scores of all the word areas and the corresponding preset weights can be equal. v j may be defined as follows:
Wherein S (j) is the area size of the jth single-word region, and Sigma kS(k) is the sum of the area sizes of k single-word regions in the text image. Therefore, it can be understood that the importance degree of the characters with larger area in the text image is higher, and the influence of the quality of the characters with larger area on the quality of the text image is larger, so that the preset weight corresponding to each text word area is obtained, the quality of the text image is obtained by carrying out weighted summation on the quality of the word areas based on the preset weight, the preset weight of the word areas with larger area can be larger, the quality detection method of the text image in the embodiment of the application accords with the judgment logic of actual quality judgment, and the accuracy of quality detection of the text image can be improved.
In some embodiments, the first neural network and the second neural network may be combined into the same deep neural network, and feature values obtained by feature extraction of the text image by the first neural network are shared into the second neural network. Specifically, the quality detection method of the text image may further include:
And inputting a characteristic value obtained by carrying out characteristic extraction on the text image by the first neural network into the second neural network, wherein the characteristic value is used for helping the second neural network to carry out characteristic extraction on the text region through the convolution layer to obtain one or more of plane characteristics, characteristic vectors and prediction quality scores.
Therefore, the calculated amount of the quality detection method of the text image in the embodiment of the application can be reduced, and the quality detection efficiency of the text image can be improved.
It should be noted that although the steps of the methods of the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
The following describes an embodiment of the apparatus of the present application, which can be used to perform the quality detection method of a text image in the above-described embodiment of the present application. Fig. 12 schematically shows a block diagram of a text image quality detecting apparatus provided by an embodiment of the present application. As shown in fig. 12, the quality detection apparatus 1200 of a text image includes:
a character scale detection module 1210 configured to detect a character scale of the text image, the character scale of the text image being an average scale of characters of the text image;
a scaling module 1220 configured to, when the character scale of the text image is smaller than a first preset scale, perform scaling processing on the text image so that the character scale of the text image is in a preset scale range, and when the character scale of the text image is larger than a second preset scale, perform scaling processing on the text image so that the character scale of the text image is in a preset scale range, the preset scale range being a scale range that is larger than the first preset scale and smaller than the second preset scale;
The text region detection module 1230 is configured to input a text image into a first neural network, and perform feature extraction and mapping processing on the text image through the first neural network to detect a text region in the text image, wherein a sensing window is configured in the first neural network, the sensing window is in a preset scale range, and the sensing window is used for moving on the text image to perform feature extraction on the text image;
A quality score prediction module 1240 configured to perform quality score prediction on a text region in the text image, to obtain a quality score of the text region;
The weighted summation module 1250 is configured to obtain preset weights corresponding to the respective text regions, and perform weighted summation processing on the quality scores of the text regions based on the preset weights to obtain the quality scores of the text images.
In some embodiments of the application, based on the above embodiments, the weighted summation module comprises:
The word number acquisition unit is configured to acquire the word number of each text region and perform summation operation on the word number of each text region to obtain the total word number of the text image;
and the preset weight calculation unit is configured to acquire the ratio of the word number of each text region to the total word number of the text image, and the ratio is used as the preset weight corresponding to each text region.
In some embodiments of the present application, based on the above embodiments, the text region is a text line, the text line text image includes a continuous image region of one or more serially arranged characters, and the quality prediction module includes:
A length-to-height ratio detection unit configured to detect a text line in the text image and a length-to-height ratio of the text line, the length of the text line being a length of an extension line of the text line extending along a character arrangement direction in the text line, the height of the text line being a height of the text line perpendicular to the extension line;
a text line dividing unit configured to divide a text line having a length-to-height ratio greater than a preset value into a plurality of text lines having a length-to-height ratio less than or equal to the preset value;
And the quality score prediction unit is configured to perform feature extraction and quality score prediction on the text lines with the length-height ratios smaller than or equal to the preset value respectively to obtain quality scores of the text lines with the length-height ratios smaller than or equal to the preset value.
In some embodiments of the present application, based on the above embodiments, the text line segmentation unit includes:
A text line projection subunit configured to project a text line with a length-to-height ratio greater than a preset value onto a length direction of the text line to form a one-dimensional projection point set, wherein projections of pixels at positions where characters are located on the length direction of the text line form real point projections, projections of pixels at positions other than the positions where the characters are located on the length direction of the text line form virtual point projections, and the real point projections and the virtual point projections form the projection point set;
And the text line segmentation subunit is configured to acquire segmentation points on the line segments aggregated by virtual point projection, and segment the text line according to the segmentation points so as to segment the text line with the length-height ratio larger than a preset value into a plurality of text lines with the length-height ratio smaller than or equal to the preset value.
In some embodiments of the present application, based on the above embodiments, the mass fraction prediction module may include:
an input unit configured to input a text region in the text image into the second neural network;
the feature extraction unit is configured to extract features of the text region through a convolution layer of the second neural network to obtain plane features;
the dimension reduction processing unit is configured to perform dimension reduction processing on the plane characteristics through a pooling layer of the second neural network to obtain characteristic vectors;
and the full-connection calculation unit is configured to perform full-connection calculation on the feature vector through a full-connection layer of the second neural network to obtain the prediction quality score of the text region.
In some embodiments of the present application, based on the above embodiments, the mass fraction prediction module may further include:
The data set acquisition unit is configured to acquire a text recognition data set, wherein the text recognition data set comprises a text image and the recognition accuracy of the text image, the recognition accuracy of the text image is the ratio of the number of recognizable characters of the text image to the actual number of characters, the number of recognizable characters is the number of characters of the text image which can be correctly recognized by the character recognition model, and the actual number of characters is the number of characters actually included in the text image;
The data set labeling unit is configured to perform equal ratio conversion on the recognition accuracy of the text image to obtain the quality score of the text image, and label the quality score of the text image into the text recognition data set;
And a neural network training unit configured to input the text recognition data set into a second neural network, and train the second neural network.
In some embodiments of the present application, based on the above embodiments, the text region is a single character region, and the single character region is a continuous image region containing one character in the text image.
Specific details of the text image quality detection device provided in each embodiment of the present application have been described in the corresponding method embodiments, and are not described herein.
Fig. 13 schematically shows a block diagram of a computer system of an electronic device for implementing an embodiment of the application.
It should be noted that, the computer system 1300 of the electronic device shown in fig. 13 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 13, the computer system 1300 includes a central processing unit 1301 (Central Processing Unit, CPU) which can perform various appropriate actions and processes according to a program stored in a Read-Only Memory 1302 (ROM) or a program loaded from a storage portion 1308 into a random access Memory 1303 (RandomAccess Memory, RAM). In the random access memory 1303, various programs and data necessary for the system operation are also stored. The cpu 1301, the rom 1302, and the ram 1303 are connected to each other via a bus 1304. An Input/Output interface 1305 (i.e., an I/O interface) is also connected to bus 1304.
Connected to the input/output interface 1305 are an input section 1306 including a keyboard, a mouse, and the like, an output section 1307 including a Cathode Ray Tube (CRT), a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), and the like, and a speaker, and the like, a storage section 1308 including a hard disk, and the like, and a communication section 1309 including a network interface card such as a local area network card, a modem, and the like. The communication section 1309 performs a communication process via a network such as the internet. The drive 1310 is also connected to the input/output interface 1305 as needed. Removable media 1311, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is installed as needed on drive 1310 so that a computer program read therefrom is installed as needed into storage portion 1308.
In particular, the processes described in the various method flowcharts may be implemented as computer software programs according to embodiments of the application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 1309 and/or installed from the removable medium 1311. The computer programs, when executed by the central processor 1301, perform the various functions defined in the system of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), a flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.