CN113536858B

CN113536858B - Image recognition method and system

Info

Publication number: CN113536858B
Application number: CN202010313920.4A
Authority: CN
Inventors: 李兆海; 王永攀; 何梦超
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2025-09-16
Anticipated expiration: 2040-04-20
Also published as: CN113536858A

Abstract

The present application discloses an image recognition method and system. The method includes: obtaining a first image and a second image, wherein the arrangement direction of the first text data contained in the first image is a first direction, and the arrangement direction of the second text data contained in the second image is a second direction; processing the first image and the second image using a text recognition model to obtain a first recognition result for the first image and a second recognition result for the second image; wherein the text recognition model is used to input the first image and the second image into a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, input the first feature sequence into the first recognition model to obtain a first recognition result, and input the second feature sequence into the second recognition model to obtain a second recognition result. The present application solves the technical problem in the related art that text recognition methods recognize text data arranged in multiple directions, wasting computing and storage resources.

Description

Image recognition method and system

Technical Field

The application relates to the field of image recognition, in particular to an image recognition method and system.

Background

Currently, an image may be identified by a text recognition algorithm to recognize text data contained in the image. Since conventional text line recognition algorithms can only process text data of one directional arrangement, when it is necessary to recognize text data of a plurality of directional arrangements, a simple solution is to train a plurality of models each for processing text data of one directional arrangement, respectively. However, this scheme requires storing a plurality of models and performing calculation by the plurality of models, respectively, resulting in a waste of calculation resources and storage resources.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides an image recognition method and an image recognition system, which at least solve the technical problems that the text recognition method in the related art recognizes text data arranged in multiple directions and wastes calculation and storage resources.

According to one aspect of the embodiment of the application, an image recognition method is provided, which comprises the steps of obtaining a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, the arrangement direction of the second text data is a second direction, processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image, wherein the text recognition model is used for inputting the first image and the second image into a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

According to another aspect of the embodiment of the application, an image recognition device is provided, which comprises a first acquisition module, a processing module and a processing module, wherein the first acquisition module is used for acquiring a first image and a second image, the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, the arrangement direction of the second text data is a second direction, the processing module is used for processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image, the text recognition model is used for inputting the first image and the second image into a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

According to another aspect of the embodiment of the application, an image recognition method is provided, which comprises the steps of obtaining a first image and a second image, wherein the first image contains first text data, the second image contains second text data, the arrangement direction of the first text data is a first direction, the arrangement direction of the second text data is a second direction, extracting features of the first image and the second image to obtain a first feature sequence of the first image and a second feature sequence of the second image, obtaining a first recognition result of the first image based on the first feature sequence, and obtaining a second recognition result of the second image based on the second feature sequence.

According to another aspect of the embodiment of the present application, there is also provided a storage medium, where the storage medium includes a stored program, and when the program runs, the device on which the storage medium is controlled to execute the above-described image recognition method.

According to another aspect of the embodiment of the application, a computing device is provided, which comprises a memory and a processor, wherein the memory is used for storing a program, and the processor is used for running the program, and the image recognition method is executed when the program runs.

According to another aspect of the embodiment of the application, an image recognition system is provided, which comprises a processor and a memory, wherein the memory is connected with the processor and is used for providing instructions for the processor to process the following processing steps, the first image and the second image are obtained, the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is the first direction, the arrangement direction of the second text data is the second direction, the first image and the second image are processed by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image, the text recognition model is used for inputting the first image and the second image into a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, the first feature sequence is input into the first recognition model to obtain a first recognition result, and the second feature sequence is input into the second recognition model to obtain a second recognition result.

In the embodiment of the application, for the image containing the text data with two different arrangement directions, one text recognition model can be utilized to process so as to obtain the recognition results of the text data with different arrangement directions, thereby realizing the purpose of image recognition. It is easy to notice that the horizontal text line and the vertical text line are simultaneously identified by one model through two identification model branches, so that the on-line storage of two identification models is avoided, the technical effects of saving calculation and storage resources are achieved, and the technical problem that text data arranged in multiple directions are identified by a text identification method in the related art, and calculation and storage resources are wasted is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 is a block diagram of a hardware configuration of a computer terminal (or mobile device) for implementing an image recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of an image recognition method according to an embodiment of the present application;

FIG. 3 is a schematic architecture diagram of an alternative image recognition method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an alternative generated mask matrix according to an embodiment of the application;

Fig. 5 is a schematic view of an image recognition apparatus according to an embodiment of the present application;

FIG. 6 is a flowchart of another image recognition method according to an embodiment of the present application, and

Fig. 7 is a block diagram of a computer terminal according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, partial terms or terminology appearing in the course of describing embodiments of the application are applicable to the following explanation:

OCR Optical Character Recognition, optical character recognition, which may refer to recognition of optical characters by image processing and pattern recognition techniques.

CTC Connectionist Temporal Classification, linking the sense time classification, can be used to solve the problem that the input sequence and the output sequence are difficult to correspond one by one.

CRNN: convolution Recurrent Neural Network, a convolutional recurrent neural network, which may be a combination of a convolutional neural network CNN and a recurrent neural network RNN, the network architecture of which mainly consists of three parts including a convolutional layer, a recurrent layer and a transcribing layer.

Attention Mechanism the attention mechanism can be used for improving the effect of the RNN-based coding and decoding model, and by giving different weights to each word in the sentence, the learning of the neural network model becomes more flexible, and can be used as an alignment relationship to explain the alignment relationship between the translation input/output sentences.

The BLSTM is Bidirectional Long Short-term Memory neural network, can use the information of past moment and future moment at the same time, its network result is made up of two ordinary RNNs, a forward RNN, utilize past information, a reverse RNN, utilize future information.

RESNET: residual Neural Network, a residual neural network, by adding a direct connection channel in the network, allows the original input information to be directly transmitted to the following layer, and the neural network of the layer does not need to learn the whole output, but only needs to learn the residual of the last network output.

Mask, which may be a string of binary codes that performs a bit and operation on the target field, masking the current input bits.

The existing text recognition algorithms can be mainly divided into two types, namely a text recognition algorithm based on CTC and a text recognition algorithm based on Attention.

The text recognition algorithm based on CTC consists of three parts, wherein the first part is a convolutional neural network for extracting the characteristic sequence of an image, the second part is a cyclic neural network for learning the context information of the characteristic sequence of text data, and the third part is a CTC decoder which solves the problem of short alignment of sequence recognition by introducing a blank class and can decode probability distribution output by the previous cyclic neural network into a final recognition result. In such a method, the CNN of the first part usually changes the height of the image to 1, and thus the height of the input image must be fixed, which makes it impossible for the model to process the image containing the text data arranged in the second direction.

The text recognition algorithm based on Attention also consists of three major parts, the first two parts of which are consistent with the text recognition algorithm based on CTC, but the third part uses an Attention mechanism (Attention Mechanism) for decoding, and the recognition results are output in turn at each time point. Unlike CTC methods, the CNN part output of the Attention-based text recognition algorithm may be a two-dimensional feature map, so that in theory, such algorithms can process images containing text data arranged in both horizontal and vertical directions simultaneously. However, since the number of categories of Chinese is far greater than that of English, the effect of the Attention algorithm of classifying by time points on Chinese text line recognition is poor, and in addition, the forward time consumption of the Attention algorithm is more than that of the CTC algorithm, so the Attention algorithm is not suitable for the current reading text recognition scene.

Since the conventional text line recognition algorithm can only process text line pictures arranged in one direction, the application range of the model is greatly limited. One of the simplest solutions is to train two models separately, but this solution requires storing the two models on-line, resulting in wasted resources.

In order to solve the problems, the application provides a model capable of simultaneously identifying text data arranged in two directions of horizontal and vertical by optimizing and improving a text identification algorithm, which not only can avoid storing two models, but also can improve the identification effect of the models.

The text recognition algorithm can be applied to various fields needing text recognition, for example, the field of auxiliary design of an online shopping platform, in order to help a merchant to design a product detail page, the merchant can upload a design template, can recognize text data arranged in different directions in the design template through the algorithm, and can replace according to the needs of the merchant to obtain the merchant's own product detail page. Further, in order to facilitate the use of the user, the algorithm can be deployed in the cloud server as a Software AS A SERVICE (SaaS) service, and the user can recognize text data arranged in different directions through the internet according to the requirement.

Example 1

According to an embodiment of the present application, there is provided an image recognition method, it being noted that the steps shown in the flowcharts of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that herein.

The method embodiments provided by the embodiments of the present application may be performed in a mobile terminal, a computer terminal, or similar computing device. Fig. 1 shows a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing an image recognition method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, 102 n) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU, or a processing device such as a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication functions. Among other things, a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in embodiments of the application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the image recognition method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the image recognition method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

It should be noted here that, in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a specific example, and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

In the above-described operating environment, the present application provides an image recognition method as shown in fig. 2. Fig. 2 is a flowchart of an image recognition method according to an embodiment of the present application. As shown in fig. 2, the method comprises the steps of:

Step S202, a first image and a second image are obtained, wherein the first image contains first text data, the second image contains second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction;

The first image and the second image in the above steps may refer to text images extracted from an image to be processed, where texts included in the two images are arranged in different directions, and the arrangement directions are a first direction and a second direction respectively. The image to be processed can be a template image uploaded by a merchant in the field of auxiliary design of an online shopping platform, or an image uploaded by a user received in the SaaS service, but is not limited to the image. In practical applications, the characters in the image to be processed are usually arranged in a horizontal direction, and also in a vertical direction, in the embodiment of the present application, the first direction is taken as a horizontal direction, and the second direction is taken as a vertical direction for example, but the present application is not limited thereto.

In an alternative embodiment, the image to be processed including the horizontal text and the vertical text can be processed through an image text extraction method, a first image including the horizontal text and a second image including the vertical text are extracted, and the two images are further processed through a model at the same time, so that a final recognition result is obtained.

Step S204, processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image;

the text recognition model is used for inputting the first image and the second image into the feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

Optionally, the feature extraction model comprises a plurality of convolution layers, an excitation layer and a pooling layer which are sequentially connected, wherein the parameter of the last pooling layer is 2*1, and the first recognition model or the second recognition model comprises a convolution neural network, a two-way long-short-term memory neural network and a joint sense time classification.

In order to ensure that text data arranged in a plurality of different directions is simultaneously identified through one model, the application can improve a text identification algorithm based on CTC. Since the bottom features of the text arranged in the horizontal direction are similar to those of the text arranged in the vertical direction, most of the existing convolutional neural network models can be shared to obtain a feature extraction model, so that the number of parameters is reduced. In addition, text data arranged in different directions can be identified through different identification models on the basis of the existing models, and the structures of all the identification models are completely consistent.

The main structure of the feature extraction model can be obtained by clipping on the basis of VGG16 (Visual Geometry Group Network) and is used for extracting feature information of a text line image for subsequent recognition. For the first recognition model and the second recognition model, the two model structures may be identical, consisting of CNN, BLSTM and CTC.

Note that, for enhancing the feature extraction capability of the feature extraction model, resNet may be used as the feature extraction model, but the present invention is not limited thereto, and other structures may be used.

Since the sequence recognition model in the conventional CTC-based text recognition algorithm needs to process the height of the image to 1 and needs to keep enough width, the model will contain 2*1-shaped pooling layers, and the first recognition model and the second recognition model can start after 2*1-shaped pooling layers appear and then start to extract the model for the features.

For example, taking text data arranged in a horizontal direction and a vertical direction as an example, as shown in fig. 3, the overall framework of the text extraction model of the present application may be divided into three parts, the first part is data processing, the second part is a feature extraction model sharing a CNN layer for extracting feature information of first text data and second text data contained in an image, and the third part is a branch recognition network (including the above-mentioned first recognition model and second recognition model) for recognizing text data arranged in the horizontal direction and the vertical direction, respectively.

Based on the scheme provided by the embodiment of the application, for the image containing the text data in two different arrangement directions, one text recognition model can be utilized to process to obtain the recognition results of the text data in different arrangement directions, so that the purpose of image recognition is realized. It is easy to notice that the horizontal text line and the vertical text line are simultaneously identified by one model through two identification model branches, so that the on-line storage of two identification models is avoided, the technical effects of saving calculation and storage resources are achieved, and the technical problem that text data arranged in multiple directions are identified by a text identification method in the related art, and calculation and storage resources are wasted is solved.

In the embodiment of the application, before the first image and the second image are processed by using the text recognition model, the method further comprises the steps of obtaining the position of each first character in the first image and the position of each second character in the second image, generating a first mask matrix corresponding to the first image based on the position of each first character and a second mask matrix corresponding to the second image based on the position of each second character, wherein the first mask matrix is used for representing the arrangement direction of the first text data, the second mask matrix is used for representing the arrangement direction of the second text data, splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix, and processing the input matrix by using the text recognition model to obtain a first recognition result and a second recognition result.

Alternatively, the first mask matrix and the second mask matrix in the above steps may be asymmetric matrices capable of identifying a certain direction information, and the values of the mask matrices are only related to the positions and the arrangement directions of characters in the text data, and in the embodiment of the present application, the values of the first mask matrices increase sequentially from left to right and from top to bottom for the text data arranged in the horizontal direction. Therefore, based on the numerical value size of the mask matrix, the arrangement direction of the text data can be determined.

In an alternative embodiment, because the accuracy requirement on the character position in the recognition process is not high, the position of each character in the text data can be obtained through a simple projection segmentation method, a mask matrix is generated based on the character position and the arrangement direction, information of the mask matrix is added into the original first image and the second image, the text recognition model is further utilized for processing, the distinguishing capability of the model on the text data arranged in the horizontal direction and the vertical direction is enhanced, the shared CNN layer can learn the characteristic information applicable to texts in different directions in a targeted manner, and finally the recognition effect of the model is improved.

For example, while text data arranged in the horizontal and vertical directions is illustrated, as shown in fig. 3, the data processing may be to generate a Mask matrix by character positions, and then splice each image with a corresponding Mask in the channel dimension, and use this four-channel matrix as the input of the model.

In the embodiment of the application, generating the second mask matrix corresponding to the second image based on the position of each second character comprises the steps of rotating the second image to obtain a rotated image, wherein the arrangement direction of second text data in the rotated image is the first direction, generating a preset matrix based on the position of each second character, and rotating the preset matrix to obtain the second mask matrix.

The preset matrix in the above step may be a mask matrix corresponding to the rotated image, and in the embodiment of the present application, for text data arranged in a horizontal direction, similar to the first mask matrix, the values of the preset matrix may increase sequentially from left to right and from top to bottom.

In an alternative embodiment, the second image may be rotated so that the arrangement direction of the text data is the same as that of the first image, then a mask matrix corresponding to the rotated image is generated by the character position, and then the mask matrix corresponding to the second image (i.e., the second mask matrix described above) may be obtained by rotating the generated mask matrix.

It should be noted that the rotation direction of the second image is opposite to the rotation direction of the preset matrix.

For example, still taking text data arranged in a horizontal direction and a vertical direction as an example, as shown in fig. 3, an image containing text data arranged in a vertical direction may be rotated first to change the text data into a horizontal direction, then a Mask matrix is generated by using a character position, the Mask matrix is rotated counterclockwise by 90 degrees to obtain a Mask matrix corresponding to the original text data containing the text data arranged in a vertical direction, and finally each image is spliced with a corresponding Mask in a channel dimension, where the four-channel matrix is used as an input of a model.

As shown in fig. 4, for the text data "standardization technical commission" arranged in the horizontal direction, which contains 8 words, therefore, a mask matrix corresponding to each word can be generated, and the values of the mask matrices are sequentially increased from left to right, thereby obtaining an image to which the mask matrices are added. Note that, the size of the mask matrix is the same as the size of the word corresponding region, for example, the size of the "standard" word corresponding region is 32×50, and the size of the mask matrix corresponding to the word is 32×50. If the size of each word corresponding region is 32×50, the size of the mask matrix corresponding to the image is 32×400, i.e., the size of the mask matrix corresponding to the image is the same as the image size.

For the text data arranged in the vertical direction, the economic check team contains 6 words, so that the text data can be rotated into text data arranged in the horizontal direction, a mask matrix corresponding to each word is generated, the values of the mask matrices are sequentially increased from left to right, the mask matrix is further rotated anticlockwise by 90 degrees, a final mask matrix can be obtained, and then an image added with the mask matrix is obtained.

In the embodiment of the application, the method further comprises the steps of obtaining a plurality of training images, wherein the arrangement direction of text data contained in each training image is a first direction or a second direction, and training the initial model by using the plurality of training images to obtain a text recognition model.

The training data in the above step may include images of text data arranged in different directions.

In an alternative embodiment, images of text data arranged in different directions can be mixed, and the mixed images are used as training data to train a model, in the process of combined training, the training data arranged in different directions can be mutually complemented, the feature extraction capability of an encoder is enhanced, the generalization and the robustness of a text recognition model are improved, and the recognition effect of the text recognition model is stronger than that of a model trained by training data arranged in a single direction.

For example, still taking text data arranged in the horizontal direction and the vertical direction as an example, for the text recognition model shown in fig. 3, a text data mixed training model arranged in the horizontal direction and the vertical direction may be used, thereby obtaining a trained text recognition model.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

Example 2

According to an embodiment of the present application, there is further provided an image recognition apparatus for implementing the above image recognition method, as shown in fig. 5, the apparatus 500 includes a first acquisition module 502 and a processing module 504.

The first obtaining module 502 is configured to obtain a first image and a second image, where the first image includes first text data, the second image includes second text data, an arrangement direction of the first text data is a first direction, and an arrangement direction of the second text data is a second direction, and the processing module 504 is configured to process the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image, where the text recognition model is configured to input the first image and the second image to a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, input the first feature sequence to the first recognition model to obtain a first recognition result, and input the second feature sequence to the second recognition model to obtain a second recognition result.

It should be noted that, the first obtaining module 502 and the processing module 504 are applied to steps S202 to S204 in embodiment 1, and the two modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

In the embodiment of the application, the device further comprises a second acquisition module, a generation module and a splicing module.

The text recognition module is used for recognizing the first text data, acquiring the second text data, acquiring the position of each first character in the first image and the position of each second character in the second image, generating a first mask matrix corresponding to the first image based on the position of each first character, generating a second mask matrix corresponding to the second image based on the position of each second character, wherein the first mask matrix is used for representing the arrangement direction of the first text data, the second mask matrix is used for representing the arrangement direction of the second text data, and the splicing module is used for splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix, and processing the input matrix by using the text recognition module to obtain a first recognition result and a second recognition result.

In the above embodiment of the present application, the generating module includes a first rotating unit, a generating unit, and a second rotating unit.

The first rotating unit is used for rotating the second image to obtain a rotated image, the arrangement direction of second text data in the rotated image is a first direction, the generating unit is used for generating a preset matrix based on the position of each second character, and the second rotating unit is used for rotating the preset matrix to obtain a second mask matrix.

In the above embodiment of the present application, the apparatus further includes a third acquisition module and a training module.

The third acquisition module is used for acquiring a plurality of training images, wherein the arrangement direction of text data contained in each training image is a first direction or a second direction; the training module is used for training the initial model by utilizing a plurality of training images to obtain a text recognition model.

It should be noted that, the preferred embodiment of the present application in the above examples is the same as the embodiment provided in example 1, the application scenario and the implementation process, but is not limited to the embodiment provided in example 1.

Example 3

According to the embodiment of the application, an image recognition method is also provided.

Fig. 6 is a flowchart of another image recognition method according to an embodiment of the present application. As shown in fig. 6, the method includes the steps of:

Step S602, a first image and a second image are obtained, wherein the first image contains first text data, the second image contains second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction;

Step S604, extracting features of the first image and the second image to obtain a first feature sequence of the first image and a second feature sequence of the second image;

Step S606, based on the first feature sequence, acquiring a first identification result of the first image;

Step S608, based on the second feature sequence, obtains a second recognition result of the second image.

In the embodiment of the application, the feature extraction is performed on the first image and the second image to obtain the first feature sequence of the first image and the second feature sequence of the second image, and the first image and the second image are input into a shared feature extraction model to obtain the first feature sequence and the second feature sequence.

Optionally, the feature extraction model comprises a plurality of convolution layers, an excitation layer and a pooling layer which are sequentially connected, wherein the parameter of the last pooling layer is 2*1. For the feature extraction model, the main structure of the feature extraction model can be obtained by clipping on the basis of VGG16 (Visual Geometry Group Network) and used for extracting feature information of a text line image for subsequent recognition.

In the embodiment of the application, acquiring the first recognition result of the first image based on the first feature sequence comprises inputting the first feature sequence into the first recognition model to obtain the first recognition result.

Optionally, the first recognition model includes a convolutional neural network, a two-way long-short term memory neural network, and a join-sense temporal classification. For the first recognition model, its structure may consist of CNN, BLSTM and CTC.

In the embodiment of the application, the second recognition result of the second image is obtained based on the second feature sequence, wherein the second feature sequence is input into the second recognition model to obtain the second recognition result.

Optionally, the second recognition model includes a convolutional neural network, a two-way long-short term memory neural network, and a join-sense temporal classification. For the second recognition model, the structure may be the same as that of the first recognition model, and consist of CNN, BLSTM, and CTC.

In the embodiment of the application, before the feature extraction is performed on the first image and the second image, the method further comprises the steps of obtaining the position of each first character in the first image and the position of each second character in the second image, generating a first mask matrix corresponding to the first image based on the position of each first character, and generating a second mask matrix corresponding to the second image based on the position of each second character, wherein the first mask matrix is used for representing the arrangement direction of the first text data, the second mask matrix is used for representing the arrangement direction of the second text data, splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix, and performing feature extraction on the input matrix to obtain a first feature sequence and a second feature sequence.

In the embodiment of the application, the method further comprises the steps of obtaining a plurality of training images, wherein the arrangement direction of text data contained in each training image is a first direction or a second direction, and training the feature extraction model, the first recognition model and the second recognition model by using the plurality of training images to obtain a text recognition model.

Example 4

According to an embodiment of the present application, there is also provided an image recognition system including:

Processor, and

The memory is connected with the processor and is used for providing instructions for the processor to process the following processing steps, wherein the first image and the second image comprise first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, the arrangement direction of the second text data is a second direction, the first image and the second image are processed by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image, the text recognition model is used for inputting the first image and the second image into a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, the first feature sequence is input into the first recognition model to obtain a first recognition result, and the second feature sequence is input into the second recognition model to obtain a second recognition result.

Example 5

Embodiments of the present application may provide a computer terminal, which may be any one of a group of computer terminals. Alternatively, in the present embodiment, the above-described computer terminal may be replaced with a terminal device such as a mobile terminal.

Alternatively, in this embodiment, the above-mentioned computer terminal may be located in at least one network device among a plurality of network devices of the computer network.

In this embodiment, the computer terminal may execute program codes for acquiring a first image and a second image, where the first image includes first text data, the second image includes second text data, an arrangement direction of the first text data is a first direction, and an arrangement direction of the second text data is a second direction, processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image, where the text recognition model is used to input the first image and the second image to a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, input the first feature sequence to the first recognition model to obtain a first recognition result, and input the second feature sequence to the second recognition model to obtain a second recognition result.

Alternatively, fig. 7 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 7, the computer terminal a may include one or more (only one is shown) processors 702, and memory 704.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the image recognition method and apparatus in the embodiments of the present application, and the processor executes the software programs and modules stored in the memory, thereby executing various functional applications and data processing, that is, implementing the image recognition method described above. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located with respect to the processor, which may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call information and an application program stored in the memory through the transmission device to execute the following steps of acquiring a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, the arrangement direction of the second text data is a second direction, the first image and the second image are processed through a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image, the text recognition model is used for inputting the first image and the second image into a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

Optionally, the processor may further execute program code for acquiring a position of each first character in the first image and a position of each second character in the second image, generating a first mask matrix corresponding to the first image based on the position of each first character, and generating a second mask matrix corresponding to the second image based on the position of each second character, where the first mask matrix is used to represent an arrangement direction of the first text data and the second mask matrix is used to represent an arrangement direction of the second text data, splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix, and processing the input matrix by using the text recognition model to obtain a first recognition result and a second recognition result.

Optionally, the processor may further execute program code for rotating the second image to obtain a rotated image, wherein an arrangement direction of second text data in the rotated image is a first direction, generating a preset matrix based on a position of each second character, and rotating the preset matrix to obtain a second mask matrix.

Optionally, the processor may further execute program code for obtaining a plurality of training images, where an arrangement direction of text data included in each training image is a first direction or a second direction, and training the initial model using the plurality of training images to obtain the text recognition model.

By adopting the embodiment of the application, an image recognition scheme is provided. The text data in two different arrangement directions are identified by one text identification model, so that the on-line storage of the two identification models is avoided, the technical effects of saving calculation and storage resources are achieved, and the technical problem that the text identification method in the related art identifies the text data arranged in a plurality of directions and wastes calculation and storage resources is solved.

The processor can call information and application programs stored in the memory through the transmission device to execute the following steps of acquiring a first image and a second image, wherein the first image contains first text data, the second image contains second text data, the arrangement direction of the first text data is a first direction, the arrangement direction of the second text data is a second direction, extracting features of the first image and the second image to obtain a first feature sequence of the first image and a second feature sequence of the second image, acquiring a first identification result of the first image based on the first feature sequence, and acquiring a second identification result of the second image based on the second feature sequence.

It will be appreciated by those skilled in the art that the structure shown in fig. 7 is only illustrative, and the computer terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile internet device (Mobile INTERNET DEVICES, MID), a PAD, etc. Fig. 7 is not limited to the structure of the electronic device. For example, the computer terminal a may also include more or fewer components (such as a network interface, a display device, etc.) than shown in fig. 7, or have a different configuration than shown in fig. 7.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device related hardware, and the program may be stored in a computer readable storage medium, where the storage medium may include a flash disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, etc.

Example 6

The embodiment of the application also provides a storage medium. Alternatively, in the present embodiment, the storage medium may be used to store the program code executed by the image recognition method provided in the above embodiment.

Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.

Optionally, in this embodiment, the storage medium is arranged to store program code for obtaining a first image and a second image, wherein the first image comprises first text data, the second image comprises second text data, the arrangement direction of the first text data is a first direction, the arrangement direction of the second text data is a second direction, processing the first image and the second image by using a text recognition model to obtain a first recognition result of the first image and a second recognition result of the second image, wherein the text recognition model is used for inputting the first image and the second image into a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into the first recognition model to obtain a first recognition result, and inputting the second feature sequence into the second recognition model to obtain a second recognition result.

Optionally, the storage medium is further configured to store program code for obtaining a position of each first character in the first image and a position of each second character in the second image, generating a first mask matrix corresponding to the first image based on the position of each first character and a second mask matrix corresponding to the second image based on the position of each second character, wherein the first mask matrix is used for representing an arrangement direction of the first text data, the second mask matrix is used for representing an arrangement direction of the second text data, splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix, and processing the input matrix by using the text recognition model to obtain a first recognition result and a second recognition result.

Optionally, the storage medium is further configured to store program code for rotating the second image to obtain a rotated image, wherein an arrangement direction of the second text data in the rotated image is a first direction, generating a preset matrix based on a position of each second character, and rotating the preset matrix to obtain a second mask matrix.

Optionally, the storage medium is further configured to store program code for obtaining a plurality of training images, wherein an arrangement direction of text data included in each training image is a first direction or a second direction, and training the initial model by using the plurality of training images to obtain the text recognition model.

Optionally, in this embodiment the storage medium is arranged to store program code for obtaining a first image and a second image, wherein the first image comprises first text data and the second image comprises second text data, the arrangement direction of the first text data is a first direction, the arrangement direction of the second text data is a second direction, performing feature extraction on the first image and the second image to obtain a first feature sequence of the first image and a second feature sequence of the second image, obtaining a first recognition result of the first image based on the first feature sequence, and obtaining a second recognition result of the second image based on the second feature sequence.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, etc. which can store the program code.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. An image recognition method, comprising:

Extracting an image to be processed containing first text data and second text data to obtain a first image and a second image, wherein the first image contains the first text data, the second image contains the second text data, the arrangement direction of the first text data is a first direction, and the arrangement direction of the second text data is a second direction;

Generating a first mask matrix corresponding to the first image and generating a second mask matrix corresponding to the second image, wherein the first mask matrix is used for representing the arrangement direction of the first text data, the second mask matrix is used for representing the arrangement direction of the second text data, the first mask matrix and the second mask matrix are asymmetric matrices, the value of the first mask matrix is related to the character position and the arrangement direction in the first text data, and the value of the second mask matrix is related to the character position and the arrangement direction in the second text data;

Splicing the first image, the first mask matrix, the second image and the second mask matrix to obtain an input matrix;

Processing the input matrix by using a text recognition model to obtain the first recognition result and the second recognition result;

the text recognition model is used for inputting the first image and the second image into a feature extraction model to obtain a first feature sequence of the first image and a second feature sequence of the second image, inputting the first feature sequence into a first recognition model to obtain a first recognition result, and inputting the second feature sequence into a second recognition model to obtain a second recognition result.

2. The method of claim 1, wherein generating a first mask matrix corresponding to the first image and generating a second mask matrix corresponding to the second image comprises:

acquiring the position of each first character in the first image and the position of each second character in the second image;

And generating the first mask matrix corresponding to the first image based on the position of each first character, and generating the second mask matrix corresponding to the second image based on the position of each second character.

3. The method of claim 2, wherein generating a second mask matrix for the second image based on the location of each second character comprises:

rotating the second image to obtain a rotated image, wherein the arrangement direction of second text data in the rotated image is a first direction;

generating a preset matrix based on the position of each second character;

and rotating the preset matrix to obtain the second mask matrix.

4. The method of claim 3, wherein the first mask matrix and the second mask matrix are asymmetric matrices, and wherein the values of the preset matrix or the first mask matrix increase sequentially from left to right and from top to bottom.

5. The method of claim 1, wherein the method further comprises:

acquiring a plurality of training images, wherein the arrangement direction of text data contained in each training image is a first direction or a second direction;

and training the initial model by utilizing the training images to obtain the text recognition model.

6. The method of claim 1, wherein the feature extraction model comprises a plurality of sequentially connected convolutional layers, excitation layers, and pooling layers, wherein the last pooling layer has a parameter 2*1, and wherein the first or second recognition model comprises a convolutional neural network, a bidirectional long-short term memory neural network, and a join-sense temporal classification.

7. An image recognition method, comprising:

extracting features of the input matrix to obtain the first feature sequence and the second feature sequence;

Acquiring a first identification result of the first image based on the first feature sequence;

and acquiring a second identification result of the second image based on the second feature sequence.

8. The method of claim 7, wherein feature extracting the first and second images to obtain a first feature sequence of the first image and a second feature sequence of the second image comprises:

and inputting the first image and the second image into a shared feature extraction model to obtain the first feature sequence and the second feature sequence.

9. The method of claim 7, wherein obtaining a first recognition result of the first image based on the first feature sequence comprises:

And inputting the first characteristic sequence into a first recognition model to obtain the first recognition result.

10. The method of claim 7, wherein obtaining a second recognition result of the second image based on the second feature sequence comprises:

and inputting the second characteristic sequence into a second recognition model to obtain a second recognition result.

11. The method of claim 7, wherein generating a first mask matrix corresponding to the first image and generating a second mask matrix corresponding to the second image comprises:

And generating a first mask matrix corresponding to the first image based on the position of each first character, and generating a second mask matrix corresponding to the second image based on the position of each second character.

12. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the image recognition method of any one of claims 1 to 11.

13. A computing device comprising a memory for storing a program and a processor for executing the program, wherein the program is executed to perform the image recognition method of any one of claims 1 to 11.

14. An image recognition system, comprising:

Processor, and

The memory is connected with the processor and is used for providing instructions for the processor to process the steps of extracting an image to be processed, which contains first text data and second text data, to obtain a first image and a second image, wherein the first image contains the first text data, the second image contains the second text data, the arrangement direction of the first text data is a first direction, the arrangement direction of the second text data is a second direction, a first mask matrix corresponding to the first image is generated, and a second mask matrix corresponding to the second image is generated, the first mask matrix is used for representing the arrangement direction of the first text data, the second mask matrix is an asymmetric matrix, the values of the first mask are related to the positions and the arrangement direction of the characters in the first text data, the values of the second matrix are related to the positions and the arrangement direction of the second text data, the first mask matrix is related to the second text data, the first mask matrix is used for inputting a recognition result, the first mask matrix is used for inputting the first feature matrix, the first feature matrix is obtained, the feature matrix is obtained by inputting the feature matrix into the first feature matrix, the feature matrix is obtained by inputting the feature matrix, and inputting the second characteristic sequence into a second recognition model to obtain a second recognition result.