WO2024244760A1 - Mathematical formula recognition method and apparatus, electronic device, and readable storage medium - Google Patents
Mathematical formula recognition method and apparatus, electronic device, and readable storage medium Download PDFInfo
- Publication number
- WO2024244760A1 WO2024244760A1 PCT/CN2024/088254 CN2024088254W WO2024244760A1 WO 2024244760 A1 WO2024244760 A1 WO 2024244760A1 CN 2024088254 W CN2024088254 W CN 2024088254W WO 2024244760 A1 WO2024244760 A1 WO 2024244760A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- vector
- feature map
- character
- unit
- sub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/22—Character recognition characterised by the type of writing
- G06V30/226—Character recognition characterised by the type of writing of cursive writing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/18—Extraction of features or characteristics of the image
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/1918—Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Definitions
- the present disclosure relates to the field of data processing technology, and in particular to a mathematical formula recognition method, device, electronic device and readable storage medium.
- handwritten formula recognition is usually implemented using an encoding and decoding architecture.
- a handwritten formula image is input into an encoder, which extracts an image feature map; then, the image feature map is input into a decoder, which recognizes the handwritten formula character by character and finally recognizes the handwritten formula.
- the encoding-decoding architecture in the existing solution is based on latex tags for character-by-character prediction.
- complex structures such as fractions, radicals, exponents, logarithms, etc. that are nested in the formula to be recognized, it is very easy to cause " ⁇ " or " ⁇ " to be lost, especially when the formula to be recognized is long, it is easy to cause some characters to be lost during the recognition process, resulting in recognition errors and reducing the recognition accuracy.
- the present disclosure provides a mathematical formula recognition method, device, electronic device and readable storage medium to address the deficiencies of the related art.
- a mathematical formula recognition method comprising: acquiring an original image containing a mathematical formula; inputting the original image into a formula recognition model to obtain a predicted character set output by the formula recognition model; the predicted character set comprising character data and structural data; restoring the position of the character data in the mathematical formula according to a preset formula format and structural data to obtain the mathematical formula in the original image.
- the formula recognition model includes an encoder and a decoder; the encoder is used to obtain an image feature map corresponding to the original image; and the decoder is used to determine a predicted character set corresponding to the original image based on the image feature map.
- the encoder includes a DenseNet network or a Transformer network.
- the decoder includes a multi-scale counting module, a cyclic decoding module, a feature fusion module and a character prediction module;
- the multi-scale counting module is used to convert the image feature map into a plurality of counting vectors of preset scales;
- the cyclic decoding module is used to obtain character latents according to the image feature map;
- the feature fusion module is used to fuse the counting vector, the character latent vector and the previous predicted character vector to obtain a target vector;
- the character prediction module is used to predict character data and structure data respectively according to the target vector.
- the circular decoding module is also used to obtain a context vector based on the image feature map; the feature fusion module is also used to fuse the count vector, the context vector, the character latent vector and the previous predicted character vector to obtain a target vector.
- the feature fusion module performs linear transformation processing on the count vector, the context vector, the character latent vector and the previous predicted character vector, respectively, to obtain a first linear vector, a second linear vector, a third linear vector and a fourth linear vector; and obtains the sum vector of the first linear vector, the second linear vector, the third linear vector and the fourth linear vector to obtain a target vector.
- the multi-scale counting module includes at least two sub-counting modules and a vector averaging sub-module; the at least two sub-counting modules contain convolution kernels of different sizes and are used to output sub-vectors of the same size, and each sub-vector is used to represent the number of times character data is predicted at different scales; the vector averaging sub-module is used to obtain the average vector of at least two sub-vectors to obtain a counting vector.
- the multi-scale counting module includes a first sub-counting module and a second sub-counting module; the first sub-counting module is used to identify feature information of a first preset scale in the image feature map to obtain a first sub-vector; the first sub-vector is used to represent the number of character data recognized under the convolution kernel of the first preset scale; the second sub-counting module is used to identify feature information of a second preset scale in the image feature map to obtain a second sub-vector; the second sub-vector is used to represent the number of character data recognized under the convolution kernel of the first preset scale.
- the first sub-counting module includes a first convolution unit, a channel unit, a conversion unit and a pooling unit;
- the first convolution unit is used to convert the image feature map into a first feature map using a convolution kernel of a first preset size;
- the channel unit is used to assign corresponding weights to different channels of the first feature map to obtain an attention feature map; and the attention feature map and the first feature map are multiplied to obtain a channel feature map;
- the conversion unit is used to scale the channel feature map to obtain a counting feature map;
- the pooling unit is used to pool the counting feature map to obtain a first sub-vector.
- the second sub-counting module includes a second convolution unit, a channel unit, a conversion unit and a pooling unit;
- the second convolution unit is used to convert the image feature map into a second feature map using a convolution kernel of a third preset size;
- the channel unit is used to assign corresponding weights to different channels of the second feature map to obtain an attention feature map; and the attention feature map and the second feature map are multiplied to obtain a channel feature map;
- the conversion unit is used to scale the channel feature map to obtain a counting feature map;
- the pooling unit is used to pool the counting feature map to obtain a second sub-vector.
- the multi-scale counting module further includes a third sub-counting module, and the third sub-counting module is used to identify feature information of a third preset scale in the image feature map to obtain a third sub-vector.
- the third sub-counting module includes a third convolution unit, a channel unit, a conversion unit and a pooling unit;
- the third convolution unit is used to convert the image feature map into a third feature map using a convolution kernel of a third preset size;
- the channel unit is used to assign corresponding weights to different channels of the third feature map to obtain an attention feature map; and the attention feature map and the third feature map are multiplied to obtain a channel feature map;
- the conversion unit is used to scale the channel feature map to obtain a counting feature map;
- the pooling unit is used to pool the counting feature map to obtain a third sub-vector.
- the cyclic decoding module includes a gated cyclic unit; the gated cyclic unit is used to generate the character latent vector according to the latent vector and a previous predicted character vector.
- the loop decoding module includes a first gated loop unit and a second gated loop unit; the first gated loop unit is used to generate a first gated vector based on a latent vector and a previous predicted character vector; the second gated loop unit is used to generate the character latent vector based on the first gated vector.
- the cyclic decoding module includes an attention submodule; the attention submodule is used to generate a context vector based on the image feature map, the all-zero tensor and the first gating vector; the second gated cyclic unit is used to generate the character latent vector based on the first gating vector and the context vector.
- the attention submodule includes a weight acquisition submodule and a context acquisition submodule; the weight acquisition submodule is used to convolve and linearly process the all-zero tensor to obtain a first eigenvector, and to obtain the image
- the feature map is convolved to obtain a second feature vector, and the first gating vector is linearly processed to obtain a third feature vector; and the first feature vector, the second convolution vector and the third feature vector are superimposed to obtain a fourth feature vector;
- the context acquisition submodule is used to perform a first activation process, a linear conversion process and a second activation process on the fourth feature vector to obtain an attention weight; and the image feature map and the attention weight are multiplied, and then summed to obtain a context vector.
- the character prediction module includes a character prediction submodule and a structure prediction submodule; the character prediction submodule is used to process the target vector to obtain predicted character data; the structure prediction submodule is used to process the target vector to obtain predicted structure data.
- the character prediction submodule includes a linear conversion unit and an activation unit; the linear conversion unit is used to perform linear conversion on the target vector; the activation unit is used to perform activation processing on the vector after linear conversion to obtain predicted character data.
- the structure prediction submodule includes a linear conversion unit and an activation unit; the linear conversion unit is used to perform linear conversion on the target vector; the activation unit is used to perform activation processing on the vector after linear conversion to obtain predicted structure data.
- the formula recognition model is trained through the following steps, including: obtaining training sample data; the training sample data includes a data set of multiple mathematical formulas; each data set includes character data and structural data obtained after expansion according to a preset formula format; the structural data is used to represent the relative position relationship of some characters in the mathematical formula; using the training sample data to train the formula recognition model to be trained to obtain a predicted character set output by the formula recognition model; the predicted character set includes character data and structural data; when it is determined that the character data meets the preset conditions, the training is stopped to obtain a trained formula recognition model.
- determining whether the character data meets a preset condition includes: obtaining the frequency of occurrence of each character data in the predicted character set to obtain the predicted frequency corresponding to each character data; obtaining the difference between the predicted frequency corresponding to each character data and the marked frequency in the label of the training sample data to obtain the prediction error value corresponding to each character data; obtaining the average value of the prediction error values corresponding to the character data in the predicted character set to obtain an error average value; when it is determined that the error average value is less than or equal to a preset average value threshold, determining that the character data meets the preset condition.
- the training sample data is acquired through the following steps, including: acquiring multiple original images containing mathematical formulas; acquiring character data corresponding to each original image; creating structural data according to a preset formula format and character data and arranging the character data to obtain training sample data corresponding to each original image.
- a mathematical formula recognition device comprising: an original image acquisition module, used to acquire an original image containing a mathematical formula; a prediction set acquisition module, used to input the original image into a formula recognition model to obtain a predicted character set output by the formula recognition model; the predicted character set comprises character data and structural data; and a mathematical formula acquisition module, used to restore the position of the character data in the mathematical formula according to a preset formula format and structural data to obtain the mathematical formula in the original image.
- an electronic device comprising a processor; and a memory for storing a computer program executable by the processor; wherein the processor is configured to execute the computer program in the memory to implement the method described in the first aspect.
- a computer-readable storage medium When an executable computer program in the storage medium is executed by a processor, the method described in the first aspect can be implemented.
- the scheme provided by the embodiments of the present disclosure can obtain an original image containing a mathematical formula; then, the original image is input into a formula recognition model to obtain a predicted character set output by the formula recognition model; the predicted character set includes character data and structural data; then, the position of the character data in the mathematical formula is restored according to the preset formula format and structural data to obtain the mathematical formula in the original image.
- the relative position relationship of the character data can be represented, and the problem of missing characters in the complex formula recognition process can be eliminated, which is conducive to improving the recognition accuracy.
- Fig. 1 is a flow chart showing a method for identifying a mathematical formula according to an exemplary embodiment.
- Fig. 2 is a block diagram of a formula recognition model according to an exemplary embodiment.
- Fig. 3 is a block diagram of a decoder according to an exemplary embodiment.
- Fig. 4 is a block diagram of a multi-scale counting module according to an exemplary embodiment.
- Fig. 5 is a block diagram of a first sub-counting module according to an exemplary embodiment.
- Fig. 6 is a block diagram showing a cyclic decoding module according to an exemplary embodiment.
- Fig. 7 is a block diagram of an attention submodule according to an exemplary embodiment.
- Fig. 8 is a block diagram of a character prediction module according to an exemplary embodiment.
- Fig. 9 is a block diagram of a formula recognition model according to an exemplary embodiment.
- Fig. 10 is a flowchart showing a training formula recognition model according to an exemplary embodiment.
- Fig. 11 is a schematic diagram showing the relationship between a mathematical formula, a preset formula format and a latex tag according to an exemplary embodiment.
- Fig. 12 is a flowchart showing a training formula recognition model according to an exemplary embodiment.
- Fig. 13 is a block diagram of a mathematical formula recognition device according to an exemplary embodiment.
- the present disclosure provides a mathematical formula recognition method, and the inventive concept thereof includes:
- This structural information can represent the relative position relationship of some characters in the mathematical formula, making the mathematical formula clearer and more accurate.
- it can enable the formula recognition model to more accurately predict the position of each character data in the mathematical formula, thereby improving the accuracy of the formula recognition model.
- a multi-scale counting module is set in the decoder of the formula recognition model.
- the multi-scale counting module can obtain the frequency of occurrence of character data in the predicted character set.
- the above frequency can be used to construct a loss function to realize the training process of the supervised formula recognition model, which is beneficial to improve the accuracy of character recognition by the formula recognition model, and then help improve the accuracy of the overall recognition of mathematical formulas.
- FIG1 is a flowchart of a mathematical formula recognition method according to an exemplary embodiment.
- a mathematical formula recognition method includes steps 11 to 13:
- step 11 an original image containing a mathematical formula is obtained.
- the electronic device can obtain an original image containing a mathematical formula.
- the electronic device can be provided with an image acquisition module, such as a camera or an image sensor.
- the electronic device can control the image acquisition module to shoot the object, thereby obtaining the original image.
- the electronic device can be provided with a communication module, such as a WiFi module, an infrared module, a USB module, etc., and the electronic device can communicate with the peer communication module of other devices through the above communication module, thereby reading the original image containing the mathematical formula from other devices.
- the electronic device does not determine whether the image contains a mathematical formula when acquiring the image. In the subsequent embodiments, it is assumed that the image contains a mathematical formula for the convenience of description.
- the original image in the subsequent embodiments may be a color image (ie, an RGB image) or a grayscale image.
- the original image is implemented as a grayscale image, thereby reducing the amount of data processing of the formula recognition model and improving recognition efficiency.
- step 12 the original image is input into a formula recognition model to obtain a predicted character set output by the formula recognition model; the predicted character set includes character data and structure data.
- the electronic device may store a formula recognition model.
- the formula recognition model may include an encoder and a decoder. Referring to FIG. 2 , the encoder 21 is used to obtain an image feature map corresponding to the original image; the decoder 22 is used to determine a predicted character set corresponding to the original image according to the image feature map.
- the encoder 21 may be implemented by a DenseNet network or a Transformer network.
- the input data of the DenseNet network is an original image of size H*W*1
- the output data thereof is an image feature map
- the size of the image feature map is H/16*W/16*684.
- H represents the height of the original image
- W represents the width of the original image.
- the decoder 22 includes a multi-scale counting module, a cyclic decoding module, a feature fusion module and a character prediction module.
- the multi-scale counting module (MSCM) 31 is used to The image feature map E(X) is converted into counting vectors (CV) of multiple preset scales.
- the loop decoding module 32 obtains the character latent vector M and the context vector ⁇ corresponding to the image feature map;
- the feature fusion module 33 is used to combine the count vector CV, the character latent vector M and the previous predicted character vector Perform linear transformation to obtain the target vector LB;
- the character prediction module 34 is used to predict character data and structure data according to the target vector LB, respectively.
- the character data and structure data constitute a predicted character set.
- the target vector LB is the sum vector of the first linear vector, the third linear vector and the fourth linear vector, that is, the sum of the numbers of the same position of each linear vector is obtained as the value of the sum vector at the same position, so that after the summation process, the dimension of the target vector is the same as that of each linear vector.
- the first linear vector, the second linear vector, the third linear vector and the fourth linear vector are all linear vectors of 1*512 dimension, then the target vector is also a linear vector of 1*512 dimension.
- the multi-scale counting module 31 includes at least two sub-counting modules and a vector averaging sub-module; the at least two sub-counting modules contain convolution kernels of different sizes and are used to output sub-vectors of the same size, and each sub-vector is used to represent the number of times character data is predicted at different scales; the vector averaging sub-module is used to obtain the average vector of at least two sub-vectors to obtain a counting vector.
- the multi-scale counting module 31 includes a first sub-counting module and a second sub-counting module; the first sub-counting module is used to identify feature information of a first preset scale in the image feature map to obtain a first sub-vector; the second sub-counting module is used to identify feature information of a second preset scale in the image feature map to obtain a second sub-vector.
- the multi-scale counting module further includes a third sub-counting module, and the third sub-counting module is used to identify feature information of a third preset scale in the image feature map to obtain a third sub-vector.
- the multi-scale counting module 31 includes a first sub-counting module, a second sub-counting module, a third sub-counting module and a vector average sub-module (Element Average). Referring to FIG. 4 , the first sub-counting module 41, the second sub-counting module 42, the third sub-counting module 43 and the vector average sub-module (Element Average) 44;
- the first sub-counting module 41 is used to identify feature information of a first preset scale in the image feature map E(X) to obtain a first sub-vector.
- the second sub-counting module 42 is used to identify feature information of a second preset scale in the image feature map E(X) to obtain a second sub-vector.
- the third sub-counting module 43 is used to identify feature information of a third preset scale in the image feature map E(X) to obtain a third sub-vector.
- the vector average submodule (Element Average) 44 is used to obtain the average vector of the first subvector, the second subvector and the third subvector to obtain a counting vector CV (Counting Vector).
- the counting vector CV is used to represent the number of each character.
- the first sub-counting module 41 includes a first convolution unit 411 (Conv3*3+BN), a channel unit 412, a conversion unit 413 (Conv1*1+Sigmoid) and a pooling unit 414 (Sum Pooling).
- the first convolution unit 411 is used to convert the image feature map E(X) into a first feature map using a convolution kernel of a first preset size.
- the first convolution unit also includes a BN (Batch Normalization) layer, which is used to convert the image feature map E(X) into a feature map of a preset dimension.
- the first convolution unit includes a convolution layer. Kernel Conv, in one example, the size of the convolution kernel, that is, the first preset size is 3*3*684; and the number of convolution kernels is 512, so the dimension of the first feature map output by the first convolution unit is H/16*W/16*512.
- the channel unit 412 is used to assign corresponding weights to different channels of the first feature map to obtain an attention feature map; and multiply the attention feature map and the first feature map to obtain a channel feature map.
- the above-mentioned attention channel may include but is not limited to the following activation functions: ReLU and Sigmoid.
- the above-mentioned attention channel may also include functions such as GAP and Linear. Since different functions have different output value ranges, corresponding weights can be assigned to different channels of the first feature map to select the attention value range under different channels. Then, the channel unit 52 can multiply the weight value of each channel with the feature map of a preset dimension to obtain a channel feature map.
- the GAP function can perform average pooling on the first feature map, that is, average the plane of H/16*W/16 to obtain the value corresponding to the plane; after 512 times, a channel feature map of H/16*W/16*512 dimensions can be obtained.
- the conversion unit 413 is used to scale the channel feature map to obtain a counting feature map.
- the conversion unit 413 is used to convert the channel feature map of different input dimensions into a feature map of another preset dimension, which is subsequently referred to as a counting feature map (Counting Map).
- the conversion unit includes a convolution kernel and a Sigmoid activation function.
- the size of the convolution kernel is 1*1*512, and the number of convolution kernels is K, so that K characters can be recognized, that is, the counting feature map is a 1*K dimensional vector.
- the pooling unit 414 is used to perform pooling processing on the count feature map to obtain a first sub-vector.
- the first convolution unit 411 performs 3*3*684 convolution through the convolution kernel. After 512 convolution kernels, the number of channels of the image feature map can be adjusted from 684 dimensions to 512 dimensions. Then, the channel unit 412 can obtain the weight value of each of the 512 channels, and multiply the weight value of each channel with the first feature map output by the first convolution unit 411 for weighting to obtain a channel feature map of H/16*W/16*512 dimensions. Afterwards, the conversion unit 413 adjusts the number of channels of the first feature map from 512 dimensions to 208 dimensions through 1*1*208 convolution. It should be noted that the above-mentioned 208 dimensions are the character data of the recognition network, which can be adjusted according to the mode.
- the dimension of the counting feature map is H*W*208.
- the pooling unit 414 may sum each H*W plane to obtain a first sub-vector of 1*208 dimension, wherein the pixel value of each of the 208 channels of the first sub-vector corresponds to the number of times the character appears in the current formula.
- the second sub-counting module 42 includes a second convolution unit 421, a channel unit 422, a conversion unit 423 and a pooling unit 424.
- the second convolution unit 421 is used to convert the image feature map E(X) into a second feature map using a convolution kernel of a second preset size
- the channel unit 422 is used to assign corresponding weights to different channels of the second feature map to obtain an attention feature map, and multiply the attention feature map and the second feature map to obtain a channel feature map
- the conversion unit 423 is used to scale the channel feature map to obtain a count feature map
- the pooling unit 424 is used to pool the count feature map to obtain a second sub-vector.
- the structure of the second sub-counting module 42 is the same as that of the first sub-counting module 41, and the difference is that the convolution units of the two are different, that is, the second sub-counting module 42 converts the image feature map into a second preset size instead of the first preset size.
- the second convolution unit 421 includes a convolution kernel of size 5*5*684, that is, the second preset size is 5*5*684; and the number of convolution kernels is 512, so that the dimension of the second feature map output by the second convolution unit 421 is H/16*W/16*512. And the dimension of the second sub-vector is H*W*208.
- the third sub-counting module 43 includes a third convolution unit 431, a channel unit 432, a conversion unit 433 and a pooling unit 434.
- the third convolution unit 431 is used to convert the image feature map E(X) into a third feature map using a convolution kernel of a third preset size;
- the channel unit 432 is used to convert different channels of the third feature map Assign corresponding weights to obtain an attention feature map; and multiply the attention feature map and the third feature map to obtain a channel feature map;
- a conversion unit 433 is used to scale the channel feature map to obtain a count feature map;
- a pooling unit 434 is used to pool the count feature map to obtain a third sub-vector.
- the structure of the third sub-counting module 43 is the same as that of the first sub-counting module 41, and the difference is that the convolution units of the two are different.
- the third convolution unit 431 includes a convolution kernel of size 7*7*684, that is, the third preset size is 7*7*684. In other words, the dimension of the third sub-vector is H*W*208.
- the loop decoding module includes a gated loop unit, which is used to generate the character latent vector based on the latent vector and the previous predicted character vector.
- the loop decoding module includes a first gated loop unit and a second gated loop unit; the first gated loop unit is used to generate a first gated vector based on the latent vector and the previous predicted character vector; the second gated loop unit is used to generate the character latent vector based on the first gated vector. It should be noted that the number of gated loop units can be selected according to the specific scenario. When the recognition accuracy is met, the corresponding scheme falls within the protection scope of the present disclosure.
- the loop decoding module further includes an attention submodule (attention model).
- the loop decoding module 32 includes an attention submodule 61, a first gated loop unit 62, and a second gated loop unit 63; the first gated loop unit 62 is used to control the hidden vector and the previous prediction character vector Generate the first gating vector
- the two input latent vectors of the first gated recurrent unit 62 and the previous prediction character vector The dimension is 256.
- the previous prediction character vector The initial value generated is a 256-dimensional vector representation of the starting symbol " ⁇ sos>", and the values after the first time are the results of the previous training process.
- Hidden vector The initial value of is the linear transformation (dimension 1*256) of the average value (dimension 1*684) of each channel pixel of the image feature map E(X) output by the encoder, that is, the 1*256-dimensional vector corresponding to the image feature map E(X); and when it is not the first cycle
- This M is the character latent vector M output by the second gated recurrent unit 63.
- the previous predicted character vector The character data can be obtained by predicting the character based on the character latent vector M; then, a table lookup operation is performed based on the character data to find the vector corresponding to the above character data, and the found vector is used as the previous predicted character vector The value of the previous prediction character vector effect.
- the attention submodule 61 is used to obtain the image feature map E(X), the all-zero tensor att ⁇ (X) and the first gate vector Generate context vector ⁇ ; the input of attention submodule 61 is E(X), whose size is H/16*W/16*684. The output of the first gated recurrent unit 62 is Its dimension is 1*256. The initial dimension of the all-zero tensor att ⁇ (X) is 1*1*H/16*W/16, which is used to record the historical attention area.
- the all-zero tensor att ⁇ (X) sets the initialization value of the vector to all zeros, so as to facilitate the operation.
- its initialization value can also be set to other random numbers, and the corresponding results can also be obtained.
- ⁇ is the update vector of the attention submodule 61 illustrated in Figure 7 after each cycle.
- the second gated loop unit 63 is used to control the first gate vector
- the character latent vector M is generated by summing up the context vector ⁇ .
- the output result of the second gated recurrent unit 63 is M, whose dimension is 1*256.
- the attention submodule 61 includes a weight acquisition submodule 71 and a context acquisition submodule 72 .
- the weight acquisition submodule 71 is used to perform convolution and linear processing (Conv+linear) on the all-zero tensor att ⁇ (X) to obtain a first feature vector, and to perform convolution processing (Conv) on the image feature map E(X) to obtain a second feature vector, and to perform convolution processing (Conv) on the first gating vector Perform linear processing (linear) to obtain a third eigenvector; and perform superposition processing ("+") on the first eigenvector, the second convolution vector and the third eigenvector to obtain a fourth eigenvector.
- Superposition processing refers to adding the H/16*W/16*512-dimensional vector after the all-zero tensor att ⁇ (X) and the H/16*W/16*512-dimensional vector after the image feature map E(X) is processed to obtain a H/16*W/16*512-dimensional sum vector; then, the first gate vector Linear processing is performed to obtain a vector of 1*512 dimension; then, the sum vector of H/16*W/16*512 dimension is further summed with the vector of 1*512 dimension to obtain the fourth eigenvector of H/16*W/16*512 dimension.
- the context acquisition submodule 72 is used to obtain the attention weight ⁇ after performing the first activation processing (activation function is tanh), linear conversion processing (linear) and second activation processing (activation function is softmax) on the fourth eigenvector; and multiply the image feature map E(X) and the attention weight ("X"), and then sum them (sum) to obtain the context vector ⁇ .
- the first gate vector After linear processing, the number of channels is adjusted to 512 dimensions.
- the image feature map E(X) is adjusted to 512 channels after 1*1 convolution.
- the first sub-vector, the second sub-vector and the third sub-vector after the above three channel processing are added, and then activated by the activation function tanh, and then the number of channels is compressed to 1 through linear transformation, and finally the attention map weight ⁇ is obtained after activation by the activation function softmax.
- the character prediction submodule 81 includes a linear conversion unit (Linear) and an activation unit (sigmoid).
- the linear conversion unit 811 is used to perform linear conversion on the target vector;
- the activation unit 812 is used to perform activation processing on the vector after linear conversion to obtain predicted character data.
- the structure prediction submodule 82 includes a linear conversion unit (Linear) and an activation unit (softmax).
- the linear conversion unit is used to perform linear conversion on the target vector; the activation unit is used to perform activation processing on the vector after linear conversion to obtain predicted character data.
- the structure of the formula recognition model may be as shown in FIG. 9 .
- training sample data is obtained; the training sample data includes a data set of multiple mathematical formulas; each data set includes character data and structure data obtained after expansion according to a preset formula format; the structure data is used to represent the relative position relationship of some characters in the mathematical formula.
- the mathematical formula includes at least one of the following structures: fraction, subscript, superscript, radical, overline, overarrow, triangle, limit, summation symbol, etc.
- the symbols of the above structures are respectively ⁇ frac, _, ⁇ , ⁇ sqrt, ⁇ overline, ⁇ overrightarrow, ⁇ widehat, ⁇ limits, ⁇ sum. It is understandable that the following structures in the mathematical formula, such as log, ln, lg or exponent, are also applicable to the scheme of the present disclosure.
- training sample data can be formed in this embodiment.
- the first column (from top to bottom) is the formula using latex tags and adding the structure "struct" and the relative position relationship (7 types), and the 2nd to 7th columns are the relative position relationship after adding the structure.
- the entire table constitutes the training sample data. It can be seen that in this embodiment, by adding structural data to the training sample data, the mathematical formula can be more accurately represented, so as to facilitate the subsequent accurate restoration of the mathematical formula.
- pre_string + current_symbol, that is, the current character is added to pre_string. Then, it is determined whether the current character current_symbol is 'struct'. When current_symbol is 'struct' or ' ⁇ eos>', it is determined whether the current character current_symbol is 'struct'.
- current_symbol For judging whether the current character current_symbol is ‘struct’, when current_symbol is ‘struct’, predict the structure information and push the structure data into the stack, pop the top element of the stack, and adjust pre_string in combination with the top element of the stack and parent_symbol.
- current_symbol When current_symbol is not ‘struct’, judge whether the stack is empty. When the stack is not empty, pop the top element of the stack, and adjust pre_string in combination with the top element of the stack and parent_symbol. When the stack is empty, judge whether the number of “ ⁇ ” and “ ⁇ ” in pre_string is equal. If the number of “ ⁇ ” and “ ⁇ ” in pre_string is equal, exit the loop and return the recognition result.
- pre_string After adjusting pre_string, it can be determined whether the number of loops exceeds 100. If so, the loop is exited and the recognition result is returned. Otherwise, the character is recognized and the current character current_symbol is updated, and the loop is restarted until the recognition result is directly returned.
- step 13 the position of the character data in the mathematical formula is restored according to the preset formula format and structure data to obtain the mathematical formula in the original image.
- an electronic device comprising: a display screen; a processor; and a memory for storing a computer program executable by the processor, wherein the processor is configured to execute the computer program in the memory to implement the above method.
- a computer-readable storage medium such as a memory including an executable computer program, and the executable computer program can be executed by a processor to implement the method of the above embodiment.
- the readable storage medium can be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Character Discrimination (AREA)
Abstract
Description
本公开涉及数据处理技术领域,尤其涉及一种数学公式识别方法、装置、电子设备和可读存储介质。The present disclosure relates to the field of data processing technology, and in particular to a mathematical formula recognition method, device, electronic device and readable storage medium.
目前,手写公式识别通常采用编码和解码架构方式实现。例如,将手写公式图像输入到编码器,由该编码器提取图像特征图;然后,将上述图像特征图输入到解码器,由解码器件对手写公式进行逐字符识别,最终识别结手写公式。At present, handwritten formula recognition is usually implemented using an encoding and decoding architecture. For example, a handwritten formula image is input into an encoder, which extracts an image feature map; then, the image feature map is input into a decoder, which recognizes the handwritten formula character by character and finally recognizes the handwritten formula.
现有方案中编码-解码架构是基于latex标签进行逐字符预测。然而,当待识别公式中存在如分数、根式、指数、对数等相互嵌套的复杂结构时,极易导致“{”或“}”丢失,尤其是待识别公式较长时识别过程中容易造成部分字符丢失,导致识别错误,降低识别准确率。The encoding-decoding architecture in the existing solution is based on latex tags for character-by-character prediction. However, when there are complex structures such as fractions, radicals, exponents, logarithms, etc. that are nested in the formula to be recognized, it is very easy to cause "{" or "}" to be lost, especially when the formula to be recognized is long, it is easy to cause some characters to be lost during the recognition process, resulting in recognition errors and reducing the recognition accuracy.
发明内容Summary of the invention
本公开提供一种数学公式识别方法、装置、电子设备和可读存储介质,以解决相关技术的不足。The present disclosure provides a mathematical formula recognition method, device, electronic device and readable storage medium to address the deficiencies of the related art.
根据本公开实施例的第一方面,提供一种数学公式识别方法,所述方法包括:获取包含数学公式的原始图像;将所述原始图像输入到公式识别模型,获得所述公式识别模型输出预测字符集合;所述预测字符集合包括字符数据和结构数据;根据预设公式格式和结构数据还原所述字符数据在数学公式中的位置,得到所述原始图像中的数学公式。According to a first aspect of an embodiment of the present disclosure, a mathematical formula recognition method is provided, the method comprising: acquiring an original image containing a mathematical formula; inputting the original image into a formula recognition model to obtain a predicted character set output by the formula recognition model; the predicted character set comprising character data and structural data; restoring the position of the character data in the mathematical formula according to a preset formula format and structural data to obtain the mathematical formula in the original image.
可选地,所述公式识别模型包括编码器和解码器;所述编码器用于获取所述原始图像对应的图像特征图;所述解码器用于根据所述图像特征图确定所述原始图像对应的预测字符集合。Optionally, the formula recognition model includes an encoder and a decoder; the encoder is used to obtain an image feature map corresponding to the original image; and the decoder is used to determine a predicted character set corresponding to the original image based on the image feature map.
可选地,所述编码器包括DenseNet网络或者Transformer网络。Optionally, the encoder includes a DenseNet network or a Transformer network.
可选地,所述解码器包括多尺度计数模块、循环解码模块、特征融合模块和字符预测模块;所述多尺度计数模块用于将图像特征图转换多个预设尺度的计数向量;所述循环解码模块用于根据所述图像特征图获取字符隐;所述特征融合模块用于对所述计数向量、所述字符隐向量和前一预测字符向量进行融合处理,得到目标向量;所述字符预测模块用于根据所述目标向量分别预测字符数据和结构数据。Optionally, the decoder includes a multi-scale counting module, a cyclic decoding module, a feature fusion module and a character prediction module; the multi-scale counting module is used to convert the image feature map into a plurality of counting vectors of preset scales; the cyclic decoding module is used to obtain character latents according to the image feature map; the feature fusion module is used to fuse the counting vector, the character latent vector and the previous predicted character vector to obtain a target vector; the character prediction module is used to predict character data and structure data respectively according to the target vector.
可选地,所述循环解码模块还用于根据所述图像特征图获取上下文向量;所述特征融合模块还用于对所述计数向量、所述上下文向量、所述字符隐向量和前一预测字符向量进行融合处理,得到目标向量。Optionally, the circular decoding module is also used to obtain a context vector based on the image feature map; the feature fusion module is also used to fuse the count vector, the context vector, the character latent vector and the previous predicted character vector to obtain a target vector.
可选地,所述特征融合模块分别对所述计数向量、所述上下文向量、所述字符隐向量和所述前一预测字符向量进行线性转换处理,得到第一线性向量、第二线性向量、第三线性向量和第四线性向量;并且,获取所述第一线性向量、所述第二线性向量、所述第三线性向量和所述第四线性向量的和向量,得到目标向量。 Optionally, the feature fusion module performs linear transformation processing on the count vector, the context vector, the character latent vector and the previous predicted character vector, respectively, to obtain a first linear vector, a second linear vector, a third linear vector and a fourth linear vector; and obtains the sum vector of the first linear vector, the second linear vector, the third linear vector and the fourth linear vector to obtain a target vector.
可选地,所述多尺度计数模块包括至少两个子计数模块和向量平均子模块;所述至少两个子计数模块包含不同尺寸的卷积核且用于输出相同尺寸的子向量,各个子向量用于表示不同尺度下预测字符数据的次数;所述向量平均子模块用于获取至少两个子向量的平均向量,得到计数向量。Optionally, the multi-scale counting module includes at least two sub-counting modules and a vector averaging sub-module; the at least two sub-counting modules contain convolution kernels of different sizes and are used to output sub-vectors of the same size, and each sub-vector is used to represent the number of times character data is predicted at different scales; the vector averaging sub-module is used to obtain the average vector of at least two sub-vectors to obtain a counting vector.
可选地,所述多尺度计数模块包括第一子计数模块和第二子计数模块;所述第一子计数模块,用于识别所述图像特征图中第一预设尺度的特征信息,得到第一子向量;所述第一子向量用于表示第一预设尺度的卷积核下识别到的字符数据的次数;所述第二子计数模块,用于识别所述图像特征图中第二预设尺度的特征信息,得到第二子向量;所述第二子向量用于表示第一预设尺度的卷积核下识别到的字符数据的次数。Optionally, the multi-scale counting module includes a first sub-counting module and a second sub-counting module; the first sub-counting module is used to identify feature information of a first preset scale in the image feature map to obtain a first sub-vector; the first sub-vector is used to represent the number of character data recognized under the convolution kernel of the first preset scale; the second sub-counting module is used to identify feature information of a second preset scale in the image feature map to obtain a second sub-vector; the second sub-vector is used to represent the number of character data recognized under the convolution kernel of the first preset scale.
可选地,所述第一子计数模块包括第一卷积单元、通道单元、转换单元和池化单元;所述第一卷积单元,用于利用第一预设尺寸的卷积核将所述图像特征图转换成第一特征图;所述通道单元,用于对所述第一特征图不同的通道赋予对应的权重,得到注意力特征图;并将所述注意力特征图和所述第一特征图作乘法处理后,得到通道特征图;所述转换单元,用于将所述通道特征图进行尺度变换处理,得到计数特征图;所述池化单元,用于对所述计数特征图进行池化处理,得到第一子向量。Optionally, the first sub-counting module includes a first convolution unit, a channel unit, a conversion unit and a pooling unit; the first convolution unit is used to convert the image feature map into a first feature map using a convolution kernel of a first preset size; the channel unit is used to assign corresponding weights to different channels of the first feature map to obtain an attention feature map; and the attention feature map and the first feature map are multiplied to obtain a channel feature map; the conversion unit is used to scale the channel feature map to obtain a counting feature map; the pooling unit is used to pool the counting feature map to obtain a first sub-vector.
可选地,所述第二子计数模块包括第二卷积单元、通道单元、转换单元和池化单元;所述第二卷积单元,用于利用第三预设尺寸的卷积核将所述图像特征图转换成第二特征图;所述通道单元,用于对所述第二特征图不同的通道赋予对应的权重,得到注意力特征图;并将所述注意力特征图和所述第二特征图作乘法处理后,得到通道特征图;所述转换单元,用于将所述通道特征图进行尺度变换处理,得到计数特征图;所述池化单元,用于对所述计数特征图进行池化处理,得到第二子向量。Optionally, the second sub-counting module includes a second convolution unit, a channel unit, a conversion unit and a pooling unit; the second convolution unit is used to convert the image feature map into a second feature map using a convolution kernel of a third preset size; the channel unit is used to assign corresponding weights to different channels of the second feature map to obtain an attention feature map; and the attention feature map and the second feature map are multiplied to obtain a channel feature map; the conversion unit is used to scale the channel feature map to obtain a counting feature map; the pooling unit is used to pool the counting feature map to obtain a second sub-vector.
可选地,所述多尺度计数模块还包括第三子计数模块,所述第三子计数模块,用于识别所述图像特征图中第三预设尺度的特征信息,得到第三子向量。Optionally, the multi-scale counting module further includes a third sub-counting module, and the third sub-counting module is used to identify feature information of a third preset scale in the image feature map to obtain a third sub-vector.
可选地,所述第三子计数模块包括第三卷积单元、通道单元、转换单元和池化单元;所述第三卷积单元,用于利用第三预设尺寸的卷积核将所述图像特征图转换成第三特征图;所述通道单元,用于对所述第三特征图的不同的通道赋予对应的权重,得到注意力特征图;并将所述注意力特征图和所述第三特征图作乘法处理后,得到通道特征图;所述转换单元,用于将所述通道特征图进行尺度变换处理,得到计数特征图;所述池化单元,用于对所述计数特征图进行池化处理,得到第三子向量。Optionally, the third sub-counting module includes a third convolution unit, a channel unit, a conversion unit and a pooling unit; the third convolution unit is used to convert the image feature map into a third feature map using a convolution kernel of a third preset size; the channel unit is used to assign corresponding weights to different channels of the third feature map to obtain an attention feature map; and the attention feature map and the third feature map are multiplied to obtain a channel feature map; the conversion unit is used to scale the channel feature map to obtain a counting feature map; the pooling unit is used to pool the counting feature map to obtain a third sub-vector.
可选地,所述循环解码模块包括门控循环单元;所述门控循环单元,用于根据隐向量和前一预测字符向量生成所述字符隐向量。Optionally, the cyclic decoding module includes a gated cyclic unit; the gated cyclic unit is used to generate the character latent vector according to the latent vector and a previous predicted character vector.
可选地,所述循环解码模块包括第一门控循环单元和第二门控循环单元;所述第一门控循环单元,用于根据隐向量和前一预测字符向量生成第一门控向量;所述第二门控循环单元,用于根据所述第一门控向量生成所述字符隐向量。Optionally, the loop decoding module includes a first gated loop unit and a second gated loop unit; the first gated loop unit is used to generate a first gated vector based on a latent vector and a previous predicted character vector; the second gated loop unit is used to generate the character latent vector based on the first gated vector.
可选地,所述循环解码模块包括注意力子模块;所述注意力子模块,用于根据所述图像特征图、全零张量和所述第一门控向量生成上下文向量;所述第二门控循环单元,用于根据所述第一门控向量和所述上下文向量生成所述字符隐向量。Optionally, the cyclic decoding module includes an attention submodule; the attention submodule is used to generate a context vector based on the image feature map, the all-zero tensor and the first gating vector; the second gated cyclic unit is used to generate the character latent vector based on the first gating vector and the context vector.
可选地,所述注意力子模块包括权重获取子模块和上下文获取子模块;所述权重获取子模块,用于将全零张量进行卷积和线性处理,得到第一特征向量,以及将所述图像 特征图进行卷积处理得到第二特征向量,以及将所述第一门控向量进行线性处理,得到第三特征向量;以及对所述第一特征向量、所述第二卷积向量和所述第三特征向量叠加处理,得到第四特征向量;所述上下文获取子模块,用于分别对所述第四特征向量进行第一次激活处理、线性转换处理和第二次激活处理后,得到注意力权重;以及将所述图像特征图和所述注意力权重进行相乘处理,再进行求和处理,得到上下文向量。Optionally, the attention submodule includes a weight acquisition submodule and a context acquisition submodule; the weight acquisition submodule is used to convolve and linearly process the all-zero tensor to obtain a first eigenvector, and to obtain the image The feature map is convolved to obtain a second feature vector, and the first gating vector is linearly processed to obtain a third feature vector; and the first feature vector, the second convolution vector and the third feature vector are superimposed to obtain a fourth feature vector; the context acquisition submodule is used to perform a first activation process, a linear conversion process and a second activation process on the fourth feature vector to obtain an attention weight; and the image feature map and the attention weight are multiplied, and then summed to obtain a context vector.
可选地,所述字符预测模块包括字符预测子模块和结构预测子模块;所述字符预测子模块,用于对所述目标向量进行处理,得到预测出的字符数据;所述结构预测子模块,用于对所述目标向量进行处理,得到预测出的结构数据。Optionally, the character prediction module includes a character prediction submodule and a structure prediction submodule; the character prediction submodule is used to process the target vector to obtain predicted character data; the structure prediction submodule is used to process the target vector to obtain predicted structure data.
可选地,所述字符预测子模块包括线性转换单元和激活单元;所述线性转换单元,用于对所述目标向量进行线性转换;所述激活单元,用于对线性转换后的向量进行激活处理,得到预测出的字符数据。Optionally, the character prediction submodule includes a linear conversion unit and an activation unit; the linear conversion unit is used to perform linear conversion on the target vector; the activation unit is used to perform activation processing on the vector after linear conversion to obtain predicted character data.
可选地,所述结构预测子模块包括线性转换单元和激活单元;所述线性转换单元,用于对所述目标向量进行线性转换;所述激活单元,用于对线性转换后的向量进行激活处理,得到预测出的结构数据。Optionally, the structure prediction submodule includes a linear conversion unit and an activation unit; the linear conversion unit is used to perform linear conversion on the target vector; the activation unit is used to perform activation processing on the vector after linear conversion to obtain predicted structure data.
可选地,所述公式识别模型通过以下步骤进行训练,包括:获取训练样本数据;所述训练样本数据包括多条数学公式的数据集合;每条数据集合包括按照预设公式格式展开后得到的字符数据和结构数据;所述结构数据用于表示数学公式中部分字符的相对位置关系;利用所述训练样本数据对待训练的公式识别模型进行训练,得到所述公式识别模型输出的预测字符集合;所述预测字符集合包括字符数据和结构数据;当确定所述字符数据满足预设条件时停止训练,得到已完成训练的公式识别模型。Optionally, the formula recognition model is trained through the following steps, including: obtaining training sample data; the training sample data includes a data set of multiple mathematical formulas; each data set includes character data and structural data obtained after expansion according to a preset formula format; the structural data is used to represent the relative position relationship of some characters in the mathematical formula; using the training sample data to train the formula recognition model to be trained to obtain a predicted character set output by the formula recognition model; the predicted character set includes character data and structural data; when it is determined that the character data meets the preset conditions, the training is stopped to obtain a trained formula recognition model.
可选地,确定所述字符数据满足预设条件,包括:获取所述预测字符集合中各个字符数据出现的频次,得到各个字符数据对应的预测频次;获取各个字符数据对应的所述预测频次与所述训练样本数据的标签内的标注频次的差值,得到各个字符数据对应的预测误差值;获取所述预测字符集合中字符数据对应的预测误差值的平均值,得到误差平均值;当确定所述误差平均值小于或等于预设平均值阈值时,确定所述字符数据满足预设条件。Optionally, determining whether the character data meets a preset condition includes: obtaining the frequency of occurrence of each character data in the predicted character set to obtain the predicted frequency corresponding to each character data; obtaining the difference between the predicted frequency corresponding to each character data and the marked frequency in the label of the training sample data to obtain the prediction error value corresponding to each character data; obtaining the average value of the prediction error values corresponding to the character data in the predicted character set to obtain an error average value; when it is determined that the error average value is less than or equal to a preset average value threshold, determining that the character data meets the preset condition.
可选地,所述训练样本数据通过以下步骤进行获取,包括:获取包含数学公式的多张原始图像;获取各张原始图像对应的字符数据;按照预设公式格式和字符数据创建结构数据并排列所述字符数据,得到每张原始图像对应的训练样本数据。Optionally, the training sample data is acquired through the following steps, including: acquiring multiple original images containing mathematical formulas; acquiring character data corresponding to each original image; creating structural data according to a preset formula format and character data and arranging the character data to obtain training sample data corresponding to each original image.
根据本公开实施例的第二方面,提供一种数学公式识别装置,所述装置包括:原始图像获取模块,用于获取包含数学公式的原始图像;预测集合获取模块,用于将所述原始图像输入到公式识别模型,获得所述公式识别模型输出预测字符集合;所述预测字符集合包括字符数据和结构数据;数学公式获取模块,用于根据预设公式格式和结构数据还原所述字符数据在数学公式中的位置,得到所述原始图像中的数学公式。According to a second aspect of an embodiment of the present disclosure, a mathematical formula recognition device is provided, the device comprising: an original image acquisition module, used to acquire an original image containing a mathematical formula; a prediction set acquisition module, used to input the original image into a formula recognition model to obtain a predicted character set output by the formula recognition model; the predicted character set comprises character data and structural data; and a mathematical formula acquisition module, used to restore the position of the character data in the mathematical formula according to a preset formula format and structural data to obtain the mathematical formula in the original image.
根据本公开实施例的第三方面,提供一种电子设备,包括处理器;用于存储所述处理器可执行的计算机程序的存储器;其中,所述处理器被配置为执行所述存储器中的计算机程序,以实现第一方面所述的方法。According to a third aspect of an embodiment of the present disclosure, there is provided an electronic device, comprising a processor; and a memory for storing a computer program executable by the processor; wherein the processor is configured to execute the computer program in the memory to implement the method described in the first aspect.
根据本公开实施例的第四方面,提供一种计算机可读存储介质,当所述存储介质中的可执行的计算机程序由处理器执行时,能够实现第一方面所述的方法。 According to a fourth aspect of an embodiment of the present disclosure, a computer-readable storage medium is provided. When an executable computer program in the storage medium is executed by a processor, the method described in the first aspect can be implemented.
本公开的实施例提供的技术方案可以包括以下有益效果:The technical solution provided by the embodiments of the present disclosure may have the following beneficial effects:
由上述实施例可知,本公开实施例提供的方案中可以获取包含数学公式的原始图像;然后,将所述原始图像输入到公式识别模型,获得所述公式识别模型输出预测字符集合;所述预测字符集合包括字符数据和结构数据;之后,根据预设公式格式和结构数据还原所述字符数据在数学公式中的位置,得到所述原始图像中的数学公式。这样,本实施例中通过设置结构数据,可以表征字符数据的相对位置关系,可以消除复杂公式识别过程中出现丢失字符的问题,有利于提高识别准确率。As can be seen from the above embodiments, the scheme provided by the embodiments of the present disclosure can obtain an original image containing a mathematical formula; then, the original image is input into a formula recognition model to obtain a predicted character set output by the formula recognition model; the predicted character set includes character data and structural data; then, the position of the character data in the mathematical formula is restored according to the preset formula format and structural data to obtain the mathematical formula in the original image. In this way, by setting structural data in this embodiment, the relative position relationship of the character data can be represented, and the problem of missing characters in the complex formula recognition process can be eliminated, which is conducive to improving the recognition accuracy.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the present disclosure.
图1是根据一示例性实施例示出的一种数学公式识别方法的流程图。Fig. 1 is a flow chart showing a method for identifying a mathematical formula according to an exemplary embodiment.
图2是根据一示例性实施例示出的一种公式识别模型的框图。Fig. 2 is a block diagram of a formula recognition model according to an exemplary embodiment.
图3是根据一示例性实施例示出的一种解码器的框图。Fig. 3 is a block diagram of a decoder according to an exemplary embodiment.
图4是根据一示例性实施例示出的一种多尺度计数模块的框图。Fig. 4 is a block diagram of a multi-scale counting module according to an exemplary embodiment.
图5是根据一示例性实施例示出的一种第一子计数模块的框图。Fig. 5 is a block diagram of a first sub-counting module according to an exemplary embodiment.
图6是根据一示例性实施例示出的一种循环解码模块的框图。Fig. 6 is a block diagram showing a cyclic decoding module according to an exemplary embodiment.
图7是根据一示例性实施例示出的一种注意力子模块的框图。Fig. 7 is a block diagram of an attention submodule according to an exemplary embodiment.
图8是根据一示例性实施例示出的一种字符预测模块的框图。Fig. 8 is a block diagram of a character prediction module according to an exemplary embodiment.
图9是根据一示例性实施例示出的一种公式识别模型的框图。Fig. 9 is a block diagram of a formula recognition model according to an exemplary embodiment.
图10是根据一示例性实施例示出的一种训练公式识别模型的流程图。Fig. 10 is a flowchart showing a training formula recognition model according to an exemplary embodiment.
图11是根据一示例性实施例示出的一种数学公式、预设公式格式和latex标签的关系示意图。Fig. 11 is a schematic diagram showing the relationship between a mathematical formula, a preset formula format and a latex tag according to an exemplary embodiment.
图12是根据一示例性实施例示出的一种训练公式识别模型的流程图。Fig. 12 is a flowchart showing a training formula recognition model according to an exemplary embodiment.
图13是根据一示例性实施例示出的一种数学公式识别装置的框图。Fig. 13 is a block diagram of a mathematical formula recognition device according to an exemplary embodiment.
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性所描述的实施例并不代表与本公开相一致的所有实施例。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置例子。需要说明的是,在不冲突的情况下,下述的实施例及实施方式中的特征可以相互组合。Exemplary embodiments will be described in detail herein, examples of which are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described exemplarily below do not represent all embodiments consistent with the present disclosure. Instead, they are merely examples of devices consistent with some aspects of the present disclosure as detailed in the attached claims. It should be noted that the features in the following embodiments and implementations may be combined with each other without conflict.
考虑到相关技术中复杂结构的数学公式在识别过程中可以出现丢失“{”或“}”, 尤其是待识别公式较长时识别过程中容易丢失部分字符,可能出现数学公式识别错误的问题。Considering that the mathematical formulas with complex structures in the related art may lose "{" or "}" during the recognition process, Especially when the formula to be recognized is long, some characters are easily lost during the recognition process, which may cause the problem of mathematical formula recognition errors.
为解决上述技术问题,本公开实施例提供一种数学公式识别方法,其发明构思包括:In order to solve the above technical problems, the present disclosure provides a mathematical formula recognition method, and the inventive concept thereof includes:
第一,在训练样本数据中加入结构信息,该结构信息可以表示数学公式中部分字符的相对位置关系,使数学公式更清晰准确;在公式识别模型训练时,能够使公式识别模型更准确的预测数学公式中各个字符数据的位置,提高公式识别模型的准确率。First, structural information is added to the training sample data. This structural information can represent the relative position relationship of some characters in the mathematical formula, making the mathematical formula clearer and more accurate. When training the formula recognition model, it can enable the formula recognition model to more accurately predict the position of each character data in the mathematical formula, thereby improving the accuracy of the formula recognition model.
第二,公式识别模型的解码器内设置有多尺度计数模块,该多尺度计数模块可以获取预测字符集中字符数据出现的频次,可以利用上述频次构造损失函数,以实现监督公式识别模型的训练过程,有利于提高公式识别模型识别字符的准确率,进而有利于提高数学公式整体识别的准确率。Second, a multi-scale counting module is set in the decoder of the formula recognition model. The multi-scale counting module can obtain the frequency of occurrence of character data in the predicted character set. The above frequency can be used to construct a loss function to realize the training process of the supervised formula recognition model, which is beneficial to improve the accuracy of character recognition by the formula recognition model, and then help improve the accuracy of the overall recognition of mathematical formulas.
本公开实施例提供了一种数学公式识别方法,可以适用于电子设备,该电子设备可以包括但不限于智能手机、平板电脑、智能显示屏或者电子白板等设备。图1是根据一示例性实施例示出的一种数学公式识别方法的流程图。The present disclosure provides a mathematical formula recognition method, which can be applied to electronic devices, including but not limited to smart phones, tablet computers, smart display screens, electronic whiteboards, etc. FIG1 is a flowchart of a mathematical formula recognition method according to an exemplary embodiment.
参见图1,一种数学公式识别方法,包括步骤11~步骤13:Referring to FIG. 1 , a mathematical formula recognition method includes steps 11 to 13:
在步骤11中,获取包含数学公式的原始图像。In step 11, an original image containing a mathematical formula is obtained.
本实施例中,电子设备可以获取包含数学公式的原始图像。在一示例中,电子设备可以设置有图像采集模组,如摄像头或者图像传感器等。当有采集原始图像的需求,例如图像采集模组的预览区域内存在物体时,例如上述物体可以是写有数学公式的纸张,此时电子设备可以控制图像采集模组来拍摄物体,从而得到原始图像。在另一示例中,电子设备可以设置有通信模组,如WiFi模组、红外模组、USB模组等等,电子设备可以通过上述通信模组与其他设备的对端通信模组进行通信,从而从其他设备读取包含数学公式的原始图像。In this embodiment, the electronic device can obtain an original image containing a mathematical formula. In one example, the electronic device can be provided with an image acquisition module, such as a camera or an image sensor. When there is a need to acquire the original image, for example, when there is an object in the preview area of the image acquisition module, for example, the object can be a paper with a mathematical formula written on it, the electronic device can control the image acquisition module to shoot the object, thereby obtaining the original image. In another example, the electronic device can be provided with a communication module, such as a WiFi module, an infrared module, a USB module, etc., and the electronic device can communicate with the peer communication module of other devices through the above communication module, thereby reading the original image containing the mathematical formula from other devices.
需要说明的是,电子设备在获取图像时并不确定图像中包含数学公式。后续实施例中均默认图像中包含数学公式,以方便描述。It should be noted that the electronic device does not determine whether the image contains a mathematical formula when acquiring the image. In the subsequent embodiments, it is assumed that the image contains a mathematical formula for the convenience of description.
另需要说明的是,后续各实施例中原始图像可以是彩色图像(即RGB图像)或者灰度图像。在一示例中,原始图像采用灰度图像实现,从而降低公式识别模型的数据处理量,提升识别效率。It should also be noted that the original image in the subsequent embodiments may be a color image (ie, an RGB image) or a grayscale image. In one example, the original image is implemented as a grayscale image, thereby reducing the amount of data processing of the formula recognition model and improving recognition efficiency.
在步骤12中,将所述原始图像输入到公式识别模型,获得所述公式识别模型输出预测字符集合;所述预测字符集合包括字符数据和结构数据。In step 12, the original image is input into a formula recognition model to obtain a predicted character set output by the formula recognition model; the predicted character set includes character data and structure data.
本实施例中,电子设备内可以存储公式识别模型。该公式识别模型可以包括编码器和解码器。参见图2,编码器21用于获取所述原始图像对应的图像特征图;解码器22用于根据所述图像特征图确定所述原始图像对应的预测字符集合。In this embodiment, the electronic device may store a formula recognition model. The formula recognition model may include an encoder and a decoder. Referring to FIG. 2 , the encoder 21 is used to obtain an image feature map corresponding to the original image; the decoder 22 is used to determine a predicted character set corresponding to the original image according to the image feature map.
在一示例中,编码器21可以包括DenseNet网络或者Transformer网络实现。以编码器21采用DenseNet网络实现为例,该DenseNet网络的输入数据为尺寸为H*W*1的原始图像,其输出数据为图像特征图,该图像特征图的尺寸为H/16*W/16*684。其中,H表示原始图像的高度,W表示原始图像的宽度。In one example, the encoder 21 may be implemented by a DenseNet network or a Transformer network. Taking the encoder 21 implemented by a DenseNet network as an example, the input data of the DenseNet network is an original image of size H*W*1, and the output data thereof is an image feature map, and the size of the image feature map is H/16*W/16*684. Wherein, H represents the height of the original image, and W represents the width of the original image.
在一示例中,解码器22包括多尺度计数模块、循环解码模块、特征融合模块和字符预测模块。参见图3,多尺度计数模块(Multi-Scale Counting Module,MSCM)31用于 将图像特征图E(X)转换多个预设尺度的计数向量(Counting Vector,CV)。In one example, the decoder 22 includes a multi-scale counting module, a cyclic decoding module, a feature fusion module and a character prediction module. Referring to FIG. 3 , the multi-scale counting module (MSCM) 31 is used to The image feature map E(X) is converted into counting vectors (CV) of multiple preset scales.
循环解码模块32获取所述图像特征图对应的字符隐向量M和上下文向量Ω;The loop decoding module 32 obtains the character latent vector M and the context vector Ω corresponding to the image feature map;
特征融合模块33用于对计数向量CV、字符隐向量M和前一预测字符向量进行线性转换处理,得到目标向量LB;The feature fusion module 33 is used to combine the count vector CV, the character latent vector M and the previous predicted character vector Perform linear transformation to obtain the target vector LB;
字符预测模块34用于根据目标向量LB分别预测字符数据和结构数据,上述字符数据和结构数据构成预测字符集合。The character prediction module 34 is used to predict character data and structure data according to the target vector LB, respectively. The character data and structure data constitute a predicted character set.
需要说明的是,在一示例中,目标向量LB是第一线性向量、第三线性向量和第四线性向量的和向量,即获取各个线性向量同一位置数量的和值作为和向量在该同一位置的取值,这样在求和处理后,目标向量与各个线性向量的维度相同。例如,第一线性向量、第二线性向量、第三线性向量和第四线性向量均为1*512维度的线性向量,则目标向量也是1*512维度的线性向量。It should be noted that, in one example, the target vector LB is the sum vector of the first linear vector, the third linear vector and the fourth linear vector, that is, the sum of the numbers of the same position of each linear vector is obtained as the value of the sum vector at the same position, so that after the summation process, the dimension of the target vector is the same as that of each linear vector. For example, if the first linear vector, the second linear vector, the third linear vector and the fourth linear vector are all linear vectors of 1*512 dimension, then the target vector is also a linear vector of 1*512 dimension.
在一示例中,多尺度计数模块31包括至少两个子计数模块和向量平均子模块;所述至少两个子计数模块包含不同尺寸的卷积核且用于输出相同尺寸的子向量,各个子向量用于表示不同尺度下预测字符数据的次数;所述向量平均子模块用于获取至少两个子向量的平均向量,得到计数向量。In one example, the multi-scale counting module 31 includes at least two sub-counting modules and a vector averaging sub-module; the at least two sub-counting modules contain convolution kernels of different sizes and are used to output sub-vectors of the same size, and each sub-vector is used to represent the number of times character data is predicted at different scales; the vector averaging sub-module is used to obtain the average vector of at least two sub-vectors to obtain a counting vector.
在一示例中,多尺度计数模块31包括第一子计数模块和第二子计数模块;第一子计数模块用于识别所述图像特征图中第一预设尺度的特征信息,得到第一子向量;第二子计数模块用于识别所述图像特征图中第二预设尺度的特征信息,得到第二子向量。In one example, the multi-scale counting module 31 includes a first sub-counting module and a second sub-counting module; the first sub-counting module is used to identify feature information of a first preset scale in the image feature map to obtain a first sub-vector; the second sub-counting module is used to identify feature information of a second preset scale in the image feature map to obtain a second sub-vector.
在另一示例中,多尺度计数模块还包括第三子计数模块,所述第三子计数模块,用于识别所述图像特征图中第三预设尺度的特征信息,得到第三子向量。本示例中,多尺度计数模块31包括第一子计数模块、第二子计数模块、第三子计数模块和向量平均子模块(Element Average)。参见图4,第一子计数模块41、第二子计数模块42、第三子计数模块43和向量平均子模块(Element Average)44;In another example, the multi-scale counting module further includes a third sub-counting module, and the third sub-counting module is used to identify feature information of a third preset scale in the image feature map to obtain a third sub-vector. In this example, the multi-scale counting module 31 includes a first sub-counting module, a second sub-counting module, a third sub-counting module and a vector average sub-module (Element Average). Referring to FIG. 4 , the first sub-counting module 41, the second sub-counting module 42, the third sub-counting module 43 and the vector average sub-module (Element Average) 44;
第一子计数模块41,用于识别图像特征图E(X)中第一预设尺度的特征信息,得到第一子向量。The first sub-counting module 41 is used to identify feature information of a first preset scale in the image feature map E(X) to obtain a first sub-vector.
第二子计数模块42,用于识别图像特征图E(X)中第二预设尺度的特征信息,得到第二子向量。The second sub-counting module 42 is used to identify feature information of a second preset scale in the image feature map E(X) to obtain a second sub-vector.
第三子计数模块43,用于识别图像特征图E(X)中第三预设尺度的特征信息,得到第三子向量。The third sub-counting module 43 is used to identify feature information of a third preset scale in the image feature map E(X) to obtain a third sub-vector.
向量平均子模块(Element Average)44,用于获取第一子向量、第二子向量和第三子向量的平均向量,得到计数向量CV(Counting Vector)。其中,计数向量CV用于表示各个字符的个数。The vector average submodule (Element Average) 44 is used to obtain the average vector of the first subvector, the second subvector and the third subvector to obtain a counting vector CV (Counting Vector). The counting vector CV is used to represent the number of each character.
在一实施例中,参见图5,第一子计数模块41包括第一卷积单元411(Conv3*3+BN)、通道单元412、转换单元413(Conv1*1+Sigmoid)和池化单元414(Sum Pooling)。In one embodiment, referring to FIG. 5 , the first sub-counting module 41 includes a first convolution unit 411 (Conv3*3+BN), a channel unit 412, a conversion unit 413 (Conv1*1+Sigmoid) and a pooling unit 414 (Sum Pooling).
第一卷积单元411,用于利用第一预设尺寸的卷积核将图像特征图E(X)转换成第一特征图。该第一卷积单元还包括BN(Batch Normalization,批量样本的归一化)层,用于将图像特征图E(X)转换为预设维度的特征图。并且,该第一卷积单元包括卷积 核Conv,在一示例中,该卷积核的尺寸即第一预设尺寸为3*3*684;并且卷积核的数量为512个,这样第一卷积单元输出的第一特征图的维度为H/16*W/16*512。The first convolution unit 411 is used to convert the image feature map E(X) into a first feature map using a convolution kernel of a first preset size. The first convolution unit also includes a BN (Batch Normalization) layer, which is used to convert the image feature map E(X) into a feature map of a preset dimension. In addition, the first convolution unit includes a convolution layer. Kernel Conv, in one example, the size of the convolution kernel, that is, the first preset size is 3*3*684; and the number of convolution kernels is 512, so the dimension of the first feature map output by the first convolution unit is H/16*W/16*512.
通道单元412用于对所述第一特征图不同的通道赋予对应的权重,得到注意力特征图;并将所述注意力特征图和所述第一特征图作乘法处理后,得到通道特征图。在一示例中,上述关注通道可包括但不限于以下激活函数:ReLU和Sigmoid,当然上述关注通道还可以包括GAP和Linear等函数,由于不同函数具有不同的输出值范围,从而可为第一特征图不同的通道赋予对应的权重以选择出不同通道下的关注值范围。然后,该通道单元52可将各通道的权重值与预设维度的特征图相乘,从而得到通道特征图。The channel unit 412 is used to assign corresponding weights to different channels of the first feature map to obtain an attention feature map; and multiply the attention feature map and the first feature map to obtain a channel feature map. In one example, the above-mentioned attention channel may include but is not limited to the following activation functions: ReLU and Sigmoid. Of course, the above-mentioned attention channel may also include functions such as GAP and Linear. Since different functions have different output value ranges, corresponding weights can be assigned to different channels of the first feature map to select the attention value range under different channels. Then, the channel unit 52 can multiply the weight value of each channel with the feature map of a preset dimension to obtain a channel feature map.
以关注通道包括GAP函数为例,该GAP函数可以对第一特征图进行平均池化,即对H/16*W/16的平面求平均值,得到该平面对应的数值;经过512次后,可以得到H/16*W/16*512维的通道特征图。Taking the attention channel including the GAP function as an example, the GAP function can perform average pooling on the first feature map, that is, average the plane of H/16*W/16 to obtain the value corresponding to the plane; after 512 times, a channel feature map of H/16*W/16*512 dimensions can be obtained.
转换单元413,用于将所述通道特征图进行尺度变换处理,得到计数特征图。该转换单元413用于将不同输入维度的通道特征图转换到另一个预设维度的特征图,后续称之为计数特征图(Counting Map)。在一示例中,该转换单元包括卷积核和Sigmoid激活函数。该卷积核的尺寸为1*1*512,并且卷积核的数量为K个,从而可以识别出K个字符,即计数特征图为1*K维的向量。The conversion unit 413 is used to scale the channel feature map to obtain a counting feature map. The conversion unit 413 is used to convert the channel feature map of different input dimensions into a feature map of another preset dimension, which is subsequently referred to as a counting feature map (Counting Map). In one example, the conversion unit includes a convolution kernel and a Sigmoid activation function. The size of the convolution kernel is 1*1*512, and the number of convolution kernels is K, so that K characters can be recognized, that is, the counting feature map is a 1*K dimensional vector.
池化单元414,用于对所述计数特征图进行池化处理,得到第一子向量。The pooling unit 414 is used to perform pooling processing on the count feature map to obtain a first sub-vector.
继续参见图5,第一卷积单元411通过卷积核进行3*3*684卷积,经过512个卷积核后可以将图像特征图的通道数从684维调整为512维。然后,通道单元412可以获取512个通道各自的权重值,并且将各通道的权重值与第一卷积单元411输出的第一特征图相乘进行加权,得到H/16*W/16*512维的通道特征图。之后,转换单元413通过1*1*208卷积将第一特征图的通道数从512维调整为208维。需要说明的是,上述208维度是识别网络的字符数据,可以根据模式来调整。此时计数特征图的维度为H*W*208。之后,池化单元414可以对每个H*W平面求和得到1*208维度的第一子向量,其中第一子向量的208个通道中每个通道的像素值分别对应表示字符在当前公式中出现次数。Continuing to refer to Figure 5, the first convolution unit 411 performs 3*3*684 convolution through the convolution kernel. After 512 convolution kernels, the number of channels of the image feature map can be adjusted from 684 dimensions to 512 dimensions. Then, the channel unit 412 can obtain the weight value of each of the 512 channels, and multiply the weight value of each channel with the first feature map output by the first convolution unit 411 for weighting to obtain a channel feature map of H/16*W/16*512 dimensions. Afterwards, the conversion unit 413 adjusts the number of channels of the first feature map from 512 dimensions to 208 dimensions through 1*1*208 convolution. It should be noted that the above-mentioned 208 dimensions are the character data of the recognition network, which can be adjusted according to the mode. At this time, the dimension of the counting feature map is H*W*208. Afterwards, the pooling unit 414 may sum each H*W plane to obtain a first sub-vector of 1*208 dimension, wherein the pixel value of each of the 208 channels of the first sub-vector corresponds to the number of times the character appears in the current formula.
继续参见图5,第二子计数模块42包括第二卷积单元421、通道单元422、转换单元423和池化单元424。其中,第二卷积单元421用于利用第二预设尺寸的卷积核将图像特征图E(X)转换成第二特征图;通道单元422用于对所述第二特征图不同的通道赋予对应的权重,得到注意力特征图,并将所述注意力特征图和所述第二特征图作乘法处理后,得到通道特征图;转换单元423用于将所述通道特征图进行尺度变换处理,得到计数特征图;池化单元424用于对所述计数特征图进行池化处理,得到第二子向量。Continuing to refer to FIG5, the second sub-counting module 42 includes a second convolution unit 421, a channel unit 422, a conversion unit 423 and a pooling unit 424. Among them, the second convolution unit 421 is used to convert the image feature map E(X) into a second feature map using a convolution kernel of a second preset size; the channel unit 422 is used to assign corresponding weights to different channels of the second feature map to obtain an attention feature map, and multiply the attention feature map and the second feature map to obtain a channel feature map; the conversion unit 423 is used to scale the channel feature map to obtain a count feature map; the pooling unit 424 is used to pool the count feature map to obtain a second sub-vector.
需要说明的是,第二子计数模块42的结构与第一子计数模块41的结构相同,其区别在于两者的卷积单元不同,即第二子计数模块42将图像特征图转换为第二预设尺寸而非第一预设尺寸。在一示例中,第二卷积单元421包括尺寸为5*5*684的卷积核,即第二预设尺寸为5*5*684;并且卷积核的数量为512个,这样第二卷积单元421输出的第二特征图的维度为H/16*W/16*512。并且,第二子向量的维度为H*W*208。It should be noted that the structure of the second sub-counting module 42 is the same as that of the first sub-counting module 41, and the difference is that the convolution units of the two are different, that is, the second sub-counting module 42 converts the image feature map into a second preset size instead of the first preset size. In one example, the second convolution unit 421 includes a convolution kernel of size 5*5*684, that is, the second preset size is 5*5*684; and the number of convolution kernels is 512, so that the dimension of the second feature map output by the second convolution unit 421 is H/16*W/16*512. And the dimension of the second sub-vector is H*W*208.
继续参见图5,第三子计数模块43包括第三卷积单元431、通道单元432、转换单元433和池化单元434。其中,第三卷积单元431,用于利用第三预设尺寸的卷积核将图像特征图E(X)转换成第三特征图;通道单元432,用于对第三特征图不同的通道 赋予对应的权重,得到注意力特征图;并将注意力特征图和第三特征图作乘法处理后,得到通道特征图;转换单元433,用于将通道特征图进行尺度变换处理,得到计数特征图;池化单元434,用于对所述计数特征图进行池化处理,得到第三子向量。5, the third sub-counting module 43 includes a third convolution unit 431, a channel unit 432, a conversion unit 433 and a pooling unit 434. The third convolution unit 431 is used to convert the image feature map E(X) into a third feature map using a convolution kernel of a third preset size; the channel unit 432 is used to convert different channels of the third feature map Assign corresponding weights to obtain an attention feature map; and multiply the attention feature map and the third feature map to obtain a channel feature map; a conversion unit 433 is used to scale the channel feature map to obtain a count feature map; a pooling unit 434 is used to pool the count feature map to obtain a third sub-vector.
需要说明的是,第三子计数模块43的结构与第一子计数模块41的结构相同,其区别在于两者的卷积单元不同。在一示例中,第三卷积单元431包括尺寸为7*7*684的卷积核,即第三预设尺寸为7*7*684。也就是说,第三子向量的维度为H*W*208。It should be noted that the structure of the third sub-counting module 43 is the same as that of the first sub-counting module 41, and the difference is that the convolution units of the two are different. In one example, the third convolution unit 431 includes a convolution kernel of size 7*7*684, that is, the third preset size is 7*7*684. In other words, the dimension of the third sub-vector is H*W*208.
在一实施例中,循环解码模块包括门控循环单元,用于根据隐向量和前一预测字符向量生成所述字符隐向量。在一示例中,循环解码模块包括第一门控循环单元和第二门控循环单元;所述第一门控循环单元,用于根据隐向量和前一预测字符向量生成第一门控向量;所述第二门控循环单元,用于根据所述第一门控向量生成所述字符隐向量。需要说明的是,门控循环单元的数量可以根据具体场景进行选择,在满足识别准确度的情况下,相应方案落入本公开的保护范围。In one embodiment, the loop decoding module includes a gated loop unit, which is used to generate the character latent vector based on the latent vector and the previous predicted character vector. In one example, the loop decoding module includes a first gated loop unit and a second gated loop unit; the first gated loop unit is used to generate a first gated vector based on the latent vector and the previous predicted character vector; the second gated loop unit is used to generate the character latent vector based on the first gated vector. It should be noted that the number of gated loop units can be selected according to the specific scenario. When the recognition accuracy is met, the corresponding scheme falls within the protection scope of the present disclosure.
在另一示例中,循环解码模块还包括注意力子模块(attention model),参见图6,循环解码模块32包括注意力子模块61、第一门控循环单元62和第二门控循环单元63;第一门控循环单元62,用于根据隐向量和前一预测字符向量生成第一门控向量该第一门控循环单元62的两个输入隐向量和前一预测字符向量均为256维度。前一预测字符向量生成的初始值为起始符号“<sos>”的256维向量表示,第一次之后的取值是前一次训练过程中的结果。隐向量的初始值为编码器输出的图像特征图E(X)的每个通道像素的平均值(维度1*684)的线性变换(维度1*256),即图像特征图E(X)对应的1*256维度的向量;并且非首次循环时该M是第二门控循环单元63输出的字符隐向量M。In another example, the loop decoding module further includes an attention submodule (attention model). Referring to FIG. 6 , the loop decoding module 32 includes an attention submodule 61, a first gated loop unit 62, and a second gated loop unit 63; the first gated loop unit 62 is used to control the hidden vector and the previous prediction character vector Generate the first gating vector The two input latent vectors of the first gated recurrent unit 62 and the previous prediction character vector The dimension is 256. The previous prediction character vector The initial value generated is a 256-dimensional vector representation of the starting symbol "<sos>", and the values after the first time are the results of the previous training process. Hidden vector The initial value of is the linear transformation (dimension 1*256) of the average value (dimension 1*684) of each channel pixel of the image feature map E(X) output by the encoder, that is, the 1*256-dimensional vector corresponding to the image feature map E(X); and when it is not the first cycle This M is the character latent vector M output by the second gated recurrent unit 63.
需要说明的是,前一预测字符向量是基于字符隐向量M进行字符预测后可得到字符数据;然后,根据该字符数据进行查表操作,查询到上述字符数据对应的向量,并将查询到的向量作为前一预测字符向量的取值,达到更前一预测字符向量的效果。It should be noted that the previous predicted character vector The character data can be obtained by predicting the character based on the character latent vector M; then, a table lookup operation is performed based on the character data to find the vector corresponding to the above character data, and the found vector is used as the previous predicted character vector The value of the previous prediction character vector effect.
注意力子模块61,用于根据图像特征图E(X)、全零张量attα(X)和第一门控向量生成上下文向量Ω;注意力子模块61的输入为E(X),其尺寸为H/16*W/16*684。第一门控循环单元62的输出结果为其维度为1*256。全零张量attα(X)的初始化维度为1*1*H/16*W/16,用于记录历史的关注区域。The attention submodule 61 is used to obtain the image feature map E(X), the all-zero tensor att α (X) and the first gate vector Generate context vector Ω; the input of attention submodule 61 is E(X), whose size is H/16*W/16*684. The output of the first gated recurrent unit 62 is Its dimension is 1*256. The initial dimension of the all-zero tensor att α (X) is 1*1*H/16*W/16, which is used to record the historical attention area.
需要说明的是,本示例中全零张量attα(X)是将向量的初始化值设置为全零,从而方便运算。当然也可以将其初始化值设置为其他随机数,同样可以得到相应的结果。该全零张量attα(X)在非首次循环的取值为attα(X)=attα(X)+α。其中α是由图7所示例的注意力子模块61在每个循环后的更新向量。 It should be noted that in this example, the all-zero tensor att α (X) sets the initialization value of the vector to all zeros, so as to facilitate the operation. Of course, its initialization value can also be set to other random numbers, and the corresponding results can also be obtained. The value of the all-zero tensor att α (X) in the non-first cycle is att α (X) = att α (X) + α. Where α is the update vector of the attention submodule 61 illustrated in Figure 7 after each cycle.
第二门控循环单元63,用于根据第一门控向量和上下文向量Ω生成字符隐向量M。第二门控循环单元63的输出结果为M,其维度为1*256。The second gated loop unit 63 is used to control the first gate vector The character latent vector M is generated by summing up the context vector Ω. The output result of the second gated recurrent unit 63 is M, whose dimension is 1*256.
参见图7,注意力子模块61包括权重获取子模块71和上下文获取子模块72。Referring to FIG. 7 , the attention submodule 61 includes a weight acquisition submodule 71 and a context acquisition submodule 72 .
权重获取子模块71,用于将全零张量attα(X)进行卷积和线性处理(Conv+linear),得到第一特征向量,以及将图像特征图E(X)进行卷积处理(Conv)得到第二特征向量,以及将第一门控向量进行线性处理(linear),得到第三特征向量;以及对所述第一特征向量、所述第二卷积向量和所述第三特征向量叠加处理(“+”),得到第四特征向量。叠加处理是指将全零张量attα(X)理后的H/16*W/16*512维度的向量和图像特征图E(X)经过处理后的H/16*W/16*512维度的向量进行相加处理,得到一个H/16*W/16*512维度的和向量;然后,第一门控向量进行线性处理(linear)得到1*512维度的向量;之后,将H/16*W/16*512维度的和向量与1*512维度的向量继续求和,得到H/16*W/16*512维度的第四特征向量。The weight acquisition submodule 71 is used to perform convolution and linear processing (Conv+linear) on the all-zero tensor att α (X) to obtain a first feature vector, and to perform convolution processing (Conv) on the image feature map E(X) to obtain a second feature vector, and to perform convolution processing (Conv) on the first gating vector Perform linear processing (linear) to obtain a third eigenvector; and perform superposition processing ("+") on the first eigenvector, the second convolution vector and the third eigenvector to obtain a fourth eigenvector. Superposition processing refers to adding the H/16*W/16*512-dimensional vector after the all-zero tensor att α (X) and the H/16*W/16*512-dimensional vector after the image feature map E(X) is processed to obtain a H/16*W/16*512-dimensional sum vector; then, the first gate vector Linear processing is performed to obtain a vector of 1*512 dimension; then, the sum vector of H/16*W/16*512 dimension is further summed with the vector of 1*512 dimension to obtain the fourth eigenvector of H/16*W/16*512 dimension.
上下文获取子模块72,用于分别对第四特征向量进行第一次激活处理(激活函数为tanh)、线性转换处理(linear)和第二次激活处理(激活函数为softmax)后,得到注意力权重α;以及将图像特征图E(X)和注意力权重进行相乘处理(“X”),再进行求和处理(sum),得到上下文向量Ω。The context acquisition submodule 72 is used to obtain the attention weight α after performing the first activation processing (activation function is tanh), linear conversion processing (linear) and second activation processing (activation function is softmax) on the fourth eigenvector; and multiply the image feature map E(X) and the attention weight ("X"), and then sum them (sum) to obtain the context vector Ω.
本实施例中,注意力子模块61采用Coverage机制以避免重复识别,上下文向量Ω=ΣhΣwE(X)×α,下一次循环时输入attα(X)=attα(X)+α。第一门控向量经过线性处理后将通道数量调整为512维度。attα(X)经过卷积层(卷积核尺寸为11*11,padding=5,步长为1),线性变换linear处理后将通道调整为512维度,图像特征图E(X)经过1*1卷积处理后将通道数调整为512。然后上述三个通道处理后的第一子向量、第二子向量和第三子向量相加操作,再由激活函数tanh激活处理,再经过线性变换将通道数压缩为1,最后经过激活函数softmax激活处理后得到注意图权重α。In this embodiment, the attention submodule 61 adopts the Coverage mechanism to avoid repeated recognition, the context vector Ω = Σ h Σ w E(X) × α, and the input att α (X) = att α (X) + α in the next cycle. The first gate vector After linear processing, the number of channels is adjusted to 512 dimensions. att α (X) passes through the convolution layer (convolution kernel size is 11*11, padding=5, step size is 1), and the channel is adjusted to 512 dimensions after linear transformation. The image feature map E(X) is adjusted to 512 channels after 1*1 convolution. Then the first sub-vector, the second sub-vector and the third sub-vector after the above three channel processing are added, and then activated by the activation function tanh, and then the number of channels is compressed to 1 through linear transformation, and finally the attention map weight α is obtained after activation by the activation function softmax.
参见图8,字符预测模块包括字符预测子模块81和结构预测子模块82。字符预测子模块81用于对目标向量进行处理,得到预测出的字符数据;结构预测子模块82用于对目标向量进行处理,得到预测出的结构数据。字符数据和结构数据构成预测字符集合。Referring to FIG8 , the character prediction module includes a character prediction submodule 81 and a structure prediction submodule 82. The character prediction submodule 81 is used to process the target vector to obtain predicted character data; the structure prediction submodule 82 is used to process the target vector to obtain predicted structure data. The character data and the structure data constitute a predicted character set.
在一示例中,字符预测子模块81包括线性转换单元(Linear)和激活单元(sigmoid)。线性转换单元811,用于对所述目标向量进行线性转换;激活单元812,用于对线性转换后的向量进行激活处理,得到预测出的字符数据。In one example, the character prediction submodule 81 includes a linear conversion unit (Linear) and an activation unit (sigmoid). The linear conversion unit 811 is used to perform linear conversion on the target vector; the activation unit 812 is used to perform activation processing on the vector after linear conversion to obtain predicted character data.
在一示例中,结构预测子模块82包括线性转换单元(Linear)和激活单元(softmax)。线性转换单元,用于对所述目标向量进行线性转换;激活单元,用于对线性转换后的向量进行激活处理,得到预测出的字符数据。In one example, the structure prediction submodule 82 includes a linear conversion unit (Linear) and an activation unit (softmax). The linear conversion unit is used to perform linear conversion on the target vector; the activation unit is used to perform activation processing on the vector after linear conversion to obtain predicted character data.
结合图2~图8,在一可能实施例中,公式识别模型的结构可以如图9所示。In conjunction with FIG. 2 to FIG. 8 , in a possible embodiment, the structure of the formula recognition model may be as shown in FIG. 9 .
参见图9,公式识别模型中通过设置多尺度计数模块31和循环解码模块32,可以识别出图像特征图中不同尺寸区域的特征,有利于提高字符的检测准确率。 Referring to FIG. 9 , by setting a multi-scale counting module 31 and a cyclic decoding module 32 in the formula recognition model, features of regions of different sizes in the image feature map can be identified, which is beneficial to improving the detection accuracy of characters.
参见图10,该公式识别模型的训练过程包括步骤101~步骤103。Referring to FIG. 10 , the training process of the formula recognition model includes steps 101 to 103 .
在步骤101中,获取训练样本数据;所述训练样本数据包括多条数学公式的数据集合;每条数据集合包括按照预设公式格式展开后得到的字符数据和结构数据;所述结构数据用于表示数学公式中部分字符的相对位置关系。In step 101, training sample data is obtained; the training sample data includes a data set of multiple mathematical formulas; each data set includes character data and structure data obtained after expansion according to a preset formula format; the structure data is used to represent the relative position relationship of some characters in the mathematical formula.
本步骤中,电子设备可以获取训练样本数据。可理解的是,上述训练样本数据中包括多条数学公式的数据集合,每条数据集合包括按照预设公式格式展开后得到的字符数据和结构数据;所述结构数据用于表示数学公式中部分字符的相对位置关系。In this step, the electronic device can obtain training sample data. It is understandable that the training sample data includes a data set of multiple mathematical formulas, each of which includes character data and structure data obtained after expansion according to a preset formula format; the structure data is used to represent the relative position relationship of some characters in the mathematical formula.
本步骤中,数学公式包括以下至少一种结构:分数、下标、上标、根式、上方划线、上方箭头、三角标、极限、求和符等等。上述各结构的符号分别为\frac、_、^、\sqrt、\overline、\overrightarrow、\widehat、\limits、\sum。可理解的是,对于数学公式中以下结构如log、ln、lg或指数等也适用于本公开的方案。In this step, the mathematical formula includes at least one of the following structures: fraction, subscript, superscript, radical, overline, overarrow, triangle, limit, summation symbol, etc. The symbols of the above structures are respectively \frac, _, ^, \sqrt, \overline, \overrightarrow, \widehat, \limits, \sum. It is understandable that the following structures in the mathematical formula, such as log, ln, lg or exponent, are also applicable to the scheme of the present disclosure.
表1:不同结构数学公式及其latex表示
Table 1: Mathematical formulas of different structures and their latex representations
参见表1,数学公式中存在7种位置关系即内部(inside)、右边(right)、上方(above)、下方(below)、左上标(left_sup,如开方)、下标(sub)、上标(sup)。As shown in Table 1, there are 7 kinds of positional relationships in mathematical formulas, namely inside, right, above, below, left superscript (left_sup, such as square root), and right superscript (right, above, below, and left superscript (left_sup, such as square root). ), subscript (sub), superscript (sup).
以数学公式包括分数为例,参见图11,说明数学公式中字符位置与latex标签的关系。参见图11,公式的上方(above)是字符a,下方(below)是字符b,右侧(right)是字符c;当预设公式格式\frac{above}{below}right时,上述公式可以写成\frac{a}{b}c。Take a mathematical formula including a fraction as an example, see Figure 11 to illustrate the relationship between the character position in the mathematical formula and the latex tag. The character a is above, the character b is below, and the character c is to the right. When the formula format \frac{above}{below}right is preset, the above formula It can be written as \frac{a}{b}c.
结合数学公式中字符的位置关系和图11所示的预设结构,本实施例中可以形成训练样本数据。In combination with the positional relationship of characters in the mathematical formula and the preset structure shown in FIG. 11 , training sample data can be formed in this embodiment.
1、数学公式为则训练样本数据如表2所示。1. The mathematical formula is The training sample data is shown in Table 2.
表2:训练样本数据
Table 2: Training sample data
2、数学公式为则训练样本数据如表3所示。2. The mathematical formula is The training sample data is shown in Table 3.
表3:训练样本数据
Table 3: Training sample data
需要说明的是,表2和表3中的“<eos>”为结束符,最后一个<eos>表示整个公式识别结束,中间部分<eos>表示每个结构结束。It should be noted that "<eos>" in Table 2 and Table 3 is the end character. The last <eos> indicates the end of the entire formula recognition, and the middle part <eos> indicates the end of each structure.
另需要说明的是,表2和表3中“struct”表示此处包含字符的相对位置关系(7种),如果有位置关系则填写,如果没有位置关系则填写“None”。 It should also be noted that in Table 2 and Table 3, "struct" indicates the relative position relationship of the characters contained therein (7 types). If there is a position relationship, fill it in; if there is no position relationship, fill in "None".
结合表2和表3,第一列(从上而下)是公式采用latex标签和添加结构“struct”以及相对位置关系(7种),第2~7列为添加结构后的相对位置关系,整个表格构成训练样本数据。可见,本实施例中通过在训练样本数据中增加结构数据,可以更准确的表示数学公式,以方便后续准确的还原出数学公式。Combining Table 2 and Table 3, the first column (from top to bottom) is the formula using latex tags and adding the structure "struct" and the relative position relationship (7 types), and the 2nd to 7th columns are the relative position relationship after adding the structure. The entire table constitutes the training sample data. It can be seen that in this embodiment, by adding structural data to the training sample data, the mathematical formula can be more accurately represented, so as to facilitate the subsequent accurate restoration of the mathematical formula.
基于上述原理可以获取若干条数学公式的数据集合,得到训练样本数据。Based on the above principle, a data set of several mathematical formulas can be obtained to obtain training sample data.
在一实施例中,可以由若干用户手写的包含数学公式,并拍摄图像,得到若干张包含数学公式的原始图像。电子设备内存储预设的图像识别模型,例如卷积神经网络等,并利用该图像识别模型识别各张原始图像,得到各张原始图像对应的字符数据。或者,采用人工标注的方式标注各张原始图像对应的字符数据,电子设备可以获取输入的字符数据。最后,电子设备可以按照预设公式格式和字符数据创建结构数据并排列字符数据,得到每张原始图像对应的训练样本数据。本实施例不仅可以提高数学公式的多样性,还可以提高训练样本数据的获取效率。In one embodiment, several users may write handwritten mathematical formulas and take pictures to obtain several original images containing mathematical formulas. The electronic device stores a preset image recognition model, such as a convolutional neural network, and uses the image recognition model to recognize each original image to obtain character data corresponding to each original image. Alternatively, the character data corresponding to each original image is annotated manually, and the electronic device can obtain the input character data. Finally, the electronic device can create structure data and arrange the character data according to the preset formula format and character data to obtain training sample data corresponding to each original image. This embodiment can not only improve the diversity of mathematical formulas, but also improve the efficiency of obtaining training sample data.
在步骤102中,利用所述训练样本数据对待训练的公式识别模型进行训练,得到所述公式识别模型输出的预测字符集合;所述预测字符集合包括字符数据和结构数据。In step 102, the training sample data is used to train the formula recognition model to be trained, and a predicted character set output by the formula recognition model is obtained; the predicted character set includes character data and structure data.
本步骤中,电子设备可以将各条训练样本数据依次输入到待训练的公式识别模型进行训练,该公式识别模型可以处理上述训练样本数据,从而得到预测字符集合。可理解的是,该预测字符集合包括字符数据和结构数据。In this step, the electronic device can sequentially input each training sample data into the formula recognition model to be trained, and the formula recognition model can process the training sample data to obtain a predicted character set. It is understandable that the predicted character set includes character data and structure data.
本步骤中,公式识别模型的推理过程如图12所示,参见图12,包括:首先,入栈的结构信息为[位置信息,parent_symbol],初始化预测结果pre_string=‘’,用于存储已识别出的字符数据,是图11中数学公式的latex标签的表示方式。逐字符识别,得到当前字符current_symbol,并且parent_symbol=current_symbol,即将current_symbol的值赋予parent_symbol。然后,判断当前字符current_symbol不等于‘struct’且current_symbol不是结束字符‘<eos>’。当current_symbol不是‘struct’和‘<eos>’时,pre_string+=current_symbol,即将当前字符加入pre_string。然后,判断当前字符current_symbol是否为‘struct’。当current_symbol是‘struct’或者‘<eos>’时,判断当前字符current_symbol是否为‘struct’。In this step, the reasoning process of the formula recognition model is shown in Figure 12. Referring to Figure 12, it includes: first, the structure information pushed into the stack is [position information, parent_symbol], and the prediction result pre_string is initialized to '', which is used to store the recognized character data, and is the representation of the latex tag of the mathematical formula in Figure 11. Character by character recognition is performed to obtain the current character current_symbol, and parent_symbol = current_symbol, that is, the value of current_symbol is assigned to parent_symbol. Then, it is determined that the current character current_symbol is not equal to 'struct' and current_symbol is not the end character '<eos>'. When current_symbol is not 'struct' and '<eos>', pre_string + = current_symbol, that is, the current character is added to pre_string. Then, it is determined whether the current character current_symbol is 'struct'. When current_symbol is 'struct' or '<eos>', it is determined whether the current character current_symbol is 'struct'.
对于判断当前字符current_symbol是否为‘struct’,当current_symbol是‘struct’时,预测结构信息并将结构数据入栈,栈顶元素出栈,结合栈顶元素、parent_symbol调整pre_string。当current_symbol不是‘struct’时,判断栈是否为空,当栈不为空时,栈顶元素出栈,结合栈顶元素、parent_symbol调整pre_string。当栈为空时,判断pre_string中的“{”和“}”的数量是否相等,如果pre_string中的“{”和“}”的数量相等则退出循环,返回识别结果。如果pre_string中的“{”和“}”的数量不相等,则在pre_string中增加一个“}”即pre_string+=“}”,此时pre_string中的“{”和“}”的数量相等则退出循环,返回识别结果。For judging whether the current character current_symbol is ‘struct’, when current_symbol is ‘struct’, predict the structure information and push the structure data into the stack, pop the top element of the stack, and adjust pre_string in combination with the top element of the stack and parent_symbol. When current_symbol is not ‘struct’, judge whether the stack is empty. When the stack is not empty, pop the top element of the stack, and adjust pre_string in combination with the top element of the stack and parent_symbol. When the stack is empty, judge whether the number of “{” and “}” in pre_string is equal. If the number of “{” and “}” in pre_string is equal, exit the loop and return the recognition result. If the number of “{” and “}” in pre_string is not equal, add a “}” to pre_string, that is, pre_string+=“}”. At this time, if the number of “{” and “}” in pre_string is equal, exit the loop and return the recognition result.
在调整pre_string后,可以判断循环次数是否超过100次,如果超过100次则退出循环,返回识别结果。否则,识别字符并更新当前字符current_symbol,重新循环,直接返回识别结果为止。After adjusting pre_string, it can be determined whether the number of loops exceeds 100. If so, the loop is exited and the recognition result is returned. Otherwise, the character is recognized and the current character current_symbol is updated, and the loop is restarted until the recognition result is directly returned.
在步骤103中,当确定所述字符数据满足预设条件时停止训练,得到已完成训练的公式识别模型。 In step 103, when it is determined that the character data meets the preset conditions, the training is stopped to obtain a trained formula recognition model.
本步骤中,为提高公式识别模型识别的准确度,本实施例中设置了损失函数。该损失函数是基于训练样本数据的标签数据中每个字符出现频次和预测数据集合中每个字符出现频次构建。在一示例中,该损失函数包括三个子损失函数,分别为是:第一个子损失函数,用于计数特征图与训练样本数据中各个字符数量比较的子损失函数,该子损失函数为SmoothL1Loss函数;第二个子损失函数,用于对结构信息预测结果判断,该子损失函数采用交叉熵函数;第三个子损失函数,用于对字符信息预测结果判断,该子损失函数采用交叉熵函数。In this step, in order to improve the accuracy of the formula recognition model, a loss function is set in this embodiment. The loss function is constructed based on the frequency of occurrence of each character in the label data of the training sample data and the frequency of occurrence of each character in the prediction data set. In one example, the loss function includes three sub-loss functions, which are: the first sub-loss function is a sub-loss function for comparing the counting feature map with the number of each character in the training sample data, and the sub-loss function is a SmoothL1Loss function; the second sub-loss function is used to judge the prediction result of the structural information, and the sub-loss function adopts the cross entropy function; the third sub-loss function is used to judge the prediction result of the character information, and the sub-loss function adopts the cross entropy function.
本步骤中,电子设备可以确定所述字符数据满足预设条件。例如,电子设备可以获取所述预测字符集合中各个字符数据出现的频次,得到各个字符数据对应的预测频次;然后,电子设备可以获取各个字符数据对应的所述预测频次与所述训练样本数据的标签内的标注频次的差值,得到各个字符数据对应的预测误差值;之后,电子设备可以获取所述预测字符集合中字符数据对应的预测误差值的平均值,得到误差平均值;当确定所述误差平均值小于或等于预设平均值阈值(如1-3次)时,电子设备可以确定所述字符数据满足预设条件,并停止训练公式识别模型,即得到完成训练的公式识别模型。当确定上述误差平均值大于预设平均值阈值时,电子设备可以确定字符数据不满足预设条件,可以返回步骤102继续训练,直至满足预设条件为止。In this step, the electronic device can determine that the character data meets the preset conditions. For example, the electronic device can obtain the frequency of occurrence of each character data in the predicted character set to obtain the predicted frequency corresponding to each character data; then, the electronic device can obtain the difference between the predicted frequency corresponding to each character data and the marked frequency in the label of the training sample data to obtain the prediction error value corresponding to each character data; thereafter, the electronic device can obtain the average value of the prediction error value corresponding to the character data in the predicted character set to obtain the error average value; when it is determined that the error average value is less than or equal to the preset average value threshold (such as 1-3 times), the electronic device can determine that the character data meets the preset conditions and stop training the formula recognition model, that is, obtain a formula recognition model that has completed training. When it is determined that the above-mentioned error average value is greater than the preset average value threshold, the electronic device can determine that the character data does not meet the preset conditions, and can return to step 102 to continue training until the preset conditions are met.
在步骤13中,根据预设公式格式和结构数据还原所述字符数据在数学公式中的位置,得到所述原始图像中的数学公式。In step 13, the position of the character data in the mathematical formula is restored according to the preset formula format and structure data to obtain the mathematical formula in the original image.
本步骤中,考虑到步骤12识别的预测字符集合采用表2和表3所示的结构体现,因此电子设备可以根据预设公式格式和结构数据还原字符数据在数学公式中的位置,得到原始图像中的数学公式。In this step, considering that the predicted character set identified in step 12 is embodied in the structure shown in Table 2 and Table 3, the electronic device can restore the position of the character data in the mathematical formula according to the preset formula format and structural data to obtain the mathematical formula in the original image.
本步骤中,使用上述公式识别模型所得数学公式的识别结果对比如表4所示。In this step, the comparison of the recognition results of the mathematical formulas obtained using the above formula recognition model is shown in Table 4.
表4公式识别模型使用前后的识别准确率
Table 4 Recognition accuracy before and after using the formula recognition model
表4中,对号“√”表示采用了该功能。参见表4,只采用计数模块的情况下,数学公式识别结果的准确率提高了45.49%;只采用结构信息的情况下,数学公式识别结果的准确率提高了50.25%;同时采用计数模块和结构信息之后,数学公式识别结果的准确率提高了51.95%。In Table 4, the check mark "√" indicates that the function is used. As shown in Table 4, when only the counting module is used, the accuracy of the mathematical formula recognition result is improved by 45.49%; when only the structural information is used, the accuracy of the mathematical formula recognition result is improved by 50.25%; after using both the counting module and the structural information, the accuracy of the mathematical formula recognition result is improved by 51.95%.
至此,本公开实施例提供的方案中可以获取包含数学公式的原始图像;然后,将所述原始图像输入到公式识别模型,获得所述公式识别模型输出预测字符集合;所述预测字符集合包括字符数据和结构数据;之后,根据预设公式格式和结构数据还原所述字符数据在数学公式中的位置,得到所述原始图像中的数学公式。这样,本实施例中通过设置结构数据,可以表征字符数据的相对位置关系,可以消除复杂公式识别过程中出现丢失字符的问题,有利于提高识别准确率。或者说,本实施例中采用结构信息和字符计数的双预测方式提高模型对于多层结构嵌套复杂公式的识别能力,减少多层嵌套因时“{”或“}”丢失导致的公式误识别,且将复杂公式按照不同结构拆分为组件识别,提高了识别的准确率;在解码阶段加入多尺度计数模块,基于标签中各字符数量构造计数 损失函数,对训练过程进行弱监督,提高各组件中字符的检出率,提高识别准确率。At this point, the scheme provided by the embodiment of the present disclosure can obtain the original image containing the mathematical formula; then, the original image is input into the formula recognition model to obtain the predicted character set output by the formula recognition model; the predicted character set includes character data and structural data; thereafter, the position of the character data in the mathematical formula is restored according to the preset formula format and structural data to obtain the mathematical formula in the original image. In this way, by setting the structural data in this embodiment, the relative position relationship of the character data can be characterized, and the problem of missing characters in the complex formula recognition process can be eliminated, which is beneficial to improving the recognition accuracy. In other words, in this embodiment, the dual prediction method of structural information and character count is adopted to improve the model's recognition ability for complex formulas with multi-layer nested structures, reduce the formula misrecognition caused by the loss of "{" or "}" when multi-layer nesting, and split the complex formula into component recognition according to different structures, thereby improving the recognition accuracy; a multi-scale counting module is added in the decoding stage to construct a count based on the number of characters in the label. The loss function weakly supervises the training process, improves the detection rate of characters in each component, and improves the recognition accuracy.
在本公开实施例提供的一种数学公式识别方法的基础上,本实施例还提供了一种数学公式识别装置,参见图13,所述装置包括:原始图像获取模块131,用于获取包含数学公式的原始图像;预测集合获取模块132,用于将所述原始图像输入到公式识别模型,获得所述公式识别模型输出预测字符集合;所述预测字符集合包括字符数据和结构数据;数学公式获取模块133,用于根据预设公式格式和结构数据还原所述字符数据在数学公式中的位置,得到所述原始图像中的数学公式。On the basis of a mathematical formula recognition method provided in an embodiment of the present disclosure, this embodiment also provides a mathematical formula recognition device, see Figure 13, the device includes: an original image acquisition module 131, used to acquire an original image containing a mathematical formula; a prediction set acquisition module 132, used to input the original image into a formula recognition model to obtain a predicted character set output by the formula recognition model; the predicted character set includes character data and structural data; a mathematical formula acquisition module 133, used to restore the position of the character data in the mathematical formula according to a preset formula format and structural data, and obtain the mathematical formula in the original image.
需要说明的是,本实施例中示出的装置实施例与上述方法实施例的内容相匹配,可以参考上述所示方法实施例的内容,在此不再赘述。It should be noted that the device embodiment shown in this embodiment matches the content of the above method embodiment, and the content of the above method embodiment can be referred to, which will not be repeated here.
在示例性实施例中,还提供了一种电子设备,包括:显示屏;处理器;用于存储所述处理器可执行的计算机程序的存储器。其中,所述处理器被配置为执行所述存储器中的计算机程序,以实现如上述的方法。In an exemplary embodiment, an electronic device is also provided, comprising: a display screen; a processor; and a memory for storing a computer program executable by the processor, wherein the processor is configured to execute the computer program in the memory to implement the above method.
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括可执行的计算机程序的存储器,上述可执行的计算机程序可由处理器执行,以实现如上述实施例的方法。其中,可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a computer-readable storage medium is also provided, such as a memory including an executable computer program, and the executable computer program can be executed by a processor to implement the method of the above embodiment. The readable storage medium can be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, etc.
本领域技术人员在考虑说明书及实践这里公开的公开后,将容易想到本公开的其它实施方案。本公开旨在涵盖任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。 Those skilled in the art will readily appreciate other embodiments of the present disclosure after considering the specification and practicing the disclosure disclosed herein. The present disclosure is intended to cover any variations, uses, or adaptations that follow the general principles of the present disclosure and include common knowledge or customary techniques in the art that are not disclosed in the present disclosure. The description and examples are to be considered exemplary only, and the true scope and spirit of the present disclosure are indicated by the following claims.
Claims (25)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310638406.1A CN116597457A (en) | 2023-05-31 | 2023-05-31 | Mathematical formula identification method, device, electronic equipment and readable storage medium |
| CN202310638406.1 | 2023-05-31 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024244760A1 true WO2024244760A1 (en) | 2024-12-05 |
Family
ID=87595461
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/088254 Pending WO2024244760A1 (en) | 2023-05-31 | 2024-04-17 | Mathematical formula recognition method and apparatus, electronic device, and readable storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN116597457A (en) |
| WO (1) | WO2024244760A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120071364A (en) * | 2025-04-22 | 2025-05-30 | 广州炫视智能科技有限公司 | Real-time recognition method and system for whiteboard handwriting formula based on artificial intelligence |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116597457A (en) * | 2023-05-31 | 2023-08-15 | 京东方科技集团股份有限公司 | Mathematical formula identification method, device, electronic equipment and readable storage medium |
| CN120071362A (en) * | 2023-11-30 | 2025-05-30 | 京东方科技集团股份有限公司 | Mathematical formula identification method, device, electronic equipment and readable storage medium |
| CN120524991A (en) * | 2025-07-24 | 2025-08-22 | 云南师范大学 | A handwritten mathematical expression recognition method based on skeleton shaping and character counting modules |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2003256770A (en) * | 2002-03-06 | 2003-09-12 | Toshiba Corp | Equation recognition apparatus and equation recognition method |
| CN114898376A (en) * | 2022-05-31 | 2022-08-12 | 深圳市星桐科技有限公司 | Formula identification method, device, equipment and medium |
| CN115100662A (en) * | 2022-06-13 | 2022-09-23 | 深圳市星桐科技有限公司 | Formula identification method, device, equipment and medium |
| CN115376140A (en) * | 2022-08-26 | 2022-11-22 | 深圳市星桐科技有限公司 | Image processing method, device, equipment and medium |
| CN116110059A (en) * | 2023-01-06 | 2023-05-12 | 武汉天喻信息产业股份有限公司 | Offline handwriting mathematical formula identification method based on deep learning |
| CN116597457A (en) * | 2023-05-31 | 2023-08-15 | 京东方科技集团股份有限公司 | Mathematical formula identification method, device, electronic equipment and readable storage medium |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110148100A (en) * | 2019-05-10 | 2019-08-20 | 腾讯科技(深圳)有限公司 | A kind of image conversion method, device, storage medium and computer equipment |
| CN113095314B (en) * | 2021-04-07 | 2024-07-09 | 科大讯飞股份有限公司 | Formula identification method, device, storage medium and equipment |
| CN115601765A (en) * | 2022-10-31 | 2023-01-13 | 京东方科技集团股份有限公司(Cn) | Handwritten formula recognition method, training method and device for handwritten formula recognition model |
-
2023
- 2023-05-31 CN CN202310638406.1A patent/CN116597457A/en active Pending
-
2024
- 2024-04-17 WO PCT/CN2024/088254 patent/WO2024244760A1/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2003256770A (en) * | 2002-03-06 | 2003-09-12 | Toshiba Corp | Equation recognition apparatus and equation recognition method |
| CN114898376A (en) * | 2022-05-31 | 2022-08-12 | 深圳市星桐科技有限公司 | Formula identification method, device, equipment and medium |
| CN115100662A (en) * | 2022-06-13 | 2022-09-23 | 深圳市星桐科技有限公司 | Formula identification method, device, equipment and medium |
| CN115376140A (en) * | 2022-08-26 | 2022-11-22 | 深圳市星桐科技有限公司 | Image processing method, device, equipment and medium |
| CN116110059A (en) * | 2023-01-06 | 2023-05-12 | 武汉天喻信息产业股份有限公司 | Offline handwriting mathematical formula identification method based on deep learning |
| CN116597457A (en) * | 2023-05-31 | 2023-08-15 | 京东方科技集团股份有限公司 | Mathematical formula identification method, device, electronic equipment and readable storage medium |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120071364A (en) * | 2025-04-22 | 2025-05-30 | 广州炫视智能科技有限公司 | Real-time recognition method and system for whiteboard handwriting formula based on artificial intelligence |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116597457A (en) | 2023-08-15 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110232689B (en) | Semantic category locating digital environments | |
| WO2024244760A1 (en) | Mathematical formula recognition method and apparatus, electronic device, and readable storage medium | |
| Sameen et al. | Classification of very high resolution aerial photos using spectral‐spatial convolutional neural networks | |
| US9996768B2 (en) | Neural network patch aggregation and statistics | |
| CN108734210B (en) | An object detection method based on cross-modal multi-scale feature fusion | |
| CN110414344B (en) | A video-based person classification method, intelligent terminal and storage medium | |
| CN117037004B (en) | UAV image detection method based on multi-scale feature fusion and context enhancement | |
| WO2019129032A1 (en) | Remote sensing image recognition method and apparatus, storage medium and electronic device | |
| CN111670457A (en) | Optimization of dynamic object instance detection, segmentation and structure mapping | |
| US11587291B2 (en) | Systems and methods of contrastive point completion with fine-to-coarse refinement | |
| CN117597703A (en) | Multi-scale converter for image analysis | |
| CN105917354A (en) | Spatial pyramid pooling networks for image processing | |
| CN111433812A (en) | Optimization of dynamic object instance detection, segmentation and structure mapping | |
| EP4222700A1 (en) | Sparse optical flow estimation | |
| US20240312252A1 (en) | Action recognition method and apparatus | |
| CN116645592B (en) | A crack detection method and storage medium based on image processing | |
| WO2022127333A1 (en) | Training method and apparatus for image segmentation model, image segmentation method and apparatus, and device | |
| CN115661463B (en) | A semi-supervised semantic segmentation method based on scale-aware attention | |
| CN115272691A (en) | A training method, identification method and equipment for detecting model of steel bar binding state | |
| CN111126049B (en) | Object relationship prediction method, device, terminal equipment and readable storage medium | |
| WO2023185209A1 (en) | Model pruning | |
| CN115457364A (en) | A target detection knowledge distillation method, device, terminal equipment and storage medium | |
| KR20210093875A (en) | Video analysis methods and associated model training methods, devices, and devices | |
| CN111523351A (en) | Neural network training method and device and electronic equipment | |
| CN116758277A (en) | Target detection method and device and computer equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24813966 Country of ref document: EP Kind code of ref document: A1 |