CN109255382B

CN109255382B - Neural network system, method and device for picture matching positioning

Info

Publication number: CN109255382B
Application number: CN201811046086.6A
Authority: CN
Inventors: 巢林林; 徐娟; 褚崴
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2018-09-07
Filing date: 2018-09-07
Publication date: 2020-07-17
Anticipated expiration: 2038-09-07
Also published as: TWI701608B; CN109255382A; TW202011266A; WO2020048273A1

Abstract

The embodiment of the specification provides a neural network system for picture matching positioning executed by a computer. The neural network system comprises a first convolution network, a second convolution network, a combination layer and a frame regression layer, wherein the first convolution network carries out convolution processing and pooling operation on a first picture to obtain a first feature vector with dimensionality corresponding to the first picture as a first number; the second convolution network performs convolution processing on the second picture to obtain N characteristic vectors respectively corresponding to N areas contained in the second picture, and the dimensionality of the characteristic vectors is also the first number; the combination layer respectively combines the first feature vector with N feature vectors to obtain N combination vectors; and the frame regression layer is used for outputting information of a predicted frame in the second picture by adopting a frame regression algorithm at least based on the N combined vectors, wherein the predicted frame indicates that the second picture comprises the part of the picture content of the first picture.

Description

Neural network system, method and device for picture matching positioning

Technical Field

One or more embodiments of the present description relate to the field of computer image processing, and more particularly, to neural network systems, methods, and apparatus for matching and positioning of pictures.

Background

Artificial intelligence and machine learning have been widely used in the field of computer image processing to intelligently perform image analysis, comparison, matching, target recognition, and the like, where matching and matching localization of images are frequently facing problems. In brief, image matching refers to determining whether two images are similar or whether the two images are the same; and the image matching positioning refers to finding out the position of the content shown in one image in the other image.

The traditional matching and positioning algorithm generally adopts a mode of firstly searching through various size image blocks in a traversal mode and then matching and positioning the image blocks in a comparison mode one by one. Such schemes are time-complex and such two-step schemes are difficult to uniformly optimize as a whole.

Therefore, an improved scheme for performing matching positioning of images more quickly and efficiently is desired.

Disclosure of Invention

One or more embodiments of the present specification describe neural network systems and methods for matching and locating pictures, thereby matching and locating pictures quickly, efficiently, and integrally.

According to a first aspect, there is provided a computer-implemented neural network system for picture matching localization, comprising:

the first convolution network comprises a first convolution layer and a pooling layer, and the first convolution layer performs convolution processing on the first picture to obtain a first convolution characteristic diagram corresponding to the first picture; the pooling layer performs pooling operation on the first convolution feature map to generate a first feature vector with a first number of dimensionalities, wherein the first picture is a picture to be matched;

the second convolution network is used for performing convolution processing on the second picture to obtain N characteristic vectors respectively corresponding to N areas contained in the second picture, and the dimensionality of the N characteristic vectors is the first number; the second picture is a picture to be searched;

the combination layer is used for performing combination operation on the first characteristic vector and the N characteristic vectors respectively to obtain N combination vectors;

a frame regression layer that outputs information of a predicted frame in the second picture based at least on the N combined vectors, the predicted frame indicating a region of the second picture that contains picture content of the first picture.

In one embodiment, the second convolution network includes a second convolution layer and a feature extraction layer, wherein the second convolution layer performs convolution processing on the second picture to obtain a second convolution feature map corresponding to the second picture; and the feature extraction layer extracts N feature vectors corresponding to the N regions respectively based on the second convolution feature map.

Further, according to one design, the second convolutional layer is a common convolutional layer with the first convolutional layer.

According to one embodiment, the N regions are segmented according to a predetermined segmentation rule.

According to another embodiment, the N regions are generated by a selective search algorithm, or by a region generation network.

According to one embodiment, the combining operation performed by the combining layer includes a vector dot product operation.

According to one possible design, the frame regression layer comprises a first hidden layer, a second hidden layer and an output layer;

the first hidden layer determines the region probability of the first picture appearing in each of the N regions;

the second hidden layer generates alternative frames in at least one region and obtains the confidence of each alternative frame;

and the output layer outputs the information of the predicted frame according to the region probability of each region and the confidence coefficient of each alternative frame, wherein the information of the predicted frame comprises the coordinates of the predicted frame, and the region probability and the confidence coefficient corresponding to the predicted frame.

Further, in one design, the second hidden layer generates an alternative frame in a region where the region probability is greater than a preset threshold.

In one embodiment, the output layer takes the candidate bounding box with the largest product of the corresponding region probability and the confidence as the predicted bounding box.

According to one embodiment, the neural network system is obtained by end-to-end training of a training sample, the training sample comprises a plurality of picture pairs, each picture pair comprises a first training picture and a second training picture, the second training picture is marked with a target box, and the target box shows an area of the second training picture containing picture content of the first training picture.

Further, in one embodiment, the bounding box regression layer includes a first hidden layer and a second hidden layer; in such a case, the end-to-end training includes:

according to the position of the target frame, determining a specific area where the target frame is located in the N areas of the second training picture, and determining an area label of the target frame according to the specific area;

predicting the prediction region probability of the first training picture in each region through the first hidden layer;

generating alternative frames in each region through the second hidden layer;

determining the intersection and comparison of each alternative frame and the target frame as the confidence of the alternative frame;

and adjusting network layer parameters of the first hidden layer and the second hidden layer at least based on the region label, the predicted region probability and the confidence coefficient of the alternative frame, thereby training the neural network system.

According to a second aspect, there is provided a computer-implemented method for picture matching localization, comprising:

performing first convolution processing on the first picture to obtain a first convolution characteristic diagram corresponding to the first picture; the first picture is a picture to be matched;

pooling the first convolution feature map to generate a first feature vector with a first number of dimensionality;

performing convolution processing on the second picture to obtain N feature vectors respectively corresponding to N areas contained in the second picture, wherein the dimensionality of the N feature vectors is the first number; the second picture is a picture to be searched;

respectively carrying out combination operation on the first eigenvector and the N eigenvectors to obtain N combined vectors;

and outputting information of a predicted frame in the second picture by adopting a frame regression algorithm at least based on the N combined vectors, wherein the predicted frame indicates that the second picture contains the part of the picture content of the first picture.

According to a third aspect, there is provided an apparatus for picture matching positioning, comprising:

the first convolution unit is configured to perform first convolution processing on the first picture to obtain a first convolution feature map corresponding to the first picture; the first picture is a picture to be matched;

the pooling unit is configured to perform pooling operation on the first convolution feature map to generate a first feature vector with a first number of dimensionality;

the second convolution unit is configured to perform convolution processing on the second picture to obtain N feature vectors respectively corresponding to N regions included in the second picture, and the dimensionality of the N feature vectors is the first number; the second picture is a picture to be searched;

the combination unit is configured to perform combination operation on the first feature vector and the N feature vectors respectively to obtain N combination vectors;

a prediction unit configured to output information of a predicted bounding box in the second picture using a bounding box regression algorithm based on at least the N combined vectors, the predicted bounding box indicating a portion of the second picture containing picture content of the first picture.

According to a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor, when executing the executable code, implements the neural network system of the first aspect.

According to the scheme provided by the embodiment of the specification, the rapid matching and positioning of the picture are realized through the two-branch neural network system, and the area containing the picture to be matched is selected by using the frame in the picture to be searched. In the process, matching and positioning are synchronously realized, so that the processing efficiency is improved, and the processing performance is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 illustrates a schematic diagram of an implementation scenario of an embodiment disclosed herein;

FIG. 2 illustrates a schematic structural diagram of a neural network system, according to one embodiment;

FIG. 3 illustrates a schematic structural diagram of a second convolutional network according to one embodiment;

FIG. 4 shows a schematic diagram of a second convolutional network according to another embodiment;

FIG. 5 illustrates a block diagram of a bounding box regression layer, according to one embodiment;

FIG. 6 illustrates a prediction result diagram according to one embodiment;

FIG. 7 illustrates a method for picture matching positioning according to one embodiment;

FIG. 8 illustrates a flow diagram for determining a predicted bounding box according to one embodiment;

fig. 9 shows a schematic block diagram of a picture matching positioning apparatus according to an embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification.

According to the embodiment of the specification, the marked picture pairs are used as training samples to train a neural network model. After the neural network is trained, the method can be used for matching and positioning the picture. Specifically, the neural network has two branches, and has a double-branch structure. Inputting the graph to be matched into the first branch, inputting the graph to be searched into the second branch, and outputting the prediction result of matching positioning by the trained neural network. Therefore, the neural network can carry out matching and positioning at the same time and directly output the matching and positioning results of the pictures.

In order to realize the matching and positioning of the pictures, the neural network respectively extracts the features of the two pictures, namely the picture to be matched and the picture to be searched, then combines the features of the two pictures, and predicts the frame based on the combined features. More specifically, for the graph to be matched, the neural network processes the graph to be matched into a first feature vector by using a first branch of the neural network; for the graph to be searched, the neural network processes the graph to be searched into N feature vectors corresponding to the N regions by using the second branch. Then, the first feature vector corresponding to the graph to be matched and the N feature vectors corresponding to the N regions of the graph to be searched are combined respectively to obtain N combined vectors, frames are predicted by adopting a frame regression algorithm in the target detection algorithm based on the N combined vectors respectively, and frame regression is conducted. Therefore, the neural network directly outputs the result of matching and positioning for the input two pictures. The specific structure and implementation of the neural network described above is described below.

Fig. 2 shows a schematic structural diagram of a neural network system for performing matching positioning of pictures according to an embodiment. It is understood that the neural network system may be implemented by any device, apparatus, platform, cluster of devices having computing and processing capabilities, such as the computing platform shown in fig. 1. As shown in fig. 2, the neural network system includes at least a first convolutional network 21, a second convolutional network 22, a combining layer 23, and a bounding box regression layer 24. The following describes the implementation of the above network layers.

The first convolution network 21 is configured to perform feature processing on the image to be matched to generate a corresponding feature vector, and the image to be matched is referred to as a first image hereinafter. Generally, the first picture is a close-up view or a detail view.

Specifically, the first convolution network 21 includes a first convolution layer 211 and a pooling layer 212, where the first convolution layer performs convolution processing on the first picture to obtain a first convolution feature map corresponding to the first picture; the pooling layer 212 performs a pooling operation on the first convolution feature map to generate a first feature vector corresponding to the first picture.

The convolutional layer is the most basic and important network layer in the convolutional neural network CNN, and is used for performing convolution processing on an image. Convolution processing is a processing operation often employed to analyze images. Specifically, the convolution process is a series of operations performed on each pixel in an image using one convolution kernel. The convolution kernel (operator) is a matrix used for image processing and is a parameter for performing an operation with an original pixel. The convolution kernel is typically a square grid structure (e.g., a 3 x 3 matrix or pixel region) with each grid having a weight value. When the convolution kernel is used for carrying out convolution calculation on the picture, the convolution kernel is slid on a pixel matrix of the picture, products and summations are carried out on each element in the convolution kernel and the image pixel value covered by each element every time one step is slid, and the new characteristic value matrix obtained in the way forms a convolution characteristic map, namely feature map. The convolution operation may extract abstract features from a pixel matrix of the original picture, and the abstract features may reflect, for example, more global features such as a linear shape and a color distribution of an area in the original picture according to the design of a convolution kernel.

In one embodiment, the first convolution layer 211 may include one or more convolution layers, each convolution layer performing a convolution process on the image. After these convolutional layer processing, a convolutional feature map (featuremap) corresponding to the first picture is obtained.

In one embodiment, The first convolutional layer 211 may include a plurality of convolutional layers, and at least one Re L U (modified linear Unit) excitation layer is further included between The plurality of convolutional layers or after some convolutional layers for performing nonlinear mapping on The convolutional layer output result.

In one embodiment, the first convolutional layer 211 includes a plurality of convolutional layers, and at least one pooling layer (pooling) is further included between the plurality of convolutional layers for pooling the convolutional layer output result. The result of the pooling operation may be input into the next convolution layer and the convolution operation may continue.

One skilled in the art will appreciate that the first convolutional layer 211 may be designed to include one or more convolutional layers, and may optionally add Re L U excitation layers and/or pooling layers between the convolutional layers, as desired.

The pooling layer 212 is used to perform additional pooling operations on the first convolution map corresponding to the first picture. The pooling operation may include, maximum pooling, average pooling, and the like.

Generally, in the matching positioning of pictures, a picture to be matched is a close-range view or a detail view, and a picture to be searched is a distant-range view or a global view, so that the picture to be matched needs to be "reduced" so as to perform comparison analysis with each region of the picture to be searched. Here, the first picture is a to-be-matched picture, so in the first convolution network 21, performing an additional pooling operation on the first convolution feature map obtained by convolution processing through the pooling layer 212 may be used to reduce the dimension of the feature of the first picture, facilitate subsequent combination with the area feature of the to-be-searched picture, and facilitate subsequent network processing. Then, the pooling layer 212 obtains a feature vector corresponding to the first picture, referred to as a first feature vector, denoted by Fs, by pooling. Assume that the dimension of the feature vector is D.

On the other hand, the second convolution network 22 is configured to perform convolution processing on the second picture, that is, the picture to be searched, to obtain N feature vectors corresponding to N regions included in the second picture, respectively, and dimensions of the N feature vectors are the same as those of the first feature vector Fs and are D dimensions.

FIG. 3 illustrates a schematic diagram of a second convolutional network according to one embodiment. As shown in fig. 3, the second convolutional network 22 includes a second convolutional layer 221 and a feature extraction layer 222.

The second convolutional layer 221 may be designed to include one or more convolutional layers, and may selectively add Re L U excitation layers and/or pooling layers between the plurality of convolutional layers, as needed, the second convolutional layer 221 obtains a second convolutional characteristic map corresponding to the second picture after the second convolutional layer 221 performs convolutional processing on the second picture.

In one embodiment, the structure and convolution processing operations of the second convolution layer 221 are identical to those of the first convolution layer 211. In this case, the second convolutional layer 221 and the first convolutional layer 211 may multiplex the same convolutional layer, share the weight parameter, in other words, may be a common convolutional layer, as shown by the dashed line in fig. 3.

The second convolution feature map obtained by the second convolution layer 221 is input to the feature extraction layer 222, and the feature extraction layer 222 extracts N feature vectors corresponding to the N regions included in the second picture, respectively, based on the second convolution feature map.

In one embodiment, the N regions are obtained by dividing according to a predetermined division rule. For example, in one example, the predetermined division rule is to divide the length and width of the graph to be searched into 4 × 4 — 16 regions by dividing the graph to be searched into 4 × 4.

It can be understood that a certain mapping relationship exists between the convolution feature map obtained by convolution processing and the original image. Therefore, when the second picture is divided according to the above-mentioned division rule, it is considered that the second convolution layer 221 may be divided into N regions, that is, the second convolution layer may be divided into N sub-feature maps, each of which corresponds to one region of the original image. In one embodiment, the second picture may be first divided into N regions, and then the second convolution layers 221 are respectively input, so that the second convolution layers 221 respectively perform convolution processing on the N regions, thereby respectively obtaining N sub-feature maps. The total of the N sub-feature maps constitutes a second convolution feature map corresponding to the second picture. In another embodiment, the second picture may also be directly input into the second convolution layer 221, so that the second convolution layer 221 performs convolution processing on the entire second picture to obtain a second convolution feature map, and then the second convolution feature map is divided into N sub-feature maps according to the segmentation rule. Next, the feature extraction layer 222 performs feature extraction based on the second convolution feature map, more specifically, based on each sub-feature map divided in accordance with the division rule, thereby obtaining N feature vectors corresponding to the N regions of the second picture, respectively.

In another embodiment, the N regions of the second picture are generated in a neural network according to a predetermined algorithm.

Fig. 4 shows a schematic structural diagram of a second convolutional network according to another embodiment. In the second convolutional network shown in fig. 4, a region generating module is further included for generating N regions in the second picture according to a predetermined algorithm.

In one example, the overall neural network system references an R-CNN (region CNN) network model or a Fast R-CNN network model for target detection. In both network models, a selective search (selectsearch) mode is adopted to generate a candidate region (region of interest), or called a region of interest (roi), and the generated candidate region may be used as a region here. More specifically, in R-CNN, candidate regions (shown by dotted lines) are generated based on the original picture, and in Fast R-CNN, candidate regions are generated based on the extracted convolution feature map. In the case of R-CNN or Fast R-CNN, the functions of the region generation module described above can also be implemented jointly by the second convolutional layer and the feature extraction layer, without having to be embodied as separate modules.

In another example, the overall neural network system mirrors a further Faster R-CNN network model, in which a region generation network rpn (region pro-social network) is proposed, dedicated to generating or suggesting candidate regions. In such a case, the region generation module in fig. 4 generates the network RPN corresponding to the region, and generates N regions based on the convolution feature map after the convolution processing.

In yet another example, the overall neural network system is based on a Yolo network model, in which the second picture is divided into a × b regions, where N ═ a × b. Accordingly, the region generation module may employ the algorithm in Yolo to generate the region.

Although fig. 3 and 4 illustrate an example of the second convolutional network, in which the second convolutional network is divided into the second convolutional layer and the feature extraction layer, a specific implementation of the second convolutional network is not limited thereto. In one example, the second convolution network may perform the convolution processing and the region feature extraction, thereby directly outputting the feature vector of each region.

Next, N feature vectors corresponding to the N regions output by the second convolutional network 22 and the first feature vector Fs of the first picture output by the first convolutional network 21 are input to the combining layer 23, where a combining operation is performed. As described above, the combination operation between the vectors is facilitated by processing the first picture into the first feature vector Fs by the first convolution network 21, processing the second picture into N feature vectors corresponding to the N regions by the second convolution network 22, and making the dimensions of the vectors the same (both D dimensions).

Specifically, the combination layer 23 performs a combination operation on the first feature vector Fs and N feature vectors corresponding to N regions of the second picture, respectively, so as to obtain N combination vectors.

In one embodiment, the combining operation comprises differencing, or averaging, the corresponding elements of the vector.

More preferably, in one embodiment, the combining operation comprises a dot product between vectors, i.e. a multiplication between corresponding elements.

In particular, it is assumed that the first feature vector Fs may be expressed as:

Fs＝(a₁,a₂,…,a_D)

n feature vectors corresponding to the N regions of the second picture are F₁，F₂，…,F_NWherein the ith feature vector F_iCan be expressed as:

F_i＝(b₁,b₂,…,b_D)

then, the first feature vector Fs and the feature vector F of the i-th region_iThe dot product of (A) can obtain a combined vector V_iIn which V is_i＝(a₁*b₁,a₂*b₂,…,a_D*b_D)

In this way, a combined vector of the first feature vector Fs and the feature vectors of the respective regions is obtained, and N combined vectors V1, V2, …, VN are obtained.

These combined vectors are then input to the bounding box regression layer 24. The frame regression layer 24 outputs information of a predicted frame in the second picture based on at least the N combined vectors, the predicted frame indicating a region of the second picture containing picture content of the first picture.

It will be appreciated that the aforementioned R-CNN, Fast RCNN, Yolo network models, and some other network models may be used for target detection. In conventional target detection, a picture to be detected is also divided into regions to obtain feature vectors corresponding to the regions, and then the feature vectors are input into a classification regression layer of a network to perform target detection. The task of target detection can be divided into target classification and bounding box regression. The target classification is to predict the category of the target object, and the bounding box regression is to determine a minimum rectangular box (bounding box) containing the target object.

By taking the implementation manner of the target detection algorithm as a reference, the frame regression layer 24 in fig. 2 may use the frame regression algorithm in the target detection to give the predicted frame from the N regions of the second picture based on the N combined vectors.

As described above, the N combination vectors are respectively combined from the feature vector of the first picture (to-be-matched picture) and the N feature vectors corresponding to the N regions of the second picture (to-be-searched picture), and therefore, the N combination vectors may reflect the similarity between the regions of the first picture and the second picture, or the N combination vectors may be used as the feature vectors of N overlay maps in which the first picture is overlaid with the regions of the second picture. Then, it is equivalent to perform border regression in the target detection by using the feature vectors of the N superimposed images as the feature vectors of the image regions to be subjected to the target detection, and the obtained border can be used as the region of the second image containing the content of the first image.

FIG. 5 shows a schematic structural diagram of a bounding box regression layer, according to an embodiment. As shown in fig. 5, the bounding box regression layer 24 may include a first hidden layer 241, a second hidden layer 242, and an output layer 243.

The first hidden layer 241 is used to determine the region probability P (R) of the first picture appearing in each of the N regions of the second picture₁),P(R₂),…,P(R_N)。

In one embodiment, the above region probability is a probability after being normalized by softmax. Therefore, the sum of the region probabilities of the respective regions is 1.

Next, the second hidden layer 242 applies a bounding box regression algorithm to generate candidate bounding boxes in at least one region, and obtains a confidence of each candidate bounding box. It is understood that in the training process of the bounding box regression algorithm, the intersection ratio between the predicted bounding box and the labeled bounding box is calculated (IoU) while the predicted bounding box is generated, and the intersection ratio can be used as a measure of the confidence; accordingly, in the prediction phase, the bounding box regression algorithm will likewise generate a candidate bounding box with its prediction IoU as its confidence.

In one embodiment, the second hidden layer 242 selects a region with the highest region probability from the regions, and performs a bounding box regression algorithm on the region to generate a candidate bounding box.

In another embodiment, firstly, filtering the region probability of each region according to a preset probability threshold, and removing the regions with the region probability lower than the threshold; the second hidden layer 242 performs a bounding box regression algorithm on only the remaining regions, respectively, in these regions, to generate candidate bounding boxes.

In yet another embodiment, the second hidden layer 242 performs a bounding box regression algorithm for each region to generate a candidate bounding box.

In one embodiment, the second hidden layer 242 generates a plurality of candidate bounding boxes for each region of the process by executing a bounding box regression algorithm, and calculates a confidence level that provides each candidate bounding box.

In another embodiment, the second hidden layer 242 generates a plurality of preliminary bounding boxes for each region of the processing by executing a bounding box regression algorithm, and then selects the bounding box with the highest confidence from the plurality of preliminary bounding boxes as the candidate bounding box.

Through the above various manners, the first hidden layer 241 determines the region probability of each region, and the second hidden layer 242 generates candidate frames for at least part of the regions, and obtains the confidence of each candidate frame. Next, the output layer 243 outputs information of the predicted bounding box according to the region probability of each region and the confidence of each candidate bounding box.

Specifically, as described above, according to different execution manners of the second hidden layer 242, the second hidden layer 242 may output a plurality of candidate frames, and the plurality of candidate frames may be located in one region or a plurality of regions. In general, the confidence of the candidate bounding box generated from the region with higher region probability is also higher; but the occasional occurrence of special cases is not excluded. Therefore, the output layer 243 comprehensively considers the region probability of the region in which each candidate frame is located and the confidence of the candidate frame itself, and selects the most likely frame as the prediction result.

In an embodiment, for a plurality of candidate frames obtained by the second hidden layer 242, the output layer 243 calculates products of the region probabilities of the regions where the candidate frames are located and the confidence degrees of the candidate frames, and selects the candidate frame corresponding to the maximum value of the products as the predicted frame.

In another embodiment, the output layer 243 calculates the sum of the region probability of the region in which each candidate frame is located and the confidence of the candidate frame, and selects the candidate frame corresponding to the largest sum as the predicted frame.

In yet another embodiment, the output layer 243 first selects the region with the highest probability of the region, and selects the candidate bounding box with the highest confidence as the predicted bounding box in the region.

Thus, the output layer 243 outputs information of an optimal prediction frame by comprehensively considering the region probability and the confidence. Generally, the output information of the predicted frame includes at least the position coordinates of the predicted frame. The position coordinates are generally expressed in the form of (x, y, w, h), where x, y shows the position of the center of the border, w is the width of the border, and h is the height of the border.

In one embodiment, the output layer 243 also outputs the region probability and/or confidence of the predicted bounding box as supplemental information.

Specific examples of implementations of the bounding box regression layer are described above. But the implementation of the bezel regression layer is not limited thereto. For example, in one implementation, the bounding box regression layer may include several convolution layers before the network layer that performs the bounding box regression algorithm, and further convolution processing is performed on each combined vector before bounding box regression is performed. In another implementation, the bounding box regression layer may also directly adopt a bounding box regression algorithm for each region to generate a candidate bounding box without determining the region probability of each region. In another implementation, the bounding box regression layer estimates the region probabilities and generates candidate bounding boxes therein for each region through the integrated network layer. Accordingly, the bounding box regression layer may have other different network structures.

As described above, the bounding box regression layer 24 outputs information of the predicted bounding box as the prediction result based on the feature vector corresponding to each region.

Although in the above example, both the combination layer 23 and the bounding box regression layer 24 are shown as each embodied as one network layer, the implementation is not limited thereto. For example, yolov3 proposes a method of multi-scale prediction. Correspondingly, the neural network system based on yolov3 may have a plurality of combinations of "combination layer + bounding box regression layer". In this case, the convolution feature maps of 1 or more convolution layers may be extracted from the plurality of convolution layers in the first convolution network and the second convolution network, and the extracted convolution feature maps may be input to the corresponding "combination layer + frame regression layer" for processing.

FIG. 6 illustrates a prediction result diagram according to one embodiment. In fig. 6, the left image is a first image, i.e., an image to be matched, and the right image is a second image, i.e., an image to be searched. After the first picture and the second picture are input into the neural network system shown in fig. 2, a prediction bounding box may be output in the second picture, the prediction bounding box showing an area of the second picture containing picture content of the first picture. As shown in fig. 6, two numbers are also placed above the prediction bounding box, the first number representing the region probability of the region in which the prediction bounding box is located, and the second number representing the confidence level of the prediction bounding box (or prediction IoU).

Thus, through the two-branch neural network system shown in fig. 2, the rapid matching and positioning of the picture is realized, and the area containing the picture to be matched is selected from the picture to be searched by using the frame. In the process, matching and positioning are synchronously realized, so that the processing efficiency is improved, and the processing performance is improved.

In one embodiment, the above neural network system performs end-to-end joint training in advance through training samples. In order to train such a neural network system, the training sample used needs to include a plurality of picture pairs, each picture pair including a first training picture and a second training picture, the second training picture being marked with a target box showing an area of the second training picture containing picture content of the first training picture. The target box thus labeled can be used as reference data (groudtruth) for training the neural network system.

Specifically, the training process may include inputting the first training picture and the second training picture into a first convolution network and a second convolution network of the neural network system, respectively, and outputting the predicted bounding box by the bounding box regression layer. And comparing the prediction frame with the labeled target frame, taking the comparison result as a prediction error, performing back propagation, and adjusting parameters of each network layer in the neural network system in a gradient descending manner and the like so as to train the neural network system.

In one embodiment, the bezel regression layer 24 takes the structure shown in FIG. 5, and includes a first hidden layer, a second hidden layer and an output layer. In such a case, the process of training the neural network system specifically includes the following steps.

As described above, the first training picture and the second training picture are respectively input into the first convolution network and the second convolution network, so as to respectively obtain the feature vector corresponding to the first training picture and the feature vector corresponding to the N regions of the second training picture. And combining the characteristic vectors respectively to obtain N combined vectors corresponding to the N areas.

It is understood that the second training picture is marked with a target frame, and therefore, it can be determined in which region of the N regions of the second training picture the target frame is located according to the position of the target frame. And determining the area label of the target frame according to the determined area.

And predicting the prediction region probability of the first training picture in each region through the first hidden layer based on the N combined vectors.

And then, generating alternative frames in each area through the second hidden layer. And, determining (IoU) a merging ratio of each candidate bounding box with the target box as a confidence of the candidate bounding box.

And then, adjusting parameters of the first hidden layer, the second hidden layer and the output layer at least based on the region label, the probability of the predicted region and the confidence coefficient of the alternative frame so as to train the neural network. It is understood that the above region label corresponds to reference data of the region probability, and therefore, by comparing the predicted region probability obtained by prediction with the region label, an error related to the region probability can be determined. In addition, the intersection comparison of the alternative border and the target border represents errors of the position and the size of the alternative border. Therefore, based on the region labels and the predicted region probabilities, and the confidence of the candidate bounding box, the above two parts of errors can be obtained. The prediction error also includes errors related to the size and position of the candidate bounding box and the target box, such as errors in the (x, y, w, h) values. Then, the error is propagated reversely in the neural network system, so that the parameters are adjusted, and the neural network system is trained.

Through the training process, the two-branch neural network system shown in fig. 2 can be obtained for performing fast matching and positioning of the picture.

According to an embodiment of another aspect, a method for picture matching positioning is also provided. Fig. 7 illustrates a method for picture matching positioning according to one embodiment. The method may be performed by a computer. As shown in fig. 7, the method includes at least the following steps.

In step 71, performing a first convolution process on the first picture to obtain a first convolution feature map corresponding to the first picture; wherein the first picture is a picture to be matched.

At step 72, a pooling operation is performed on the first convolution map to generate a first feature vector having a first number of dimensions.

In step 73, performing convolution processing on the second picture to obtain N feature vectors corresponding to the N regions included in the second picture, where the dimensionality of the N feature vectors is the first number; the second picture is a picture to be searched.

In step 74, the first eigenvector is combined with the N eigenvectors respectively to obtain N combined vectors.

In step 75, based on at least the N combined vectors, a frame regression algorithm is employed to output information of a predicted frame in the second picture, the predicted frame indicating that the second picture contains a portion of the picture content of the first picture.

According to one embodiment, step 73 further comprises: performing second convolution processing on the second picture to obtain a second convolution characteristic diagram corresponding to the second picture; and then extracting N feature vectors corresponding to the N regions respectively based on the second convolution feature map.

In one embodiment, the second convolution process is the same as the first convolution process in step 71.

According to one possible design, the N regions are obtained by dividing according to a predetermined division rule.

According to another design, the N regions are generated by a selective search algorithm, or by a region generation network.

In one embodiment, the combining operation in step 74 comprises a vector dot product operation.

FIG. 8 shows a flowchart of determining a predicted bounding box, sub-steps of step 75, according to one embodiment. As shown in fig. 8, according to one embodiment, step 75 further comprises:

a step 751 of determining a region probability of the first picture appearing in each of the N regions;

step 752, generating alternative frames in at least one region, and obtaining confidence of each alternative frame;

and 753, outputting information of the predicted frame according to the region probability of each region and the confidence coefficient of each alternative frame, wherein the information of the predicted frame comprises the coordinates of the predicted frame, and the region probability and the confidence coefficient corresponding to the predicted frame.

According to one embodiment, step 752 includes generating an alternative bounding box in the region where the region probability is greater than a preset threshold.

According to one embodiment, step 753 further includes determining the candidate bounding box that has the largest product of the corresponding region probability and confidence as the predicted bounding box.

According to one embodiment, the method is implemented by a neural network system, the neural network system is obtained by end-to-end training of a training sample, the training sample includes a plurality of picture pairs, each picture pair includes a first training picture and a second training picture, a target frame is marked in the second training picture, and the target frame shows an area where the second training picture includes picture content of the first training picture.

Further, in one possible design, the end-to-end training includes:

predicting the prediction region probability of the first training picture in each region;

generating alternative frames in each area;

According to an embodiment of another aspect, an apparatus for picture matching positioning is also provided. Fig. 9 shows a schematic block diagram of a picture matching positioning apparatus according to an embodiment. It is to be appreciated that the apparatus can be implemented by any device, apparatus, platform, cluster of devices having computing, processing capabilities.

As shown in fig. 9, the apparatus 900 includes:

a first convolution unit 91 configured to perform a first convolution process on the first picture to obtain a first convolution feature map corresponding to the first picture; the first picture is a picture to be matched;

a pooling unit 92 configured to perform a pooling operation on the first convolution feature map to generate a first feature vector with a first number of dimensions;

a second convolution unit 93, configured to perform convolution processing on the second picture to obtain N feature vectors corresponding to N regions included in the second picture, respectively, where dimensions of the N feature vectors are the first number; the second picture is a picture to be searched;

a combining unit 94 configured to perform a combining operation on the first feature vector and the N feature vectors respectively to obtain N combined vectors;

a prediction unit 95 configured to output information of a predicted bounding box in the second picture by using a bounding box regression algorithm based on at least the N combined vectors, the predicted bounding box indicating a portion of the second picture containing picture content of the first picture.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the neural network system described in connection with fig. 2, or the method described in connection with fig. 7.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor implementing the neural network system described in conjunction with fig. 2, or the method described in conjunction with fig. 7, when executing the executable code.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A computer-implemented neural network system for picture matching localization, comprising:

a frame regression layer, which adopts a frame regression algorithm at least based on the N combined vectors to output information of a predicted frame in the second picture, wherein the predicted frame indicates that the second picture comprises the part of the picture content of the first picture;

the frame regression layer comprises a first hidden layer, a second hidden layer and an output layer;

2. The system of claim 1, wherein the second convolutional network comprises a second convolutional layer and a feature extraction layer,

the second convolution layer performs convolution processing on the second picture to obtain a second convolution characteristic diagram corresponding to the second picture;

and the feature extraction layer extracts N feature vectors corresponding to the N regions respectively based on the second convolution feature map.

3. The system of claim 2, wherein the second convolutional layer is a common convolutional layer with the first convolutional layer.

4. The system of claim 1, wherein the N regions are partitioned according to a predetermined partitioning rule.

5. The system of claim 1, wherein the N regions are generated by a selective search algorithm, or by a region generation network.

6. The system of claim 1, wherein the combining operation by the combining layer comprises a vector dot product operation.

7. The system of claim 1, wherein the second hidden layer generates an alternative bounding box in a region having a region probability greater than a preset threshold.

8. The system of claim 1, wherein the output layer takes as the predicted bounding box the candidate bounding box for which the product of the corresponding region probability and confidence is greatest.

9. The system of claim 1, wherein the neural network system is trained end-to-end through a training sample, the training sample comprising a plurality of pairs of pictures, each pair of pictures comprising a first training picture and a second training picture, the second training picture being labeled with a target box showing a portion of the second training picture containing picture content of the first training picture.

10. The system of claim 9, wherein the end-to-end training comprises:

generating alternative frames in each region through the second hidden layer;

and adjusting parameters of the first hidden layer, the second hidden layer and the output layer at least based on the region label, the predicted region probability, the confidence coefficient of the candidate frame, the size and the position of the candidate frame and the target frame, so as to train the neural network system.

11. A computer-implemented method for picture matching localization, comprising:

outputting information of a predicted frame in the second picture by adopting a frame regression algorithm at least based on the N combined vectors, wherein the predicted frame indicates that the second picture comprises the part of the picture content of the first picture;

wherein the outputting information of the predicted frame in the second picture by using a frame regression algorithm based on at least the N combined vectors includes:

determining a region probability that the first picture appears in each of the N regions;

generating alternative frames in at least one region, and obtaining the confidence of each alternative frame;

and outputting the information of the predicted frame according to the region probability of each region and the confidence coefficient of each alternative frame, wherein the information of the predicted frame comprises the coordinates of the predicted frame, and the region probability and the confidence coefficient corresponding to the predicted frame.

12. The method according to claim 11, wherein the convolving the second picture to obtain N feature vectors corresponding to N regions included in the second picture respectively comprises,

performing second convolution processing on the second picture to obtain a second convolution characteristic diagram corresponding to the second picture;

and extracting N feature vectors corresponding to the N regions respectively based on the second convolution feature map.

13. The method of claim 12, wherein the second convolution process is the same as the first convolution process.

14. The method of claim 11, wherein the N regions are segmented according to a predetermined segmentation rule.

15. The method of claim 11, wherein the N regions are generated by a selective search algorithm, or by a region generation network.

16. The method of claim 11, wherein the combining operation comprises a vector dot product operation.

17. The method of claim 11, wherein the generating the alternative bounding box in the at least one region comprises generating the alternative bounding box in a region where a region probability is greater than a preset threshold.

18. The method according to claim 11, wherein the outputting information of the predicted bounding box according to the region probability of each region and the confidence of each candidate bounding box comprises taking the candidate bounding box with the largest product of the corresponding region probability and the confidence as the predicted bounding box.

19. The method of claim 11, wherein the method is implemented by a neural network system trained end-to-end through a training sample, the training sample comprising a plurality of pairs of pictures, each pair of pictures comprising a first training picture and a second training picture, the second training picture being labeled with a target box showing an area of the second training picture containing picture content of the first training picture.

20. The method of claim 19, wherein the end-to-end training comprises:

generating alternative frames in each area;

and adjusting parameters of the neural network system at least based on the region label, the predicted region probability, the confidence of the alternative frame, the size and the position of the alternative frame and the target frame, thereby training the neural network system.

21. An apparatus for picture matching positioning, comprising:

a prediction unit configured to output information of a predicted bounding box in the second picture by using a bounding box regression algorithm based on at least the N combined vectors, the predicted bounding box indicating a portion of the second picture containing picture content of the first picture;

wherein the prediction unit is specifically configured to:

22. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor, when executing the executable code, implements the neural network system of any one of claims 1-10.