CN117115483A

CN117115483A - Template matching method, medium and equipment based on heterogeneous image feature fusion

Info

Publication number: CN117115483A
Application number: CN202310905813.4A
Authority: CN
Inventors: 陈初杰; 张子恒; 瞿崇晓; 李彤; 陈碧乾; 尉婉丽; 李俊薇
Original assignee: CETC 52 Research Institute
Current assignee: CETC 52 Research Institute
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2023-11-24

Abstract

The application relates to the technical field of image processing, in particular to a template matching method based on heterogeneous image feature fusion, which comprises the following steps: a, acquiring an image set and preprocessing to obtain a background image and a template image; b, establishing a matching model, wherein the matching model comprises an encoder, a feature mapper and a decoder which are sequentially connected, and sending the background image and the template image in the acquired image set into the matching model for matching; training the established matching model based on the preprocessed image set, and calculating a loss function until the loss function converges to a preset value to obtain a matching model comprising a trained encoder, a feature mapper and a decoder; and D, matching the test images by using the trained matching model. Template matching between heterologous images can be realized end to end, labels can be automatically generated in the training process, self-supervision learning is realized, a large amount of human resource labeling is not required to be consumed, and the method is more efficient and accurate than the existing method.

Description

A template matching method, medium and equipment based on heterogeneous image feature fusion

技术领域Technical field

本发明涉及图像处理技术领域，具体涉及一种基于异构图像特征融合的模板匹配方法。The invention relates to the technical field of image processing, and in particular to a template matching method based on heterogeneous image feature fusion.

背景技术Background technique

模板匹配是指针对模板图和背景图两张图像，确定图像之间的空间映射关系，从而在背景图中找到特定的位置。一般地，图像模板匹配的主要难点是模板和背景图之间的成像往往是异源图像，异源图像是指不同光源的成像图像，其成像特征存在很大的差异。Template matching refers to determining the spatial mapping relationship between the two images of the template image and the background image, so as to find a specific position in the background image. Generally, the main difficulty in image template matching is that the imaging between the template and the background image is often a heterogeneous image. Heterogeneous images refer to imaging images from different light sources, and their imaging characteristics are very different.

现有的模板匹配方法，往往是通过人工设计图像特征点，然后对图像特征点进行提取从而发现匹配关系，例如通过SHIFT、ORB等方法提取图像角点作为图像特征点，然后通过特征点匹配算法确定图像特征点之间的对应关系，进一步计算用于描述图像映射关系的单应变换矩阵，从而确定两幅图像之间的匹配关系。Existing template matching methods often manually design image feature points, and then extract the image feature points to find matching relationships. For example, image corner points are extracted as image feature points through SHIFT, ORB and other methods, and then the feature point matching algorithm is used to extract the image feature points. Determine the correspondence between the image feature points, and further calculate the homography transformation matrix used to describe the image mapping relationship, thereby determining the matching relationship between the two images.

此类方法在一些日常场景下已经取得了较好的效果，但针对成像特征差异大的异源图像模板匹配便会失效。此外，传统方法所提取的特征点往往会有一些异常的点，称为“外点”，需要通过复杂的随机选择一致性方法进行迭代剔除，处理耗时长，效率低下。且现有技术中经典的译码器由六个结构相同的译码模块组成，参数量大，推理效率低。This type of method has achieved good results in some daily scenarios, but template matching of heterogeneous images with large differences in imaging characteristics will fail. In addition, the feature points extracted by traditional methods often have some abnormal points, called "outer points", which need to be iteratively eliminated through a complex random selection consistency method, which is time-consuming and inefficient. Moreover, the classic decoder in the existing technology consists of six decoding modules with the same structure, which has a large number of parameters and low reasoning efficiency.

传统方法需要经过先提取特征点，再通过特征点描述子确定特征点之间的匹配关系，还需要通过RANSC方法剔除异常点，通过两步才能完成模板匹配，所耗费的计算资源大，效率低。The traditional method needs to extract feature points first, and then determine the matching relationship between feature points through the feature point descriptors. It also needs to eliminate abnormal points through the RANSC method. Template matching can be completed in two steps, which consumes a lot of computing resources and is inefficient. .

发明内容Contents of the invention

针对现有技术中存在的上述问题，本发明提供一种基于异构图像特征融合的模板匹配方法，能够端到端地实现异源图像之间的模板匹配，并且可以在训练过程中自动产生标签，实现自监督学习，无需耗费大量的人力资源标注，比现有方法更加高效准确。In view of the above problems existing in the prior art, the present invention provides a template matching method based on heterogeneous image feature fusion, which can realize template matching between heterogeneous images end-to-end, and can automatically generate labels during the training process. , realizing self-supervised learning without consuming a lot of human resources for labeling, and is more efficient and accurate than existing methods.

第一方面，本申请实施例提供了一种基于异构图像特征融合的模板匹配方法，包括步骤：In the first aspect, embodiments of the present application provide a template matching method based on heterogeneous image feature fusion, including the steps:

A，获取图像集并进行预处理，以得到背景图像和模板图像，所述图像集包括训练图像集和测试图像；A. Obtain an image set and perform preprocessing to obtain a background image and a template image. The image set includes a training image set and a test image;

B，建立匹配模型，所述匹配模型包括依次连接的编码器、特征映射器以及译码器，并将获取的图像集中的背景图像和模板图像送入匹配模型进行匹配；B. Establish a matching model. The matching model includes an encoder, a feature mapper and a decoder connected in sequence, and the background image and template image in the acquired image set are sent to the matching model for matching;

C，基于预处理后的图像集对建立的匹配模型进行训练，并计算损失函数，直至损失函数收敛至预设值，以得到包含训练好的编码器、特征映射器以及译码器的匹配模型；C. Train the established matching model based on the preprocessed image set and calculate the loss function until the loss function converges to the preset value to obtain a matching model including the trained encoder, feature mapper and decoder. ;

D，利用训练好的匹配模型对测试图像进行匹配。D. Use the trained matching model to match the test image.

在第一方面的一种可选方案中，步骤A中，所述预处理具体包括：In an optional solution of the first aspect, in step A, the preprocessing specifically includes:

A1，在图像集中随机选取两张图像，记为第一图像和第二图像，在第一图像上截取一个随机大小的图像，记为第一截图，并记录第一截图在第一图像上的关键点坐标，作为训练正样本；A1, randomly select two images from the image set, recorded as the first image and the second image, capture an image of random size on the first image, recorded as the first screenshot, and record the position of the first screenshot on the first image. Key point coordinates are used as training positive samples;

A2，在第二图像上与第一图像的相同位置处截取图像，记为第二截图，作为训练负样本。A2, capture an image on the second image at the same position as the first image, record it as the second screenshot, and use it as a training negative sample.

在第一方面的又一种可选方案中，步骤A中，所述预处理还包括对第一图像和第二图像进行数据增强处理，所述数据增强处理是指图像变换，包括缩放、平移、旋转、透视、随机添加噪声、随机对比度变换、随机亮度变化以及随机选取一块区域进行遮挡。In yet another optional solution of the first aspect, in step A, the preprocessing further includes performing data enhancement processing on the first image and the second image, where the data enhancement processing refers to image transformation, including zooming and translation. , rotation, perspective, randomly adding noise, random contrast transformation, random brightness changes and randomly selecting an area for occlusion.

在第一方面的又一种可选方案中，步骤B中，获取的图像集中的背景图像和模板图像送入匹配模型之前，还进行网格划分，并选取预设比例的网格进行随机遮盖，目的是减少图像冗余信息，提高匹配模型的鲁棒性。In another alternative to the first aspect, in step B, before the background image and template image in the acquired image set are sent to the matching model, the grid is also divided, and a grid with a preset ratio is selected for random masking. , the purpose is to reduce redundant information in the image and improve the robustness of the matching model.

在第一方面的又一种可选方案中，步骤B中，所述匹配模型中特征映射器包括依次连接的卷积层和输出为预设维度的线性映射层；译码器包括依次连接的自注意力结构、残差结构以及交叉自注意力结构；In yet another alternative to the first aspect, in step B, the feature mapper in the matching model includes a convolutional layer connected in sequence and a linear mapping layer whose output is a preset dimension; the decoder includes a convolutional layer connected in sequence and a linear mapping layer whose output is a preset dimension. Self-attention structure, residual structure and cross-self-attention structure;

交叉自注意力结构：用于背景图像的特征向量与模板图像的特征向量点乘，模板图像的特征向量与背景图像的特征向量点乘，以得到权重。Cross self-attention structure: used for the dot multiplication of the feature vector of the background image and the feature vector of the template image, and the dot multiplication of the feature vector of the template image and the feature vector of the background image to obtain the weight.

在第一方面的又一种可选方案中，步骤B中，图像输入译码器之前，还对位置信息进行编码，编码式如下：In yet another alternative to the first aspect, in step B, before the image is input to the decoder, the position information is also encoded, and the encoding formula is as follows:

其中，PE表示位置编码向量，pos表示网格编号，d_index表示编码特征向量元素位置，i表示编码特征向量中的元素编号，d表示编码特征向量的维度，维度d的值为n，n为整数。Among them, PE represents the position encoding vector, pos represents the grid number, d_index represents the element position of the encoding feature vector, i represents the element number in the encoding feature vector, d represents the dimension of the encoding feature vector, the value of dimension d is n, and n is an integer. .

在第一方面的又一种可选方案中，步骤C中，所述损失函数，记为L，计算式为：In yet another alternative to the first aspect, in step C, the loss function, denoted as L, has a calculation formula of:

L＝λ₁L_confidence+λ₂L_loc L＝λ ₁ L _confidence +λ ₂ L _loc

其中，L_confidence表示位置是否存在关键点的损失值，λ₁表示位置是否存在关键点的损失值的权重，L_loc表示关键点坐标的损失值，λ₂表示关键点坐标的损失值的权重，c_i表示第i个关键点是否移动的真值，移动为1，未移动为0，p_i表示是否存在第i个关键点的预测值，x_i表示第i个关键点的横坐标的真值，y_i表示第i个关键点的纵坐标的真值，表示第i个关键点横坐标的预测值，/>表示第i个关键点纵坐标的预测值。Among them, L _confidence represents the loss value of whether there is a key point at the position, λ ₁ represents the weight of the loss value of whether there is a key point at the position, L _loc represents the loss value of the key point coordinates, λ ₂ represents the weight of the loss value of the key point coordinates, c _i represents the true value of whether the i-th key point has moved, which is 1 if it has moved, and 0 if it has not moved. p _i represents whether there is a predicted value of the i-th key point, and x _i represents the true value of the abscissa of the i-th key point. value, y _i represents the true value of the ordinate of the i-th key point, Represents the predicted value of the abscissa of the i-th key point, /> Represents the predicted value of the ordinate of the i-th key point.

在第一方面的又一种可选方案中，步骤D具体包括：In yet another alternative to the first aspect, step D specifically includes:

D1，获取预处理后的测试图像，包括预处理后的背景图像和模板图像；D1, obtain the preprocessed test image, including the preprocessed background image and template image;

D2，将预处理后的背景图像和模板图像输入训练好的匹配模型中，以得到匹配结果；D2, input the preprocessed background image and template image into the trained matching model to obtain the matching result;

D3，对匹配结果进行分析，设置关键点的预设置信度，若置信度大于预设置信值的关键点的数量大于等于3，则匹配成功，匹配位置由置信度最高的三个关键点决定，若置信度大于预设置信值的关键点的数量小于3，则背景图像和模板图像不匹配。D3, analyze the matching results and set the preset confidence level of the key points. If the number of key points with a confidence level greater than the preset confidence value is greater than or equal to 3, the matching is successful, and the matching position is determined by the three key points with the highest confidence level. , if the number of key points with a confidence greater than the preset confidence value is less than 3, the background image and the template image do not match.

第二方面，本申请实施例提供了一种计算机存储介质，计算机存储介质存储有计算机程序，计算机程序包括程序指令，程序指令当被处理器执行时，可实现本申请实施例第一方面或第一方面的任意一种实现方式提供的一种基于异构图像特征融合的模板匹配方法。In a second aspect, embodiments of the present application provide a computer storage medium. The computer storage medium stores a computer program. The computer program includes program instructions. When executed by a processor, the program instructions can implement the first aspect or the third aspect of the embodiment of the present application. Any implementation of the aspect provides a template matching method based on heterogeneous image feature fusion.

第三方面，本申请实施例提供了一种电子设备，包括存储器和处理器，所述存储器和所述处理器之间互相通信连接，所述存储器存储有计算机指令，所述处理器通过执行所述计算机指令，从而执行本申请实施例第一方面或第一方面的任意一种实现方式提供的一种基于异构图像特征融合的模板匹配方法。In a third aspect, embodiments of the present application provide an electronic device, including a memory and a processor. The memory and the processor are communicatively connected to each other. The memory stores computer instructions. The processor executes the instructions. The computer instructions are used to execute a template matching method based on heterogeneous image feature fusion provided by the first aspect of the embodiment of the present application or any implementation of the first aspect.

本发明的有益技术效果包括：The beneficial technical effects of the present invention include:

1.通过改进的多头自注意力结构对异构图像高维特征进行融合，发现模板图和背景图之间的空间匹配关系，能够有效的完成异构特征图像的模板匹配，不同于现有图像匹配方法是通过人工提取图像角点作为特征点再对其进行匹配的方法，克服了无法适用于模板匹配场景的问题，算法适应性更强，端到端的方式一步到位完成模板匹配。1. Through the improved multi-head self-attention structure, the high-dimensional features of heterogeneous images are fused, and the spatial matching relationship between the template image and the background image is discovered, which can effectively complete the template matching of heterogeneous feature images, which is different from existing images. The matching method is a method of manually extracting image corners as feature points and then matching them, which overcomes the problem of not being suitable for template matching scenarios. The algorithm is more adaptable and can complete template matching in one step in an end-to-end manner.

2.在训练过程中自动产生标签，实现自监督学习，无需耗费大量的人力资源标注，与现有方法相比，图像匹配更加快速，准确，高效。2. Labels are automatically generated during the training process to achieve self-supervised learning without spending a lot of human resources on labeling. Compared with existing methods, image matching is faster, more accurate, and more efficient.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

图1为本发明中一种基于异构图像特征融合的模板匹配方法流程图；Figure 1 is a flow chart of a template matching method based on heterogeneous image feature fusion in the present invention;

图2为本发明中一种基于异构图像特征融合的模板匹配模型框图；Figure 2 is a block diagram of a template matching model based on heterogeneous image feature fusion in the present invention;

图3为本发明实施例中验证实验的背景图像；Figure 3 is a background image of the verification experiment in the embodiment of the present invention;

图4为本发明实施例中验证实验的模板图像；Figure 4 is a template image of the verification experiment in the embodiment of the present invention;

图5为本发明实施例中验证实验的匹配结果示意图。Figure 5 is a schematic diagram of the matching results of the verification experiment in the embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.

在下述介绍中，术语“第一”、“第二”仅为用于描述的目的，而不能理解为指示或暗示相对重要性。下述介绍提供了本申请的多个实施例，不同实施例之间可以替换或者合并组合，因此本申请也可认为包含所记载的相同和/或不同实施例的所有可能组合。因而，如果一个实施例包含特征A、B、C，另一个实施例包含特征B、D，那么本申请也应视为包括含有A、B、C、D的一个或多个所有其他可能的组合的实施例，尽管该实施例可能并未在以下内容中有明确的文字记载。In the following introduction, the terms "first" and "second" are used for descriptive purposes only and shall not be understood as indicating or implying relative importance. The following description provides multiple embodiments of the present application. Different embodiments can be replaced or combined. Therefore, the present application can also be considered to include all possible combinations of the same and/or different embodiments described. Thus, if one embodiment contains features A, B, C, and another embodiment contains features B, D, then the application should also be considered to include all other possible combinations containing one or more of A, B, C, D embodiment, although this embodiment may not be explicitly documented in the following content.

下面的描述提供了示例，并且不对权利要求书中阐述的范围、适用性或示例进行限制。可以在不脱离本申请内容的范围的情况下，对描述的元素的功能和布置做出改变。各个示例可以适当省略、替代或添加各种过程或组件。例如所描述的方法可以以所描述的顺序不同的顺序来执行，并且可以添加、省略或组合各种步骤。此外，可以将关于一些示例描述的特征组合到其他示例中。The following description provides examples and does not limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of the elements described without departing from the scope of the disclosure. Various procedures or components may be omitted, substituted, or added as appropriate from each example. For example, the described methods may be performed in a different order than that described, and various steps may be added, omitted, or combined. Additionally, features described with respect to some examples may be combined into other examples.

实施例一：Example 1:

参照图1，一种基于异构图像特征融合的模板匹配方法，包括步骤：Referring to Figure 1, a template matching method based on heterogeneous image feature fusion includes steps:

步骤A，获取图像集并进行预处理，所述图像集包括训练图像集和测试图像。训练图像集包括：开源图像数据集，所述开源图像数据集可以是COCO，ImageNet等，和连续图像数据集，即从视频中抽取的连续图像帧。预处理包括：标签文件生成和数据增强。Step A: Obtain an image set and perform preprocessing. The image set includes a training image set and a test image. The training image set includes: an open source image data set, which can be COCO, ImageNet, etc., and a continuous image data set, that is, continuous image frames extracted from the video. Preprocessing includes: label file generation and data enhancement.

步骤A具体包括：Step A specifically includes:

步骤A1：从训练图像集中随机选取两张不同的图像，记为第一图像和第二图像。Step A1: Randomly select two different images from the training image set, marked as the first image and the second image.

步骤A2：从第一图像上随机截取一个随机大小的第一截图，记录第一截图在原图上的9个点坐标：左上角(x1,y1)，右上角(x2,y2)，左下角(x3,y3)，右下角(x4,y4)、左上区域中心点(x5,y5)、右上区域中心点(x6,y6)、左下区域中心点(x7,y7)、右下区域中心点(x8,y8)、截图中心点(x9,y9)，将9个点坐标根据图像宽高进行归一化，存储到标签文件中，作为训练正样本。Step A2: Randomly take a randomly sized first screenshot from the first image, and record the 9 point coordinates of the first screenshot on the original image: upper left corner (x1, y1), upper right corner (x2, y2), lower left corner ( x3,y3), lower right corner (x4,y4), upper left area center point (x5,y5), upper right area center point (x6,y6), lower left area center point (x7,y7), lower right area center point (x8 , y8), screenshot center point (x9, y9), normalize the coordinates of the 9 points according to the image width and height, and store them in the label file as training positive samples.

步骤A3：在第二图像上与第一图像的相同位置截图，以得到第二截图，并生成一个空的标签文件，作为训练负样本。Step A3: Take a screenshot of the second image at the same position as the first image to obtain the second screenshot, and generate an empty label file as a training negative sample.

步骤A4：对第一截图和第二截图进行数据增强，以得到第一变换截图和第二变换截图。Step A4: Perform data enhancement on the first screenshot and the second screenshot to obtain the first transformed screenshot and the second transformed screenshot.

数据增强是指对背景图像做图像变换，包括缩放、平移、旋转、透视变换、随机添加噪声、随机对比度变换、随机亮度变化、随机选取一块区域进行遮挡等操作，从而增加数据的多样性，提升算法的多场景适应能力，对每个训练样本随机选用至少两种数据增强方式对图片进行处理。需要注意的是，当数据增强的方法会影响图像的几何结构，例如缩放、平移、旋转、透视变换等操作，需要对对应的标签文件中的坐标点做相应的变换。Data enhancement refers to image transformation of the background image, including scaling, translation, rotation, perspective transformation, random addition of noise, random contrast transformation, random brightness change, random selection of an area for occlusion and other operations, thereby increasing the diversity of data and improving The algorithm has multi-scenario adaptability and randomly selects at least two data enhancement methods for each training sample to process the image. It should be noted that when the data enhancement method will affect the geometric structure of the image, such as scaling, translation, rotation, perspective transformation and other operations, the coordinate points in the corresponding label file need to be transformed accordingly.

参照图2，步骤B，建立匹配模型，所述匹配模型包括依次连接的编码器、特征映射器以及译码器，并将训练图像集中的背景图像和模板图送入匹配模型进行匹配。Referring to Figure 2, step B, a matching model is established. The matching model includes an encoder, a feature mapper and a decoder connected in sequence, and the background image and template image in the training image set are sent to the matching model for matching.

本发明基于多头自注意力机制的方法对异构图像特征进行融合实现模板匹配，网络结构是以编码器和译码器为基础，根据本发明要解决的问题任务特点进行改进设计，以更好的获取背景图像和模板图像之间的匹配关系。算法的输入为背景图像和模板图两个部分，通过编码器和特征映射器将两张图像用一系列的特征向量表示。然后通过改进的译码器对两种图像特征进行融合得到特征编码输出。This invention uses a method based on a multi-head self-attention mechanism to fuse heterogeneous image features to achieve template matching. The network structure is based on the encoder and decoder, and is improved and designed according to the characteristics of the problems and tasks to be solved by this invention to better Get the matching relationship between the background image and the template image. The input of the algorithm is the background image and the template image, and the two images are represented by a series of feature vectors through the encoder and feature mapper. Then the improved decoder is used to fuse the two image features to obtain the feature encoding output.

在输入编码器前，对输入的背景图像和模板图进行预处理，所述预处理为缩放操作，以得到预处理后的背景图像和预处理后的模板图像，其中预处理后的背景图像的宽为w1*16，高为h1*16，其中w1和h1为14至40之间的整数，具体地，w1和h1均设置为20，预处理后的背景图像的分辨率为320*320。预处理后的模板图像的宽为w2*16，高为h2*16，其中w2和h2为3至7之间的整数，具体地，w2和h2均设置为5，预处理后的模板图像的分辨率为80*80。Before inputting into the encoder, the input background image and template image are preprocessed. The preprocessing is a scaling operation to obtain a preprocessed background image and a preprocessed template image, where the preprocessed background image is The width is w1*16 and the height is h1*16, where w1 and h1 are integers between 14 and 40. Specifically, w1 and h1 are both set to 20, and the resolution of the preprocessed background image is 320*320. The width of the preprocessed template image is w2*16 and the height is h2*16, where w2 and h2 are integers between 3 and 7. Specifically, w2 and h2 are both set to 5. The preprocessed template image is The resolution is 80*80.

将预处理后的背景图像和预处理后的模板图像进行网格划分，每个网格的大小为16×16像素，预处理后的背景图像被划分为w1*h1个网格，预处理后的模板图像被划分为w2*h2个网格。进一步地，预处理后的背景图像被划分为400个网格，预处理后的模板图像被划分为25个网格。The preprocessed background image and the preprocessed template image are divided into grids. The size of each grid is 16×16 pixels. The preprocessed background image is divided into w1*h1 grids. After preprocessing The template image is divided into w2*h2 grids. Further, the preprocessed background image is divided into 400 grids, and the preprocessed template image is divided into 25 grids.

需要注意的是，在训练阶段，会分别选取比例为r1和r2的网格进行随机遮盖以减少图像冗余信息，提高算法鲁棒性。将划分后的图像网格经过对应的图像特征映射结构，后便能得到w1*h1个维度为n的背景图特征向量和w2*h2个维度为n的模板特征向量。具体地，r1＝0.7，r2＝0.4，后便能得到400个维度为768的背景图特征向量和25个维度为768的模板特征向量。It should be noted that during the training phase, grids with proportions r1 and r2 will be selected for random masking to reduce redundant image information and improve the robustness of the algorithm. After passing the divided image grid through the corresponding image feature mapping structure, w1*h1 background image feature vectors with n dimensions and w2*h2 template feature vectors with n dimensions can be obtained. Specifically, r1=0.7, r2=0.4, and then 400 background image feature vectors with a dimension of 768 and 25 template feature vectors with a dimension of 768 can be obtained.

编码器包括依次连接的自注意力结构和残差结构，背景图特征向量和模板特征向量分别通过三个矩阵运算得到K,Q和V的向量，每个向量对应的Q与所有向量的K点乘得到权重，然后用得到的权重与V相乘，最后得到加权和的V值作为编码器的输出向量。具体地，背景图像特征向量的Q与模板特征向量的K点乘得到权重，同时，每个模板特征向量的Q与背景图像特征的K点乘，得到权重，然后用得到的权重与V相乘最后得到加权和的V值作为编码器的输出向量。The encoder includes a self-attention structure and a residual structure connected in sequence. The background image feature vector and the template feature vector obtain the vectors of K, Q and V through three matrix operations respectively. The Q corresponding to each vector and the K points of all vectors Multiply to obtain the weight, then multiply the obtained weight with V, and finally obtain the weighted sum V value as the output vector of the encoder. Specifically, the Q of the background image feature vector is multiplied by K points of the template feature vector to obtain the weight. At the same time, the Q of each template feature vector is multiplied by the K points of the background image feature to obtain the weight, and then the obtained weight is multiplied by V. Finally, the weighted sum V value is obtained as the output vector of the encoder.

将得到的编码器的输出向量通过特征映射器进行特征映射，后输入改进的译码器结构。特征映射器包含依次连接的卷积层和线性映射层，即全连接层，所述卷积层包括卷积核大小为16×16，步长为16×16的卷积，激活函数以及批归一化操作，具体地，激活函数为ReLU，卷积层后连接一个输出为n的全连接层，n＝768。The obtained output vector of the encoder is feature mapped through the feature mapper, and then input into the improved decoder structure. The feature mapper includes a convolution layer and a linear mapping layer connected in sequence, that is, a fully connected layer. The convolution layer includes a convolution with a convolution kernel size of 16×16 and a step size of 16×16, an activation function and a batch regression. Unification operation, specifically, the activation function is ReLU, and the convolutional layer is followed by a fully connected layer with an output of n, n=768.

特征映射器为输出维度为n×n的全连接层，具体地，特征映射结构为输出维度为768×768的全连接层。The feature mapper is a fully connected layer with an output dimension of n×n. Specifically, the feature map structure is a fully connected layer with an output dimension of 768×768.

由于每个网格在对应的图像中有特定的位置，因此需要将其位置信息也添加到得到的背景图特征向量和模板特征向量中。在输入译码器之前，还对位置信息进行编码，以得到相同维度的位置编码向量，编码式如下：Since each grid has a specific position in the corresponding image, its position information needs to be added to the obtained background image feature vector and template feature vector. Before inputting to the decoder, the position information is also encoded to obtain a position encoding vector of the same dimension. The encoding formula is as follows:

其中，PE表示位置编码向量，pos表示当前网格编号，遵从行优先准则，d_index表示编码特征向量元素位置，i表示编码特征向量中的元素编号，为除以2向下取整，d表示编码特征向量的维度，维度d的值为n，n为整数，进一步地，维度d的值为768。Among them, PE represents the position encoding vector, pos represents the current grid number, and follows the row precedence principle. d_index represents the element position of the encoding feature vector. i represents the element number in the encoding feature vector, which is divided by 2 and rounded down. d represents encoding. The dimension of the feature vector, the value of dimension d is n, n is an integer, and further, the value of dimension d is 768.

得到位置信息编码向量后，每个位置信息编码向量与对应的背景图像特征向量或模板特征向量相加，得到背景图像输入特征和模板输入特征。将得到的背景图像输入特征和模板输入特征输入改进的译码器中得到输出特征向量。After obtaining the position information encoding vector, each position information encoding vector is added to the corresponding background image feature vector or template feature vector to obtain the background image input feature and template input feature. The obtained background image input features and template input features are input into the improved decoder to obtain an output feature vector.

译码器包括依次连接的自注意力结构、残差结构以及交叉自注意力结构，译码器的输入为9个查询键值(向量)，通过自注意力结构和残差结构后得到V，和编码器的特征编码输出K和Q一起，作为交叉自注意力结构的输入，并经过向量相加、归一化操作得到输出特征向量。The decoder includes self-attention structure, residual structure and cross self-attention structure connected in sequence. The input of the decoder is 9 query key values (vectors), and V is obtained through the self-attention structure and residual structure. Together with the feature encoding outputs K and Q of the encoder, it is used as the input of the cross self-attention structure, and the output feature vector is obtained through vector addition and normalization operations.

交叉自注意力结构的前w1*h1个Q，即背景图像相关的特征向量，只与后w2*h2个K点乘得到权重，后w2*h2个Q，即模板图像相关的特征向量，只与前w1*h1个K点乘得到权重，其中，9个查询键值为可训练参数。The first w1*h1 Qs of the cross self-attention structure, that is, the feature vectors related to the background image, are only multiplied with the last w2*h2 K points to obtain the weight. The last w2*h2 Qs, that is, the feature vectors related to the template image, are only Multiply the first w1*h1 K points to get the weight, of which 9 query key values are trainable parameters.

具体地，交叉自注意力结构的前400个Q，即背景图像相关的特征向量，只与后25个K点乘得到权重，后25个Q，即模板图像相关的特征向量，只与前400个K点乘得到权重，其中，9个查询键值为可训练参数。Specifically, the first 400 Q of the cross self-attention structure, that is, the feature vectors related to the background image, are only multiplied with the last 25 K points to obtain the weight. The last 25 Q, that is, the feature vectors related to the template image, are only multiplied with the first 400 The weights are obtained by multiplying K points, among which 9 query key values are trainable parameters.

译码器的输入为进行位置编码后的背景图像输入特征、模板输入特征和查询向量，其中，查询向量为算法训练过程中训练得到，有9个。译码器的输出为9个解码输出向量，所述9个解码输出向量通过前馈网络结构FFN，进一步的做特征提取，得到9个维度为3的向量，每个向量表示为(p,x,y)，p表示存在关键点的置信度，x表示关键点的横坐标值，y表示关键点的纵坐标值。The inputs of the decoder are the background image input features, template input features and query vectors after position encoding. Among them, there are 9 query vectors obtained during the algorithm training process. The output of the decoder is 9 decoding output vectors. The 9 decoding output vectors are further extracted through the feedforward network structure FFN to obtain 9 vectors with a dimension of 3. Each vector is expressed as (p, x , y), p represents the confidence that the key point exists, x represents the abscissa value of the key point, and y represents the ordinate value of the key point.

建立的译码器的优点是，有更少的参数量，让模型更轻量化，提高推理效率。背景图像特征与模板图像特征交叉关联，让匹配模型更容易学习到图像之间的空间关系，达到更好的模板匹配效果。The advantage of the established decoder is that it has fewer parameters, making the model more lightweight and improving inference efficiency. The cross-correlation between background image features and template image features makes it easier for the matching model to learn the spatial relationship between images and achieve better template matching effects.

步骤C，基于预处理后的图像集对建立的匹配模型进行训练，并计算损失函数，直至损失函数收敛至预设值，以得到包含训练好的编码器、特征映射器以及译码器的匹配模型。Step C: Train the established matching model based on the preprocessed image set, and calculate the loss function until the loss function converges to the preset value to obtain a matching including the trained encoder, feature mapper and decoder. Model.

步骤C具体包括：Step C specifically includes:

步骤C1：进行预训练，预训练的网络结构为依次连接的编码器和译码器，先将特征映射层去掉，即编码器的输出直接作为译码器的输入。Step C1: Perform pre-training. The pre-trained network structure is an encoder and a decoder connected in sequence. The feature mapping layer is first removed, that is, the output of the encoder is directly used as the input of the decoder.

步骤C2：从开源图像数据集中随机选取两张图像，在训练正样本中，第一图像为背景图像，第一截图为模板图像，对背景图和模板图进行缩放，将背景图像的宽高分别缩放为w1*16和h1*16，其中w1和h1为14至40之间的整数；将模板图的宽高分别缩放为w2*16和h2*16，其中w2和h2为3至7之间的整数。具体地，每次训练所选取的w1和h1，w2和h2均为满足要求的随机整数，以提高训练出来的算法模型对多尺度图像的适应能力。Step C2: Randomly select two images from the open source image data set. In the training positive sample, the first image is the background image, and the first screenshot is the template image. Scale the background image and template image, and separate the width and height of the background image. Scale to w1*16 and h1*16, where w1 and h1 are integers between 14 and 40; scale the width and height of the template image to w2*16 and h2*16 respectively, where w2 and h2 are between 3 and 7 integer. Specifically, w1, h1, w2 and h2 selected for each training are random integers that meet the requirements to improve the adaptability of the trained algorithm model to multi-scale images.

步骤C3：将缩放后的背景图像和缩放后的模板图像进行网格划分，每个网格的大小为16×16像素，对缩放后的背景图像和缩放后的模板图像分别选取比例为r1和r2的网格进行随机遮盖，其中，0.5≤r1≤0.8，0.3≤r2≤0.5，具体地，r1＝0.7，r2＝0.4。Step C3: Divide the scaled background image and scaled template image into grids. The size of each grid is 16×16 pixels. Select the ratios of r1 and r1 for the scaled background image and scaled template image respectively. The grid of r2 is randomly covered, where 0.5≤r1≤0.8, 0.3≤r2≤0.5, specifically, r1=0.7, r2=0.4.

步骤C4：将图像输入预训练的网络结构中，进行前向推理分析，以得到匹配结果，读取输入图像对应的标签文件，计算损失函数值L，直至损失函数收敛至预设值，以得到预训练匹配模型。Step C4: Input the image into the pre-trained network structure, perform forward inference analysis to obtain the matching result, read the label file corresponding to the input image, and calculate the loss function value L until the loss function converges to the preset value to obtain Pretrained matching model.

损失函数，记为L，计算式为：The loss function, denoted as L, is calculated as:

L＝λ₁L_confidence+λ₂L_loc L＝λ ₁ L _confidence +λ ₂ L _loc

其中，L_confidence表示位置是否存在关键点的损失值，λ₁表示位置是否存在关键点的损失值的权重，L_loc表示关键点坐标的损失值，满足5＜λ₁＜10，λ₂表示关键点坐标的损失值的权重，满足0.5＜λ₂＜2.0，c_i表示第i个关键点是否移动的真值，移动为1，未移动为0，p_i表示是否存在第i个关键点的预测值，满足0≤p_i≤1.0，x_i表示第i个关键点的横坐标的真值，y_i表示第i个关键点的纵坐标的真值，表示第i个关键点横坐标的预测值，/>表示第i个关键点纵坐标的预测值，其中，λ₁＝8.0，λ₂＝1.5。Among them, L _confidence represents the loss value of whether there is a key point at the position, λ ₁ represents the weight of the loss value of whether there is a key point at the position, L _loc represents the loss value of the key point coordinates, and satisfies 5＜λ ₁ <10, λ ₂ represents the key The weight of the loss value of the point coordinate satisfies 0.5 < λ ₂ < 2.0. c _i represents the true value of whether the i-th key point has moved. It is 1 if it has moved, and 0 if it has not moved. p _i represents whether there is a true value of the i-th key point. The predicted value satisfies 0 ≤ p _i ≤ 1.0, x _i represents the true value of the abscissa of the i-th key point, y _i represents the true value of the ordinate of the i-th key point, Represents the predicted value of the abscissa of the i-th key point, /> Indicates the predicted value of the ordinate of the i-th key point, where λ ₁ =8.0 and λ ₂ =1.5.

进一步地，判断是否达到预设训练次数，若未达到预设训练次数，则判断是否已收敛至预设值，若已收敛至预设值，则停止训练，以得到预训练匹配模型，若未收敛至预设值，则继续训练，若达到预设训练次数，则停止训练，以得到预训练匹配模型。Further, determine whether the preset training times have been reached. If the preset training times have not been reached, determine whether it has converged to the preset value. If it has converged to the preset value, stop training to obtain the pre-trained matching model. If not, If it converges to the preset value, the training will continue. If the preset training times are reached, the training will be stopped to obtain the pre-trained matching model.

在一实施例中，该阶段冻结大部分模型参数，使其在训练过程中不更新，可训练的参数量少，因此可以快速训练收敛。包括如下步骤：In one embodiment, most of the model parameters are frozen at this stage so that they are not updated during the training process. The number of trainable parameters is small, so training can converge quickly. Includes the following steps:

步骤C5：进行修正训练，得到预训练匹配模型后，网络结构为依次连接的编码器、特征映射器以及译码器，即编码器的输出经过特征映射后再输入译码器。Step C5: Perform correction training and obtain the pre-trained matching model. The network structure is an encoder, feature mapper and decoder connected in sequence, that is, the output of the encoder is input to the decoder after being feature mapped.

步骤C6：利用得到的预训练匹配模型的参数对编码器和译码器的网络参数值进行初始化，对特征映射结构参数进行随机初始化，将学习率设置为预训练阶段的0.1倍，冻结编码器和译码器参数的值，即在修正训练阶段不更新参数。Step C6: Use the parameters of the obtained pre-trained matching model to initialize the network parameter values of the encoder and decoder, randomly initialize the feature mapping structure parameters, set the learning rate to 0.1 times of the pre-training stage, and freeze the encoder and decoder parameter values, i.e. the parameters are not updated during the correction training phase.

步骤C7：从连续图像数据集选取相邻的两张图像，通过数据预处理后得到困难训练样本对，重复步骤C2至步骤C4对网络进行参数训练，以得到训练好的匹配模型。Step C7: Select two adjacent images from the continuous image data set, obtain difficult training sample pairs through data preprocessing, and repeat steps C2 to C4 to perform parameter training on the network to obtain a trained matching model.

步骤D，利用训练好的匹配模型对测试图像进行匹配。Step D, use the trained matching model to match the test image.

步骤D具体包括：Step D specifically includes:

步骤D1：将背景图像进行缩放，缩放至W1*H1大小，W1和H1均为16的整数倍，将模板图进行缩放，缩放至W2*H2大小，W2和H2均为16的整数倍，满足W2<W1，H2<H1，具体地，背景图缩放至320*320，模板图缩放至80*80。Step D1: Scale the background image to the size of W1*H1, W1 and H1 are both integer multiples of 16, scale the template image to the size of W2*H2, W2 and H2 are both integer multiples of 16, satisfying W2<W1, H2<H1, specifically, the background image is scaled to 320*320, and the template image is scaled to 80*80.

步骤D2：将预处理后的背景图像和模板图像输入训练好的匹配模型中，以得到匹配结果。Step D2: Input the preprocessed background image and template image into the trained matching model to obtain the matching result.

步骤D3：对匹配结果进行分析，依次判断输出的9个向量中的置信度是否大于预设置信值，若置信度大于预设置信值的关键点个数小于3，则说明无法在背景图中找到模板图的匹配位置，若置信度大于阈值T的关键点个数不小于3，则说明模板匹配成功，匹配位置由置信度最高的3个关键点确定。Step D3: Analyze the matching results and determine whether the confidence in the nine output vectors is greater than the preset confidence value. If the number of key points with a confidence greater than the preset confidence value is less than 3, it means that it cannot be included in the background image. Find the matching position of the template map. If the number of key points with confidence greater than the threshold T is not less than 3, it means that the template matching is successful. The matching position is determined by the three key points with the highest confidence.

为了验证本发明方法的效果，建立了测试数据集，对本发明方法进行测试试验验证。In order to verify the effect of the method of the present invention, a test data set was established, and the method of the present invention was tested and verified.

参照图3至图5，待匹配的背景图像为一张分辨率为1048*777的RGB三通道彩色图像，即图，模板图为分辨率为200*200的近红外图像，即图，将两张图像送入匹配模型进行模板匹配，并根据匹配关系把模板图叠加在RGB彩色测试图像上，得到的匹配结果如图。Referring to Figure 3 to Figure 5, the background image to be matched is an RGB three-channel color image with a resolution of 1048*777, that is, the figure, and the template image is a near-infrared image with a resolution of 200*200, that is, the two Each image is sent to the matching model for template matching, and the template image is superimposed on the RGB color test image according to the matching relationship. The matching result is as shown in the figure.

图中矩形框的位置为模板图片在背景图上对应的匹配位置，圆圈的大小表示匹配点的置信度大小，圆圈越大匹配点的置信度越高。从匹配结果可以发现，两张不同大小，不同成像特点的图像可以正确的得到匹配，说明本发明方法的有效性。The position of the rectangular frame in the figure is the corresponding matching position of the template image on the background image. The size of the circle indicates the confidence level of the matching point. The larger the circle, the higher the confidence level of the matching point. It can be found from the matching results that two images of different sizes and different imaging characteristics can be matched correctly, which illustrates the effectiveness of the method of the present invention.

实施例二：Example 2:

一种计算机可读存储介质，存储有计算机指令，计算机指令用于使计算机执行如实施例一提出的一种基于异构图像特征融合的模板匹配方法。人工智能加速硬件可为寒武纪MLU270人工智能加速硬件和Intel CPU，模型转换工具链将实施例一提供的网络模型转换为.om格式的文件，以便模型参数能被正确的加载到计算机可读存储介质中进行推理计算。A computer-readable storage medium stores computer instructions, and the computer instructions are used to cause the computer to execute a template matching method based on heterogeneous image feature fusion as proposed in Embodiment 1. The artificial intelligence acceleration hardware can be Cambrian MLU270 artificial intelligence acceleration hardware and Intel CPU. The model conversion tool chain converts the network model provided in Embodiment 1 into an .om format file so that the model parameters can be correctly loaded into a computer-readable file. Perform inference calculations in storage media.

实施例三：Embodiment three:

一种电子设备，可以是单台服务器，也可以是嵌入式计算平台。包括存储器和处理器，存储器和处理器之间互相通信连接，存储器存储有计算机指令，处理器通过执行计算机指令，从而执行如实施例一提出的一种基于异构图像特征融合的模板匹配方法。An electronic device, which can be a single server or an embedded computing platform. It includes a memory and a processor. The memory and the processor are communicatively connected to each other. The memory stores computer instructions. The processor executes the computer instructions to execute a template matching method based on heterogeneous image feature fusion as proposed in Embodiment 1.

存储器为机器可读存储介质，电子设备还包括深度学习并行计算加速芯片和网络接口，深度学习并行计算加速芯片、机器可读存储介质、网络接口以及处理器之间通过PCIe总线系统相连，深度学习并行计算加速芯片用于深度学习模型前向推理计算加速，机器可读存储介质用于存储程序、指令或代码，处理器可以用于控制网络接口的收发动作，从而可以通过网络进行数据收发，深度学习并行计算加速芯片实现深度学习网络前线推理计算的并行处理，加快计算速度。The memory is a machine-readable storage medium. The electronic device also includes a deep learning parallel computing acceleration chip and a network interface. The deep learning parallel computing acceleration chip, machine-readable storage medium, network interface and processor are connected through the PCIe bus system. Deep learning Parallel computing acceleration chips are used to accelerate forward inference calculations of deep learning models. Machine-readable storage media are used to store programs, instructions or codes. The processor can be used to control the sending and receiving actions of the network interface, so that data can be sent and received through the network. Deep The learning parallel computing acceleration chip realizes parallel processing of frontline inference calculations in deep learning networks and speeds up calculations.

以上所述的实施例仅仅是对本发明的优选实施方式进行描述，并非对本发明的范围进行限定，在不脱离本发明设计精神的前提下，本领域普通技术人员对本发明的技术方案作出的各种变形和改进，均应落入本发明的保护范围内。The above-described embodiments are only descriptions of preferred embodiments of the present invention and do not limit the scope of the present invention. Without departing from the design spirit of the present invention, those of ordinary skill in the art may make various modifications to the technical solutions of the present invention. All deformations and improvements shall fall within the protection scope of the present invention.

Claims

1. A template matching method based on heterogeneous image feature fusion is characterized by comprising the following steps of

A, acquiring an image set and preprocessing to obtain a background image and a template image, wherein the image set comprises a training image set and a test image;

b, establishing a matching model, wherein the matching model comprises an encoder, a feature mapper and a decoder which are sequentially connected, and sending the background image and the template image in the acquired image set into the matching model for matching;

training the established matching model based on the preprocessed image set, and calculating a loss function until the loss function converges to a preset value to obtain a matching model comprising a trained encoder, a feature mapper and a decoder;

and D, matching the test images by using the trained matching model.

2. The template matching method based on heterogeneous image feature fusion according to claim 1, wherein in step a, the preprocessing specifically includes:

a1, randomly selecting two images in an image set, marking the images as a first image and a second image, cutting an image with random size on the first image, marking the image as a first screenshot, and recording the coordinates of key points of the first screenshot on the first image as training positive samples;

a2, capturing an image at the same position on the second image as the first image, and recording the image as a second screenshot as a training negative sample.

3. The method of claim 2, wherein in step a, the preprocessing further includes performing data enhancement processing on the first image and the second image, where the data enhancement processing refers to image transformation, and includes scaling, translation, rotation, perspective, random noise addition, random contrast transformation, random brightness change, and random selection of a region for occlusion.

4. The template matching method based on heterogeneous image feature fusion according to claim 1, wherein in the step B, before the background image and the template image in the acquired image set are sent to the matching model, mesh division is further performed, and meshes with preset proportions are selected for random covering.

5. The template matching method based on heterogeneous image feature fusion according to claim 1, wherein in the step B, a feature mapper in the matching model comprises a convolution layer and a linear mapping layer with output of a preset dimension, which are sequentially connected; the decoder comprises a self-attention structure, a residual error structure and a crossed self-attention structure which are connected in sequence;

cross self-attention structure: the feature vector for the background image is multiplied by the feature vector point of the template image, and the feature vector of the template image is multiplied by the feature vector point of the background image to obtain the weight.

6. The method of claim 1, wherein in step B, before the image is input to the decoder, the position information is further encoded, where the encoding formula is as follows:

wherein PE represents a position coding vector, pos represents a grid number, d_index represents a coding feature vector element position, i represents an element number in the coding feature vector, d represents a dimension of the coding feature vector, the value of the dimension d is n, and n is an integer.

7. The template matching method based on heterogeneous image feature fusion according to claim 1, wherein in the step C, the loss function is denoted as L, and the calculation formula is as follows:

L＝λ ₁ L _confidence +λ ₂ L _loc

wherein L is _confidence Loss value, lambda, indicating whether or not there is a key point in the location ₁ Weights, L, representing loss values for locations for which there are keypoints _loc Loss value, lambda, representing coordinates of key points ₂ Weights, c, representing loss values for keypoint coordinates _i True value indicating whether the ith key point is moved, move to 1, not move to 0, p _i A predicted value indicating whether the ith key point exists, x _i True value, y, representing the abscissa of the ith keypoint _i A true value representing the ordinate of the ith keypoint,predicted value representing the abscissa of the ith keypoint,/-)>Representing the predicted value of the ordinate of the ith keypoint.

8. The template matching method based on heterogeneous image feature fusion according to claim 1, wherein the step D specifically includes:

d1, acquiring a preprocessed test image, wherein the preprocessed test image comprises a preprocessed background image and a preprocessed template image;

d2, inputting the preprocessed background image and template image into a trained matching model to obtain a matching result;

and D3, analyzing the matching result, setting preset confidence levels of key points, if the confidence levels are greater than the preset confidence values, the number of the key points is greater than or equal to 3, then the matching is successful, the matching position is determined by three key points with the highest confidence levels, and if the confidence levels are greater than the preset confidence values, the number of the key points is less than 3, then the background image and the template image are not matched.

9. A computer-readable storage medium storing computer instructions for causing the computer to perform a heterogeneous image feature fusion-based template matching method according to any one of claims 1 to 8.

10. An electronic device comprising a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory storing computer instructions, the processor executing the computer instructions to perform a heterogeneous image feature fusion-based template matching method as claimed in any one of claims 1-8.