CN115775387A

CN115775387A - Target detection method, device, equipment and storage medium

Info

Publication number: CN115775387A
Application number: CN202211569525.8A
Authority: CN
Inventors: 王若霄; 冯子健; 云一柯; 金伟; 彭城; 黎嘉信; 秦宝星; 程昊天
Original assignee: Shanghai Gaussian Automation Technology Development Co Ltd
Current assignee: Shanghai Gaussian Automation Technology Development Co Ltd
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-03-10

Abstract

The invention discloses a training method of a target detection model and a target detection method, wherein the training method of the target detection model is realized by acquiring a training sample image; inputting the training sample image into a 2.5D initial target detection model to obtain prediction result data of a sample target; wherein the 2.5D initial target detection model adds a depth estimation layer in a head network of the 2D target detection model; the prediction result data includes: a prediction box and a prediction depth; and calculating a target loss function value according to the prediction result data, the label data and a target loss function, and adjusting network parameters in the 2.5D initial target detection model based on the target loss function value to realize a 2.5D target detection method combining 2D target detection and depth estimation, so that the problem of high use cost of a 3D target detection method can be solved, and the target detection cost is reduced.

Description

Target detection method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a target detection method, a target detection device, target detection equipment and a storage medium.

Background

In the automatic navigation process of the robot, the perception of the surrounding environment is an important basis for realizing safe navigation, and effective and accurate target detection is the key of environment perception.

At present, efficient and accurate target detection remains a challenging problem for robots, as they need a full knowledge of the 3D environment around the robot. Because a two-dimensional target detection algorithm cannot provide depth information and cannot be directly applied to target detection, a 3D target detection method is generally adopted at present to obtain a good target detection result.

However, the 3D detection model of the 3D target detection method relies on high-quality training data or expensive sensors (such as a multi-line laser radar) are used for collecting the training data, so that the training cost is high, the training process is very complex, the requirement on hardware resources is extremely high, and the use scene of the 3D target detection model is limited.

Disclosure of Invention

The invention provides a target detection method, a target detection device and a storage medium, and realizes a 2.5D target detection method combining 2D target detection and depth estimation, so as to solve the problem of high use cost of a 3D target detection method and reduce target detection cost.

According to an aspect of the present invention, there is provided a method for training an object detection model, including:

acquiring a training sample image; the training sample image is labeled with label data of a sample object, the label data including: calibrating a frame and calibrating depth;

inputting the training sample image into a 2.5D initial target detection model to obtain prediction result data of the sample target; wherein the 2.5D initial target detection model adds a depth estimation layer in a head network of the 2D target detection model; the prediction result data includes: a prediction box and a prediction depth;

and calculating a target loss function value according to the prediction result data and the label data, and adjusting network parameters in the 2.5D initial target detection model based on the target loss function value.

Further, the 2.5D initial target detection model includes:

for a YOLOv5 model based on an anchor frame, reserving a Backbone network Backbone, a Neck network Neck and a Head network Head of the YOLOv5 model; adding a depth estimation layer in the head network to obtain a 2.5D initial target detection model;

accordingly, the object loss function of the 2.5D initial object detection model includes:

adding a deep regression loss function on the basis of the original loss function of the YOLOv5 model;

carrying out weighted summation on the original loss function and the depth regression loss function of the YOLOv5 model to obtain a target loss function;

wherein, the primitive loss function of the YOLOv5 model includes: a classification loss function, a bounding box loss function, and a confidence loss function; the depth regression loss function includes: an absolute error between the predicted depth and the calibrated depth, and a relative error between the predicted depth and the calibrated depth.

According to the technical scheme, on the basis of the YOLOv5 model based on the anchor frame, the depth estimation layer is added in the head network, the depth regression loss function is added in the target loss function, the 2D target detection model of the YOLOv5 is improved to the 2.5D target detection model, and the accuracy of target detection is improved by combining 2D target detection and depth estimation.

Further, the 2.5D initial object detection model includes:

for a CenterNet model for detecting key points, reserving a Backbone network backhaul, a Neck network Neck and a Head network Head of the CenterNet model; adding a depth estimation layer in the head network to obtain a 2.5D initial target detection model;

correspondingly, the object loss function of the 2.5D initial object detection model includes:

adding a deep regression loss function on the basis of the original loss function of the CenterNet model;

carrying out weighted summation on the original loss function and the depth regression loss function of the CenterNet model to obtain a target loss function;

wherein the primal loss function of the CenterNet model comprises: a central point prediction loss function, a target size loss function and a target central point bias loss function; the depth regression loss function includes: an absolute error between the predicted depth and the nominal depth, and a relative error between the predicted depth and the nominal depth.

According to the technical scheme, on the basis of the CenterNet model based on key point detection, the depth estimation layer is added in the head network, the depth regression loss function is added in the target loss function, the 2D target detection model of YOLOv5 is improved to be the 2.5D target detection model, and the accuracy of target detection is improved by combining 2D target detection and depth estimation.

Further, the acquiring the training sample image includes:

acquiring an RGB sample image and corresponding laser point cloud sample data;

determining a calibration frame of a sample target in the RGB sample image, and projecting the laser point cloud sample data onto the RGB sample image;

visualizing the laser point cloud sample data in the calibration frame into a bird's-eye view corresponding to the RGB sample image to obtain a point cloud bird's-eye view, and splicing the point cloud bird's-eye view and the RGB sample image to obtain a spliced image;

determining a boundary frame of the sample target according to the calibration frame of the RGB sample image and the visualized laser point cloud sample data in the point cloud aerial view;

determining the calibration depth of the corresponding calibration frame according to the average depth of the visualized laser point cloud sample data in the boundary frame;

and marking the calibration frame and the calibration depth pair on the spliced image to obtain a training sample image.

According to the technical scheme, the labeling mode of the training sample image suitable for the 2.5D target detection model is provided.

According to another aspect of the present invention, there is provided an object detection method including:

acquiring an RGB image to be detected and a sparse depth image of a scene to be detected;

performing convolution processing on the RGB image to be detected and the sparse depth image respectively, and then performing feature fusion to obtain a fusion feature image;

inputting the fusion characteristic image into a 2.5D complete target detection model to obtain a 2.5D target detection frame of a target object in the scene to be detected; the 2.5D complete target detection model is obtained by training based on the training method of the target detection model provided by the embodiment of the invention.

According to the technical scheme, the 2.5D complete target detection model obtained by training based on the training method of the target detection model provided by the invention can be used for carrying out 2.5D target detection on the scene to be detected, so that the precision of the target detection result is improved, the data acquisition requirement of the scene to be detected is reduced, and the detection cost is further reduced.

Further, after the fused feature image is input into a 2.5D complete target detection model to obtain a 2.5D target detection frame of the target object in the scene to be detected, the method further includes:

acquiring an observation track of the target object; the observation trajectory includes: a multi-frame calibration frame and a corresponding calibration depth;

predicting a next frame of 2.5D target detection frame according to the target object in the 2.5D target detection frame of the current frame based on a Kalman filter, and forming a prediction track based on the multi-frame 2.5D target detection frame and the corresponding prediction depth;

and calculating the matching distance between the observed track and the predicted track by using a Hungarian algorithm, and matching the predicted track and the observed track based on the matching distance to obtain a target tracking result of the target object.

Above-mentioned technical scheme can realize the target tracking to 2.5D target detection frame to realize the perception to the target completely.

Optionally, the matching distance between the observed trajectory and the predicted trajectory includes:

wherein,

representing the observation trajectory

And the predicted trajectory

I denotes the cross-over ratioThe association metric, D represents the distance association metric, IOU represents the cross-over ratio calculation,

and

respectively representing a calibration frame and a corresponding calibration depth of a jth frame forming an observation track;

and dT _t ^j Respectively representing the 2.5D target detection box and the corresponding predicted depth of the jth frame constituting the observation trajectory.

According to the technical scheme, the target tracking of the target detection box is converted into the bipartite graph matching problem, and the calculation is performed by using the Hungarian algorithm, so that the calculation efficiency is improved.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a method of training an object detection model or a method of object detection according to any embodiment of the invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement a training method or an object detection method of an object detection model according to any one of the embodiments of the present invention when the computer instructions are executed.

The embodiment of the invention provides a training method of a target detection model, which comprises the steps of obtaining a training sample image; inputting the training sample image into a 2.5D initial target detection model to obtain prediction result data of a sample target; wherein the 2.5D initial target detection model adds a depth estimation layer in a head network of the 2D target detection model; the prediction result data includes: a prediction box and a prediction depth; and calculating a target loss function value according to the prediction result data, the label data and the target loss function, and adjusting network parameters in the 2.5D initial target detection model based on the target loss function value to realize a 2.5D target detection method combining 2D target detection and depth estimation, so that the problem of high use cost of a 3D target detection method can be solved, and the target detection cost is reduced.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1A is a flowchart of a method for training a target detection model according to an embodiment of the present invention;

FIG. 1B is a schematic diagram of a training sample image according to an embodiment of the present invention;

FIG. 2A is a flowchart of a target detection method according to an embodiment of the present invention;

FIG. 2B is a schematic diagram of a fused feature image generation process;

FIG. 3 is a schematic structural diagram of a training apparatus for a target detection model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device implementing the target detection model training method or the target detection method according to the embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "original", "target", and the like in the description and claims of the present invention and the drawings described above are used for distinguishing similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Fig. 1A is a flowchart of a method for training a target detection model according to an embodiment of the present invention, where the method is applicable to a case where a target detection model is trained based on a training sample image, and the method may be performed by a training apparatus of the target detection model, and the training apparatus of the target detection model may be implemented in a form of hardware and/or software, and the training apparatus of the target detection model may be configured in an electronic device. As shown in fig. 1A, the method includes:

s110, obtaining a training sample image; the training sample image is labeled with label data of a sample object, the label data including: calibration frame and calibration depth.

The training sample image may be understood as a sample image used for training the target detection model. In the present invention, the initial detection model to be trained is a 2.5D object detection model capable of providing depth features, and therefore, compared to the training sample image used by the 2D object detection model, the labeled tag data needs to include a calibration depth (i.e. a true depth value of the sample object) in addition to a calibration box (i.e. a true bounding box of the sample object) in the training sample image used by the present invention.

For example, the manner of acquiring the training sample image may be to acquire a sample image containing a sample target and a laser point cloud corresponding to the sample target; and determining the calibration depth of the sample target according to the laser point cloud, and marking the calibration frame and the calibration depth of each sample target on the sample image to obtain a training sample image.

S120, inputting the training sample image into a 2.5D initial target detection model to obtain prediction result data of the sample target; wherein the 2.5D initial target detection model adds a depth estimation layer in a head network of the 2D target detection model; the prediction result data includes: prediction box and prediction depth.

Wherein, the 2.5D initial target detection model is an untrained or untrained complete 2.5D target detection model. The 2.5D object detection model is an object detection model that is capable of combining two-dimensional object detection and object depth estimation. The 2.5D initial target detection model constructed by the embodiment of the invention is obtained by adding a depth estimation layer in a head network of some 2D target detection models.

Specifically, on the basis of the existing 2D target detection model, the original model architecture of the 2D target detection model is retained, and a depth estimation layer for estimating the predicted depth of the sample target is added to the head network (or the output network) to obtain a 2.5D initial target detection model. The training sample image is input into a 2.5D initial target detection model, so that a prediction frame and a prediction depth corresponding to the sample target can be obtained.

Compared with the existing 2D target detection model, the 2.5D initial target detection model constructed by the embodiment of the invention has more accurate target detection result; compared with the existing 3D target detection model, the method needs to use the RGB image, the RGB-D depth image, the laser point cloud and other output object types, the length, the width, the height, the rotation angle and other information in the three-dimensional space, has simple model structure and low requirement on the training sample image, reduces the cost of model training and target detection, and has wider application scenes.

S130, calculating a target loss function value according to the prediction result data and the label data, and adjusting network parameters in the 2.5D initial target detection model based on the target loss function value.

The target loss function value is used for measuring the difference between a target prediction result obtained by performing target detection on the training sample image through the 2.5D initial target detection model and label data labeled in the training sample image.

Specifically, the prediction result data (including a prediction frame and a prediction depth) output by the 2.5D initial target detection model and the label data (a calibration frame and a calibration depth) labeled in the training sample image are input into the target function, and the target loss function value is calculated. And iteratively adjusting network parameters in the 2.5D initial target detection model according to the target loss function value until the target loss function value reaches the minimum value.

For example, the network parameters in the 2.5D initial target detection model may include untrained initial network parameters, or may include pre-trained but incomplete network parameters.

According to the technical scheme of the embodiment of the invention, a training sample image is obtained; inputting the training sample image into a 2.5D initial target detection model to obtain prediction result data of a sample target; wherein the 2.5D initial target detection model adds a depth estimation layer in a head network of the 2D target detection model; the prediction result data includes: a prediction box and a prediction depth; and calculating a target loss function value according to the prediction result data, the label data and a target loss function, and adjusting network parameters in the 2.5D initial target detection model based on the target loss function value to provide a 2.5D target detection method, which can solve the problem of high use cost of a 3D target detection method and reduce the target detection cost.

The embodiment of the invention can easily promote the existing 2D target detection model to the 2.5D initial target detection model, improves the accuracy of target detection, and is suitable for most of the traditional 2D target detection models, such as a common Yolov5 model based on anchor frame and a CenterNet model based on key point detection anchor free.

In an alternative embodiment, the 2.5D initial object detection model includes:

for a YOLOv5 model based on an anchor frame, reserving a Backbone network backhaul, a Neck Neck network and a Head network of the YOLOv5 model; adding a depth estimation layer in the head network to obtain a 2.5D initial target detection model;

adding a depth regression loss function on the basis of the original loss function of the YOLOv5 model;

wherein the primitive loss function of the YOLOv5 model comprises: a classification loss function, a bounding box loss function, and a confidence loss function; the depth regression loss function includes: an absolute error between the predicted depth and the nominal depth, and a relative error between the predicted depth and the nominal depth.

Specifically, the YOLOv5 model is a typical anchor-box-based object detection model. The network architecture of the YOLOv5 model can be divided into three parts, wherein the first part is a Backbone network backhaul and is responsible for the feature extraction of a target, and the first part consists of a Focus module, a Bottlen CSP module and an SPP module; the second part is a Neck Neck network which is mainly used for enhancing the features extracted by the main network, and the adopted module is a PANet path polymerization structure; the third part is a Head network, a Head detection network which is the same as a YOLOv3 network model is adopted, the three detection heads respectively carry out 8-time, 16-time and 32-time down sampling on the original image, and finally, three feature vectors with different sizes are respectively generated and used for detecting targets with different sizes.

According to the embodiment of the invention, on the basis of reserving a Backbone network backhaul, a Neck Neck network and a Head network of a YOLOv5 model, a depth estimation layer is added in the Head network for estimating the depth information of the target, so that the network architecture of a 2.5D initial target detection model is determined.

Because a depth estimation layer is added in a head network in a network architecture, correspondingly, a depth regression loss function is added to a target loss function of a 2.5D initial target detection model on the basis of an original loss function of a YOLOv5 model, and is used for measuring the difference between the predicted depth output by the depth estimation layer of the 2.5D initial target detection model and the marked depth marked in a training sample image.

Specifically, a target loss function is obtained by performing weighted summation on an original loss function and a depth regression loss function of the YOLOv5 model; the primitive loss functions of the YOLOv5 model include: a classification loss function, a bounding box loss function, and a confidence loss function; the depth regression loss function includes: the absolute error between the predicted depth and the nominal depth, and the relative error between the predicted depth and the nominal depth.

Illustratively, the objective loss function is: l = ω _b L _box +ω _o L _obj +ω _c L _cls +ω _d L _depth ；

L _cls Represents a classification loss function, such as: l is _cls ＝-α(1-p) ^γ log p；

p is class probability of the prediction frame, α and γ are parameters, and α =0.25 and γ =2 can be set; w is a _b ,w _o ,w _c ,w _d Are all weight parameters.

The bounding box loss function can adopt a GIoU loss function, the GIoU is used as a bounding box regression loss function, and the GIoU method fully utilizes the advantages of the IoU while overcoming the defects of the IoU. Let C denote the least convex closed box containing A and B, assuming A is the prediction box and B is the scaling box. The calculation formula for GIoU is as follows:

the calculation formula of the GIoU loss function as the bounding box loss function is as follows:

L _BOX ＝1-GIOU；

L _obj for the confidence loss function, a cross entropy loss function can be used, such as:

L _obj ＝(1-y)log(1-x)-ylog(x)；

wherein y represents a calibration value, y =0 represents not a target, and y =1 represents a target; x represents the predicted probability value.

The depth regression loss function is: l is _depth ＝L _abs +L _rel ；

L _abs Representing an absolute error between the predicted depth and the nominal depth; l is a radical of an alcohol _rel Representing a relative error between the predicted depth and the calibrated depth; r = l _d -f _d ，l _d To calibrate depth, f _d To predict depth.

In another alternative embodiment, the 2.5D initial object detection model includes:

for a centret model for key point detection, reserving a Backbone network backhaul, a Neck network tack and a Head network Head of the centret model; adding a depth estimation layer in the head network to obtain a 2.5D initial target detection model;

weighting and summing the original loss function and the depth regression loss function of the CenterNet model to obtain a target loss function;

Specifically, the centret model is a target detection model for detecting key points. The network architecture of the CenterNet model is similar to that of the YOLOv5 model, and can be divided into three parts, namely a Backbone network Backbone, a Neck network Neck and a Head network Head. According to the embodiment of the invention, on the basis of reserving a Backbone network backhaul, a Neck Neck network and a Head network of a CenterNet model, a depth estimation layer is added in the Head network for estimating the depth information of the target, so that the network architecture of the 2.5D initial target detection model is determined.

Due to the fact that the depth estimation layer is added in the head network in the network architecture, correspondingly, the target loss function of the 2.5D initial target detection model is also added with the depth regression loss function on the basis of the original loss function of the CenterNet model, and the depth regression loss function is used for measuring the difference between the predicted depth output through the depth estimation layer of the 2.5D initial target detection model and the marked depth marked in the training sample image.

Specifically, a target loss function is obtained by performing weighted summation on an original loss function and a depth regression loss function of the CenterNet model; the loss-of-origin function of the centrnet model includes: a classification loss function, a bounding box loss function, and a confidence loss function; the depth regression loss function includes: absolute error between the predicted depth and the calibrated depth, and relative error between the predicted depth and the calibrated depth.

For example, the target loss function is the same as the target loss function of the YOLOv5 model, and this embodiment will not be described again.

It should be noted that the training method for the 2.5D object detection model provided in the embodiment of the present invention is not limited to the YOLOv5 model and the centrnet model described above, and for most of the conventional 2D object detection models, a depth estimation layer is added on the basis of maintaining an original network architecture, and the 2D object detection model can be upgraded to the 2.5D object detection model, and the model upgrading method is simple and has a wide application range.

Alternatively, for the evaluation of the model effect, evaluation indexes of common 2D target detection, such as accuracy, recall rate and map, may be used. However, for accuracy in estimating depth distance, an average depth error, which is an average of the relative depth errors of correctly detected target samples, i.e., an average of the relative depth errors, may be used for the estimation

Wherein, E _d Representing the mean depth error, D representing one of all correctly detected target samples TP, D _gt Representing the nominal depth, D, of the target sample _pred Representing the predicted depth of the target sample.

Optionally, the acquiring a training sample image includes:

acquiring an RGB sample image and corresponding laser point cloud sample data;

and marking the calibration frame and the calibration depth on the spliced image to obtain a training sample image.

The point cloud aerial view can be understood as an image formed by point cloud data under the aerial view.

Specifically, an RGB sample image is obtained through an image sensor, laser point cloud sample data in the same scene as the RGB sample image is obtained through a laser radar sensor, and time synchronization is carried out on the RGB sample image and the laser point cloud sample data. Marking a sample target in an RGB sample image with a calibration Box Box _RGB The calibration box may be a conventional 2D bounding box and the laser point cloud sample data is projected onto the RGB sample image. In order to simplify the subsequent labeling process, the laser point cloud sample data in the labeling frame needs to be subjected to a depth standard, and the laser point cloud sample data outside the calibration frame can be ignored.

And performing visual display on the laser point cloud sample data in the calibration frame in the aerial view corresponding to the RGB sample image to obtain a point cloud aerial view, and splicing the point cloud aerial view and the RGB sample image to obtain a spliced image. Therefore, in the point cloud aerial view, which pixel points belong to a sample target are determined according to the calibration frame of the RGB sample image and the visualized laser point cloud sample data, and the boundary frame Box of the sample target is drawn _PC . In addition, it is also possible to use the same for each pair [ Box _RGB ，Box _PC ]The same ID is assigned, and a calibration frame in the RGB image is associated with a bounding box in the point cloud aerial view. For each pair [ Box _RGB ，Box _PC ]Determining a corresponding calibration Box Box according to the average depth of the visualized laser point cloud sample data in the bounding Box _RGB The calibrated depth of (2); therefore, the calibration frame and the calibration depth can be marked on the spliced image to obtain a training sample image. Illustratively, a schematic of a training sample image is shown in FIG. 1B.

Fig. 2A is a flowchart of a target detection method according to an embodiment of the present invention, where the present embodiment is applicable to a case where target detection is performed based on a 2.5D complete target detection model obtained by training the target detection model provided by the above-mentioned embodiment, the method may be executed by a target detection device, the target detection device may be implemented in a form of hardware and/or software, and the target detection device may be configured in an electronic device. As shown in fig. 2A, the method includes:

s210, acquiring an RGB image to be detected and a sparse depth image of a scene to be detected.

The scene to be detected is a scene needing target detection. The RGB image to be detected is an RGB image obtained by shooting a scene to be detected, and the RGB image to be detected comprises at least one target object. The sparse depth image is an image formed by depth information of a target object in a scene to be detected.

For example, the to-be-detected RGB image of the to-be-detected scene may be obtained by shooting an image of the to-be-detected scene by an image sensor.

Optionally, obtaining a sparse depth image of a scene to be detected includes:

acquiring sparse point cloud data of a scene to be detected through a single line laser radar;

and projecting the sparse point cloud data to an image plane to obtain a sparse depth image.

Specifically, sparse point cloud data of a scene to be detected is collected through a single line laser radar, the sparse point cloud data is projected onto an image plane, the sparse point cloud data can be converted into an image, and a sparse depth image is obtained. In order to enable the sparse depth image to more accurately reflect the depth information of the target object, other pixel points except the depth information can be removed.

S220, performing convolution processing on the RGB image to be detected and the sparse depth image respectively, and performing feature fusion to obtain a fusion feature image.

The fusion characteristic image is an image obtained after the characteristics of the sparse depth image of the RGB image to be detected are fused.

Specifically, as shown in fig. 2B, the two-dimensional convolution processing is performed on the acquired RGB image to be detected of the scene to be detected to obtain an RGB feature image, the two-dimensional convolution processing is performed on the acquired sparse depth image to obtain a depth feature image, and the RGB feature image and the depth feature image are subjected to feature fusion to obtain a fusion feature image.

S230, inputting the fusion characteristic image into a 2.5D complete target detection model to obtain a 2.5D target detection frame of a target object in the scene to be detected; and the 2.5D complete target detection model is obtained by training based on a training method of the target detection model.

Wherein the 2.5D target detection frame is a detection frame determined by combining the 2D target detection frame and the predicted depth.

Specifically, the fusion feature image is input into the 2.5D complete target detection model obtained by training with the target detection model training method provided by the invention, so as to obtain a 2.5D target detection frame output by the 2.5D complete target detection model. Since the 2.5D complete target detection model includes the depth estimation module for predicting the depth information of the target object, the 2.5D target detection frame output by the 2.5D complete target detection model has higher precision than the 2D target detection frame.

According to the technical scheme of the embodiment of the invention, the RGB image to be detected and the sparse depth image of the scene to be detected are obtained; performing convolution processing on the RGB image to be detected and the sparse depth image respectively, and then performing feature fusion to obtain a fusion feature image; inputting the fusion characteristic image into a 2.5D complete target detection model to obtain a 2.5D target detection frame of a target object in a scene to be detected; the 2.5D complete target detection model is obtained by training based on a training method of the target detection model, compared with the existing 3D target detection method, the requirement on the image to be detected is lower, the calculated amount of the model is greatly reduced, and the target detection cost is reduced; compared with the existing 2D target detection method, the accuracy of the detection result is effectively improved.

Optionally, after the fused feature image is input into a 2.5D complete target detection model to obtain a 2.5D target detection frame of the target object in the scene to be detected, the method further includes:

predicting a next frame of 2.5D target detection frame according to a target object in a 2.5D target detection frame of a current frame based on a Kalman filter, and forming a prediction track based on the multi-frame 2.5D target detection frame and a corresponding prediction depth;

Wherein the observation trajectory is an actual motion trajectory of the target object. In the embodiment of the present invention, the observation trajectory for the 2.5D target detection frame may be composed of a multi-frame calibration frame and a corresponding calibration depth. The predicted trajectory is a motion trajectory of the target object predicted based on a target detection frame output by the 2.5D complete target detection model. The observation trajectory can be composed of a multi-frame calibration frame and a corresponding calibration depth.

Because a plurality of target objects may be contained in the scene to be detected and the target objects may move in the multi-frame image, after the 2.5D target detection frame of the target objects in the scene to be detected is determined, the target tracking of the target detection frame is also required to be correspondingly performed.

Specifically, the observation track of the target object is obtained by collecting multiple frames of images of the target object and marking the calibration frame and the calibration depth of each frame of image. And predicting the 2.5D target detection frame and the prediction depth of the next frame based on the 2.5D target detection frame of the target object output by the Kalman filter according to the 2.5D complete target detection model, so as to obtain the prediction track formed by the multi-frame 2.5D target detection frame and the corresponding prediction depth.

The motion trail of the target object can be described by using a motion state space. And determining the central position of the motion state space by using the internal reference matrix and the external reference matrix of the sensor, and determining the position coordinates of the central position of the motion state space in a world coordinate system according to the position of the sensor. The motion state space of the trajectory of the target object may be expressed as a 12-dimensional vector { bx, by, bw, bh, x, y, vbx, vby, vbw, vbh, vx, vy }, where x, y denotes the center position of the target object in the world coordinate system, vbx, vby, vbw, vbh denotes the first derivative of the detection frame in the image coordinate system, and vx, vy denotes the velocity of the target object in the world coordinate system (assuming z = 0).

And constructing data association for matching the predicted trajectory Tt and the observed trajectory Ot, converting the data association into a bipartite graph matching problem, solving the bipartite graph matching problem by using a Hungary algorithm, and matching a target detection frame and a calibration frame of each frame of image, thereby realizing target tracking of the target detection frame corresponding to each target object. The data association of the predicted trajectory Tt with the observed trajectory Ot may be expressed in terms of a matching distance.

wherein,

representing the observation trajectory

And said predicted trajectory T _t ^j I represents the cross-over ratio correlation metric, D represents the distance correlation metric, IOU represents the cross-over ratio calculation,

and

respectively representing a calibration frame and a corresponding calibration depth of a jth frame forming an observation track; boxT _t ^j And dT _t ^j Respectively representing the 2.5D target detection box and the corresponding predicted depth of the jth frame constituting the observation trajectory.

Fig. 3 is a schematic structural diagram of a training apparatus for a target detection model according to an embodiment of the present invention. As shown in fig. 3, the apparatus includes:

a training sample obtaining module 310, configured to obtain a training sample image; the training sample image is labeled with label data of a sample object, the label data including: calibrating a frame and calibrating a depth;

a prediction result obtaining module 320, configured to input the training sample image into a 2.5D initial target detection model to obtain prediction result data of the sample target; wherein the 2.5D initial target detection model adds a depth estimation layer in a head network of the 2D target detection model; the prediction result data includes: a prediction box and a prediction depth;

a parameter adjusting module 330, configured to calculate a target loss function value according to the prediction result data, the tag data, and a target loss function, and adjust a network parameter in the 2.5D initial target detection model based on the target loss function value.

Optionally, the 2.5D initial target detection model includes:

wherein the primitive loss function of the YOLOv5 model comprises: a classification loss function, a bounding box loss function, and a confidence loss function; the depth regression loss function includes: an absolute error between the predicted depth and the calibrated depth, and a relative error between the predicted depth and the calibrated depth.

Optionally, the 2.5D initial target detection model includes:

for a CenterNet model for key point detection, reserving a Backbone network backhaul, a Neck network Neck and a Head network Head of the CenterNet model; adding a depth estimation layer in the head network to obtain a 2.5D initial target detection model;

Optionally, the training sample obtaining module is specifically configured to:

acquiring an RGB sample image and corresponding laser point cloud sample data;

The training device for the target detection model provided by the embodiment of the invention can execute the training method for the target detection model provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 4 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention. As shown in fig. 4, the apparatus includes:

the image acquisition module 410 is used for acquiring an RGB image to be detected and a sparse depth image of a scene to be detected;

the feature fusion module 420 is configured to perform convolution processing on the RGB image to be detected and the sparse depth image, and perform feature fusion to obtain a fusion feature image;

the target detection module 430 is configured to input the fusion feature image into a 2.5D complete target detection model to obtain a 2.5D target detection frame of the target object in the scene to be detected; and the 2.5D complete target detection model is obtained by training based on a training method of the target detection model.

Optionally, the method further includes:

an observation track obtaining module, configured to obtain an observation track of the target object after the fusion feature image is input into a 2.5D complete target detection model to obtain a 2.5D target detection frame of the target object in the scene to be detected; the observation trajectory includes: multi-frame calibration frames and corresponding calibration depths;

the prediction track generation module is used for predicting a next frame of 2.5D target detection frame according to the 2.5D target detection frame of the target object in the current frame based on the Kalman filter, and forming a prediction track based on the multi-frame 2.5D target detection frame and the corresponding prediction depth;

and the target tracking module is used for calculating the matching distance between the observed track and the predicted track by using a Hungarian algorithm, and matching the predicted track and the observed track based on the matching distance to obtain a target tracking result of the target object.

wherein,

representing the observation trajectory

And the predicted trajectory T _t ^j I represents the cross-over ratio correlation metric, D represents the distance correlation metric, IOU represents the cross-over ratio calculation,

and

The target detection device provided by the embodiment of the invention can execute the target detection method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

FIG. 5 illustrates a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 may also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as a training method of an object detection model or an object detection method.

In some embodiments, the training method of the object detection model or the object detection method may be implemented as a computer program that is tangibly embodied in a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the training method of the object detection model or the object detection method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured in any other suitable way (e.g. by means of firmware) to perform a training method of the object detection model or an object detection method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Computer programs for implementing the methods of the present invention can be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired result of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for training a target detection model, comprising:

2. The method of claim 1, wherein the 2.5D initial object detection model comprises:

3. The method of claim 1, wherein the 2.5D initial object detection model comprises:

4. The method of claim 1, wherein the obtaining training sample images comprises:

acquiring an RGB sample image and corresponding laser point cloud sample data;

determining a boundary frame of the sample target according to a calibration frame of the RGB sample image and visualized laser point cloud sample data in the point cloud aerial view;

5. A method of object detection, comprising:

performing convolution processing on the RGB image to be detected and the sparse depth image respectively, and performing feature fusion to obtain a fusion feature image;

inputting the fusion characteristic image into a 2.5D complete target detection model to obtain a 2.5D target detection frame of a target object in the scene to be detected; wherein, the 2.5D complete target detection model is obtained by training based on the training method of the target detection model according to any one of claims 1 to 4.

6. The method according to claim 5, wherein after inputting the fused feature image into a 2.5D complete target detection model to obtain a 2.5D target detection frame of a target object in the scene to be detected, the method further comprises:

7. The method of claim 6, wherein the matching distance of the observed trajectory and the predicted trajectory comprises:

wherein,

representing the observation trajectory

And the predicted trajectory

I represents the cross-over ratio correlation metric, ad represents the distance correlation metric, IOU represents the cross-over ratio calculation,

and

8. An apparatus for training an object detection model, comprising:

the training sample acquisition module is used for acquiring a training sample image; the training sample image is labeled with label data of a sample object, the label data including: calibrating a frame and calibrating depth;

the prediction result acquisition module is used for inputting the training sample image into a 2.5D initial target detection model to obtain prediction result data of the sample target; wherein the 2.5D initial target detection model adds a depth estimation layer in a head network of the 2D target detection model; the prediction result data includes: a prediction box and a prediction depth;

and the parameter adjusting module is used for calculating a target loss function value according to the prediction result data, the label data and a target loss function, and adjusting network parameters in the 2.5D initial target detection model based on the target loss function value.

9. An object detection device, comprising:

the image acquisition module is used for acquiring an RGB image to be detected and a sparse depth image of a scene to be detected;

the characteristic fusion module is used for performing convolution processing on the RGB image to be detected and the sparse depth image respectively and then performing characteristic fusion to obtain a fusion characteristic image;

the target detection module is used for inputting the fusion characteristic image into a 2.5D complete target detection model to obtain a 2.5D target detection frame of a target object in the scene to be detected; wherein, the 2.5D complete target detection model is obtained by training based on the training method of the target detection model according to any one of claims 1 to 4.

10. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of training an object detection model of any one of claims 1-4 or the method of object detection of any one of claims 5-7.

11. A computer-readable storage medium, having stored thereon computer instructions for causing a processor, when executed, to implement the method of training an object detection model according to any one of claims 1-4 or the method of object detection according to any one of claims 5-7.