Disclosure of Invention
The invention aims to provide a small target detection method based on an SCGD-YOLO network under the view angle of an unmanned aerial vehicle, which solves the problem of detection accuracy reduction caused by the conditions of lower resolution of the small target, complex environment interference and the like when the unmanned aerial vehicle shoots at high altitude, and specifically provides the following technical scheme for solving the technical problems:
a small target detection method based on SCGD-YOLO network under the unmanned aerial vehicle visual angle comprises the following steps:
Step 1, acquiring an open source data set of a small target under the view angle of an unmanned aerial vehicle, and dividing the data set into a training set, a verification set and a test set, wherein the data set comprises ten categories, and the main categories are mainly the small target;
preferably, visDrone 2019 data sets are adopted as experimental data sets for training, verification and testing, and objects photographed by the unmanned aerial vehicle in different positions, angles and environments are collected, wherein 10 categories are included, and most of the categories take small targets as main bodies, and are data sets specially used for small target detection.
Step 2, configuring a network environment required by a network model;
Preferably, the configured network environment is Ubuntu 16.04LTS operating system, experimental runs were performed using a network of NVIDIA GTX3090 GPU with 16GB video memory, and Python 3.8.16, and Pytorch versions 1.13.1 and torchvision 0.14.1.
Step3, constructing an SCGD-YOLO network model, wherein the SCGD-YOLO network model comprises a backbone network, a neck network and a detection head;
The SCGD-YOLO network model comprises the steps of firstly carrying out feature extraction on an input image through a backbone network, gradually extracting low-level and high-level features of the image through a convolution layer, batch normalization and an activation function, capturing space information and semantic information in the image through multi-layer feature extraction, inputting the extracted feature information into a neck network for feature fusion, fusing features of different layers through a feature pyramid in the neck network, finally sending feature fusion information into a detection head, outputting target information predicted by each network, and integrating class labels, frame coordinates and confidence of the output image.
Preferably, the SCGD-YOLO network model is improved by taking YOLOv as a baseline model, firstly introducing improved C2f-CAG and C2f-CFG modules into a backbone network and a neck network to replace an original C2f module, secondly replacing a feature pyramid of the original neck network with a brand-new feature pyramid structure SCOK, and finally replacing a decoupling head structure of a head network with a lightweight detection head LSDC comprising shared convolution.
Preferably, in the improved C2f-CAG and C2f-CFG modules, the C2f-CAG structure is introduced into a CAFormer module in the Transformer, the token mixer of the CAFormer module is a self-attention layer, a Convolutional GLU gating mechanism in TransNext is used for replacing an MLP layer in the CAFormer module, the GLU controls the transmission of information flow in an adaptive manner, a convolution operation is used for replacing a traditional full connection layer, the C2f-CFG structure is introduced into a ConvFormer module in the Transformer, the token mixer of the ConvFormer module is a separable convolution, the separable convolution is composed of a depth-wise convolution and a point-wise convolution, the depth-wise convolution is used for extracting spatial features, the point-wise convolution is used for extracting channel features, and meanwhile, the Convolutional GLU gating mechanism is also used for replacing the MLP layer in the ConvFormer module, and the C2f-CFG module is used in a collocation.
Preferably, the Convolutional GLU gating mechanism is a nonlinear activation function based on the gating mechanism, the expression is GLU (x) = (W 1x)⊙σ(W2 x), wherein W 1 x is a linear transformation of an input, W 2 x is another linear transformation as a feature extraction part, sigma (·) is a sigmoid activation function, the output controls gating between 0 and 1, and as such, indicates element level multiplication.
Preferably, in the brand new feature pyramid SCOK, an SPD-Conv is introduced to extract small target information, the SPD-Conv is composed of a space-to-depth layer and a non-step convolution layer, the SPD-Conv downsamples the feature map and retains all information in the channel dimension, and after the small target information in the SPD-Conv and the small target information in the feature layer are spliced, the small target information is transmitted to a SPlit-Omni-Kernel module to perform feature fusion and is output to a detection head to perform small target detection and positioning.
Preferably, the SPlit-Omni-Kernel module divides the input feature into two branches according to the CSP residual concept, one branch is processed by the Omni-Kernel module, the other branch is kept unchanged, and finally, the reconstruction of the multi-scale information is realized through feature cascade. Wherein the Omni-Kernel module comprises a big branch, a global branch and a local branch.
Preferably, in the lightweight detection head LSDC of the shared convolution, a1×1 convolution is first used to adjust the number of channels, then 23×3 convolutions are used as shared weight convolutions instead of the original 123×3 convolutions for feature extraction, small target information is captured in the feature extraction stage by introducing detail enhancement convolutions, normalization is introduced in the 1×1 convolution, and normalization is introduced in the convolution of the feature extractor, and the flow of the normalization is deduced as follows:
Wherein NxCxHxW is the size of the defined input feature map x, the normalization is performed by dividing the number of channels into a plurality of groups, and assuming that the number of channels is divided into G groups, each group contains C' =C/G, and for G groups, each group calculates the mean μ g and variance thereof Wherein the method comprises the steps ofFor calculating the normalization of each channel in each group, normalizing each element, and introducing a trainable scaling factor γ and offset β.
Preferably, the detail enhancement convolution comprises five convolution layers which are deployed in parallel, including a common convolution, an angle difference convolution, a center difference convolution, a horizontal difference convolution and a vertical difference convolution, and is used for recovering the spatial resolution of the image and enhancing the detail part.
Step 4, sending the pictures and the labels of the training data set into the constructed SCGD-YOLO network model for training, and adjusting corresponding super parameters according to the result of the verification set to obtain an optimal training result;
preferably, the training method is that no pre-trained model is used in the training process, the input size of an image is set to 640 x 640 pixels, 200 epochs are trained, the batch is set to 16, the initial learning rate is 0.01, random gradient descent (SGD) is used for parameter optimization, and a final weight file is saved after training is completed.
And 5, sending the pictures to be detected in the test set into a trained SCGD-YOLO network model to detect the small target, and outputting a detection result.
Compared with the prior art, the invention has the following beneficial effects:
(1) Aiming at the fact that Bottleneck in a C2f module in a backbone network and a neck network has weak detection capability on a small target in a complex environment and a large number of parameters are contained in MLP (multi-level programmable logic) of the small target, the novel lightweight Bottleneck module is designed, the network is improved, the parameter quantity and the calculation quantity of a model are reduced, and delay expenditure is minimized.
(2) According to the invention, a brand-new feature pyramid module is designed, and the small target information is fused with the P2 feature layer by extracting the small target information of the feature layer, so that the feature fusion of different parts and the capturing capability of small targets in unmanned aerial vehicle images are improved, good detection precision can be maintained in environments such as dense vehicle stacking and sunlight irradiation environments, and the adaptability of the unmanned aerial vehicle in different environments can be improved.
(3) Aiming at the fact that the decoupling heads occupy larger parameter quantity in the network model, a brand new detection head is designed, on the premise of reserving the decoupling heads, a shared convolution module is introduced to reduce the parameter quantity, and detail enhancement convolution is introduced to improve the capability of capturing small target information, so that the detection head is light in weight on the premise of guaranteeing the accuracy.
(4) The SCGD-YOLO algorithm provided by the invention has the characteristics of high precision, few parameters and easiness in deployment, and has strong practicability and great application prospect.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention provides a small target detection method based on an SCGD-YOLO network under the view angle of an unmanned aerial vehicle, and specifically provides a technical scheme with reference to FIGS. 1 to 11, wherein the method specifically comprises the following steps of:
Step 1, acquiring an open source data set of a small target in an unmanned aerial vehicle view angle on a network, wherein the data set comprises ten categories, and the small target is taken as a main point;
In this embodiment, visDrone 2019 dataset is adopted as the experimental dataset for training, verification and testing, the dataset is collected by the Tianjin university machine learning and data mining laboratory AISKYEYE team, objects photographed by the unmanned aerial vehicle in different positions, angles and environments are collected, the objects comprise 10 types, the distribution ratio of the training set is 6471 pictures, the distribution ratio of the verification set is 548, the distribution ratio of the testing set is 1610, the ten types of objects contained in the image are automobiles, pedestrians, bicycles, characters, tricycles, trucks, open tricycles, buses, motorcycles and trucks, and most of the objects take small targets as main bodies, and the objects are the datasets specially used for small target detection.
Step 2, configuring a network environment required by a network model;
In this embodiment, the configured network environment is Ubuntu 16.04LTS operating system, the network of NVIDIA GTX3090 GPU with 16GB video memory is adopted for experimental operation, python 3.8.16 is used, and Pytorch versions 1.13.1 and torchvision 0.14.1.
Step 3, constructing an SCGD-YOLO network model, wherein the SCGD-YOLO network model is improved by taking YOLOv as a baseline model, the model comprises a Backbone network (Backbone), a neck network (Neck) and a detection Head (Head), firstly, introducing improved C2f-CAG and C2f-CFG modules into the Backbone network and the neck network to replace an original C2f module, secondly, designing a brand-new feature pyramid structure SCOK to replace a feature pyramid of the original neck network, and finally, replacing a decoupling Head structure of the Head network with a lightweight detection Head LSDC comprising shared convolution.
In this embodiment, in combination with fig. 2, in the scgd-YOLO network model diagram, an input image is firstly subjected to feature extraction through a backbone network, low-level and high-level features of the image are gradually extracted through a convolution layer, batch normalization and activation functions, spatial information and semantic information in the image can be captured through multi-layer feature extraction, then the extracted feature information is input into a neck network for feature fusion, features of different layers are fused through a feature pyramid in the neck network, finally feature fusion information is sent into a detection head, the model outputs target information predicted by each network, and finally class labels, frame coordinates and confidence of the output image are integrated.
Further, the invention replaces the C2f module in the backbone network and the backbone network with the C2f-CAG and C2f-CFG module, the structures of the two modules are shown in figure 3, the C2f-CAG structure introduces CAFormer modules in the Transformer, the token mixer of the module is a self-attention layer, which channels are more important in the characteristic extraction process can be more accurately distinguished, thus dynamically adjusting the weight of each channel, improving the characteristic selection capacity of the model, simultaneously using a Convolutional GLU gating mechanism in TransNext to replace an MLP layer in the CAFormer module, the GLU can control the transmission of information flow in a self-adaptive mode, thereby enhancing the nonlinear expression capacity of the model, and using convolution operation to replace the traditional full-connection layer, thereby obviously reducing the parameter quantity of the model, and likewise, the token mixer of the module is a separable convolution, which is formed by convolution with a progressive depth and convolution point convolution, the convolution is used for extracting the characteristic of the channel, the GLF is more suitable for the purpose of extracting the space point, and the GLF is more suitable for the purpose of reducing the characteristic extraction of the channel, and the GLF 2 is more suitable for the purpose of reducing the error.
Illustratively, the GLU acts as a non-linear activation function based on gating mechanisms, expressed as follows:
GLU(x)=(W1x)⊙σ(W2x) (1)
Wherein W 1 x is the linear transformation of the input, W 2 x is another linear transformation as the feature extraction part, sigma (&) is the sigmoid activation function, and the output controls the gating between 0 and 1, as indicated by element level multiplication.
Further, aiming at the poor extraction and fusion effects of small target information, a brand-new feature pyramid structure SCOK is designed in the neck network, in this embodiment, with reference to fig. 4, fig. 4 is a schematic structural diagram of a brand-new feature pyramid SCOK in the embodiment of the present invention, and SPD-Conv is composed of a space-to-depth layer and a non-step convolution layer. After integrating small target information in SPD-Conv with small target information of a P3 feature layer, considering that feature information fusion is insufficient, and the parameter quantity is greatly increased by processing the small target information completely, a SPlit-Omni-Kernel module is designed, specifically, with reference to FIG. 5, FIG. 5 is a schematic diagram of a SPlit-Omni-Kernel module in FIG. 4, input features are divided into two branches according to CSP residual error concept, one branch is processed by the Omni-Kernel module, the other branch is kept unchanged, and finally reconstruction of multi-scale information is realized through feature cascading.
The method comprises the steps of an Omni-Kernel module, as shown in fig. 6, performing 1×1 convolution processing on an input feature map, then performing large branches respectively, including 15×1 deep convolution, 15×15 deep convolution and 1×15 deep convolution, to capture small target information in different directions, wherein the global branches are composed of a double-Domain Channel Attention Module (DCAM) and a frequency-based spatial attention module (FSAM), the global domains which cannot be covered by the large branches can be compensated by adopting double-domain processing, the local branches adopt simple 1×1 deep convolution layers, the utilization rate of feature information is improved under the condition of not increasing model complexity, and finally, the results of the 3 branches and the feature map output by the 1×1 convolution on the input side are spliced and then subjected to 1×1 convolution processing, wherein the value of the depth convolution K in the large branches influences the parameter quantity and the accuracy of the model, and the accuracy are balanced by setting the value of K to 15 through testing.
In this embodiment, in conjunction with fig. 7, the decoupling head structure of the head network is replaced with a lightweight detection head LSDC comprising a shared convolution, as shown in fig. 7. The detection head adopted by the original model is a decoupling head, the detection head divides the target classification task and the boundary frame regression task into two independent processes, the method can improve the characteristic extraction capability of the network to a certain extent, but as each regression task needs two 3×3 convolutions to extract and process the characteristics, and one 1×1 convolution is used for adjusting the boundary frame and outputting boundary frame prediction information, the network model needs to respectively finish the detection of the characteristic layers of three scales of P3, P4 and P5, 123×3 convolutions and 61×1 convolutions are needed, and the parameter quantity and the calculation quantity of the model are greatly increased. For this problem, two convolutions sharing weights are used to replace 12 convolutions of 3×3 for image feature extraction, while considering that the weight sharing can reduce the number of parameters and the calculation amount, a part of small target information is lost, so that the precision is reduced, and therefore, a detail enhancement convolution is introduced to capture the small target information in the feature extraction stage so as to maintain the precision not to be reduced. Through extensive research, it has been found that the Group Normalization (GN) method has been demonstrated to improve the accuracy of classification and localization of the detection heads, thus attempting to introduce GN on a 1 x 1 convolution and on a feature extractor convolution to compensate for the accuracy loss, the flow of GN is deduced as follows:
In the formulas (2) (3), assuming that the magnitude of the input feature map x is nxc×h×w, GN is obtained by dividing the number of channels into a plurality of groups first, assuming that the number of channels is divided into G groups, each group contains C' =c/G for which the average μg and variance are calculated for each group In the formula (4) of the present invention,The normalization of each channel in each group is calculated, in equation (5), each element is normalized, and a trainable scaling factor γ and offset β are introduced for restoring the expressive power of the model.
And 4, sending the pictures and the labels in the training data set into a network model for training, and adjusting corresponding super parameters according to the result of the verification set to obtain an optimal training result.
In the embodiment, the training method for training is that no pre-trained model is used in the training process, the input size of an image is set to 640×640 pixels, 200 epochs are trained, the batch is set to 16, the initial learning rate is 0.01, random gradient descent (SGD) is used for parameter optimization, and a final weight file is saved after training is completed.
And 5, sending the pictures to be detected in the test set into a trained SCGD-YOLO model to detect the small target.
In this example, the results of ablation experiments are analyzed below to verify the role of each module in the model, and the results of ablation experiments are shown in table 1 below.
TABLE 1 ablation experiment (V represents adding the module to the model)
As shown by the ablation experimental results in table 1, after the novel feature pyramid network is introduced, the mAP50 and the mAP50-95 detected by the small targets are respectively improved by 2.9% and 2% compared with the reference model YOLOv s, the parameter quantity and the model size are improved to a certain extent, and the feature pyramid aiming at the small targets can better capture the features of the small targets to improve the precision, but the parameter quantity and the model volume are improved due to the introduction of more feature information of the small targets.
Further, for the improvement of the C2f module in the YOLOv model, as separable convolution is introduced into the C2f module of the neck network for light weight, and a self-attention mechanism is used in the C2f module in the main network, after the C2f-CAG and the C2f-CFG modules are introduced, the average detection precision can be known to be reduced by 10.6 percent under the condition of small loss according to experimental results, and the model volume is reduced by 2.2MB. The LSDC detection head is improved aiming at the decoupling head of the reference model, so that the parameter quantity of the model is reduced by 15.3%, the volume of the model is reduced by 1.7MB, and mAP50 is reduced by only 0.1%, and the detection head has the effect of light weight. The modules provided by the invention are randomly combined, the mAP50 and the mAP50-95 of the model are improved, and the parameter quantity is reduced compared with a reference model, but the size of the model is increased due to the introduction of small target characteristic information. And finally, adding the network model obtained by all modules, wherein compared with the original reference model, mAP50 and mAP50-95 of the model respectively rise by 2.4% and 1.7%, the parameter quantity is reduced by 27.5%, the model volume is reduced by 4.3MB, the deployment into the unmanned aerial vehicle is facilitated, and the small target detection task of the unmanned aerial vehicle visual angle is met.
Preferably, to represent the applicability of the algorithm in different scenarios, a TinyPerson dataset is alternatively selected for generalization experiments. The TinyPerson data set takes tiny target detection in a long distance and large background as a design reference, the images are collected from the Internet, and the key characteristics of the data set are that people are divided into two types, namely offshore people and land people, the offshore people comprise people on a ship, people lying in water and the like, the land people comprise other everybody, the targets of the data set are mostly small targets, and the data set is used for small target detection. The results of the generalization experiment are shown in table 2 below.
TABLE 2 generalization experiment results
| Model |
P(%) |
R(%) |
mAP50(%) |
mAP50-95(%) |
Parameters(%) |
Modelsize/MB |
| YOLOv8s |
44.5 |
29.6 |
28.3 |
9.14 |
11.13 |
21.5 |
| Ours |
47.2 |
33.7 |
31.4 |
9.83 |
8.06 |
17.1 |
Exemplary, as shown by the generalization experimental results of table 2, compared with the reference model, the precision (P), recall (R) and average precision (mAP 50) of the improved model on the Tiny Person data set are respectively improved by 2.7%, 4.1% and 3.1%, the parameter quantity is reduced by 27.6%, the model volume is reduced by 4.4MB, and the improved algorithm has good universality and universality.
Preferably, the present application will perform two sets of comparison experiments, the first set comparing the SCGD-YOLO algorithm herein with other algorithms of the YOLO series, and the second set comparing with other mainstream algorithms in the field of target detection in recent years, the comparison test results are shown in tables 3 and 4 below.
TABLE 3 comparative test with the YOLO series Algorithm
| Model |
P(%) |
R(%) |
mAP50(%) |
Parameters(M) |
| YOLOv5s |
42.5 |
31.9 |
31.0 |
7.04 |
| YOLOv7-tiny |
46.4 |
36.4 |
34.1 |
6.03 |
| YOLOv7 |
51.3 |
42.0 |
39.6 |
36.53 |
| YOLOv8s |
50.3 |
38.3 |
39.1 |
11.13 |
| YOLOv10s |
50.2 |
38.7 |
39.4 |
7.22 |
| YOLOv11s |
49.7 |
37.9 |
38.6 |
9.42 |
| Ours |
53.7 |
39.5 |
41.5 |
8.06 |
By way of example, as can be seen from the comparison test of table 3, the SCGD-YOLO algorithm proposed by the present invention is significantly advanced in accuracy and average accuracy with other YOLO series algorithms, and the model parameters are also lighter in the YOLO series algorithm.
TABLE 4 comparative test with other mainstream algorithms
By way of example, as can be seen from the comparison experiment in table 4, the SCGD-YOLO algorithm provided by the present invention is far ahead in accuracy and average accuracy compared with other mainstream algorithms, and the parameter quantity is far lower than that of other algorithms, so that the SCGD-YOLO algorithm is more advantageous in model deployment.
In this embodiment, in order to demonstrate the effects achieved by the present invention, description will be given with reference to fig. 8, 9, 10 and 11. The detection diagram of the improved model is compared with the detection diagram of the original model, as shown in fig. 8, most of vehicles and pedestrians can be detected by the original model under the condition that vehicles are dense, but only half of the images of motorcycles and pedestrians at the right lower corner of the picture are unrecognized, and the pedestrians and motorcycles which are missed in the right lower corner can be identified by the improved model even under the condition that the vehicles are dense as shown in fig. 9, as shown in fig. 10, partial information can be covered and difficultly extracted under the condition that the sunlight irradiates, most of vehicles and pedestrians can be detected by the original model, but the pedestrians at the left upper corner and the right side of the picture can not be identified due to sunlight irradiation and vehicle shielding, and the pedestrians can be identified under the condition that the sunlight irradiates and the vehicles are shielded by the improved model as shown in fig. 11, so that the detection precision of the improved model can be ensured under different environments better than that of the original model can be seen by the comparison diagram.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that the above-mentioned embodiments are merely preferred embodiments of the present invention, and the present invention is not limited thereto, but may be modified or substituted for some of the technical features thereof by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.