CN119919836A

CN119919836A - A small target detection method from the perspective of drone based on SCGD-YOLO network

Info

Publication number: CN119919836A
Application number: CN202510101054.5A
Authority: CN
Inventors: 项铁铭; 苏旭麟; 杨梦雅; 成思霖; 林铭煌
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2025-01-22
Filing date: 2025-01-22
Publication date: 2025-05-02

Abstract

The invention discloses a small target detection method under the perspective of an unmanned aerial vehicle (UAV) based on an SCGD-YOLO network. The method comprises the following steps: obtaining an open source data set of small targets under the perspective of an unmanned aerial vehicle (UAV), and dividing the data set into a training set, a validation set and a test set, wherein the data set comprises ten categories, and the main category is mainly small targets; and configuring a network environment required by a network model; constructing an SCGD-YOLO network model, wherein the SCGD-YOLO network model comprises a backbone network, a neck network and a detection head; sending pictures and labels of the training data set into the constructed SCGD-YOLO network model for training, and adjusting corresponding hyperparameters according to the result of the validation set to obtain the best training result; finally, sending pictures to be detected in the test set into the trained SCGD-YOLO network model for small target detection, and outputting the detection result. The invention solves the problem that the resolution of small targets is low and the detection accuracy is reduced when the UAV is shooting at high altitude, and not only improves the detection accuracy of small targets, but also reduces the number of parameters of the model.

Description

Small target detection method based on SCGD-YOLO network under unmanned aerial vehicle visual angle

Technical Field

The invention relates to the technical field of small target detection under an unmanned aerial vehicle visual angle, in particular to a small target detection method under an unmanned aerial vehicle visual angle based on an SCGD-YOLO network.

Background

The unmanned aerial vehicle is widely applied to the fields of remote sensing images, agricultural pest detection, disaster monitoring, industrial target detection and the like, has the characteristics of low cost, high flexibility, wide viewing angle and the like, can perform tasks such as monitoring, patrol, target tracking and the like, but many application scenes relate to detection of small targets, on one hand, the problems of low resolution of the small targets in the images, easiness in interference of complex environments, mutual shielding among the small targets, unstable shooting in high-altitude environments and the like cause great difficulty in feature extraction of the small targets, and on the other hand, the conventional target detection algorithm is mostly developed for target detection of conventional size, and the detection algorithm aiming at specific scenes of the small targets is relatively few.

The target detection is an important research direction in the field of computer vision, aims at identifying and positioning targets from images or videos, and the current target detection algorithm can be roughly divided into a candidate region-based target detection algorithm and a regression-based target detection algorithm, wherein the candidate region-based target detection algorithm comprises R-CNN, faster R-CNN, mask R-CNN and the like, and the Faster R-CNN is a detection algorithm commonly used in the target detection of a deep learning method at present, but has larger network parameters, high calculation cost and relatively slow reasoning speed and is not suitable for the task of real-time detection. In order to solve the problem of low reasoning speed, a YOLO series algorithm is generated, and the algorithm is a regression-based target detection algorithm, and the selection of candidate frames is eliminated, so that target frames and category predictions are directly generated in an original image, the target detection process is simplified, and the speed and efficiency are more emphasized. Although the YOLO algorithm has a faster detection speed, in processing small target detection at the unmanned aerial vehicle viewing angle, the target resolution is correspondingly reduced due to the lower resolution. For this reason, many researches in recent years improve the detection accuracy of the model by introducing attention mechanisms and the like, but also greatly increase the parameters of the model, make the subsequent deployment of the model difficult, and affect the practicability of the model in practical application.

Disclosure of Invention

The invention aims to provide a small target detection method based on an SCGD-YOLO network under the view angle of an unmanned aerial vehicle, which solves the problem of detection accuracy reduction caused by the conditions of lower resolution of the small target, complex environment interference and the like when the unmanned aerial vehicle shoots at high altitude, and specifically provides the following technical scheme for solving the technical problems:

a small target detection method based on SCGD-YOLO network under the unmanned aerial vehicle visual angle comprises the following steps:

Step 1, acquiring an open source data set of a small target under the view angle of an unmanned aerial vehicle, and dividing the data set into a training set, a verification set and a test set, wherein the data set comprises ten categories, and the main categories are mainly the small target;

preferably, visDrone 2019 data sets are adopted as experimental data sets for training, verification and testing, and objects photographed by the unmanned aerial vehicle in different positions, angles and environments are collected, wherein 10 categories are included, and most of the categories take small targets as main bodies, and are data sets specially used for small target detection.

Step 2, configuring a network environment required by a network model;

Preferably, the configured network environment is Ubuntu 16.04LTS operating system, experimental runs were performed using a network of NVIDIA GTX3090 GPU with 16GB video memory, and Python 3.8.16, and Pytorch versions 1.13.1 and torchvision 0.14.1.

Step3, constructing an SCGD-YOLO network model, wherein the SCGD-YOLO network model comprises a backbone network, a neck network and a detection head;

The SCGD-YOLO network model comprises the steps of firstly carrying out feature extraction on an input image through a backbone network, gradually extracting low-level and high-level features of the image through a convolution layer, batch normalization and an activation function, capturing space information and semantic information in the image through multi-layer feature extraction, inputting the extracted feature information into a neck network for feature fusion, fusing features of different layers through a feature pyramid in the neck network, finally sending feature fusion information into a detection head, outputting target information predicted by each network, and integrating class labels, frame coordinates and confidence of the output image.

Preferably, the SCGD-YOLO network model is improved by taking YOLOv as a baseline model, firstly introducing improved C2f-CAG and C2f-CFG modules into a backbone network and a neck network to replace an original C2f module, secondly replacing a feature pyramid of the original neck network with a brand-new feature pyramid structure SCOK, and finally replacing a decoupling head structure of a head network with a lightweight detection head LSDC comprising shared convolution.

Preferably, in the improved C2f-CAG and C2f-CFG modules, the C2f-CAG structure is introduced into a CAFormer module in the Transformer, the token mixer of the CAFormer module is a self-attention layer, a Convolutional GLU gating mechanism in TransNext is used for replacing an MLP layer in the CAFormer module, the GLU controls the transmission of information flow in an adaptive manner, a convolution operation is used for replacing a traditional full connection layer, the C2f-CFG structure is introduced into a ConvFormer module in the Transformer, the token mixer of the ConvFormer module is a separable convolution, the separable convolution is composed of a depth-wise convolution and a point-wise convolution, the depth-wise convolution is used for extracting spatial features, the point-wise convolution is used for extracting channel features, and meanwhile, the Convolutional GLU gating mechanism is also used for replacing the MLP layer in the ConvFormer module, and the C2f-CFG module is used in a collocation.

Preferably, the Convolutional GLU gating mechanism is a nonlinear activation function based on the gating mechanism, the expression is GLU (x) = (W ₁x)⊙σ(W₂ x), wherein W ₁ x is a linear transformation of an input, W ₂ x is another linear transformation as a feature extraction part, sigma (·) is a sigmoid activation function, the output controls gating between 0 and 1, and as such, indicates element level multiplication.

Preferably, in the brand new feature pyramid SCOK, an SPD-Conv is introduced to extract small target information, the SPD-Conv is composed of a space-to-depth layer and a non-step convolution layer, the SPD-Conv downsamples the feature map and retains all information in the channel dimension, and after the small target information in the SPD-Conv and the small target information in the feature layer are spliced, the small target information is transmitted to a SPlit-Omni-Kernel module to perform feature fusion and is output to a detection head to perform small target detection and positioning.

Preferably, the SPlit-Omni-Kernel module divides the input feature into two branches according to the CSP residual concept, one branch is processed by the Omni-Kernel module, the other branch is kept unchanged, and finally, the reconstruction of the multi-scale information is realized through feature cascade. Wherein the Omni-Kernel module comprises a big branch, a global branch and a local branch.

Preferably, in the lightweight detection head LSDC of the shared convolution, a1×1 convolution is first used to adjust the number of channels, then 23×3 convolutions are used as shared weight convolutions instead of the original 123×3 convolutions for feature extraction, small target information is captured in the feature extraction stage by introducing detail enhancement convolutions, normalization is introduced in the 1×1 convolution, and normalization is introduced in the convolution of the feature extractor, and the flow of the normalization is deduced as follows:

Wherein NxCxHxW is the size of the defined input feature map x, the normalization is performed by dividing the number of channels into a plurality of groups, and assuming that the number of channels is divided into G groups, each group contains C' =C/G, and for G groups, each group calculates the mean μ _g and variance thereof Wherein the method comprises the steps ofFor calculating the normalization of each channel in each group, normalizing each element, and introducing a trainable scaling factor γ and offset β.

Preferably, the detail enhancement convolution comprises five convolution layers which are deployed in parallel, including a common convolution, an angle difference convolution, a center difference convolution, a horizontal difference convolution and a vertical difference convolution, and is used for recovering the spatial resolution of the image and enhancing the detail part.

Step 4, sending the pictures and the labels of the training data set into the constructed SCGD-YOLO network model for training, and adjusting corresponding super parameters according to the result of the verification set to obtain an optimal training result;

preferably, the training method is that no pre-trained model is used in the training process, the input size of an image is set to 640 x 640 pixels, 200 epochs are trained, the batch is set to 16, the initial learning rate is 0.01, random gradient descent (SGD) is used for parameter optimization, and a final weight file is saved after training is completed.

And 5, sending the pictures to be detected in the test set into a trained SCGD-YOLO network model to detect the small target, and outputting a detection result.

Compared with the prior art, the invention has the following beneficial effects:

(1) Aiming at the fact that Bottleneck in a C2f module in a backbone network and a neck network has weak detection capability on a small target in a complex environment and a large number of parameters are contained in MLP (multi-level programmable logic) of the small target, the novel lightweight Bottleneck module is designed, the network is improved, the parameter quantity and the calculation quantity of a model are reduced, and delay expenditure is minimized.

(2) According to the invention, a brand-new feature pyramid module is designed, and the small target information is fused with the P2 feature layer by extracting the small target information of the feature layer, so that the feature fusion of different parts and the capturing capability of small targets in unmanned aerial vehicle images are improved, good detection precision can be maintained in environments such as dense vehicle stacking and sunlight irradiation environments, and the adaptability of the unmanned aerial vehicle in different environments can be improved.

(3) Aiming at the fact that the decoupling heads occupy larger parameter quantity in the network model, a brand new detection head is designed, on the premise of reserving the decoupling heads, a shared convolution module is introduced to reduce the parameter quantity, and detail enhancement convolution is introduced to improve the capability of capturing small target information, so that the detection head is light in weight on the premise of guaranteeing the accuracy.

(4) The SCGD-YOLO algorithm provided by the invention has the characteristics of high precision, few parameters and easiness in deployment, and has strong practicability and great application prospect.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.

In the drawings:

FIG. 1 is a general flow chart of an embodiment of the present invention;

FIG. 2 is a diagram of an improved model of an embodiment of the present invention;

FIG. 3 is a schematic diagram of the structures of the C2f-CAG and the C2f-CFG according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a novel feature pyramid SCOK according to an example embodiment of the present invention;

FIG. 5 is a schematic diagram of the SPlit-Omni-Kernel module of FIG. 4;

FIG. 6 is a schematic diagram of the Omni-Kernel module of FIG. 5;

FIG. 7 is a schematic diagram of a structure of a LSDC of a shared weight detection head according to an embodiment of the present invention;

FIG. 8 is a graph of the detection effect of the original model in the case of dense vehicle stacking;

FIG. 9 is a graph showing the detection effect of SCGD-YOLO in the case of dense vehicle stacking according to the embodiment of the present invention;

FIG. 10 is a graph of the detection effect of an original model in a sunlight environment;

FIG. 11 is a graph showing the effect of SCGD-YOLO detection in a sunlight environment according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention provides a small target detection method based on an SCGD-YOLO network under the view angle of an unmanned aerial vehicle, and specifically provides a technical scheme with reference to FIGS. 1 to 11, wherein the method specifically comprises the following steps of:

Step 1, acquiring an open source data set of a small target in an unmanned aerial vehicle view angle on a network, wherein the data set comprises ten categories, and the small target is taken as a main point;

In this embodiment, visDrone 2019 dataset is adopted as the experimental dataset for training, verification and testing, the dataset is collected by the Tianjin university machine learning and data mining laboratory AISKYEYE team, objects photographed by the unmanned aerial vehicle in different positions, angles and environments are collected, the objects comprise 10 types, the distribution ratio of the training set is 6471 pictures, the distribution ratio of the verification set is 548, the distribution ratio of the testing set is 1610, the ten types of objects contained in the image are automobiles, pedestrians, bicycles, characters, tricycles, trucks, open tricycles, buses, motorcycles and trucks, and most of the objects take small targets as main bodies, and the objects are the datasets specially used for small target detection.

Step 2, configuring a network environment required by a network model;

In this embodiment, the configured network environment is Ubuntu 16.04LTS operating system, the network of NVIDIA GTX3090 GPU with 16GB video memory is adopted for experimental operation, python 3.8.16 is used, and Pytorch versions 1.13.1 and torchvision 0.14.1.

Step 3, constructing an SCGD-YOLO network model, wherein the SCGD-YOLO network model is improved by taking YOLOv as a baseline model, the model comprises a Backbone network (Backbone), a neck network (Neck) and a detection Head (Head), firstly, introducing improved C2f-CAG and C2f-CFG modules into the Backbone network and the neck network to replace an original C2f module, secondly, designing a brand-new feature pyramid structure SCOK to replace a feature pyramid of the original neck network, and finally, replacing a decoupling Head structure of the Head network with a lightweight detection Head LSDC comprising shared convolution.

In this embodiment, in combination with fig. 2, in the scgd-YOLO network model diagram, an input image is firstly subjected to feature extraction through a backbone network, low-level and high-level features of the image are gradually extracted through a convolution layer, batch normalization and activation functions, spatial information and semantic information in the image can be captured through multi-layer feature extraction, then the extracted feature information is input into a neck network for feature fusion, features of different layers are fused through a feature pyramid in the neck network, finally feature fusion information is sent into a detection head, the model outputs target information predicted by each network, and finally class labels, frame coordinates and confidence of the output image are integrated.

Further, the invention replaces the C2f module in the backbone network and the backbone network with the C2f-CAG and C2f-CFG module, the structures of the two modules are shown in figure 3, the C2f-CAG structure introduces CAFormer modules in the Transformer, the token mixer of the module is a self-attention layer, which channels are more important in the characteristic extraction process can be more accurately distinguished, thus dynamically adjusting the weight of each channel, improving the characteristic selection capacity of the model, simultaneously using a Convolutional GLU gating mechanism in TransNext to replace an MLP layer in the CAFormer module, the GLU can control the transmission of information flow in a self-adaptive mode, thereby enhancing the nonlinear expression capacity of the model, and using convolution operation to replace the traditional full-connection layer, thereby obviously reducing the parameter quantity of the model, and likewise, the token mixer of the module is a separable convolution, which is formed by convolution with a progressive depth and convolution point convolution, the convolution is used for extracting the characteristic of the channel, the GLF is more suitable for the purpose of extracting the space point, and the GLF is more suitable for the purpose of reducing the characteristic extraction of the channel, and the GLF 2 is more suitable for the purpose of reducing the error.

Illustratively, the GLU acts as a non-linear activation function based on gating mechanisms, expressed as follows:

GLU(x)=(W₁x)⊙σ(W₂x) (1)

Wherein W ₁ x is the linear transformation of the input, W ₂ x is another linear transformation as the feature extraction part, sigma (&) is the sigmoid activation function, and the output controls the gating between 0 and 1, as indicated by element level multiplication.

Further, aiming at the poor extraction and fusion effects of small target information, a brand-new feature pyramid structure SCOK is designed in the neck network, in this embodiment, with reference to fig. 4, fig. 4 is a schematic structural diagram of a brand-new feature pyramid SCOK in the embodiment of the present invention, and SPD-Conv is composed of a space-to-depth layer and a non-step convolution layer. After integrating small target information in SPD-Conv with small target information of a P3 feature layer, considering that feature information fusion is insufficient, and the parameter quantity is greatly increased by processing the small target information completely, a SPlit-Omni-Kernel module is designed, specifically, with reference to FIG. 5, FIG. 5 is a schematic diagram of a SPlit-Omni-Kernel module in FIG. 4, input features are divided into two branches according to CSP residual error concept, one branch is processed by the Omni-Kernel module, the other branch is kept unchanged, and finally reconstruction of multi-scale information is realized through feature cascading.

The method comprises the steps of an Omni-Kernel module, as shown in fig. 6, performing 1×1 convolution processing on an input feature map, then performing large branches respectively, including 15×1 deep convolution, 15×15 deep convolution and 1×15 deep convolution, to capture small target information in different directions, wherein the global branches are composed of a double-Domain Channel Attention Module (DCAM) and a frequency-based spatial attention module (FSAM), the global domains which cannot be covered by the large branches can be compensated by adopting double-domain processing, the local branches adopt simple 1×1 deep convolution layers, the utilization rate of feature information is improved under the condition of not increasing model complexity, and finally, the results of the 3 branches and the feature map output by the 1×1 convolution on the input side are spliced and then subjected to 1×1 convolution processing, wherein the value of the depth convolution K in the large branches influences the parameter quantity and the accuracy of the model, and the accuracy are balanced by setting the value of K to 15 through testing.

In this embodiment, in conjunction with fig. 7, the decoupling head structure of the head network is replaced with a lightweight detection head LSDC comprising a shared convolution, as shown in fig. 7. The detection head adopted by the original model is a decoupling head, the detection head divides the target classification task and the boundary frame regression task into two independent processes, the method can improve the characteristic extraction capability of the network to a certain extent, but as each regression task needs two 3×3 convolutions to extract and process the characteristics, and one 1×1 convolution is used for adjusting the boundary frame and outputting boundary frame prediction information, the network model needs to respectively finish the detection of the characteristic layers of three scales of P3, P4 and P5, 123×3 convolutions and 61×1 convolutions are needed, and the parameter quantity and the calculation quantity of the model are greatly increased. For this problem, two convolutions sharing weights are used to replace 12 convolutions of 3×3 for image feature extraction, while considering that the weight sharing can reduce the number of parameters and the calculation amount, a part of small target information is lost, so that the precision is reduced, and therefore, a detail enhancement convolution is introduced to capture the small target information in the feature extraction stage so as to maintain the precision not to be reduced. Through extensive research, it has been found that the Group Normalization (GN) method has been demonstrated to improve the accuracy of classification and localization of the detection heads, thus attempting to introduce GN on a 1 x 1 convolution and on a feature extractor convolution to compensate for the accuracy loss, the flow of GN is deduced as follows:

In the formulas (2) (3), assuming that the magnitude of the input feature map x is nxc×h×w, GN is obtained by dividing the number of channels into a plurality of groups first, assuming that the number of channels is divided into G groups, each group contains C' =c/G for which the average μg and variance are calculated for each group In the formula (4) of the present invention,The normalization of each channel in each group is calculated, in equation (5), each element is normalized, and a trainable scaling factor γ and offset β are introduced for restoring the expressive power of the model.

And 4, sending the pictures and the labels in the training data set into a network model for training, and adjusting corresponding super parameters according to the result of the verification set to obtain an optimal training result.

In the embodiment, the training method for training is that no pre-trained model is used in the training process, the input size of an image is set to 640×640 pixels, 200 epochs are trained, the batch is set to 16, the initial learning rate is 0.01, random gradient descent (SGD) is used for parameter optimization, and a final weight file is saved after training is completed.

And 5, sending the pictures to be detected in the test set into a trained SCGD-YOLO model to detect the small target.

In this example, the results of ablation experiments are analyzed below to verify the role of each module in the model, and the results of ablation experiments are shown in table 1 below.

TABLE 1 ablation experiment (V represents adding the module to the model)

As shown by the ablation experimental results in table 1, after the novel feature pyramid network is introduced, the mAP50 and the mAP50-95 detected by the small targets are respectively improved by 2.9% and 2% compared with the reference model YOLOv s, the parameter quantity and the model size are improved to a certain extent, and the feature pyramid aiming at the small targets can better capture the features of the small targets to improve the precision, but the parameter quantity and the model volume are improved due to the introduction of more feature information of the small targets.

Further, for the improvement of the C2f module in the YOLOv model, as separable convolution is introduced into the C2f module of the neck network for light weight, and a self-attention mechanism is used in the C2f module in the main network, after the C2f-CAG and the C2f-CFG modules are introduced, the average detection precision can be known to be reduced by 10.6 percent under the condition of small loss according to experimental results, and the model volume is reduced by 2.2MB. The LSDC detection head is improved aiming at the decoupling head of the reference model, so that the parameter quantity of the model is reduced by 15.3%, the volume of the model is reduced by 1.7MB, and mAP50 is reduced by only 0.1%, and the detection head has the effect of light weight. The modules provided by the invention are randomly combined, the mAP50 and the mAP50-95 of the model are improved, and the parameter quantity is reduced compared with a reference model, but the size of the model is increased due to the introduction of small target characteristic information. And finally, adding the network model obtained by all modules, wherein compared with the original reference model, mAP50 and mAP50-95 of the model respectively rise by 2.4% and 1.7%, the parameter quantity is reduced by 27.5%, the model volume is reduced by 4.3MB, the deployment into the unmanned aerial vehicle is facilitated, and the small target detection task of the unmanned aerial vehicle visual angle is met.

Preferably, to represent the applicability of the algorithm in different scenarios, a TinyPerson dataset is alternatively selected for generalization experiments. The TinyPerson data set takes tiny target detection in a long distance and large background as a design reference, the images are collected from the Internet, and the key characteristics of the data set are that people are divided into two types, namely offshore people and land people, the offshore people comprise people on a ship, people lying in water and the like, the land people comprise other everybody, the targets of the data set are mostly small targets, and the data set is used for small target detection. The results of the generalization experiment are shown in table 2 below.

TABLE 2 generalization experiment results

Model	P(%)	R(%)	mAP50(%)	mAP50-95(%)	Parameters(%)	Modelsize/MB
							YOLOv8s	44.5	29.6	28.3	9.14	11.13	21.5
Ours	47.2	33.7	31.4	9.83	8.06	17.1

Exemplary, as shown by the generalization experimental results of table 2, compared with the reference model, the precision (P), recall (R) and average precision (mAP 50) of the improved model on the Tiny Person data set are respectively improved by 2.7%, 4.1% and 3.1%, the parameter quantity is reduced by 27.6%, the model volume is reduced by 4.4MB, and the improved algorithm has good universality and universality.

Preferably, the present application will perform two sets of comparison experiments, the first set comparing the SCGD-YOLO algorithm herein with other algorithms of the YOLO series, and the second set comparing with other mainstream algorithms in the field of target detection in recent years, the comparison test results are shown in tables 3 and 4 below.

TABLE 3 comparative test with the YOLO series Algorithm

Model	P(%)	R(%)	mAP50(%)	Parameters(M)
					YOLOv5s	42.5	31.9	31.0	7.04
YOLOv7-tiny	46.4	36.4	34.1	6.03
					YOLOv7	51.3	42.0	39.6	36.53
YOLOv8s	50.3	38.3	39.1	11.13
					YOLOv10s	50.2	38.7	39.4	7.22
YOLOv11s	49.7	37.9	38.6	9.42
					Ours	53.7	39.5	41.5	8.06

By way of example, as can be seen from the comparison test of table 3, the SCGD-YOLO algorithm proposed by the present invention is significantly advanced in accuracy and average accuracy with other YOLO series algorithms, and the model parameters are also lighter in the YOLO series algorithm.

TABLE 4 comparative test with other mainstream algorithms

By way of example, as can be seen from the comparison experiment in table 4, the SCGD-YOLO algorithm provided by the present invention is far ahead in accuracy and average accuracy compared with other mainstream algorithms, and the parameter quantity is far lower than that of other algorithms, so that the SCGD-YOLO algorithm is more advantageous in model deployment.

In this embodiment, in order to demonstrate the effects achieved by the present invention, description will be given with reference to fig. 8, 9, 10 and 11. The detection diagram of the improved model is compared with the detection diagram of the original model, as shown in fig. 8, most of vehicles and pedestrians can be detected by the original model under the condition that vehicles are dense, but only half of the images of motorcycles and pedestrians at the right lower corner of the picture are unrecognized, and the pedestrians and motorcycles which are missed in the right lower corner can be identified by the improved model even under the condition that the vehicles are dense as shown in fig. 9, as shown in fig. 10, partial information can be covered and difficultly extracted under the condition that the sunlight irradiates, most of vehicles and pedestrians can be detected by the original model, but the pedestrians at the left upper corner and the right side of the picture can not be identified due to sunlight irradiation and vehicle shielding, and the pedestrians can be identified under the condition that the sunlight irradiates and the vehicles are shielded by the improved model as shown in fig. 11, so that the detection precision of the improved model can be ensured under different environments better than that of the original model can be seen by the comparison diagram.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the above-mentioned embodiments are merely preferred embodiments of the present invention, and the present invention is not limited thereto, but may be modified or substituted for some of the technical features thereof by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A small target detection method from the perspective of a drone based on the SCGD-YOLO network, characterized in that it includes the following steps:

Step 1: Obtain an open source dataset of small targets from the perspective of a drone, and divide the dataset into a training set, a validation set, and a test set. The dataset contains ten categories, with small targets being the main category.

Step 2: Configure the network environment required by the network model;

Step 3: Build a SCGD-YOLO network model, which includes a backbone network, a neck network, and a detection head;

Step 4: Send the images and labels of the training data set to the constructed SCGD-YOLO network model for training, and adjust the corresponding hyperparameters according to the results of the validation set to obtain the best training results;

Step 5: Send the images to be detected in the test set to the trained SCGD-YOLO network model for small target detection and output the detection results.

2. A small target detection method under the perspective of an unmanned aerial vehicle based on a SCGD-YOLO network according to claim 1, characterized in that: the constructed SCGD-YOLO network model comprises: the input image is first subjected to feature extraction by a backbone network, and the network gradually extracts low-level and high-level features of the image through convolutional layers, batch normalization and activation functions, and captures spatial information and semantic information in the image through multi-layer feature extraction; the extracted feature information is input into a neck network for feature fusion, and features of different layers are fused through a feature pyramid in the neck network; finally, the feature fusion information is sent to a detection head, the target information predicted by each network is output, and the category label, frame coordinates and confidence of the output image are integrated.

3. A small target detection method based on a SCGD-YOLO network from the perspective of a drone according to claim 1, characterized in that: the SCGD-YOLO network model is improved based on YOLOv8 as the baseline model, firstly, the improved C2f-CAG and C2f-CFG modules are introduced into the backbone network and the neck network to replace the original C2f module, and then the new feature pyramid structure SCOK is used to replace the feature pyramid of the original neck network, and finally the decoupling head structure of the head network is replaced by a lightweight detection head LSDC containing shared convolution.

4. According to claim 3, a small target detection method based on the SCGD-YOLO network under the perspective of a drone is characterized in that: in the improved C2f-CAG and C2f-CFG modules, the C2f-CAG structure introduces the CAFormer module in the Transformer, the token mixer of the CAFormer module is a self-attention layer, and the Convolutional GLU gating mechanism in TransNext is used to replace the MLP layer in the CAFormer module. GLU controls the transmission of information flow in an adaptive manner, and uses convolution operations to replace traditional fully connected layers;

The C2f-CFG structure introduces the ConvFormer module in Transformer. The token mixer of the ConvFormer module is a separable convolution, which is composed of depth-wise convolution and point-wise convolution. The depth-wise convolution is used to extract spatial features, and the point-by-point convolution is used to extract channel features. At the same time, the Convolutional GLU gating mechanism is also used to replace the MLP layer in the ConvFormer module. The C2f-CAG and C2f-CFG modules are used in combination.

5. According to a small target detection method based on SCGD-YOLO network from the perspective of a drone according to claim 4, it is characterized in that: the Convolutional GLU gating mechanism is a nonlinear activation function based on a gating mechanism, and the expression is: GLU(x)= _W1x )⊙σ( _W2x ), wherein _W1x is a linear transformation of the input, as a feature extraction part, _W2x is another linear transformation, as a gating part, σ(·) is a sigmoid activation function, the output is controlled to be a gating between 0 and 1, and ⊙ represents element-wise multiplication.

6. According to a small target detection method based on the SCGD-YOLO network from the perspective of a drone as described in claim 3, it is characterized in that: SPD-Conv is introduced into the new feature pyramid SCOK to extract small target information, and the SPD-Conv consists of a space-to-depth layer and a non-step convolution layer. SPD-Conv downsamples the feature map and retains all information in the channel dimension. After the small target information in the SPD-Conv is spliced with the small target information in the feature layer, it is transmitted to the SPlit-Omni-Kernel module for feature fusion and output to the detection head for small target detection and positioning.

7. According to claim 6, a small target detection method based on the SCGD-YOLO network from the perspective of a drone is characterized in that: the SPlit-Omni-Kernel module, according to the CSP residual idea, divides the input features into two branches, one of which is processed by the Omni-Kernel module, and the other branch remains unchanged, and finally realizes the weight of multi-scale information through feature cascading, and the Omni-Kernel module contains a large branch, a global branch, and a local branch.

8. According to claim 3, a small target detection method based on SCGD-YOLO network from the perspective of a drone is characterized in that: in the shared convolution lightweight detection head LSDC, a 1×1 convolution is first used to adjust the number of channels, and then two 3×3 convolutions are used as shared weight convolutions to replace the original 12 3×3 convolutions for feature extraction, and small target information is captured in the feature extraction stage by introducing detail enhancement convolution, and normalization is introduced in the 1×1 convolution, and normalization is introduced on the convolution of the feature extractor, and the normalization process is derived as follows:

Among them, N×C×H×W is the size of the defined input feature map x. The normalization method is: first, the number of channels is divided into multiple groups. Assuming that the number of channels is divided into G groups, each group contains C'=C/G. For each G group, the mean _μg and variance are calculated. in It is used to calculate the normalization of each channel in each group, normalize each element, and introduce a trainable scaling factor γ and offset β.

9. According to claim 8, a small target detection method based on the SCGD-YOLO network from the perspective of a drone is characterized in that: the detail enhancement convolution comprises five parallel deployed convolution layers, including ordinary convolution, angle difference convolution, center difference convolution, horizontal difference convolution and vertical difference convolution, which are used to restore the spatial resolution of the image and enhance the details.

10. A small target detection method based on the SCGD-YOLO network from the perspective of a drone according to claim 1, characterized in that: the training method is: no pre-trained model is used during the training process, parameter optimization is performed by using stochastic gradient descent, and the final weight file is saved after the training is completed.