Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a hyperspectral target detection and identification method and system based on semantic and spatial spectrum feature fusion.
The hyperspectral target detection and identification method based on semantic and spatial spectrum feature fusion provided by the invention comprises the following steps:
extracting an abnormal target in a hyperspectral image based on an unsupervised abnormal detection algorithm, performing rapid coarse detection on the target through a constraint energy minimization operator, and eliminating false alarms based on inter-frame motion characteristic contrast;
Step 2, precisely detecting the coarse detection result by adopting a double-flow convolutional neural network, acquiring space-spectrum characteristic information of a target, and combining a cubic long-short-term memory network to predict a confidence interval range of the target so as to realize dynamic tracking;
And 3, classifying the detection targets by using a support vector machine based on historical spectrum training, and judging the category to which the detection targets belong.
Preferably, the unsupervised anomaly detection algorithm in step 1 adopts an RX detection operator, and the expression is:
Wherein, For any one of the vectors of picture elements in the image,As a sample mean value vector of the samples,A sample covariance matrix for the image;
And (3) eliminating false alarms by calculating the energy distribution of the target in the direction of the small eigenvalue of the covariance matrix and combining the target motion distance threshold between adjacent frames.
Preferably, the dual-flow convolutional neural network in the step2 includes an upper branch and a lower branch, each branch contains an input, 9 convolutional layers are utilized in each branch to extract spectral information rich in input pixels, and one-dimensional convolutional layers are utilized to realize convolution operation;
The convolution layer with the core step length of 2 is used for replacing a pooling layer in a network, so that spectral characteristics are reserved to the greatest extent, all the characteristics extracted by the convolution layer with the core step length of 2 are added with the characteristics extracted by the last layer based on different average pooling layers, and then the final characteristics of each branch are obtained through the operation of an AVG pooling layer and a full connection layer;
In a dual-flow convolutional neural network, For a target a priori pixel,For the target pixel to be a target pixel,For background pixels, the input of the upper branch is always the following according to the training sample structureThe input of the current branch isThe label of the training sample is 1 when the input of the lower branch is 1The final characteristics of the two branches are obtained through multiple convolution operations, pooling operations and one full connection operation and are recorded asAndTwo features are then combined:
finally, the output of the double-flow convolutional neural network is obtained through the last full-connection layer and a Sigmoid function.
Preferably, the neutral square long short time memory network in the step 2 consists of a space branch, a time branch and an output branch, wherein the input comprises the longitude and latitude, the speed, the acceleration and the historical track of the target, and the output is the predicted result of the target position at the next momentDistance between adjacent framesThe confidence interval is defined as the 10 pixel neighborhood of the last frame position, where V is the target speed,For the orbital tilt, r is the spatial resolution of the video hyperspectral camera imaging, and f is the video frame rate.
Preferably, the support vector machine in step 3 adopts a soft margin optimization model, trains a classifier based on historical spectrum data, distinguishes aircraft, ships and other target categories, and the loss function is as follows:
Wherein, Is the weight vector of the object,Is a bias term that is used to determine,AndThe feature vector of the i-th sample and the corresponding label, respectively.
The hyperspectral target detection and identification system based on semantic and spatial spectrum feature fusion provided by the invention comprises the following components:
The coarse detection module is used for extracting an abnormal target in the hyperspectral image based on an unsupervised abnormal detection algorithm, carrying out rapid coarse detection on the target through a constraint energy minimization operator, and eliminating false alarms based on inter-frame motion characteristic comparison;
The fine detection module is used for carrying out fine detection on the coarse detection result by adopting a double-flow convolutional neural network, acquiring the space-spectrum characteristic information of the target, and combining with a cubic long-short-term memory network to predict the confidence interval range of the target so as to realize dynamic tracking;
and the classification module is used for classifying the detection targets by using a support vector machine based on historical spectrum training and judging the category to which the detection targets belong.
Preferably, an RX detection operator is adopted in the unsupervised anomaly detection algorithm in the coarse detection module, and the expression is:
Wherein, For any one of the vectors of picture elements in the image,As a sample mean value vector of the samples,A sample covariance matrix for the image;
And (3) eliminating false alarms by calculating the energy distribution of the target in the direction of the small eigenvalue of the covariance matrix and combining the target motion distance threshold between adjacent frames.
Preferably, the double-flow convolutional neural network in the fine detection module comprises an upper branch and a lower branch, each branch comprises an input, 9 convolutional layers are utilized in each branch to extract spectrum information rich in input pixels, and one-dimensional convolutional layers are utilized to realize convolution operation;
The convolution layer with the core step length of 2 is used for replacing a pooling layer in a network, so that spectral characteristics are reserved to the greatest extent, all the characteristics extracted by the convolution layer with the core step length of 2 are added with the characteristics extracted by the last layer based on different average pooling layers, and then the final characteristics of each branch are obtained through the operation of an AVG pooling layer and a full connection layer;
In a dual-flow convolutional neural network, For a target a priori pixel,For the target pixel to be a target pixel,For background pixels, the input of the upper branch is always the following according to the training sample structureThe input of the current branch isThe label of the training sample is 1 when the input of the lower branch is 1The final characteristics of the two branches are obtained through multiple convolution operations, pooling operations and one full connection operation and are recorded asAndTwo features are then combined:
finally, the output of the double-flow convolutional neural network is obtained through the last full-connection layer and a Sigmoid function.
Preferably, the neutral square short-time memory network of the fine detection module consists of a space branch, a time branch and an output branch, wherein the input comprises the longitude and latitude, the speed, the acceleration and the historical track of the target, and the output is the predicted result of the target position at the next momentDistance between adjacent framesThe confidence interval is defined as the 10 pixel neighborhood of the last frame position, where V is the target speed,For the orbital tilt, r is the spatial resolution of the video hyperspectral camera imaging, and f is the video frame rate.
Preferably, the support vector machine in the classification module adopts a soft margin optimization model, trains the classifier based on historical spectrum data, distinguishes aircraft, ships and other target categories, and the loss function is as follows:
Wherein, Is the weight vector of the object,Is a bias term that is used to determine,AndThe feature vector of the i-th sample and the corresponding label, respectively.
Compared with the prior art, the invention has the following beneficial effects:
The invention provides a hyperspectral target detection and recognition method based on semantic and spatial spectrum feature fusion, which constructs a semantic segmentation network model in the spatial domain of a hyperspectral image, designs a self-adaptive spatial spectrum joint optimization model, constructs a spatial spectrum joint cascade detector, realizes the spatial domain target detection of hyperspectral images at pixel-by-pixel level, improves the target detection precision, reduces the false alarm rate and solves the problem of difficult capture of a time-sensitive target track in the air.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.
Examples
Referring to fig. 1, the present embodiment provides a hyperspectral target detection and identification method based on semantic and spatial spectrum feature fusion, which mainly includes:
step 1, extracting an abnormal target in a hyperspectral image based on abnormal detection of an unsupervised method, performing rapid coarse detection on the target through a CEM operator, and comparing inter-frame information to eliminate false alarms;
Step 2, performing fine detection through a double-flow convolutional neural network to obtain target related information, and determining a confidence interval range tracking detection target by using a cubic long-short-time memory network;
and 3, classifying the detected targets by using SVM based on historical spectrum training, and judging which targets the detected targets belong to.
Further, referring to fig. 2, the present embodiment provides a flowchart based on unsupervised method anomaly detection.
The anomaly detection method based on the unsupervised method adopts an RX detection operator, and the form is as follows:
Wherein, For any one of the vectors of picture elements in the image,As a sample mean value vector of the samples,Is the sample covariance matrix of the image. In the aboveIn the same manner as the mahalanobis distance, the RX algorithm essentially can be regarded as the inverse of the principal component analysis. Principal component analysis compresses a significant portion of the meaningful image information from the original image feature space to a space in which at least some of the uncorrelated principal components are basis. Obviously, some objects (small objects, outlier objects) with very low probability of occurrence in the image will not be included in these principal components, instead they will have a greater probability of occurrence in the covariance matrixIn the direction of the feature vector corresponding to the small feature value of (a). The RX algorithm is calculated byIf there is an outlier in the graph, its corresponding energy will be small and will likely be in line with the covariance matrixCorresponds to a small eigenvalue of (c), and the smaller the eigenvalue,The larger the value, the more effectively an abnormal object in the image can be detected.
The RX detection operator can detect targets such as airplanes, ships and the like existing in the hyperspectral image as abnormal targets, but other abnormal points in the image are often accompanied with noise output, so that the function of coarse detection of the targets is realized under the condition that the spectrum of the targets is unknown in abnormal detection, and then further fine detection is needed to eliminate noise and reduce false alarm rate.
Further, referring to FIG. 3a, the present embodiment provides a flow chart for improved constrained energy minimized target detection.
Recording deviceFor all observation sample sets, whereFor any of the sample pixel vectors,Is the number of the pixels to be processed,Assume that the number of bands of an image isIs the object of interest. The purpose of CEM is to design a FIR linear filterMinimizing the filtered output energy:
the solution of the above formula is CEM operator The method comprises the following steps:
applying the CEM operator to each pixel in the image will result in a target And the target is detected under the distribution condition in the image.
Let V be the target moving speed (m/s),For the aircraft flight tilt, r is the spatial resolution (m) of the video hyperspectral camera imaging, f is the video frame rate (fps), and W is the video breadth (m), as shown in table 1.
From the above information, the motion speed of the target pixel on the image plane can be calculated(Pixel/s) is:
Distance of motion of adjacent frame target The (pixels) are:
Time of target stay in field of view (S) is:
Frame number of target stay field of view The method comprises the following steps:
TABLE 1 motion characteristics of aircraft and watercraft reflected on images
It is calculated that the target remains in the field of view for at least 12s, while the hyperspectral video satellites are imaged at 5 frames per second, so that the target appears continuously over at least 60 frames. The conservative estimate of the distance that a moving object moves between two adjacent frames should also be less than 20 pixels. Thus, the target will typically appear in the vicinity of the target position of the previous frame (within 20 pixel neighborhood).
Therefore, for the anomaly detection of two consecutive frames, for all the objects detected on the previous frame, if they no longer appear within 20 pixels of the next frame, they can be considered as false alarms.
Further, referring to FIG. 3b, the present embodiment provides a result graph of improved constrained energy minimized target detection.
Further, referring to fig. 4, the present embodiment provides a structure diagram of a dual-flow convolutional neural network.
Before training, a mixed pixel selection strategy based on sparse representation and classification is proposed, typical background samples are selected in a hyperspectral image, and then enough target samples are generated through some typical background samples and target priors. In the training process, training samples (a positive training sample with a label of 1 constructed by a target prior and a target sample and a negative training sample with a label of 0 constructed by a target prior and a background sample) are input into a well-designed double-flow convolution network, and the discrimination capability is learned. During the test, the test samples (consisting of target priors and detection pixels) are classified by a well-trained two-stream convolutional network. The output of the network constitutes the final detection result.
The double-flow convolutional neural network comprises two branches, namely an uplink branch and a downlink branch. Each branch contains one input. In each branch, 9 convolution layers are used to extract the input pixel-rich spectral information. The convolution operation is implemented with one-dimensional convolution layers, which are followed by a ReLU layer. Considering that the pooling layer may cause the loss of spectrum information when the dimension of the spectrum is required to be reduced, a convolution layer with a core step length of 2 is used for replacing the pooling layer in the network. To preserve spectral features to the maximum, all features extracted by the convolution layer with a kernel step size of 2 are added to the features extracted by the last layer based on the different average pooling layers. The final characteristics of each branch are then obtained by AVG pooling and full connection layer operations.
In the network of the present invention,For a target a priori pixel,For the target pixel to be a target pixel,Is a background pixel. From the previous training sample configuration, the input of the upper branch is alwaysThe input of the current branch isThe label of the training sample is 1 when the input of the lower branch is 1When the label is 0. After some convolution operations, several pooling operations and one full join operation, the final characteristics of the two branches are obtained and recorded as1 And2, Then combining the two features:
Finally, the output of the network is obtained through the last full-connection layer and a Sigmoid function.
There are two loss functions in the proposed network. First is the binary cross entropy loss:
Wherein, In order to be of the size of the batch,In order to train the labels of the samples,Is the output of the Sigmoid function. The other is ICS loss. To improve the effect of target and background separability, ICS losses are proposed. If the inputs of the two branches are respectivelyAnd(Tag 1) they belong to the same class, the distance between them should be minimized, otherwise they belong to different classes, the distance between them should be maximized. Thus, the proposed ICS loss is expressed as:
Wherein, 1 Is the extracted upper branch feature of the device,2 Is the extracted lower branch feature; And Respectively extracting feature vectors;
Final loss function The sum of ICS loss and BCE loss is expressed as:
A cubic long-short-term memory network is a new structure developed based on LSTM, consisting of three branches, a spatial branch for capturing moving objects, a temporal branch for processing the motion, and an output branch for combining the first two branches to generate a predicted frame.
The spatial branches flow along the z-axis (spatial axis), where convolution is responsible for capturing and analyzing moving objects. The spatial state is generated by the branch carrying spatial layout information about the moving object.
The time branches flow along the x-axis (time axis), and convolution aims to obtain and process motion. A temporal state is generated by the branch, which contains motion information.
The output branch generates intermediate or final predicted frames along the y-axis (output axis) based on the predicted motion provided by the temporal branch and the motion object information provided by the spatial branch.
Processing temporal and spatial information separately may result in better predictions, and such separation may reduce the prediction burden of the network. Stacking a plurality CubicLSTM of units along the spatial and output branches may form a two-dimensional network. The two-dimensional network can further construct a three-dimensional network (CubicRNN) through evolution along a time axis, three layers are spatially overlapped, information of a tracking target can be more prominent, and better spatial information is obtained.
The motion characteristics of high maneuvering targets in the air are divided into two types, namely short-time motion characteristics and long-time motion characteristics. Short-time motion characteristics include longitude and latitude, speed, acceleration and the like of a target, and the characteristics can change greatly in a short period. The long-term task characteristics comprise historical motion tracks of the target, and compared with the short-term motion characteristics, the long-term task characteristics are relatively stable in the motion process of the target. In order to make the target tracking more accurate, it is necessary to use both the short-time motion characteristics and the long-time task characteristics of the target. Therefore, a CubicLSTM network is adopted, and the short-time motion characteristic and the long-time task characteristic of the target are taken as the input of the network together, so that the state of the target at the next moment is predicted, and the confidence range of the target is calculated. The input and output variables of the CubicLSTM network employed are defined as follows:
TABLE 2 input output relationship
After the predicted target position at the next moment is obtained through CubicLSTM network, further, the confidence interval of the target is required to be determined according to the motion characteristic of the target, and the target is detected within the range to obtain the real position of the target, so that the tracking of the target is realized. The confidence interval range will be analyzed in detail based on the characteristics of the target motion.
Let V be the target speed (m/s),For orbital tilt, r is the spatial resolution (m) of video hyperspectral camera imaging, f is the video frame rate (fps), and W is the video breadth (m).
From the above information, the motion speed of the target pixel on the image plane can be calculated(Pixel/s) is:
The adjacent frame target motion distance (pixel) is:
the time(s) for which the target stays in the field of view is:
The number of frames of the target stay field is:
It can be calculated that the distance of the moving object between two adjacent frames is smaller than 2.5 pixels, and the conservative estimation should be smaller than 10 pixels. Thus, the target will typically appear within 10 pixels of the vicinity of the target location of the previous frame. And repeatedly detecting the target in the adjacent area to capture the position of the next frame of the target.
Further, referring to fig. 5a, the present embodiment provides a target classification recognition flowchart of a support vector machine.
The use of hard-edge SVMs in the linear inseparable problem will produce classification errors, so that a new optimization problem can be constructed by introducing a loss function on the basis of maximizing the edge. Given input dataAnd learning objectThe optimization problem of the soft-margin SVM using the hinge loss function is expressed as follows:
Wherein, Is the weight vector of the object,Is a bias term that is used to determine,Is a regularization parameter which is a function of the data,Is the number of samples that are to be taken,AndThe feature vector of the ith sample and the corresponding label are respectively;
the above equation shows that the soft margin SVM is one Regularized classifier, inRepresenting the hinge loss function.
After training based on the historical spectrum, the SVM classifier can classify the targets of the accurate detection result and distinguish the targets of the airplane, the ship and the like.
Further, referring to fig. 5b, the present embodiment provides a target classification recognition result diagram of the support vector machine.
Further, referring to fig. 6, the present embodiment provides a hyperspectral target detection and identification system based on semantic and spatial spectrum feature fusion, which mainly includes:
The coarse detection module is used for inputting the hyperspectral image to be detected, detecting an abnormal target in the hyperspectral image based on an abnormal detection RX operator, performing rapid coarse detection on the target through a CEM operator, and eliminating false alarms by comparing the inter-frame information.
And the fine detection module is used for carrying out fine detection through a double-flow convolutional neural network, acquiring target related information, and determining a confidence interval range tracking detection target by using a cubic long-short-term memory network.
The classification module classifies the detected targets by using SVM based on historical spectrum training and judges which targets the detected targets belong to.
Further, referring to FIG. 7, the present embodiment provides an electronic device, mainly comprising at least one memory and at least one processor, wherein the at least one memory stores instructions that, when executed by the at least one processor, perform a hyperspectral target detection and recognition method based on semantic and spatial signature fusion according to an exemplary embodiment of the present disclosure.
By way of example, the electronic device may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the instructions described above. Here, the electronic device is not necessarily a single electronic device, but may be any device or an aggregate of circuits capable of executing the above-described instructions (or instruction set) singly or in combination. The electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).
In an electronic device, a processor may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
The processor may execute instructions or code stored in the memory, wherein the memory may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.
The memory may be integrated with the processor, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, the memory may include a stand-alone device, such as an external disk drive, a storage array, or any other storage device usable by a database system. The memory and the processor may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., such that the processor is able to read files stored in the memory.
In addition, the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.
Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the device and the respective modules thereof provided by the invention can be regarded as a hardware component, and the modules for realizing various programs included therein can be regarded as a structure in the hardware component, and the modules for realizing various functions can be regarded as a structure in the hardware component as well as a software program for realizing the method.
The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.