Disclosure of Invention
In view of this, embodiments of the present invention provide a feature selection method and apparatus, which at least can solve the problem that in the prior art, redundancy of global features cannot be minimized and features with strong discriminant cannot be selected.
To achieve the above object, according to an aspect of the embodiments of the present invention, there is provided a feature selection method, including:
Determining the characteristics in each sample, and constructing a sample vector based on the values of all the characteristics in each sample, wherein the samples are at least one of image, text and voice data;
Processing the sample vector by using a sparse feature selection algorithm with the minimum global redundancy to obtain an importance score vector, wherein the importance score vector comprises importance scores of all features;
and extracting a preset number of features with the maximum importance scores from the importance scores of the features as selected target features.
Optionally, the processing the sample vector by using a sparse feature selection algorithm with global redundancy minimization includes:
Determining the category of each sample, inputting the sample vector, the category, a first importance score vector, a second importance score vector and a redundancy matrix into a sparse feature selection algorithm with minimized global redundancy, wherein the first importance score vector does not consider redundancy, the second importance score vector considers redundancy, and
Constraint is carried out on the elements in the second importance scoring vector, wherein the constraint condition is that the elements are not 0 and the sum is 1;
a second importance score vector is obtained that is minimized and after constraining the elements.
Optionally, the processing the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector includes:
processing the sample vector through an evaluation criterion to calculate a first importance score when redundancy is not considered by each feature, and generating a first importance score vector;
And introducing a redundancy matrix, and inputting the first importance grading vector into a redundancy information minimization criterion so as to carry out redundancy elimination correction on different features and obtain a second importance vector.
Optionally, the method further comprises:
normalizing the values of a feature in different samples to construct a feature vector of the feature;
and calculating the inner product between the feature vectors of every two features to construct a redundancy matrix between every two features.
Optionally, the normalizing the values of a feature in different samples to construct a feature vector of the feature includes:
constructing a data matrix according to the values of a feature in different samples;
and processing the data matrix by using a centralizing matrix to obtain a centralized feature vector corresponding to the feature.
To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided a feature selection apparatus including:
the device comprises a feature determining module, a feature determining module and a feature determining module, wherein the feature determining module is used for determining features in each sample and constructing a sample vector based on values of all the features in each sample, and the samples are at least one of image, text and voice data;
the vector processing module is used for processing the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector, wherein the importance score vector comprises importance scores of all features;
And the feature selection module is used for extracting a preset number of features with the maximum importance scores from the importance scores of the features to serve as selected target features.
Optionally, the vector processing module is configured to:
Determining the category of each sample, inputting the sample vector, the category, a first importance score vector, a second importance score vector and a redundancy matrix into a sparse feature selection algorithm with minimized global redundancy, wherein the first importance score vector does not consider redundancy, the second importance score vector considers redundancy, and
Constraining the elements in the second importance score vector, wherein the constraint condition is that the elements are not 0 and the sum is 6;
a second importance score vector is obtained that is minimized and after constraining the elements.
Optionally, the vector processing module is configured to:
processing the sample vector through an evaluation criterion to calculate a first importance score when redundancy is not considered by each feature, and generating a first importance score vector;
And introducing a redundancy matrix, and inputting the first importance grading vector into a redundancy information minimization criterion so as to carry out redundancy elimination correction on different features and obtain a second importance vector.
Optionally, the method further comprises a redundancy matrix construction module for:
normalizing the values of a feature in different samples to construct a feature vector of the feature;
and calculating the inner product between the feature vectors of every two features to construct a redundancy matrix between every two features.
Optionally, the redundancy matrix construction module is configured to:
constructing a data matrix according to the values of a feature in different samples;
and processing the data matrix by using a centralizing matrix to obtain a centralized feature vector corresponding to the feature.
To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a feature selection electronic device.
The electronic device comprises one or more processors and a storage device, wherein the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize any of the feature selection methods.
To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described feature selection methods.
According to the scheme provided by the invention, one embodiment of the scheme has the advantages or beneficial effects that which non-redundant features are characterized based on the redundant matrix, norm regularization is used for selecting the data features, and feature selection and global redundant information minimization are unified into one convex target optimization function so as to effectively realize feature selection.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Feature selection has been widely used in many different fields, such as selection of causative genes in medical research and image segmentation in computer vision. The use of a small number of features with very rich representation capabilities not only can accelerate the model learning process, but can enhance the generalization ability of the model over unknown data.
The present feature selection method mainly adopts three modes, and the interest and disadvantage analysis is as follows:
1) The filtering feature selection method (shown in fig. 1 (a)) relies on the inherent properties of the data to rank the features, thereby selecting important features according to the rank. Notably, the filtered approach is classifier independent and therefore is computationally inexpensive, but generally performs poorly.
2) The wrapped feature selection method (shown in fig. 1 (b)) is associated with a specific classifier, so that it can generally achieve better performance, but is not suitable for large-scale data due to high computational cost.
3) The learning process of the embedded feature selection method (shown in fig. 1 (c)) is already embedded in the classification model training, and the feature (i.e. the selected feature) to be retained is finally obtained by solving the optimization problem of the classification algorithm.
Thus, the embedded feature selection method can achieve both lower computational cost and satisfactory feature selection performance compared to the filtered feature selection method and the wrapped feature selection method.
Referring to FIG. 2, a schematic diagram of mRMR feature selection is shown. There are four different features on the left, feature importance scores of 0.95, 0.91, 0.72 and 0.33, respectively. The important score for feature 1 is highest and has relevance to features 2 and 3. Feature 2 and feature 3 may represent data from different angles, respectively, so that feature 2 and feature 3 are uncorrelated with each other. When the mRMR method selects the above 4 features, since feature 2 and feature 3 have correlation with feature 1 at the same time, feature 1 and feature 4 are eventually selected and feature 2 and feature 3 are ignored. In fact, the combination of feature 2 and feature 3 has a stronger discrimination capability than feature 1 and feature 4. The mRMR method cannot minimize the redundancy of global features, nor can it select features with strong discriminant (e.g., feature 2 and feature 3).
Referring to fig. 3, a main flowchart of a feature selection method provided by an embodiment of the present invention is shown, including the following steps:
S301, determining characteristics in each sample, and constructing a sample vector based on values of all the characteristics in each sample, wherein the samples are at least one of image, text and voice data;
S302, processing the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector, wherein the importance score vector comprises importance scores of all features;
and S303, extracting a preset number of features with the maximum importance scores from the importance scores of the features to serve as selected target features.
In the above embodiment, for steps S301 to S303, since there are fewer features in the low-dimensional data, the present scheme is only for feature selection of high-dimensional data (such as voice/image/text), where the number of features is generally much larger than the number of samples.
Redundancy means that two or more features provide duplicate information, so that redundancy information minimization can be defined, i.e., if both feature i and feature j are related to each other, it is preferable to leave one of them out while the other is selected, thereby achieving the goal of minimizing redundancy information of the final selected feature.
In general, by reducing redundant features and using more informative features, better performance results can be achieved in subsequent tasks (e.g., clustering, classification). For example, in order to distinguish a particular person in a face recognition task, it is desirable to select some non-redundant features that are discriminant and representative, such as eyes, nose, mouth, etc., rather than non-representative redundant features, such as hair, etc. The motivation of the feature selection method with minimized global redundancy is to select representative and non-redundant features in the feature set of the sample.
Given n samples { x 1,x2,...,xn }, use a sample matrixRepresents a sample vector, wherein the kth row represents the kth sampleD is the number of features. Taking a face picture as an example, the characteristics may be wavelet characteristics, histogram characteristics, gray-scale characteristics and the like extracted from the face picture.
The global redundancy minimized sparse feature selection algorithm (GRMS, global Redundancy Minimize feature Selection) provided by this scheme is:
W and s are calculated, and the scheme mainly uses s to select features. Wherein, The regression coefficient matrix is a first importance scoring vector which does not consider the feature redundancy and is also the feature importance weight produced by the sparse feature selection algorithm; And representing an all-1 vector with the size d, wherein gamma, beta and lambda are regularization parameters and are used for balancing the weights of various items in the formula. Notably, to avoid the optimization formula falling into a trivial solution, elements in the sum 1 constraint s are introduced, s T 1=1 (elements are not 0 and the sum is 1).
The first term in the GRMS carries out regression on the training sample to the label (corresponding to y, which is the category of the sample) through a regression coefficient, and measures the regression value and the true value through a square loss function, and the second term is the norm of the regression coefficient w, so that a sparse effect can be generated, namely elements in a plurality of rows in the finally solved w are all 0, and the non-0 rows correspond to the reserved characteristics, thereby realizing the characteristic selection function. The last two items draw global feature redundancy information in feature selection immediately and minimize it.
In the above formula, A is a redundancy matrix, and the redundancy matrix is introducedAnd describing redundancy of every two features. Given a sample vectorThe data matrix formed by the i-th feature and the j-th feature is X (i)、X(j) (i, j=1, 2,., d), respectively. The following centralised features can be obtained accordingly:
wherein, In order to center the matrix in the form of a centered matrix,AndRepresenting feature vectors centered between the ith feature and the jth feature. In actual operation, the value of a feature in different samples can be directly normalized without considering the centering matrix.
The redundancy matrix a may be calculated as follows:
It can be seen that the redundancy matrix Wherein the method comprises the steps ofThe Hadamard product, the above formula can be written as a matrix form as follows: b=df TFD=(FD)T FD, where f= [ F 1,f2,...,fd ], D is a diagonal matrix, the diagonal element is 1/||f i |, i=1, 2,..d. At the same time a matrix can be determined B is a semi-positive definite matrix.
It is assumed that there is a high correlation between the ith feature and the jth feature, and that the absolute value of its correlation coefficient is relatively large, whether it is a positive correlation or a negative correlation. To penalize high correlation, square cosine similarity is used to measure the correlation between two features, since B is a semi-positive definite matrix, it is apparent that the redundancy matrix a is non-negative and also semi-positive definite.
Through a sparse feature selection algorithm, the weight consistency of the features before and after correction can be ensured while the global feature redundancy is minimized. In particular, if feature i is correlated with all other features, the first term of the sparse feature selection algorithm will cause the correction weight score s i for feature i to become smaller, thereby reducing the importance of feature i. On the other hand, if feature i is not correlated with all other features, then the modified weight score s i for feature i will remain consistent with the initial feature weight score w i, i.e., the weight score for feature i remains unchanged.
Taking face recognition application as an example, assume that features extracted from original face data include gray features, histogram features, fourier features, gradient features and the like, a matrix X is formed by the features of different pictures, and y is a class vector corresponding to different people. And calculating a redundancy matrix A according to the data matrix X, and inputting the matrix X, the category vector y and the redundancy matrix A into a GRMS formula to finally obtain correction weight scoring vectors s of different characteristics in 4, wherein the higher the numerical value in s is, the higher the characteristic weight corresponding to the element is.
Referring to the redundancy matrix shown in fig. 4, the darker the color reflects the stronger the correlation (i.e., redundancy) between features. Notably, feature selection and feature redundancy minimization are simultaneously optimized.
The scheme can also be considered from another angle, such as firstly, constructing a first importance score vector by calculating a first importance score of each feature in the feature set when redundancy is not considered through evaluation criteria such as an extreme tree algorithm and a random forest on the basis of the sample vectorAnd then inputting the first importance score vector into a redundant information minimization criterion to carry out redundancy elimination correction on different features to obtain a second importance vector s, wherein the second importance vector comprises a second importance score of each feature. The redundant information minimization criteria are:
Where λ is a regularization parameter used to balance the weights of the first term and the second term, and a redundancy matrix characterizes the correlation between features, i.e., redundancy.
Specifically, the first term s T As characterizes the redundancy of the global feature, so the smaller the first term should be, the better the second term w T s characterizes the consistency of the original feature first importance score w (redundancy not considered) and the modified feature second importance score s (redundancy considered), so the larger the second term should be, since it is desirable that the modified feature can be kept As consistent As possible with the pre-modification feature importance.
According to the method provided by the embodiment, the proposed global redundancy minimized sparse feature selection algorithm GRMS sorts the feature importance after calculating the second importance scoring vector, and simultaneously, the redundant matrix is used for describing non-redundant features so as to correct the sparsely selected features.
The method provided by the scheme provides a sparse feature selection algorithm through global redundancy minimization to carry out redundancy correction on features based on redundancy matrix, and feature selection and feature redundancy minimization are simultaneously optimized and solved aiming at the problem that the existing method can not select the feature subset with global redundancy information minimization.
Referring to fig. 5, a schematic diagram of main modules of a feature selection apparatus 500 according to an embodiment of the present invention is shown, including:
A feature determining module 501, configured to determine features in each sample, and construct a sample vector based on values of all the features in each sample, where the samples are at least one of image, text, and voice data;
the vector processing module 502 is configured to process the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector, where the importance score vector includes importance scores of features;
A feature selection module 503, configured to extract, from the importance scores of the features, a predetermined number of features with the largest importance scores as the selected target features.
In the embodiment of the present invention, the vector processing module 502 is configured to:
Determining the category of each sample, inputting the sample vector, the category, a first importance score vector, a second importance score vector and a redundancy matrix into a sparse feature selection algorithm with minimized global redundancy, wherein the first importance score vector does not consider redundancy, the second importance score vector considers redundancy, and
Constraining the elements in the second importance score vector, wherein the constraint condition is that the elements are not 0 and the sum is 6;
a second importance score vector is obtained that is minimized and after constraining the elements.
In the embodiment of the present invention, the vector processing module 502 is configured to:
processing the sample vector through an evaluation criterion to calculate a first importance score when redundancy is not considered by each feature, and generating a first importance score vector;
And introducing a redundancy matrix, and inputting the first importance grading vector into a redundancy information minimization criterion so as to carry out redundancy elimination correction on different features and obtain a second importance vector.
The implementation device of the invention also comprises a redundancy matrix construction module which is used for:
normalizing the values of a feature in different samples to construct a feature vector of the feature;
and calculating the inner product between the feature vectors of every two features to construct a redundancy matrix between every two features.
In the implementation device of the present invention, the redundancy matrix construction module is configured to:
constructing a data matrix according to the values of a feature in different samples;
and processing the data matrix by using a centralizing matrix to obtain a centralized feature vector corresponding to the feature.
In addition, the implementation of the apparatus in the embodiments of the present invention has been described in detail in the above method, so that the description is not repeated here.
Fig. 6 illustrates an exemplary system architecture 600 in which embodiments of the present invention may be applied.
As shown in fig. 6, the system architecture 600 may include terminal devices 601, 602, 603, a network 604, and a server 605 (by way of example only). The network 604 is used as a medium to provide communication links between the terminal devices 601, 602, 603 and the server 605. The network 604 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 605 via the network 604 using the terminal devices 601, 602, 603 to receive or send messages, etc. Various communication client applications can be installed on the terminal devices 601, 602, 603.
The terminal devices 601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, and the server 605 may be a server providing various services.
It should be noted that, the method provided by the embodiment of the present invention is generally performed by the server 605, and accordingly, the apparatus is generally disposed in the server 605.
It should be understood that the number of terminal devices, networks and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 7, there is illustrated a schematic diagram of a computer system 700 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the system 700 are also stored. The CPU 701, ROM 702, and RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Connected to the I/O interface 705 are an input section 706 including a keyboard, a mouse, and the like, an output section 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like, a storage section 708 including a hard disk, and the like, and a communication section 709 including a network interface card such as a LAN card, a modem, and the like. The communication section 709 performs communication processing via a network such as the internet. The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read therefrom is mounted into the storage section 708 as necessary.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 709, and/or installed from the removable medium 711. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 701.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, a processor may be described as including a feature determination module, a vector processing module, and a feature selection module. The names of these modules do not in some way constitute a limitation on the module itself, for example, a vector processing module may also be described as "a module that processes a sample vector to obtain an importance score vector".
As a further aspect, the invention also provides a computer readable medium which may be comprised in the device described in the above embodiments or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include:
Determining the characteristics in each sample, and constructing a sample vector based on the values of all the characteristics in each sample, wherein the samples are at least one of image, text and voice data;
Processing the sample vector by using a sparse feature selection algorithm with the minimum global redundancy to obtain an importance score vector, wherein the importance score vector comprises importance scores of all features;
and extracting a preset number of features with the maximum importance scores from the importance scores of the features as selected target features.
According to the technical scheme of the embodiment of the invention, a sparse feature selection algorithm with global redundancy minimization is provided to carry out redundancy correction on features based on redundancy matrix etching, and feature selection and feature redundancy minimization are simultaneously optimized and solved.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.