[go: up one dir, main page]

CN113761307B - A feature selection method and device - Google Patents

A feature selection method and device Download PDF

Info

Publication number
CN113761307B
CN113761307B CN202011351263.9A CN202011351263A CN113761307B CN 113761307 B CN113761307 B CN 113761307B CN 202011351263 A CN202011351263 A CN 202011351263A CN 113761307 B CN113761307 B CN 113761307B
Authority
CN
China
Prior art keywords
vector
feature
features
redundancy
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011351263.9A
Other languages
Chinese (zh)
Other versions
CN113761307A (en
Inventor
祖辰
杨立军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202011351263.9A priority Critical patent/CN113761307B/en
Publication of CN113761307A publication Critical patent/CN113761307A/en
Application granted granted Critical
Publication of CN113761307B publication Critical patent/CN113761307B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种特征选择方法和装置,涉及计算机技术领域。该方法的一具体实施方式包括:确定各样本中的特征,基于所有特征在各个样本中的值构建样本向量;利用全局冗余最小化的稀疏特征选择算法处理所述样本向量,得到重要性评分向量;从所述各特征的重要性评分中,提取重要性评分最大的预定数目个特征,以作为所选择的目标特征。该实施方式提出的全局冗余最小化的稀疏特征选择算法GRMS,以最小化全局特征的冗余性、选择具有强判别性的特征,实现对稀疏选择的特征进行修正。

The present invention discloses a feature selection method and device, and relates to the field of computer technology. A specific implementation of the method includes: determining the features in each sample, constructing a sample vector based on the values of all features in each sample; processing the sample vector using a sparse feature selection algorithm with global redundancy minimization to obtain an importance score vector; extracting a predetermined number of features with the largest importance scores from the importance scores of each feature as the selected target features. The sparse feature selection algorithm GRMS with global redundancy minimization proposed in this implementation minimizes the redundancy of global features, selects features with strong discriminability, and realizes the correction of sparsely selected features.

Description

Feature selection method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a feature selection method and apparatus.
Background
With the development of information technology, global data is exploded, and more data needs to be stored and transmitted. To process massive amounts of high-dimensional data, it is necessary to extract the most powerful and valuable information from it. Since the high-dimensional data contains a large number of features, the features inevitably contain noise, in this case, feature selection becomes an indispensable data mining technology, and the performance of subsequent classification or clustering and the like can be improved through feature dimension reduction.
At present, the characteristic selection method is divided into three types, namely a filtering method, a wrapping method and an embedded method. In implementing the present invention, the inventors have found that these methods generally do not take into account redundancy between selected features, and therefore these features tend to have high correlation, which is detrimental to subsequent clustering or classification tasks. Although the mRMR (Max-RELEVANCE AND MIN-Redundancy) feature selection method based on mutual information has been proposed to minimize Redundancy between features, the mRMR method adopts a greedy strategy to find features with minimum Redundancy, resulting in the selected features not having global Redundancy information minimization.
Disclosure of Invention
In view of this, embodiments of the present invention provide a feature selection method and apparatus, which at least can solve the problem that in the prior art, redundancy of global features cannot be minimized and features with strong discriminant cannot be selected.
To achieve the above object, according to an aspect of the embodiments of the present invention, there is provided a feature selection method, including:
Determining the characteristics in each sample, and constructing a sample vector based on the values of all the characteristics in each sample, wherein the samples are at least one of image, text and voice data;
Processing the sample vector by using a sparse feature selection algorithm with the minimum global redundancy to obtain an importance score vector, wherein the importance score vector comprises importance scores of all features;
and extracting a preset number of features with the maximum importance scores from the importance scores of the features as selected target features.
Optionally, the processing the sample vector by using a sparse feature selection algorithm with global redundancy minimization includes:
Determining the category of each sample, inputting the sample vector, the category, a first importance score vector, a second importance score vector and a redundancy matrix into a sparse feature selection algorithm with minimized global redundancy, wherein the first importance score vector does not consider redundancy, the second importance score vector considers redundancy, and
Constraint is carried out on the elements in the second importance scoring vector, wherein the constraint condition is that the elements are not 0 and the sum is 1;
a second importance score vector is obtained that is minimized and after constraining the elements.
Optionally, the processing the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector includes:
processing the sample vector through an evaluation criterion to calculate a first importance score when redundancy is not considered by each feature, and generating a first importance score vector;
And introducing a redundancy matrix, and inputting the first importance grading vector into a redundancy information minimization criterion so as to carry out redundancy elimination correction on different features and obtain a second importance vector.
Optionally, the method further comprises:
normalizing the values of a feature in different samples to construct a feature vector of the feature;
and calculating the inner product between the feature vectors of every two features to construct a redundancy matrix between every two features.
Optionally, the normalizing the values of a feature in different samples to construct a feature vector of the feature includes:
constructing a data matrix according to the values of a feature in different samples;
and processing the data matrix by using a centralizing matrix to obtain a centralized feature vector corresponding to the feature.
To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided a feature selection apparatus including:
the device comprises a feature determining module, a feature determining module and a feature determining module, wherein the feature determining module is used for determining features in each sample and constructing a sample vector based on values of all the features in each sample, and the samples are at least one of image, text and voice data;
the vector processing module is used for processing the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector, wherein the importance score vector comprises importance scores of all features;
And the feature selection module is used for extracting a preset number of features with the maximum importance scores from the importance scores of the features to serve as selected target features.
Optionally, the vector processing module is configured to:
Determining the category of each sample, inputting the sample vector, the category, a first importance score vector, a second importance score vector and a redundancy matrix into a sparse feature selection algorithm with minimized global redundancy, wherein the first importance score vector does not consider redundancy, the second importance score vector considers redundancy, and
Constraining the elements in the second importance score vector, wherein the constraint condition is that the elements are not 0 and the sum is 6;
a second importance score vector is obtained that is minimized and after constraining the elements.
Optionally, the vector processing module is configured to:
processing the sample vector through an evaluation criterion to calculate a first importance score when redundancy is not considered by each feature, and generating a first importance score vector;
And introducing a redundancy matrix, and inputting the first importance grading vector into a redundancy information minimization criterion so as to carry out redundancy elimination correction on different features and obtain a second importance vector.
Optionally, the method further comprises a redundancy matrix construction module for:
normalizing the values of a feature in different samples to construct a feature vector of the feature;
and calculating the inner product between the feature vectors of every two features to construct a redundancy matrix between every two features.
Optionally, the redundancy matrix construction module is configured to:
constructing a data matrix according to the values of a feature in different samples;
and processing the data matrix by using a centralizing matrix to obtain a centralized feature vector corresponding to the feature.
To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a feature selection electronic device.
The electronic device comprises one or more processors and a storage device, wherein the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize any of the feature selection methods.
To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described feature selection methods.
According to the scheme provided by the invention, one embodiment of the scheme has the advantages or beneficial effects that which non-redundant features are characterized based on the redundant matrix, norm regularization is used for selecting the data features, and feature selection and global redundant information minimization are unified into one convex target optimization function so as to effectively realize feature selection.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIGS. 1 (a) - (c) are schematic diagrams of three types of feature selection methods, namely, wrap-around, filter-type and embedded-type, respectively;
FIG. 2 is a schematic diagram of mRMR feature selection;
FIG. 3 is a flow chart of a feature selection method according to an embodiment of the invention;
FIG. 4 is a schematic diagram of a sparse-based global redundancy information minimization method;
FIG. 5 is a schematic diagram of the main modules of a feature selection device according to an embodiment of the invention;
FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;
fig. 7 is a schematic diagram of a computer system suitable for use in implementing a mobile device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Feature selection has been widely used in many different fields, such as selection of causative genes in medical research and image segmentation in computer vision. The use of a small number of features with very rich representation capabilities not only can accelerate the model learning process, but can enhance the generalization ability of the model over unknown data.
The present feature selection method mainly adopts three modes, and the interest and disadvantage analysis is as follows:
1) The filtering feature selection method (shown in fig. 1 (a)) relies on the inherent properties of the data to rank the features, thereby selecting important features according to the rank. Notably, the filtered approach is classifier independent and therefore is computationally inexpensive, but generally performs poorly.
2) The wrapped feature selection method (shown in fig. 1 (b)) is associated with a specific classifier, so that it can generally achieve better performance, but is not suitable for large-scale data due to high computational cost.
3) The learning process of the embedded feature selection method (shown in fig. 1 (c)) is already embedded in the classification model training, and the feature (i.e. the selected feature) to be retained is finally obtained by solving the optimization problem of the classification algorithm.
Thus, the embedded feature selection method can achieve both lower computational cost and satisfactory feature selection performance compared to the filtered feature selection method and the wrapped feature selection method.
Referring to FIG. 2, a schematic diagram of mRMR feature selection is shown. There are four different features on the left, feature importance scores of 0.95, 0.91, 0.72 and 0.33, respectively. The important score for feature 1 is highest and has relevance to features 2 and 3. Feature 2 and feature 3 may represent data from different angles, respectively, so that feature 2 and feature 3 are uncorrelated with each other. When the mRMR method selects the above 4 features, since feature 2 and feature 3 have correlation with feature 1 at the same time, feature 1 and feature 4 are eventually selected and feature 2 and feature 3 are ignored. In fact, the combination of feature 2 and feature 3 has a stronger discrimination capability than feature 1 and feature 4. The mRMR method cannot minimize the redundancy of global features, nor can it select features with strong discriminant (e.g., feature 2 and feature 3).
Referring to fig. 3, a main flowchart of a feature selection method provided by an embodiment of the present invention is shown, including the following steps:
S301, determining characteristics in each sample, and constructing a sample vector based on values of all the characteristics in each sample, wherein the samples are at least one of image, text and voice data;
S302, processing the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector, wherein the importance score vector comprises importance scores of all features;
and S303, extracting a preset number of features with the maximum importance scores from the importance scores of the features to serve as selected target features.
In the above embodiment, for steps S301 to S303, since there are fewer features in the low-dimensional data, the present scheme is only for feature selection of high-dimensional data (such as voice/image/text), where the number of features is generally much larger than the number of samples.
Redundancy means that two or more features provide duplicate information, so that redundancy information minimization can be defined, i.e., if both feature i and feature j are related to each other, it is preferable to leave one of them out while the other is selected, thereby achieving the goal of minimizing redundancy information of the final selected feature.
In general, by reducing redundant features and using more informative features, better performance results can be achieved in subsequent tasks (e.g., clustering, classification). For example, in order to distinguish a particular person in a face recognition task, it is desirable to select some non-redundant features that are discriminant and representative, such as eyes, nose, mouth, etc., rather than non-representative redundant features, such as hair, etc. The motivation of the feature selection method with minimized global redundancy is to select representative and non-redundant features in the feature set of the sample.
Given n samples { x 1,x2,...,xn }, use a sample matrixRepresents a sample vector, wherein the kth row represents the kth sampleD is the number of features. Taking a face picture as an example, the characteristics may be wavelet characteristics, histogram characteristics, gray-scale characteristics and the like extracted from the face picture.
The global redundancy minimized sparse feature selection algorithm (GRMS, global Redundancy Minimize feature Selection) provided by this scheme is:
W and s are calculated, and the scheme mainly uses s to select features. Wherein, The regression coefficient matrix is a first importance scoring vector which does not consider the feature redundancy and is also the feature importance weight produced by the sparse feature selection algorithm; And representing an all-1 vector with the size d, wherein gamma, beta and lambda are regularization parameters and are used for balancing the weights of various items in the formula. Notably, to avoid the optimization formula falling into a trivial solution, elements in the sum 1 constraint s are introduced, s T 1=1 (elements are not 0 and the sum is 1).
The first term in the GRMS carries out regression on the training sample to the label (corresponding to y, which is the category of the sample) through a regression coefficient, and measures the regression value and the true value through a square loss function, and the second term is the norm of the regression coefficient w, so that a sparse effect can be generated, namely elements in a plurality of rows in the finally solved w are all 0, and the non-0 rows correspond to the reserved characteristics, thereby realizing the characteristic selection function. The last two items draw global feature redundancy information in feature selection immediately and minimize it.
In the above formula, A is a redundancy matrix, and the redundancy matrix is introducedAnd describing redundancy of every two features. Given a sample vectorThe data matrix formed by the i-th feature and the j-th feature is X (i)、X(j) (i, j=1, 2,., d), respectively. The following centralised features can be obtained accordingly:
wherein, In order to center the matrix in the form of a centered matrix,AndRepresenting feature vectors centered between the ith feature and the jth feature. In actual operation, the value of a feature in different samples can be directly normalized without considering the centering matrix.
The redundancy matrix a may be calculated as follows:
It can be seen that the redundancy matrix Wherein the method comprises the steps ofThe Hadamard product, the above formula can be written as a matrix form as follows: b=df TFD=(FD)T FD, where f= [ F 1,f2,...,fd ], D is a diagonal matrix, the diagonal element is 1/||f i |, i=1, 2,..d. At the same time a matrix can be determined B is a semi-positive definite matrix.
It is assumed that there is a high correlation between the ith feature and the jth feature, and that the absolute value of its correlation coefficient is relatively large, whether it is a positive correlation or a negative correlation. To penalize high correlation, square cosine similarity is used to measure the correlation between two features, since B is a semi-positive definite matrix, it is apparent that the redundancy matrix a is non-negative and also semi-positive definite.
Through a sparse feature selection algorithm, the weight consistency of the features before and after correction can be ensured while the global feature redundancy is minimized. In particular, if feature i is correlated with all other features, the first term of the sparse feature selection algorithm will cause the correction weight score s i for feature i to become smaller, thereby reducing the importance of feature i. On the other hand, if feature i is not correlated with all other features, then the modified weight score s i for feature i will remain consistent with the initial feature weight score w i, i.e., the weight score for feature i remains unchanged.
Taking face recognition application as an example, assume that features extracted from original face data include gray features, histogram features, fourier features, gradient features and the like, a matrix X is formed by the features of different pictures, and y is a class vector corresponding to different people. And calculating a redundancy matrix A according to the data matrix X, and inputting the matrix X, the category vector y and the redundancy matrix A into a GRMS formula to finally obtain correction weight scoring vectors s of different characteristics in 4, wherein the higher the numerical value in s is, the higher the characteristic weight corresponding to the element is.
Referring to the redundancy matrix shown in fig. 4, the darker the color reflects the stronger the correlation (i.e., redundancy) between features. Notably, feature selection and feature redundancy minimization are simultaneously optimized.
The scheme can also be considered from another angle, such as firstly, constructing a first importance score vector by calculating a first importance score of each feature in the feature set when redundancy is not considered through evaluation criteria such as an extreme tree algorithm and a random forest on the basis of the sample vectorAnd then inputting the first importance score vector into a redundant information minimization criterion to carry out redundancy elimination correction on different features to obtain a second importance vector s, wherein the second importance vector comprises a second importance score of each feature. The redundant information minimization criteria are:
Where λ is a regularization parameter used to balance the weights of the first term and the second term, and a redundancy matrix characterizes the correlation between features, i.e., redundancy.
Specifically, the first term s T As characterizes the redundancy of the global feature, so the smaller the first term should be, the better the second term w T s characterizes the consistency of the original feature first importance score w (redundancy not considered) and the modified feature second importance score s (redundancy considered), so the larger the second term should be, since it is desirable that the modified feature can be kept As consistent As possible with the pre-modification feature importance.
According to the method provided by the embodiment, the proposed global redundancy minimized sparse feature selection algorithm GRMS sorts the feature importance after calculating the second importance scoring vector, and simultaneously, the redundant matrix is used for describing non-redundant features so as to correct the sparsely selected features.
The method provided by the scheme provides a sparse feature selection algorithm through global redundancy minimization to carry out redundancy correction on features based on redundancy matrix, and feature selection and feature redundancy minimization are simultaneously optimized and solved aiming at the problem that the existing method can not select the feature subset with global redundancy information minimization.
Referring to fig. 5, a schematic diagram of main modules of a feature selection apparatus 500 according to an embodiment of the present invention is shown, including:
A feature determining module 501, configured to determine features in each sample, and construct a sample vector based on values of all the features in each sample, where the samples are at least one of image, text, and voice data;
the vector processing module 502 is configured to process the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector, where the importance score vector includes importance scores of features;
A feature selection module 503, configured to extract, from the importance scores of the features, a predetermined number of features with the largest importance scores as the selected target features.
In the embodiment of the present invention, the vector processing module 502 is configured to:
Determining the category of each sample, inputting the sample vector, the category, a first importance score vector, a second importance score vector and a redundancy matrix into a sparse feature selection algorithm with minimized global redundancy, wherein the first importance score vector does not consider redundancy, the second importance score vector considers redundancy, and
Constraining the elements in the second importance score vector, wherein the constraint condition is that the elements are not 0 and the sum is 6;
a second importance score vector is obtained that is minimized and after constraining the elements.
In the embodiment of the present invention, the vector processing module 502 is configured to:
processing the sample vector through an evaluation criterion to calculate a first importance score when redundancy is not considered by each feature, and generating a first importance score vector;
And introducing a redundancy matrix, and inputting the first importance grading vector into a redundancy information minimization criterion so as to carry out redundancy elimination correction on different features and obtain a second importance vector.
The implementation device of the invention also comprises a redundancy matrix construction module which is used for:
normalizing the values of a feature in different samples to construct a feature vector of the feature;
and calculating the inner product between the feature vectors of every two features to construct a redundancy matrix between every two features.
In the implementation device of the present invention, the redundancy matrix construction module is configured to:
constructing a data matrix according to the values of a feature in different samples;
and processing the data matrix by using a centralizing matrix to obtain a centralized feature vector corresponding to the feature.
In addition, the implementation of the apparatus in the embodiments of the present invention has been described in detail in the above method, so that the description is not repeated here.
Fig. 6 illustrates an exemplary system architecture 600 in which embodiments of the present invention may be applied.
As shown in fig. 6, the system architecture 600 may include terminal devices 601, 602, 603, a network 604, and a server 605 (by way of example only). The network 604 is used as a medium to provide communication links between the terminal devices 601, 602, 603 and the server 605. The network 604 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 605 via the network 604 using the terminal devices 601, 602, 603 to receive or send messages, etc. Various communication client applications can be installed on the terminal devices 601, 602, 603.
The terminal devices 601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, and the server 605 may be a server providing various services.
It should be noted that, the method provided by the embodiment of the present invention is generally performed by the server 605, and accordingly, the apparatus is generally disposed in the server 605.
It should be understood that the number of terminal devices, networks and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 7, there is illustrated a schematic diagram of a computer system 700 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the system 700 are also stored. The CPU 701, ROM 702, and RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Connected to the I/O interface 705 are an input section 706 including a keyboard, a mouse, and the like, an output section 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like, a storage section 708 including a hard disk, and the like, and a communication section 709 including a network interface card such as a LAN card, a modem, and the like. The communication section 709 performs communication processing via a network such as the internet. The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read therefrom is mounted into the storage section 708 as necessary.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 709, and/or installed from the removable medium 711. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 701.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, a processor may be described as including a feature determination module, a vector processing module, and a feature selection module. The names of these modules do not in some way constitute a limitation on the module itself, for example, a vector processing module may also be described as "a module that processes a sample vector to obtain an importance score vector".
As a further aspect, the invention also provides a computer readable medium which may be comprised in the device described in the above embodiments or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include:
Determining the characteristics in each sample, and constructing a sample vector based on the values of all the characteristics in each sample, wherein the samples are at least one of image, text and voice data;
Processing the sample vector by using a sparse feature selection algorithm with the minimum global redundancy to obtain an importance score vector, wherein the importance score vector comprises importance scores of all features;
and extracting a preset number of features with the maximum importance scores from the importance scores of the features as selected target features.
According to the technical scheme of the embodiment of the invention, a sparse feature selection algorithm with global redundancy minimization is provided to carry out redundancy correction on features based on redundancy matrix etching, and feature selection and feature redundancy minimization are simultaneously optimized and solved.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (8)

1.一种特征选择方法,其特征在于,包括:1. A feature selection method, comprising: 确定各样本中的特征,基于所有特征在各个样本中的值构建样本向量;其中,样本为图像、文本、语音数据中的至少一种;Determine the features in each sample, and construct a sample vector based on the values of all the features in each sample; wherein the sample is at least one of image, text, and speech data; 利用全局冗余最小化的稀疏特征选择算法处理所述样本向量,得到重要性评分向量;其中,重要性评分向量包括各特征的重要性评分;所述利用全局冗余最小化的稀疏特征选择算法处理所述样本向量,包括:确定各样本所属的类别,将所述样本向量、所述类别、第一重要性评分向量、第二重要性评分向量和冗余矩阵,一同输入到全局冗余最小化的稀疏特征选择算法中进行最小化;所述第一重要性评分向量未考虑冗余性,所述第二重要性评分向量考虑冗余性;以及对所述第二重要性评分向量中的元素进行约束;其中,约束条件为元素非0,并且和为1;得到最小化且约束元素后的第二重要性评分向量;The sample vector is processed by a sparse feature selection algorithm with global redundancy minimization to obtain an importance score vector; wherein the importance score vector includes the importance score of each feature; the sample vector is processed by the sparse feature selection algorithm with global redundancy minimization, including: determining the category to which each sample belongs, inputting the sample vector, the category, the first importance score vector, the second importance score vector and the redundancy matrix into the sparse feature selection algorithm with global redundancy minimization for minimization; the first importance score vector does not consider redundancy, and the second importance score vector considers redundancy; and constraining the elements in the second importance score vector; wherein the constraint condition is that the element is non-0 and the sum is 1; obtaining the second importance score vector after minimization and constraining the elements; 从所述各特征的重要性评分中,提取重要性评分最大的预定数目个特征,以作为所选择的目标特征。From the importance scores of the features, a predetermined number of features with the largest importance scores are extracted as the selected target features. 2.根据权利要求1所述的方法,其特征在于,所述利用全局冗余最小化的稀疏特征选择算法处理所述样本向量,得到重要性评分向量,包括:2. The method according to claim 1, characterized in that the sparse feature selection algorithm using global redundancy minimization is used to process the sample vector to obtain the importance score vector, comprising: 通过评价准则处理所述样本向量,以计算各特征未考虑冗余性时的第一重要性评分,生成第一重要性评分向量;Processing the sample vector by an evaluation criterion to calculate a first importance score of each feature without considering redundancy, and generating a first importance score vector; 引入冗余矩阵,并将所述第一重要性评分向量输入冗余信息最小化准则中,以对不同特征进行去冗余性修正,得到第二重要性向量。A redundancy matrix is introduced, and the first importance score vector is input into a redundant information minimization criterion to perform redundancy correction on different features to obtain a second importance vector. 3.根据权利要求1或2所述的方法,其特征在于,还包括:3. The method according to claim 1 or 2, further comprising: 将一特征在不同样本中的值进行归一化,以构建所述一特征的特征向量;Normalizing the values of a feature in different samples to construct a feature vector of the feature; 计算两两特征的特征向量之间的内积,以构建所述两两特征之间的冗余矩阵。The inner product between the eigenvectors of two features is calculated to construct a redundancy matrix between the two features. 4.根据权利要求3所述的方法,其特征在于,所述将一特征在不同样本中的值进行归一化,以构建所述一特征的特征向量,包括:4. The method according to claim 3, characterized in that the step of normalizing the values of a feature in different samples to construct a feature vector of the feature comprises: 根据一特征在不同样本中的值,构建数据矩阵;Construct a data matrix based on the value of a feature in different samples; 利用中心化矩阵处理所述数据矩阵,得到与所述一特征对应的中心化后的特征向量。The data matrix is processed using a centering matrix to obtain a centralized feature vector corresponding to the feature. 5.一种特征选择装置,其特征在于,包括:5. A feature selection device, comprising: 特征确定模块,用于确定各样本中的特征,基于所有特征在各个样本中的值构建样本向量;其中,样本为图像、文本、语音数据中的至少一种;A feature determination module, used to determine the features in each sample, and construct a sample vector based on the values of all features in each sample; wherein the sample is at least one of image, text, and voice data; 向量处理模块,用于利用全局冗余最小化的稀疏特征选择算法处理所述样本向量,得到重要性评分向量;其中,重要性评分向量包括各特征的重要性评分;所述利用全局冗余最小化的稀疏特征选择算法处理所述样本向量,包括:确定各样本所属的类别,将所述样本向量、所述类别、第一重要性评分向量、第二重要性评分向量和冗余矩阵,一同输入到全局冗余最小化的稀疏特征选择算法中进行最小化;所述第一重要性评分向量未考虑冗余性,所述第二重要性评分向量考虑冗余性;以及对所述第二重要性评分向量中的元素进行约束;其中,约束条件为元素非0,并且和为1;得到最小化且约束元素后的第二重要性评分向量;A vector processing module, used for processing the sample vector using a sparse feature selection algorithm with global redundancy minimization to obtain an importance score vector; wherein the importance score vector includes the importance score of each feature; the sparse feature selection algorithm with global redundancy minimization is used to process the sample vector, including: determining the category to which each sample belongs, inputting the sample vector, the category, the first importance score vector, the second importance score vector and the redundancy matrix into the sparse feature selection algorithm with global redundancy minimization for minimization; the first importance score vector does not consider redundancy, and the second importance score vector considers redundancy; and constraining the elements in the second importance score vector; wherein the constraint condition is that the element is non-0 and the sum is 1; obtaining the second importance score vector after minimization and constraining the elements; 特征选择模块,用于从所述各特征的重要性评分中,提取重要性评分最大的预定数目个特征,以作为所选择的目标特征。The feature selection module is used to extract a predetermined number of features with the largest importance scores from the importance scores of the features as selected target features. 6.根据权利要求5所述的装置,其特征在于,还包括冗余矩阵构建模块,用于:6. The device according to claim 5, further comprising a redundant matrix construction module, for: 将一特征在不同样本中的值进行归一化,以构建所述一特征的特征向量;Normalizing the values of a feature in different samples to construct a feature vector of the feature; 计算两两特征的特征向量之间的内积,以构建所述两两特征之间的冗余矩阵。The inner product between the eigenvectors of two features is calculated to construct a redundancy matrix between the two features. 7.一种电子设备,其特征在于,包括:7. An electronic device, comprising: 一个或多个处理器;one or more processors; 存储装置,用于存储一个或多个程序,a storage device for storing one or more programs, 当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-4中任一所述的方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1 to 4. 8.一种计算机可读介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现如权利要求1-4中任一所述的方法。8. A computer-readable medium having a computer program stored thereon, wherein when the program is executed by a processor, the method according to any one of claims 1 to 4 is implemented.
CN202011351263.9A 2020-11-25 2020-11-25 A feature selection method and device Active CN113761307B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011351263.9A CN113761307B (en) 2020-11-25 2020-11-25 A feature selection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011351263.9A CN113761307B (en) 2020-11-25 2020-11-25 A feature selection method and device

Publications (2)

Publication Number Publication Date
CN113761307A CN113761307A (en) 2021-12-07
CN113761307B true CN113761307B (en) 2025-02-21

Family

ID=78786083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011351263.9A Active CN113761307B (en) 2020-11-25 2020-11-25 A feature selection method and device

Country Status (1)

Country Link
CN (1) CN113761307B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444720A (en) * 2022-01-29 2022-05-06 北京百度网讯科技有限公司 Data processing method, data processing device, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5592943A (en) * 1993-04-07 1997-01-14 Osteo Sciences Corporation Apparatus and method for acoustic analysis of bone using optimized functions of spectral and temporal signal components
US9471881B2 (en) * 2013-01-21 2016-10-18 International Business Machines Corporation Transductive feature selection with maximum-relevancy and minimum-redundancy criteria
US20140207799A1 (en) * 2013-01-21 2014-07-24 International Business Machines Corporation Hill-climbing feature selection with max-relevancy and minimum redundancy criteria
CN104933733A (en) * 2015-06-12 2015-09-23 西北工业大学 Target tracking method based on sparse feature selection
CN111338950A (en) * 2020-02-25 2020-06-26 北京高质系统科技有限公司 Software defect feature selection method based on spectral clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A General Framework for Auto-Weighted Feature Selection via Global Redundancy Minimization;Feiping Nie 等;IEEE TRANSACTIONS ON IMAGE PROCESSING;20190531;第28卷(第5期);第2429页左边栏、第2429页右边栏-第2430页左边栏 *

Also Published As

Publication number Publication date
CN113761307A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN116580257B (en) Feature fusion model training and sample retrieval method, device and computer equipment
US11487995B2 (en) Method and apparatus for determining image quality
CN114596497B (en) Training method of target detection model, target detection method, device and equipment
US10719693B2 (en) Method and apparatus for outputting information of object relationship
CN108197532A (en) The method, apparatus and computer installation of recognition of face
CN111582348A (en) Method, device, equipment and storage medium for training condition generating type countermeasure network
WO2022105118A1 (en) Image-based health status identification method and apparatus, device and storage medium
CN114399808B (en) A method, system, electronic device and storage medium for estimating face age
CN108197652A (en) For generating the method and apparatus of information
CN111542841A (en) System and method for content identification
WO2020024484A1 (en) Method and device for outputting data
US20200082213A1 (en) Sample processing method and device
CN115861462B (en) Training method, device, electronic equipment and storage medium for image generation model
CN112801219A (en) Multi-mode emotion classification method, device and equipment
WO2021135449A1 (en) Deep reinforcement learning-based data classification method, apparatus, device, and medium
CN114817612A (en) Method and related device for calculating multi-modal data matching degree and training calculation model
CN113704528B (en) Cluster center determining method, device and equipment and computer storage medium
CN112668482A (en) Face recognition training method and device, computer equipment and storage medium
CN114494784A (en) Training methods, image processing methods and object recognition methods of deep learning models
CN113722438A (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN113688955B (en) Text recognition method, device, equipment and medium
WO2021012691A1 (en) Method and device for image retrieval
CN113821687A (en) A content retrieval method, apparatus, and computer-readable storage medium
CN112131199B (en) A log processing method, device, equipment and medium
CN117421641A (en) Text classification method, device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant