[go: up one dir, main page]

CN113033641B - Semi-supervised classification method for high-dimensional data - Google Patents

Semi-supervised classification method for high-dimensional data Download PDF

Info

Publication number
CN113033641B
CN113033641B CN202110285595.XA CN202110285595A CN113033641B CN 113033641 B CN113033641 B CN 113033641B CN 202110285595 A CN202110285595 A CN 202110285595A CN 113033641 B CN113033641 B CN 113033641B
Authority
CN
China
Prior art keywords
matrix
subspace
sample
learning
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110285595.XA
Other languages
Chinese (zh)
Other versions
CN113033641A (en
Inventor
叶枫旭
余志文
陈俊龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110285595.XA priority Critical patent/CN113033641B/en
Publication of CN113033641A publication Critical patent/CN113033641A/en
Application granted granted Critical
Publication of CN113033641B publication Critical patent/CN113033641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • G06F18/21322Rendering the within-class scatter matrix non-singular
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • G06F18/21322Rendering the within-class scatter matrix non-singular
    • G06F18/21328Rendering the within-class scatter matrix non-singular involving subspace restrictions, e.g. nullspace techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a semi-supervised classification method for high-dimensional data, and relates to the field of artificial intelligence semi-supervised learning. The method mainly overcomes the influence of data noise and redundant features on a model in high-dimensional data in the manufacturing industry, integrates subspace learning, graph construction and classifier training into a unified framework, and achieves a better classification effect. The method comprises the following steps: 1) Inputting a training data set; 2) Normalizing the data; 3) Initializing parameters and variables; 4) Subspace learning; 5) Constructing a graph; 6) Training a classifier; 7) Repeating the steps 4) -6) circularly until the algorithm is converged; 8) Classifying the test samples; 9) And obtaining the classification accuracy. The invention completes the construction of the graph from two low-dimensional spaces, namely the label space and the subspace, effectively relieves the interference of noise data and redundant characteristics on an algorithm model, ensures the quality of the graph and improves the classification effect.

Description

一种高维数据半监督分类方法A semi-supervised classification method for high-dimensional data

技术领域technical field

本发明涉及人工智能半监督学习的技术领域,尤其是指一种高维数据半监督分类方法。The invention relates to the technical field of artificial intelligence semi-supervised learning, in particular to a high-dimensional data semi-supervised classification method.

背景技术Background technique

随着智能时代的到来,部分传统制造业也逐渐向智能制造靠拢。针对制造业所产生的大量数据,运用智能决策方法来优化生产,销售,服务等流程,是智能制造需要面对的主要问题之一。制造业在发展的过程中往往积累了大量的数据。然而,在普遍情况下,这些大数据并不是都带有标签的。面对大量数据,少量标签的情况,若我们想用全监督分类算法,对数据进行建模分析,来学习到这些数据的某些模式,往往不能取得令人满意的效果。那么,该如何从大量的数据和少量的标签中,学习到数据的固有模式呢?其中一个解决方法便是,尝试给海量的训练数据打上标签,但这代价不菲,需要消耗大量的人力物力。显然,更好的解决方案则是直接从算法和模型入手,设计一种算法模型,使它能够从仅带有少量标签的数据中,学习到一个性能较好,泛化能力强的分类模型。而半监督分类算法,正是这样的算法模型。它利用少量的带标签样本和大量的无标签样本,对数据进行学习分类,从而节省了给训练样本人工打上标签的开销。因此,半监督分类算法具有重要的研究意义,在近几年吸引了广大科研人员的研究和探索,在工业上也具有良好的应用前景。With the advent of the intelligent era, some traditional manufacturing industries are gradually moving closer to intelligent manufacturing. In view of the large amount of data generated by the manufacturing industry, using intelligent decision-making methods to optimize production, sales, service and other processes is one of the main problems that intelligent manufacturing needs to face. The manufacturing industry often accumulates a large amount of data in the process of development. However, in general, these big data are not all labeled. In the face of a large amount of data and a small number of labels, if we want to use a fully supervised classification algorithm to model and analyze the data to learn certain patterns of these data, we often cannot achieve satisfactory results. So, how to learn the inherent patterns of data from a large amount of data and a small number of labels? One of the solutions is to try to label a large amount of training data, but this is expensive and requires a lot of manpower and material resources. Obviously, a better solution is to start directly from the algorithm and model, and design an algorithm model so that it can learn a classification model with better performance and strong generalization ability from data with only a small number of labels. The semi-supervised classification algorithm is exactly such an algorithm model. It uses a small number of labeled samples and a large number of unlabeled samples to learn and classify data, thus saving the overhead of manually labeling training samples. Therefore, the semi-supervised classification algorithm has important research significance, has attracted the research and exploration of a large number of scientific researchers in recent years, and has a good application prospect in industry.

基于图的半监督分类算法是近几年来半监督领域较为热门的研究方向之一,因其往往具有更为优秀的表现。此类算法基于数据应处于流形空间的假设,样本的分布应足够平滑。所谓的平滑,指的是,越接近的样本,即相似度越高的样本,其标签应尽可能的相同。在这类算法中,通常要构建一个图来表示样本之间的相似度,进而得到样本之间的平滑度项,然后将损失函数,正则项和平滑度项结合在一起作为模型的整体目标函数,通过优化该目标函数来求解分类器参数,使得最终训练得到的分类器不仅在带标签样本上的具有较小的分类损失,在所有样本(包括带标签样本和无标签样本)上的分类结果也足够平滑。Graph-based semi-supervised classification algorithms are one of the more popular research directions in the semi-supervised field in recent years, because they often have better performance. Such algorithms are based on the assumption that the data should be in a manifold space, and the distribution of samples should be smooth enough. The so-called smoothness means that the closer samples, that is, samples with higher similarity, their labels should be as identical as possible. In this type of algorithm, it is usually necessary to construct a graph to represent the similarity between samples, and then obtain the smoothness term between samples, and then combine the loss function, regularization term and smoothness term as the overall objective function of the model , by optimizing the objective function to solve the classifier parameters, so that the final trained classifier not only has a small classification loss on the labeled sample, but also has a classification result on all samples (including labeled samples and unlabeled samples) Also smooth enough.

然而,目前的一些基于图的半监督分类算法还无法很好地适用于制造业中高维数据的场景。比如,制造业中的数据往往带有缺失值和数据噪声,会对图的构建带来干扰,对模型的性能产生一定的影响。另一个问题则是,当处理制造业的高维数据时,受数据噪声和冗余特征的影响,基于图的半监督分类算法往往不能有很好的表现。However, some current graph-based semi-supervised classification algorithms are not well suited for high-dimensional data scenarios in manufacturing. For example, the data in the manufacturing industry often contains missing values and data noise, which will interfere with the construction of the graph and have a certain impact on the performance of the model. Another problem is that when dealing with high-dimensional data in the manufacturing industry, due to the influence of data noise and redundant features, graph-based semi-supervised classification algorithms often cannot perform well.

发明内容Contents of the invention

本发明的目的在于克服现有技术的缺点与不足,提出了一种高维数据半监督分类方法,可有效缓解高维数据中的数据噪声和冗余特征对于模型的影响,并将图的构建过程和分类器训练过程整合到一个统一框架中,显著提升半监督分类场景下的分类效果。The purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, and propose a semi-supervised classification method for high-dimensional data, which can effectively alleviate the influence of data noise and redundant features in high-dimensional data on the model, and construct the graph The process and classifier training process are integrated into a unified framework, which significantly improves the classification performance in semi-supervised classification scenarios.

为实现上述目的,本发明所提供的技术方案为:一种高维数据半监督分类方法,包括以下步骤:In order to achieve the above object, the technical solution provided by the present invention is: a semi-supervised classification method for high-dimensional data, comprising the following steps:

1)输入训练数据集,为高维数据集;1) Input the training data set, which is a high-dimensional data set;

2)对数据归一化,消除不同特征量纲不同的影响,同时提升后续优化学习的速度;2) Normalize the data, eliminate the influence of different feature dimensions, and increase the speed of subsequent optimization learning at the same time;

3)初始化回归矩阵

Figure BDA0002980321250000021
子空间投影矩阵
Figure BDA0002980321250000022
其中d为样本的特征数,c为样本类别数,
Figure BDA0002980321250000023
表示d行c列的实数矩阵;初始化W的低秩分解矩阵
Figure BDA0002980321250000024
其中
Figure BDA0002980321250000025
表示c行c列的实数矩阵;初始化相似度矩阵
Figure BDA0002980321250000026
参数矩阵
Figure BDA0002980321250000027
其中n为样本数量,
Figure BDA0002980321250000028
表示n行n列的实数矩阵;初始化偏置向量
Figure BDA0002980321250000029
其中
Figure BDA00029803212500000210
表示c行1列的实数矩阵;3) Initialize the regression matrix
Figure BDA0002980321250000021
subspace projection matrix
Figure BDA0002980321250000022
Where d is the number of features of the sample, c is the number of sample categories,
Figure BDA0002980321250000023
Represents a real matrix of d rows and c columns; initializes the low-rank decomposition matrix of W
Figure BDA0002980321250000024
in
Figure BDA0002980321250000025
Represents a real matrix of c rows and c columns; initialize the similarity matrix
Figure BDA0002980321250000026
parameter matrix
Figure BDA0002980321250000027
where n is the sample size,
Figure BDA0002980321250000028
Represents a real matrix of n rows and n columns; initializes the bias vector
Figure BDA0002980321250000029
in
Figure BDA00029803212500000210
Represents a real number matrix with c rows and 1 column;

4)子空间学习:根据提出的子空间学习目标函数,推导低秩分解矩阵B,参数矩阵C和子空间投影矩阵A的最优解;由于提出的目标函数涉及多个优化变量,所以用交替优化的方法,迭代更新B、C、A,逐步优化,提升子空间质量,进而学习到最优的表现样本本质特征的子空间;4) Subspace learning: According to the proposed subspace learning objective function, the optimal solution of low-rank decomposition matrix B, parameter matrix C and subspace projection matrix A is derived; since the proposed objective function involves multiple optimization variables, alternate optimization is used The method, update B, C, A iteratively, gradually optimize, improve the quality of the subspace, and then learn the optimal subspace that expresses the essential characteristics of the sample;

5)从样本子空间和样本标签空间这两个方面综合学习样本相似度矩阵;将样本定义为图的节点,将样本间的相似度定义为图的边,样本相似度矩阵的学习过程即为图的构建过程;5) Comprehensively learn the sample similarity matrix from two aspects: the sample subspace and the sample label space; define the samples as the nodes of the graph, define the similarity between samples as the edges of the graph, and the learning process of the sample similarity matrix is Graph construction process;

6)在步骤4)子空间学习和步骤5)相似度矩阵学习的基础上,学习半监督线性回归分类器,即学习回归矩阵W和偏置向量b;6) On the basis of step 4) subspace learning and step 5) similarity matrix learning, learn a semi-supervised linear regression classifier, that is, learn a regression matrix W and a bias vector b;

7)循环进行步骤4)至步骤6),迭代学习各个变量,直至收敛;当收敛的时候,子空间学习、图的构建和分类器学习这三个过程也就得到了联合最优解;7) Perform step 4) to step 6) in a loop, and iteratively learn each variable until convergence; when convergence, the three processes of subspace learning, graph construction and classifier learning have also obtained a joint optimal solution;

8)对测试样本进行分类,假设输入的测试样本为x,样本类别数为c,则预测标签predict(x)为:8) Classify the test samples, assuming that the input test sample is x and the number of sample categories is c, then the prediction label predict(x) is:

Figure BDA0002980321250000031
Figure BDA0002980321250000031

其中,(WTx+b)i表示向量(WTx+b)的第i个元素;Among them, (W T x+b) i represents the i-th element of the vector (W T x+b);

9)计算分类准确率:输入测试样本的标签,与预测结果进行对比,计算出最后的分类准确率,由于使用的测试样本为高维数据,不存在不平衡数据,所以只采用分类准确率来评判效果。9) Calculate the classification accuracy rate: input the label of the test sample, compare it with the prediction result, and calculate the final classification accuracy rate. Since the test sample used is high-dimensional data and there is no unbalanced data, only the classification accuracy rate is used. Judge the effect.

在步骤2)中,数据归一化步骤是:获取第r行数据相对应的最大值X(r)max和最小值X(r)min,将第d行数据根据如下的公式进行转换:In step 2), the data normalization step is: obtain the maximum value X(r) max and the minimum value X(r) min corresponding to the r-th row of data, and convert the d-th row of data according to the following formula:

Figure BDA0002980321250000032
Figure BDA0002980321250000032

其中,

Figure BDA0002980321250000041
为第r行第i个数据,
Figure BDA0002980321250000042
为更新之后的数据,n为数据集中样本数量,d为样本的特征数,i∈{1,2,...,n},r∈{1,2,...,d}。in,
Figure BDA0002980321250000041
is the i-th data in row r,
Figure BDA0002980321250000042
is the updated data, n is the number of samples in the data set, d is the feature number of the sample, i∈{1,2,...,n}, r∈{1,2,...,d}.

在步骤3)中,初始化化方法为:初始化回归矩阵

Figure BDA0002980321250000043
为全零矩阵;初始化W的低秩分解矩阵
Figure BDA0002980321250000044
为全零矩阵;初始化相似度矩阵
Figure BDA0002980321250000045
和参数矩阵
Figure BDA0002980321250000046
为全零矩阵;初始化偏置向量
Figure BDA0002980321250000047
为全零向量;初始化子空间投影矩阵A=qf(R)为正交矩阵;其中,
Figure BDA0002980321250000048
是一个随机矩阵,每个元素都在区间[0,1],qf(·)表示的是QR分解。In step 3), the initialization method is: initialize the regression matrix
Figure BDA0002980321250000043
is an all-zero matrix; initialize the low-rank decomposition matrix of W
Figure BDA0002980321250000044
is an all-zero matrix; initialize the similarity matrix
Figure BDA0002980321250000045
and parameter matrix
Figure BDA0002980321250000046
is an all-zero matrix; initialize the bias vector
Figure BDA0002980321250000047
Be all zero vector; Initialization subspace projection matrix A=qf (R) is an orthogonal matrix; Wherein,
Figure BDA0002980321250000048
is a random matrix, each element is in the interval [0,1], and qf(·) represents the QR decomposition.

在步骤4)中,子空间学习过程如下:In step 4), the subspace learning process is as follows:

定义子空间学习的目标函数为:Define the objective function of subspace learning as:

Figure BDA0002980321250000049
Figure BDA0002980321250000049

其中,tr(·)为矩阵的迹,

Figure BDA00029803212500000410
表示矩阵的F-范数,
Figure BDA00029803212500000411
为样本矩阵,
Figure BDA00029803212500000412
表示d行n列的实数矩阵,
Figure BDA00029803212500000413
为回归矩阵,
Figure BDA00029803212500000414
为子空间投影矩阵,
Figure BDA00029803212500000415
是W的低秩分解矩阵,
Figure BDA00029803212500000416
是参数矩阵;α,θ,β为可调节的参数;Among them, tr( ) is the trace of the matrix,
Figure BDA00029803212500000410
represents the F-norm of the matrix,
Figure BDA00029803212500000411
is the sample matrix,
Figure BDA00029803212500000412
Represents a real number matrix with d rows and n columns,
Figure BDA00029803212500000413
is the regression matrix,
Figure BDA00029803212500000414
is the subspace projection matrix,
Figure BDA00029803212500000415
is the low-rank decomposition matrix of W,
Figure BDA00029803212500000416
is a parameter matrix; α, θ, β are adjustable parameters;

将目标函数分别对B、C、A求偏导,能够得各个变量的更新公式;接着,按以下要求更新各个变量:Calculate the partial derivative of the objective function with respect to B, C, and A respectively, and the update formula of each variable can be obtained; then, update each variable according to the following requirements:

a、根据公式B=ATW更新B;a. Update B according to the formula B=A T W;

b、根据公式C=(XTAATX+I)-1XTAATX更新C;b. Update C according to the formula C=(X T AA T X+I) -1 X T AA T X;

c、按以下公式循环更新子空间投影矩阵A:At+1=qf(At+G),直至收敛;c. cyclically update the subspace projection matrix A according to the following formula: A t+1 =qf(A t +G), until convergence;

其中,I为单位矩阵,t代表第t轮迭代,At表示第t轮迭代时A的取值,At+1表示第t+1轮迭代时A的取值,G表示目标函数的梯度,G=-2(X(αL+θ(I-C)(I-C)T)XTA-βWBT),qf(·)表示的是QR分解。Among them, I is the identity matrix, t represents the t-th iteration, A t represents the value of A in the t-th iteration, A t+1 represents the value of A in the t+1-th iteration, and G represents the gradient of the objective function , G=-2(X(αL+θ(IC)(IC) T )X T A-βWB T ), qf(·) means QR decomposition.

在步骤5)中,图的构建过程为:从样本标签空间和样本子空间两个方面来共同学习相似度矩阵,定义相似度矩阵学习的目标函数为:In step 5), the construction process of the graph is: jointly learn the similarity matrix from the two aspects of the sample label space and the sample subspace, and define the objective function of the similarity matrix learning as:

Figure BDA0002980321250000051
Figure BDA0002980321250000051

其中,tr(·)为矩阵的迹,

Figure BDA0002980321250000052
表示矩阵的F-范数,
Figure BDA0002980321250000053
是回归矩阵,
Figure BDA0002980321250000054
是子空间投影矩阵,
Figure BDA0002980321250000055
是样本矩阵,
Figure BDA0002980321250000056
表示d行n列的实数矩阵,
Figure BDA0002980321250000057
是相似度矩阵,
Figure BDA0002980321250000058
是拉普拉斯矩阵,并且L=D-S,D是一个对角矩阵,
Figure BDA0002980321250000059
Dii表示矩阵D第i行第i列上的元素,Sij表示相似度矩阵S第i行第j列上的元素;参数λ是正则项的权重;Among them, tr( ) is the trace of the matrix,
Figure BDA0002980321250000052
represents the F-norm of the matrix,
Figure BDA0002980321250000053
is the regression matrix,
Figure BDA0002980321250000054
is the subspace projection matrix,
Figure BDA0002980321250000055
is the sample matrix,
Figure BDA0002980321250000056
Represents a real number matrix with d rows and n columns,
Figure BDA0002980321250000057
is the similarity matrix,
Figure BDA0002980321250000058
is a Laplacian matrix, and L=DS, D is a diagonal matrix,
Figure BDA0002980321250000059
D ii represents the element on row i, column i of matrix D, and S ij represents the element on row i, column j of similarity matrix S; parameter λ is the weight of the regular term;

设每个样本的近邻数为k,即每个样本仅与k个近邻样本的相似度不为0,其它皆为0;令xi,xj分别表示第i,j个样本;定义eij为xi和xj在子空间的欧式距离和在标签空间的欧式距离之和,则eij的计算公式如下:Let the number of neighbors of each sample be k, that is, the similarity between each sample and k neighbor samples is not 0, and the others are all 0; let x i , x j denote the i-th and j-th samples respectively; define e ij is the sum of the Euclidean distance between x i and x j in the subspace and the Euclidean distance in the label space, then the calculation formula of e ij is as follows:

Figure BDA00029803212500000510
Figure BDA00029803212500000510

接着,根据对目标函数的求解,能够得到相似度矩阵S的更新公式:Then, according to the solution of the objective function, the update formula of the similarity matrix S can be obtained:

Figure BDA00029803212500000511
Figure BDA00029803212500000511

其中,中间变量

Figure BDA00029803212500000512
Among them, the intermediate variable
Figure BDA00029803212500000512

在步骤6)中,半监督线性回归分类器学习过程如下:In step 6), the semi-supervised linear regression classifier learning process is as follows:

定义半监督线性回归分类器的基础目标函数为:The basic objective function that defines a semi-supervised linear regression classifier is:

Figure BDA00029803212500000513
Figure BDA00029803212500000513

其中,tr(·)是矩阵的迹,

Figure BDA00029803212500000514
表示矩阵的F-范数,
Figure BDA00029803212500000515
是回归矩阵,
Figure BDA00029803212500000516
是样本矩阵,
Figure BDA00029803212500000517
表示d行n列的实数矩阵,
Figure BDA00029803212500000518
是偏置向量,
Figure BDA00029803212500000519
是样本的标签矩阵,参数γ是正则项权重,
Figure BDA00029803212500000520
是对角矩阵,如果样本xi是带标签样本,则Uii=1,否则Uii=0,其中Uii表示矩阵U第i行第i列上的元素;Among them, tr( ) is the trace of the matrix,
Figure BDA00029803212500000514
represents the F-norm of the matrix,
Figure BDA00029803212500000515
is the regression matrix,
Figure BDA00029803212500000516
is the sample matrix,
Figure BDA00029803212500000517
Represents a real number matrix with d rows and n columns,
Figure BDA00029803212500000518
is the bias vector,
Figure BDA00029803212500000519
Is the label matrix of the sample, and the parameter γ is the weight of the regular term,
Figure BDA00029803212500000520
is a diagonal matrix, if the sample x i is a labeled sample, then U ii =1, otherwise U ii =0, where U ii represents the element on the i-th row and i-th column of the matrix U;

将上述目标函数和步骤4)的子空间学习的目标函数以及步骤5)的相似度矩阵学习的目标函数结合在一起,得最终的目标函数:Combining the above objective function with the objective function of subspace learning in step 4) and the objective function of similarity matrix learning in step 5), the final objective function is obtained:

Figure BDA0002980321250000061
Figure BDA0002980321250000061

其中,Loss=tr((WTX+b1T-Y)U(WTX+b1T-Y)T),

Figure BDA0002980321250000062
是参数矩阵,
Figure BDA0002980321250000063
是子空间投影矩阵,
Figure BDA0002980321250000064
是相似度矩阵;
Figure BDA0002980321250000065
是W的低秩分解矩阵,
Figure BDA0002980321250000066
是拉普拉斯矩阵,并且L=D-S,D是一个对角矩阵,
Figure BDA0002980321250000067
Dii表示矩阵D第i行第i列上的元素,Sij表示相似度矩阵S第i行第j列上的元素;参数α,θ,β是调整各个项重要程度的权重;Among them, Loss=tr((W T X+b1 T -Y)U(W T X+b1 T -Y) T ),
Figure BDA0002980321250000062
is the parameter matrix,
Figure BDA0002980321250000063
is the subspace projection matrix,
Figure BDA0002980321250000064
is the similarity matrix;
Figure BDA0002980321250000065
is the low-rank decomposition matrix of W,
Figure BDA0002980321250000066
is a Laplacian matrix, and L=DS, D is a diagonal matrix,
Figure BDA0002980321250000067
D ii represents the element on row i, column i of matrix D, S ij represents the element on row i, column j of similarity matrix S; parameters α, θ, β are the weights to adjust the importance of each item;

将上述最终的目标函数分别对W和b求偏导,得W和b的更新公式如下:Calculate the partial derivative of the above final objective function with respect to W and b respectively, and the update formulas of W and b are as follows:

W=[XUcXT+αXLXT+β(I-AAT)+γI]-1XUcYT W=[XU c X T +αXLX T +β(I-AA T )+γI] -1 XU c Y T

Figure BDA0002980321250000068
Figure BDA0002980321250000068

其中,中间变量

Figure BDA0002980321250000069
Among them, the intermediate variable
Figure BDA0002980321250000069

接着,按上述更新公式更新W和b,即可完成半监督线性回归分类器的学习过程。Then, update W and b according to the above update formula, and the learning process of the semi-supervised linear regression classifier can be completed.

本发明与现有技术相比,具有如下优点与有益效果:Compared with the prior art, the present invention has the following advantages and beneficial effects:

本发明具有扎实的数学理论基础,并且准确性、稳定性和鲁棒性都有着非常大的优势。第一,从标签空间和子空间这两个低维空间来共同构建图,能够克服制造业高维数据中的冗余特征的影响。并且,从两个空间构建的图,也更具鲁棒性,更好适应数据分布不稳定的特点。第二,学习子空间的过程中,利用到了回归矩阵的低秩性质,使得子空间更容易区分不同类别的样本。第三,将子空间学习,图的构建,分类器训练这三个过程整合到一个统一的框架,通过循环交替优化,三个过程相互促进,达到联合最优解,显著提高了算法框架整体的学习能力。The invention has a solid mathematical theory foundation, and has great advantages in accuracy, stability and robustness. First, constructing graphs from two low-dimensional spaces, label space and subspace, can overcome the influence of redundant features in high-dimensional manufacturing data. Moreover, the graph constructed from the two spaces is also more robust and better adapts to the characteristics of unstable data distribution. Second, in the process of learning the subspace, the low-rank property of the regression matrix is used, which makes it easier for the subspace to distinguish samples of different categories. Third, the three processes of subspace learning, graph construction, and classifier training are integrated into a unified framework. Through cyclical alternate optimization, the three processes promote each other to achieve a joint optimal solution, which significantly improves the overall performance of the algorithm framework. learning ability.

附图说明Description of drawings

图1为本发明逻辑流程示意图。Fig. 1 is a schematic diagram of the logic flow of the present invention.

图2为本发明与传统半监督分类算法和基于图的半监督分类算法的准确率对比表,SSCNGC是本发明方法的简称,数字加粗为效果最好的,数据格式为“准确率±标准差”。Fig. 2 is the accuracy rate contrast table of the present invention and traditional semi-supervised classification algorithm and the semi-supervised classification algorithm based on graph, SSCNGC is the abbreviation of the method of the present invention, and the number bold is effect best, and data format is " accuracy rate ± standard Difference".

具体实施方式detailed description

下面结合实施例及附图对本发明作进一步详细的描述,但本发明的实施方式不限于此。The present invention will be further described in detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

如图1所示,本实施例所提供的高维数据半监督分类方法,包括以下步骤:As shown in Figure 1, the high-dimensional data semi-supervised classification method provided by this embodiment includes the following steps:

1)输入训练数据集,为高维数据集。1) The input training data set is a high-dimensional data set.

2)对数据归一化,消除不同特征量纲不同的影响,同时提升后续优化学习的速度;其中,数据归一化步骤是:获取第r行数据相对应的最大值X(r)max和最小值X(r)min,将第d行数据根据如下的公式进行转换:2) Normalize the data, eliminate the influence of different feature dimensions, and at the same time increase the speed of subsequent optimization learning; where, the data normalization step is: obtain the maximum value X(r) max and The minimum value is X(r) min , and the data in row d is converted according to the following formula:

Figure BDA0002980321250000071
Figure BDA0002980321250000071

其中,

Figure BDA0002980321250000072
为第r行第i个数据,
Figure BDA0002980321250000073
为更新之后的数据,n为数据集中样本数量,d为样本的特征数,i∈{1,2,...,n},r∈{1,2,...,d}。in,
Figure BDA0002980321250000072
is the i-th data in row r,
Figure BDA0002980321250000073
is the updated data, n is the number of samples in the data set, d is the feature number of the sample, i∈{1,2,...,n}, r∈{1,2,...,d}.

3)初始化回归矩阵

Figure BDA0002980321250000074
子空间投影矩阵
Figure BDA0002980321250000075
其中d为样本的特征数,c为样本类别数,
Figure BDA0002980321250000076
表示d行c列的实数矩阵;初始化W的低秩分解矩阵
Figure BDA0002980321250000077
其中
Figure BDA0002980321250000078
表示c行c列的实数矩阵;初始化相似度矩阵
Figure BDA0002980321250000079
参数矩阵
Figure BDA00029803212500000710
其中n为样本数量,
Figure BDA00029803212500000711
表示n行n列的实数矩阵;初始化偏置向量
Figure BDA00029803212500000712
其中
Figure BDA00029803212500000713
表示c行1列的实数矩阵;3) Initialize the regression matrix
Figure BDA0002980321250000074
subspace projection matrix
Figure BDA0002980321250000075
Where d is the number of features of the sample, c is the number of sample categories,
Figure BDA0002980321250000076
Represents a real matrix of d rows and c columns; initializes the low-rank decomposition matrix of W
Figure BDA0002980321250000077
in
Figure BDA0002980321250000078
Represents a real matrix of c rows and c columns; initialize the similarity matrix
Figure BDA0002980321250000079
parameter matrix
Figure BDA00029803212500000710
where n is the sample size,
Figure BDA00029803212500000711
Represents a real matrix of n rows and n columns; initializes the bias vector
Figure BDA00029803212500000712
in
Figure BDA00029803212500000713
Represents a real number matrix with c rows and 1 column;

初始化方法为:初始化回归矩阵

Figure BDA00029803212500000714
为全零矩阵;初始化W的低秩分解矩阵
Figure BDA00029803212500000715
为全零矩阵;初始化相似度矩阵
Figure BDA00029803212500000716
和参数矩阵
Figure BDA00029803212500000717
为全零矩阵;初始化偏置向量
Figure BDA0002980321250000081
为全零向量;初始化子空间投影矩阵A=qf(R)为正交矩阵;其中,
Figure BDA0002980321250000082
是一个随机矩阵,每个元素都在区间[0,1],qf(·)表示的是QR分解。The initialization method is: initialize the regression matrix
Figure BDA00029803212500000714
is an all-zero matrix; initialize the low-rank decomposition matrix of W
Figure BDA00029803212500000715
is an all-zero matrix; initialize the similarity matrix
Figure BDA00029803212500000716
and parameter matrix
Figure BDA00029803212500000717
is an all-zero matrix; initialize the bias vector
Figure BDA0002980321250000081
Be all zero vector; Initialization subspace projection matrix A=qf (R) is an orthogonal matrix; Wherein,
Figure BDA0002980321250000082
is a random matrix, each element is in the interval [0,1], and qf(·) represents the QR decomposition.

4)子空间学习:根据提出的子空间学习目标函数,推导低秩分解矩阵B,参数矩阵C和子空间投影矩阵A的最优解;由于提出的目标函数涉及多个优化变量,所以用交替优化的方法,迭代更新B、C、A,逐步优化,提升子空间质量,进而学习到最优的表现样本本质特征的子空间;4) Subspace learning: According to the proposed subspace learning objective function, the optimal solution of low-rank decomposition matrix B, parameter matrix C and subspace projection matrix A is derived; since the proposed objective function involves multiple optimization variables, alternate optimization is used The method, update B, C, A iteratively, gradually optimize, improve the quality of the subspace, and then learn the optimal subspace that expresses the essential characteristics of the sample;

子空间学习过程如下:The subspace learning process is as follows:

定义子空间学习的目标函数为:Define the objective function of subspace learning as:

Figure BDA0002980321250000083
Figure BDA0002980321250000083

其中,tr(·)为矩阵的迹,

Figure BDA0002980321250000084
表示矩阵的F-范数,
Figure BDA0002980321250000085
为样本矩阵;α,θ,β为可调节的参数;Among them, tr( ) is the trace of the matrix,
Figure BDA0002980321250000084
represents the F-norm of the matrix,
Figure BDA0002980321250000085
is the sample matrix; α, θ, β are adjustable parameters;

将目标函数分别对B、C、A求偏导,能够得各个变量的更新公式;接着,按以下要求更新各个变量:Calculate the partial derivative of the objective function with respect to B, C, and A respectively, and the update formula of each variable can be obtained; then, update each variable according to the following requirements:

a、根据公式B=ATW更新B;a. Update B according to the formula B=A T W;

b、根据公式C=(XTAATX+I)-1XTAATX更新C;b. Update C according to the formula C=(X T AA T X+I) -1 X T AA T X;

c、按以下公式循环更新子空间投影矩阵A:At+1=qf(At+G),直至收敛;c. cyclically update the subspace projection matrix A according to the following formula: A t+1 =qf(A t +G), until convergence;

其中,I为单位矩阵,t代表第t轮迭代,At表示第t轮迭代时A的取值,At+1表示第t+1轮迭代时A的取值,G表示目标函数的梯度,G=-2(X(αL+θ(I-C)(I-C)T)XTA-βWBT),qf(·)表示的是QR分解。Among them, I is the identity matrix, t represents the t-th iteration, A t represents the value of A in the t-th iteration, A t+1 represents the value of A in the t+1-th iteration, and G represents the gradient of the objective function , G=-2(X(αL+θ(IC)(IC) T )X T A-βWB T ), qf(·) means QR decomposition.

5)从样本子空间和样本标签空间这两个方面综合学习样本相似度矩阵;将样本定义为图的节点,将样本间的相似度定义为图的边,样本相似度矩阵的学习过程即为图的构建过程;5) Comprehensively learn the sample similarity matrix from two aspects: the sample subspace and the sample label space; define the samples as the nodes of the graph, define the similarity between samples as the edges of the graph, and the learning process of the sample similarity matrix is Graph construction process;

图的构建过程为:从样本标签空间和样本子空间两个方面来共同学习相似度矩阵,定义相似度矩阵学习的目标函数为:The construction process of the graph is as follows: jointly learn the similarity matrix from two aspects of the sample label space and the sample subspace, and define the objective function of the similarity matrix learning as:

Figure BDA0002980321250000091
Figure BDA0002980321250000091

其中,参数λ是正则项的权重;Among them, the parameter λ is the weight of the regular term;

设每个样本的近邻数为k,即每个样本仅与k个近邻样本的相似度不为0,其它皆为0;令xi,xj分别表示第i,j个样本;定义eij为xi和xj在子空间的欧式距离和在标签空间的欧式距离之和,则eij的计算公式如下:Let the number of neighbors of each sample be k, that is, the similarity between each sample and k neighbor samples is not 0, and the others are all 0; let x i , x j denote the i-th and j-th samples respectively; define e ij is the sum of the Euclidean distance between x i and x j in the subspace and the Euclidean distance in the label space, then the calculation formula of e ij is as follows:

Figure BDA0002980321250000092
Figure BDA0002980321250000092

接着,根据对目标函数的求解,能够得到相似度矩阵S的更新公式:Then, according to the solution of the objective function, the update formula of the similarity matrix S can be obtained:

Figure BDA0002980321250000093
Figure BDA0002980321250000093

其中,中间变量

Figure BDA0002980321250000094
Among them, the intermediate variable
Figure BDA0002980321250000094

6)在步骤4)子空间学习和步骤5)相似度矩阵学习的基础上,学习半监督线性回归分类器,即学习回归矩阵W和偏置向量b;6) On the basis of step 4) subspace learning and step 5) similarity matrix learning, learn a semi-supervised linear regression classifier, that is, learn a regression matrix W and a bias vector b;

半监督线性回归分类器学习过程如下:The semi-supervised linear regression classifier learning process is as follows:

定义半监督线性回归分类器的基础目标函数为:The basic objective function that defines a semi-supervised linear regression classifier is:

Figure BDA0002980321250000095
Figure BDA0002980321250000095

其中,

Figure BDA0002980321250000096
是对角矩阵,如果样本xi是带标签样本,则Uii=1,否则Uii=0,其中Uii表示矩阵U第i行第i列上的元素;in,
Figure BDA0002980321250000096
is a diagonal matrix, if the sample x i is a labeled sample, then U ii =1, otherwise U ii =0, where U ii represents the element on the i-th row and i-th column of the matrix U;

将上述目标函数和步骤4)的子空间学习的目标函数以及步骤5)的相似度矩阵学习的目标函数结合在一起,得最终的目标函数:Combining the above objective function with the objective function of subspace learning in step 4) and the objective function of similarity matrix learning in step 5), the final objective function is obtained:

Figure BDA0002980321250000097
Figure BDA0002980321250000097

其中,Loss=tr((WTX+b1T-Y)U(WTX+b1T-Y)T);参数α,θ,β是调整各个项重要程度的权重;Among them, Loss=tr((W T X+b1 T -Y)U(W T X+b1 T -Y) T ); parameters α, θ, β are weights to adjust the importance of each item;

将上述最终的目标函数分别对W和b求偏导,得W和b的更新公式如下:Calculate the partial derivative of the above final objective function with respect to W and b respectively, and the update formulas of W and b are as follows:

W=[XUcXT+αXLXT+β(I-AAT)+γI]-1XUcYT W=[XU c X T +αXLX T +β(I-AA T )+γI] -1 XU c Y T

Figure BDA0002980321250000101
Figure BDA0002980321250000101

其中,中间变量

Figure BDA0002980321250000102
Among them, the intermediate variable
Figure BDA0002980321250000102

接着,按上述更新公式更新W和b,即可完成半监督线性回归分类器的学习过程。Then, update W and b according to the above update formula, and the learning process of the semi-supervised linear regression classifier can be completed.

7)循环进行步骤4)至步骤6),迭代学习各个变量,直至收敛;当收敛的时候,子空间学习、图的构建和分类器学习这三个过程也就得到了联合最优解。7) Perform step 4) to step 6) in a loop, and iteratively learn each variable until convergence; when convergence, the three processes of subspace learning, graph construction and classifier learning will obtain a joint optimal solution.

8)对测试样本进行分类,假设输入的测试样本为x,则预测标签predict(x)为:8) Classify the test sample, assuming that the input test sample is x, then the prediction label predict(x) is:

Figure BDA0002980321250000103
Figure BDA0002980321250000103

其中,(WTx+b)i表示向量(WTx+b)的第i个元素。Among them, (W T x+b) i represents the ith element of the vector (W T x+b).

9)计算分类准确率:输入测试样本的标签,与预测结果进行对比,计算出最后的分类准确率,由于使用的测试样本为高维数据,不存在不平衡数据,所以只采用分类准确率来评判效果。9) Calculate the classification accuracy rate: input the label of the test sample, compare it with the prediction result, and calculate the final classification accuracy rate. Since the test sample used is high-dimensional data and there is no unbalanced data, only the classification accuracy rate is used. Judge the effect.

图2为本发明与传统半监督分类算法和基于图的半监督分类算法的准确率对比表,SSCNGC是本发明方法的简称,数字加粗为效果最好的,数据格式为“准确率±标准差”。从图中可以看到,在16个高维数据集的实验中,本发明在其中15个数据集上取得了最高的准确率,并且在9个数据集上取得了5%以上的提升,这表明本发明相比于传统半监督算法,具有更强的优越性。Fig. 2 is the accuracy rate contrast table of the present invention and traditional semi-supervised classification algorithm and the semi-supervised classification algorithm based on graph, SSCNGC is the abbreviation of the method of the present invention, and the number bold is effect best, and data format is " accuracy rate ± standard Difference". As can be seen from the figure, in the experiments of 16 high-dimensional data sets, the present invention has achieved the highest accuracy rate on 15 of them, and has achieved an improvement of more than 5% on 9 data sets. It shows that the present invention has stronger advantages compared with the traditional semi-supervised algorithm.

上述实施例为本发明较佳的实施方式,但本发明的实施方式并不受上述实施例的限制,其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化,均应为等效的置换方式,都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.

Claims (5)

1. A semi-supervised classification method for high-dimensional data is characterized by comprising the following steps:
1) Inputting a training data set which is a high-dimensional data set;
2) Normalizing the data, eliminating the influence of different characteristic dimensions, and simultaneously improving the speed of subsequent optimization learning;
3) Initializing a regression matrix
Figure FDA0003797073460000011
Subspace projection matrix
Figure FDA0003797073460000012
Where d is the number of features of the sample, c is the number of sample classes,
Figure FDA0003797073460000013
a real matrix representing d rows and c columns; initializing a low rank decomposition matrix of W
Figure FDA0003797073460000014
Wherein
Figure FDA0003797073460000015
A real number matrix representing c rows and c columns; initializing a similarity matrix
Figure FDA0003797073460000016
Parameter matrix
Figure FDA0003797073460000017
Where n is the number of samples in the sample,
Figure FDA0003797073460000018
a real number matrix representing n rows and n columns; initialization bias directionMeasurement of
Figure FDA0003797073460000019
Wherein
Figure FDA00037970734600000110
A real matrix representing c rows and 1 columns;
4) Subspace learning: deriving an optimal solution of a low-rank decomposition matrix B, a parameter matrix C and a subspace projection matrix A according to the proposed subspace learning objective function; because the proposed objective function relates to a plurality of optimization variables, B, C and A are iteratively updated by an alternate optimization method, the optimization is carried out step by step, the subspace quality is improved, and the optimal subspace for expressing the essential characteristics of the sample is learned; the subspace learning process is as follows:
the objective function defining the subspace learning is:
Figure FDA00037970734600000111
wherein tr (-) is the trace of the matrix,
Figure FDA00037970734600000112
the F-norm of the matrix is represented,
Figure FDA00037970734600000113
in the form of a matrix of samples,
Figure FDA00037970734600000114
a matrix of real numbers representing d rows and n columns,
Figure FDA00037970734600000115
in the form of a regression matrix,
Figure FDA00037970734600000116
for the projection matrix of the subspace,
Figure FDA00037970734600000117
is a low-rank decomposition matrix of W,
Figure FDA00037970734600000118
is a parameter matrix; alpha, theta, beta are adjustable parameters;
the target function is subjected to partial derivatives of B, C and A respectively, and an updating formula of each variable can be obtained; next, each variable is updated as required:
a. according to the formula B = A T W updates B;
b. according to the formula C = (X) T AA T X+I) -1 X T AA T Updating C by X;
c. circularly updating the subspace projection matrix A according to the following formula: a. The t+1 =qf(A t + G) until convergence;
wherein, I is an identity matrix, t represents the t-th iteration, A t Represents the value of A in the t-th iteration, A t+1 Denotes the value of A in the t +1 th iteration, G denotes the gradient of the objective function, G = -2 (X (. Alpha.L + theta (I-C)) T )X T A-βWB T ) Qf (·) denotes QR decomposition;
5) Comprehensively learning a sample similarity matrix from two aspects of a sample subspace and a sample label space; defining samples as nodes of the graph, defining the similarity among the samples as edges of the graph, wherein the learning process of the sample similarity matrix is the construction process of the graph;
6) Learning a semi-supervised linear regression classifier, namely learning a regression matrix W and a bias vector b, on the basis of the subspace learning in the step 4) and the similarity matrix learning in the step 5);
7) Circularly performing the step 4) to the step 6), and iteratively learning each variable until convergence; when convergence occurs, joint optimal solutions are obtained through three processes of subspace learning, graph construction and classifier learning;
8) Classifying the test samples, assuming that the input test sample is x and the number of sample classes is c, predicting (x) of the prediction label is as follows:
Figure FDA0003797073460000021
wherein (W) T x+b) i Represents a vector (W) T x + b) th element;
9) Calculating the classification accuracy: and inputting labels of the test samples, comparing the labels with the prediction result, and calculating the final classification accuracy.
2. The semi-supervised classification method for high-dimensional data as recited in claim 1, wherein: in step 2), the data normalization step is: obtaining the maximum value X (r) corresponding to the r row data max And minimum X (r) min And converting the d-th row of data according to the following formula:
Figure FDA0003797073460000031
wherein,
Figure FDA0003797073460000032
for the ith data of the r-th row,
Figure FDA0003797073460000033
for the data after update, n is the number of samples in the dataset, d is the number of features of the samples, i ∈ {1, 2.., n }, r ∈ {1, 2.., d }.
3. The semi-supervised classification method for high-dimensional data as recited in claim 1, wherein: in step 3), the initialization method is as follows: initializing a regression matrix
Figure FDA0003797073460000034
Is an all-zero matrix; initializing a low rank decomposition matrix of W
Figure FDA0003797073460000035
Is an all-zero matrix; initializing a similarity matrix
Figure FDA0003797073460000036
And parameter matrix
Figure FDA0003797073460000037
Is an all-zero matrix; initializing a bias vector
Figure FDA0003797073460000038
Is an all-zero vector; initializing subspace projection matrices a = qf (R) as orthogonal matrices; wherein,
Figure FDA0003797073460000039
is a random matrix with each element in the interval 0,1]And qf (·) denotes QR decomposition.
4. The semi-supervised classification method for high-dimensional data as recited in claim 1, wherein: in step 5), the construction process of the graph is as follows: the similarity matrix is jointly learned from two aspects of the sample label space and the sample subspace, and an objective function for defining the similarity matrix learning is as follows:
Figure FDA00037970734600000310
wherein tr (-) is the trace of the matrix,
Figure FDA00037970734600000311
the F-norm of the matrix is represented,
Figure FDA00037970734600000312
is a regression matrix, and the regression matrix is,
Figure FDA00037970734600000313
is a subspace projectionThe shadow matrix is a matrix of the image,
Figure FDA00037970734600000314
is a matrix of samples of the sample to be sampled,
Figure FDA00037970734600000315
a matrix of real numbers representing d rows and n columns,
Figure FDA00037970734600000316
is a matrix of the degree of similarity (or similarity matrix),
Figure FDA00037970734600000317
is a laplacian matrix and L = D-S, D is a diagonal matrix,
Figure FDA00037970734600000318
D ii representing the elements of matrix D at row i and column i, S ij Representing the elements in the ith row and the jth column of the similarity matrix S; the parameter λ is the weight of the regularization term;
setting the number of neighbors of each sample as k, namely, the similarity between each sample and the k neighbor samples is not 0, and the similarity is 0; let x i ,x j Respectively representing the ith and the j th samples; definition e ij Is x i And x j Sum of Euclidean distance in subspace and Euclidean distance in tag space, then e ij The calculation formula of (c) is as follows:
Figure FDA00037970734600000319
then, according to the solution of the objective function, an update formula of the similarity matrix S can be obtained:
Figure FDA0003797073460000041
wherein the intermediate variable
Figure FDA0003797073460000042
5. The semi-supervised classification method for high-dimensional data as recited in claim 1, wherein: in step 6), the learning process of the semi-supervised linear regression classifier is as follows:
the basic objective function that defines a semi-supervised linear regression classifier is:
Figure FDA0003797073460000043
where tr (-) is the trace of the matrix,
Figure FDA0003797073460000044
the F-norm of the matrix is represented,
Figure FDA0003797073460000045
is a regression matrix, and the regression matrix is,
Figure FDA0003797073460000046
is a matrix of samples of the sample to be sampled,
Figure FDA0003797073460000047
a matrix of real numbers representing d rows and n columns,
Figure FDA0003797073460000048
is a vector of the offset to the offset,
Figure FDA0003797073460000049
is the label matrix of the sample, the parameter gamma is the regularization term weight,
Figure FDA00037970734600000410
is a diagonal matrix if sample x i If it is a sample with a label, U ii =1, otherwise U ii =0, wherein U ii Represents the matrix U at the ith row and ith columnThe element (b);
combining the objective function with the objective function learned by the subspace in the step 4) and the objective function learned by the similarity matrix in the step 5) to obtain a final objective function:
Figure FDA00037970734600000411
wherein Loss = tr ((W) T X+b1 T -Y)U(W T X+b1 T -Y) T ),
Figure FDA00037970734600000412
Is a matrix of parameters that is a function of,
Figure FDA00037970734600000413
is a projection matrix of the subspace,
Figure FDA00037970734600000414
is a similarity matrix;
Figure FDA00037970734600000415
is a low-rank decomposition matrix of W,
Figure FDA00037970734600000416
is a laplacian matrix and L = D-S, D is a diagonal matrix,
Figure FDA00037970734600000417
D ii representing the elements of matrix D at row i and column i, S ij Representing the elements in the ith row and the jth column of the similarity matrix S; the parameters alpha, theta and beta are weights for adjusting the importance degree of each item;
and respectively solving the partial derivatives of the final objective function to W and b to obtain the update formulas of W and b as follows:
W=[XU c X T +αXLX T +β(I-AA T )+γI] -1 XU c Y T
Figure FDA00037970734600000418
wherein the intermediate variable
Figure FDA0003797073460000051
And then, updating W and b according to the updating formula, and finishing the learning process of the semi-supervised linear regression classifier.
CN202110285595.XA 2021-03-17 2021-03-17 Semi-supervised classification method for high-dimensional data Active CN113033641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110285595.XA CN113033641B (en) 2021-03-17 2021-03-17 Semi-supervised classification method for high-dimensional data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110285595.XA CN113033641B (en) 2021-03-17 2021-03-17 Semi-supervised classification method for high-dimensional data

Publications (2)

Publication Number Publication Date
CN113033641A CN113033641A (en) 2021-06-25
CN113033641B true CN113033641B (en) 2022-12-16

Family

ID=76471055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110285595.XA Active CN113033641B (en) 2021-03-17 2021-03-17 Semi-supervised classification method for high-dimensional data

Country Status (1)

Country Link
CN (1) CN113033641B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841214B (en) * 2022-05-18 2023-06-02 杭州电子科技大学 Pulse data classification method and device based on semi-supervised discriminant projection
CN118506902B (en) * 2024-07-19 2024-10-25 中国石油大学(华东) Molecular property prediction method based on measurement small sample learning method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015167526A1 (en) * 2014-04-30 2015-11-05 Hewlett-Packard Development Company, L.P Facilitating interpretation of high-dimensional data clusters
CN106778832A (en) * 2016-11-28 2017-05-31 华南理工大学 The semi-supervised Ensemble classifier method of high dimensional data based on multiple-objection optimization
CN111027582A (en) * 2019-09-20 2020-04-17 哈尔滨理工大学 Semi-supervised feature subspace learning method and device based on low-rank graph learning
CN112232438A (en) * 2020-11-05 2021-01-15 华东理工大学 High-dimensional image representation-oriented multi-kernel subspace learning framework

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968639A (en) * 2012-09-28 2013-03-13 武汉科技大学 Semi-supervised image clustering subspace learning algorithm based on local linear regression
CN110717354B (en) * 2018-07-11 2023-05-12 哈尔滨工业大学 Superpixel classification method based on semi-supervised K-SVD and multi-scale sparse representation
CN109784392B (en) * 2019-01-07 2020-12-22 华南理工大学 A Synthetic Confidence-Based Semi-Supervised Classification Method for Hyperspectral Images

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015167526A1 (en) * 2014-04-30 2015-11-05 Hewlett-Packard Development Company, L.P Facilitating interpretation of high-dimensional data clusters
CN106778832A (en) * 2016-11-28 2017-05-31 华南理工大学 The semi-supervised Ensemble classifier method of high dimensional data based on multiple-objection optimization
CN111027582A (en) * 2019-09-20 2020-04-17 哈尔滨理工大学 Semi-supervised feature subspace learning method and device based on low-rank graph learning
CN112232438A (en) * 2020-11-05 2021-01-15 华东理工大学 High-dimensional image representation-oriented multi-kernel subspace learning framework

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
自适应半监督集成分类算法在高维数据上的研究;张乙东;《中国优秀硕士学位论文全文数据库信息科技辑》;20200131;全文 *

Also Published As

Publication number Publication date
CN113033641A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
WO2023201772A1 (en) Cross-domain remote sensing image semantic segmentation method based on adaptation and self-training in iteration domain
CN106778832B (en) Semi-supervised ensemble classification method for high-dimensional data based on multi-objective optimization
CN105260738B (en) High-resolution remote sensing image change detecting method and system based on Active Learning
CN103488662A (en) Clustering method and system of parallelized self-organizing mapping neural network based on graphic processing unit
CN113723492A (en) Hyperspectral image semi-supervised classification method and device for improving active deep learning
CN113469270B (en) A semi-supervised intuitionistic clustering method based on decomposed multi-objective differential evolution superpixels
Liu et al. Group collaborative representation for image set classification
Wang et al. Hyperspectral image classification based on domain adversarial broad adaptation network
CN110263855B (en) A Method for Image Classification Using Common Base Capsule Projection
CN113033641B (en) Semi-supervised classification method for high-dimensional data
CN109829494A (en) A kind of clustering ensemble method based on weighting similarity measurement
CN109858531B (en) Hyperspectral remote sensing image fast clustering algorithm based on graph
CN114399649B (en) Rapid multi-view semi-supervised learning method and system based on learning graph
CN108268890A (en) A kind of hyperspectral image classification method
CN111639686A (en) Semi-supervised classification algorithm based on dimension weighting and visual angle feature consistency
Chen et al. A novel localized and second order feature coding network for image recognition
CN111652265A (en) A Robust Semi-Supervised Sparse Feature Selection Method Based on Self-Adjusting Graphs
CN116662825A (en) Comprehensive similarity measurement evaluation method based on data driving
CN106295677A (en) A kind of current image cluster-dividing method combining Lars regular terms and feature self study
CN115661497A (en) A Fast Image Clustering Method Based on Bipartite Graph Embedding and Discriminant Information
CN108446735A (en) A kind of feature selection approach optimizing neighbour's constituent analysis based on differential evolution
Zhang et al. A spectral clustering based method for hyperspectral urban image
CN117253074B (en) Self-training and domain countermeasure-based hyperspectral image domain self-adaptive classification method
CN113673555A (en) A Memory-Based Unsupervised Domain Adaptive Image Classification Method
CN118313548A (en) A method and system for predicting enterprise energy consumption based on time series profiling technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant