CN115905860A

CN115905860A - Method and system for deep learning test sample selection based on feature distribution analysis

Info

Publication number: CN115905860A
Application number: CN202211344257.XA
Authority: CN
Inventors: 陶传奇; 李丽; 郭虹静; 黄志球
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-04-04
Anticipated expiration: 2042-10-31
Also published as: CN115905860B

Abstract

The invention discloses a deep learning test sample selection method and a system based on feature distribution analysis, wherein the method comprises the following steps: dividing the feature distribution of the training samples by adopting a clustering method to obtain a feature distribution cluster of the training samples; calculating a characteristic difference value of each training sample/test sample distributed on each characteristic distribution cluster according to the characteristic distribution cluster of the training sample; calculating the characteristic dispersion degree of the training sample according to the output vector of the training sample; constructing a sequencing model based on a sequencing learning algorithm, and realizing prediction and sequencing of test samples; and generating a synthetic test set, sequencing the synthetic test set by using a sequencing model, setting a sampling ratio, and selecting test samples with the highest sequencing to form a test subset. The method screens out test input capable of quickly and sufficiently detecting the DNN model fault under limited resources, so that the influence of data distribution change on the DNN model precision is relieved, the test marking cost is reduced, and the DNN model test efficiency is improved.

Description

Deep learning test sample selection method and system based on feature distribution analysis

技术领域Technical Field

本发明属于深度学习测试技术领域，具体涉及一种基于特征分布分析的深度学习测试样本选择方法及系统。The present invention belongs to the field of deep learning test technology, and specifically relates to a deep learning test sample selection method and system based on feature distribution analysis.

背景技术Background Art

随着深度学习(DL)技术的迅猛发展，基于大量数据训练好的深度神经网络(DeepNeural Network，简称DNN)模型被广泛应用于自动驾驶、人脸识别、语音识别、医学诊断、飞机防撞系统、软件工程等多种领域。尽管DNN取得了巨大的成果，但由于其质量问题也导致了许多安全事故的发生。在实践中，DNN大多是在一个测试集上进行测试的，该测试集与训练集来自同一数据集，其数据分布与训练数据一致。然而，当DNN部署在真实场景中时，随着时间推移，新的测试数据不断生成，数据的分布越来越多样，随着数据分布的转移，模型在测试环境下的有效性会不断下降。With the rapid development of deep learning (DL) technology, deep neural network (DNN) models trained based on a large amount of data have been widely used in many fields such as autonomous driving, face recognition, speech recognition, medical diagnosis, aircraft collision avoidance systems, software engineering, etc. Although DNN has achieved great results, its quality problems have also led to many safety accidents. In practice, DNN is mostly tested on a test set, which comes from the same data set as the training set and has the same data distribution as the training data. However, when DNN is deployed in real scenarios, new test data is continuously generated over time, and the distribution of data becomes more and more diverse. As the data distribution shifts, the effectiveness of the model in the test environment will continue to decline.

因此为了保证模型能够适应新环境下不同分布的数据，在DNN测试中，测试人员通过收集新的未标记的测试数据来重新训练原有的模型，从而有效地检测DNN模型故障的同时使模型的权重参数更新，进一步提高模型质量，以适应新的数据分布。Therefore, in order to ensure that the model can adapt to data with different distributions in the new environment, in DNN testing, testers retrain the original model by collecting new unlabeled test data, thereby effectively detecting DNN model failures while updating the model's weight parameters, further improving the model quality to adapt to the new data distribution.

然而大量的测试数据需要正确标记才能投入测试，当前手工标记是主要的方法。为了保证标记的正确性，通常需要多个用户协同完成标记任务，有些涉及到专业领域的测试数据(例如来自医学领域的影像图片文本)则需要领域专家进行标记，这造成了大量标记成本的消耗，同时也严重影响了DNN模型测试的效率。因此维护一个小规模且具备良好测试能力的测试集已经成为当前深度学习测试领域的重要研究问题。深度学习测试选择方法的目标就是要在原始无标签的测试集中挑选出有代表性的测试输入数据构成测试子集，随后对测试子集进行标记，以降低整体标记及测试执行的成本。However, a large amount of test data needs to be correctly labeled before it can be put into testing, and manual labeling is currently the main method. In order to ensure the correctness of the labeling, multiple users are usually required to collaborate to complete the labeling task. Some test data involving professional fields (such as images, pictures and texts from the medical field) require field experts to label, which results in a large amount of labeling costs and seriously affects the efficiency of DNN model testing. Therefore, maintaining a small-scale test set with good testing capabilities has become an important research issue in the current field of deep learning testing. The goal of deep learning test selection methods is to select representative test input data from the original unlabeled test set to form a test subset, and then label the test subset to reduce the overall labeling and test execution costs.

为了提高DNN模型测试效率，许多测试选择方法已经被提出。Shen等人提出一种测试选择方法MCP，该方法基于待测DNN模型对测试输入预测的最大置信度和第二大置信度，将测试样本聚类到多个边界区域，并指定优先级从所有边界区域中均匀选择测试样本构成测试子集，以指导DNN模型的重训练，提高DNN模型质量。In order to improve the efficiency of DNN model testing, many test selection methods have been proposed. Shen et al. proposed a test selection method MCP, which clusters test samples into multiple boundary regions based on the maximum confidence and second largest confidence of the test input prediction of the DNN model to be tested, and assigns priorities to uniformly select test samples from all boundary regions to form a test subset, so as to guide the retraining of the DNN model and improve the quality of the DNN model.

但是，现有技术存在的问题在于：However, the problems with the prior art are:

1)现有方法并没有考虑数据分布变化的问题，不能有效选择与训练样本具有不同分布的测试样本数据来重训练模型，从而提高模型适应新的数据分布的能力。1) Existing methods do not consider the problem of data distribution changes and cannot effectively select test sample data with different distributions from training samples to retrain the model, thereby improving the model's ability to adapt to new data distributions.

2)现有方法筛选出来的测试子集未能保证其测试样本的多样性，缺乏测试充分性。2) The test subsets screened by existing methods fail to ensure the diversity of their test samples and lack test adequacy.

发明内容Summary of the invention

针对于上述现有技术的不足，本发明的目的在于提供一种基于特征分布分析的深度学习测试样本选择方法及系统，以解决现有技术中选择的测试子集存在的测试不充分以及数据分布变化未知的问题；本发明方法在有限资源下筛选出能够快速充分检测出DNN模型故障的测试输入，以缓解数据分布变化对DNN模型精度影响的同时，减少测试标记成本，提高DNN测试效率。In view of the above-mentioned deficiencies in the prior art, the purpose of the present invention is to provide a deep learning test sample selection method and system based on feature distribution analysis, so as to solve the problems of insufficient testing and unknown data distribution changes in the test subsets selected in the prior art; the method of the present invention screens out test inputs that can quickly and fully detect DNN model faults under limited resources, so as to alleviate the impact of data distribution changes on the accuracy of the DNN model, reduce test marking costs, and improve DNN testing efficiency.

为达到上述目的，本发明采用的技术方案如下：To achieve the above object, the technical solution adopted by the present invention is as follows:

本发明的一种基于特征分布分析的深度学习测试样本选择方法，步骤如下：The present invention provides a deep learning test sample selection method based on feature distribution analysis, comprising the following steps:

1)采用聚类方法对训练样本的特征分布进行划分，得到训练样本的特征分布集群；1) Using clustering method to divide the characteristic distribution of training samples and obtain characteristic distribution clusters of training samples;

2)根据步骤1)中得到训练样本的特征分布集群，计算每个训练样本/测试样本分布在每个特征分布集群上的特征差异值；2) According to the feature distribution clusters of the training samples obtained in step 1), the feature difference value of each training sample/test sample distributed in each feature distribution cluster is calculated;

3)根据训练样本的输出向量计算训练样本的特征分散度；3) Calculate the feature dispersion of the training sample based on the output vector of the training sample;

4)基于排序学习算法构建排序模型，并实现对测试样本的预测和排序；4) Build a sorting model based on the sorting learning algorithm and realize the prediction and sorting of test samples;

5)使用对抗攻击方法生成合成测试集，利用排序模型对合成测试集进行排序，并设置采样比，选择排序靠前的测试样本构成测试子集。5) Use the adversarial attack method to generate a synthetic test set, use the sorting model to sort the synthetic test set, set the sampling ratio, and select the top-ranked test samples to form a test subset.

进一步地，所述步骤1)具体包括：将所有训练样本输入到一个DNN模型中进行预测，得到各自的输出向量，即所有训练样本的特征向量；根据特征向量，采用K-means聚类算法，对训练样本进行聚类，以划分训练样本在对应的DNN模型上的不同类别的特征分布，得到多个特征分布集群，每一个特征分布集群代表一类特征分布，在同一个特征分布集群下的训练样本拥有同一类特征分布。Furthermore, the step 1) specifically includes: inputting all training samples into a DNN model for prediction to obtain respective output vectors, i.e., feature vectors of all training samples; clustering the training samples according to the feature vectors using a K-means clustering algorithm to divide the feature distributions of different categories of the training samples on the corresponding DNN model to obtain multiple feature distribution clusters, each feature distribution cluster represents a type of feature distribution, and the training samples under the same feature distribution cluster have the same type of feature distribution.

进一步地，所述步骤2)中具体包括：Furthermore, the step 2) specifically includes:

特征差异值(FDF)计算：给定一个大小为m的样本集S(该样本集包括训练子集R和测试集T)、被测模型D及根据K-means聚类算法得到的训练样本在被测模型D上的特征分布集群为w个，特征分布集群的集合为C＝{c₁，c₂，...，c_w}；设s_h是样本集S中的第h个样本，c_j是第j个集群，根据MMD-critic算法从c_j中选择k个原型构成集合X_j＝{x_j，1，x_j，2，...，x_j，k},x_j，p为集合X_j中的第p个原型；根据样本s_h的特征向量(即输出向量)，计算样本s_h与集合X_j中所有原型的距离，将所有距离的平均值作为样本s_h分布在集群c_j的特征差异值FDF_h，j；计算如下：Calculation of feature difference value (FDF): Given a sample set S of size m (the sample set includes a training subset R and a test set T), a model D under test, and w feature distribution clusters of the training samples obtained according to the K-means clustering algorithm on the model D under test, the set of feature distribution clusters is C = {c ₁ , c ₂ , ..., c _w }; let s _h be the hth sample in the sample set S, c _j be the jth cluster, select k prototypes from c _j according to the MMD-critic algorithm to form a set X _j = {x _{j, 1} , x _{j, 2} , ..., x _{j, k} }, x _{j, p} is the pth prototype in the set X _j ; according to the feature vector (i.e., output vector) of sample s _h , calculate the distance between sample s _h and all prototypes in the set X _j , and take the average of all distances as the feature difference value FDF _{h, j} of sample s _h distributed in cluster c _j ; the calculation is as follows:

则样本s_h分布在集群集合C＝{c₁，c₂，...，c_w}上的特征差异值集合为FDF_h＝{FDF_h，1，FDF_h，2，...，FDF_h，w}。Then the set of feature difference values of the sample _sh distributed on the cluster set C = {c ₁ , c ₂ , ... , c _w } is FDF _h = {FDF _{h, 1} , FDF _{h, 2} , ... , FDF _{h, w} }.

进一步地，所述步骤3)中具体包括：Furthermore, the step 3) specifically includes:

特征分散度(FDS)计算：给定训练集R及被测模型D，训练集R的类别为n个，假设一个训练样本r(r∈R)在被测模型D上的预测概率向量为P(r)＝<p_r，1，p_r，2，...，p_r，n>，p_r，n表示训练样本r被预测为第n个类别的概率值大小；假设训练样本r真实类别为第t个类别，根据第t个类别上的预测概率p_r，t构建一个长度为n的基准向量P′_(r)＝<p_r，t，p_r，t，...，p_r，t>,p_r，t表示训练样本r被预测为第t个类别的概率值大小；训练样本r的特征分散度(FDS)计算如下：Feature dispersion (FDS) calculation: Given a training set R and a model under test D, the training set R has n categories. Assume that the predicted probability vector of a training sample r (r∈R) on the model under test D is P(r)＝<pr _,1 , _pr,2 ,..., _pr,n >, where _pr,n represents the probability value of the training sample r being predicted as the nth category. Assume that the true category of the training sample r is the tth category. According to the predicted probability _pr,t on the tth category, construct a reference vector P′ _(r) ＝< _pr,t , _pr,t ,..., _pr,t > with a length of n. pr _,t represents the probability value of the training sample r being predicted as the tth category. The feature dispersion (FDS) of the training sample r is calculated as follows:

训练样本的特征分散度FDS值越高，其特征分布越分散。The higher the feature dispersion FDS value of the training sample, the more dispersed its feature distribution is.

进一步地，所述步骤4)具体包括：Furthermore, the step 4) specifically includes:

41)新训练集的构建：新训练集记为FP＝{fp_1，2，fp_2，3，...，fp_n-1，n}，fp_n-1，n表示由原始训练样本r_n-1和r_n的特征差异值构成的一个特征对；给定一个由原始训练样本r_i和r_j的特征差异值构成的一个特征对fp_p，i，记作fp_p，i＝＜FDF_i，FDF_j＞；该特征对fp_p，i的标签label_i，j取值为-1，0或1；标签label_i，j由原始训练样本r_i和r_j的特征分散度的相对大小决定；标签的取值反映特征对中原始训练样本之间的训练模型能力的相对大小；当FDF_i>FDF_j时,label_i，j＝1,表明原始训练样本r_i的特征比原始训练样本r_j的特征分布分散、训练模型的能力强；当FDF_i＝FDF_j时，label_i，j＝0,表明原始训练样本r_i的特征分布与原始训练样本r_j的特征分布的分散性相同，训练模型的能力相同；当FDF_i＜FDF_j时，label_i，i＝-1,表明原始训练样本r_j的特征比原始训练样本r_i的特征分布分散，训练模型的能力强；将获得的新训练集FP输入到xgboost排序算法中，该算法利用树集成模型，设定生成树的数目为m颗，通过不断地分裂训练样本的基本特征来添加新的树，以从训练集中学习到更复杂的特征，同时不断拟合预测残差，以最小化训练误差；在生成设定数目的m颗树后，完成排序模型的构建。41) Construction of new training set: The new training set is denoted as FP = {fp _{1, 2} , fp ₂ , 3, ..., fp _{n-1, n} }, fp _{n-1, n} represents a feature pair consisting of the feature difference values of the original training samples r _n-1 and r _n ; given a feature pair fp _{p, i} consisting of the feature difference values of the original training samples r _i and r _j , denoted as fp _{p, i} = <FDF _i , FDF _j >; the label label _{i, j} of the feature pair fp _{p, i} takes values of -1, 0 or 1; the label label _{i, j} is determined by the relative size of the feature dispersion of the original training samples r _i and r _j ; the value of the label reflects the relative size of the training model capability between the original training samples in the feature pair; when FDF _i >FDF _j , label _{i, j} = 1, indicating that the feature distribution of the original training sample r _i is more dispersed than that of the original training sample r _j and the training model capability is stronger; when FDF _i = FDF _j , label _{i, j} =0, indicating that the dispersion of the feature distribution of the original training sample r _i is the same as that of the original training sample r _j , and the training model has the same ability; when FDF _i ＜FDF _j , label _{i, i} =-1, indicating that the feature distribution of the original training sample r _j is more dispersed than that of the original training sample r _i , and the training model has stronger ability; the obtained new training set FP is input into the xgboost sorting algorithm, which uses the tree ensemble model and sets the number of generated trees to m. New trees are added by continuously splitting the basic features of the training samples to learn more complex features from the training set, while continuously fitting the prediction residuals to minimize the training error; after generating the set number of m trees, the sorting model is constructed.

42)新测试集的构建：通过提取测试样本的特征差异值集合，构建新测试样本；假定一个测试样本t_i的特征差异值集合为FDF_i，则根据测试样本t_i构建的新测试样本为只包含一个特征差异值集合的特征对，即<FDF_i>；42) Construction of new test set: construct new test samples by extracting the feature difference value set of the test samples; assuming that the feature difference value set of a test sample _ti is FDF _i , then the new test sample constructed based on the test sample _ti is a feature pair containing only one feature difference value set, that is, <FDF _i >;

43)测试样本的预测：输入新测试样本到构建的排序模型中，将新测试样本的每个特征分布在排序模型中的每一颗树上对应的叶子节点的分数值相加，得到各个测试样本的预测分数值，该预测分数值反映测试样本与揭示模型错误的相关性；预测分数值越大，测试样本越容易揭示模型错误；根据该预测分数值的大小，将预测分数值大的测试样本优先排在前面，以实现对测试样本的排序。43) Prediction of test samples: Input new test samples into the constructed sorting model, add up the scores of each feature of the new test samples on the corresponding leaf nodes of each tree in the sorting model, and get the predicted score of each test sample. The predicted score reflects the correlation between the test sample and revealing model errors. The larger the predicted score, the easier it is for the test sample to reveal model errors. According to the size of the predicted score, the test samples with large predicted scores are prioritized to achieve sorting of the test samples.

进一步地，所述步骤5)具体包括：使用FGSM、BIM-A、BIM-B、CW以及JSMA五种对抗攻击方法对测试集中每一个测试样本进行扰动，生成对抗样本集，随机从对抗样本集中选取20％的对抗样本，从测试集中选取80％的测试样本，以构成具有不同数据分布的合成测试集；利用排序模型对各个合成测试集进行排序，设置采样比，根据比例，从已排序的合成测试集中选择排序在前的测试样本构成测试子集，以完成对测试样本的选择。Furthermore, the step 5) specifically includes: using five adversarial attack methods, namely FGSM, BIM-A, BIM-B, CW and JSMA, to perturb each test sample in the test set to generate an adversarial sample set, randomly selecting 20% of the adversarial samples from the adversarial sample set, and selecting 80% of the test samples from the test set to form a synthetic test set with different data distributions; using a sorting model to sort each synthetic test set, setting a sampling ratio, and according to the ratio, selecting the test samples ranked first from the sorted synthetic test set to form a test subset to complete the selection of test samples.

本发明还提供一种基于特征分布分析的深度学习测试样本选择系统，包括：The present invention also provides a deep learning test sample selection system based on feature distribution analysis, comprising:

特征分布划分模块，用于采用聚类方法对训练样本的特征分布进行划分，得到训练样本的特征分布集群；A feature distribution partitioning module is used to partition the feature distribution of training samples using a clustering method to obtain feature distribution clusters of training samples;

特征差异值计算模块，用于根据训练样本的特征分布集群，计算每个训练样本/测试样本分布在每个特征分布集群上的特征差异值；A feature difference value calculation module is used to calculate the feature difference value of each training sample/test sample distributed on each feature distribution cluster according to the feature distribution cluster of the training sample;

特征分散度计算模块，用于根据训练样本的输出向量计算训练样本的特征分散度；A feature dispersion calculation module is used to calculate the feature dispersion of the training sample according to the output vector of the training sample;

排序模型构建模块，基于排序学习算法构建排序模型，并实现对测试样本的预测和排序；The sorting model building module builds a sorting model based on the sorting learning algorithm and realizes the prediction and sorting of test samples;

生成模块，用于使用对抗攻击方法生成合成测试集；A generation module for generating a synthetic test set using an adversarial attack method;

测试样本选择模块，用于利用排序模型对合成测试集进行排序，并设置采样比，选择排序靠前的测试样本构成测试子集。The test sample selection module is used to sort the synthetic test set using the sorting model, set the sampling ratio, and select the top-ranked test samples to form a test subset.

本发明还提供一种选择终端，包括：The present invention also provides a selection terminal, comprising:

一个或多个处理器；one or more processors;

存储器，用于存储一个或多个程序；A memory for storing one or more programs;

当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现所述基于特征分布分析的深度学习测试样本选择方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the deep learning test sample selection method based on feature distribution analysis.

本发明还提供一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现所述基于特征分布分析的深度学习测试样本选择方法。The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the deep learning test sample selection method based on feature distribution analysis.

本发明的有益效果：Beneficial effects of the present invention:

本发明具有有效的错误检测能力：本发明基于对样本特征分布的分析，利用learning to rank，通过学习训练样本特征分布信息与训练模型的能力之间的关系来构建排序模型，以预测测试样本的特征分布信息与揭示模型错误的能力之间的相关性，从而得到一组相关性分数。分数的大小反映了测试样本揭示错误能力的大小。根据分数对测试样本进行排序和选择，能够快速地找到有效的测试用例进行标记，检测出DNN模型中的故障，大大降低标记成本，提高DNN测试效率。The present invention has effective error detection capability: Based on the analysis of sample feature distribution, the present invention uses learning to rank to build a ranking model by learning the relationship between the feature distribution information of training samples and the ability of training models to predict the correlation between the feature distribution information of test samples and the ability to reveal model errors, thereby obtaining a set of correlation scores. The size of the score reflects the size of the test sample's ability to reveal errors. By sorting and selecting test samples according to the scores, effective test cases can be quickly found for marking, faults in the DNN model can be detected, the marking cost can be greatly reduced, and the DNN test efficiency can be improved.

本发明具有缓解数据分布转移的能力：本发明通过提取测试样本的特征分布与各个类别下的训练样本的特征分布的差异，来衡量测试样本特征分布的差异性。根据特征差异值，选择分布差异性较大的测试样本对模型进行重训练，从而使重训练后的模型能够有效适应新的数据分布，缓解数据分布变化对模型精度的影响。The present invention has the ability to mitigate data distribution shift: the present invention measures the difference in the feature distribution of the test sample by extracting the difference between the feature distribution of the test sample and the feature distribution of the training samples under each category. According to the feature difference value, the test sample with a larger distribution difference is selected to retrain the model, so that the retrained model can effectively adapt to the new data distribution and mitigate the impact of the change in data distribution on the model accuracy.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明方法的原理图。FIG. 1 is a schematic diagram of the method of the present invention.

图2为本发明中对训练样本进行聚类的示意图。FIG. 2 is a schematic diagram of clustering training samples in the present invention.

图3为本发明中提取特征差异值的示意图。FIG. 3 is a schematic diagram of extracting feature difference values in the present invention.

图4为本发明中计算特征分散度的示意图。FIG. 4 is a schematic diagram of calculating feature dispersion in the present invention.

具体实施方式DETAILED DESCRIPTION

为了便于本领域技术人员的理解，下面结合实施例与附图对本发明作进一步的说明，实施方式提及的内容并非对本发明的限定。In order to facilitate the understanding of those skilled in the art, the present invention is further described below in conjunction with embodiments and drawings. The contents mentioned in the implementation modes are not intended to limit the present invention.

参照图1所示，本发明的一种基于特征分布分析的深度学习测试样本选择方法，步骤如下：1, a method for selecting a deep learning test sample based on feature distribution analysis of the present invention comprises the following steps:

1)采用聚类方法对训练样本的特征分布进行划分，得到训练样本的特征分布集群；参照图2所示，具体包括：将所有训练样本(不考虑其标签)输入到一个DNN模型中进行预测，得到各自的输出向量，即所有训练样本的特征向量；根据特征向量，采用K-means聚类算法，对训练样本进行聚类，以划分训练样本在对应的DNN模型上的不同类别的特征分布，得到多个特征分布集群，每一个特征分布集群代表一类特征分布，在同一个特征分布集群下的训练样本拥有同一类特征分布。1) Using a clustering method to divide the feature distribution of training samples to obtain feature distribution clusters of training samples; as shown in Figure 2, specifically including: inputting all training samples (regardless of their labels) into a DNN model for prediction to obtain respective output vectors, i.e., feature vectors of all training samples; according to the feature vectors, using a K-means clustering algorithm to cluster the training samples to divide the feature distributions of different categories of training samples on the corresponding DNN model to obtain multiple feature distribution clusters, each of which represents a type of feature distribution, and the training samples under the same feature distribution cluster have the same type of feature distribution.

2)根据步骤1)中得到训练样本的特征分布集群，计算每个训练样本/测试样本分布在每个特征分布集群上的特征差异值(FDF)；参照图3所示，具体包括：2) According to the feature distribution clusters of the training samples obtained in step 1), the feature difference value (FDF) of each training sample/test sample distributed on each feature distribution cluster is calculated; as shown in FIG. 3, specifically including:

3)根据训练样本的输出向量计算训练样本的特征分散度(FDS)；参照图4所示，具体包括：3) Calculating the feature dispersion (FDS) of the training sample according to the output vector of the training sample; as shown in FIG4 , specifically including:

4)基于排序学习(learning to rank)算法构建排序模型，并实现对测试样本的预测和排序；具体包括：4) Build a ranking model based on the learning to rank algorithm and implement the prediction and ranking of test samples; specifically include:

41)新训练集的构建：新训练集记为FP＝{fp_1，2，fp_2，3，...，fp_n-1，n}，fp_n-1，n表示由原始训练样本r_n-1和r_n的特征差异值构成的一个特征对；给定一个由原始训练样本r_i和r_j的特征差异值构成的一个特征对fp_p，i，记作fp_p，i＝＜FDF_i，FDF_j＞；该特征对fp_p，i的标签label_i，j取值为-1，0或1；标签label_i，j由原始训练样本r_i和r_j的特征分散度的相对大小决定；标签的取值反映特征对中原始训练样本之间的训练模型能力的相对大小；当FDF_i>FDF_j时,label_i，j＝1,表明原始训练样本r_i的特征比原始训练样本r_j的特征分布分散、训练模型的能力强；当FDF_i＝FDF_j时，label_i，j＝0,表明原始训练样本r_i的特征分布与原始训练样本r_j的特征分布的分散性相同，训练模型的能力相同；当FDF_i＜FDF_j时，label_i，j＝-1,表明原始训练样本r_j的特征比原始训练样本r_i的特征分布分散，训练模型的能力强；将获得的新训练集FP输入到xgboost排序算法中，该算法利用树集成模型，设定生成树的数目为m颗，通过不断地分裂训练样本的基本特征来添加新的树，以从训练集中学习到更复杂的特征，同时不断拟合预测残差，以最小化训练误差；在生成设定数目的m颗树后，完成排序模型的构建。41) Construction of new training set: The new training set is denoted as FP = {fp _{1, 2} , fp ₂ , 3, ..., fp _{n-1, n} }, fp _{n-1, n} represents a feature pair consisting of the feature difference values of the original training samples r _n-1 and r _n ; given a feature pair fp _{p, i} consisting of the feature difference values of the original training samples r _i and r _j , denoted as fp _{p, i} = <FDF _i , FDF _j >; the label label _{i, j} of the feature pair fp _{p, i} takes values of -1, 0 or 1; the label label _{i, j} is determined by the relative size of the feature dispersion of the original training samples r _i and r _j ; the value of the label reflects the relative size of the training model capability between the original training samples in the feature pair; when FDF _i >FDF _j , label _{i, j} = 1, indicating that the feature distribution of the original training sample r _i is more dispersed than that of the original training sample r _j and the training model capability is stronger; when FDF _i = FDF _j , label _{i, j} =0, indicating that the dispersion of the feature distribution of the original training sample r _i is the same as that of the original training sample r _j , and the training model has the same ability; when FDF _i ＜FDF _j , label _{i, j} =-1, indicating that the feature distribution of the original training sample r _j is more dispersed than that of the original training sample r _i , and the training model has stronger ability; the obtained new training set FP is input into the xgboost sorting algorithm, which uses the tree ensemble model and sets the number of generated trees to m. New trees are added by continuously splitting the basic features of the training samples to learn more complex features from the training set, while continuously fitting the prediction residuals to minimize the training error; after generating the set number of m trees, the sorting model is constructed.

5)使用对抗攻击方法生成合成测试集，利用排序模型对合成测试集进行排序，并设置采样比，选择排序靠前的测试样本构成测试子集；具体包括：使用FGSM、BIM-A、BIM-B、CW以及JSMA五种对抗攻击方法对测试集中每一个测试样本进行扰动，生成对抗样本集，随机从对抗样本集中选取20％的对抗样本，从测试集中选取80％的测试样本，以构成具有不同数据分布的合成测试集；利用排序模型对各个合成测试集进行排序，设置采样比，根据比例，从已排序的合成测试集中选择排序在前的测试样本构成测试子集，以完成对测试样本的选择。5) Generate a synthetic test set using an adversarial attack method, sort the synthetic test set using a sorting model, set the sampling ratio, and select the top-ranked test samples to form a test subset; specifically, the following steps are used: use five adversarial attack methods, namely FGSM, BIM-A, BIM-B, CW, and JSMA, to perturb each test sample in the test set to generate an adversarial sample set, randomly select 20% of the adversarial samples from the adversarial sample set, and select 80% of the test samples from the test set to form a synthetic test set with different data distributions; sort each synthetic test set using a sorting model, set the sampling ratio, and select the top-ranked test samples from the sorted synthetic test set to form a test subset according to the ratio to complete the selection of test samples.

示例中：In the example:

选择较为成熟的图像数据集CIFAR10和DNN网络模型VGG-16。使用CIFAR10的训练集，对ResNet-20进行训练，得到训练好的模型(简称L)。使用五种对抗攻击方法FGSM，BIM-A，BIM-B，CW，JSMA对CIFAR10的测试集进行微小扰动，将20％的扰动图片和80％的原始图片合成新的测试集。采用本发明方法对合成测试集(10000张测试图像)进行1％的采样，即从测试集中选择出1000个有效的测试样本，构成一个小规模的测试子集，以充分有效地检测出模型L的故障，并提升模型精度，有效缓解数据分布转移对模型质量的影响。具体的实现过程如下：Select the more mature image dataset CIFAR10 and the DNN network model VGG-16. Use the training set of CIFAR10 to train ResNet-20 and obtain a trained model (referred to as L). Use five adversarial attack methods FGSM, BIM-A, BIM-B, CW, and JSMA to slightly perturb the test set of CIFAR10, and synthesize 20% of the perturbed images and 80% of the original images into a new test set. The method of the present invention is used to sample 1% of the synthetic test set (10,000 test images), that is, 1,000 valid test samples are selected from the test set to form a small-scale test subset, so as to fully and effectively detect the faults of model L, improve the model accuracy, and effectively alleviate the impact of data distribution transfer on model quality. The specific implementation process is as follows:

1.设定聚类数目k＝10，使用K-means聚类算法对CIFAR10的训练样本进行聚类，得到10个不同的特征分布集群C＝{c₁，c₂，...，c₁₀}。1. Set the number of clusters k = 10, use the K-means clustering algorithm to cluster the CIFAR10 training samples, and obtain 10 different feature distribution clusters C = {c ₁ , c ₂ , ..., c ₁₀ }.

2.从CIFAR10训练集中选取1％的训练样本构成训练子集R，将选取的训练样本输入到被测模型中，得到其输出向量，即训练样本的特征向量，根据训练样本真实类别下的预测值构建基准向量。利用训练样本的特征向量和基准向量计算出训练样本的特征分散度(FDS)。2. Select 1% of the training samples from the CIFAR10 training set to form the training subset R. Input the selected training samples into the model under test to obtain its output vector, i.e., the feature vector of the training sample. Construct a benchmark vector based on the predicted value of the training sample under the true category. Use the feature vector of the training sample and the benchmark vector to calculate the feature dispersion (FDS) of the training sample.

3.利用MMD-critic算法，从各个特征分布集群中选择5个原型来近似地表示整个集群。给定训练子集R和合成测试集T,得到样本集S(S＝R∪T)；给定第h个样本s_h(s_h∈S)和一个集群c_j，根据s_h在被测模型L上的输出向量，计算s_h与c_j的5个原型的欧氏距离。将所有欧氏距离的平均值作为该样本s_h分布在集群c_j上的特征差异值FDF_h，j。以此类推，s_h分布在10个特征集群上的特征差异值集合为FDF_h＝{FDF_h，1，FDF_h，2，...，FDF_h，10}。3. Using the MMD-critic algorithm, select 5 prototypes from each feature distribution cluster to approximately represent the entire cluster. Given the training subset R and the synthetic test set T, obtain the sample set S (S = R ∪ T); given the h-th sample s _h (s _h ∈ S) and a cluster c _j , calculate the Euclidean distance between s _h and the 5 prototypes of c _j according to the output vector of s _h on the tested model L. The average of all Euclidean distances is taken as the feature difference value FDF _{h, j} of the sample s _h distributed on cluster c _j . Similarly, the set of feature difference values of s _h distributed on 10 feature clusters is FDF _h = {FDF _{h, 1} , FDF _{h, 2} , ..., FDF _{h, 10} }.

4.根据训练子集R中任意两个训练样本的特征差异值集合，构建特征对，将特征对的集合作为新的训练集，用于训练基于learning to rank的排序模型。再将合成测试集T中的每一个测试样本的特征差异值集合作为排序模型的新测试样本。将新的测试样本输入到排序模型中，排序模型根据该样本的特征差异值对测试样本自动打分。该分数值的大小反映测试样本与揭示模型L的错误的相关性大小；根据分数值对测试样本按照相关性大小排序；排序越靠前的测试样本越能够揭示模型L的错误。4. Based on the feature difference value set of any two training samples in the training subset R, construct feature pairs, and use the set of feature pairs as a new training set for training the ranking model based on learning to rank. Then use the feature difference value set of each test sample in the synthetic test set T as a new test sample for the ranking model. Input the new test sample into the ranking model, and the ranking model automatically scores the test sample according to the feature difference value of the sample. The size of the score value reflects the correlation between the test sample and the error of revealing model L; the test samples are sorted according to the size of the correlation based on the score value; the higher the ranking of the test sample, the more likely it is to reveal the error of model L.

5.从排序后的测试集中选取1％的排在前面的测试样本进行标记，将其输入到原始的被测模型L中，进行重训练后再执行测试，报告重训练后的模型精度的提升效果，从而证明本发明的有效性。5. Select 1% of the top test samples from the sorted test set and mark them, input them into the original tested model L, perform the test after retraining, and report the improvement effect of the model accuracy after retraining, so as to prove the effectiveness of the present invention.

本发明具体应用途径很多，以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以作出若干改进，这些改进也应视为本发明的保护范围。The present invention has many specific application paths. The above is only a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements can be made without departing from the principle of the present invention. These improvements should also be regarded as the protection scope of the present invention.

Claims

1. A deep learning test sample selection method based on feature distribution analysis is characterized by comprising the following steps:

1) Dividing the feature distribution of the training samples by adopting a clustering method to obtain a feature distribution cluster of the training samples;

2) Calculating a characteristic difference value of each training sample/test sample distributed on each characteristic distribution cluster according to the characteristic distribution cluster of the training samples obtained in the step 1);

3) Calculating the characteristic dispersion degree of the training sample according to the output vector of the training sample;

4) Constructing a sequencing model based on a sequencing learning algorithm, and realizing prediction and sequencing of test samples;

5) And generating a synthetic test set by using an anti-attack method, sequencing the synthetic test set by using a sequencing model, setting a sampling ratio, and selecting test samples ranked in front to form a test subset.

2. The method for selecting the deep learning test sample based on the feature distribution analysis according to claim 1, wherein the step 1) specifically comprises: inputting all training samples into a DNN model for prediction to obtain respective output vectors, namely the characteristic vectors of all training samples; and clustering the training samples by adopting a K-means clustering algorithm according to the feature vectors to divide the feature distribution of different types of the training samples on the corresponding DNN model to obtain a plurality of feature distribution clusters, wherein each feature distribution cluster represents one type of feature distribution, and the training samples under the same feature distribution cluster have the same type of feature distribution.

3. The method for selecting the deep learning test sample based on the feature distribution analysis according to claim 1, wherein the step 2) specifically comprises:

calculating a characteristic difference value: giving w sample sets S with the size of m, measured models D and feature distribution clusters of training samples on the measured models D obtained according to a K-means clustering algorithm, wherein the set of the feature distribution clusters is C = { C = { (C) } ₁ ，c ₂ ，...，c _w }; let s _h Is the h sample in sample set S, c _j Is the jth cluster, from c according to the MMD-critic algorithm _j In which k prototypes are selected to constitute a set X _j ＝{X _j，1 ，x _j，2 ，...，x _j，k }，x _j，p Is a set X _j The p-th prototype in (1); according to the sample s _h Computing a sample s _h And set X _j The average value of all the distances is used as the sample s _h Distributed in cluster c _j Characteristic difference value FDF of _h，j (ii) a The calculation is as follows:

then the sample s _h Distributed in cluster set C = { C = } ₁ ，c ₂ ，...，c _w The set of feature difference values on (i) } is FDF _h ＝{FDF _h，1 ，FDF _h，2 ，...，FDF _h，w }。

4. The method for selecting the deep learning test sample based on the feature distribution analysis according to claim 1, wherein the step 3) specifically comprises:

calculating the characteristic dispersion degree: given a training set R and a tested model D, the class of the training set R is n, and the prediction probability vector of a training sample R (R belongs to R) on the tested model D is assumed to be P (R) = R<p _r，1 ，p _r，2 ，...，p _r，n >，p _r，n Representing the magnitude of the probability value that the training sample r is predicted to be in the nth category; assuming that the true category of a training sample r is the t-th category, according to the prediction probability p on the t-th category _r，t Constructing a reference vector P 'of length n' _(r) ＝<P _r，t ，p _r，t ，...，p _r，t >，p _r，t Representing the magnitude of the probability value that the training sample r is predicted to be in the t-th category; the Feature Dispersion (FDS) of the training samples r is calculated as follows:

the higher the FDS value of the training sample, the more dispersed the feature distribution.

5. The method for selecting the deep learning test sample based on the feature distribution analysis according to claim 1, wherein the step 4) specifically comprises:

41 Construction of a new training set: the new training set is denoted as FP = { FP = _1，2 ，fp _2，3 ，…，fp _n-1，n }，fp _n-1，n Representing the data from the original training sample r _n-1 And r _n A feature pair is formed by the feature difference values; given a training sample r consisting of the original _i And r _j A feature pair fp consisting of feature difference values of _p，i Denoted fp _p，i ＝<FDF _i ，FDF _j >(ii) a The characteristic pair fp _p，i Label of (2) _i，j The value is-1, 0 or 1; label label _i，j From the original training sample r _i And r _j Relative size determination of the characteristic dispersion of (a); the value of the label reflects the relative size of the training model capability between the original training samples in the feature pair; when FDF _i ＞FDF _j When, label _i，j =1, indicates original training sample r _i Is compared with the original training sample r _j The characteristic distribution of the model is dispersed, and the capability of training the model is strong; when FDF _i ＝FDF _j When, label _i，j =0, indicating the feature distribution of the original training sample ri and the original training sample r _j The feature distribution has the same dispersity and the model training capability is the same; when FDF _i ＜FDF _j When, label _i，j = -1, shows original training sample r _j Is compared with the original training sample r _i The characteristic distribution of the model is dispersed, and the capability of training the model is strong; inputting the obtained new training set FP into an xgboost sorting algorithm to complete the construction of a sorting model;

42 Construction of a new test set: constructing a new test sample by extracting a characteristic difference value set of the test sample; assume a test sample t _i Is FDF _i According to the test sample t _i Constructing a new test sample as the features only containing one set of feature difference valuesTo, i.e.<FDF _i >；

43 Prediction of test samples): inputting new test samples into the constructed sequencing model, and adding the score values of leaf nodes corresponding to each feature of the new test samples distributed on each tree in the sequencing model to obtain the predicted score value of each test sample; according to the size of the prediction score value, the test samples with large prediction score values are preferentially arranged in front, so that the test samples are sorted.

6. The method for selecting the deep learning test sample based on the feature distribution analysis according to claim 1, wherein the step 5) specifically comprises: using five counterattack methods of FGSM, BIM-A, BIM-B, CW and JSMA to disturb each test sample in the test set to generate a counterattack sample set, randomly selecting 20% of the counterattack samples from the counterattack sample set, and selecting 80% of the test samples from the test set to form a synthetic test set with different data distribution; and sequencing each synthetic test set by using a sequencing model, setting a sampling ratio, and selecting the test samples sequenced at the front from the sequenced synthetic test sets according to the proportion to form a test subset so as to finish the selection of the test samples.

7. A deep learning test sample selection system based on feature distribution analysis, comprising:

the characteristic distribution dividing module is used for dividing the characteristic distribution of the training samples by adopting a clustering method to obtain a characteristic distribution cluster of the training samples;

the characteristic difference value calculation module is used for calculating the characteristic difference value of each training sample/test sample distributed on each characteristic distribution cluster according to the characteristic distribution cluster of the training sample;

the characteristic dispersion degree calculation module is used for calculating the characteristic dispersion degree of the training sample according to the output vector of the training sample;

the sequencing model building module builds a sequencing model based on a sequencing learning algorithm and realizes prediction and sequencing of the test samples;

a generation module for generating a synthetic test set using an anti-attack method;

and the test sample selection module is used for sorting the synthetic test set by using the sorting model, setting a sampling ratio and selecting the test sample with the front sorting to form a test subset.

8. A selection terminal, comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the selection method according to any one of claims 1 to 6.