CN107256245A

CN107256245A - Improved and system of selection towards the off-line model that refuse messages are classified

Info

Publication number: CN107256245A
Application number: CN201710409006.8A
Authority: CN
Inventors: 毛莺池; 齐海; 贾必聪; 李晓芳; 平萍; 徐淑芳
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2017-06-02
Filing date: 2017-06-02
Publication date: 2017-10-17
Anticipated expiration: 2037-06-02
Also published as: CN107256245B

Abstract

Improve and system of selection, comprise the following steps towards the off-line model that refuse messages are classified the invention discloses a kind of：(1) feature selecting and extension, select feature using feature selection approach, construct feature term vector, original short message text is represented using Feature Words vector model；(2) offline sorting algorithm and improved tuning training and test, the improvement for making to classify towards refuse messages to offline sorting algorithm, data preparation is carried out to the training set obtained by step (1) and test set according to each offline sorting algorithm and improvement, using training set is to each off-line algorithm and improves progress tuning training and test；(3) the offline sorting algorithm selection based on evaluation index, is proposed the evaluation index classified towards refuse messages, the test result obtained by step (2) is analyzed using the evaluation index and optimal offline sorting algorithm is selected.

Description

Offline Model Improvement and Selection Method for Spam SMS Classification

技术领域technical field

本发明涉及一种离线文本分类算法，具体涉及一种面向垃圾短信分类的离线模型改进与选择方法，属于基于文本内容的垃圾短信识别技术领域。The invention relates to an offline text classification algorithm, in particular to an offline model improvement and selection method for classification of junk short messages, and belongs to the technical field of text content-based identification of junk short messages.

背景技术Background technique

在文本分类问题中最重要的是选择和训练文本分类模型，文本分类的性能在很大程度上取决于文本分类模型。近来，研究人员基于机器学习，结合统计学、信息学等多学科理论提出各种各样的文本分类模型。The most important thing in the text classification problem is to select and train the text classification model, and the performance of text classification depends on the text classification model to a large extent. Recently, researchers have proposed various text classification models based on machine learning, combined with multidisciplinary theories such as statistics and informatics.

朴素贝叶斯分类算法是基于统计学的机器学习方法，被广泛应用于文本分类问题。该算法基于特征独立性假设，虽然实际问题中特征之间往往存在相关性，但是该假设简化了朴素贝叶斯分类模型的计算。在基于内容的垃圾短信分类问题中，朴素贝叶斯分类算法取得了很好的预测性能。Naive Bayesian classification algorithm is a machine learning method based on statistics, which is widely used in text classification problems. The algorithm is based on the assumption of feature independence. Although there is often correlation between features in practical problems, this assumption simplifies the calculation of the naive Bayesian classification model. In the content-based spam classification problem, the Naive Bayesian classification algorithm has achieved very good predictive performance.

决策分类树算法也是文本分类问题的常用算法，它使用训练数据集学习一棵决策分类树，树中的每个节点对应一个特征，节点的每个分支对应基于该节点特征的一个划分，树的叶子节点对应类别标签。目前有很多决策树构造方法，例如基于信息增益的ID3算法、基于信息增益比的C4.5算法和基于基尼指数的CART 算法等。决策树分类算法应用于文本分类问题得到一组规则，沿着这些规则对测试文本的对应特征进行判断，最终可以确定测试文本的类别。The decision classification tree algorithm is also a common algorithm for text classification problems. It uses the training data set to learn a decision classification tree. Each node in the tree corresponds to a feature, and each branch of the node corresponds to a division based on the feature of the node. The tree Leaf nodes correspond to category labels. At present, there are many decision tree construction methods, such as ID3 algorithm based on information gain, C4.5 algorithm based on information gain ratio, and CART algorithm based on Gini index. The decision tree classification algorithm is applied to the text classification problem to obtain a set of rules, and the corresponding features of the test text are judged along these rules, and finally the category of the test text can be determined.

感知机由Schutze等人首次应用到文本分类问题中。后来，应用于文本分类问题的感知机算法经过大量的改进和优化，例如POSITIVE WINNOW、 BALANCED WINNOW、WIDROW-HOFF等等。感知机实际上是最简单的神经网络，二者的区别在于，感知机学习得到的是线性分类模型，而神经网络得到的是非线性的分类模型，但是感知机却能取得与神经网络近似的分类性能，并且训练时间复杂度较低。Perceptrons were first applied to text classification problems by Schutze et al. Later, the perceptron algorithm applied to the text classification problem has been greatly improved and optimized, such as POSITIVE WINNOW, BALANCED WINNOW, WIDROW-HOFF and so on. The perceptron is actually the simplest neural network. The difference between the two is that the perceptron learns a linear classification model, while the neural network obtains a nonlinear classification model, but the perceptron can obtain a classification similar to the neural network. performance, and the training time complexity is low.

KNN算法基于距离度量函数选择与测试样本距离最近的k个训练样本，然后使用多数表决的方法决定测试样本的类别。无需训练，但是分类误差也较大，如果k值选择得过小，容易受到噪声数据的影响，如果k值选择得过大，这时与测试样本距离较大(不相似)的训练样本也会对预测起作用，产生错误的预测结果。在文本分类问题中，使用更多的是KNN与其他分类算法结合的分类模型，例如最近邻和聚类算法、最近邻和最大后验估计等。The KNN algorithm selects the k training samples closest to the test sample based on the distance measurement function, and then uses the majority voting method to determine the category of the test sample. No training is required, but the classification error is also large. If the k value is selected too small, it is easily affected by the noise data. If the k value is selected too large, then the training samples with a large distance (dissimilarity) from the test sample will also Act on the forecast, produce wrong forecast results. In text classification problems, more classification models combining KNN and other classification algorithms are used, such as nearest neighbor and clustering algorithms, nearest neighbor and maximum a posteriori estimation, etc.

支持向量机分类算法被广泛应用于文本分类问题中，并且大量实验表明支持向量机是准确率很高的分类模型。Support vector machine classification algorithm is widely used in text classification problems, and a large number of experiments show that support vector machine is a classification model with high accuracy.

近来，集成分类器越来越受到关注，其基本思想是“三个臭皮匠赛过一个诸葛亮”，多个分类器的预测结果一定比单个分类器的更加可信，学习多个弱分类器，最后综合每个弱分类器的分类结果作为最终预测结果。分类器集成规则主要有多数表决规则(Majority Voting)、动态分类器选择(dynamicselection)、线性加权组合规则(Weighted Linear Combination)、自适应分类器组合规则 (AdaptiveCombination)等。AdaBoost就是一种构建集成分类器的算法，该算法通过动态改变样本权重分布学习多个弱分类器，采用的集成规则是线性加权组合规则，根据弱分类器的分类误差率计算权值。Recently, integrated classifiers have attracted more and more attention. The basic idea is that "three cobblers are better than one Zhuge Liang". The prediction results of multiple classifiers must be more credible than that of a single classifier. Learning multiple weak classifiers, Finally, the classification results of each weak classifier are integrated as the final prediction result. Classifier integration rules mainly include majority voting rules (Majority Voting), dynamic classifier selection (dynamic selection), linear weighted combination rules (Weighted Linear Combination), adaptive classifier combination rules (Adaptive Combination) and so on. AdaBoost is an algorithm for building integrated classifiers. This algorithm learns multiple weak classifiers by dynamically changing the weight distribution of samples. The integration rule adopted is a linear weighted combination rule, and the weight is calculated according to the classification error rate of the weak classifiers.

面向垃圾短信特征信息较为稳定的静态短信数据，本发明对离线分类算法进行改进和选择，其中离线分类算法包括：LR、AdaBoost决策树、SVM和GBDT，提出面向垃圾短信分类的评价指标，并基于评价指标进行离线模型选择。Facing the relatively stable static short message data of spam short message feature information, the present invention improves and selects the offline classification algorithm, wherein the offline classification algorithm includes: LR, AdaBoost decision tree, SVM and GBDT, and proposes an evaluation index facing spam short message classification, and based on Evaluation metrics for offline model selection.

发明内容Contents of the invention

发明目的：本发明基于垃圾短信特征信息较为稳定的静态短信数据集提出离线分类算法的改进与选择，旨在得到最优的垃圾短信分类模型。离线分类算法及改进包括基于特定数据格式的LR，特定的数据格式降低了LR模型的时空资源消耗，提高了模型的训练和测试效率；差异化损失的AdaBoost决策树，差异化损失是基于垃圾短信识别中正常短信误判代价高于垃圾短息误判的特点，由于 AdaBoost在迭代产生弱分类器的过程中可以动态更新样本权重分布，所以可以通过在迭代过程中差异化的调整错分样本的权重使得上一轮中由正常短信误判为垃圾短信的样本在本轮迭代中受到较误判垃圾短信更大的关注；在处理非线性可分问题和分类性能上具有一定优势的SVM和可以自动做特征选择并且每次迭代只使用部分特征，性能很好的GBDT，并介绍它们的参数调优方法。其次，提出面向垃圾短信分类的评价指标，对离线分类算法进行参数调优训练和测试，并基于评价指标对离线模型进行选择。Purpose of the invention: the present invention proposes the improvement and selection of the offline classification algorithm based on the relatively stable static short message data set of spam short message feature information, aiming at obtaining the optimal spam short message classification model. Offline classification algorithms and improvements include LR based on a specific data format, which reduces the space-time resource consumption of the LR model and improves the efficiency of model training and testing; AdaBoost decision tree for differential loss, differential loss is based on spam messages The misjudgment cost of normal SMS is higher than that of spam SMS misjudgment in recognition. Since AdaBoost can dynamically update the sample weight distribution in the process of iteratively generating weak classifiers, it is possible to differentially adjust the misclassified samples in the iterative process. The weight makes the samples misjudged as spam text messages by normal text messages in the previous round receive more attention than misjudged spam text messages in this round of iteration; SVM and SVM which have certain advantages in dealing with nonlinear separable problems and classification performance Automatically do feature selection and use only part of the features in each iteration, GBDT with good performance, and introduce their parameter tuning methods. Secondly, an evaluation index for spam SMS classification is proposed, parameter tuning training and testing are carried out for the offline classification algorithm, and the offline model is selected based on the evaluation index.

技术方案：一种面向垃圾短信分类的离线模型改进与选择方法，包括以下四个方面：Technical solution: an offline model improvement and selection method for spam classification, including the following four aspects:

(1)短信文本预处理，主要的预处理内容包括：分词、短信文本统一转简体表述、号码等脱敏字符串转单字符、去除停用词；(1) SMS text preprocessing, the main preprocessing content includes: word segmentation, unified conversion of SMS text to simplified expressions, conversion of desensitized strings such as numbers to single characters, and removal of stop words;

(2)特征选择与扩展，根据步骤(1)所得到的预处理结果，使用特征选择方法选择特征，构造特征词向量，使用特征词向量模型表示原始短信文本；(2) feature selection and expansion, according to the preprocessing result obtained in step (1), use the feature selection method to select features, construct the feature word vector, and use the feature word vector model to represent the original text message;

(3)离线分类算法及改进的调优训练与测试，对离线分类算法作面向垃圾短信分类的改进，根据各离线分类算法及改进对步骤(2)所得的训练集和测试集进行数据准备，使用训练集对各离线算法及改进进行调优训练和测试；(3) Offline classification algorithm and improved tuning training and testing, the offline classification algorithm is improved for spam classification, according to each offline classification algorithm and improvement, the training set and test set obtained in step (2) are prepared for data, Use the training set to perform tuning training and testing for each offline algorithm and improvement;

(4)基于评价指标的离线分类算法选择，提出面向垃圾短信分类的评价指标，使用该评价指标对步骤(3)所得到的测试结果进行分析并选择最优离线分类算法。(4) Based on the selection of the offline classification algorithm of the evaluation index, an evaluation index for spam SMS classification is proposed, and the evaluation index is used to analyze the test results obtained in step (3) and select the optimal offline classification algorithm.

所述内容(1)短信文本预处理，主要的预处理内容包括：分词、短信文本统一转简体表述、号码等脱敏字符串转单字符、去除停用词，具体为：The content (1) SMS text preprocessing, the main preprocessing content includes: word segmentation, SMS text unified to simplified expression, number and other desensitized character strings to single characters, stop words removal, specifically:

(1.1)使用Ansj对短信文本分词，保留词性标注；(1.1) Use Ansj to segment the SMS text and keep the part-of-speech tag;

(1.2)短信文本统一转简体表述、号码等脱敏字符串转单字符；(1.2) Unified conversion of SMS text to simplified expressions, numbers and other desensitized strings to single characters;

(1.3)根据停用词表去除停用词。(1.3) Remove stop words according to the stop word list.

所述内容(2)特征选择与扩展，根据步骤(1)所得到的预处理结果，使用特征选择方法选择特征，构造特征词向量，使用特征词向量模型表示原始短信文本，具体为：Described content (2) feature selection and expansion, according to the preprocessing result that step (1) obtains, use feature selection method to select feature, construct feature word vector, use feature word vector model to represent original SMS text, specifically:

(2.1)基于统计阈值和平均信息增益的频繁词特征选择，阈值是可调参数，根据阈值选择频繁词作为特征词集，根据特征词集的平均信息增益变化情况决定是否继续调整阈值；(2.1) Frequent word feature selection based on statistical threshold and average information gain, the threshold is an adjustable parameter, select frequent words as the feature word set according to the threshold, and decide whether to continue to adjust the threshold according to the average information gain of the feature word set;

(2.2)基于N-Gram算法的双字词和组合词特征选择，基于N-Gram算法产生文字片断序列，根据步骤(2.1)得到的最优统计阈值过滤掉非频繁序列，将剩余的序列构建关联矩阵，矩阵元素为对应行列组合序列在垃圾短信文本中的出现频度，根据一定的标准筛选组合文字序列；(2.2) Feature selection based on the N-Gram algorithm for double-character words and compound words, generate text fragment sequences based on the N-Gram algorithm, filter out infrequent sequences according to the optimal statistical threshold obtained in step (2.1), and construct the remaining sequences Relevance matrix, the matrix element is the frequency of occurrence of the corresponding row and column combination sequence in the spam text, and the combination text sequence is screened according to certain standards;

(2.3)非修饰性实词组合成元组特征，遍历所有的垃圾短信文本寻找名词+ 动词\形容词组合，根据一定的标准对所得元组特征进行筛选。(2.3) Combining non-modifying content words into tuple features, traversing all spam texts to find noun+verb\adjective combinations, and screening the tuple features obtained according to certain standards.

(2.4)基于累积信息增益的特征选择，对由以上步骤得到的词和组合词特征的合并结果，选择累积信息增益达到原始特征词信息增益总和的95％的特征词，进而构建特征词向量。(2.4) Based on the feature selection of cumulative information gain, for the merging result of the word and compound word features obtained by the above steps, select the feature word whose cumulative information gain reaches 95% of the original feature word information gain sum, and then construct the feature word vector.

所述内容(3)离线分类算法及改进的调优训练与测试，对离线分类算法作面向垃圾短信分类的改进，根据各离线分类算法及改进对步骤(2)所得的训练集和测试集进行数据准备，使用训练集对各离线算法及改进进行调优训练和测试，具体为：Described content (3) off-line classification algorithm and improved tuning training and test, the off-line classification algorithm is done to the improvement facing spam classification, according to each off-line classification algorithm and improvement to the training set and the test set of step (2) gained Data preparation, using the training set to optimize training and testing of each offline algorithm and improvement, specifically:

(3.1)对离线分类算法作面向垃圾短信分类的改进，包括基于特定数据格式的LR，特定数据格式为：label index1:value1 index2:value2...，使用该特定数据格式，LR在计算系数向量和实例的内积时公式为：其中，w代表系数向量(矩阵)，x_i表示第i个实例向量(矩阵)，index_j代表实例向量x_i非零元素的下标，由于采用0、1词典模型，因此非0元素为1，label为实例类别标签，通常为整数，如0和1，value是对应的特征取值，由于本发明采用的是词典模型；差异化损失的AdaBoost决策时，在垃圾短信分类中，正常短信误判代价高于垃圾短信误判，因此提出差异化损失的改进，在每次迭代更新训练样本的权重时，如果在上一次迭代正确分类，更新式为如果在上一次迭代错误分类，更新式为 (3.1) Improve the offline classification algorithm for spam classification, including LR based on a specific data format. The specific data format is: label index1:value1 index2:value2..., using this specific data format, LR calculates the coefficient vector The formula for the inner product of and instance is: Among them, w represents the coefficient vector (matrix), x _i represents the i-th instance vector (matrix), index _j represents the subscript of the non-zero element of the instance vector x _i , since the 0, 1 dictionary model is used, the non-zero element is 1 , label is an instance category label, usually an integer, such as 0 and 1, and value is a corresponding feature value, because what the present invention uses is a dictionary model; during the AdaBoost decision-making of differential loss, in the spam classification, the normal short message is wrong The judgment cost is higher than the misjudgment of spam text messages, so an improvement of the differential loss is proposed. When updating the weight of the training sample in each iteration, if the classification is correct in the last iteration, the update formula is If it was misclassified in the previous iteration, the update formula is

w_m,i为第m次迭代第i个实例的权重，Z_m是规范化因子，α_m为第m个基分类器的权重。w _m,i is the weight of the i-th instance in the m-th iteration, Z _m is the normalization factor, and α _m is the weight of the m-th base classifier.

(3.2)根据各离线分类算法及改进对步骤(2)所得的训练集和测试集进行数据准备；(3.2) carry out data preparation to the training set and the test set of step (2) gained according to each off-line classification algorithm and improvement;

(3.3)使用训练集对各离线算法及改进进行调优训练和测试，采用交叉验证调优SVM的模型参数，采用网格搜索寻找GBDT的最优参数，基本思想是：按照参数重要性次序进行调优，如果只对一个参数调优，那么根据该参数的取值区间构造参数向量，遍历向量中的所有取值，根据预测结果选最优；如果同时对两个参数进行调优，那么根据两个参数的取值区间构造二维的参数矩阵，形如网格，每个网格对应两个参数的取值组合，遍历所有的网格，基于预测结果选择最优参数组合。对于LR和AdaBoost，通过调整迭代次数得到最优模型，最优使用各个最优模型对测试集进行预测。(3.3) Use the training set to optimize training and testing of each offline algorithm and improvement, use cross-validation to tune the model parameters of SVM, and use grid search to find the optimal parameters of GBDT. The basic idea is: according to the order of importance of parameters Tuning, if only one parameter is tuned, then construct a parameter vector according to the value range of the parameter, traverse all the values in the vector, and select the best according to the prediction result; if tuning two parameters at the same time, then according to The value interval of the two parameters constructs a two-dimensional parameter matrix, which is shaped like a grid. Each grid corresponds to the value combination of the two parameters. All the grids are traversed, and the optimal parameter combination is selected based on the prediction results. For LR and AdaBoost, the optimal model is obtained by adjusting the number of iterations, and each optimal model is optimally used to predict the test set.

所述内容(4)基于评价指标的离线分类算法选择，提出面向垃圾短信分类的评价指标，使用该评价指标对步骤(3)所得到的测试结果进行分析并选择最优离线分类算法，具体为：Described content (4) selects based on the off-line classification algorithm of evaluation index, proposes the evaluation index facing spam classification, uses this evaluation index to analyze the test result that step (3) obtains and selects optimal offline classification algorithm, specifically :

(4.1)提出面向垃圾短信分类的评价指标，包括准确率accuracy、正确率召回率和其中TP为真实类别为1(垃圾短信)并且预测为1的样本数目，FP为真实类别为0而预测为1的样本数目，FN为真实类别为1而预测为0的样本数目；(4.1) Propose evaluation indicators for spam SMS classification, including accuracy and accuracy recall rate with Wherein TP is the number of samples whose real category is 1 (spam message) and is predicted to be 1, FP is the number of samples whose real category is 0 and is predicted to be 1, and FN is the number of samples whose real category is 1 and predicted to be 0;

(4.2)使用步骤(4.1)提出的评价指标对步骤(3)所得到的测试结果进行分析并选择最优离线分类算法。(4.2) Use the evaluation index proposed in step (4.1) to analyze the test results obtained in step (3) and select the optimal offline classification algorithm.

本发明采用上述技术方案，具有以下有益效果：The present invention adopts the above-mentioned technical scheme, and has the following beneficial effects:

1.文本预处理一方面提高了特征选择的准确性和有效性，另一方面，避免了重要信息的丢失；1. Text preprocessing improves the accuracy and effectiveness of feature selection on the one hand, and avoids the loss of important information on the other hand;

2.基于统计阈值和平均信息增益的特征选择方法所选择的特征较简单基于统计阈值的方法更具有区分性，使用基于N-Gram算法的双字词和组合词特征选择以及把非修饰性实词组合成元组特征的方法，有效避免了仅仅基于分词结果进行特征选择的信息损失，并且所选择的组合词特征更准确地描述了垃圾短信所特有的信息，平均信息增益不能代表所有的特征词，提出基于累计信息增益对以上所选特征词的合并结果做进一步选择，该方法克服了基于信息增益阈值设定较为困难的问题；2. The features selected by the feature selection method based on statistical threshold and average information gain are more discriminative than the simple method based on statistical threshold. Using N-Gram algorithm-based feature selection of double-character words and compound words and combining non-modifying entities The method of combining words into tuple features effectively avoids the information loss of feature selection based on word segmentation results, and the selected combination word features more accurately describe the unique information of spam text messages, and the average information gain cannot represent all feature words , it is proposed to further select the combined results of the above selected feature words based on the cumulative information gain, this method overcomes the difficult problem of setting the threshold based on information gain;

3.针对离线分类算法的改进与选择有效提高了垃圾短信识别性能，并且最优离线模型具体很高的识别准确率和效率。3. The improvement and selection of offline classification algorithms have effectively improved the recognition performance of spam text messages, and the optimal offline model has a high recognition accuracy and efficiency.

附图说明Description of drawings

图1为LR分类结果关于迭代次数折线图(a)为LR分类结果各指标关于迭代次数折线图，(b)为LR分类结果F2-Measure关于迭代次数折线图；Fig. 1 is the broken line graph of LR classification result about the number of iterations (a) is the broken line graph of each index of the LR classification result about the number of iterations, (b) is the broken line graph of the number of iterations of the F2-Measure of the LR classification result;

图2为LIBSVM参数C取8时训练结果图；Figure 2 is a graph of the training results when the LIBSVM parameter C is set to 8;

图3为LIBSVM参数C与g优化选择结果图；Fig. 3 is a diagram of the optimization selection results of LIBSVM parameters C and g;

图4为运用最优参数C与g在训练集上训练支持向量机模型；Fig. 4 is to use optimal parameter C and g to train support vector machine model on training set;

图5为迭代10次的AdaBoost分类结果ROC曲线图；Figure 5 is the ROC curve of the AdaBoost classification results for 10 iterations;

图6为迭代20次的AdaBoost分类结果ROC曲线图。Figure 6 is the ROC curve of the AdaBoost classification results for 20 iterations.

具体实施方式detailed description

下面结合具体实施例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with specific embodiment, further illustrate the present invention, should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various equivalent forms of the present invention All modifications fall within the scope defined by the appended claims of the present application.

面向垃圾短信分类的离线模型改进与选择方法，包括以下四个方面：The offline model improvement and selection method for spam classification includes the following four aspects:

(3.1)对离线分类算法作面向垃圾短信分类的改进，包括基于特定数据格式的LR，特定数据格式为：label index1:value1 index2:value2...，使用该特定数据格式，LR在计算系数向量和实例的内积时公式为：其中label为实例类别标签，通常为整数，如0和1，index是有序的非零特征的索引，value是对应的特征取值，由于本发明采用的是词典模型；差异化损失的AdaBoost决策时，在垃圾短信分类中，正常短信误判代价高于垃圾短信误判，因此提出差异化损失的改进，在每次迭代更新训练样本的权重时，如果在上一次迭代正确分类，更新式为如果在上一次迭代错误分类，更新式为 (3.1) Improve the offline classification algorithm for spam classification, including LR based on a specific data format. The specific data format is: label index1:value1 index2:value2..., using this specific data format, LR calculates the coefficient vector The formula for the inner product of and instance is: Wherein label is instance class label, is usually an integer, such as 0 and 1, and index is the index of ordered non-zero feature, and value is the corresponding feature value, because what the present invention adopts is dictionary model; AdaBoost decision-making of differential loss When , in the classification of spam text messages, the misjudgment cost of normal text messages is higher than that of spam text messages. Therefore, an improvement of differential loss is proposed. When updating the weight of training samples in each iteration, if the classification is correct in the previous iteration, the update formula is If it was misclassified in the previous iteration, the update formula is

对于步骤(3)离线分类算法及改进的调优训练与测试，下面具体阐述：For the step (3) offline classification algorithm and improved tuning training and testing, the following is a detailed description:

1)离线分类算法与改进1) Offline classification algorithm and improvement

1.1)基于特定数据格式的LR1.1) LR based on a specific data format

对于离线的逻辑回归算法，如果数据集样本量很大，特征维度很高，因为需要将数据全部载入内存，因此对内存的消耗会非常大。同时，短信文本通常较短，使用词向量空间模型表示之后往往得到极为稀疏的矩阵，如果能够对原始矩阵压缩存储，那么不仅可以降低内存消耗，而且可以提高LR模型的训练和测试效率。本文借鉴LIBSVM的数据格式转换原始数据集：For the offline logistic regression algorithm, if the sample size of the data set is large and the feature dimension is high, because all the data needs to be loaded into the memory, the memory consumption will be very large. At the same time, SMS texts are usually short, and after being represented by the word vector space model, an extremely sparse matrix is often obtained. If the original matrix can be compressed and stored, it will not only reduce memory consumption, but also improve the training and testing efficiency of the LR model. This paper uses the data format of LIBSVM to convert the original data set:

表1 LIBSVM数据格式示例Table 1 LIBSVM data format example

label为实例类别标签，通常为整数，如0和1。index是有序的非零特征的索引，value是对应的特征取值。使用该数据格式后数据集的大小从原来的2.07G变成 52MB，可见短信矩阵是非常稀疏的。转换之后的示例如表2所示：label is the instance category label, usually an integer, such as 0 and 1. index is the index of the ordered non-zero features, and value is the corresponding feature value. After using this data format, the size of the data set has changed from 2.07G to 52MB. It can be seen that the SMS matrix is very sparse. The converted example is shown in Table 2:

表2 基于特定数据格式转换短信数据示例Table 2 Example of converting SMS data based on a specific data format

1.2)交叉验证调优的SVM1.2) SVM tuned by cross-validation

原始数据集通常不是线性可分的，包括近似线性可分和线性不可分，对于这两种问题，支持向量机分别通过引入松弛变量和核技巧的方法解决。另外，支持向量机最终求解的最优化问题是支持向量到分隔超平面的几何间隔最大化，这就使得支持向量机不仅寻求正确分类，还要求以最大的确信度对样本正确分类，因此支持向量机能够取得非常高的分类准确率。综上，支持向量机比逻辑回归适用的分类场景更多，分类准确率更高，不过，支持向量机计算较为复杂，进而训练过程比较耗时。本发明使用LIBSVM软件包进行支持向量机模型的训练和测试，并基于S折交叉验证对参数C和g进行调优。LIBSVM的训练函数使用方法为：The original data set is usually not linearly separable, including approximately linearly separable and linearly inseparable. For these two problems, support vector machines are solved by introducing slack variables and kernel techniques respectively. In addition, the optimization problem finally solved by the support vector machine is to maximize the geometric interval between the support vector and the separating hyperplane, which makes the support vector machine not only seek the correct classification, but also require the correct classification of the sample with the maximum degree of certainty, so the support vector machine can achieve very high classification accuracy. To sum up, support vector machine is applicable to more classification scenarios than logistic regression, and the classification accuracy is higher. However, the calculation of support vector machine is more complicated, and the training process is more time-consuming. The present invention uses the LIBSVM software package to train and test the support vector machine model, and optimizes parameters C and g based on S-fold cross-validation. The training function of LIBSVM is used as follows:

svmtrain[options]training_set_file[model_file]svmtrain [options] training_set_file [model_file]

其中，options用于对参数进行设置，training_set_file是训练数据文件，model_file用于保存训练结果文件即模型文件。本文核函数选用RBF，通过设置交叉验证参数-v n对参数-g r(gama)和-c cost进行调优，其中-g r(gama)设置核函数中的函数，默认为1/k，k为特征维度，-c cost设置惩罚项的系数，默认为1， -v n即表示采用n折交叉验证。最后使用最优参数C和g训练支持向量机分类模型，对测试集进行预测。Among them, options is used to set the parameters, training_set_file is the training data file, and model_file is used to save the training result file, that is, the model file. In this paper, RBF is selected as the kernel function, and the parameters -g r(gama) and -c cost are tuned by setting the cross-validation parameter -v n, where -g r(gama) sets the function in the kernel function, the default is 1/k, and k is the feature dimension, -c cost sets the coefficient of the penalty item, the default is 1, -v n means to use n-fold cross-validation. Finally, use the optimal parameters C and g to train the support vector machine classification model and predict the test set.

1.3)差异化损失的AdaBoost决策树1.3) AdaBoost decision tree for differential loss

AdaBoost算法的基本思想是训练多个分类器，预测时采用加权表决的方式决定最终的分类结果，在迭代产生每个子分类器的过程中，动态调整样本权重并根据子分类器的准确率计算权重，该思想使得上一轮迭代错误分类的样本在本轮迭代中获得更大的关注，模型训练更加具有针对性，同时预测性能也较高。在垃圾短信分类中，正常短信误判代价高于垃圾短信误判，因此需要在训练过程中动态地修改样本权重，在加大上一轮迭代误判样本权重的基础上进一步扩大由正常短信误判为垃圾短信的样本权重，AdaBoost很容易做到这一点。弱分类器使用决策树是因为决策树的训练时间复杂度较低，并且可以通过剪枝技术有效避免过拟合。差异化损失的AdaBoost决策树分类算法为：The basic idea of the AdaBoost algorithm is to train multiple classifiers. When predicting, weighted voting is used to determine the final classification result. In the process of iteratively generating each sub-classifier, the sample weight is dynamically adjusted and the weight is calculated according to the accuracy of the sub-classifier. , this idea makes the samples misclassified in the previous round of iteration get more attention in the current round of iteration, the model training is more targeted, and the prediction performance is also higher. In the classification of spam text messages, the cost of misjudgment of normal text messages is higher than that of misjudgment of spam text messages. Therefore, it is necessary to dynamically modify the sample weights during the training process, and further expand the weight of misjudgment samples from normal text messages on the basis of increasing the weight of misjudgment samples in the previous round of iteration. AdaBoost can easily do this for the sample weights judged as spam messages. Weak classifiers use decision trees because the training time complexity of decision trees is low, and overfitting can be effectively avoided through pruning techniques. The AdaBoost decision tree classification algorithm for differential loss is:

初始化M棵决策树，每棵决策树的权重初始化为1/M，训练集中每个样本权重初始化为1/N， N为训练样本数目Initialize M decision trees, the weight of each decision tree is initialized to 1/M, the weight of each sample in the training set is initialized to 1/N, N is the number of training samples

循环M次，每次产生一棵决策树：Loop M times and generate a decision tree each time:

对训练集用决策树算法训练第m棵决策树G_m(x)Use the decision tree algorithm to train the mth decision tree G _m (x) on the training set

根据样本权重分布计算第m棵决策树的分类错误率 Calculate the classification error rate of the mth decision tree according to the sample weight distribution

计算第m棵决策树的权重 Calculate the weight of the mth decision tree

更新训练样本权重：Update training sample weights:

if如果某个样本被本轮分类器正确分类if If a sample is correctly classified by the current round classifier

1.4)参数网格搜索的GBDT1.4) GBDT with parameter grid search

在scikit-learn中，Gradient Boosting Classifier为GBDT的分类，类似于Adaboost，我们把重要参数分为两类，第一类是Boosting框架的重要参数，第二类是弱学习器即CART回归树的重要参数。In scikit-learn, Gradient Boosting Classifier is a classification of GBDT. Similar to Adaboost, we divide important parameters into two categories. The first category is the important parameters of the Boosting framework, and the second category is the weak learner, that is, the important parameters of the CART regression tree. parameter.

a)GBDT类库Boosting框架参数a) GBDT class library Boosting framework parameters

n_estimator：最大迭代次数，即弱学习器的最大个数。如果n_estimators设置得太小，容易欠拟合，如果n_estimators设置得太大，又容易过拟合，所以需要调优得到一个适中的数值，该参数默认值为100。在实际调参过程中，通常会和learning_rate一起考虑。n_estimator: The maximum number of iterations, that is, the maximum number of weak learners. If n_estimators is set too small, it is easy to underfit. If n_estimators is set too large, it is easy to overfit, so it needs to be tuned to get a moderate value. The default value of this parameter is 100. In the actual tuning process, it is usually considered together with learning_rate.

learning_rate：每个弱分类器的权重缩减系数，或称为步长，为了避免陷入局部最优解，每次迭代只相信弱分类器的一部分，通过学习更多的分类器来弥补，该参数也是一种正则化的方式，引入缩减shrinkage技术之后强分类器的迭代公式为：learning_rate: The weight reduction coefficient of each weak classifier, or step size, in order to avoid falling into a local optimal solution, each iteration only believes a part of the weak classifier, and makes up for it by learning more classifiers. This parameter is also A regularization method, the iterative formula of the strong classifier after introducing shrinkage technology is:

f_k(x)＝f_k-1(x)+νh_k(x) (4-1-3)f _k (x)＝f _k-1 (x)+νh _k (x) (4-1-3)

ν的取值范围为0<ν≤1， f_k(x)表示第k次迭代得到的集成分类器判决式，v为缩减因子，h_k(x)为第k次迭代得到的弱分类器。对于同样的训练集，要想达到同样的拟合效果，如果ν设置的较小，那么就需要学习更多的弱分类器，即增加迭代次数，所以通常会用这两个参数一起来优化算法的分类效果，即n_estimators和learning_rate一起调参。 ν的初始值通常取一个较小的值，默认值为1。The value range of ν is 0<ν≤1, f _k (x) represents the decision formula of the integrated classifier obtained in the k-th iteration, v is the reduction factor, and h _k (x) is the weak classifier obtained in the k-th iteration . For the same training set, in order to achieve the same fitting effect, if ν is set smaller, then it is necessary to learn more weak classifiers, that is, increase the number of iterations, so these two parameters are usually used together to optimize the algorithm The classification effect, that is, n_estimators and learning_rate are tuned together. The initial value of ν usually takes a small value, and the default value is 1.

subsample：即子采样的比例，取值区间为(0,1]。如果取值为1，表示使用全部样本，不使用子采样，如果取值为介于0和1之间的一个数，表示只有一部分样本会用来做GBDT的决策回归树的训练。选择小于1的子采样比例可以避免过拟合，减小方差，提高模型的泛化能力即在测试数据集上的预测精度。但是相对的，也会增大训练样本的偏差，使得模型在训练集上表现较差，学习不够充分，因此取值不能太小，一般较好的取值在[0.5,0.8]之间，默认是1.0，即不使用子采样。subsample: the ratio of subsampling, the value interval is (0,1]. If the value is 1, it means that all samples are used, and subsampling is not used. If the value is a number between 0 and 1, it means Only a part of the samples will be used for training the decision regression tree of GBDT. Selecting a subsampling ratio less than 1 can avoid overfitting, reduce variance, and improve the generalization ability of the model, that is, the prediction accuracy on the test data set. But relatively Yes, it will also increase the deviation of the training samples, making the model perform poorly on the training set, and the learning is not sufficient, so the value should not be too small. Generally, a better value is between [0.5,0.8], and the default is 1.0 , that is, no subsampling is used.

b)GBDT类库弱分类器参数b) GBDT class library weak classifier parameters

由于GBDT所采用的弱分类器是决策回归树，所以弱分类器参数就来源于这些决策回归树。Since the weak classifier used by GBDT is a decision regression tree, the parameters of the weak classifier come from these decision regression trees.

划分时考虑的最大特征数max_features：可以使用多种类型的值，默认是 "None"，表示划分时考虑所有的特征，如果是“log2”意味着划分时最多考虑log₂N 个特征；如果是“sqrt”或者“auto”表示划分时最多考虑个特征，其中N为样本总特征数即维度；如果是整数，表示考虑的特征绝对数；如果是浮点数，表示考虑特征数的百分比，即考虑指定百分比取整之后的数目的特征。如果样本特征数不多，比如小于50，使用默认值，如果特征数非常多，就需要使用其他的值来控制划分时考虑的最大特征数目，以此控制决策回归树的生成时间。The maximum number of features considered during division max_features: multiple types of values can be used, the default is "None", which means that all features are considered during division, if it is "log2", it means that at most log ₂ N features are considered during division; if it is "sqrt" or "auto" indicates that the most features, where N is the total feature number of the sample, that is, the dimension; if it is an integer, it indicates the absolute number of features considered; if it is a floating point number, it indicates the percentage of the number of features considered, that is, the number of features considered after the specified percentage is rounded. If the number of sample features is small, such as less than 50, use the default value. If the number of features is very large, you need to use other values to control the maximum number of features considered in the division, so as to control the generation time of the decision regression tree.

决策树最大深度max_depth：如果不设置该参数，那么算法在建立决策回归树的时候不会限制树的深度，当数据规模较小，即样本数和特征维度都不是很大的时候，可以不设置该参数，但是当样本量很大，数据维度较高的时候，就需要通过限制决策树的最大深度来控制模型的训练时间，避免过拟合，提高模型的泛化能力。具体取值取决于数据的分布，常用取值区间为10～100。Decision tree maximum depth max_depth: If this parameter is not set, the algorithm will not limit the depth of the tree when building a decision regression tree. When the data scale is small, that is, the number of samples and the feature dimension are not very large, you can not set it This parameter, but when the sample size is large and the data dimension is high, it is necessary to control the training time of the model by limiting the maximum depth of the decision tree to avoid overfitting and improve the generalization ability of the model. The specific value depends on the distribution of the data, and the common value range is 10 to 100.

内部节点再划分所需最小样本数min_samples_split：这个值限制了子树继续划分的条件，如果某节点的样本数少于min_samples_split，则不会再继续尝试选择最优特征来对该节点上的数据进行划分。该参数的默认值为2，如果数据量不大，可以不设置该参数，如果数据量非常大，可以通过选取较大的值来提高模型的预测性能。The minimum number of samples required for internal node re-division min_samples_split: This value limits the conditions for subtrees to continue to be divided. If the number of samples of a node is less than min_samples_split, it will not continue to try to select the optimal feature to analyze the data on the node. divided. The default value of this parameter is 2. If the amount of data is not large, this parameter does not need to be set. If the amount of data is very large, a larger value can be selected to improve the prediction performance of the model.

叶子节点最少样本数min_samples_leaf：这个值限制了叶子节点最少的样本数，如果某叶子节点样本数目小于该最少样本数，则会和兄弟节点一起被剪枝。该参数默认值为1，可以设置为最少样本绝对数，也可以设置为最少样本占样本总数的比例，如果数据量不大，不需要设置这个参数，如果数据量很大，可以增大该参数值。Minimum number of samples of leaf nodes min_samples_leaf: This value limits the minimum number of samples of leaf nodes. If the number of samples of a leaf node is less than the minimum number of samples, it will be pruned together with sibling nodes. The default value of this parameter is 1, which can be set to the minimum absolute number of samples, or the ratio of the minimum sample to the total number of samples. If the amount of data is not large, this parameter does not need to be set. If the amount of data is large, this parameter can be increased value.

本发明采用网格搜索的方法对GBDT的参数进行调优，基本思想是：按照参数重要性次序进行调优，如果只对一个参数调优，那么根据该参数的取值区间构造参数向量，遍历向量中的所有取值，根据预测结果选最优；如果同时对两个参数进行调优，那么根据两个参数的取值区间构造二维的参数矩阵，形如网格，每个网格对应两个参数的取值组合，遍历所有的网格，基于预测结果选择最优参数组合。The present invention adopts the method of grid search to optimize the parameters of GBDT. The basic idea is: to optimize according to the order of importance of the parameters. All values in the vector are selected according to the prediction results; if two parameters are tuned at the same time, then a two-dimensional parameter matrix is constructed according to the value range of the two parameters, which is shaped like a grid, and each grid corresponds to The value combination of the two parameters traverses all the grids and selects the optimal parameter combination based on the prediction results.

2)离线分类算法的调优训练与测试2) Tuning training and testing of offline classification algorithms

2.1)LR：通过增加迭代次数来寻找最优的回归系数，以得到最优的分类性能。计算各个指标的值，使用F₂-Measure评价LR分类算法的性能，综合时间复杂度选择合理的迭代次数。2.1) LR: Find the optimal regression coefficient by increasing the number of iterations to obtain the optimal classification performance. Calculate the value of each index, use F ₂ -Measure to evaluate the performance of the LR classification algorithm, and select a reasonable number of iterations based on the time complexity.

表3 LR分类结果Table 3 LR classification results

分析表3可见，当迭代次数为200次时，各项指标达到最大值，迭代次数在 200～400时，指标值下降，当迭代次数超过500时，各项指标又有所回升，但仍然劣于最优结果，这说明逻辑回归随着迭代次数的增加，权重系数会逐渐逼近全局最优解，达到最优之后继续迭代反而会由于原始问题线性不可分或者某些误分类样本而出现大的波动，当然，这种现象会逐渐趋于稳定。上述分析在图1中体现较为明显。Analysis of Table 3 shows that when the number of iterations is 200, each index reaches the maximum value, when the number of iterations is 200-400, the index value drops, and when the number of iterations exceeds 500, each index rises again, but is still inferior. This shows that the weight coefficient of logistic regression will gradually approach the global optimal solution as the number of iterations increases, and continue to iterate after reaching the optimum, but there will be large fluctuations due to the linear inseparability of the original problem or some misclassified samples. , of course, this phenomenon will gradually stabilize. The above analysis is clearly reflected in Figure 1.

2.2)SVM：本实验中，通过设置交叉验证参数-v n对参数-g r(gama)和-c cost 进行调优，再使用最优参数C和g训练支持向量机分类模型，最后基于支持向量机模型对测试集进行预测。参数C取8时模型训练结果如图2所示。2.2) SVM: In this experiment, the parameters -g r(gama) and -c cost are tuned by setting the cross-validation parameter -v n, and then the optimal parameters C and g are used to train the support vector machine classification model, and finally based on the support vector machine The model makes predictions on the test set. The model training results when the parameter C is set to 8 are shown in Figure 2.

在图2中，#iter为迭代次数，rho为决策函数中的常数项b，nSV为支持向量个数，nBSV为边界上的支持向量个数。训练后的模型文件中包含具体的支持向量信息，其中给出支持向量到分隔超平面的几何间隔。表4给出了部分结果：In Figure 2, #iter is the number of iterations, rho is the constant item b in the decision function, nSV is the number of support vectors, and nBSV is the number of support vectors on the boundary. The trained model file contains specific support vector information, which gives the geometric interval from the support vector to the separating hyperplane. Table 4 gives some results:

表4 支持向量及其几何间隔Table 4 Support vectors and their geometric intervals

因为惩罚项参数和核函数对于支持向量机比较重要，下面采用交叉验证的方式调优参数C和g。此处设置cross validation(-v)的值为5，即采用5折交叉验证对参数C和g进行调优。具体的，需要对libsvm的python子目录中的grid.py 进行修改，与此同时，需要解压gnuplot工具，然后修改/libsvm/tools目录下的 grid.py文件，将文件中的gnuplot路径修改为实际路径。然后，将grid.py文件和 python安装目录下的python.exe文件拷贝到libsvm/windows目录下，输入python grid.py trainData命令，执行后，即可得到最优参数C和g。使用垃圾短信训练数据集进行实验，得到最优的参数C和g结果如图3所示。Because the penalty parameter and the kernel function are more important to the support vector machine, the parameters C and g are tuned by cross-validation. Here, set the value of cross validation (-v) to 5, that is, use 5-fold cross-validation to optimize parameters C and g. Specifically, you need to modify the grid.py in the python subdirectory of libsvm. At the same time, you need to unzip the gnuplot tool, then modify the grid.py file in the /libsvm/tools directory, and modify the gnuplot path in the file to the actual path. Then, copy the grid.py file and the python.exe file in the python installation directory to the libsvm/windows directory, enter the python grid.py trainData command, and after execution, the optimal parameters C and g can be obtained. Using the spam SMS training data set for experiments, the optimal parameters C and g are obtained as shown in Figure 3.

由图3的实验结果可见，最优的惩罚参数C为8，最优g的值为0.03125，对于所选择的训练样本，这些参数所训练出的支持向量机分类模型得到的 accuracy达到99.4848％，训练过程中随着参数C和g的逐步调优，准确率越来越高，模型逐步收敛。使用参数C和g的最优值训练支持向量机模型如图4所示。It can be seen from the experimental results in Figure 3 that the optimal penalty parameter C is 8, and the optimal g value is 0.03125. For the selected training samples, the accuracy of the support vector machine classification model trained by these parameters reaches 99.4848%. During the training process, with the gradual tuning of parameters C and g, the accuracy rate is getting higher and higher, and the model is gradually converging. Using the optimal values of parameters C and g to train the SVM model is shown in Figure 4.

对训练好的支持向量机模型进行测试，预测14万条待识别短信，计算各个指标的结果，如表5所示：Test the trained support vector machine model, predict 140,000 text messages to be identified, and calculate the results of each indicator, as shown in Table 5:

表5 LIBSVM测试结果Table 5 LIBSVM test results

由表5可知，支持向量机分类模型的整体分类性能非常优秀，惩罚参数取值为8 时分类准确率和F值较高。但是相对于召回率，我们接受垃圾短信被识别为正常短信，但是却会因为正常短信被误判并且拦截而损失更大，因此我们更加关注 precision，即识别为垃圾短信的短信中真正的垃圾短信所占的比例，这个比例越大越好(越大说明识别出来的垃圾短信中正常短信越少)。由支持向量机的训练时间可知，当面对海量短信文本数据的时候，支持向量机分类模型可能不能满足实时性的要求。如果在采用复杂度较低的算法模型并且同时各项指标能够达到近似的预测精度和性能的话，应该优先考虑。It can be seen from Table 5 that the overall classification performance of the support vector machine classification model is very good, and the classification accuracy and F value are higher when the penalty parameter is set to 8. But compared to the recall rate, we accept that spam text messages are identified as normal text messages, but the loss will be greater because normal text messages are misjudged and intercepted, so we pay more attention to precision, that is, the real spam text messages among the text messages identified as spam text messages The ratio, the larger the ratio, the better (the larger the number of normal text messages in the identified spam text messages). From the training time of the support vector machine, we can see that when faced with massive SMS text data, the support vector machine classification model may not meet the real-time requirements. If the algorithm model with low complexity is used and the indicators can achieve similar prediction accuracy and performance at the same time, it should be given priority.

2.3)AdaBoost：逐轮迭代，重复训练和调整样本权重的过程，直到训练错误率为0或者迭代次数也就是基分类器达到一定的数目为止。2.3) AdaBoost: Iterates round by round, repeating the process of training and adjusting sample weights until the training error rate is 0 or the number of iterations, that is, the base classifier reaches a certain number.

表6 改进的AdaBoost分类结果Table 6 Improved AdaBoost classification results

由表6可知，改进的AdaBoost分类结果收敛速度很快，在20次迭代之后，各项指标值就不再变化。从各项指标数据来看，AdaBoost要优于LR，劣于SVM。从precision这项指标来看，改进的AdaBoost几乎不会将正常短信识别为垃圾短信。实验表明，改进的AdaBoost达到了预期的效果。并且算法的训练的时间复杂度也明显低于支持向量机。It can be seen from Table 6 that the improved AdaBoost classification results converge very quickly, and after 20 iterations, the values of each index will not change. From the data of various indicators, AdaBoost is better than LR and worse than SVM. Judging from the indicator of precision, the improved AdaBoost hardly recognizes normal text messages as spam text messages. Experiments show that the improved AdaBoost achieves the desired effect. And the time complexity of algorithm training is obviously lower than that of support vector machine.

还有一个用于度量分类中非均衡性的工具是ROC曲线，代表接受者操作特征。ROC曲线给出伪正率和真正率随着阈值的变化情况，右上角对应把所有样本判别为垃圾短信，左上角对应把所有样本判别为正常短信。虚线是随机猜测的结果。好的分类器对应的ROC曲线应该尽量处在左上角，即在伪正率很低的同时，真正率还很高，在垃圾短信识别中，这就说明训练出的分类模型能够识别出几乎所有的垃圾短信，并且很少将正常短信误判为垃圾短信。Yet another tool for measuring disequilibrium in classification is the ROC curve, which stands for Receiver Operating Characteristic. The ROC curve shows the change of the false positive rate and true rate with the threshold. The upper right corner corresponds to discriminating all samples as spam text messages, and the upper left corner corresponds to discriminating all samples as normal text messages. The dashed lines are the result of random guessing. The ROC curve corresponding to a good classifier should be in the upper left corner as much as possible, that is, while the false positive rate is very low, the true rate is still high. In spam text message identification, this means that the trained classification model can identify almost all spam text messages, and rarely misjudge normal text messages as spam text messages.

本实验中，迭代10次和20次的ROC曲线如图5和图6所示。在图5和图 6的ROC曲线中，均给出了两条线，一条虚线和一条直线。图中的横轴为伪正例的比例即原始正常短信中被识别为垃圾短信的比例，而纵轴是真正例的比例即召回率。In this experiment, the ROC curves of 10 iterations and 20 iterations are shown in Figure 5 and Figure 6. In the ROC curves of Figure 5 and Figure 6, two lines are given, a dashed line and a straight line. The horizontal axis in the figure is the proportion of false positives That is, the proportion of the original normal text messages identified as spam text messages, and the vertical axis is the proportion of real cases That is the recall rate.

ROC曲线的一个分析指标是虚线下的面积AUC，图中两条曲线下的面积均大于0.98，并且在图中可以明显的看出，在迭代了20次之后，曲线更向左上角靠拢，面积增大。从性能上来讲，分类器在迭代至20次时，效果达到最优。AUC 越接近于1，分类器越好，随机猜测时的AUC的值为0.5。An analysis index of the ROC curve is the area under the dotted line AUC, the area under the two curves in the figure is greater than 0.98, and it can be clearly seen in the figure that after 20 iterations, the curve moves closer to the upper left corner, and the area increase. In terms of performance, the classifier achieves the best effect when iterating to 20 times. The closer the AUC is to 1, the better the classifier, and the value of AUC at random guessing is 0.5.

2.4)GBDT：本实验逐步调参所得实验结果如下：在任何参数都选择默认的情况下，训练数据训练的Accuracy＝99.82％，AUC Score(Train)＝0.999899，可见模型对于训练数据拟合的不错。下面将通过调参提高模型的泛化能力。首先从步长(learning rate)和迭代次数(n_estimators)开始。一般来说，开始选择一个较小的步长来网格搜索最好的迭代次数。这里，将步长初始值设置为0.1。对迭代次数进行网格搜索，结果如表7所示：2.4) GBDT: The experimental results obtained by step-by-step parameter adjustment in this experiment are as follows: In the case of selecting the default for any parameter, the Accuracy of the training data training = 99.82%, AUC Score (Train) = 0.999899, it can be seen that the model fits the training data well . The following will improve the generalization ability of the model through parameter adjustment. First start with the step size (learning rate) and the number of iterations (n_estimators). In general, start by choosing a smaller step size to grid search for the best number of iterations. Here, the initial value of the step size is set to 0.1. Grid search is performed on the number of iterations, and the results are shown in Table 7:

表7 GBDT网格搜索最优n_estimators结果表Table 7 GBDT grid search optimal n_estimators result table

由表7可知，最优的迭代次数是90，此时mean最大，达到0.9961，std又最小，仅有0.00367。找到了最优的迭代次数，现在对决策树进行调参。首先对决策树的最大深度max_depth和内部结点再划分所需最小样本数 min_samples_split进行网格搜索。实验结果如表8所示：It can be seen from Table 7 that the optimal number of iterations is 90, at this time the mean is the largest, reaching 0.9961, and the std is the smallest, only 0.00367. The optimal number of iterations has been found, and now the parameters of the decision tree are adjusted. Firstly, conduct a grid search on the maximum depth max_depth of the decision tree and the minimum number of samples min_samples_split required for re-division of internal nodes. The experimental results are shown in Table 8:

表8 GBDT网格搜索最优max_depth和min_samples_split结果表Table 8 GBDT grid search optimal max_depth and min_samples_split result table

由表8可知，最优的最大树深度为6，内部结点再划分所需最小样本数的最优值为100。由于决策树深度为6是一个比较合理的值，因此可以把max_depth 这一参数的最优值定为6，对于内部结点再划分所需最小样本数 min_samples_split，暂时并不能确定，因为这个参数还和决策树其他参数存在关联。下面，再对内部结点再划分所需最小样本数min_samples_split和叶子结点最少样本数min_samples_leaf一起调参。输出结果如表9所示：It can be seen from Table 8 that the optimal maximum tree depth is 6, and the optimal value of the minimum number of samples required for internal node subdivision is 100. Since the depth of the decision tree is 6 is a relatively reasonable value, the optimal value of the parameter max_depth can be set to 6. The minimum number of samples min_samples_split required for internal node subdivision cannot be determined for the time being, because this parameter is still It is associated with other parameters of the decision tree. Next, adjust the parameters of the minimum number of samples min_samples_split and the minimum number of samples of leaf nodes min_samples_leaf for internal node subdivision. The output results are shown in Table 9:

表9 GBDT网格搜索最优min_samples_leaf和min_samples_split结果表Table 9 GBDT grid search optimal min_samples_leaf and min_samples_split result table

由表9可知，最优min_samples_leaf值为6，min_samples_split值为100，均处在搜索边界值，还有进一步调试小于边界的必要。进一步将min_samples_leaf 的下限设为4，min_samples_split的下限设为60，并且上限与之前有交叉，便于发现最优结果，调试结果如表10所示：It can be seen from Table 9 that the optimal min_samples_leaf value is 6, and the min_samples_split value is 100, both of which are at the search boundary value, and further debugging is necessary to be smaller than the boundary value. Further set the lower limit of min_samples_leaf to 4, and the lower limit of min_samples_split to 60, and the upper limit is intersected with the previous one, so as to find the optimal result. The debugging results are shown in Table 10:

表10 表9续进一步网格搜索调试Table 10 Table 9 Continued Further Grid Search Debugging

由表10可知，下探参数min_samples_leaf和min_samples_split最终的最优值没有变化，分别为：6和100。将新的调优之后的参数放到GBDT类中训练模型并且预测测试数据，输出结果显示：Accuracy＝99.83％，AUC Score＝0.999989。对比一开始完全不调参的结果，可见精确度有所提高。现在再对子采样的比例进行网格搜索，输出结果如表11所示：It can be seen from Table 10 that the final optimal values of the drop-down parameters min_samples_leaf and min_samples_split have not changed, which are 6 and 100, respectively. Put the new tuned parameters into the GBDT class to train the model and predict the test data. The output results show: Accuracy=99.83%, AUC Score=0.999989. Compared with the result of not adjusting parameters at all at the beginning, it can be seen that the accuracy has improved. Now perform a grid search on the sub-sampling ratio, and the output results are shown in Table 11:

表11 GBDT网格搜索最优subsample结果表Table 11 GBDT grid search optimal subsample result table

可见最优子采样比例为0.6。现在已经基本得到所有调优的参数结果，这时可以调试参数learning_rate和n_estimators增强模型的泛化能力，测试结果表12 所示：It can be seen that the optimal sub-sampling ratio is 0.6. Now that all the tuning parameter results have been basically obtained, the parameters learning_rate and n_estimators can be adjusted to enhance the generalization ability of the model. The test results are shown in Table 12:

表12 GBDT最优化迭代次数和步长结果表Table 12 GBDT optimization iteration number and step size result table

表13 GBDT最优参数表Table 13 GBDT optimal parameter table

由表13可知，加倍迭代次数，减半学习率AUC分数反而会稍有下降，原因是为了增强模型的泛化能力，防止过拟合，同时减小了子采样比例subsample，从而降低了训练数据集的拟合程度。同时，可以发现减小步长，增加迭代次数可以在保证模型泛化能力的基础上减小一定的偏差即训练数据的拟合程度。再则，如果过小的步长反而导致拟合效果变差，也就是说不常不能设置的过小。最后，用调优后的所有最优参数(见表13)训练模型并且用于测试数据，得到的各指标值如表14所示：From Table 13, it can be seen that doubling the number of iterations and halving the learning rate AUC score will decrease slightly. The reason is to enhance the generalization ability of the model and prevent overfitting. At the same time, the subsample ratio is reduced, thereby reducing the training data. The degree of fit of the set. At the same time, it can be found that reducing the step size and increasing the number of iterations can reduce a certain deviation, that is, the degree of fitting of the training data, on the basis of ensuring the generalization ability of the model. Furthermore, if the step size is too small, the fitting effect will deteriorate instead, that is to say, it is not usually possible to set it too small. Finally, use all the optimal parameters after tuning (see Table 13) to train the model and use it for testing data, and the obtained index values are shown in Table 14:

表14 GBDT分类结果Table 14 GBDT classification results

3)基于评价指标的离线分类算法选择3) Offline classification algorithm selection based on evaluation index

分析表14可知，GBDT的分类准确率，F₂-Measure是所有离线模型中最好的，并且模型的训练时间要优于支持向量机，在处理垃圾短信特征信息比较稳定的静态短信数据集时，使用GBDT的分类性能最好。Analysis of Table 14 shows that the classification accuracy of GBDT, F ₂ -Measure is the best among all offline models, and the training time of the model is better than that of the support vector machine. When dealing with static SMS data sets with relatively stable spam feature information , the classification performance using GBDT is the best.

对所有的离线模型进行实验之后，将各个模型经过调优之后对测试短信数据进行预测的结果列表如15所示：After conducting experiments on all offline models, the result list of predicting the test SMS data after each model is tuned is shown in 15:

表15 各离线分类算法最优分类结果Table 15 Optimal classification results of each offline classification algorithm

分析表15可知：逻辑回归是最快的分类算法，精度也最差，SVM提高了分类的精度，但是时间复杂度也最高，改进的AdaBoost由于增加了对正常短信误判的惩罚，precision为100％，即识别为垃圾短信的短信中真实类别全部为垃圾短信，没有正常短信误判。但是GBDT在取得更好精度的同时，precison也为 100％。从表中的最后一列可见，GBDT的时间复杂度优于支持向量机。Analysis of Table 15 shows that logistic regression is the fastest classification algorithm with the worst accuracy. SVM improves the classification accuracy, but the time complexity is also the highest. The improved AdaBoost increases the penalty for misjudgment of normal SMS, and the precision is 100. %, that is, all the real categories of text messages identified as spam text messages are spam text messages, and there is no misjudgment of normal text messages. But while GBDT achieves better accuracy, the precision is also 100%. It can be seen from the last column in the table that the time complexity of GBDT is better than that of SVM.

Claims

1. a kind of improve and system of selection towards the off-line model that refuse messages are classified, it is characterised in that comprises the following steps：

(1) feature selecting and extension, select feature using feature selection approach, construct feature term vector, use feature term vector Model represents original short message text；

(2) offline sorting algorithm and improved tuning training and test, make what is classified towards refuse messages to offline sorting algorithm Improve, data preparation is carried out to the training set obtained by step (1) and test set according to each offline sorting algorithm and improvement, instruction is used Practice each off-line algorithm of set pair and improve and carry out tuning training and test；

(3) the offline sorting algorithm selection based on evaluation index, is proposed the evaluation index classified towards refuse messages, is commented using this Valency index is analyzed the test result obtained by step (2) and selects optimal offline sorting algorithm.

2. according to claim 1 improve and system of selection towards the off-line model that refuse messages are classified, it is characterised in that Step (1) is concretely comprised the following steps：

(1.1) the frequent word feature selecting based on statistical threshold and average information gain, threshold value is adjustable parameter, is selected according to threshold value Frequent word is selected as feature word set, is decided whether to continue to adjust threshold value according to the average information change in gain situation of feature word set；

(1.2) two-character word and portmanteau word feature selecting based on N-Gram algorithms, word segment sequence is produced based on N-Gram algorithms Row, non-Frequent episodes are fallen according to the optimal statistical threshold filtering that step (1.1) is obtained, by remaining sequence construct incidence matrix, Matrix element is occurrence frequency of the correspondence ranks composite sequence in refuse messages text, and text is combined according to certain standard screening Word sequence；

(1.3) non-modified property notional word is combined into tuple feature, travel through all refuse messages texts find noun+verbs describe Word combination, is screened according to certain standard to gained tuple feature；

(1.4) feature selecting based on cumulative information gain, to the word and the merging knot of portmanteau word feature obtained by above step Really, selection cumulative information gain reaches 95% Feature Words of primitive character word information gain summation, so construction feature word to Amount.

3. according to claim 1 improve and system of selection towards the off-line model that refuse messages are classified, it is characterised in that Step (2) are concretely comprised the following steps：

(2.1) improvement classified towards refuse messages, including the LR based on format, certain number are made to offline sorting algorithm It is according to form：label index1:value1index2:Value2..., using the format, LR design factor to Formula is when amount and the inner product of example：Wherein label For example class label, usually integer, the index for the non-zero characteristics that index is ordered into, value is corresponding feature value, by In using dictionary model, therefore the value of non-zero characteristics is 1；During the AdaBoost decision-makings of differentiation loss, in refuse messages classification, Normal short message erroneous judgement cost is judged by accident higher than refuse messages, therefore proposes the improvement of differentiation loss, and training sample is updated in each iteration During this weight, if correctly classified in upper once iteration, newer isIf in upper once iteration Mistake is classified, and newer is

(2.2) data preparation is carried out to the training set obtained by step (1) and test set according to each offline sorting algorithm and improvement；

(2.3) using training set is to each off-line algorithm and improves progress tuning training and test, using cross validation tuning SVM's Model parameter, GBDT optimized parameter is found using grid search, is specifically：Tuning is carried out according to parameter significance sequence, such as Fruit is only to an arameter optimization, then according to the interval constructing variable of parameter vector, all values during traversal is vectorial, Select optimal according to predicting the outcome；If carrying out tuning to two parameters simultaneously, then constructed according to the interval of two parameters The parameter matrix of two dimension, shape such as grid, the valued combinations of two parameters of each grid correspondence travel through all grids, based on pre- Survey result selection best parameter group.For LR and AdaBoost, optimal models is obtained by adjusting iterations, it is optimal to use Each optimal models is predicted to test set.

4. according to claim 1 improve and system of selection towards the off-line model that refuse messages are classified, it is characterised in that Step (3) are concretely comprised the following steps：

(3.1) evaluation index classified towards refuse messages, including accuracy rate accuracy, accuracy are proposed Recall rateWithWherein TP is that true classification is that 1 (rubbish is short Letter) and be predicted as 1 number of samples, FP is the number of samples that true classification is predicted as 1 for 0, FN be true classification be 1 and It is predicted as 0 number of samples；

(3.2) evaluation index proposed using step (3.1) is analyzed and selected most to the test result obtained by step (2) Excellent offline sorting algorithm.