[go: up one dir, main page]

CN118656273A - Hard disk failure prediction and data migration method for low-quality data sets - Google Patents

Hard disk failure prediction and data migration method for low-quality data sets Download PDF

Info

Publication number
CN118656273A
CN118656273A CN202411153731.XA CN202411153731A CN118656273A CN 118656273 A CN118656273 A CN 118656273A CN 202411153731 A CN202411153731 A CN 202411153731A CN 118656273 A CN118656273 A CN 118656273A
Authority
CN
China
Prior art keywords
hard disk
data
information
smart
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411153731.XA
Other languages
Chinese (zh)
Other versions
CN118656273B (en
Inventor
杨洪章
屠趁锋
高军
王平
马萌
卢晓雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Tianjin University of Technology
Original Assignee
Peking University
Tianjin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Tianjin University of Technology filed Critical Peking University
Priority to CN202411153731.XA priority Critical patent/CN118656273B/en
Publication of CN118656273A publication Critical patent/CN118656273A/en
Application granted granted Critical
Publication of CN118656273B publication Critical patent/CN118656273B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • G06F18/15Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明提供了面向低质量数据集的硬盘故障预测及数据迁移方法,包括:获取硬盘的SMART信息得到信息集,对信息集进行正负样本重建,并将丢失数据的信息集作为原始数据集;对原始数据集进行无用数据清理操作并进行欠采样处理;对原始数据集进行缺失值填充,将缺失值填充后的原始数据集转化为时间序列数据;构建并训练预测模型,将时间序列数据及对应的ASFD特征输入预测模型中得到硬盘故障的预测结果;根据预测结果识别濒临故障的硬盘,并基于二部图最大匹配策略和修复调度策略完成对濒临故障硬盘数据的修复。本发明有益效果:在低质量硬盘SMART的情况下,达到较高的硬盘故障预测准确率,并且能够提前对濒临故障硬盘的数据进行主动迁移和修复。

The present invention provides a hard disk failure prediction and data migration method for low-quality data sets, including: obtaining the SMART information of the hard disk to obtain an information set, reconstructing the information set with positive and negative samples, and using the information set with lost data as the original data set; performing useless data cleaning operations on the original data set and performing undersampling processing; filling missing values on the original data set, and converting the original data set after missing value filling into time series data; constructing and training a prediction model, inputting the time series data and the corresponding ASFD features into the prediction model to obtain the prediction result of the hard disk failure; identifying the hard disk that is about to fail according to the prediction result, and completing the repair of the hard disk data that is about to fail based on the bipartite graph maximum matching strategy and the repair scheduling strategy. Beneficial effects of the present invention: In the case of low-quality hard disk SMART, a higher hard disk failure prediction accuracy is achieved, and the data of the hard disk that is about to fail can be actively migrated and repaired in advance.

Description

面向低质量数据集的硬盘故障预测及数据迁移方法Hard disk failure prediction and data migration method for low-quality data sets

技术领域Technical Field

本发明属于数据存储领域,尤其是涉及面向低质量数据集的硬盘故障预测及数据迁移方法。The present invention belongs to the field of data storage, and in particular relates to a hard disk failure prediction and data migration method for low-quality data sets.

背景技术Background Art

当今,硬盘是计算机中最主要的存储设备,许多数据中心依靠大量硬盘来存储重要信息。在一些应用大规模存储系统的场景中,例如高性能计算和互联网服务,硬盘故障的发生十分频繁。相关文献的调查显示,78%的硬件更换是由硬盘故障引起的。硬盘故障造成的灾难性后果是永久性的,并且难以恢复,从而数据中心的可靠性降低。因此,尽早预测磁盘故障不仅可以降低数据丢失的风险,还可以降低数据恢复的成本。Today, hard disks are the most important storage devices in computers, and many data centers rely on a large number of hard disks to store important information. In some scenarios where large-scale storage systems are used, such as high-performance computing and Internet services, hard disk failures occur frequently. A survey of relevant literature shows that 78% of hardware replacements are caused by hard disk failures. The catastrophic consequences of hard disk failures are permanent and difficult to recover, thereby reducing the reliability of the data center. Therefore, predicting disk failures as early as possible can not only reduce the risk of data loss, but also reduce the cost of data recovery.

SMART(Self-Monitoring Analysis and Reporting Technology)在20世纪90年代被提出,这项技术可检测硬盘内部的各种工作信息,如硬盘读写次数、磁头加载卸载次数、寻道错误率等。传统的基于阈值的硬盘故障技术将硬盘当前的SMART属性与设定的阈值进行比较,如果硬盘的SMART属性超过相应的阈值,硬盘就会发出警报信息给操作系统。但是,这种方法在FPR(False Positive Rate)为0.1%的情况下,只有3%至10%的 TPR(TruePositive Rate)。过去几十年,不少学者采用机器学习和深度学习技术提出多种方法来提高故障预测的准确率,包括使用Long short-term memory(LSTM),GenerativeAdversarial Network(GAN),regularized greedy forest(RGF),TemporalConvolutional Networks (TCNs),recurrent neural networks (RNNs),Convolutionalneural networks (CNNs)等算法来建立硬盘故障预测模型。遗憾的是,这些相关工作普遍基于高质量数据集来训练模型,比如Backblaze开源数据集,这样的预测模型在低质量数据集场景中的适用性较差。SMART (Self-Monitoring Analysis and Reporting Technology) was proposed in the 1990s. This technology can detect various working information inside the hard disk, such as the number of hard disk reads and writes, the number of head load and unload times, and the seek error rate. The traditional threshold-based hard disk failure technology compares the current SMART attribute of the hard disk with the set threshold. If the SMART attribute of the hard disk exceeds the corresponding threshold, the hard disk will send an alarm message to the operating system. However, this method has only a 3% to 10% TPR (True Positive Rate) when the FPR (False Positive Rate) is 0.1%. In the past few decades, many scholars have used machine learning and deep learning technologies to propose a variety of methods to improve the accuracy of fault prediction, including the use of Long short-term memory (LSTM), Generative Adversarial Network (GAN), regularized greedy forest (RGF), Temporal Convolutional Networks (TCNs), recurrent neural networks (RNNs), Convolutional neural networks (CNNs) and other algorithms to establish hard disk failure prediction models. Unfortunately, these related works generally train models based on high-quality datasets, such as the Backblaze open source dataset. Such prediction models have poor applicability in low-quality dataset scenarios.

在实际的工业应用时,硬盘SMART数据的采集、传输、存储等过程中都有可能因客观或主观的原因丢失部分数据。例如,在数据采集过程中可能发生各种错误,例如传感器错误、数据传输错误、设备故障等,这些错误会导致数据集中出现不准确或不完整的数据。又例如,某大规模数据中心在节假日期间必须关闭SMART采集功能。另外,运维工程师为了阻止自己被智能运维软件替代,通过破坏样本数据以降低硬盘故障预测技术的准确性。如何在低质量数据集条件下仍能取得良好的预测效果,成为硬盘故障预测技术在实际工程应用中需要解决的重要问题。In actual industrial applications, some data may be lost due to objective or subjective reasons during the collection, transmission, and storage of hard disk SMART data. For example, various errors may occur during data collection, such as sensor errors, data transmission errors, equipment failures, etc. These errors will cause inaccurate or incomplete data in the data set. For another example, a large-scale data center must turn off the SMART collection function during holidays. In addition, in order to prevent themselves from being replaced by intelligent operation and maintenance software, operation and maintenance engineers destroy sample data to reduce the accuracy of hard disk failure prediction technology. How to achieve good prediction results under low-quality data sets has become an important problem that needs to be solved in the actual engineering application of hard disk failure prediction technology.

在面对低质量数据集时,现有技术往往采用填充众数的方法,虽然实现简单,但是硬盘故障预测的准确率不足,与此同时,现有技术没有考虑硬盘数据的时序性问题,硬盘发生故障不是在某个瞬间突然发生故障,而是一个逐渐变化的过程。When faced with low-quality data sets, existing technologies often use the method of filling the mode. Although it is simple to implement, the accuracy of hard disk failure prediction is insufficient. At the same time, existing technologies do not consider the timing of hard disk data. Hard disk failure does not occur suddenly at a certain moment, but is a gradual process.

发明内容Summary of the invention

有鉴于此,本发明旨在提出面向低质量数据集的硬盘故障预测及数据迁移方法,以期解决上述部分技术问题中的至少之一。In view of this, the present invention aims to propose a hard disk failure prediction and data migration method for low-quality data sets, in order to solve at least one of the above-mentioned technical problems.

为达到上述目的,本发明的技术方案是这样实现的:To achieve the above object, the technical solution of the present invention is achieved as follows:

面向低质量数据集的硬盘故障预测及数据迁移方法,包括如下步骤:The hard disk failure prediction and data migration method for low-quality data sets includes the following steps:

获取硬盘的SMART信息得到信息集,对所述信息集进行正负样本标记,并将丢失数据的信息集作为原始数据集;Obtaining SMART information of the hard disk to obtain an information set, marking the information set with positive and negative samples, and using the information set with missing data as the original data set;

对所述原始数据集进行无用数据清理操作并进行欠采样处理;Performing a useless data cleaning operation and under-sampling processing on the original data set;

对所述原始数据集进行缺失值填充将缺失值填充后的原始数据集转化为时间序列数据,并引入ASFD特征;Filling missing values in the original data set, converting the original data set after missing value filling into time series data, and introducing ASFD features;

构建并训练预测模型,将时间序列数据及对应的ASFD特征输入预测模型中得到硬盘故障的预测结果;Build and train a prediction model, input time series data and corresponding ASFD features into the prediction model to obtain the prediction results of hard disk failure;

根据预测结果识别濒临故障的硬盘,并基于二部图最大匹配策略和修复调度策略完成对故障硬盘的迁移修复。According to the prediction results, the hard disk that is about to fail is identified, and the migration and repair of the failed hard disk is completed based on the bipartite graph maximum matching strategy and repair scheduling strategy.

进一步的,对所述信息集进行正负样本标记的过程包括:Furthermore, the process of labeling the information set as positive and negative samples includes:

获取硬盘故障情况,将硬盘故障前一时间段内的信息集标记为正样本,将其余时间下的信息集标记为负样本。Obtain hard disk failure information, mark the information set in the time period before the hard disk failure as a positive sample, and mark the information set in the rest of the time as a negative sample.

进一步的,所述对原始数据集进行无用数据清理操作并进行欠采样处理的过程包括:Furthermore, the process of performing useless data cleaning operation and undersampling processing on the original data set includes:

获取硬盘中固定不变的数据,并将其视为无用数据从原始数据集中清除;根据原始数据集中标记为正样本的信息集的数量,选取对应数量的负样本信息集,并将其余负样本信息集清除。Obtain the fixed data in the hard disk and treat it as useless data and remove it from the original data set; select the corresponding number of negative sample information sets according to the number of information sets marked as positive samples in the original data set, and remove the remaining negative sample information sets.

进一步的,所述对原始数据集进行缺失值填充的过程包括:Furthermore, the process of filling missing values in the original data set includes:

获取当前硬盘的硬盘型号,采集具有相同硬盘型号的所有硬盘的SMART信息,作为填充集;Get the hard disk model of the current hard disk, and collect the SMART information of all hard disks with the same hard disk model as a filling set;

获取当前硬盘中丢失数据对应的SMART项,作为缺失集;Get the SMART items corresponding to the lost data in the current hard disk as the missing set;

提取所述填充集与所述缺失集的交集中,各个SMART项的众数,将所述众数作为丢失数据进行填充。The mode of each SMART item in the intersection of the filled set and the missing set is extracted, and the mode is used as the missing data for filling.

进一步的,所述引入ASFD特征的过程包括:Furthermore, the process of introducing the ASFD feature includes:

遍历所述时间序列中的每一个元素,计算当前元素与前一个元素之间的差值,其中,所述元素为转换为时间序列的硬盘SMART信息;Traversing each element in the time series, and calculating the difference between the current element and the previous element, wherein the element is the hard disk SMART information converted into a time series;

对所述差值取绝对值,并将每一个元素对应的绝对值相加得到ASFD值,并将所述ASFD值添加至原始数据集中,其中,每个ASFD值分别对应一个SMART项。The absolute value of the difference is taken, and the absolute value corresponding to each element is added to obtain an ASFD value, and the ASFD value is added to the original data set, wherein each ASFD value corresponds to a SMART item.

进一步的,所述构建并训练预测模型的过程包括:Furthermore, the process of building and training the prediction model includes:

使用LGB算法对所述预测模型进行训练,并使用OPTUNA框架获取预测模型的最优超参数。The prediction model is trained using the LGB algorithm, and the optimal hyperparameters of the prediction model are obtained using the OPTUNA framework.

进一步的,所述基于二部图最大匹配策略和修复调度策略完成对故障硬盘的迁移修复的过程包括:Furthermore, the process of completing the migration and repair of the faulty hard disk based on the bipartite graph maximum matching strategy and the repair scheduling strategy includes:

根据所述二部图最大匹配策略,将硬盘中各个数据块分配至多个重建集中,且重建集具备并行重建的能力,并根据所述修复调度策略,最大化每次迁移重建中数据块的数量。According to the bipartite graph maximum matching strategy, each data block in the hard disk is allocated to multiple reconstruction sets, and the reconstruction sets have the ability to be reconstructed in parallel, and according to the repair scheduling strategy, the number of data blocks in each migration reconstruction is maximized.

相对于现有技术,本发明所述的面向低质量数据集的硬盘故障预测及数据迁移方法具有以下有益效果:Compared with the prior art, the hard disk failure prediction and data migration method for low-quality data sets described in the present invention has the following beneficial effects:

在低质量硬盘SMART的情况下,达到较高的硬盘故障预测准确率,并且能够提前对濒临故障硬盘的数据进行主动迁移和修复。In the case of low-quality hard disk SMART, a high hard disk failure prediction accuracy rate can be achieved, and data on the hard disk that is about to fail can be actively migrated and repaired in advance.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

构成本发明的一部分的附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The accompanying drawings constituting a part of the present invention are used to provide a further understanding of the present invention. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention. In the accompanying drawings:

图1为本发明的工作流程示意图;Fig. 1 is a schematic diagram of the workflow of the present invention;

图2为本发明中RS(4,2)编码条件下三个条带的分布示意图;FIG2 is a schematic diagram showing the distribution of three stripes under RS (4,2) coding conditions in the present invention;

图3为本发明中识别要传输的块示意图;FIG3 is a schematic diagram of identifying blocks to be transmitted in the present invention;

图4为本发明中识别如何存储修复块示意图;FIG4 is a schematic diagram of identifying how to store repair blocks in the present invention;

图5为本发明中执行修复示意图;FIG5 is a schematic diagram of performing repair in the present invention;

图6为本发明中修复轮中执行迁移和重建示意图。FIG. 6 is a schematic diagram of performing migration and reconstruction in a repair round in the present invention.

具体实施方式DETAILED DESCRIPTION

需要说明的是,在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。It should be noted that, in the absence of conflict, the embodiments of the present invention and the features in the embodiments may be combined with each other.

在本发明的描述中,需要理解的是,术语“中心”、“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。此外,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”等的特征可以明示或者隐含地包括一个或者更多个该特征。在本发明的描述中,除非另有说明,“多个”的含义是两个或两个以上。In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inside", "outside" and the like indicate positions or positional relationships based on the positions or positional relationships shown in the accompanying drawings, and are only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore cannot be understood as limiting the present invention. In addition, the terms "first", "second", etc. are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", etc. may explicitly or implicitly include one or more of the features. In the description of the present invention, unless otherwise specified, "multiple" means two or more.

在本发明的描述中,需要说明的是,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以通过具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that, unless otherwise clearly specified and limited, the terms "installed", "connected", and "connected" should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection, or it can be indirectly connected through an intermediate medium, or it can be the internal communication of two components. For ordinary technicians in this field, the specific meanings of the above terms in the present invention can be understood by specific circumstances.

下面将参考附图并结合实施例来详细说明本发明。The present invention will be described in detail below with reference to the accompanying drawings and in conjunction with embodiments.

面向低质量数据集的硬盘故障预测及数据迁移方法,包括如下步骤:The hard disk failure prediction and data migration method for low-quality data sets includes the following steps:

S1、获取硬盘的SMART信息得到信息集,对所述信息集进行正负样本标记,并将丢失数据的信息集作为原始数据集。S1. Obtain the SMART information of the hard disk to obtain an information set, mark the information set with positive and negative samples, and use the information set with missing data as the original data set.

所述步骤S1中对信息集进行正负样本标记的过程包括:The process of labeling the information set with positive and negative samples in step S1 includes:

获取硬盘故障情况,将硬盘故障前一时间段内的信息集标记为正样本,将其余时间下的信息集标记为负样本。Obtain hard disk failure information, mark the information set in the time period before the hard disk failure as a positive sample, and mark the information set in the rest of the time as a negative sample.

优选的,所述步骤S1的一种具体执行流程及其解释如下:Preferably, a specific execution process of step S1 and its explanation are as follows:

定期对大型数据中心的硬盘采集SMART信息,例如每1天每1块硬盘采集1次,并将故障盘在故障前30天内的数据标记为1,称为正样本;其他时间段内的数据标记为0,称为负样本,所述负样本即为健康硬盘的数据;Collect SMART information from hard disks in large data centers regularly, for example, collect information from each hard disk once a day, and mark the data of the failed disk within 30 days before the failure as 1, which is called a positive sample; the data in other time periods is marked as 0, which is called a negative sample. The negative sample is the data of healthy hard disks;

硬盘SMART数据的采集、传输、存储等过程中都有可能因客观或主观的原因丢失部分数据,常见的丢失情况包括:同一块硬盘连续若干天的SMART数据丢失、同一天内若干块硬盘的SMART数据丢失、同一块硬盘同一天丢失部分SMART字段,因此,选择实际采集并有部分数据丢失的数据集作为原始数据集。Part of the data may be lost due to objective or subjective reasons during the collection, transmission, and storage of hard disk SMART data. Common loss situations include: SMART data loss of the same hard disk for several consecutive days, SMART data loss of several hard disks on the same day, and partial SMART field loss of the same hard disk on the same day. Therefore, the data set that is actually collected and has partial data loss is selected as the original data set.

S2、对所述原始数据集进行无用数据清理操作并进行欠采样处理。S2. Cleaning useless data and performing under-sampling processing on the original data set.

所述步骤S2中对原始数据集进行无用数据清理操作并进行欠采样处理的过程包括:The process of performing useless data cleaning and undersampling processing on the original data set in step S2 includes:

获取硬盘中固定不变的数据,并将其视为无用数据从原始数据集中清除;根据原始数据集中标记为正样本的信息集的数量,选取对应数量的负样本信息集,并将其余负样本信息集清除。Obtain the fixed data in the hard disk and treat it as useless data and remove it from the original data set; select the corresponding number of negative sample information sets according to the number of information sets marked as positive samples in the original data set, and remove the remaining negative sample information sets.

筛选处理SMART数据集中无用数据,是为了减少后续分析的工作量,同时由于硬盘数据集中正负样本数量差距悬殊,因此使用欠采样的方式人为改变正负样本比例。The purpose of filtering out useless data in the SMART dataset is to reduce the workload of subsequent analysis. At the same time, due to the large disparity between the number of positive and negative samples in the hard disk dataset, undersampling is used to artificially change the ratio of positive and negative samples.

S3、对所述原始数据集进行缺失值填充将缺失值填充后的原始数据集转化为时间序列数据,并引入ASFD特征。S3. Fill missing values in the original data set, convert the original data set after missing value filling into time series data, and introduce ASFD features.

通过构造时间序列,可以有效弥补SMART数据集的低质量问题,为了更好处理将原数据集中数据进行计算与处理,创建并使用时间窗口将原始数据集转化为时间序列数据,利用时间窗口可以更有代表性地发现数据之间联系,同时便利后续计算,构造小的滑动窗口,在构造的时间窗口上滑动,可以更好的发现时间窗口内数据的变化情况与丰富高阶数据的计算,将多个时间序列的特征进行合并,选取出相互作用后表现良好的特征强化预测模型。By constructing time series, we can effectively make up for the low quality of the SMART data set. In order to better calculate and process the data in the original data set, we create and use time windows to convert the original data set into time series data. The time window can be used to more representatively discover the relationship between data and facilitate subsequent calculations. By constructing a small sliding window and sliding on the constructed time window, we can better discover the changes in data within the time window and enrich the calculation of high-order data. The features of multiple time series are merged, and the features that perform well after interaction are selected to strengthen the prediction model.

S31、所述步骤S3中对原始数据集进行缺失值填充的过程包括:S31, the process of filling missing values in the original data set in step S3 includes:

获取当前硬盘的硬盘型号,采集具有相同硬盘型号的所有硬盘的SMART信息,作为填充集;Get the hard disk model of the current hard disk, and collect the SMART information of all hard disks with the same hard disk model as a filling set;

获取当前硬盘中丢失数据对应的SMART项,作为缺失集;Get the SMART items corresponding to the lost data in the current hard disk as the missing set;

提取所述填充集与所述缺失集的交集中,各个SMART项的众数,将所述众数作为丢失数据进行填充。The mode of each SMART item in the intersection of the filled set and the missing set is extracted, and the mode is used as the missing data for filling.

优选的,所述步骤S31的一种具体执行流程及其解释如下:Preferably, a specific execution process of step S31 and its explanation are as follows:

将原始数据集转化为时间序列数据后,即使原有数据大量丢失,仍然可以从高阶数值中得到原有数据的变化趋势,使预测模型更加准确,同时具备应对数据集质量低下情况的能力;After converting the original data set into time series data, even if a large amount of original data is lost, the changing trend of the original data can still be obtained from the high-order values, making the prediction model more accurate and capable of dealing with the situation of low quality data set;

为了得到更好的预测结果,将健康盘和故障盘分类后分别处理,对于健康盘,为硬盘i在第j个SMART项中的缺失值填充该健康盘所属型号的所有第j个SMART项的众数,对于故障盘,为硬盘i在第j个SMART项中的缺失值填充该故障盘所属型号的所有第j个SMART项的众数。In order to obtain better prediction results, healthy disks and faulty disks are classified and processed separately. For healthy disks, the missing value of hard disk i in the jth SMART item is filled with the mode of all jth SMART items of the model to which the healthy disk belongs. For faulty disks, the missing value of hard disk i in the jth SMART item is filled with the mode of all jth SMART items of the model to which the faulty disk belongs.

S32、所述步骤S3中引入ASFD特征的过程包括:S32, the process of introducing the ASFD feature in step S3 includes:

遍历所述时间序列中的每一个元素,计算当前元素与前一个元素之间的差值,其中,所述元素为转换为时间序列的硬盘SMART信息;Traversing each element in the time series, and calculating the difference between the current element and the previous element, wherein the element is the hard disk SMART information converted into a time series;

对所述差值取绝对值,并将每一个元素对应的绝对值相加得到ASFD值,并将所述ASFD值添加至原始数据集中,其中,每个ASFD值分别对应一个SMART项。The absolute value of the difference is taken, and the absolute value corresponding to each element is added to obtain an ASFD value, and the ASFD value is added to the original data set, wherein each ASFD value corresponds to a SMART item.

优选的,所述步骤S32的一种具体执行流程及其解释如下:Preferably, a specific execution process of step S32 and its explanation are as follows:

ASFD(Absolute Sum of First Difference)是对时间序列各数值之间的变化或变化的度量,它的计算方法是一个时间序列中每个元素与其前一个元素之差,绝对和则是这些差值的绝对值之和,ASFD的结果值是一个非负数,值为零表示时间序列各数值均相同,值越高表示数值变化越大;ASFD (Absolute Sum of First Difference) is a measure of the change or variation between values in a time series. It is calculated by taking the difference between each element in a time series and its previous element. The absolute sum is the sum of the absolute values of these differences. The resulting value of ASFD is a non-negative number. A value of zero means that all values in the time series are the same. A higher value means a greater change in the values.

所述ASFD的计算公式为:The calculation formula of the ASFD is: .

其中,y为所求ASFD值;X为所求SMART属性列的SMART属性值;I为第i行;N为所求SMART属性列的行数;xi为所求SMART属性列第i行的SMART属性值;xi+1为所求SMART属性列第i+1行的SMART属性值。Wherein, y is the desired ASFD value; X is the desired SMART attribute value of the desired SMART attribute column; I is the i-th row; N is the number of rows in the desired SMART attribute column; xi is the SMART attribute value of the i-th row of the desired SMART attribute column; xi+1 is the SMART attribute value of the i+1-th row of the desired SMART attribute column.

对每个硬盘的每个SMART项,逐个计算ASFD值,将其作为新的属性添加至原始数据集中,处理过后的原始数据集的条目数量不变,但每个条目的SMART项的数量增加一倍,例如,编号为4214的硬盘,在2024年6月1日的SMART数据项ID为1的属性值为15,SMART数据项ID为1的ASFD值为0.9,即,每个SMART数据值都新增了一个对应的ASFD值。For each SMART item of each hard disk, the ASFD value is calculated one by one and added to the original data set as a new attribute. The number of entries in the processed original data set remains unchanged, but the number of SMART items for each entry is doubled. For example, for the hard disk numbered 4214, the attribute value of the SMART data item ID 1 on June 1, 2024 is 15, and the ASFD value of the SMART data item ID 1 is 0.9. That is, a corresponding ASFD value is added for each SMART data value.

S4、构建并训练预测模型,将时间序列数据及对应的ASFD特征输入预测模型中得到硬盘故障的预测结果。S4. Build and train a prediction model, input the time series data and the corresponding ASFD features into the prediction model to obtain the prediction result of the hard disk failure.

S41、所述步骤S4中构建并训练预测模型的过程包括:S41, the process of constructing and training the prediction model in step S4 includes:

使用LGB算法对所述预测模型进行训练,并使用OPTUNA框架获取预测模型的最优超参数。The prediction model is trained using the LGB algorithm, and the optimal hyperparameters of the prediction model are obtained using the OPTUNA framework.

优选的,所述步骤S41中提出的名词解释即技术方案的选取原因如下:Preferably, the reasons for selecting the term explanation, i.e., the technical solution, proposed in step S41 are as follows:

LGB(lightgbm)是本领域的公知常识:梯度提升决策树(Gradient BoostingDecision Tree,GBDT)是机器学习中一个常用的模型,其主要思想是利用弱分类器迭代训练来得到最优的模型,该模型具有准确率高、不容易过拟合等优点。GDBT在工业界应用广泛,通常被用于分类、回归和排序等任务,LGB是微软DMTK团队实现的GBDT开源框架,LGB算法作为GBDT的一种实现,在传统的GBDT算法上做了很多优化,具有支持高效率的并行训练、更低的内存消耗、更高的准确率等优点。LGB (lightgbm) is common knowledge in this field: Gradient Boosting Decision Tree (GBDT) is a commonly used model in machine learning. Its main idea is to use weak classifiers for iterative training to obtain the optimal model. This model has the advantages of high accuracy and low overfitting. GDBT is widely used in the industry and is usually used for tasks such as classification, regression and sorting. LGB is an open source framework for GBDT implemented by the Microsoft DMTK team. As an implementation of GBDT, the LGB algorithm has made many optimizations on the traditional GBDT algorithm and has the advantages of supporting efficient parallel training, lower memory consumption, and higher accuracy.

LGB算法相较于传统的GBDT算法有如下优势:Compared with the traditional GBDT algorithm, the LGB algorithm has the following advantages:

采用直方图优化,直方图的优化使得数据的表达更为简化而且减少了内存的使用,并且直方图在一定程度上可以避免过拟合的现象,叶子节点的直方图可以由父亲节点的直方图减去兄弟节点的直方图得到,根据这一点,可以先计算数据量较小的叶子节点的直方图,然后通过直方图做差的方法得到叶子节点数据量较大的直方图来达到加速的效果。Histogram optimization is used. Histogram optimization simplifies the expression of data and reduces the use of memory. Histograms can avoid overfitting to a certain extent. The histogram of a leaf node can be obtained by subtracting the histogram of a brother node from the histogram of the father node. Based on this, the histogram of a leaf node with a smaller amount of data can be calculated first, and then the histogram of a leaf node with a larger amount of data can be obtained by subtracting the histogram to achieve the acceleration effect.

采用leaf-wise的方法来提高模型的精度,相较于level-wise的方法,leaf-wise方法更加高效,在叶子节点的数量相同时,leaf-wise方法可以得到更小的训练误差、更高的精度,如果单纯使用leaf-wise的方式来生长,在小数据集上可能会出现过拟合的现象,因此LGB在leaf-wise上增加了对于树的深度的限制。The leaf-wise method is used to improve the accuracy of the model. Compared with the level-wise method, the leaf-wise method is more efficient. When the number of leaf nodes is the same, the leaf-wise method can obtain smaller training errors and higher accuracy. If the leaf-wise method is used alone to grow, overfitting may occur on small data sets. Therefore, LGB adds restrictions on the depth of the tree in leaf-wise.

支持类别特征,传统的机器学习算法一般不能直接输入类别,需要对类别特征做离散化,将类别特征转化为多维的0/1特征,这样的做法无论在空间还是在时间上效率都不高,LGB通过更改决策树算法的决策规则,直接支持类别特征,不需要额外的离散化。Support for category features. Traditional machine learning algorithms generally cannot directly input categories. Category features need to be discretized and converted into multi-dimensional 0/1 features. This approach is inefficient both in space and time. LGB directly supports category features by changing the decision rules of the decision tree algorithm without the need for additional discretization.

支持并行化。特征并行(Feature Parallelization)适用于小数据但是特征较多的场景,数据并行(Data Parallelization)适用于数据量较大但是特征较少的场景,投票并行(Voting Parallelization)适用于数据量较大并且特征也较多的场景,特征并行通过垂直切分数据,使得不同的机器上都有所有的样本点,但是有不同的特征,然后在不同机器上寻找不同特征的最优分割点,然后进行同步,得到全局的最优分割点,数据并行通过水平切分数据,使得不同的机器拥有部分的数据,但是有全部的特征,然后在不同的机器上构造直方图,进行直方图合并,在合并之后的全局直方图上寻找最优的分割点,投票并行是基于数据并行的优化,数据并行的瓶颈主要在于合并直方图,投票并行的方式通过投票的方式只合并部分特征值的直方图,达到了降低通信量的效果,首先通过本地的数据得到TOP K的最优特征,然后通过投票得到可能是全局最优分割点的特征,在合并直方图时只合并这些被选中的特征,从而降低了直方图合并时的通信量。Supports parallelization. Feature Parallelism is suitable for scenarios with small data but more features, Data Parallelism is suitable for scenarios with large data volume but fewer features, and Voting Parallelism is suitable for scenarios with large data volume and more features. Feature Parallelism vertically splits the data so that different machines have all sample points but different features, and then finds the optimal segmentation points of different features on different machines, and then synchronizes them to obtain the global optimal segmentation point. Data Parallelism horizontally splits the data so that different machines have part of the data but all the features, and then constructs histograms on different machines, merges the histograms, and finds the optimal segmentation point on the merged global histogram. Voting Parallelism is an optimization based on data parallelism. The bottleneck of data parallelism is mainly the merging of histograms. Voting Parallelism only merges the histograms of some feature values by voting, thereby reducing the amount of communication. First, the optimal features of TOP K are obtained through local data, and then the features that may be the global optimal segmentation points are obtained by voting. When merging histograms, only these selected features are merged, thereby reducing the amount of communication when merging histograms.

OPTUNA是本领域的公知常识:关于模型的超参数调优,不少学者使用GridSearch的暴力搜索方法,虽然简单易懂,但它的缺点很明显,运行太耗时,相比之下,基于贝叶斯框架下的调参方法,比如HyperOPT,OPTUNA,后者轻量级且功能更强大,速度也是非常快,作为本发明采用的参数调优方法,通常,基于树的模型的超参数可以分为4类:1)影响决策树结构和学习的参数;2)影响训练速度的参数;3)提高精度的参数;4)防止过拟合的参数。OPTUNA is common knowledge in this field: Regarding the hyperparameter tuning of the model, many scholars use the brute force search method of GridSearch. Although it is simple and easy to understand, its disadvantage is obvious and it is too time-consuming to run. In contrast, the parameter tuning method based on the Bayesian framework, such as HyperOPT and OPTUNA, is lightweight and more powerful, and the speed is also very fast. As the parameter tuning method adopted by the present invention, generally, the hyperparameters of the tree-based model can be divided into 4 categories: 1) parameters that affect the structure and learning of the decision tree; 2) parameters that affect the training speed; 3) parameters that improve the accuracy; 4) parameters that prevent overfitting.

大多数情况下,这些类别之间存在大量重叠,提高某一类别的效率可能会降低另一类别的效率,完全依赖手动调参不仅耗时耗力,而且难以找到最优的平衡点,然而,若能提供合适的参数网格,OPTUNA可以自动搜索并找到这些类别之间最平衡的参数组合,从而优化整体性能。In most cases, there is a lot of overlap between these categories. Improving the efficiency of one category may reduce the efficiency of another category. Relying entirely on manual parameter tuning is not only time-consuming and labor-intensive, but also difficult to find the optimal balance. However, if a suitable parameter grid is provided, OPTUNA can automatically search and find the most balanced parameter combination between these categories, thereby optimizing the overall performance.

上述LGB和OPTUNA方法,是长期、基于大量模型算法与参数调优实验、比较后,得出的一种最优的组合,其他模型算法,例如LSTM、随机森林、SVM均无法达到LGB方法的效果,其他的参数调优方法,例如GridSearch、HyperOPT同样无法达到OPTUNA方法的效果。The above-mentioned LGB and OPTUNA methods are an optimal combination obtained after long-term experiments and comparisons based on a large number of model algorithms and parameter tuning. Other model algorithms, such as LSTM, random forest, and SVM, cannot achieve the effect of the LGB method. Other parameter tuning methods, such as GridSearch and HyperOPT, also cannot achieve the effect of the OPTUNA method.

S42、所述步骤S4中将时间序列数据及对应的ASFD特征输入预测模型中得到硬盘故障的预测结果的过程包括:S42, the process of inputting the time series data and the corresponding ASFD features into the prediction model in step S4 to obtain the prediction result of the hard disk failure includes:

将采集的硬盘SMART数据,检查其是否存在部分数据丢失,如果存在丢失,即低质量,则对丢失数据,填充该硬盘该SMART项的最近N天的众数,N取值一般为7天或10天。然后按照计算该硬盘每个SMART项的ASFD值,得到该硬盘的最新样本,将输入模型,输入结果是0或1,其中0代表预测结果为该硬盘健康,1代表硬盘即将濒临故障。The collected hard disk SMART data is checked for partial data loss. If there is data loss, i.e., low quality, the missing data is filled with the mode of the SMART item of the hard disk in the last N days, where N is generally 7 or 10 days. Then, the ASFD value of each SMART item of the hard disk is calculated to obtain the latest sample of the hard disk, which is input into the model. The input result is 0 or 1, where 0 represents the prediction result that the hard disk is healthy, and 1 represents that the hard disk is about to fail.

S5、根据预测结果识别濒临故障的硬盘,并基于二部图最大匹配策略和修复调度策略完成对故障硬盘的迁移修复。S5. Identify the hard disk that is about to fail based on the prediction results, and complete the migration and repair of the failed hard disk based on the bipartite graph maximum matching strategy and the repair scheduling strategy.

所述步骤S5中基于二部图最大匹配策略和修复调度策略完成对故障硬盘数据修复的过程包括:The process of completing the repair of the faulty hard disk data based on the bipartite graph maximum matching strategy and the repair scheduling strategy in step S5 includes:

根据所述二部图最大匹配策略,将硬盘中各个数据块分配至多个重建集中,且重建集具备并行重建的能力,并根据所述修复调度策略,最大化每次迁移重建中数据块的数量。According to the bipartite graph maximum matching strategy, each data block in the hard disk is allocated to multiple reconstruction sets, and the reconstruction sets have the ability to be reconstructed in parallel, and according to the repair scheduling strategy, the number of data blocks in each migration reconstruction is maximized.

优选的,所述步骤S5中所述二部图最大匹配策略与修复调度策略的具体执行流程分别为:Preferably, the specific execution processes of the bipartite graph maximum matching strategy and the repair scheduling strategy in step S5 are respectively:

二部图最大匹配策略为:The maximum matching strategy for bipartite graphs is:

构建集合C和初始重建集R,所述集合C表示需要修复的即障盘的所有块的集合;Construct a set C and an initial reconstruction set R, wherein the set C represents the set of all blocks of the faulty disk that need to be repaired;

对集合C中每一个数据块,依次判断如果将其添加至重建集R中,当前新构成的重建集R中的块集合是否可以并行重建;For each data block in the set C, determine in turn whether the block set in the newly constructed reconstruction set R can be reconstructed in parallel if it is added to the reconstruction set R;

若可以并行重建,则将该数据块添加至重建集R,并从集合C中删除,直至集合C 中的所有块都组织成重建集R。If parallel reconstruction is possible, the data block is added to the reconstruction set R and deleted from the set C until all blocks in the set C are organized into the reconstruction set R.

所述二部图最大匹配策略中,依次判断并将数据块添加至重建集中的具体过程包括:In the bipartite graph maximum matching strategy, the specific process of sequentially judging and adding data blocks to the reconstruction set includes:

针对每个数据块Ci,使用MATCH函数构建一个包含 N−1个节点的二部图,所述二部图的一个部分由健康硬盘的块组成,另一个部分由校验块组成;For each data block Ci , use the MATCH function to construct a bipartite graph containing N−1 nodes, one part of which consists of blocks from healthy hard disks and the other part consists of check blocks;

识别二部图中的最大匹配,如果最大匹配中有k(1+|R|)条边,说明当前数据块Ci可以通过k(1+|R|)个不同的健康硬盘和校验块并行重建,此时MATCH函数返回true,并将当前数据块 Ci添加到重建集R中,并从集合C中删除当前数据块CiIdentify the maximum matching in the bipartite graph. If there are k(1+|R|) edges in the maximum matching, it means that the current data block Ci can be rebuilt in parallel through k(1+|R|) different healthy hard disks and check blocks. At this time, the MATCH function returns true, adds the current data block Ci to the reconstruction set R, and deletes the current data block Ci from the set C.

设置多个修复轮次,并在每个修复轮次中,重复上述针对每个数据块Ci执行的操作,直至集合C中所有数据块均被添加至重建集R中,得到所有重建集R的集合D={R1,R2,…,Rd},其中d为重建集的总数。Multiple repair rounds are set, and in each repair round, the above operations performed on each data block Ci are repeated until all data blocks in the set C are added to the reconstruction set R, and the set of all reconstruction sets R is obtained D = { R1 , R2 , ..., Rd }, where d is the total number of reconstruction sets.

为了增加重建集的块数,从而提高并行重建的能力,可以检查是否能够通过交换重建集R中的一个数据块,与不在重建集R中的其他数据块,从而扩展重建集R。In order to increase the number of blocks in the reconstruction set and thus improve the capability of parallel reconstruction, it may be checked whether the reconstruction set R can be expanded by exchanging a data block in the reconstruction set R with other data blocks not in the reconstruction set R.

修复调度策略:Repair scheduling strategy:

将集合D中的所有的重建集R按照其包含的数据块的数量进行单调递减排序,这样可以确保在修复过程中优先考虑包含更多块的重建集;Sort all reconstruction sets R in set D in monotonically decreasing order according to the number of data blocks they contain, so as to ensure that reconstruction sets containing more blocks are given priority during the repair process;

设置两个指标l和u,其中,l表示当前最小的重建集索引的指标,u表示当前最大的重建集索引,即第一个重建集为Rl,最后一个重建集为RuSet two indices l and u, where l represents the index of the current smallest reconstruction set index, and u represents the current largest reconstruction set index, that is, the first reconstruction set is R l , and the last reconstruction set is R u ;

建立通过迁移进行修复的块集合Cm,建立通过使用健康硬盘协同重建修复的块集合CrEstablish a block set C m that is repaired by migration, and establish a block set C r that is repaired by collaborative reconstruction using healthy hard disks;

计算Cm,判断当前Cm是否满足|R(l+1)∪…∪Ru|≤ CmCalculate C m and determine whether the current C m satisfies |R (l+1) ∪…∪R u |≤ C m ;

若满足,则R(l+1)至Ru之间的重建集可以通过迁移进行修复,Rl通过重建进行修复,并且迁移和重建过程并行执行,中断修复轮次完成修复;If it is satisfied, the reconstruction set between R (l+1) to Ru can be repaired by migration, R l can be repaired by reconstruction, and the migration and reconstruction processes are performed in parallel, and the repair round is interrupted to complete the repair;

否则,找到当前集合D中数据块数量最少的重建集,使得,x为当前重建集的重建集索引的指标,并获取当前重建集的子集Rx’属于Rx,使得Otherwise, find the reconstruction set with the least number of data blocks in the current set D, so that , x is the index of the reconstruction set index of the current reconstruction set, and obtains the subset R x 'of the current reconstruction set belonging to R x , such that ;

令Rx = Rx - Rx’,令Ml = Rx’ ∪Rx+1…∪Ru,更新指标l和u,令l = l +1,u = x,以便在下一修复轮次中考虑新的重建集,迭代这个过程,直到所有重建集都被处理完毕。Let Rx = Rx - Rx ', let Ml = Rx ' ∪Rx +1∪Ru , update the indicators l and u, let l = l +1, u = x, so that the new reconstruction set is considered in the next repair round, and iterate this process until all reconstruction sets have been processed.

工作过程:Working process:

本发明公开了面向低质量数据集的硬盘故障预测及数据迁移方法,为了更好的说明本发明的工作过程,给出如下所示的一种具体实施流程:The present invention discloses a hard disk failure prediction and data migration method for low-quality data sets. In order to better illustrate the working process of the present invention, a specific implementation process is given as follows:

1)创建时间窗口:在低质量数据集中我们使用窗口大小为10天为标准的时间窗口去创建时间序列特征,方便后续时间序列特征的加入。1) Create a time window: In low-quality data sets, we use a standard time window size of 10 days to create time series features, which facilitates the subsequent addition of time series features.

2)设置滑动窗口:提前预设两个不同的滑动窗口大小分别为7天和1天,方便后续时间特征序列加入,滑动窗口将在上述时间窗口上滑动以进行计算;2) Set sliding window: Preset two different sliding windows of 7 days and 1 day in advance to facilitate the subsequent addition of time feature sequences. The sliding window will slide on the above time windows for calculation;

加入ASFD特征,在数据集中使用滑动窗口的方法加入了时间序列特征AbsoluteSum of First Difference(ASFD),反映硬盘SMART数据相邻观测值之间的绝对波动情况,如果给定一个时间序列x = [1,1,0,1,0],那么一阶差分就是[0,0,1,1,1],而ASFD即为y =[0,0,1,2,3];Add ASFD feature. Use the sliding window method to add the time series feature AbsoluteSum of First Difference (ASFD) to the data set to reflect the absolute fluctuation between adjacent observations of hard disk SMART data. If a time series x = [1,1,0,1,0] is given, then the first-order difference is [0,0,1,1,1], and ASFD is y = [0,0,1,2,3];

3)也可以加入CID特征作为补充,使用滑动窗口方法加入了时间序列特征Complexity Invariant Distance(CID),用来反映硬盘SMART数据的时间序列复杂度。对于时间序列Q和C,其CID计算公式为:,通过多次实验,加入CID特征会在大多数情况下提升预测准确率,但是会增加处理时间,需要工程技术人员根据实际需要决定是否加入CID。3) CID features can also be added as a supplement. The time series feature Complexity Invariant Distance (CID) is added using the sliding window method to reflect the time series complexity of the hard disk SMART data. For time series Q and C, the CID calculation formula is: ,Through multiple experiments, adding CID features will improve the ,prediction accuracy in most cases, but it will increase the processing time, and ,engineering technicians need to decide whether to add CID based on actual ,needs.

4)训练数据集概况:原始数据集做预处理的之后的每个SMART属性加入了ASFD特征和CID特征。训练数据集一共有121维特征。4) Overview of the training dataset: After preprocessing the original dataset, each SMART attribute is added with ASFD features and CID features. The training dataset has a total of 121 dimensions of features.

5)模型训练:用微软开发的 LGB(lightgbm)结合预处理之后训练集来训练硬盘故障预测模型。5) Model training: Use LGB (lightgbm) developed by Microsoft and the preprocessed training set to train the hard disk failure prediction model.

6)模型超参数调优:使用OPTUNA的方法来寻找 LGB算法最优的超参数,从而对模型进行调优。6) Model hyperparameter tuning: Use the OPTUNA method to find the optimal hyperparameters of the LGB algorithm and tune the model.

7)硬盘故障预测:在得到超参数最优的故障预测模型之后,将测试集数据输入故障预测模型,得到 disk-level的预测结果,即这个硬盘是否将会发生故障。7) Hard disk failure prediction: After obtaining the failure prediction model with the optimal hyperparameters, the test set data is input into the failure prediction model to obtain the disk-level prediction result, that is, whether the hard disk will fail.

8)数据迁移与重建:8) Data migration and reconstruction:

如图2、3、4、5所示:构建一个具有左右两组顶点集合的二部图,{N1, N2, N3, N4,N5}, {N4, N5, N6, N7, N8}的节点顶点分别连接到条带S1和S2的k = 5个块顶点,找到最大匹配,得出可以从N1,N2,N3,N4中检索块,以重建S1的块。另构建一个二部图,{N6, N7,N8}、{N1, N2, N3}和{N3, N4,N5}的节点顶点分别连接到条带S1、S2和S3的条带顶点,最大匹配表明,可以将S1的修复块存储在N6中。As shown in Figures 2, 3, 4, and 5: A bipartite graph with two sets of left and right vertices is constructed, and the node vertices of {N1, N2, N3, N4, N5} and {N4, N5, N6, N7, N8} are connected to the k = 5 block vertices of strips S1 and S2 respectively. The maximum matching is found, and it is concluded that blocks can be retrieved from N1, N2, N3, and N4 to reconstruct the blocks of S1. Another bipartite graph is constructed, and the node vertices of {N6, N7, N8}, {N1, N2, N3}, and {N3, N4, N5} are connected to the strip vertices of strips S1, S2, and S3 respectively. The maximum matching shows that the repair block of S1 can be stored in N6.

如图6所示:给出另一个数据迁移与重建的实施例,取一个d=7的重建集,Cm=4。对于第一轮修复,发现最大的x是5,使得|R5|+|R6|+|R7|>Cm = 4,接着进一步找一个子集R5’⊂R5,R5’包含一个R5中的块,使得M1 = R5’∪R6∪R7 通过迁移,与R1的重建在第一个修复轮次并行修复,R5中剩余的数据块数量减少到2,最后在第三轮修复中修复所有的块。As shown in Figure 6: Another embodiment of data migration and reconstruction is given, taking a reconstruction set of d = 7, Cm = 4. For the first round of repair, it is found that the largest x is 5, so that |R5|+|R6|+|R7|>Cm = 4, and then further find a subset R5'⊂R5, R5' contains a block in R5, so that M1 = R5'∪R6∪R7. Through migration, the number of remaining data blocks in R5 is reduced to 2 in parallel with the reconstruction of R1 in the first repair round, and finally all blocks are repaired in the third round of repair.

最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本实用各实施例技术方案的范围,其均应涵盖在本发明的权利要求和说明书的范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or replace some or all of the technical features therein by equivalents. These modifications or replacements do not deviate the essence of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present invention, and they should all be included in the scope of the claims and specification of the present invention.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. The hard disk fault prediction and data migration method for the low-quality data set is characterized by comprising the following steps of:
obtaining SMART information of a hard disk to obtain an information set, reconstructing positive and negative samples of the information set, and taking the information set of lost data as an original data set;
carrying out useless data cleaning operation on the original data set and undersampling;
Filling the original data set with missing values, converting the original data set filled with the missing values into time sequence data, and introducing ASFD features;
constructing and training a prediction model, and inputting time sequence data and corresponding ASFD features into the prediction model to obtain a prediction result of the hard disk fault;
And identifying the hard disk which is in imminent failure according to the prediction result, and completing migration repair of the failure hard disk based on the maximum matching strategy and the repair scheduling strategy of the bipartite graph.
2. The method for predicting hard disk failure and data migration for low quality data sets according to claim 1, wherein the process of reconstructing positive and negative samples of said information sets comprises:
Acquiring an information set of a fault hard disk in a period of time before a fault, and marking the information set as a positive sample; according to the length of the time period, acquiring an information set of the healthy hard disk in the same time period, and marking the information set as a negative sample;
wherein, the SMART information of the hard disk is recorded with the fault state of the hard disk.
3. The method for predicting hard disk failure and data migration for low-quality data sets according to claim 1, wherein said process of performing garbage collection operation on the original data set and performing undersampling process comprises:
The method comprises the steps of obtaining fixed and unchanged data in a hard disk, and regarding the fixed and unchanged data as useless data to be cleared from an original data set; and selecting a corresponding number of negative sample information sets according to the number of the information sets marked as positive samples in the original data set, and clearing the rest negative sample information sets.
4. The method for predicting hard disk failure and data migration in accordance with claim 1, wherein said process of missing value filling of said original data set comprises:
acquiring the hard disk model of the current hard disk, and acquiring SMART information of all the hard disks with the same hard disk model as a filling set;
Acquiring a SMART item corresponding to lost data in a current hard disk, and taking the SMART item as a deletion set;
And extracting the mode of each SMART item in the intersection of the filling set and the missing set, and filling the mode as missing data.
5. The method for hard disk failure prediction and data migration for low quality data sets according to claim 1, wherein the process of introducing ASFD features comprises:
Traversing each element in the time sequence, and calculating a difference value between a current element and a previous element, wherein the element is hard disk SMART information converted into the time sequence;
And taking absolute values of the differences, adding the absolute values corresponding to each element to obtain ASFD values, and adding the ASFD values to an original data set, wherein each SMART item corresponds to one ASFD value respectively.
6. The method for predicting hard disk failure and data migration in accordance with claim 1, wherein said process of constructing and training a predictive model comprises:
and training the prediction model by using an LGB algorithm, and acquiring the optimal super parameters of the prediction model by using a OPTUNA framework.
7. The method for predicting and repairing hard disk failure based on low-quality data set according to claim 1, wherein the process of completing migration repair of the failed hard disk based on the bipartite graph maximum matching policy and the repair scheduling policy comprises:
And distributing each data block in the hard disk to a plurality of reconstruction sets according to the maximum matching strategy of the bipartite graph, wherein the reconstruction sets have parallel reconstruction capability, and the number of the data blocks in each migration reconstruction is maximized according to the repair scheduling strategy.
CN202411153731.XA 2024-08-21 2024-08-21 Hard disk failure prediction and data migration method for low-quality data sets Active CN118656273B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411153731.XA CN118656273B (en) 2024-08-21 2024-08-21 Hard disk failure prediction and data migration method for low-quality data sets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411153731.XA CN118656273B (en) 2024-08-21 2024-08-21 Hard disk failure prediction and data migration method for low-quality data sets

Publications (2)

Publication Number Publication Date
CN118656273A true CN118656273A (en) 2024-09-17
CN118656273B CN118656273B (en) 2025-02-11

Family

ID=92708168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411153731.XA Active CN118656273B (en) 2024-08-21 2024-08-21 Hard disk failure prediction and data migration method for low-quality data sets

Country Status (1)

Country Link
CN (1) CN118656273B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119007793A (en) * 2024-10-23 2024-11-22 天津理工大学 Cloud edge cooperative data fault tolerance method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344618A1 (en) * 2010-12-23 2017-11-30 Eliot Horowitz Systems and methods for managing distributed database deployments
CN109828869A (en) * 2018-12-05 2019-05-31 中兴通讯股份有限公司 Predict the method, apparatus and storage medium of hard disk failure time of origin
CN111858108A (en) * 2020-06-23 2020-10-30 新华三技术有限公司 Hard disk fault prediction method and device, electronic equipment and storage medium
CN115689071A (en) * 2023-01-03 2023-02-03 南京工大金泓能源科技有限公司 Equipment fault fusion prediction method and system based on correlation parameter mining
CN117218128A (en) * 2023-11-09 2023-12-12 成都格理特电子技术有限公司 Method and system for detecting running and leaking targets by integrating time sequence information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344618A1 (en) * 2010-12-23 2017-11-30 Eliot Horowitz Systems and methods for managing distributed database deployments
CN109828869A (en) * 2018-12-05 2019-05-31 中兴通讯股份有限公司 Predict the method, apparatus and storage medium of hard disk failure time of origin
CN111858108A (en) * 2020-06-23 2020-10-30 新华三技术有限公司 Hard disk fault prediction method and device, electronic equipment and storage medium
CN115689071A (en) * 2023-01-03 2023-02-03 南京工大金泓能源科技有限公司 Equipment fault fusion prediction method and system based on correlation parameter mining
CN117218128A (en) * 2023-11-09 2023-12-12 成都格理特电子技术有限公司 Method and system for detecting running and leaking targets by integrating time sequence information

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119007793A (en) * 2024-10-23 2024-11-22 天津理工大学 Cloud edge cooperative data fault tolerance method and system
CN119007793B (en) * 2024-10-23 2025-01-14 天津理工大学 Cloud edge cooperative data fault tolerance method and system

Also Published As

Publication number Publication date
CN118656273B (en) 2025-02-11

Similar Documents

Publication Publication Date Title
CN111367961B (en) Time sequence data event prediction method and system based on graph convolution neural network and application thereof
CN111612041B (en) Abnormal user identification method and device, storage medium and electronic equipment
US9589045B2 (en) Distributed clustering with outlier detection
US9460236B2 (en) Adaptive variable selection for data clustering
CN106452825B (en) An alarm correlation analysis method for power distribution communication network based on improved decision tree
CN103336791B (en) Hadoop-based fast rough set attribute reduction method
CN118093895A (en) A knowledge graph visualization development system
CN104778622A (en) Method and system for predicting TPS transaction event threshold value
CN117828539A (en) Intelligent data fusion analysis system and method
CN118656273B (en) Hard disk failure prediction and data migration method for low-quality data sets
CN119938335A (en) Service proxy method and system based on Doris front-end node
CN118152378B (en) Construction method and system of intelligent data center
CN118520415A (en) Carbon emission monitoring method and system for motor vehicle
CN115345458A (en) Business process compliance checking method, computer equipment and readable storage medium
CN116795977A (en) Data processing methods, devices, equipment and computer-readable storage media
CN115238583B (en) Business process remaining time prediction method and system supporting incremental log
CN107133335A (en) A kind of repetition record detection method based on participle and index technology
Yang et al. Zte-predictor: Disk failure prediction system based on lstm
CN115035966A (en) Superconductor screening method, device and equipment based on active learning and symbolic regression
CN119337984A (en) A railway fault tracing method and system based on weighted path sorting algorithm
CN118568790A (en) Big data platform data processing method, device and system
CN117687815A (en) Hard disk fault prediction method and system
CN117893326A (en) Post-investment management information system and post-investment management method
CN117291575A (en) Equipment maintenance method, equipment maintenance device, computer equipment and storage medium
CN116149895A (en) Big data cluster performance prediction method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant