[go: up one dir, main page]

CN111459692A - Method, apparatus and computer program product for predicting drive failure - Google Patents

Method, apparatus and computer program product for predicting drive failure Download PDF

Info

Publication number
CN111459692A
CN111459692A CN201910048750.9A CN201910048750A CN111459692A CN 111459692 A CN111459692 A CN 111459692A CN 201910048750 A CN201910048750 A CN 201910048750A CN 111459692 A CN111459692 A CN 111459692A
Authority
CN
China
Prior art keywords
operational data
drive
data
attributes
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910048750.9A
Other languages
Chinese (zh)
Other versions
CN111459692B (en
Inventor
刘冰
刘星宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
EMC Corp
Original Assignee
EMC IP Holding Co LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by EMC IP Holding Co LLC filed Critical EMC IP Holding Co LLC
Priority to CN201910048750.9A priority Critical patent/CN111459692B/en
Priority to US16/414,196 priority patent/US10996861B2/en
Publication of CN111459692A publication Critical patent/CN111459692A/en
Application granted granted Critical
Publication of CN111459692B publication Critical patent/CN111459692B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0616Improving the reliability of storage systems in relation to life time, e.g. increasing Mean Time Between Failures [MTBF]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/008Reliability or availability analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0653Monitoring storage devices or systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

Embodiments of the present disclosure relate to methods, apparatuses, and computer program products for predicting drive failure. The method comprises the following steps: obtaining operational data for the drive, each data item in the operational data indicating values for one or more attributes of the drive at a respective time; identifying null values for one or more attributes from the data items of the operational data; adjusting the operational data based at least on the identification of the null value; and processing the adjusted operational data using a machine learning model to obtain a fault prediction of whether the drive has failed within a predetermined time period after the respective time.

Description

Method, apparatus and computer program product for predicting drive failure
Technical Field
Embodiments of the present disclosure relate to the field of computers, and more particularly, to methods, apparatuses, and computer program products for predicting drive failures.
Background
With the continuous development of computer technology, people rely on the storage capacity of the server side more and more. Once any driver in the server fails, it may cause data loss or service of the terminal, thereby causing a huge impact on the user. It is highly desirable to know in advance which drives in the server are likely to fail in the future, so that the hard disk can be replaced in advance to avoid unacceptable losses due to hard disk failure. Therefore, how to effectively predict the possible failure of the drive becomes a hot spot of current interest.
Disclosure of Invention
Embodiments of the present disclosure provide a scheme for predicting drive failure.
According to a first aspect of the present disclosure, a method for predicting a drive failure is presented. The method comprises the following steps: obtaining operational data for the drive, each data item in the operational data indicating values for one or more attributes of the drive at a respective time; identifying null values for one or more attributes from the data items of the operational data; adjusting the operational data based at least on the identification of the null value; and processing the adjusted operational data using a machine learning model to obtain a fault prediction of whether the drive has failed within a predetermined time period after the respective time.
According to a second aspect of the present disclosure, an apparatus for predicting a failure of a drive is presented. The apparatus comprises: at least one processing unit; at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the apparatus to perform acts comprising: obtaining operational data for the drive, each data item in the operational data indicating values for one or more attributes of the drive at a respective time; identifying null values for one or more attributes from the data items of the operational data; adjusting the operational data based at least on the identification of the null value; and processing the adjusted operational data using a machine learning model to obtain a fault prediction of whether the drive has failed within a predetermined time period after the respective time.
In a third aspect of the disclosure, a computer program product is provided. The computer program product is stored in a non-transitory computer storage medium and comprises machine executable instructions which, when run in a device, cause the device to perform any of the steps of the method described according to the first aspect of the disclosure.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure.
FIG. 1 illustrates a schematic diagram of an environment in which embodiments of the present disclosure may be implemented;
FIG. 2 illustrates a schematic diagram of example operational data in accordance with an embodiment of the present disclosure;
FIG. 3 illustrates a flow chart of a method for predicting drive failure in accordance with an embodiment of the present disclosure;
FIG. 4 illustrates a schematic diagram of test drive failure prediction according to an embodiment of the present disclosure; and
FIG. 5 illustrates a schematic block diagram of an example device that can be used to implement embodiments of the present disclosure.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.
As described above, with the continuous development of computer technology, people rely more and more on the storage capability of the server side. Once any driver in the server fails, it may cause data loss or service of the terminal, thereby causing a huge impact on the user. Therefore, there is an increasing concern about how to early warn the manager of possible failure of the drive.
Typically, drives are equipped with Self-Monitoring, analysis, and Reporting Technology (SMART) to detect and report various drive reliability indicators. Some existing drive failure early warning models can predict impending failures through SMART data, and some drive manufacturers have designed failure prediction models based on SMART data built into the drives. However, such built-in models are only based on predictions of threshold conditions, which are difficult to provide efficient estimates. In addition, in practical applications, the raw operational data of the drive collected by the drive manager tends to have a large noise. In some cases, the drive manager may not be able to collect values for certain specific attributes of the drive, i.e., be recorded as null values in the raw operational data. Existing failure prediction schemes have difficulty in making effective drive failure predictions based on these raw operational data.
According to an embodiment of the present disclosure, a scheme for predicting drive failure is provided. In this scheme, first, raw operational data of the drive may be obtained, each data item in the operational data indicating values of one or more attributes of the drive at a respective time. Subsequently, null values for the one or more attributes are identified from the data items of the operational data, and the operational data is adjusted based at least on the identification of the null values. The adjusted operational data may then be input to a machine learning model to obtain a fault prediction of whether the drive has failed within a predetermined period of time after a corresponding time. In this way, the raw operational data can be automatically preprocessed to account for the possible presence of a large number of nulls in the raw data, and furthermore, by processing the adjusted operational data using a machine learning model, the accuracy of drive failure prediction can be improved.
Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings. Fig. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented. In this example environment 100, a computing device 120 may receive operational data 110 for a drive. In some embodiments, the drive manager may collect operational data 110 associated with attributes of one or more drives over a particular period of time and store it in memory over time. These attributes may indicate the operational status of the drive during a particular period of time, such as the number of boots, power-on time, and number of bad sectors. In some embodiments, the drive manager may collect the drive's operational data 110 on a daily basis and add a timestamp to each data item to indicate the value of one or more attributes of the drive over a particular period of time.
FIG. 2 illustrates example operational data 110. As shown in FIG. 2, example operational data 110 may include data items 210-1, 210-2, and 210-M associated with one or more attributes of one or more drives at different times. For ease of description, one or more of the data items 210-1, 210-2, and 210-M are collectively or individually referred to as data item 210.
In some embodiments, as shown in FIG. 2, the example operational data 110 may be maintained in a tabular form, where the example operational data 110 includes a plurality of fields, e.g., an identifier 202, a timestamp 204, an attribute 1206-1, an attribute 2206-2, and …, an attribute N206-N. Specifically, the identifier 202 may indicate an identifier of the drive associated with the data item, such as ID-1 and ID-M; the timestamp 204 may indicate a time corresponding to the data item. Furthermore, as described above, in practical application scenarios, it is often difficult for the drive manager to obtain the values of all the attributes of the drive, i.e., some of the attributes are controlled in the table, such as the value 212 of attribute 1 for drive ID-1 at timestamp T2, the value 214 of attribute 2 for drive ID-M at timestamp T3, and the value 216 of attribute N are all null. It is difficult for the conventional prediction method to process such noise data having a large number of nulls.
In some embodiments, the driver may be of any suitable type, including but not limited to: serial attached small computer system interface (SAS) drives, Serial Advanced Technology Attachment (SATA) drives, Solid State Drives (SSDs), and Hard Disk Drives (HDDs), among others.
With continued reference to fig. 1, in some embodiments, the operational data 110 may be transmitted to the computing device 120 by way of wired or wireless communication. In some embodiments, the computing device 120 may also directly read the operational data 110 stored in a storage device coupled to the computing device 120. The computing device 120 may determine a failure prediction 130 for the drive based on the received operational data 110. For example, the computing device 120 may give a failure prediction of "a drive is likely to fail within 7 days".
The process of predicting a drive failure will be described in more detail below with reference to FIG. 3. FIG. 3 illustrates a flow diagram of a process 300 for predicting drive failure, according to some embodiments of the present disclosure. Process 300 may be implemented by computing device 120 of fig. 1. For ease of discussion, the process 300 will be described in conjunction with fig. 1 and 2.
At block 302, the computing device 120 obtains operational data 110 for the drive, each data item in the operational data 110 indicating a value of one or more attributes of the drive at a respective time. In some embodiments, as described above, the computing device 120 may receive the operational data 110 by way of wired or wireless communication. The computing device 120 may also read the operational data 110 stored in a storage device coupled to the computing device 120.
In some embodiments, the one or more attributes of the drive include one or more attributes of the SMART category, which may include, for example, a number of boots of the drive, a power-on time, a number of bad sectors, and the like. The operation of the head, disk, motor, circuitry of the drive can be monitored by the SMART type of attributes, which are therefore also commonly used to predict drive failure.
In some embodiments, when the drive is a SATA drive, the one or more attributes of the drive may also include one or more attributes of a Background Media Scanning (BMS) class. These attributes may include, for example, whether a media error occurred, whether a recovery error occurred, etc. By adding input of BMS attributes, computing device 120 can more accurately predict possible failures of SATA type drives.
In some embodiments, the received operational data 110 may be training data used to train a machine learning model. Since the operational data 110 may contain data items for different timestamps for multiple different drives, the computing device 120 may group the operational data 110 by the identifier 202 field of the drive so that data for the same drive is aggregated into the same group. Further, for a certain packet, if any data item in the packet is determined to correspond to a failure state, each data item of the entire packet may be tagged with a label of "-1", for example, to distinguish the labels of "1" of other packets that do not have a failure.
In some embodiments, for training data, the computing device 120 may adjust the fraction of data items corresponding to non-failed drives and data items corresponding to failed drives to improve the effectiveness of model training.
At block 304, computing device 120 identifies null values for one or more attributes from the data items of operational data 110. Taking the example operational data 110 of FIG. 2 as an example, the computing device 120 may identify control of one or more attributes in the plurality of data items 210. For example, computing device 120 may identify: the value 212 for attribute 1 of drive ID-1 at timestamp T2, the value 214 for attribute 2 of drive ID-M at timestamp T3, and the value 216 for attribute N are all null values.
At block 306, based at least on the identification of the null value, the computing device 120 adjusts the operational data 110. In some embodiments, when the received operational data 110 is training data, the computing device 120 may determine a fraction of null values identified in a data item (referred to as a first data item for ease of description) in the operational data 110. As shown in fig. 2, the computing device 120 may determine a duty ratio of the number of attributes controlled in the data item other than the drive basic information, for example, assuming that N is 3 in fig. 2, the duty ratio of the null value in the data item 210-2 may be determined as 1/3, and the duty ratio of the null value in the data item 210-M is 2/3.
Computing device 120 can compare the duty ratio to a predetermined duty ratio threshold and remove the first data item from operational data 110 in response to the duty ratio exceeding the duty ratio threshold. For example, the predetermined duty threshold may be set to 1/2, i.e., when the general attributes in a data item are all null, the computing device 120 may remove the data item from the data item. Specifically, in the example of FIG. 2, because the occupancy 2/3 of the null value in the data item 210-M is greater than the predetermined occupancy threshold 1/2, the computing device 120 may remove the data item 210-M from the operational data 110 without being part of the training data.
In some embodiments, the computing device 120 may set a null value identified in the data item of the operational data to a predetermined value. In some embodiments, the computing device 120 may also set the null value in each data item to the average of the values of the attribute in the group. In some embodiments, computing device 120 may set the null value in each data item to 0 or a predetermined value such as-999 that can differ significantly from normal. Considering that, in most cases, the null value is also a reference reflecting that the drive may malfunction, the computing device 120 can predict the drive malfunction more accurately by setting a predetermined value that is easy to be normal. In some embodiments, the computing device 120 may perform the assignment of null values after performing the removal of data items for which the null value occupancy exceeds the occupancy threshold.
In some embodiments, the computing device 120 may also determine, based on the operational data, an attribute change value indicative of a particular attribute of the drive, the attribute change value being indicative of a degree of change in the value of the particular attribute of the drive at the first time.
In some embodiments, the computing device 120 may calculate a difference between the value of a particular attribute of the drive at a first time and the value of the attribute at a previous time, for example the difference may be expressed as: | Ai-Ai-nL wherein AiA value, A, representing an attribute at a first time (e.g., day i) of the drivei-nRepresenting the value of the attribute at the drive at a previous time (e.g., n days before the ith day).
In some embodiments, the computing device 120 may calculate that a particular attribute of a drive at a first time is in the pastThe variance within a predetermined time, for example, the method may be expressed as:
Figure BDA0001950061360000071
wherein p isiThe probability of the day is indicated, which in this example may be 1/n, AiRepresenting the value of the property at the drive at the first time (e.g., day i) and μ represents the average of the values of the property over the n days of the drive.
In some embodiments, computing device 120 may add the attribute change value to a second data item in the run data that corresponds to the first time. For example, in the example of fig. 2, the computing device, computing device 120, may add, for each data item, features indicating the degree of change in the value of each attribute, such as the difference and variance described above. By adding features indicating the degree of change in the attribute values, this data can be used to perform more accurate drive failure prediction.
At block 308, the computing device 120 processes the adjusted operational data using a machine learning model to obtain a failure prediction of whether the drive failed within a predetermined period of time after a corresponding time.
In some embodiments, for the training phase of the machine learning model, computing device 120 may adjust the operational data to obtain a corresponding fault prediction. In addition, a fault label corresponding to each data item in the training data may also be input into the machine learning model, where the fault label may correspond to whether the drive has failed within a predetermined period of time corresponding to the data item. For example, for a data item associated with day 1 of a drive, the tag may indicate whether the drive failed within 2 weeks after day 1. The computing device 120 may train the machine learning model by adjusting a distance between the model's failure prediction and the actual tags, thereby obtaining a trained drive failure prediction model.
In some embodiments, the machine learning model may be a decision tree model, preferably the machine learning model may be a random forest model.
In some embodiments, after completing the training of the model, the computing device 120 may also process the test run data using the trained machine learning model to obtain a test failure prediction of whether the drive failed within a predetermined time period after the time corresponding to the test run data. In some embodiments, both the test data and the training data may be historical operating data of the drive, for example, the historical operating data may be divided by time, with a portion of the historical operating data being selected as training data and the remaining portion being selected as test data.
In some embodiments, the computing device 120 may also aggregate the predicted outcomes for the plurality of data items over a particular time period as a final predicted outcome. For example, fig. 4 shows a schematic diagram 400 of test failure prediction according to an embodiment of the disclosure. In this example, computing device 120 has selected data items 410-1, 410-2, 410-3, 410-4, 410-5, 410-6, and 410-7 that correspond to drive ID-1 operating states that differ by 7 days. Computing device 120 may use the trained machine learning model to predict failure labels 420-1, 420-2, 420-3, 420-4, 420-5, and 420-6 corresponding to each data item in turn. Specifically, in the prediction process, the computing device 120 may determine, when a failure is predicted, whether or not a duty ratio at which the failure is predicted exceeds a predetermined threshold value within a predetermined time period (for convenience of description, referred to as a verification time period) before a time corresponding to the failure, and in a case where the duty ratio exceeds the threshold value, the computing device 120 may output the failure prediction as a one-time formal prediction. For example, the computing device 120 may set the predetermined time to 3 days and may set the duty ratio to 1/2. For the example of FIG. 4, when the computing device 120 predicts the failure tag 420-4 on day 4, it only fails on day 4 for three days, so the failure tag 420-4 on day 4 may be temporarily not output and the failure is again predicted on day 6, and when the percentage of failures is predicted to exceed the threshold (1/2) within 3 days, the computing device 120 may output the prediction as a formal failure prediction. In this manner, the computing device 120 may avoid mispredictions due to abrupt changes in data under certain conditions, thereby improving the accuracy of drive failure prediction.
In some embodiments, based on the test failure prediction, the computing device 120 may adjust one or more hyper-parameters of the machine learning model. The computing device 120 may adjust one or more hyper-parameters of the machine learning model based on a difference between the results of the test failure prediction and the actual results to further optimize the model to improve the accuracy of the drive failure prediction. In some embodiments, the one or more hyper-parameters may include the following two categories: model configuration parameters, such as the number of nodes, the number of layers, etc. in the decision tree model; data processing parameters such as the length of the predetermined time period for failure prediction, the length of the check time period used during use, the size of the time window for each data item entered, the number of samples, etc. By adjusting the hyper-parameters of the machine learning model during the testing phase, the computing device 120 may obtain a more accurate machine learning model for fault prediction.
In some embodiments, after completing the model training and testing described above, the computing device 120 may receive the operational data processed as described above using the final machine learning model to obtain a failure prediction of whether the drive failed within a predetermined period of time after a corresponding time. In some embodiments, the computing device 120 may, upon predicting that the drive will fail within a predetermined time, issue an alert to a drive manager to remind the drive manager to perform feature operations such as replacing the drive.
In some embodiments, in use, for the example where the machine learning model is a random forest model, the computing device 120 may also present a priority of one or more attributes entered into the random forest model based on the attribute importance rankings of the random forest model, and thereby more intuitively suggest to the user which attributes may more largely decide whether the drive will fail, and thereby assist management personnel in better monitoring the state of the drive and in being able to timely replace a potentially failing drive.
In the manner described above, the scheme of the present disclosure can be directly applied to impure raw operation data obtained by a drive manager, and the accuracy of the operation data is improved by means of data preprocessing, and the prediction of whether a drive will fail is automatically obtained by a machine learning model based on the processed operation data, so that the accuracy of a drive failure prediction model is greatly improved.
Fig. 5 illustrates a schematic block diagram of an example device 500 that may be used to implement embodiments of the present disclosure. For example, computing device 120 as shown in fig. 1 may be implemented by device 500. As shown, device 500 includes a Central Processing Unit (CPU)501 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)502 or loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data required for the operation of the device 500 can also be stored. The CPU 501, ROM 502, and RAM503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 505.
A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The various processes and processes described above, such as method 300, may be performed by processing unit 501. For example, in some embodiments, the method 300 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM503 and executed by CPU 501, one or more of the acts of method 300 described above may be performed.
The present disclosure may be methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
Computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including AN object oriented programming language such as Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" language or similar programming languages.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (17)

1. A method for predicting drive failure, comprising:
obtaining operational data for a drive, each data item in the operational data indicating values for one or more attributes of the drive at a respective time;
identifying null values for the one or more attributes from data items of the operational data;
adjusting the operational data based at least on the identification of the null value; and
processing the adjusted operational data with a machine learning model to obtain a fault prediction of whether the drive failed within a predetermined time period after the respective time.
2. The method of claim 1, wherein adjusting the operational data comprises:
determining a duty ratio of null values identified in a first data item of the operational data; and
removing the first data item from the operational data in response to the duty exceeding a duty threshold.
3. The method of claim 1 or 2, wherein adjusting the operational data comprises:
setting a null value identified in a data item of the operational data to a predetermined value.
4. The method of claim 1 or 2, wherein adjusting the operational data further comprises:
determining, based on the operational data, an attribute change value indicative of a particular attribute of the drive, the attribute change value being indicative of a degree of change in a value of the particular attribute of the drive at a first time; and
adding the attribute change value to a second data item in the operational data corresponding to the first time.
5. The method of claim 1, wherein the driver is a series-connected small computer system interface driver, the one or more attributes including a background media scanning attribute of the driver.
6. The method of claim 1, wherein the machine learning model is a random forest model.
7. The method of claim 1, further comprising:
processing test operation data using the machine learning model to obtain a test failure prediction of whether the driver failed within a predetermined time period after a time corresponding to the test operation data; and
adjusting one or more hyper-parameters of the machine learning model based on the test failure prediction.
8. The method of claim 7, wherein the one or more hyper-parameters comprise a length of the predetermined time period.
9. An apparatus for predicting drive failure, comprising:
at least one processing unit;
at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, which when executed by the at least one processing unit, cause the apparatus to perform acts comprising:
obtaining operational data for a drive, each data item in the operational data indicating values for one or more attributes of the drive at a respective time;
identifying null values for the one or more attributes from data items of the operational data;
adjusting the operational data based at least on the identification of the null value; and
processing the adjusted operational data with a machine learning model to obtain a fault prediction of whether the drive failed within a predetermined time period after the respective time.
10. The apparatus of claim 9, wherein adjusting the operational data comprises:
determining a duty ratio of null values identified in a first data item of the operational data; and
removing the first data item from the operational data in response to the duty exceeding a duty threshold.
11. The apparatus of claim 9 or 10, wherein adjusting the operational data comprises:
setting a null value identified in a data item of the operational data to a predetermined value.
12. The apparatus of claim 9 or 10, wherein adjusting the operational data further comprises:
determining, based on the operational data, an attribute change value indicative of a particular attribute of the drive, the attribute change value being indicative of a degree of change in a value of the particular attribute of the drive at a first time; and
adding the attribute change value to a second data item in the operational data corresponding to the first time.
13. The apparatus of claim 9, wherein the driver is a series-connected small computer system interface driver, the one or more attributes comprising a background media scanning attribute of the driver.
14. The apparatus of claim 9, wherein the machine learning model is a random forest model.
15. The apparatus of claim 9, the acts further comprising:
processing test operation data using the machine learning model to obtain a test failure prediction of whether the driver failed within a predetermined time period after a time corresponding to the test operation data; and
adjusting one or more hyper-parameters of the machine learning model based on the test failure prediction.
16. The apparatus of claim 15, wherein the one or more hyper-parameters comprise a length of the predetermined time period.
17. A computer program product stored in a non-transitory computer storage medium and comprising machine executable instructions that when run in a device cause the device to perform the actions of any of claims 1 to 8.
CN201910048750.9A 2019-01-18 2019-01-18 Method, apparatus and computer program product for predicting drive failure Active CN111459692B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910048750.9A CN111459692B (en) 2019-01-18 2019-01-18 Method, apparatus and computer program product for predicting drive failure
US16/414,196 US10996861B2 (en) 2019-01-18 2019-05-16 Method, device and computer product for predicting disk failure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910048750.9A CN111459692B (en) 2019-01-18 2019-01-18 Method, apparatus and computer program product for predicting drive failure

Publications (2)

Publication Number Publication Date
CN111459692A true CN111459692A (en) 2020-07-28
CN111459692B CN111459692B (en) 2023-08-18

Family

ID=71608867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910048750.9A Active CN111459692B (en) 2019-01-18 2019-01-18 Method, apparatus and computer program product for predicting drive failure

Country Status (2)

Country Link
US (1) US10996861B2 (en)
CN (1) CN111459692B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637560A (en) * 2020-12-16 2022-06-17 伊姆西Ip控股有限责任公司 Device management method, electronic device, and computer program product
CN116244113A (en) * 2023-02-22 2023-06-09 安芯网盾(北京)科技有限公司 System downtime obstacle avoidance and restoration method and device

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220080915A (en) 2020-12-08 2022-06-15 삼성전자주식회사 Method for operating storage device and host device, and storage device
CN114647524A (en) * 2020-12-17 2022-06-21 伊姆西Ip控股有限责任公司 Method, apparatus and computer program product for managing a storage system
CN114282342A (en) 2021-11-09 2022-04-05 三星(中国)半导体有限公司 Fault prediction method and device for storage device
CN117271229A (en) * 2022-06-09 2023-12-22 中兴智能科技南京有限公司 Hard disk fault prediction method and device, storage medium and electronic device
CN115729761B (en) * 2022-11-23 2023-10-20 中国人民解放军陆军装甲兵学院 Hard disk fault prediction method, system, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030149613A1 (en) * 2002-01-31 2003-08-07 Marc-David Cohen Computer-implemented system and method for performance assessment
US20040260967A1 (en) * 2003-06-05 2004-12-23 Copan Systems, Inc. Method and apparatus for efficient fault-tolerant disk drive replacement in raid storage systems
US20170364821A1 (en) * 2016-06-21 2017-12-21 Tata Consultancy Services Limited Method and system for analyzing driver behaviour based on telematics data
CN108304941A (en) * 2017-12-18 2018-07-20 中国软件与技术服务股份有限公司 A kind of failure prediction method based on machine learning
CN108446734A (en) * 2018-03-20 2018-08-24 中科边缘智慧信息科技(苏州)有限公司 Disk failure automatic prediction method based on artificial intelligence

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10643138B2 (en) * 2015-01-30 2020-05-05 Micro Focus Llc Performance testing based on variable length segmentation and clustering of time series data
US11579951B2 (en) * 2018-09-27 2023-02-14 Oracle International Corporation Disk drive failure prediction with neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030149613A1 (en) * 2002-01-31 2003-08-07 Marc-David Cohen Computer-implemented system and method for performance assessment
US20040260967A1 (en) * 2003-06-05 2004-12-23 Copan Systems, Inc. Method and apparatus for efficient fault-tolerant disk drive replacement in raid storage systems
US20170364821A1 (en) * 2016-06-21 2017-12-21 Tata Consultancy Services Limited Method and system for analyzing driver behaviour based on telematics data
CN108304941A (en) * 2017-12-18 2018-07-20 中国软件与技术服务股份有限公司 A kind of failure prediction method based on machine learning
CN108446734A (en) * 2018-03-20 2018-08-24 中科边缘智慧信息科技(苏州)有限公司 Disk failure automatic prediction method based on artificial intelligence

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637560A (en) * 2020-12-16 2022-06-17 伊姆西Ip控股有限责任公司 Device management method, electronic device, and computer program product
US11995219B2 (en) 2020-12-16 2024-05-28 EMC IP Holding Company LLC Method, electronic equipment, and computer program product for device management
CN116244113A (en) * 2023-02-22 2023-06-09 安芯网盾(北京)科技有限公司 System downtime obstacle avoidance and restoration method and device
CN116244113B (en) * 2023-02-22 2023-12-19 安芯网盾(北京)科技有限公司 System downtime obstacle avoidance and restoration method and device

Also Published As

Publication number Publication date
US20200233587A1 (en) 2020-07-23
US10996861B2 (en) 2021-05-04
CN111459692B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
CN111459692A (en) Method, apparatus and computer program product for predicting drive failure
US11023325B2 (en) Resolving and preventing computer system failures caused by changes to the installed software
CN112436968B (en) Network traffic monitoring method, device, equipment and storage medium
US10962968B2 (en) Predicting failures in electrical submersible pumps using pattern recognition
US10579453B2 (en) Stream-processing data
EP3407200B1 (en) Method and device for updating online self-learning event detection model
KR101713985B1 (en) Method and apparatus for prediction maintenance
CN108460397B (en) Method and device for analyzing equipment fault type, storage medium and electronic equipment
CN110275814A (en) A monitoring method and device for a business system
US10831711B2 (en) Prioritizing log tags and alerts
US11886276B2 (en) Automatically correlating phenomena detected in machine generated data to a tracked information technology change
US20200103886A1 (en) Computer System and Method for Evaluating an Event Prediction Model
CN109960635B (en) Monitoring and alarming method, system, equipment and storage medium of real-time computing platform
WO2014145977A1 (en) System and methods for automated plant asset failure detection
CN115280337B (en) Machine learning based data monitoring
AU2019275633B2 (en) System and method of automated fault correction in a network environment
JP2017097712A (en) Instrument diagnosis device and system as well as method
CN117251114A (en) Model training method, disk life prediction method, related device and equipment
US20210027254A1 (en) Maintenance management apparatus, system, method, and non-transitory computer readable medium
CN109992477A (en) Information processing method, system and electronic equipment for electronic equipment
CN117668737B (en) Pipeline detection data fault early warning checking method and related device
CN117457059A (en) Fault detection method and device for SSD and electronic equipment
US20220083320A1 (en) Maintenance of computing devices
CN114297034B (en) Cloud platform monitoring method and cloud platform
US11461007B2 (en) Method, device, and computer program product for determining failures and causes of storage devices in a storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant