[go: up one dir, main page]

WO2018126367A1 - 数据清洗方法及装置 - Google Patents

数据清洗方法及装置 Download PDF

Info

Publication number
WO2018126367A1
WO2018126367A1 PCT/CN2017/070190 CN2017070190W WO2018126367A1 WO 2018126367 A1 WO2018126367 A1 WO 2018126367A1 CN 2017070190 W CN2017070190 W CN 2017070190W WO 2018126367 A1 WO2018126367 A1 WO 2018126367A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
segment
time
abnormal
data segment
Prior art date
Application number
PCT/CN2017/070190
Other languages
English (en)
French (fr)
Inventor
黄建华
康宏
Original Assignee
上海温尔信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海温尔信息科技有限公司 filed Critical 上海温尔信息科技有限公司
Priority to PCT/CN2017/070190 priority Critical patent/WO2018126367A1/zh
Publication of WO2018126367A1 publication Critical patent/WO2018126367A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present application relates to the field of data processing technologies, and in particular, to a data cleaning method and apparatus.
  • a wearable device is a portable device that can be worn directly on the user or integrated into the user's clothing or accessories.
  • the wearable device can continuously collect relevant data of the user and upload the collected data to the server to implement data interaction.
  • the user's health index, behavioral habits, life preferences, and the like can be analyzed.
  • the data can be denoised to remove noise in the data, improve the reliability of the data, and improve the accuracy of the analysis results.
  • a lot of data collected by the wearable device from the human body is limited by the complex scene of the human body and the problem of the device itself or the connection with the mobile device. Some wrong data may occur, such as data loss caused by device falling off, or data loss caused by device communication. Wrong data needs to be eliminated in time to avoid further analysis affecting analytical calculations. The inventors thought that before using the data collected by the wearable device, the data needs to be thoroughly cleaned to obtain clean and reliable data, and the accuracy of the correlation analysis based on the data is improved.
  • the embodiment of the present application provides a data cleaning method, including:
  • the step of removing the abnormal data includes: setting a time window based on a time domain characteristic of the physical quantity; and using the time window, dividing the data sequence into at least one data segment; respectively The abnormal data in the at least one data segment is removed.
  • the step of removing the abnormal data in the first data segment of the at least one data segment includes: removing data exceeding the data range in the first data segment; and/or removing The data in the first data segment whose volatility is greater than the fluctuation threshold.
  • the step of removing the data beyond the data range includes: calculating a mean and a variance of the data in the first data segment; setting the data range according to the mean and the variance a boundary and a lower boundary; removing data in the first data segment that is larger than the upper boundary and data smaller than the lower boundary.
  • the step of removing the data whose volatility is greater than the fluctuation threshold includes: calculating a differential of the data in the first data segment; and removing an absolute value of the differential in the first data segment to be greater than a differential Threshold data.
  • the step of uniformly processing the continuous data segment includes: identifying, according to the set time interval threshold, the continuous data segment from the data sequence after the abnormal data is removed; using data interpolation In a way, the continuous data segments are processed uniformly in time.
  • the step of identifying the continuous data segment includes: determining, according to the time interval threshold, a data interruption point in the data sequence after the abnormal data is removed; and identifying the data after removing the abnormal data A data segment in the data sequence that is disconnected by the data interruption point as the continuous data segment.
  • the step of uniformly processing the consecutive data segments includes: determining at least one uniformization time point corresponding to the continuous data segment; and interpolating the at least the data in the continuous data segment A data that equalizes the time point.
  • the embodiment of the present application further provides a data cleaning device, including:
  • An acquiring unit configured to acquire a data sequence of a physical quantity time domain sample
  • a removing unit configured to remove abnormal data in the data sequence based on a time domain characteristic of the physical quantity
  • a uniform processing unit for uniformly processing contiguous data segments in the data sequence after the abnormal data is removed in time.
  • the removing unit includes: a setting subunit, configured to set a time window based on a time domain characteristic of the physical quantity; and a dividing subunit, configured to use the time window to use the data sequence Dividing into at least one data segment; removing subunits for respectively removing abnormal data in the at least one data segment.
  • the removing subunit is specifically configured to: remove data exceeding the data range in the first data segment; and/or remove data in the first data segment whose volatility is greater than a fluctuation threshold.
  • the removing subunit is specifically configured to: calculate a mean and a variance of data in the first data segment; and set upper and lower boundaries of the data range according to the mean and variance And removing data larger than the upper boundary and data smaller than the lower boundary in the first data segment.
  • the removing subunit is specifically configured to: calculate a differential of data in the first data segment; and remove data in which the absolute value of the differential entropy in the first data segment is greater than a differential threshold.
  • the uniform processing unit includes: an identifying subunit, configured to identify the continuous data segment from the data sequence after the abnormal data is removed according to a set time interval threshold; The unit is configured to uniformly process the continuous data segment in time by using data interpolation.
  • the identifying subunit is specifically configured to: determine, according to the time interval threshold, a data interruption point in the data sequence after the abnormal data is removed; and identify the data sequence after the abnormal data is removed A piece of data that is disconnected by the data interruption point as the continuous data segment.
  • the interpolation subunit is specifically configured to: determine at least one homogenization time point corresponding to the continuous data segment; and interpolate the at least one homogenization by using data in the continuous data segment The data corresponding to the time point.
  • the embodiment of the present application further provides a computer storage medium, which stores the following program instructions:
  • a second program instruction configured to remove abnormal data in the data sequence based on a time domain characteristic of the physical quantity
  • the third program instruction is configured to uniformly process the continuous data segments in the data sequence after the abnormal data is removed in time.
  • the embodiment of the present application further provides an electronic device, including:
  • a memory configured to store a computer program
  • a communication interface configured to implement communication between the electronic device and other devices
  • a processor coupled to the memory and the communication interface is configured to execute the computer program for:
  • the method when the processor removes the abnormal data, is specifically configured to: set a time window based on a time domain characteristic of the physical quantity; and use the time window to divide the data sequence into At least one data segment; respectively removing abnormal data in the at least one data segment.
  • the method when the processor removes the abnormal data, the method is specifically configured to: remove data exceeding the data range in the first data segment; and/or remove fluctuations in the first data segment Data with a rate greater than the fluctuation threshold.
  • the processor when the processor removes the data that is out of the data range, the processor is specifically configured to: calculate a mean and a variance of the data in the first data segment;
  • the processor removes the number of the volatility greater than a fluctuation threshold According to the time, it is specifically used to: calculate a differential of the data in the first data segment; and remove data in which the absolute value of the differential entropy in the first data segment is greater than a differential threshold.
  • the method when the processor uniformly processes the consecutive data segments, the method is specifically configured to: identify the continuous data from the data sequence after the abnormal data is removed according to a set time interval threshold Data segment; using data interpolation, uniformly processing the continuous data segment in time.
  • the method when the processor identifies the continuous data segment, the method is specifically configured to: determine, according to the time interval threshold, a data interruption point in the data sequence after the abnormal data is removed; In the data sequence after the abnormal data is removed, the data segment disconnected by the data interruption point is used as the continuous data segment.
  • the method when the processor uniformly processes the consecutive data segments, the method is specifically configured to: determine at least one homogenization time point corresponding to the continuous data segment; and utilize data in the continuous data segment And interpolating data corresponding to the at least one homogenization time point.
  • the abnormal data is removed based on the time domain characteristics of the physical quantity, and uniformly processed in time.
  • FIG. 1 is a schematic structural diagram of a data cleaning system according to an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a data cleaning method according to another embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a data cleaning method according to another embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a data cleaning method according to another embodiment of the present application.
  • FIG. 5 is a schematic flowchart diagram of a data cleaning method according to another embodiment of the present application.
  • FIG. 6 is a schematic flowchart diagram of a data cleaning method according to another embodiment of the present disclosure.
  • FIG. 7 is a schematic structural diagram of a data cleaning apparatus according to another embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of an electronic device according to another embodiment of the present disclosure.
  • the data cleaning method provided by the embodiment of the present application can be implemented based on the data cleaning system shown in FIG. 1, but is not limited thereto.
  • the data cleaning system includes: a data collection device 10 and a data cleaning device 20; and the data collection device 10 is communicatively coupled to the data cleaning device 20.
  • the data collection device 10 is configured to perform time domain sampling on a physical quantity to obtain sampling data of the physical quantity.
  • the data collection device 10 can directly report the physical quantity of the sampling data to the data cleaning device 20, so that the data cleaning device 20 obtains the data sequence of the physical quantity time domain sampling. or,
  • the data collection device 10 may store the sampled data of the physical quantity into a database for the data cleaning device 20 to acquire the data sequence of the time-domain samples of the physical quantity from the database.
  • the data cleaning device 20 it is mainly used to acquire a data sequence of physical quantity time domain samples, and the data sequence is cleaned to obtain reliable and accurate data, and provide basic data for subsequent application or analysis.
  • the data collection device 10 and the data cleaning device 20 may be connected by wireless or wired network.
  • the network standard of the mobile network may be any of 2G (GSM), 2.5G (GPRS), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G+ (LTE+), WiMax, etc.
  • GSM 2G
  • GPRS 2.5G
  • WCDMA 3G
  • WCDMA Time Division Multiple Access
  • TD-SCDMA Time Division Multiple Access
  • CDMA2000 Code Division Multiple Access 2000
  • UTMS Universal Mobile communications
  • 4G Long Term Evolution
  • LTE+ Long Term Evolution+
  • the physical quantity in this embodiment may be any physical quantity that supports time domain acquisition, and may be, for example, temperature or humidity.
  • the data collection device 10 in this embodiment may be any device capable of performing time domain acquisition on physical quantities, for example, various sensors. Taking the physical quantity as the temperature, especially the body temperature of the human body as an example, the data acquisition device may be a wearable device with a temperature sensor.
  • the data cleaning device 20 in this embodiment may be any device having data storage and data processing functions such as a server, a computer, a tablet computer, a smart terminal, and the like.
  • FIG. 2 is a schematic flowchart diagram of a data cleaning method according to another embodiment of the present application. As shown in FIG. 2, the method includes:
  • the data cleaning device 20 acquires a data sequence of a physical quantity of time domain samples.
  • the data cleaning device 20 can acquire a data sequence formed by the data acquisition device 10 performing time domain sampling on the physical quantity.
  • the data sequence includes sampled data of the physical quantities at different points in time and corresponding timestamps.
  • the data collection device 10 performs time domain sampling on the physical quantity. For each sampled data, the data collection device 10 automatically adds a time stamp to it. Optionally, the data collection device 10 reports the time-stamped sample data to the data cleaning device 20 for the data cleaning device 20 to obtain the time-domain sampled data sequence of the physical quantity. Alternatively, the data collection device 10 stores the time-stamped sample data to a database for the data cleaning device 20 to obtain the physical quantity from the database. Time-domain sampled data sequence.
  • the data collection device 10 begins at a specified point in time, sampling the physical quantities at even time intervals, with no time stamps for the sampled data.
  • the data collection device 10 reports the sample data without time stamp to the data cleaning device 20, and the data cleaning device 20 adds a time stamp to the sample data to obtain a data sequence of the time domain sample of the physical quantity.
  • the data collection device 10 stores the sample data without time stamps in a database. In the storage process, the sample data is time stamped for the data cleaning device 20 to acquire the time domain samples of the physical quantity from the database. Data sequence.
  • the physical quantity generally has a certain time domain characteristic from the physical and mathematical characteristics of the physical quantity itself. For example, some physical quantities are continuous over a certain time frame, and the changes are gentle and do not suddenly jump or change rapidly.
  • some physical quantities are continuous over a certain time frame, and the changes are gentle and do not suddenly jump or change rapidly.
  • the body temperature of the human body is continuous and does not suddenly jump; if the body temperature data of sudden jump in the actual collection is abnormal, it should belong to the abnormal situation in the collection process, not the body temperature of the sampling object. The jump has changed.
  • the body temperature change is relatively flat. Generally, the body temperature change will not exceed 0.05 degrees per second. If the actual body temperature data changes exceed this range, it should be an abnormal situation in the collection process, not the measurement object. Body temperature really changed at this rate.
  • the data cleaning device 20 removes the abnormal data in the data sequence acquired in step 201 based on the time domain characteristics of the physical quantity.
  • the data sequence after the abnormal data is removed includes data conforming to the time domain characteristics of the physical quantity, and the data is reliable and accurate.
  • step 203 considering step 202 to remove the abnormal data in the data sequence, the data sequence may not be continuous in time, no longer uniform, and overall is inconvenient to use, but the continuous data segments in the data sequence still have certain use. Value.
  • the data cleaning device 20 uniformly processes the continuous data segments in the data sequence after the removal of the abnormal data in time to provide a reliable, temporally continuous and uniform continuous data segment for subsequent use.
  • the continuous data segment refers to a data sequence in which the abnormal data is removed, and the time interval corresponding to all adjacent data is less than a preset.
  • the data segment of the time interval threshold is a data sequence in which the abnormal data is removed, and the time interval corresponding to all adjacent data is less than a preset.
  • the value of the time interval threshold may be different according to the application scenario and the physical quantity. This embodiment does not limit the value of the time interval threshold, and can be adaptively set.
  • the abnormal data is removed based on the time domain characteristics of the physical quantity, and is uniformly processed in time.
  • the continuous data segment in the data sequence after the abnormal data realizes the cleaning of the data sequence of the time domain sampling, and finally obtains reliable and accurate sampling data, thereby improving the accuracy of correlation analysis based on the sampled data.
  • removing the abnormal data in the data sequence based on the time domain characteristics of the physical quantity may include the following steps:
  • a time window is set, which reflects the time domain characteristics of the physical quantity, which is simply a characteristic of the physical quantity changing with time.
  • the body temperature in order to continuously collect human body temperature, in general, the body temperature does not change more than 0.5 degrees within 3 minutes, and according to this characteristic, the time window can be set to 3 minutes. This means that in the body temperature data within 3 minutes, the body temperature data that changes by more than 0.5 degrees is abnormal data.
  • the human heart rate in order to continuously collect the human heart rate, in general, the human heart rate does not change more than 15 times within 10 seconds, and according to this characteristic, the time window can be set to 10 seconds. This means that in the heart rate data within 10 seconds, the heart rate data that has changed more than 15 times is abnormal data.
  • step 2022 based on the time window set in step 2021, the data sequence can be divided into at least one data segment, the length of time of each data segment being the length of the time window.
  • the data sequence is divided into at least one data segment by using a time window, and at least one data segment does not overlap. Further, if the length of the last data segment in the data sequence is less than the length of the time window, but the ratio of the length of time to the time window is greater than or equal to the specified The ratio, for example greater than 1/3, preserves the last data segment as a separate data segment. Conversely, if the length of the last data segment in the data sequence is less than the length of the time window, and the ratio of the length of time to the time window is less than a specified ratio, such as less than 1/3, meaning that the last data segment is less than the required time window 1/3, The last data segment is then merged into the most recent time period data.
  • the data in 12:00:00-12:30:00 can be divided into a data segment.
  • the data in 12:30:00-13:00:00 is divided into a data segment, and the data in 13:00:00-13:30:00 is divided into a data segment, and the data of the last 5 minutes is merged into 13 :00:00-13:30:00 time period.
  • step 2023 the data segment divided by step 2022 is removed from the abnormal data.
  • step 2022 each data segment is divided, that is, the process proceeds to step 2023, the abnormal data in the data segment is removed, and then the process returns to step 2022. or,
  • step 2022 the process proceeds to step 2023 to remove the abnormal data in each data segment one by one.
  • the abnormal data may be removed in the following manner:
  • the step of removing the number of data segments beyond the data range in the first data segment may be: calculating a mean and a variance of the data in the first data segment, respectively recorded as ⁇ and ⁇ ; setting data according to the mean and variance
  • the upper and lower boundaries of the range are denoted as ⁇ + ⁇ and ⁇ - ⁇ , respectively; the data larger than the upper boundary ⁇ + ⁇ in the first data segment and the data smaller than the lower boundary ⁇ - ⁇ are removed, that is, only the first data segment is retained
  • is a coefficient, which can be determined according to the application scenario and physical quantity.
  • the step of removing the data in the first data segment whose volatility is greater than the fluctuation threshold may be: calculating a differential of the data in the first data segment; and removing the data in the first data segment whose absolute value is greater than the differential threshold.
  • the volatility of the data is represented by differentiation, and correspondingly, the volatility threshold is embodied by a differential threshold. Can diversify all data in the data sequence The absolute value of the score is compared to the differential threshold, and the differential that generally exceeds the differential threshold will appear in pieces.
  • the data whose absolute value of the differential is greater than the differential threshold is the data of the abnormality of the change, for example, it may be the initial stage of the acquisition, or the end of the collection, or the acquisition object is lost for some reason (for example, the temperature measuring device falls off). These data are generally abnormal data.
  • n is a positive integer
  • dT(n) represents the differentiation of the nth time point
  • T(n) and T(n-1) represent the data of the nth time point and the n-1th, respectively.
  • Data at time points; t(n) and t(n-1) represent the nth time point and the n-1th time point, respectively.
  • n is a non-negative integer
  • dT(n) represents the differentiation of the nth time point
  • T(n) and T(n+1) represent the data of the nth time point and the n+th, respectively.
  • Data at one time point; t(n) and t(n+1) represent the nth time point and the n+1th time point, respectively;
  • dT(end) and dT(end-1) respectively represent the last time The differentiation of the point and the differentiation of the penultimate time point.
  • the differential calculation formula belongs to the central differential.
  • n is a positive integer
  • dT(n) represents the differentiation of the nth time point
  • T(n-1) and T(n+1) respectively represent the nth - data of one time point and data of the n+1th time point
  • t(n-1) and t(n+1) respectively represent the n-1th time point and the n+1th time point
  • dT (end) and dT(end-1) represent the differentiation of the last time point and the differentiation of the penultimate time point, respectively.
  • abnormal data in the second and third data segments in the at least one data segment may be removed in the same manner as the first data segment, but is not limited thereto.
  • the step of uniformly processing the continuous data segments in time may be: identifying the continuous data segment from the data sequence after the abnormal data is removed according to the set time interval threshold;
  • the data interpolation method uniformly processes the continuous data segments in time.
  • the step of identifying the continuous data segment may be: according to the time interval threshold, The data interruption point in the data sequence after the abnormal data is removed; the data segment broken by the data interruption point in the data sequence after the abnormal data is removed is identified as a continuous data segment. Specifically, the difference between the timestamp corresponding to the adjacent data in the data sequence after the abnormal data is removed is compared with the time interval threshold, and the adjacent data with the difference of the timestamp greater than the time interval threshold is used as the interrupt data point.
  • the interrupt data points are used as a segmentation point to divide the data sequence into at least one continuous data segment. In each successive data segment, the difference between the timestamps corresponding to the adjacent data is less than or equal to the time interval threshold.
  • the step of uniformly processing the continuous data segment may be: determining at least one uniformization time point corresponding to the continuous data segment; and using data in the continuous data segment, interpolating data corresponding to the at least one uniformization time point.
  • the data may be directly used as the data corresponding to the homogenization time point; if the homogenization time point is not related to any data in the continuous data segment If the time stamps are the same, the data corresponding to the homogenization time points can be interpolated and the data corresponding to the homogenization time points can be taken.
  • the interpolation method may be linear interpolation, spline interpolation, or the like.
  • FIG. 4 a data cleaning method is shown in FIG. 4, including:
  • the wearable device continuously collects body temperature and stores the collected body temperature data into a database.
  • the wearable device can add a time stamp to the collected body temperature data. Or, in the process of storing to the database, time stamp data is added.
  • the data cleaning device acquires a data sequence corresponding to the body temperature of the human body from the database, and the data sequence includes a series of body temperature data.
  • the data cleaning device sets a time window based on a time domain characteristic of the physical quantity.
  • the data cleaning device divides the data sequence into at least one data segment by using a time window.
  • the data cleaning device calculates a mean and a variance of body temperature data in each of the at least one data segment.
  • the data cleaning device respectively determines the mean and variance of the body temperature data in each data segment. Set the upper and lower boundaries of the data range corresponding to each data segment.
  • the data cleaning device separately removes body temperature data larger than the upper boundary and body temperature data smaller than the lower boundary in each data segment.
  • the data cleaning device identifies the continuous data segment from the data sequence after the abnormal data is removed according to the set time interval threshold.
  • the data cleaning device adopts a data interpolation manner to uniformly process the continuous data segment in time.
  • FIG. 5 In the application scenario of collecting human body temperature, another data cleaning method is shown in FIG. 5, including:
  • the wearable device continuously collects human body temperature, and stores the collected body temperature data into a database.
  • the wearable device can add a time stamp to the collected body temperature data. Or, in the process of storing to the database, time stamp data is added.
  • the data cleaning device acquires a data sequence corresponding to the body temperature of the human body from the database, and the data sequence includes a series of body temperature data.
  • the data cleaning device sets a time window based on a time domain characteristic of the physical quantity.
  • the data cleaning device divides the data sequence into at least one data segment by using a time window.
  • the data cleaning device calculates a differential of body temperature data in each of the at least one data segment.
  • the data cleaning device respectively removes body temperature data whose absolute value in each data segment is greater than a differential threshold.
  • the data cleaning device identifies the continuous data segment from the data sequence after the abnormal data is removed according to the set time interval threshold.
  • the data cleaning device adopts a data interpolation manner to uniformly process the continuous data segment in time.
  • FIG. 6 In the application scenario of collecting human body temperature, another data cleaning method is shown in FIG. 6, which includes:
  • the wearable device continuously collects body temperature and stores the collected body temperature data in a database.
  • the wearable device can add a time stamp to the collected body temperature data. Or, in the process of storing to the database, time stamp data is added.
  • the data cleaning device acquires a data sequence corresponding to the body temperature of the human body from the database, and the data sequence includes a series of body temperature data.
  • the data cleaning device sets a time window based on a time domain characteristic of the physical quantity.
  • the data cleaning device divides the data sequence into at least one data segment by using a time window.
  • the data cleaning device calculates a mean and a variance of body temperature data in each of the at least one data segment.
  • the data cleaning device sets upper and lower boundaries of the data range corresponding to each data segment according to the mean and variance of the body temperature data in each data segment.
  • the data cleaning device separately removes body temperature data larger than the upper boundary and body temperature data smaller than the lower boundary in each data segment.
  • the data cleaning device calculates a differential of body temperature data in each of the at least one data segment.
  • the data cleaning device respectively removes body temperature data whose absolute value in each data segment is greater than a differential threshold.
  • the data cleaning device identifies the continuous data segment from the data sequence after the abnormal data is removed according to the set time interval threshold.
  • the data cleaning device adopts a data interpolation manner to uniformly process the continuous data segment in time.
  • steps 605-607 and steps 608-609 are not limited to the order described in the embodiment, and the operations described in steps 608-609 may be performed first, and the operations described in steps 605-607 may be performed. Wherein, the operations described in steps 605-607 are performed first, and the operations described in steps 608-609 are performed as a preferred embodiment.
  • the data sequence corresponding to the human body temperature first removes the abnormal data based on the time domain characteristics of the human body temperature, and identifies and removes the abnormal data.
  • a contiguous segment of data in the subsequent data sequence The continuous data segment is homogenized in time to obtain reliable and accurate body temperature data, which provides a good basic condition for subsequent analysis based on body temperature data, which is beneficial to improve the accuracy of subsequent analysis results.
  • the execution bodies of the steps of the method provided by the foregoing embodiments may all be the same device, or the method may also be performed by different devices.
  • the execution body of steps 201 to 203 may be device A; for example, the execution body of steps 201 and 202 may be device A, the execution body of step 203 may be device B, and the like.
  • FIG. 7 is a schematic structural diagram of a data cleaning apparatus according to another embodiment of the present application. As shown in FIG. 7, the apparatus includes an acquisition unit 71, a removal unit 72, and a uniform processing unit 73.
  • the obtaining unit 71 is configured to acquire a data sequence of a physical quantity of time domain samples.
  • the removing unit 72 is configured to remove the abnormal data in the data sequence based on the time domain characteristic of the physical quantity.
  • the uniform processing unit 73 is configured to uniformly process the continuous data segments in the data sequence after the abnormal data is removed in time.
  • an implementation structure of the removing unit 72 includes:
  • the subunit is removed for respectively removing abnormal data in the at least one data segment.
  • the removing subunit is specifically configured to: remove data exceeding the data range in the first data segment; and/or remove data in the first data segment whose volatility is greater than a fluctuation threshold.
  • the removing subunit is specifically configured to: calculate a mean value and a variance of the data in the first data segment when removing data exceeding the data range in the first data segment; according to the mean value and the variance, Setting an upper boundary and a lower boundary of the data range; removing data larger than the upper boundary and data smaller than the lower boundary in the first data segment.
  • the removing subunit is specifically configured to: calculate a differential of the data in the first data segment when removing the data whose volatility is greater than the fluctuation threshold in the first data segment; The data in which the absolute value of the differential in the first data segment is greater than the differential threshold.
  • an implementation structure of the uniform processing unit includes:
  • a identifying subunit configured to identify the continuous data segment from the data sequence after the abnormal data is removed according to the set time interval threshold
  • the interpolation subunit is configured to uniformly process the continuous data segment in time by using a data interpolation method.
  • the identifying subunit is specifically configured to: determine, according to the time interval threshold, a data interruption point in the data sequence after the abnormal data is removed; and identify the data sequence after the abnormal data is removed, A data segment in which the data break point is broken as the continuous data segment.
  • the interpolation subunit is specifically configured to: determine at least one homogenization time point corresponding to the continuous data segment; and interpolate data corresponding to the at least one homogenization time point by using data in the continuous data segment .
  • the data cleaning device provided in this embodiment may be used to perform the process provided by the foregoing method embodiments, and details are not described herein again.
  • the data cleaning device provided in this embodiment combines the data sampling scenario, and considers the physical and mathematical characteristics of the data itself, and the time-domain sampling data sequence of the physical quantity, the abnormal data is removed based on the time domain characteristics of the physical quantity, and in time, Uniformly processing the continuous data segments in the data sequence after the abnormal data is removed, thereby purifying the data sequence of the time domain sampling, and finally obtaining reliable and accurate sampling data, thereby improving the accuracy of correlation analysis based on the sampled data.
  • the data cleaning device can be implemented as an electronic device, including: a memory 81, a processor 82, and a communication interface 83.
  • the memory 81 is configured to store a computer program.
  • the memory 81 can also be configured to store other various data to support operation on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.
  • Memory 81 can be any type of volatile or non-volatile storage device or combination thereof Implementations such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory magnetic memory
  • flash memory disk or optical disk.
  • the communication interface 83 is configured to implement communication between the electronic device and other devices, such as wired or wireless communication.
  • the electronic device can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof.
  • the communication interface 83 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel.
  • communication interface 83 also includes a near field communication (NFC) module to facilitate short range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • a processor 82 coupled to the memory 81 and the communication interface 83, is configured to execute a computer program in the memory 81 for:
  • the processor 82 is configured to: when the abnormal data is removed, to set a time window based on the time domain characteristic of the physical quantity; and use the time window to divide the data sequence into at least one data a segment; respectively removing abnormal data in the at least one data segment.
  • the processor 82 when the processor 82 removes the abnormal data in the first data segment of the at least one data segment, the processor 82 is specifically configured to: remove data in the first data segment that is out of the data range; and/or And removing data in the first data segment whose volatility is greater than a fluctuation threshold.
  • the processor 82 when the processor 82 removes the data beyond the data range, the processor 82 is specifically configured to: calculate a mean and a variance of the data in the first data segment; and set the data according to the average and the variance. An upper boundary and a lower boundary of the range; data larger than the upper boundary and smaller than the lower boundary in the first data segment are removed.
  • the method is: calculating a differential of the data in the first data segment; and removing data in which the absolute value of the differential in the first data segment is greater than a differential threshold.
  • the method when the processor 82 uniformly processes the consecutive data segments, the method is specifically configured to: identify, according to the set time interval threshold, the continuous data from the data sequence after the abnormal data is removed. Segment; using data interpolation, uniformly processing the continuous data segments in time.
  • the method when the processor 82 identifies the continuous data segment, the method is specifically configured to: determine, according to the time interval threshold, a data interruption point in the data sequence after the abnormal data is removed; In the data sequence after the abnormal data is removed, the data segment broken by the data interruption point is used as the continuous data segment.
  • the method when the processor 82 uniformly processes the consecutive data segments, the method is specifically configured to: determine at least one homogenization time point corresponding to the continuous data segment; and utilize data in the continuous data segment, Interpolating data corresponding to the at least one homogenization time point.
  • the electronic device further includes: a display 84, a power supply component 85, an audio component 86, and the like. Only some of the components are schematically illustrated in FIG. 8, and it is not meant that the client device includes only the components shown in FIG.
  • Display 84 includes a screen whose screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor may sense not only the boundary of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
  • a power supply assembly 85 provides power to various components of the electronic device.
  • Power component 85 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for client devices.
  • the audio component 86 is configured to output and/or input an audio signal.
  • the audio component 86 includes a microphone (MIC) that is configured to receive an external audio signal when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode.
  • the received audio signal may be further stored in the memory 81 or transmitted via the communication interface 83.
  • the audio Component 86 also includes a speaker for outputting an audio signal.
  • the embodiment of the present application further provides a computer storage medium suitable for a computer program, where the computer storage medium stores the following program instructions:
  • a second program instruction configured to remove abnormal data in the data sequence based on a time domain characteristic of the physical quantity
  • the third program instruction is configured to uniformly process the continuous data segments in the data sequence after the abnormal data is removed in time.
  • the process provided by the foregoing method embodiment can be implemented, and the data sequence of the time domain sampling is cleaned, and reliable and accurate data is obtained, which provides a good basic condition for subsequent analysis based on the sampled data. Improve the accuracy of subsequent analysis results.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the computer readable memory is stored in the computer readable memory.
  • the instructions in the production result include an article of manufacture of the instruction device that implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory, such as read only memory (ROM) or flash memory (flashRAM), in a computer readable medium.
  • RAM random access memory
  • ROM read only memory
  • flashRAM flash memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.
  • embodiments of the present application can be provided as a method, system, or computer program product.
  • the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware.
  • the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

一种数据清洗方法及装置。数据清洗方法包括:获取一物理量的时域采样的数据序列(201);基于物理量的时域特性,去除所述数据序列中的异常数据(202);在时间上,均匀处理去除异常数据后的数据序列中的连续数据段(203)。采用该方法可以获得可靠、准确的时域采样数据,有利于提高基于所获得的采样数据进行相关分析的准确性。

Description

数据清洗方法及装置 技术领域
本申请涉及数据处理技术领域,尤其涉及一种数据清洗方法及装置。
背景技术
可穿戴设备是可直接穿戴在用户身上,或是可整合到用户的衣服或配件中的一种便携式设备。可穿戴设备可以不间断的采集用户的相关数据,并将采集的数据上传至服务端,实现数据交互。
基于可穿戴设备采集的大量数据,可以分析用户的健康指数、行为习惯、生活偏好等。在使用可穿戴设备采集到的数据之前,可对数据进行降噪处理,以去除数据中的噪声,提高数据的可靠性,进而提高分析结果的准确度。
发明内容
可穿戴设备采集人体的很多数据受限于人体复杂的场景以及设备本身的问题或者与移动设备的连接错误,会出现一些错误数据,比如设备脱落造成数据错误,或者设备通讯造成数据丢失。错误的数据需要及时剔除,以免进一步分析时影响分析计算。发明人想到在使用可穿戴设备采集的数据之前,需要对数据进行全面清洗,以获得干净、可靠的数据,提高基于所述数据进行相关分析的准确性。
本申请实施例提供一种数据清洗方法,包括:
获取一物理量的时域采样的数据序列;
基于所述物理量的时域特性,去除所述数据序列中的异常数据;
在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
在一可选实施方式中,所述异常数据的去除步骤,包括:基于所述物理量的时域特性,设置时间窗口;利用所述时间窗口,将所述数据序列划分为至少一个数据片段;分别去除所述至少一个数据片段中的异常数据。
在一可选实施方式中,对所述至少一个数据片段中的第一数据片段,所述异常数据的去除步骤,包括:去除所述第一数据片段中超出数据范围的数据;和/或去除所述第一数据片段中波动率大于波动阈值的数据。
在一可选实施方式中,所述超出数据范围的数据的去除步骤,包括:计算所述第一数据片段中的数据的均值和方差;根据所述均值和方差,设置所述数据范围的上边界和下边界;去除所述第一数据片段中大于所述上边界的数据以及小于所述下边界的数据。
在一可选实施方式中,所述波动率大于波动阈值的数据的去除步骤,包括:计算所述第一数据片段中的数据的微分;去除所述第一数据片段中微分的绝对值大于微分阈值的数据。
在一可选实施方式中,所述连续数据段的均匀处理步骤,包括:根据设定的时间间隔阈值,从所述去除异常数据后的数据序列中,识别所述连续数据段;采用数据插值方式,在时间上,均匀处理所述连续数据段。
在一可选实施方式中,所述连续数据段的识别步骤,包括:根据所述时间间隔阈值,确定所述去除异常数据后的数据序列中的数据中断点;识别所述去除异常数据后的数据序列中,被所述数据中断点断开的数据片段,作为所述连续数据段。
在一可选实施方式中,所述连续数据段的均匀处理步骤,包括:确定所述连续数据段对应的至少一个均匀化时间点;利用所述连续数据段中的数据,插值出所述至少一个均匀化时间点对应的数据。
本申请实施例还提供一种数据清洗装置,包括:
获取单元,用于获取一物理量的时域采样的数据序列;
去除单元,用于基于所述物理量的时域特性,去除所述数据序列中的异常数据;
均匀处理单元,用于在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
在一可选实施方式中,所述去除单元包括:设置子单元,用于基于所述物理量的时域特性,设置时间窗口;划分子单元,用于利用所述时间窗口,将所述数据序列划分为至少一个数据片段;去除子单元,用于分别去除所述至少一个数据片段中的异常数据。
在一可选实施方式中,所述去除子单元具体用于:去除所述第一数据片段中超出数据范围的数据;和/或去除所述第一数据片段中波动率大于波动阈值的数据。
在一可选实施方式中,所述去除子单元具体用于:计算所述第一数据片段中的数据的均值和方差;根据所述均值和方差,设置所述数据范围的上边界和下边界;去除所述第一数据片段中大于所述上边界的数据以及小于所述下边界的数据。
在一可选实施方式中,所述去除子单元具体用于:计算所述第一数据片段中的数据的微分;去除所述第一数据片段中微分熵的绝对值大于微分阈值的数据。
在一可选实施方式中,所述均匀处理单元包括:识别子单元,用于根据设定的时间间隔阈值,从所述去除异常数据后的数据序列中,识别所述连续数据段;插值子单元,用于采用数据插值方式,在时间上,均匀处理所述连续数据段。
在一可选实施方式中,所述识别子单元具体用于:根据所述时间间隔阈值,确定所述去除异常数据后的数据序列中的数据中断点;识别所述去除异常数据后的数据序列中,被所述数据中断点断开的数据片段,作为所述连续数据段。
在一可选实施方式中,所述插值子单元具体用于:确定所述连续数据段对应的至少一个均匀化时间点;利用所述连续数据段中的数据,插值出所述至少一个均匀化时间点对应的数据。
本申请实施例还提供一种计算机存储介质,存储有以下程序指令:
第一程序指令,用于获取一物理量的时域采样的数据序列;
第二程序指令,用于基于所述物理量的时域特性,去除所述数据序列中的异常数据;
第三程序指令,用于在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
本申请实施例还提供一种电子设备,包括:
存储器,被配置为存储计算机程序;
通信接口,被配置为实现所述电子设备与其它设备之间的通信;
处理器,耦合至所述存储器和所述通信接口,被配置为执行所述计算机程序,以用于:
通过所述通信接口,获取一物理量的时域采样的数据序列;
基于所述物理量的时域特性,去除所述数据序列中的异常数据;
在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
在一可选实施方式中,所述处理器在去除所述异常数据时,具体用于:基于所述物理量的时域特性,设置时间窗口;利用所述时间窗口,将所述数据序列划分为至少一个数据片段;分别去除所述至少一个数据片段中的异常数据。
在一可选实施方式中,所述处理器在去除所述异常数据时,具体用于:去除所述第一数据片段中超出数据范围的数据;和/或去除所述第一数据片段中波动率大于波动阈值的数据。
在一可选实施方式中,所述处理器在去除所述超出数据范围的数据时,具体用于:计算所述第一数据片段中的数据的均值和方差;
根据所述均值和方差,设置所述数据范围的上边界和下边界;
去除所述第一数据片段中大于所述上边界的数据以及小于所述下边界的数据。
在一可选实施方式中,所述处理器在去除所述波动率大于波动阈值的数 据时,具体用于:计算所述第一数据片段中的数据的微分;去除所述第一数据片段中微分熵的绝对值大于微分阈值的数据。
在一可选实施方式中,所述处理器在均匀处理所述连续数据段时,具体用于:根据设定的时间间隔阈值,从所述去除异常数据后的数据序列中,识别所述连续数据段;采用数据插值方式,在时间上,均匀处理所述连续数据段。
在一可选实施方式中,所述处理器在识别所述连续数据段时,具体用于:根据所述时间间隔阈值,确定所述去除异常数据后的数据序列中的数据中断点;识别所述去除异常数据后的数据序列中,被所述数据中断点断开的数据片段,作为所述连续数据段。
在一可选实施方式中,所述处理器在均匀处理所述连续数据段时,具体用于:确定所述连续数据段对应的至少一个均匀化时间点;利用所述连续数据段中的数据,插值出所述至少一个均匀化时间点对应的数据。
在本申请实施例中,结合数据采样场景,从数据本身具备的物理和数学特性考虑,对物理量的时域采样的数据序列,基于物理量的时域特性去除异常数据,并在时间上,均匀处理去除异常数据后的数据序列中的连续数据段,实现对时域采样的数据序列的清洗,最终获得可靠、准确的采样数据,进而提高基于采样数据进行相关分析的准确性。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1为本申请一实施例提供的数据清洗系统的结构示意图;
图2为本申请另一实施例提供的数据清洗方法的流程示意图;
图3为本申请又一实施例提供的数据清洗方法的流程示意图;
图4为本申请又一实施例提供的数据清洗方法的流程示意图;
图5为本申请又一实施例提供的数据清洗方法的流程示意图;
图6为本申请又一实施例提供的数据清洗方法的流程示意图;
图7为本申请又一实施例提供的数据清洗装置的结构示意图;
图8为本申请又一实施例提供的电子设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的数据清洗方法可基于图1所示的数据清洗系统实现,但不限于此。如图1所示,所述数据清洗系统包括:数据采集设备10和数据清洗设备20;且数据采集设备10与数据清洗设备20通信连接。
数据采集设备10,用于对一物理量进行时域采样,以获得所述物理量的采样数据。
可选的,数据采集设备10可以将物理量的采样数据直接上报给数据清洗设备20,以供数据清洗设备20获得所述物理量的时域采样的数据序列。或者,
可选的,数据采集设备10可以将物理量的采样数据存储至一数据库中,以供数据清洗设备20从该数据库中获取所述物理量的时域采样的数据序列。
对数据清洗设备20来说,主要用于获取物理量的时域采样的数据序列,对所述数据序列进行清洗,以获得可靠、准确的数据,为后续应用或分析提供基础数据。
其中,数据采集设备10与数据清洗设备20之间可以是无线或有线网络连接。在本实施例中,若数据采集设备10通过移动网络与数据清洗设备20 通信连接,该移动网络的网络制式可以为2G(GSM)、2.5G(GPRS)、3G(WCDMA、TD-SCDMA、CDMA2000、UTMS)、4G(LTE)、4G+(LTE+)、WiMax等中的任意一种。除此之外,数据采集设备10还可以通过蓝牙、Wi-Fi、红外等无线通信方式与数据清洗设备20连接。
本实施例中的物理量可以是任意支持时域采集的物理量,例如可以是温度或湿度。
与上述物理量相适应,本实施例中的数据采集设备10可以是任何能够对物理量进行时域采集的设备,例如可以是各种传感器。以物理量为温度,尤其是人体体温为例,所述数据采集设备可以是带有温度传感器的可穿戴设备。
本实施例中的数据清洗设备20可以是服务器、计算机、平板电脑、智能终端等任何具有数据存储和数据处理功能的设备。
结合图1所示的数据清洗系统,以下实施例从数据清洗设备20的角度,详细说明本申请实施例提供的数据清洗方法的流程。
图2为本申请另一实施例提供的数据清洗方法的流程示意图。如图2所示,所述方法包括:
201、获取一物理量的时域采样的数据序列。
202、基于物理量的时域特性,去除所述数据序列中的异常数据。
203、在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
在步骤201中,数据清洗设备20获取一物理量的时域采样的数据序列。例如,数据清洗设备20可以获取数据采集设备10对所述物理量进行时域采样形成的数据序列。所述数据序列包括所述物理量在不同时间点上的采样数据以及对应的时间戳。
在一种应用场景,数据采集设备10对所述物理量进行时域采样,对每个采样数据,数据采集设备10自动为其添加时间戳。可选的,数据采集设备10向数据清洗设备20上报带有时间戳的采样数据,以供数据清洗设备20获得所述物理量的时域采样的数据序列。或者,数据采集设备10将带有时间戳的采样数据存储至一数据库,以供数据清洗设备20从数据库中获取所述物理量 的时域采样的数据序列。
在另一应用场景中,数据采集设备10在指定时间点开始,以均匀时间间隔对所述物理量进行采样,采样数据不带有时间戳。可选的,数据采集设备10向数据清洗设备20上报不带时间戳的采样数据,数据清洗设备20为采样数据补上时间戳,以获得所述物理量的时域采样的数据序列。或者,数据采集设备10将不带时间戳的采样数据存储至一数据库,在存储过程中,为采样数据补上时间戳,以供数据清洗设备20从数据库中获取所述物理量的时域采样的数据序列。
在步骤202中,本申请发明人从物理量本身具有的物理和数学特性考虑,发现物理量一般具有一定的时域特性。例如,有些物理量在一定时间范围内是连续的,且其变化是平缓的,不会突然跳变或者迅速变化。以人体体温为例,正常情况下,人体体温是连续的,不会突然跳变;如果实际采集中出现突然跳变的体温数据,应该属于采集过程中的异常情况,并不是采样对象的体温真的跳变了。当然,人体体温的变化也是比较平缓的,一般人体体温的变化不会超过0.05度每秒;如果实际采集到体温数据的变化超过这个范围,应该属于采集过程中的异常情况,并不是测量对象的体温真的按这个速度变化了。
以上列举的情况属于采样数据本身不合理,使用常规的噪声滤波方法无法识别出来。基于上述发现,数据清洗设备20基于物理量的时域特性,去除步骤201中获取的数据序列中的异常数据。其中,去除异常数据后的数据序列中包括符合物理量的时域特性的数据,这些数据是可靠、准确的。
在步骤203中,考虑步骤202去除了数据序列中的异常数据,可能导致数据序列在时间上不再连续,不再均匀,整体上不便于使用,但数据序列中连续的数据片段还是具有一定使用价值的。对此,数据清洗设备20对去除异常数据后的数据序列中的连续数据段,在时间上进行均匀处理,以便于提供可靠、时间上连续且均匀的连续数据段,以供后续使用。所述连续数据段是指去除异常数据后的数据序列中,所有相邻数据对应的时间间隔均小于预设 的时间间隔阈值的数据片段。
其中,根据应用场景以及物理量的不同,上述时间间隔阈值的取值会有所不同。本实施例并不限定时间间隔阈值的取值,可适应性设置。
在本实施例中,结合数据采样场景,从数据本身具备的物理和数学特性考虑,对物理量的时域采样的数据序列,基于物理量的时域特性去除异常数据,并在时间上,均匀处理去除异常数据后的数据序列中的连续数据段,实现对时域采样的数据序列的清洗,最终获得可靠、准确的采样数据,进而提高基于采样数据进行相关分析的准确性。
在上述实施例或下述实施例中,如图3所示,基于物理量的时域特性,去除数据序列中的异常数据,可以包括以下步骤:
2021、基于物理量的时域特性,设置一时间窗口。
2022、利用时间窗口,将数据序列划分为至少一个数据片段。
2023、分别去除至少一个数据片段中的异常数据。
在步骤2021中,设定一时间窗口,该时间窗口反映物理量的时域特性,简单来说,就是物理量随时间变化的特性。
例如,以连续采集人体体温为例,一般来说,人体体温在3分钟内的变化不会超过0.5度,根据该特性可以设置时间窗口为3分钟。这意味着,3分钟内的体温数据中,变化超过0.5度的体温数据为异常数据。
又例如,以连续采集人体心率为例,一般来说,人体心率在10秒钟内的变化不会超过15次,根据该特性可以设置时间窗口为10秒钟。这意味着,10秒钟内的心率数据中,变化超过15次的心率数据为异常数据。
在步骤2022中,基于步骤2021设置的时间窗口,可以将数据序列划分为至少一个数据片段,每个数据片段的时间长度为所述时间窗口的长度。
可选的,利用时间窗口将数据序列划分为至少一个数据片段,至少一个数据片段之间不具有交叠。进一步,若数据序列中最后一个数据片段的时间长度不足时间窗口的长度,但其时间长度与时间窗口的比值大于或等于指定 比例,例如大于1/3,则保留最后的数据片段为单独数据段。反之,若数据序列中最后一个数据片段的时间长度不足时间窗口的长度,且时间长度与时间窗口的比值小于指定比例,例如小于1/3,意味着最后数据片段不足要求时间窗口1/3,则将最后的数据片段合并至最近的时间段数据。例如,数据序列对应的时间为12:00:00-13:35:00,时间窗口为30分钟,则可以将12:00:00-12:30:00内的数据划分为一数据片段,将12:30:00-13:00:00内的数据划分为一数据片段,将13:00:00-13:30:00内的数据划分为一数据片段,将最后5分钟的数据合并到13:00:00-13:30:00时间段内。
在步骤2023中,对步骤2022划分出的数据片段,去除其中的异常数据。
可选的,步骤2022每划分出一个数据片段,即进入步骤2023,去除该数据片段中的异常数据,再返回步骤2022。或者,
可选的,在步骤2022划分出所有数据片段后,进入步骤2023,逐一去除每个数据片段中的异常数据。
对上述至少一个数据片段中的第一数据片段来说,可以采用以下方式去除其中的异常数据:
去除第一数据片段中超出数据范围的数据;和/或
去除第一数据片段中波动率大于波动阈值的数据。
可选的,去除第一数据片段中超出数据范围的数的步骤,可以为:计算第一数据片段中的数据的均值和方差,分别记为μ和σ;根据所述均值和方差,设置数据范围的上边界和下边界,分别记为μ+ρσ和μ-ρσ;去除第一数据片段中大于上边界μ+ρσ的数据以及小于下边界μ-ρσ的数据,即仅保留第一数据片段中位于上边界μ+ρσ和下边界μ-ρσ之间的数据。其中,ρ是一个系数,可根据应用场景和物理量而定。
可选的,去除第一数据片段中波动率大于波动阈值的数据的步骤,可以为:计算第一数据片段中的数据的微分;去除第一数据片段中微分的绝对值大于微分阈值的数据。在该可选实施方式中,数据的波动率通过微分来体现,相应的,波动率阈值通过微分阈值来体现。可以将数据序列中所有数据的微 分的绝对值与微分阈值比较,一般超过微分阈值的微分会成片出现。这些成片出现的微分的绝对值大于微分阈值的数据属于变化异常的数据,例如可能是采集刚开始阶段,或者是采集结束阶段,或者是某种原因导致采集对象丢失(例如体温测量设备脱落),这些数据一般属于异常数据。
上述微分的计算方法可以有多种,下面举例说明:
例如,一种微分计算公式为:dT(n)=(T(n)-T(n-1))/(t(n)-t(n-1)),dT(1)=dT(2)。在该微分计算公式中,n为正整数;dT(n)表示第n个时间点的微分;T(n)和T(n-1)分别表示第n个时间点的数据和第n-1个时间点的数据;t(n)和t(n-1)分别表示第n个时间点和第n-1个时间点。
又例如,另一种微分计算公式为:dT(n)=(T(n+1)-T(n))/(t(n+1)-t(n)),dT(end)=dT(end-1)。在该微分计算公式中,n为非负整数;dT(n)表示第n个时间点的微分;T(n)和T(n+1)分别表示第n个时间点的数据和第n+1个时间点的数据;t(n)和t(n+1)分别表示第n个时间点和第n+1个时间点;dT(end)和dT(end-1)分别表示最后一个时间点的微分和倒数第二个时间点的微分。
又例如,又一种微分计算公式为:dT(n)=(T(n+1)-T(n-1))/(t(n+1)-t(n-1)),dT(1)=dT(2),T(end)=dT(end-1)。该微分计算公式属于中心微分,在该微分计算公式中,n为正整数;dT(n)表示第n个时间点的微分;T(n-1)和T(n+1)分别表示第n-1个时间点的数据和第n+1个时间点的数据;t(n-1)和t(n+1)分别表示第n-1个时间点和第n+1个时间点;dT(end)和dT(end-1)分别表示最后一个时间点的微分和倒数第二个时间点的微分。
值得说明的是,可采用与第一数据片段相同的方法,去除上述至少一个数据片段中的第二、第三等数据片段中的异常数据,但并不限于此。
在上述实施例或下述实施例中,在时间上,均匀处理连续数据段的步骤,可以为:根据设定的时间间隔阈值,从去除异常数据后的数据序列中,识别连续数据段;采用数据插值方式,在时间上,均匀处理所述连续数据段。
可选的,上述识别连续数据段的步骤,可以为:根据时间间隔阈值,确 定去除异常数据后的数据序列中的数据中断点;识别去除异常数据后的数据序列中,被数据中断点断开的数据片段,作为连续数据段。具体的,可以将去除异常数据后的数据序列中相邻数据对应的时间戳的差值与时间间隔阈值进行比较,将时间戳的差值大于时间间隔阈值的相邻数据之间作为中断数据点,将这些中断数据点作为切分点,从而将数据序列切分为至少一个连续数据段。在每个连续数据段中,相邻数据对应的时间戳的差值均小于或等于时间间隔阈值。
可选的,上述连续数据段的均匀处理步骤,可以为:确定连续数据段对应的至少一个均匀化时间点;利用连续数据段中的数据,插值出至少一个均匀化时间点对应的数据。
具体实现上,如果均匀化时间点与连续数据段中某个数据的时间戳相同,则可以直接将该数据作为均匀化时间点对应的数据;如果均匀化时间点不与连续数据段中任何数据的时间戳相同,则可以取位于该均匀化时间点前后的数据插值出均匀化时间点对应的数据。所述插值方式可以是线性插值,样条插值等。
在采集人体体温的应用场景中,一种数据清洗方法如图4所示,包括:
401、可穿戴设备连续采集人体体温,将采集到的体温数据存储至数据库。
可选的,可穿戴设备可以为采集到的体温数据添加时间戳。或者,在存储至数据库的过程中,为体温数据补上时间戳。
402、数据清洗设备从数据库中获取人体体温对应的数据序列,该数据序列包括一系列体温数据。
403、数据清洗设备基于物理量的时域特性,设置一时间窗口。
404、数据清洗设备利用时间窗口,将数据序列划分为至少一个数据片段。
405、数据清洗设备计算至少一个数据片段中每个数据片段中的体温数据的均值和方差。
406、数据清洗设备根据每个数据片段中的体温数据的均值和方差,分别 设置每个数据片段对应的数据范围的上边界和下边界。
407、数据清洗设备分别去除每个数据片段中大于上边界的体温数据以及小于下边界的体温数据。
408、数据清洗设备根据设定的时间间隔阈值,从去除异常数据后的数据序列中,识别连续数据段。
409、数据清洗设备采用数据插值方式,在时间上,均匀处理所述连续数据段。
在采集人体体温的应用场景中,另一种数据清洗方法如图5所示,包括:
501、可穿戴设备连续采集人体体温,将采集到的体温数据存储至数据库。
可选的,可穿戴设备可以为采集到的体温数据添加时间戳。或者,在存储至数据库的过程中,为体温数据补上时间戳。
502、数据清洗设备从数据库中获取人体体温对应的数据序列,该数据序列包括一系列体温数据。
503、数据清洗设备基于物理量的时域特性,设置一时间窗口。
504、数据清洗设备利用时间窗口,将数据序列划分为至少一个数据片段。
505、数据清洗设备计算至少一个数据片段中每个数据片段中的体温数据的微分。
506、数据清洗设备分别去除每个数据片段中微分的绝对值大于微分阈值的体温数据。
507、数据清洗设备根据设定的时间间隔阈值,从去除异常数据后的数据序列中,识别连续数据段。
508、数据清洗设备采用数据插值方式,在时间上,均匀处理所述连续数据段。
在采集人体体温的应用场景中,又一种数据清洗方法如图6所示,包括:
601、可穿戴设备连续采集人体体温,将采集到的体温数据存储至数据库。
可选的,可穿戴设备可以为采集到的体温数据添加时间戳。或者,在存储至数据库的过程中,为体温数据补上时间戳。
602、数据清洗设备从数据库中获取人体体温对应的数据序列,该数据序列包括一系列体温数据。
603、数据清洗设备基于物理量的时域特性,设置一时间窗口。
604、数据清洗设备利用时间窗口,将数据序列划分为至少一个数据片段。
605、数据清洗设备计算至少一个数据片段中每个数据片段中的体温数据的均值和方差。
606、数据清洗设备根据每个数据片段中的体温数据的均值和方差,分别设置每个数据片段对应的数据范围的上边界和下边界。
607、数据清洗设备分别去除每个数据片段中大于上边界的体温数据以及小于下边界的体温数据。
608、数据清洗设备计算至少一个数据片段中每个数据片段中的体温数据的微分。
609、数据清洗设备分别去除每个数据片段中微分的绝对值大于微分阈值的体温数据。
610、数据清洗设备根据设定的时间间隔阈值,从去除异常数据后的数据序列中,识别连续数据段。
611、数据清洗设备采用数据插值方式,在时间上,均匀处理所述连续数据段。
在此说明,上述步骤605-607与步骤608-609的执行顺序并不限于该实施例中描述的顺序,也可以先执行步骤608-609描述的操作,再执行步骤605-607描述的操作。其中,先执行步骤605-607描述的操作,再执行步骤608-609描述的操作是一种优选实施方式。
在上述实施例中,结合人体体温这一个特定物理量,从体温数据本身具备的物理和数学特性考虑,对人体体温对应的数据序列,首先基于人体体温的时域特性去除异常数据,识别去除异常数据后的数据序列中的连续数据段, 在时间上对连续数据段进行均匀化处理,以获得可靠、准确的体温数据,为后续基于体温数据进行各种分析提供良好的基础条件,利于提高后续分析结果的准确性。
需要说明的是,上述实施例所提供方法的各步骤的执行主体均可以是同一设备,或者,该方法也由不同设备作为执行主体。比如,步骤201至步骤203的执行主体可以为设备A;又比如,步骤201和202的执行主体可以为设备A,步骤203的执行主体可以为设备B;等等。
图7为本申请又一实施例提供的数据清洗装置的结构示意图。如图7所示,该装置包括:获取单元71、去除单元72和均匀处理单元73。
获取单元71,用于获取一物理量的时域采样的数据序列。
去除单元72,用于基于所述物理量的时域特性,去除所述数据序列中的异常数据。
均匀处理单元73,用于在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
在一可选实施方式中,去除单元72的一种实现结构包括:
设置子单元,用于基于所述物理量的时域特性,设置时间窗口;
划分子单元,用于利用所述时间窗口,将所述数据序列划分为至少一个数据片段;
去除子单元,用于分别去除所述至少一个数据片段中的异常数据。
进一步可选的,去除子单元具体用于:去除所述第一数据片段中超出数据范围的数据;和/或,去除所述第一数据片段中波动率大于波动阈值的数据。
进一步可选的,去除子单元在去除所述第一数据片段中超出数据范围的数据时,具体用于:计算所述第一数据片段中的数据的均值和方差;根据所述均值和方差,设置所述数据范围的上边界和下边界;去除所述第一数据片段中大于所述上边界的数据以及小于所述下边界的数据。
进一步可选的,去除子单元在去除所述第一数据片段中波动率大于波动阈值的数据时,具体用于:计算所述第一数据片段中的数据的微分;去除所 述第一数据片段中微分的绝对值大于微分阈值的数据。
在一可选实施方式中,均匀处理单元的一种实现结构包括:
识别子单元,用于根据设定的时间间隔阈值,从所述去除异常数据后的数据序列中,识别所述连续数据段;
插值子单元,用于采用数据插值方式,在时间上,均匀处理所述连续数据段。
进一步可选的,识别子单元具体用于:根据所述时间间隔阈值,确定所述去除异常数据后的数据序列中的数据中断点;识别所述去除异常数据后的数据序列中,被所述数据中断点断开的数据片段,作为所述连续数据段。
进一步可选的,插值子单元具体用于:确定所述连续数据段对应的至少一个均匀化时间点;利用所述连续数据段中的数据,插值出所述至少一个均匀化时间点对应的数据。
本实施例提供的数据清洗装置,可用于执行上述方法实施例提供的流程,详细描述在此不再赘述。
本实施例提供的数据清洗装置,结合数据采样场景,从数据本身具备的物理和数学特性考虑,对物理量的时域采样的数据序列,基于物理量的时域特性去除异常数据,并在时间上,均匀处理去除异常数据后的数据序列中的连续数据段,实现对时域采样的数据序列的清洗,最终获得可靠、准确的采样数据,进而提高基于采样数据进行相关分析的准确性。
以上描述了数据清洗装置的内部功能和结构,如图8所示,实际中,该数据清洗装置可实现为一电子设备,包括:存储器81、处理器82和通信接口83。
存储器81,被配置为存储计算机程序。
另外,存储器81还可被配置为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。
存储器81可以由任何类型的易失性或非易失性存储设备或者它们的组合 实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
通信接口83,被配置为实现电子设备与其它设备之间的通信,例如可以是有线或无线通信方式。
电子设备可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信接口83经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,通信接口83还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
处理器82,耦合至存储器81和通信接口83,被配置为执行存储器81中的计算机程序,以用于:
通过通信接口83获取一物理量的时域采样的数据序列;
基于所述物理量的时域特性,去除所述数据序列中的异常数据;
在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
在一可选实施方式中,处理器82在去除异常数据时,具体用于:基于所述物理量的时域特性,设置时间窗口;利用所述时间窗口,将所述数据序列划分为至少一个数据片段;分别去除所述至少一个数据片段中的异常数据。
在一可选实施方式中,处理器82在去除至少一个数据片段中的第一数据片段中的异常数据时,具体用于:去除所述第一数据片段中超出数据范围的数据;和/或,去除所述第一数据片段中波动率大于波动阈值的数据。
在一可选实施方式中,处理器82在去除超出数据范围的数据时,具体用于:计算所述第一数据片段中的数据的均值和方差;根据所述均值和方差,设置所述数据范围的上边界和下边界;去除所述第一数据片段中大于所述上边界的数据以及小于所述下边界的数据。
在一可选实施方式中,处理器82在去除波动率大于波动阈值的数据时, 具体用于:计算所述第一数据片段中的数据的微分;去除所述第一数据片段中微分的绝对值大于微分阈值的数据。
在一可选实施方式中,处理器82在均匀处理所述连续数据段时,具体用于:根据设定的时间间隔阈值,从所述去除异常数据后的数据序列中,识别所述连续数据段;采用数据插值方式,在时间上,均匀处理所述连续数据段。
在一可选实施方式中,处理器82在识别所述连续数据段时,具体用于:根据所述时间间隔阈值,确定所述去除异常数据后的数据序列中的数据中断点;识别所述去除异常数据后的数据序列中,被所述数据中断点断开的数据片段,作为所述连续数据段。
在一可选实施方式中,处理器82在均匀处理所述连续数据段时,具体用于:确定所述连续数据段对应的至少一个均匀化时间点;利用所述连续数据段中的数据,插值出所述至少一个均匀化时间点对应的数据。
进一步,如图8所示,电子设备还包括:显示器84、电源组件85、音频组件86等其它组件。图8中仅示意性给出部分组件,并不意味着客户端设备只包括图8所示组件。
显示器84包括屏幕,其屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。
电源组件85,为电子设备的各种组件提供电力。电源组件85可以包括电源管理系统,一个或多个电源,及其他与为客户端设备生成、管理和分配电力相关联的组件。
音频组件86被配置为输出和/或输入音频信号。例如,音频组件86包括一个麦克风(MIC),当电子设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器81或经由通信接口83发送。在一些实施例中,音频 组件86还包括一个扬声器,用于输出音频信号。
本申请实施例还提供一种适用于计算机程序的计算机存储介质,计算机存储介质存储有以下程序指令:
第一程序指令,用于获取一物理量的时域采样的数据序列;
第二程序指令,用于基于所述物理量的时域特性,去除所述数据序列中的异常数据;
第三程序指令,用于在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
当上述程序指令被执行时,可实现上述方法实施例提供的流程,实现对时域采样的数据序列的清洗,获得可靠、准确的数据,为后续基于采样数据进行各种分析提供良好的基础条件,提高后续分析结果的准确性。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器 中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flashRAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、 商品或者设备中还存在另外的相同要素。
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (20)

  1. 一种数据清洗方法,其特征在于,包括:
    获取一物理量的时域采样的数据序列;
    基于所述物理量的时域特性,去除所述数据序列中的异常数据;
    在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
  2. 根据权利要求1所述的方法,其特征在于,所述异常数据的去除步骤,包括:
    基于所述物理量的时域特性,设置时间窗口;
    利用所述时间窗口,将所述数据序列划分为至少一个数据片段;
    分别去除所述至少一个数据片段中的异常数据。
  3. 根据权利要求2所述的方法,其特征在于,对所述至少一个数据片段中的第一数据片段,所述异常数据的去除步骤,包括:
    去除所述第一数据片段中超出数据范围的数据;和/或
    去除所述第一数据片段中波动率大于波动阈值的数据。
  4. 根据权利要求3所述的方法,其特征在于,所述超出数据范围的数据的去除步骤,包括:
    计算所述第一数据片段中的数据的均值和方差;
    根据所述均值和方差,设置所述数据范围的上边界和下边界;
    去除所述第一数据片段中大于所述上边界的数据以及小于所述下边界的数据。
  5. 根据权利要求3所述的方法,其特征在于,所述波动率大于波动阈值的数据的去除步骤,包括:
    计算所述第一数据片段中的数据的微分;
    去除所述第一数据片段中微分的绝对值大于微分阈值的数据。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述连续数据段的均匀处理步骤,包括:
    根据设定的时间间隔阈值,从所述去除异常数据后的数据序列中,识别所述连续数据段;
    采用数据插值方式,在时间上,均匀处理所述连续数据段。
  7. 根据权利要求6所述的方法,其特征在于,所述连续数据段的识别步骤,包括:
    根据所述时间间隔阈值,确定所述去除异常数据后的数据序列中的数据中断点;
    识别所述去除异常数据后的数据序列中,被所述数据中断点断开的数据片段,作为所述连续数据段。
  8. 根据权利要求6所述的方法,其特征在于,所述连续数据段的均匀处理步骤,包括:
    确定所述连续数据段对应的至少一个均匀化时间点;
    利用所述连续数据段中的数据,插值出所述至少一个均匀化时间点对应的数据。
  9. 一种数据清洗装置,其特征在于,包括:
    获取单元,用于获取一物理量的时域采样的数据序列;
    去除单元,用于基于所述物理量的时域特性,去除所述数据序列中的异常数据;
    均匀处理单元,用于在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
  10. 根据权利要求9所述的装置,其特征在于,所述去除单元包括:
    设置子单元,用于基于所述物理量的时域特性,设置时间窗口;
    划分子单元,用于利用所述时间窗口,将所述数据序列划分为至少一个数据片段;
    去除子单元,用于分别去除所述至少一个数据片段中的异常数据。
  11. 根据权利要求10所述的装置,其特征在于,所述去除子单元具体用于:
    去除所述第一数据片段中超出数据范围的数据;和/或
    去除所述第一数据片段中波动率大于波动阈值的数据。
  12. 根据权利要求9-11任一项所述的装置,其特征在于,所述均匀处理单元包括:
    识别子单元,用于根据设定的时间间隔阈值,从所述去除异常数据后的数据序列中,识别所述连续数据段;
    插值子单元,用于采用数据插值方式,在时间上,均匀处理所述连续数据段。
  13. 根据权利要求12所述的装置,其特征在于,所述识别子单元具体用于:
    根据所述时间间隔阈值,确定所述去除异常数据后的数据序列中的数据中断点;
    识别所述去除异常数据后的数据序列中,被所述数据中断点断开的数据片段,作为所述连续数据段。
  14. 根据权利要求12所述的装置,其特征在于,所述插值子单元具体用于:
    确定所述连续数据段对应的至少一个均匀化时间点;
    利用所述连续数据段中的数据,插值出所述至少一个均匀化时间点对应的数据。
  15. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有以下程序指令:
    第一程序指令,用于获取一物理量的时域采样的数据序列;
    第二程序指令,用于基于所述物理量的时域特性,去除所述数据序列中的异常数据;
    第三程序指令,用于在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
  16. 一种电子设备,其特征在于,包括:
    存储器,被配置为存储计算机程序;
    通信接口,被配置为实现所述电子设备与其它设备之间的通信;
    处理器,耦合至所述存储器和所述通信接口,被配置为执行所述计算机程序,以用于:
    通过所述通信接口,获取一物理量的时域采样的数据序列;
    基于所述物理量的时域特性,去除所述数据序列中的异常数据;
    在时间上,均匀处理去除异常数据后的数据序列中的连续数据段。
  17. 根据权利要求16所述的电子设备,其特征在于,所述处理器在去除所述异常数据时,具体用于:
    基于所述物理量的时域特性,设置时间窗口;
    利用所述时间窗口,将所述数据序列划分为至少一个数据片段;
    分别去除所述至少一个数据片段中的异常数据。
  18. 根据权利要求16或17所述的电子设备,其特征在于,所述处理器在均匀处理所述连续数据段时,具体用于:
    根据设定的时间间隔阈值,从所述去除异常数据后的数据序列中,识别所述连续数据段;
    采用数据插值方式,在时间上,均匀处理所述连续数据段。
  19. 根据权利要求18所述的电子设备,其特征在于,所述处理器在识别所述连续数据段时,具体用于:
    根据所述时间间隔阈值,确定所述去除异常数据后的数据序列中的数据中断点;
    识别所述去除异常数据后的数据序列中,被所述数据中断点断开的数据片段,作为所述连续数据段。
  20. 根据权利要求18所述的电子设备,其特征在于,所述处理器在均匀处理所述连续数据段时,具体用于:
    确定所述连续数据段对应的至少一个均匀化时间点;
    利用所述连续数据段中的数据,插值出所述至少一个均匀化时间点对应的数据。
PCT/CN2017/070190 2017-01-04 2017-01-04 数据清洗方法及装置 WO2018126367A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/070190 WO2018126367A1 (zh) 2017-01-04 2017-01-04 数据清洗方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/070190 WO2018126367A1 (zh) 2017-01-04 2017-01-04 数据清洗方法及装置

Publications (1)

Publication Number Publication Date
WO2018126367A1 true WO2018126367A1 (zh) 2018-07-12

Family

ID=62788934

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/070190 WO2018126367A1 (zh) 2017-01-04 2017-01-04 数据清洗方法及装置

Country Status (1)

Country Link
WO (1) WO2018126367A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522806A (zh) * 2020-04-26 2020-08-11 陈文海 大数据清洗处理方法、装置、服务器及可读存储介质
CN111625413A (zh) * 2020-04-23 2020-09-04 平安科技(深圳)有限公司 指标异常分析方法、装置及存储介质
US20210091866A1 (en) * 2015-07-17 2021-03-25 Feng Zhang Method, apparatus, and system for accurate wireless monitoring

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332042A (zh) * 2011-09-13 2012-01-25 东南大学 一种石英挠性加速度计启动模型的建模方法
US20120084323A1 (en) * 2010-10-02 2012-04-05 Microsoft Corporation Geographic text search using image-mined data
CN102609501A (zh) * 2012-02-02 2012-07-25 北京华电天仁电力控制技术有限公司 一种基于实时历史数据库的数据清洗方法
CN105719019A (zh) * 2016-01-21 2016-06-29 华南理工大学 一种考虑用户预约数据的公共自行车高峰期需求预测方法
CN105740627A (zh) * 2016-01-29 2016-07-06 深圳市奋达科技股份有限公司 一种心率计算方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120084323A1 (en) * 2010-10-02 2012-04-05 Microsoft Corporation Geographic text search using image-mined data
CN102332042A (zh) * 2011-09-13 2012-01-25 东南大学 一种石英挠性加速度计启动模型的建模方法
CN102609501A (zh) * 2012-02-02 2012-07-25 北京华电天仁电力控制技术有限公司 一种基于实时历史数据库的数据清洗方法
CN105719019A (zh) * 2016-01-21 2016-06-29 华南理工大学 一种考虑用户预约数据的公共自行车高峰期需求预测方法
CN105740627A (zh) * 2016-01-29 2016-07-06 深圳市奋达科技股份有限公司 一种心率计算方法及装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210091866A1 (en) * 2015-07-17 2021-03-25 Feng Zhang Method, apparatus, and system for accurate wireless monitoring
US11770197B2 (en) * 2015-07-17 2023-09-26 Origin Wireless, Inc. Method, apparatus, and system for accurate wireless monitoring
CN111625413A (zh) * 2020-04-23 2020-09-04 平安科技(深圳)有限公司 指标异常分析方法、装置及存储介质
CN111522806A (zh) * 2020-04-26 2020-08-11 陈文海 大数据清洗处理方法、装置、服务器及可读存储介质
CN111522806B (zh) * 2020-04-26 2023-07-07 上海聚均科技有限公司 大数据清洗处理方法、装置、服务器及可读存储介质

Similar Documents

Publication Publication Date Title
TWI620547B (zh) Information processing device, information processing method and information processing system
CN106653059B (zh) 婴儿啼哭原因的自动识别方法及其系统
CN109800483A (zh) 一种预测方法、装置、电子设备和计算机可读存储介质
KR101700656B1 (ko) 사용자 정보를 수집하기 위한 방법 및 장치
CN106161705B (zh) 音频设备测试方法及装置
WO2015196601A1 (zh) 用户界面反应时间的方法、装置及设备、存储介质
JPWO2019159252A1 (ja) 生体信号を用いるストレス推定装置およびストレス推定方法
WO2018126367A1 (zh) 数据清洗方法及装置
US20140194756A1 (en) Biological rhythm disturbance degree calculating device, biological rhythm disturbance degree calculating system, biological rhythm disturbance degree calculating method, program, and recording medium
US20140067838A1 (en) Analysis module, cloud analysis system and method thereof
CN104636164B (zh) 启动页面生成方法及装置
WO2018126366A1 (zh) 温度测量方法及装置
CN111584035A (zh) 一种菜谱的推荐方法及装置和冰箱
CN112735563A (zh) 推荐信息的生成方法、装置和处理器
CN106775403A (zh) 获取卡顿信息的方法及装置
CN104297542A (zh) 一种基于用电量的提示方法及装置
CN106341712A (zh) 多媒体数据的处理方法及装置
WO2016131244A1 (zh) 用户健康的监测方法、监测装置以及监测终端
CN110069468B (zh) 一种获取用户需求的方法及装置、电子设备
CN109870172B (zh) 计步检测方法、装置、设备及存储介质
CN105657575B (zh) 视频标注方法和装置
CN111093481B (zh) 温度展示方法和装置
CN105706409B (zh) 用于增强用户对于服务的参与度的方法、设备及系统
CN111414074A (zh) 屏幕浏览数据处理方法、装置、介质及电子设备
CN105551206A (zh) 一种基于情绪的提醒方法和相关装置及提醒系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17890037

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC , EPO FORM 1205A DATED 28.10.19.

122 Ep: pct application non-entry in european phase

Ref document number: 17890037

Country of ref document: EP

Kind code of ref document: A1