[go: up one dir, main page]

WO2018126367A1 - Procédé et dispositif de nettoyage de données - Google Patents

Procédé et dispositif de nettoyage de données Download PDF

Info

Publication number
WO2018126367A1
WO2018126367A1 PCT/CN2017/070190 CN2017070190W WO2018126367A1 WO 2018126367 A1 WO2018126367 A1 WO 2018126367A1 CN 2017070190 W CN2017070190 W CN 2017070190W WO 2018126367 A1 WO2018126367 A1 WO 2018126367A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
segment
time
abnormal
data segment
Prior art date
Application number
PCT/CN2017/070190
Other languages
English (en)
Chinese (zh)
Inventor
黄建华
康宏
Original Assignee
上海温尔信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海温尔信息科技有限公司 filed Critical 上海温尔信息科技有限公司
Priority to PCT/CN2017/070190 priority Critical patent/WO2018126367A1/fr
Publication of WO2018126367A1 publication Critical patent/WO2018126367A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present application relates to the field of data processing technologies, and in particular, to a data cleaning method and apparatus.
  • a wearable device is a portable device that can be worn directly on the user or integrated into the user's clothing or accessories.
  • the wearable device can continuously collect relevant data of the user and upload the collected data to the server to implement data interaction.
  • the user's health index, behavioral habits, life preferences, and the like can be analyzed.
  • the data can be denoised to remove noise in the data, improve the reliability of the data, and improve the accuracy of the analysis results.
  • a lot of data collected by the wearable device from the human body is limited by the complex scene of the human body and the problem of the device itself or the connection with the mobile device. Some wrong data may occur, such as data loss caused by device falling off, or data loss caused by device communication. Wrong data needs to be eliminated in time to avoid further analysis affecting analytical calculations. The inventors thought that before using the data collected by the wearable device, the data needs to be thoroughly cleaned to obtain clean and reliable data, and the accuracy of the correlation analysis based on the data is improved.
  • the embodiment of the present application provides a data cleaning method, including:
  • the step of removing the abnormal data includes: setting a time window based on a time domain characteristic of the physical quantity; and using the time window, dividing the data sequence into at least one data segment; respectively The abnormal data in the at least one data segment is removed.
  • the step of removing the abnormal data in the first data segment of the at least one data segment includes: removing data exceeding the data range in the first data segment; and/or removing The data in the first data segment whose volatility is greater than the fluctuation threshold.
  • the step of removing the data beyond the data range includes: calculating a mean and a variance of the data in the first data segment; setting the data range according to the mean and the variance a boundary and a lower boundary; removing data in the first data segment that is larger than the upper boundary and data smaller than the lower boundary.
  • the step of removing the data whose volatility is greater than the fluctuation threshold includes: calculating a differential of the data in the first data segment; and removing an absolute value of the differential in the first data segment to be greater than a differential Threshold data.
  • the step of uniformly processing the continuous data segment includes: identifying, according to the set time interval threshold, the continuous data segment from the data sequence after the abnormal data is removed; using data interpolation In a way, the continuous data segments are processed uniformly in time.
  • the step of identifying the continuous data segment includes: determining, according to the time interval threshold, a data interruption point in the data sequence after the abnormal data is removed; and identifying the data after removing the abnormal data A data segment in the data sequence that is disconnected by the data interruption point as the continuous data segment.
  • the step of uniformly processing the consecutive data segments includes: determining at least one uniformization time point corresponding to the continuous data segment; and interpolating the at least the data in the continuous data segment A data that equalizes the time point.
  • the embodiment of the present application further provides a data cleaning device, including:
  • An acquiring unit configured to acquire a data sequence of a physical quantity time domain sample
  • a removing unit configured to remove abnormal data in the data sequence based on a time domain characteristic of the physical quantity
  • a uniform processing unit for uniformly processing contiguous data segments in the data sequence after the abnormal data is removed in time.
  • the removing unit includes: a setting subunit, configured to set a time window based on a time domain characteristic of the physical quantity; and a dividing subunit, configured to use the time window to use the data sequence Dividing into at least one data segment; removing subunits for respectively removing abnormal data in the at least one data segment.
  • the removing subunit is specifically configured to: remove data exceeding the data range in the first data segment; and/or remove data in the first data segment whose volatility is greater than a fluctuation threshold.
  • the removing subunit is specifically configured to: calculate a mean and a variance of data in the first data segment; and set upper and lower boundaries of the data range according to the mean and variance And removing data larger than the upper boundary and data smaller than the lower boundary in the first data segment.
  • the removing subunit is specifically configured to: calculate a differential of data in the first data segment; and remove data in which the absolute value of the differential entropy in the first data segment is greater than a differential threshold.
  • the uniform processing unit includes: an identifying subunit, configured to identify the continuous data segment from the data sequence after the abnormal data is removed according to a set time interval threshold; The unit is configured to uniformly process the continuous data segment in time by using data interpolation.
  • the identifying subunit is specifically configured to: determine, according to the time interval threshold, a data interruption point in the data sequence after the abnormal data is removed; and identify the data sequence after the abnormal data is removed A piece of data that is disconnected by the data interruption point as the continuous data segment.
  • the interpolation subunit is specifically configured to: determine at least one homogenization time point corresponding to the continuous data segment; and interpolate the at least one homogenization by using data in the continuous data segment The data corresponding to the time point.
  • the embodiment of the present application further provides a computer storage medium, which stores the following program instructions:
  • a second program instruction configured to remove abnormal data in the data sequence based on a time domain characteristic of the physical quantity
  • the third program instruction is configured to uniformly process the continuous data segments in the data sequence after the abnormal data is removed in time.
  • the embodiment of the present application further provides an electronic device, including:
  • a memory configured to store a computer program
  • a communication interface configured to implement communication between the electronic device and other devices
  • a processor coupled to the memory and the communication interface is configured to execute the computer program for:
  • the method when the processor removes the abnormal data, is specifically configured to: set a time window based on a time domain characteristic of the physical quantity; and use the time window to divide the data sequence into At least one data segment; respectively removing abnormal data in the at least one data segment.
  • the method when the processor removes the abnormal data, the method is specifically configured to: remove data exceeding the data range in the first data segment; and/or remove fluctuations in the first data segment Data with a rate greater than the fluctuation threshold.
  • the processor when the processor removes the data that is out of the data range, the processor is specifically configured to: calculate a mean and a variance of the data in the first data segment;
  • the processor removes the number of the volatility greater than a fluctuation threshold According to the time, it is specifically used to: calculate a differential of the data in the first data segment; and remove data in which the absolute value of the differential entropy in the first data segment is greater than a differential threshold.
  • the method when the processor uniformly processes the consecutive data segments, the method is specifically configured to: identify the continuous data from the data sequence after the abnormal data is removed according to a set time interval threshold Data segment; using data interpolation, uniformly processing the continuous data segment in time.
  • the method when the processor identifies the continuous data segment, the method is specifically configured to: determine, according to the time interval threshold, a data interruption point in the data sequence after the abnormal data is removed; In the data sequence after the abnormal data is removed, the data segment disconnected by the data interruption point is used as the continuous data segment.
  • the method when the processor uniformly processes the consecutive data segments, the method is specifically configured to: determine at least one homogenization time point corresponding to the continuous data segment; and utilize data in the continuous data segment And interpolating data corresponding to the at least one homogenization time point.
  • the abnormal data is removed based on the time domain characteristics of the physical quantity, and uniformly processed in time.
  • FIG. 1 is a schematic structural diagram of a data cleaning system according to an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a data cleaning method according to another embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a data cleaning method according to another embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a data cleaning method according to another embodiment of the present application.
  • FIG. 5 is a schematic flowchart diagram of a data cleaning method according to another embodiment of the present application.
  • FIG. 6 is a schematic flowchart diagram of a data cleaning method according to another embodiment of the present disclosure.
  • FIG. 7 is a schematic structural diagram of a data cleaning apparatus according to another embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of an electronic device according to another embodiment of the present disclosure.
  • the data cleaning method provided by the embodiment of the present application can be implemented based on the data cleaning system shown in FIG. 1, but is not limited thereto.
  • the data cleaning system includes: a data collection device 10 and a data cleaning device 20; and the data collection device 10 is communicatively coupled to the data cleaning device 20.
  • the data collection device 10 is configured to perform time domain sampling on a physical quantity to obtain sampling data of the physical quantity.
  • the data collection device 10 can directly report the physical quantity of the sampling data to the data cleaning device 20, so that the data cleaning device 20 obtains the data sequence of the physical quantity time domain sampling. or,
  • the data collection device 10 may store the sampled data of the physical quantity into a database for the data cleaning device 20 to acquire the data sequence of the time-domain samples of the physical quantity from the database.
  • the data cleaning device 20 it is mainly used to acquire a data sequence of physical quantity time domain samples, and the data sequence is cleaned to obtain reliable and accurate data, and provide basic data for subsequent application or analysis.
  • the data collection device 10 and the data cleaning device 20 may be connected by wireless or wired network.
  • the network standard of the mobile network may be any of 2G (GSM), 2.5G (GPRS), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G+ (LTE+), WiMax, etc.
  • GSM 2G
  • GPRS 2.5G
  • WCDMA 3G
  • WCDMA Time Division Multiple Access
  • TD-SCDMA Time Division Multiple Access
  • CDMA2000 Code Division Multiple Access 2000
  • UTMS Universal Mobile communications
  • 4G Long Term Evolution
  • LTE+ Long Term Evolution+
  • the physical quantity in this embodiment may be any physical quantity that supports time domain acquisition, and may be, for example, temperature or humidity.
  • the data collection device 10 in this embodiment may be any device capable of performing time domain acquisition on physical quantities, for example, various sensors. Taking the physical quantity as the temperature, especially the body temperature of the human body as an example, the data acquisition device may be a wearable device with a temperature sensor.
  • the data cleaning device 20 in this embodiment may be any device having data storage and data processing functions such as a server, a computer, a tablet computer, a smart terminal, and the like.
  • FIG. 2 is a schematic flowchart diagram of a data cleaning method according to another embodiment of the present application. As shown in FIG. 2, the method includes:
  • the data cleaning device 20 acquires a data sequence of a physical quantity of time domain samples.
  • the data cleaning device 20 can acquire a data sequence formed by the data acquisition device 10 performing time domain sampling on the physical quantity.
  • the data sequence includes sampled data of the physical quantities at different points in time and corresponding timestamps.
  • the data collection device 10 performs time domain sampling on the physical quantity. For each sampled data, the data collection device 10 automatically adds a time stamp to it. Optionally, the data collection device 10 reports the time-stamped sample data to the data cleaning device 20 for the data cleaning device 20 to obtain the time-domain sampled data sequence of the physical quantity. Alternatively, the data collection device 10 stores the time-stamped sample data to a database for the data cleaning device 20 to obtain the physical quantity from the database. Time-domain sampled data sequence.
  • the data collection device 10 begins at a specified point in time, sampling the physical quantities at even time intervals, with no time stamps for the sampled data.
  • the data collection device 10 reports the sample data without time stamp to the data cleaning device 20, and the data cleaning device 20 adds a time stamp to the sample data to obtain a data sequence of the time domain sample of the physical quantity.
  • the data collection device 10 stores the sample data without time stamps in a database. In the storage process, the sample data is time stamped for the data cleaning device 20 to acquire the time domain samples of the physical quantity from the database. Data sequence.
  • the physical quantity generally has a certain time domain characteristic from the physical and mathematical characteristics of the physical quantity itself. For example, some physical quantities are continuous over a certain time frame, and the changes are gentle and do not suddenly jump or change rapidly.
  • some physical quantities are continuous over a certain time frame, and the changes are gentle and do not suddenly jump or change rapidly.
  • the body temperature of the human body is continuous and does not suddenly jump; if the body temperature data of sudden jump in the actual collection is abnormal, it should belong to the abnormal situation in the collection process, not the body temperature of the sampling object. The jump has changed.
  • the body temperature change is relatively flat. Generally, the body temperature change will not exceed 0.05 degrees per second. If the actual body temperature data changes exceed this range, it should be an abnormal situation in the collection process, not the measurement object. Body temperature really changed at this rate.
  • the data cleaning device 20 removes the abnormal data in the data sequence acquired in step 201 based on the time domain characteristics of the physical quantity.
  • the data sequence after the abnormal data is removed includes data conforming to the time domain characteristics of the physical quantity, and the data is reliable and accurate.
  • step 203 considering step 202 to remove the abnormal data in the data sequence, the data sequence may not be continuous in time, no longer uniform, and overall is inconvenient to use, but the continuous data segments in the data sequence still have certain use. Value.
  • the data cleaning device 20 uniformly processes the continuous data segments in the data sequence after the removal of the abnormal data in time to provide a reliable, temporally continuous and uniform continuous data segment for subsequent use.
  • the continuous data segment refers to a data sequence in which the abnormal data is removed, and the time interval corresponding to all adjacent data is less than a preset.
  • the data segment of the time interval threshold is a data sequence in which the abnormal data is removed, and the time interval corresponding to all adjacent data is less than a preset.
  • the value of the time interval threshold may be different according to the application scenario and the physical quantity. This embodiment does not limit the value of the time interval threshold, and can be adaptively set.
  • the abnormal data is removed based on the time domain characteristics of the physical quantity, and is uniformly processed in time.
  • the continuous data segment in the data sequence after the abnormal data realizes the cleaning of the data sequence of the time domain sampling, and finally obtains reliable and accurate sampling data, thereby improving the accuracy of correlation analysis based on the sampled data.
  • removing the abnormal data in the data sequence based on the time domain characteristics of the physical quantity may include the following steps:
  • a time window is set, which reflects the time domain characteristics of the physical quantity, which is simply a characteristic of the physical quantity changing with time.
  • the body temperature in order to continuously collect human body temperature, in general, the body temperature does not change more than 0.5 degrees within 3 minutes, and according to this characteristic, the time window can be set to 3 minutes. This means that in the body temperature data within 3 minutes, the body temperature data that changes by more than 0.5 degrees is abnormal data.
  • the human heart rate in order to continuously collect the human heart rate, in general, the human heart rate does not change more than 15 times within 10 seconds, and according to this characteristic, the time window can be set to 10 seconds. This means that in the heart rate data within 10 seconds, the heart rate data that has changed more than 15 times is abnormal data.
  • step 2022 based on the time window set in step 2021, the data sequence can be divided into at least one data segment, the length of time of each data segment being the length of the time window.
  • the data sequence is divided into at least one data segment by using a time window, and at least one data segment does not overlap. Further, if the length of the last data segment in the data sequence is less than the length of the time window, but the ratio of the length of time to the time window is greater than or equal to the specified The ratio, for example greater than 1/3, preserves the last data segment as a separate data segment. Conversely, if the length of the last data segment in the data sequence is less than the length of the time window, and the ratio of the length of time to the time window is less than a specified ratio, such as less than 1/3, meaning that the last data segment is less than the required time window 1/3, The last data segment is then merged into the most recent time period data.
  • the data in 12:00:00-12:30:00 can be divided into a data segment.
  • the data in 12:30:00-13:00:00 is divided into a data segment, and the data in 13:00:00-13:30:00 is divided into a data segment, and the data of the last 5 minutes is merged into 13 :00:00-13:30:00 time period.
  • step 2023 the data segment divided by step 2022 is removed from the abnormal data.
  • step 2022 each data segment is divided, that is, the process proceeds to step 2023, the abnormal data in the data segment is removed, and then the process returns to step 2022. or,
  • step 2022 the process proceeds to step 2023 to remove the abnormal data in each data segment one by one.
  • the abnormal data may be removed in the following manner:
  • the step of removing the number of data segments beyond the data range in the first data segment may be: calculating a mean and a variance of the data in the first data segment, respectively recorded as ⁇ and ⁇ ; setting data according to the mean and variance
  • the upper and lower boundaries of the range are denoted as ⁇ + ⁇ and ⁇ - ⁇ , respectively; the data larger than the upper boundary ⁇ + ⁇ in the first data segment and the data smaller than the lower boundary ⁇ - ⁇ are removed, that is, only the first data segment is retained
  • is a coefficient, which can be determined according to the application scenario and physical quantity.
  • the step of removing the data in the first data segment whose volatility is greater than the fluctuation threshold may be: calculating a differential of the data in the first data segment; and removing the data in the first data segment whose absolute value is greater than the differential threshold.
  • the volatility of the data is represented by differentiation, and correspondingly, the volatility threshold is embodied by a differential threshold. Can diversify all data in the data sequence The absolute value of the score is compared to the differential threshold, and the differential that generally exceeds the differential threshold will appear in pieces.
  • the data whose absolute value of the differential is greater than the differential threshold is the data of the abnormality of the change, for example, it may be the initial stage of the acquisition, or the end of the collection, or the acquisition object is lost for some reason (for example, the temperature measuring device falls off). These data are generally abnormal data.
  • n is a positive integer
  • dT(n) represents the differentiation of the nth time point
  • T(n) and T(n-1) represent the data of the nth time point and the n-1th, respectively.
  • Data at time points; t(n) and t(n-1) represent the nth time point and the n-1th time point, respectively.
  • n is a non-negative integer
  • dT(n) represents the differentiation of the nth time point
  • T(n) and T(n+1) represent the data of the nth time point and the n+th, respectively.
  • Data at one time point; t(n) and t(n+1) represent the nth time point and the n+1th time point, respectively;
  • dT(end) and dT(end-1) respectively represent the last time The differentiation of the point and the differentiation of the penultimate time point.
  • the differential calculation formula belongs to the central differential.
  • n is a positive integer
  • dT(n) represents the differentiation of the nth time point
  • T(n-1) and T(n+1) respectively represent the nth - data of one time point and data of the n+1th time point
  • t(n-1) and t(n+1) respectively represent the n-1th time point and the n+1th time point
  • dT (end) and dT(end-1) represent the differentiation of the last time point and the differentiation of the penultimate time point, respectively.
  • abnormal data in the second and third data segments in the at least one data segment may be removed in the same manner as the first data segment, but is not limited thereto.
  • the step of uniformly processing the continuous data segments in time may be: identifying the continuous data segment from the data sequence after the abnormal data is removed according to the set time interval threshold;
  • the data interpolation method uniformly processes the continuous data segments in time.
  • the step of identifying the continuous data segment may be: according to the time interval threshold, The data interruption point in the data sequence after the abnormal data is removed; the data segment broken by the data interruption point in the data sequence after the abnormal data is removed is identified as a continuous data segment. Specifically, the difference between the timestamp corresponding to the adjacent data in the data sequence after the abnormal data is removed is compared with the time interval threshold, and the adjacent data with the difference of the timestamp greater than the time interval threshold is used as the interrupt data point.
  • the interrupt data points are used as a segmentation point to divide the data sequence into at least one continuous data segment. In each successive data segment, the difference between the timestamps corresponding to the adjacent data is less than or equal to the time interval threshold.
  • the step of uniformly processing the continuous data segment may be: determining at least one uniformization time point corresponding to the continuous data segment; and using data in the continuous data segment, interpolating data corresponding to the at least one uniformization time point.
  • the data may be directly used as the data corresponding to the homogenization time point; if the homogenization time point is not related to any data in the continuous data segment If the time stamps are the same, the data corresponding to the homogenization time points can be interpolated and the data corresponding to the homogenization time points can be taken.
  • the interpolation method may be linear interpolation, spline interpolation, or the like.
  • FIG. 4 a data cleaning method is shown in FIG. 4, including:
  • the wearable device continuously collects body temperature and stores the collected body temperature data into a database.
  • the wearable device can add a time stamp to the collected body temperature data. Or, in the process of storing to the database, time stamp data is added.
  • the data cleaning device acquires a data sequence corresponding to the body temperature of the human body from the database, and the data sequence includes a series of body temperature data.
  • the data cleaning device sets a time window based on a time domain characteristic of the physical quantity.
  • the data cleaning device divides the data sequence into at least one data segment by using a time window.
  • the data cleaning device calculates a mean and a variance of body temperature data in each of the at least one data segment.
  • the data cleaning device respectively determines the mean and variance of the body temperature data in each data segment. Set the upper and lower boundaries of the data range corresponding to each data segment.
  • the data cleaning device separately removes body temperature data larger than the upper boundary and body temperature data smaller than the lower boundary in each data segment.
  • the data cleaning device identifies the continuous data segment from the data sequence after the abnormal data is removed according to the set time interval threshold.
  • the data cleaning device adopts a data interpolation manner to uniformly process the continuous data segment in time.
  • FIG. 5 In the application scenario of collecting human body temperature, another data cleaning method is shown in FIG. 5, including:
  • the wearable device continuously collects human body temperature, and stores the collected body temperature data into a database.
  • the wearable device can add a time stamp to the collected body temperature data. Or, in the process of storing to the database, time stamp data is added.
  • the data cleaning device acquires a data sequence corresponding to the body temperature of the human body from the database, and the data sequence includes a series of body temperature data.
  • the data cleaning device sets a time window based on a time domain characteristic of the physical quantity.
  • the data cleaning device divides the data sequence into at least one data segment by using a time window.
  • the data cleaning device calculates a differential of body temperature data in each of the at least one data segment.
  • the data cleaning device respectively removes body temperature data whose absolute value in each data segment is greater than a differential threshold.
  • the data cleaning device identifies the continuous data segment from the data sequence after the abnormal data is removed according to the set time interval threshold.
  • the data cleaning device adopts a data interpolation manner to uniformly process the continuous data segment in time.
  • FIG. 6 In the application scenario of collecting human body temperature, another data cleaning method is shown in FIG. 6, which includes:
  • the wearable device continuously collects body temperature and stores the collected body temperature data in a database.
  • the wearable device can add a time stamp to the collected body temperature data. Or, in the process of storing to the database, time stamp data is added.
  • the data cleaning device acquires a data sequence corresponding to the body temperature of the human body from the database, and the data sequence includes a series of body temperature data.
  • the data cleaning device sets a time window based on a time domain characteristic of the physical quantity.
  • the data cleaning device divides the data sequence into at least one data segment by using a time window.
  • the data cleaning device calculates a mean and a variance of body temperature data in each of the at least one data segment.
  • the data cleaning device sets upper and lower boundaries of the data range corresponding to each data segment according to the mean and variance of the body temperature data in each data segment.
  • the data cleaning device separately removes body temperature data larger than the upper boundary and body temperature data smaller than the lower boundary in each data segment.
  • the data cleaning device calculates a differential of body temperature data in each of the at least one data segment.
  • the data cleaning device respectively removes body temperature data whose absolute value in each data segment is greater than a differential threshold.
  • the data cleaning device identifies the continuous data segment from the data sequence after the abnormal data is removed according to the set time interval threshold.
  • the data cleaning device adopts a data interpolation manner to uniformly process the continuous data segment in time.
  • steps 605-607 and steps 608-609 are not limited to the order described in the embodiment, and the operations described in steps 608-609 may be performed first, and the operations described in steps 605-607 may be performed. Wherein, the operations described in steps 605-607 are performed first, and the operations described in steps 608-609 are performed as a preferred embodiment.
  • the data sequence corresponding to the human body temperature first removes the abnormal data based on the time domain characteristics of the human body temperature, and identifies and removes the abnormal data.
  • a contiguous segment of data in the subsequent data sequence The continuous data segment is homogenized in time to obtain reliable and accurate body temperature data, which provides a good basic condition for subsequent analysis based on body temperature data, which is beneficial to improve the accuracy of subsequent analysis results.
  • the execution bodies of the steps of the method provided by the foregoing embodiments may all be the same device, or the method may also be performed by different devices.
  • the execution body of steps 201 to 203 may be device A; for example, the execution body of steps 201 and 202 may be device A, the execution body of step 203 may be device B, and the like.
  • FIG. 7 is a schematic structural diagram of a data cleaning apparatus according to another embodiment of the present application. As shown in FIG. 7, the apparatus includes an acquisition unit 71, a removal unit 72, and a uniform processing unit 73.
  • the obtaining unit 71 is configured to acquire a data sequence of a physical quantity of time domain samples.
  • the removing unit 72 is configured to remove the abnormal data in the data sequence based on the time domain characteristic of the physical quantity.
  • the uniform processing unit 73 is configured to uniformly process the continuous data segments in the data sequence after the abnormal data is removed in time.
  • an implementation structure of the removing unit 72 includes:
  • the subunit is removed for respectively removing abnormal data in the at least one data segment.
  • the removing subunit is specifically configured to: remove data exceeding the data range in the first data segment; and/or remove data in the first data segment whose volatility is greater than a fluctuation threshold.
  • the removing subunit is specifically configured to: calculate a mean value and a variance of the data in the first data segment when removing data exceeding the data range in the first data segment; according to the mean value and the variance, Setting an upper boundary and a lower boundary of the data range; removing data larger than the upper boundary and data smaller than the lower boundary in the first data segment.
  • the removing subunit is specifically configured to: calculate a differential of the data in the first data segment when removing the data whose volatility is greater than the fluctuation threshold in the first data segment; The data in which the absolute value of the differential in the first data segment is greater than the differential threshold.
  • an implementation structure of the uniform processing unit includes:
  • a identifying subunit configured to identify the continuous data segment from the data sequence after the abnormal data is removed according to the set time interval threshold
  • the interpolation subunit is configured to uniformly process the continuous data segment in time by using a data interpolation method.
  • the identifying subunit is specifically configured to: determine, according to the time interval threshold, a data interruption point in the data sequence after the abnormal data is removed; and identify the data sequence after the abnormal data is removed, A data segment in which the data break point is broken as the continuous data segment.
  • the interpolation subunit is specifically configured to: determine at least one homogenization time point corresponding to the continuous data segment; and interpolate data corresponding to the at least one homogenization time point by using data in the continuous data segment .
  • the data cleaning device provided in this embodiment may be used to perform the process provided by the foregoing method embodiments, and details are not described herein again.
  • the data cleaning device provided in this embodiment combines the data sampling scenario, and considers the physical and mathematical characteristics of the data itself, and the time-domain sampling data sequence of the physical quantity, the abnormal data is removed based on the time domain characteristics of the physical quantity, and in time, Uniformly processing the continuous data segments in the data sequence after the abnormal data is removed, thereby purifying the data sequence of the time domain sampling, and finally obtaining reliable and accurate sampling data, thereby improving the accuracy of correlation analysis based on the sampled data.
  • the data cleaning device can be implemented as an electronic device, including: a memory 81, a processor 82, and a communication interface 83.
  • the memory 81 is configured to store a computer program.
  • the memory 81 can also be configured to store other various data to support operation on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.
  • Memory 81 can be any type of volatile or non-volatile storage device or combination thereof Implementations such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory magnetic memory
  • flash memory disk or optical disk.
  • the communication interface 83 is configured to implement communication between the electronic device and other devices, such as wired or wireless communication.
  • the electronic device can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof.
  • the communication interface 83 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel.
  • communication interface 83 also includes a near field communication (NFC) module to facilitate short range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • a processor 82 coupled to the memory 81 and the communication interface 83, is configured to execute a computer program in the memory 81 for:
  • the processor 82 is configured to: when the abnormal data is removed, to set a time window based on the time domain characteristic of the physical quantity; and use the time window to divide the data sequence into at least one data a segment; respectively removing abnormal data in the at least one data segment.
  • the processor 82 when the processor 82 removes the abnormal data in the first data segment of the at least one data segment, the processor 82 is specifically configured to: remove data in the first data segment that is out of the data range; and/or And removing data in the first data segment whose volatility is greater than a fluctuation threshold.
  • the processor 82 when the processor 82 removes the data beyond the data range, the processor 82 is specifically configured to: calculate a mean and a variance of the data in the first data segment; and set the data according to the average and the variance. An upper boundary and a lower boundary of the range; data larger than the upper boundary and smaller than the lower boundary in the first data segment are removed.
  • the method is: calculating a differential of the data in the first data segment; and removing data in which the absolute value of the differential in the first data segment is greater than a differential threshold.
  • the method when the processor 82 uniformly processes the consecutive data segments, the method is specifically configured to: identify, according to the set time interval threshold, the continuous data from the data sequence after the abnormal data is removed. Segment; using data interpolation, uniformly processing the continuous data segments in time.
  • the method when the processor 82 identifies the continuous data segment, the method is specifically configured to: determine, according to the time interval threshold, a data interruption point in the data sequence after the abnormal data is removed; In the data sequence after the abnormal data is removed, the data segment broken by the data interruption point is used as the continuous data segment.
  • the method when the processor 82 uniformly processes the consecutive data segments, the method is specifically configured to: determine at least one homogenization time point corresponding to the continuous data segment; and utilize data in the continuous data segment, Interpolating data corresponding to the at least one homogenization time point.
  • the electronic device further includes: a display 84, a power supply component 85, an audio component 86, and the like. Only some of the components are schematically illustrated in FIG. 8, and it is not meant that the client device includes only the components shown in FIG.
  • Display 84 includes a screen whose screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor may sense not only the boundary of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
  • a power supply assembly 85 provides power to various components of the electronic device.
  • Power component 85 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for client devices.
  • the audio component 86 is configured to output and/or input an audio signal.
  • the audio component 86 includes a microphone (MIC) that is configured to receive an external audio signal when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode.
  • the received audio signal may be further stored in the memory 81 or transmitted via the communication interface 83.
  • the audio Component 86 also includes a speaker for outputting an audio signal.
  • the embodiment of the present application further provides a computer storage medium suitable for a computer program, where the computer storage medium stores the following program instructions:
  • a second program instruction configured to remove abnormal data in the data sequence based on a time domain characteristic of the physical quantity
  • the third program instruction is configured to uniformly process the continuous data segments in the data sequence after the abnormal data is removed in time.
  • the process provided by the foregoing method embodiment can be implemented, and the data sequence of the time domain sampling is cleaned, and reliable and accurate data is obtained, which provides a good basic condition for subsequent analysis based on the sampled data. Improve the accuracy of subsequent analysis results.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the computer readable memory is stored in the computer readable memory.
  • the instructions in the production result include an article of manufacture of the instruction device that implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory, such as read only memory (ROM) or flash memory (flashRAM), in a computer readable medium.
  • RAM random access memory
  • ROM read only memory
  • flashRAM flash memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.
  • embodiments of the present application can be provided as a method, system, or computer program product.
  • the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware.
  • the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

L'invention concerne un procédé et un dispositif de nettoyage de données. Le procédé de nettoyage de données consiste : à obtenir une séquence de données échantillonnées de domaine temporel d'une quantité physique (201) ; à supprimer des données anormales dans la séquence de données en fonction de la caractéristique de domaine temporel de la quantité physique (202) ; et en temps voulu, à traiter uniformément des segments de données continus dans la séquence de données, les données anormales de la séquence de données sont supprimées (203). Au moyen du procédé, des données échantillonnées de domaine temporel fiables et précises peuvent être obtenues, et la précision de l'analyse associée basée sur des données échantillonnées obtenues est améliorée.
PCT/CN2017/070190 2017-01-04 2017-01-04 Procédé et dispositif de nettoyage de données WO2018126367A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/070190 WO2018126367A1 (fr) 2017-01-04 2017-01-04 Procédé et dispositif de nettoyage de données

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/070190 WO2018126367A1 (fr) 2017-01-04 2017-01-04 Procédé et dispositif de nettoyage de données

Publications (1)

Publication Number Publication Date
WO2018126367A1 true WO2018126367A1 (fr) 2018-07-12

Family

ID=62788934

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/070190 WO2018126367A1 (fr) 2017-01-04 2017-01-04 Procédé et dispositif de nettoyage de données

Country Status (1)

Country Link
WO (1) WO2018126367A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522806A (zh) * 2020-04-26 2020-08-11 陈文海 大数据清洗处理方法、装置、服务器及可读存储介质
CN111625413A (zh) * 2020-04-23 2020-09-04 平安科技(深圳)有限公司 指标异常分析方法、装置及存储介质
US20210091866A1 (en) * 2015-07-17 2021-03-25 Feng Zhang Method, apparatus, and system for accurate wireless monitoring

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332042A (zh) * 2011-09-13 2012-01-25 东南大学 一种石英挠性加速度计启动模型的建模方法
US20120084323A1 (en) * 2010-10-02 2012-04-05 Microsoft Corporation Geographic text search using image-mined data
CN102609501A (zh) * 2012-02-02 2012-07-25 北京华电天仁电力控制技术有限公司 一种基于实时历史数据库的数据清洗方法
CN105719019A (zh) * 2016-01-21 2016-06-29 华南理工大学 一种考虑用户预约数据的公共自行车高峰期需求预测方法
CN105740627A (zh) * 2016-01-29 2016-07-06 深圳市奋达科技股份有限公司 一种心率计算方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120084323A1 (en) * 2010-10-02 2012-04-05 Microsoft Corporation Geographic text search using image-mined data
CN102332042A (zh) * 2011-09-13 2012-01-25 东南大学 一种石英挠性加速度计启动模型的建模方法
CN102609501A (zh) * 2012-02-02 2012-07-25 北京华电天仁电力控制技术有限公司 一种基于实时历史数据库的数据清洗方法
CN105719019A (zh) * 2016-01-21 2016-06-29 华南理工大学 一种考虑用户预约数据的公共自行车高峰期需求预测方法
CN105740627A (zh) * 2016-01-29 2016-07-06 深圳市奋达科技股份有限公司 一种心率计算方法及装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210091866A1 (en) * 2015-07-17 2021-03-25 Feng Zhang Method, apparatus, and system for accurate wireless monitoring
US11770197B2 (en) * 2015-07-17 2023-09-26 Origin Wireless, Inc. Method, apparatus, and system for accurate wireless monitoring
CN111625413A (zh) * 2020-04-23 2020-09-04 平安科技(深圳)有限公司 指标异常分析方法、装置及存储介质
CN111522806A (zh) * 2020-04-26 2020-08-11 陈文海 大数据清洗处理方法、装置、服务器及可读存储介质
CN111522806B (zh) * 2020-04-26 2023-07-07 上海聚均科技有限公司 大数据清洗处理方法、装置、服务器及可读存储介质

Similar Documents

Publication Publication Date Title
TWI620547B (zh) Information processing device, information processing method and information processing system
CN106653059B (zh) 婴儿啼哭原因的自动识别方法及其系统
CN109800483A (zh) 一种预测方法、装置、电子设备和计算机可读存储介质
KR101700656B1 (ko) 사용자 정보를 수집하기 위한 방법 및 장치
CN106161705B (zh) 音频设备测试方法及装置
WO2015196601A1 (fr) Procédé, appareil et dispositif pour tester un temps de réponse d'interface utilisateur, et support de stockage
JPWO2019159252A1 (ja) 生体信号を用いるストレス推定装置およびストレス推定方法
WO2018126367A1 (fr) Procédé et dispositif de nettoyage de données
US20140194756A1 (en) Biological rhythm disturbance degree calculating device, biological rhythm disturbance degree calculating system, biological rhythm disturbance degree calculating method, program, and recording medium
US20140067838A1 (en) Analysis module, cloud analysis system and method thereof
CN104636164B (zh) 启动页面生成方法及装置
WO2018126366A1 (fr) Procédé et appareil de mesure de température
CN111584035A (zh) 一种菜谱的推荐方法及装置和冰箱
CN112735563A (zh) 推荐信息的生成方法、装置和处理器
CN106775403A (zh) 获取卡顿信息的方法及装置
CN104297542A (zh) 一种基于用电量的提示方法及装置
CN106341712A (zh) 多媒体数据的处理方法及装置
WO2016131244A1 (fr) Procédé de surveillance de la santé d'un utilisateur, dispositif de surveillance et terminal de surveillance
CN110069468B (zh) 一种获取用户需求的方法及装置、电子设备
CN109870172B (zh) 计步检测方法、装置、设备及存储介质
CN105657575B (zh) 视频标注方法和装置
CN111093481B (zh) 温度展示方法和装置
CN105706409B (zh) 用于增强用户对于服务的参与度的方法、设备及系统
CN111414074A (zh) 屏幕浏览数据处理方法、装置、介质及电子设备
CN105551206A (zh) 一种基于情绪的提醒方法和相关装置及提醒系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17890037

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC , EPO FORM 1205A DATED 28.10.19.

122 Ep: pct application non-entry in european phase

Ref document number: 17890037

Country of ref document: EP

Kind code of ref document: A1