CN115151900A - Method and apparatus for filling in missing industrial longitudinal data - Google Patents
Method and apparatus for filling in missing industrial longitudinal data Download PDFInfo
- Publication number
- CN115151900A CN115151900A CN202080097170.XA CN202080097170A CN115151900A CN 115151900 A CN115151900 A CN 115151900A CN 202080097170 A CN202080097170 A CN 202080097170A CN 115151900 A CN115151900 A CN 115151900A
- Authority
- CN
- China
- Prior art keywords
- missing
- slices
- slice
- industrial
- trend
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Mathematical Analysis (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Operations Research (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Algebra (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- General Factory Administration (AREA)
Abstract
Description
技术领域technical field
本发明涉及工业数据处理的技术,且更确切地说,涉及一种用于填充缺失工业纵向数据的方法、设备和计算机可读存储介质。The present invention relates to the technology of industrial data processing, and more particularly, to a method, apparatus and computer-readable storage medium for filling in missing industrial longitudinal data.
背景技术Background technique
工业数据广泛地用于系统和装置的状态监测、预测维护等的工业领域。一些工业数据是时间序列数据。举例来说,网格的负载率的数据可在单独时间点收集,且在一天中随时间变化。Industrial data are widely used in industrial fields such as condition monitoring of systems and devices, predictive maintenance, and the like. Some industrial data is time series data. For example, data on the load rate of a grid may be collected at separate points in time and vary over time of day.
此外,我们可发现来自网格的数据可在不同天内共享类似图案,这可指示电力消费者的不同工作模式。在此类情况下,从网格收集的时间序列数据可呈现为纵向数据,其中每一切片是对应日的观察,表示网格的负载率,如图1中所示。由于网格数据的周期性特性,我们可将周期性特性表示为纵向数据的形式。举例来说,每一切片(或例子)是每日运行数据,且所有切片按时间顺序布置。此处,缺失数据被定义为缺失切片,如图1中所示,其中缺失从2015年5月到2015年7月的数个切片。Furthermore, we can find that data from grids can share similar patterns over different days, which can indicate different working patterns of electricity consumers. In such cases, the time series data collected from the grid can be presented as longitudinal data, where each slice is an observation for the corresponding day, representing the loading rate of the grid, as shown in FIG. 1 . Due to the periodic nature of gridded data, we can represent the periodicity in the form of longitudinal data. For example, each slice (or instance) is a daily run of data, and all slices are arranged in chronological order. Here, missing data are defined as missing slices, as shown in Figure 1, where several slices from May 2015 to July 2015 are missing.
当未存储或收集数据值时,缺失数据出现且可能对根据数据得出的结论具有明显影响。缺失数据是普遍出现且在纵向数据中当然是常见的。Missing data occurs when data values are not stored or collected and can have a significant impact on conclusions drawn from the data. Missing data are common and certainly common in longitudinal data.
已提议填充纵向数据的缺失数据的各种方法。还存在可用于时间序列数据(非纵向数据)的通用缺失数据填充方法,例如,内插。然而,据我们所知,没有方法处理纵向数据,其中每一切片是一条时间序列数据而不是一般的多维特征。Various approaches have been proposed to fill in missing data for longitudinal data. There are also general missing data imputation methods that can be used for time series data (non-longitudinal data), eg, interpolation. However, to the best of our knowledge, there is no method for dealing with longitudinal data where each slice is a piece of time series data rather than a general multidimensional feature.
发明内容SUMMARY OF THE INVENTION
在本公开中,我们提出工业领域中填充纵向数据的缺失数据的解决方案,其中每一切片是时间序列数据。与当前线性回归或内插相比,将切片视为整体,还考虑切片随时间的趋势,以此方式可以更有意义的方式填充缺失数据且反映真实物理状态。In this disclosure, we propose a solution for filling missing data in longitudinal data in the industrial domain, where each slice is time series data. Compared to current linear regression or interpolation, treating the slice as a whole and also considering the trend of the slice over time can fill in missing data in a more meaningful way and reflect the true physical state.
本公开的实施例包含用于填充缺失的工业纵向数据的方法、设备。Embodiments of the present disclosure include methods and apparatus for filling in missing industrial longitudinal data.
根据本公开的第一方面,呈现一种用于填充缺失的工业纵向数据的方法。方法包含以下步骤:收集工业纵向数据,其中工业纵向数据包括缺失切片,每一切片对应于收集时间点;估计工业纵向数据的所有切片随时间的总体趋势;基于总体趋势计算每一缺失切片的趋势值;对于每一缺失切片,基于趋势值寻找至少一个类似切片;基于至少一个类似切片填充每一缺失切片。According to a first aspect of the present disclosure, a method for filling in missing industrial longitudinal data is presented. The method comprises the steps of: collecting industrial longitudinal data, wherein the industrial longitudinal data includes missing slices, each slice corresponding to a collection time point; estimating an overall trend over time for all slices of the industrial longitudinal data; calculating a trend for each missing slice based on the overall trend value; for each missing slice, find at least one similar slice based on the trend value; fill each missing slice based on at least one similar slice.
根据本公开的第二方面,呈现一种用于填充缺失的工业纵向数据的设备。设备包含:数据收集模块,其被配置成收集工业纵向数据,其中工业纵向数据包括缺失切片,每一切片对应于收集时间点;数据处理模块,其被配置成:估计工业纵向数据的所有切片随时间的总体趋势;基于总体趋势计算每一缺失切片的趋势值;对于每一缺失切片,基于趋势值寻找至少一个类似切片;基于至少一个类似切片填充每一缺失切片。According to a second aspect of the present disclosure, an apparatus for filling in missing industrial longitudinal data is presented. The apparatus comprises: a data collection module configured to collect industrial longitudinal data, wherein the industrial longitudinal data includes missing slices, each slice corresponding to a collection time point; a data processing module configured to: estimate all slices of the industrial longitudinal data over time The overall trend of time; the trend value of each missing slice is calculated based on the overall trend; for each missing slice, at least one similar slice is found based on the trend value; each missing slice is filled based on at least one similar slice.
根据本公开的第三方面,呈现一种用于填充缺失的工业纵向数据的设备。设备包含:至少一个处理器;至少一个存储器,其耦合到至少一个处理器、被配置成执行根据第一方面的方法。According to a third aspect of the present disclosure, an apparatus for filling in missing industrial longitudinal data is presented. The apparatus comprises: at least one processor; at least one memory coupled to the at least one processor configured to perform the method according to the first aspect.
根据本公开的第四方面,呈现一种用于填充缺失的工业纵向数据的计算机可读介质。计算机可读介质存储计算机可执行指令,其中计算机可执行指令在被执行时使得至少一个处理器执行根据第一方面的方法。According to a fourth aspect of the present disclosure, there is presented a computer readable medium for filling in missing industrial longitudinal data. The computer-readable medium stores computer-executable instructions, wherein the computer-executable instructions, when executed, cause at least one processor to perform the method according to the first aspect.
与当前使用解决方案相比,在本公开中提供的解决方案的情况下,根据总体趋势和具有类似趋势值的现有数据将每一缺失切片作为整体填充,估计纵向数据的切片的总体趋势,呈现可广泛用于周期性时间序列数据的更合理的填充解决方案。In the case of the solution provided in this disclosure, each missing slice is populated as a whole based on the overall trend and existing data with similar trend values, compared to the currently used solution, estimating the overall trend of the slice of longitudinal data, Presents a more reasonable padding solution that can be widely used for periodic time series data.
任选地,在估计工业纵向数据的所有切片随时间的总体趋势之前,可将工业纵向数据的每一切片归一化;接着,可确定工业纵向数据的所有归一化切片是否具有相同形状,且如果不具有,那么可将工业纵向数据的所有切片划分为部分,其中每一部分中的切片具有相同形状;以及对于具有缺失切片的每一部分,可估计部分中的切片的总体趋势,且可基于部分的总体趋势计算每一缺失切片的趋势值,可基于趋势值而在部分中寻找每一缺失切片的至少一个类似切片,且可基于至少一个类似切片填充每一缺失切片。为了确保填充结果更接近于真实状态,应首先寻找具有相同形状的切片以用于参考。然而,在幅度差的影响下,具有相同形状和明显不同幅度的切片可视为不同的形状。为了引入更多切片用于参考,首先,应通过归一化消除不同幅度的影响。如果在归一化之后随时间真实地存在不同的形状,那么为了确保要参考的最接近切片,应将具有相同形状的切片处理为单独部分。可从这个部分中选择现有切片以填充缺失切片。在通过切片的形状进行归一化和划分的情况下,填充结果可更准确且更接近于真实状态。可基于消费者的要求定制归一化的方法。Optionally, each slice of the industrial longitudinal data can be normalized before estimating the overall trend over time for all the slices of the industrial longitudinal data; then, it can be determined whether all the normalized slices of the industrial longitudinal data have the same shape, And if not, all slices of the industrial longitudinal data can be divided into parts, where the slices in each part have the same shape; and for each part with missing slices, the overall trend of the slices in the part can be estimated and can be based on The overall trend of the portion calculates a trend value for each missing slice, at least one similar slice for each missing slice can be found in the portion based on the trend value, and each missing slice can be filled based on the at least one similar slice. To ensure that the fill result is closer to the true state, you should first look for slices with the same shape for reference. However, slices with the same shape and significantly different amplitudes can be regarded as different shapes under the influence of the amplitude difference. To introduce more slices for reference, first, the effects of different amplitudes should be removed by normalization. If there are truly different shapes over time after normalization, slices with the same shape should be treated as separate parts in order to ensure the closest slice to be referenced. Existing slices can be selected from this section to fill in missing slices. With normalization and division by the shape of the slice, the fill result can be more accurate and closer to the true state. The normalization method can be customized based on the requirements of the consumer.
任选地,如果时间段期间的所有归一化切片之间的形状差都小于预定义阈值,那么确定时间段期间的切片具有相同形状。Optionally, the slices during the time period are determined to have the same shape if the difference in shape between all normalized slices during the time period is less than a predefined threshold.
任选地,趋势值是切片的均值。Optionally, the trend value is the mean of the slice.
附图说明Description of drawings
通过参考以下结合附图进行的对本发明技术的实施例的描述,本发明技术的上文所提到属性和其它特征和优点以及其实现方式将变得更显而易见,并且本发明技术自身将得到更好理解,在附图中:The above-mentioned attributes and other features and advantages of the present technology, as well as the manner in which it is achieved, will become more apparent, and the present technology itself will be further enhanced by reference to the following description of embodiments of the present technology taken in conjunction with the accompanying drawings. Easy to understand, in the attached picture:
图1描绘工业纵向数据和缺失数据。Figure 1 depicts industry longitudinal data and missing data.
图2描绘根据本公开的一个实施例的用于填充缺失的工业纵向数据的设备的框图。2 depicts a block diagram of an apparatus for filling in missing industrial longitudinal data according to one embodiment of the present disclosure.
图3A描绘根据本公开的一个实施例的用于填充缺失的工业纵向数据的方法的流程图。3A depicts a flowchart of a method for filling in missing industrial longitudinal data according to one embodiment of the present disclosure.
图3B描绘步骤S302的流程图。FIG. 3B depicts a flowchart of step S302.
图4描绘图1中的数据的归一化切片。FIG. 4 depicts normalized slices of the data in FIG. 1 .
图5A和图5B描绘来自图1中的切片的所提取趋势。5A and 5B depict the extracted trends from the slices in FIG. 1 .
图6描绘利用本公开中提供的解决方案的填充结果。FIG. 6 depicts the fill results using the solutions provided in this disclosure.
附图标号:Reference number:
10,用于填充缺失的工业纵向数据的设备10. Equipment for filling in missing industrial longitudinal data
101,数据收集模块101. Data collection module
102,数据处理模块102, data processing module
103,至少一个处理器103, at least one processor
104,至少一个存储器104, at least one memory
105,通信模块105, communication module
30,数据处理程序30. Data processing program
31,所收集的纵向数据31. Longitudinal data collected
300,用于填充缺失的工业纵向数据的方法300, Methods for Filling Missing Industrial Longitudinal Data
S301~S305,方法300的步骤S301-S305, the steps of the
S3021~S3023,S302的子步骤S3021~S3023, sub-steps of S302
具体实施方式Detailed ways
在下文中,详细描述本发明技术的上述特征和其它特征。参考图式描述各种实施例,其中相同的参考标号贯穿全文用于指代相同的元件。在以下描述中,出于解释的目的,阐述许多具体细节以便提供对一个或多个实施例的透彻理解。可注意到,所示出的实施例意图解释而非限制本发明。可显而易见的是,可在没有这些具体细节的情况下实践此类实施例。In the following, the above-described features and other features of the present technology are described in detail. Various embodiments are described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more embodiments. It may be noted that the illustrated embodiments are intended to illustrate, but not to limit, the invention. It may be evident that such embodiments may be practiced without these specific details.
当介绍本公开的各种实施例的元件时,冠词“一(a/an)”和“所述(the)”意欲意味存在所述元件中的一个或多个。术语“包括”、“包含”和“具有”意欲为包含性的且意味着可以存在除所列元件之外的另外元件。When introducing elements of various embodiments of the present disclosure, the articles "a/an" and "the" are intended to mean that there are one or more of the elements. The terms "comprising", "comprising" and "having" are intended to be inclusive and mean that there may be additional elements other than the listed elements.
现在,将在下文中通过参考图2到图6详细地描述本公开。Now, the present disclosure will hereinafter be described in detail by referring to FIGS. 2 to 6 .
图2描绘根据本公开的一个实施例的设备的框图。本公开中呈现的用于填充缺失的工业纵向数据的设备10可实施为计算机处理器的网络,以执行本公开中呈现的用于填充缺失的工业纵向数据的以下方法300。设备10还可是单个计算机,如图2所示,其包含至少一个存储器104,所述存储器包含计算机可读介质,例如随机存取存储器(RAM)。设备10还包含与至少一个存储器104耦合的至少一个处理器103。计算机可执行指令存储于至少一个存储器104中,且当由至少一个处理器103执行时,可使得至少一个处理器103执行本文中所描述的步骤。至少一个处理器103可包含微处理器、专用集成电路(ASIC)、数字信号处理器(DSP)、中央处理单元(CPU)、图形处理单元(GPU)、状态机等。计算机可读介质的实施例包含但不限于软磁盘、CD-ROM、磁盘、存储器芯片、ROM、RAM、ASIC、经配置处理器、所有光学介质、所有磁带或其它磁性介质,或计算机处理器可从其中读取指令的任何其它介质。并且,各种其它形式的计算机可读介质可将指令传输或载送到计算机,包含路由器、私用或公用网络,或有线和无线的其它传输装置或信道。指令可包含来自任何计算机编程语言的代码,包含例如C、C++、C#、Visual Basic、Java和JavaScript。2 depicts a block diagram of a device according to one embodiment of the present disclosure. The
图2中所示的至少一个存储器104可含有数据处理程序30,当由至少一个处理器103执行时,使得至少一个处理器103执行本公开中呈现的用于填充缺失的工业纵向数据的方法300。纵向数据31还可存储于至少一个存储器104中。可经由设备10的通信模块105接收这些数据。The at least one
数据处理程序30可包含:
-数据收集模块101,其被配置成收集工业纵向数据31;- a
-数据处理模块102,其被配置成处理所收集的工业纵向数据31。- A
此处,工业纵向数据31包含缺失切片,且每一切片对应于收集时间点。Here, the industrial
详细地说,数据处理模块102被配置成In detail, the
-估计工业纵向数据31的所有切片随时间的总体趋势;- Estimate the general trend over time for all slices of industrial
-基于总体趋势计算每一缺失切片的趋势值;- Calculate the trend value for each missing slice based on the overall trend;
-对于每一缺失切片,基于趋势值寻找至少一个类似切片;- for each missing slice, find at least one similar slice based on the trend value;
-基于至少一个类似切片填充每一缺失切片。- Filling each missing slice based on at least one similar slice.
任选地,数据处理模块102进一步被配置成:在估计工业纵向数据的所有切片随时间的总体趋势之前,将工业纵向数据的每一切片归一化;当估计工业纵向数据的所有切片随时间的总体趋势时,确定工业纵向数据的所有归一化切片是否具有相同形状,且如果没有,那么将工业纵向数据的所有切片划分为部分,其中每一部分中的切片具有相同形状;对于具有缺失切片的每一部分,估计部分中的切片的总体趋势;当基于总体趋势计算每一缺失切片的趋势值时,对于具有缺失切片的每一部分,基于部分的总体趋势计算每一缺失切片的趋势值;当对于每一缺失切片,基于趋势值寻找至少一个类似切片时,对于具有缺失切片的每一部分,基于趋势值而在部分中寻找每一缺失切片的至少一个类似切片;当基于至少一个类似切片填充每一缺失切片时,对于具有缺失切片的每一部分,基于至少一个类似切片填充每一缺失切片。Optionally, the
任选地,数据处理模块102进一步被配置成当确定工业纵向数据的所有归一化切片是否具有相同形状时,如果时间段期间的所有归一化切片之间的形状差都小于预定义阈值,那么确定时间段期间的切片具有相同形状。Optionally, the
任选地,趋势值是切片的均值。Optionally, the trend value is the mean of the slice.
稍后将参考图3A和图3B描述通过数据处理模块102的数据处理细节。Details of data processing by the
尽管数据收集模块101、数据处理模块102在上文描述为数据处理程序30的软件模块。并且,数据收集模块和数据处理模块可经由硬件,例如ASIC芯片实施。数据收集模块和数据处理模块可集成到一个芯片中,或单独地实施和电连接。Although the
应提及,本公开可包含具有与图2中所示的架构不同的架构的设备。以上架构仅是示范性的且用于解释图3A和图3B中所示的示范性方法300。It should be mentioned that the present disclosure may encompass devices having different architectures than that shown in FIG. 2 . The above architecture is exemplary only and serves to explain the
可进行根据本公开的各种方法。根据本公开的一种示范性方法300包含以下步骤:Various methods in accordance with the present disclosure can be performed. An
S301:收集工业纵向数据,其中工业纵向数据包含缺失切片,每一切片对应于收集时间点;S301: Collect industrial longitudinal data, wherein the industrial longitudinal data includes missing slices, and each slice corresponds to a collection time point;
S302:估计工业纵向数据的所有切片随时间的总体趋势;S302: Estimate the overall trend over time for all slices of industrial longitudinal data;
S303:基于总体趋势计算每一缺失切片的趋势值;S303: Calculate the trend value of each missing slice based on the overall trend;
S304:对于每一缺失切片,基于趋势值寻找至少一个类似切片;S304: For each missing slice, find at least one similar slice based on the trend value;
S305:基于至少一个类似切片填充每一缺失切片。S305: Fill each missing slice based on at least one similar slice.
任选地,在步骤S302之前,估计工业纵向数据的所有切片随时间的总体趋势,方法300可进一步包含:Optionally, prior to step S302, estimating an overall trend over time for all slices of the industrial longitudinal data, the
S301':将工业纵向数据的每一切片归一化。接着参考图3B,步骤S302可包含以下子步骤:S301': Normalize each slice of industrial longitudinal data. 3B, step S302 may include the following sub-steps:
S3021:确定工业纵向数据的所有归一化切片是否具有相同形状,且如果没有,那么程序通过子步骤S3022进行。S3021: Determine whether all normalized slices of the industrial longitudinal data have the same shape, and if not, the procedure proceeds through sub-step S3022.
S3022:将工业纵向数据的所有切片划分为部分,其中每一部分中的切片具有相同形状;以及S3022: Divide all slices of industrial longitudinal data into sections, wherein the slices in each section have the same shape; and
S3023:对于具有缺失切片的每一部分,估计部分中的切片的总体趋势。S3023: For each section with missing slices, estimate the overall trend of the slices in the section.
接着步骤S303可包含:对于具有缺失切片的每一部分,基于部分的总体趋势计算每一缺失切片的趋势值;步骤S304可包含:对于具有缺失切片的每一部分,基于趋势值而在部分中寻找每一缺失切片的至少一个类似切片;以及步骤S305可包含:对于具有缺失切片的每一部分,基于至少一个类似切片填充每一缺失切片。Then step S303 may include: for each section with missing slices, calculating a trend value for each missing slice based on the overall trend of the section; step S304 may include: for each section with missing slices, searching for each section in the section based on the trend value At least one similar slice of a missing slice; and step S305 may include: for each portion having the missing slice, filling each missing slice based on the at least one similar slice.
任选地,在子步骤S3021中,如果时间段期间的所有归一化切片之间的形状差都小于预定义阈值,那么可确定时间段期间的切片具有相同形状。Optionally, in sub-step S3021, if the shape difference between all normalized slices during the time period is less than a predefined threshold, it may be determined that the slices during the time period have the same shape.
接下来,取来自图1中所示的网格的数据作为实例来描述示范性实施例。Next, an exemplary embodiment is described taking data from the grid shown in FIG. 1 as an example.
如图1中所示,从2015年1月到2017年6月收集呈纵向数据形式的网格电力数据。缺失了从2015年5月到2015年6月的数个切片。在本公开中提供的解决方案的情况下,可参考随时间轴和现有切片的数据趋势填充缺失切片。As shown in Figure 1, grid power data in the form of longitudinal data was collected from January 2015 to June 2017. Several slices from May 2015 to June 2015 are missing. With the solution provided in this disclosure, missing slices can be filled with reference to data trends over time axis and existing slices.
基本想法是借助于总体趋势和其它切片来填充缺失数据。任选地,为了消除幅度差对形状判断的影响,在步骤S301'中,可对所收集的数据执行归一化,否则,仅在纬度上不同的切片可能被视为具有不同形状。参考图1,冬季(约1月和2月)的负载率的幅度明显低于夏季(同一年的约7月、8月和9月)的负载率的幅度,而切片的形状是一致的(从0点钟到24点钟,首先保持低,然后渐升,保持平坦且下降)。The basic idea is to fill in missing data with the help of general trends and other slices. Optionally, in order to eliminate the influence of the magnitude difference on the shape judgment, in step S301 ′, normalization may be performed on the collected data, otherwise, slices that differ only in latitude may be regarded as having different shapes. Referring to Figure 1, the magnitude of the loading rate in winter (about January and February) is significantly lower than that in summer (about July, August, and September in the same year), while the shape of the slices is consistent ( From 0 o'clock to 24 o'clock, first keep low, then gradually rise, keep flat and fall).
接着在子步骤S3021中,可关于归一化切片的形状而在归一化切片之间进行比较。欧氏距离(Euclidean distance)可用于测量切片的形状的差。如果随时间归一化切片的形状差大于预定义阈值,那么在子步骤S3022中,可将所有切片分裂成部分,在所述部分中的每一者中,所有切片共享类似形状。接着在子步骤S3023中,对于每一部分(或对于所有切片,在所有切片中不存在明显的形状差的情况下),可计算切片随时间的总体趋势。任选地,每一切片的均值可以是切片的所计算的趋势值。Then in sub-step S3021, a comparison may be made between the normalized slices with respect to the shape of the normalized slices. Euclidean distance can be used to measure the difference in shape of slices. If the shape difference of the normalized slices over time is greater than a predefined threshold, then in sub-step S3022, all slices may be split into sections, in each of which all slices share a similar shape. Then in sub-step S3023, for each section (or for all slices, in the case where there is no significant shape difference in all slices), the overall trend of the slices over time may be calculated. Optionally, the mean for each slice can be the calculated trend value for the slice.
接下来在步骤S303中,对于每一部分中的缺失切片,可应用数种技术来估计缺失切片,例如多项式曲线拟合和高斯过程(gaussian process)。在步骤S304和S305中,具有缺失切片的所估计趋势值,可基于具有类似趋势值的其它切片填充缺失切片。Next in step S303, for the missing slices in each section, several techniques can be applied to estimate the missing slices, such as polynomial curve fitting and a gaussian process. In steps S304 and S305, with the estimated trend values of the missing slices, the missing slices can be filled based on other slices with similar trend values.
在步骤S301'中,对于每一切片,我们可使用特征按比例调整作为归一化方法:In step S301', for each slice, we can use feature scaling as the normalization method:
其中x是原始切片且x归一化是归一化切片,且数据的所有归一化切片绘示于图4中。图4中的每一曲线可被视为对应切片的形状。我们可以从图4中看到几乎所有切片共享同一形状。因此,不必将数据切割到不同部分中。否则,一些技术(例如,集群)可用于分离所有切片,使得同一部分中的切片具有相同形状。where x is the original slice and x normalized is the normalized slice, and all normalized slices of the data are depicted in FIG. 4 . Each curve in Figure 4 can be viewed as the shape of the corresponding slice. We can see from Figure 4 that almost all slices share the same shape. Therefore, the data does not have to be sliced into different parts. Otherwise, some techniques (eg, clustering) can be used to separate all slices so that slices in the same section have the same shape.
图5A和图5B绘示每一切片的所提取趋势。此处,每一切片的均值(粗点)用作趋势。参考图5B,在步骤S302和步骤S303中,可应用高斯过程以估计缺失切片的趋势(点线)。在由于切片的形状不是恒定的而存在多个部分的情况下,将分别对每一部分执行这一步骤。5A and 5B illustrate the extracted trends for each slice. Here, the mean (thick dots) of each slice is used as a trend. Referring to FIG. 5B, in steps S302 and S303, a Gaussian process may be applied to estimate the trend (dotted line) of missing slices. In cases where there are multiple sections because the shape of the slice is not constant, this step will be performed for each section separately.
在步骤S304中,可寻找具有大部分类似趋势值的至少一个现有切片。此处,使用2个现有切片切片a和切片b,缺失切片可以表示为:In step S304, at least one existing slice with most similar trend values may be found. Here, using 2 existing slices slice a and slice b , the missing slice can be represented as:
其中切片a和切片b是与切片缺失具有最接近的趋势值的两个切片,且趋势a、趋势b和趋势缺失分别是切片的趋势值。可根据差应用要求来考虑要用于计算缺失切片的现有切片的数目。where slice a and slice b are the two slices with the closest trend value to slice missing , and trend a , trend b , and trend missing are the trend values of the slice, respectively. The number of existing slices to be used to calculate missing slices may be considered according to poor application requirements.
本公开的填充结果和其它方法。参考图6,在图的中间,线性回归用于填充缺失数据。切片上的点是维度,此处在切片中存在24个维度,每一维度表示一天的特定小时。右侧上的百分比是在0到1范围内的切片上的点的值。对于每一切片的每一维度,分别填充缺失值,我们可看到经填充切片不再有意义。然而,此处还可应用除线性回归外的技术,所述技术具有相同缺点。在根据本公开填充缺失切片的底部中,其达成更合理的结果,可在2015年5月1日发现急剧增加,其符合2016年和2017年的同一时段的趋势。然而在图的中间,缺失切片逐位增加,将利用此类方法省略重要性变化信息。Fill results and other methods of the present disclosure. Referring to Figure 6, in the middle of the figure, linear regression is used to fill in missing data. The dots on the slice are the dimensions, here there are 24 dimensions in the slice, each dimension representing a specific hour of the day. The percentage on the right is the value of the point on the slice in the
以下是可采用本公开中提供的解决方案的2种使用情况。Below are 2 use cases where the solutions provided in this disclosure can be employed.
使用情况1:用于变压器的条件评估管理器Use Case 1: Condition Evaluation Manager for Transformers
在变压器处收集数据且将数据传送到数据管理系统中。在数据处理和分析之后,可将变压器的健康报告和负载-位移建议提供给消费者。归因于许多原因,所收集的数据将是不完整的,且数据处理和分析部分中需要缺失数据填充方法。Data is collected at the transformer and transmitted into a data management system. Following data processing and analysis, a transformer health report and load-displacement recommendations can be provided to consumers. The collected data will be incomplete due to many reasons and missing data imputation methods are required in the data processing and analysis section.
使用情况2:分布式能量系统Use Case 2: Distributed Energy Systems
在分布式能量系统的话题下存在各种应用,例如,负载均衡、峰值避免、防盗等。所有这些应用程序都是基于相关装置的连续监测,连续监测对缺失数据具有低容许度,使得填充方法为数据处理的必不可少的部分。Various applications exist under the topic of distributed energy systems, eg, load balancing, peak avoidance, anti-theft, etc. All of these applications are based on continuous monitoring of associated devices, which has a low tolerance for missing data, making imputation methods an essential part of data processing.
本公开中还提供一种存储可执行指令的计算机可读介质,计算机可执行指令在由计算机执行时使计算机执行本公开所呈现的方法中的任一个。Also provided in the present disclosure is a computer-readable medium storing executable instructions that, when executed by a computer, cause the computer to perform any of the methods presented in the present disclosure.
计算机程序正由至少一个处理器执行且执行本公开中呈现的方法中的任一个。A computer program is being executed by at least one processor and performs any of the methods presented in this disclosure.
虽然已参考某些实施例详细地描述了本发明技术,但应了解,本发明技术不限于那些精确的实施例。实际上,鉴于描述用于实践本发明的示范性模式的本公开,在不脱离本发明的范围和精神的情况下,所属领域的技术人员能进行许多修改和变化。因此,本发明的范围由所附权利要求书而不是由前述描述指示。落入权利要求书的等效含义和范围内的所有改变、修改和变化将被认为在权利要求书的范围内。Although the present technology has been described in detail with reference to certain embodiments, it is to be understood that the present technology is not limited to those precise embodiments. Indeed, given this disclosure, which describes exemplary modes for practicing the invention, many modifications and variations can be made by those skilled in the art without departing from the scope and spirit of the invention. Accordingly, the scope of the invention is indicated by the appended claims rather than by the foregoing description. All changes, modifications and variations that come within the meaning and range of equivalency of the claims are to be deemed to be within the scope of the claims.
Claims (10)
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2020/076273 WO2021164028A1 (en) | 2020-02-21 | 2020-02-21 | Method and apparatus for filling missing industrial longitudinal data |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN115151900A true CN115151900A (en) | 2022-10-04 |
Family
ID=77391429
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202080097170.XA Pending CN115151900A (en) | 2020-02-21 | 2020-02-21 | Method and apparatus for filling in missing industrial longitudinal data |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN115151900A (en) |
| WO (1) | WO2021164028A1 (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20060026156A1 (en) * | 2004-07-28 | 2006-02-02 | Heather Zuleba | Method for linking de-identified patients using encrypted and unencrypted demographic and healthcare information from multiple data sources |
| US20130226613A1 (en) * | 2012-02-23 | 2013-08-29 | Robert Bosch Gmbh | System and Method for Estimation of Missing Data in a Multivariate Longitudinal Setup |
| WO2017044082A1 (en) * | 2015-09-09 | 2017-03-16 | Intel Corporation | Separated application security management |
| CN109947812A (en) * | 2018-07-09 | 2019-06-28 | 平安科技(深圳)有限公司 | Consecutive miss value fill method, data analysis set-up, terminal and storage medium |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8941652B1 (en) * | 2012-05-23 | 2015-01-27 | Google Inc. | Incremental surface hole filling |
| CN106844781B (en) * | 2017-03-10 | 2020-04-21 | 广州视源电子科技股份有限公司 | Data processing method and device |
| US11775873B2 (en) * | 2018-06-11 | 2023-10-03 | Oracle International Corporation | Missing value imputation technique to facilitate prognostic analysis of time-series sensor data |
| CN109460775B (en) * | 2018-09-20 | 2020-09-11 | 国家计算机网络与信息安全管理中心 | A data filling method and device based on information entropy |
-
2020
- 2020-02-21 CN CN202080097170.XA patent/CN115151900A/en active Pending
- 2020-02-21 WO PCT/CN2020/076273 patent/WO2021164028A1/en not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20060026156A1 (en) * | 2004-07-28 | 2006-02-02 | Heather Zuleba | Method for linking de-identified patients using encrypted and unencrypted demographic and healthcare information from multiple data sources |
| US20130226613A1 (en) * | 2012-02-23 | 2013-08-29 | Robert Bosch Gmbh | System and Method for Estimation of Missing Data in a Multivariate Longitudinal Setup |
| WO2017044082A1 (en) * | 2015-09-09 | 2017-03-16 | Intel Corporation | Separated application security management |
| CN109947812A (en) * | 2018-07-09 | 2019-06-28 | 平安科技(深圳)有限公司 | Consecutive miss value fill method, data analysis set-up, terminal and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2021164028A1 (en) | 2021-08-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN106991447A (en) | A kind of embedded multi-class attribute tags dynamic feature selection algorithm | |
| CN106533742A (en) | Time sequence mode representation-based weighted directed complicated network construction method | |
| CN114037478A (en) | Advertisement abnormal flow detection method and system, electronic equipment and readable storage medium | |
| CN115389954A (en) | Method for estimating battery capacity, electronic device and readable storage medium | |
| CN115062087A (en) | User portrait construction method, device, equipment and medium | |
| CN119200980B (en) | An information security data processing system based on blockchain | |
| CN117556369A (en) | Power theft detection method and system for dynamically generated residual error graph convolution neural network | |
| CN117688499A (en) | A multi-index anomaly detection method, device, electronic equipment and storage medium | |
| CN110460574A (en) | A Method of Network Abnormal Traffic Detection Based on Wavelet Lifting | |
| CN118551939A (en) | Intelligent monitoring method and system for the whole process of luggage accessories production | |
| CN103390058B (en) | The domain knowledge browsing method of knowledge based map | |
| CN116471174A (en) | Log data monitoring system, method, device and storage medium | |
| CN116581874A (en) | A method and system for rapidly generating fault handling schemes based on knowledge graphs | |
| CN119130612B (en) | Personalized content push analysis system and method for mothers and infants based on AI scenarios | |
| CN114692771B (en) | A method for implementing a power grid alarm data analysis system based on clustering algorithm | |
| CN115151900A (en) | Method and apparatus for filling in missing industrial longitudinal data | |
| CN113361956A (en) | Resource quality evaluation method, device, equipment and storage medium for resource producer | |
| CN117972104A (en) | A method and system for quickly generating a standard knowledge base | |
| CN117195118A (en) | Data anomaly detection method, device, equipment and medium | |
| CN110852893A (en) | Risk identification method, system, equipment and storage medium based on mass data | |
| CN116150483A (en) | Electronic certificate recommendation method, equipment and storage medium based on Bayesian network model | |
| CN113468152A (en) | High-frequency user electricity consumption data cleaning method, system, equipment and storage medium | |
| CN115278757A (en) | A method, device and electronic device for detecting abnormal data | |
| CN110210254A (en) | The optimization verification method of repeated data in a kind of more data integrity validations | |
| CN120578656B (en) | Data cleaning method and system for AI intelligent database |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |