[go: up one dir, main page]

CN115151900A - Method and apparatus for filling in missing industrial longitudinal data - Google Patents

Method and apparatus for filling in missing industrial longitudinal data Download PDF

Info

Publication number
CN115151900A
CN115151900A CN202080097170.XA CN202080097170A CN115151900A CN 115151900 A CN115151900 A CN 115151900A CN 202080097170 A CN202080097170 A CN 202080097170A CN 115151900 A CN115151900 A CN 115151900A
Authority
CN
China
Prior art keywords
missing
slices
slice
industrial
trend
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202080097170.XA
Other languages
Chinese (zh)
Inventor
周林飞
李晶
丹尼尔·施尼盖斯
田鹏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens Corp
Original Assignee
Siemens Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Corp filed Critical Siemens Corp
Publication of CN115151900A publication Critical patent/CN115151900A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • General Factory Administration (AREA)

Abstract

A method, apparatus, system, and computer-readable medium for populating missing industrial longitudinal data is presented. Considering the slices as a whole, in contrast to current linear regression or interpolation, also takes into account the trend of the slices over time, in such a way that missing data can be filled in a more meaningful way and the true physical state is reflected.

Description

用于填充缺失的工业纵向数据的方法和设备Method and apparatus for filling in missing industrial longitudinal data

技术领域technical field

本发明涉及工业数据处理的技术,且更确切地说,涉及一种用于填充缺失工业纵向数据的方法、设备和计算机可读存储介质。The present invention relates to the technology of industrial data processing, and more particularly, to a method, apparatus and computer-readable storage medium for filling in missing industrial longitudinal data.

背景技术Background technique

工业数据广泛地用于系统和装置的状态监测、预测维护等的工业领域。一些工业数据是时间序列数据。举例来说,网格的负载率的数据可在单独时间点收集,且在一天中随时间变化。Industrial data are widely used in industrial fields such as condition monitoring of systems and devices, predictive maintenance, and the like. Some industrial data is time series data. For example, data on the load rate of a grid may be collected at separate points in time and vary over time of day.

此外,我们可发现来自网格的数据可在不同天内共享类似图案,这可指示电力消费者的不同工作模式。在此类情况下,从网格收集的时间序列数据可呈现为纵向数据,其中每一切片是对应日的观察,表示网格的负载率,如图1中所示。由于网格数据的周期性特性,我们可将周期性特性表示为纵向数据的形式。举例来说,每一切片(或例子)是每日运行数据,且所有切片按时间顺序布置。此处,缺失数据被定义为缺失切片,如图1中所示,其中缺失从2015年5月到2015年7月的数个切片。Furthermore, we can find that data from grids can share similar patterns over different days, which can indicate different working patterns of electricity consumers. In such cases, the time series data collected from the grid can be presented as longitudinal data, where each slice is an observation for the corresponding day, representing the loading rate of the grid, as shown in FIG. 1 . Due to the periodic nature of gridded data, we can represent the periodicity in the form of longitudinal data. For example, each slice (or instance) is a daily run of data, and all slices are arranged in chronological order. Here, missing data are defined as missing slices, as shown in Figure 1, where several slices from May 2015 to July 2015 are missing.

当未存储或收集数据值时,缺失数据出现且可能对根据数据得出的结论具有明显影响。缺失数据是普遍出现且在纵向数据中当然是常见的。Missing data occurs when data values are not stored or collected and can have a significant impact on conclusions drawn from the data. Missing data are common and certainly common in longitudinal data.

已提议填充纵向数据的缺失数据的各种方法。还存在可用于时间序列数据(非纵向数据)的通用缺失数据填充方法,例如,内插。然而,据我们所知,没有方法处理纵向数据,其中每一切片是一条时间序列数据而不是一般的多维特征。Various approaches have been proposed to fill in missing data for longitudinal data. There are also general missing data imputation methods that can be used for time series data (non-longitudinal data), eg, interpolation. However, to the best of our knowledge, there is no method for dealing with longitudinal data where each slice is a piece of time series data rather than a general multidimensional feature.

发明内容SUMMARY OF THE INVENTION

在本公开中,我们提出工业领域中填充纵向数据的缺失数据的解决方案,其中每一切片是时间序列数据。与当前线性回归或内插相比,将切片视为整体,还考虑切片随时间的趋势,以此方式可以更有意义的方式填充缺失数据且反映真实物理状态。In this disclosure, we propose a solution for filling missing data in longitudinal data in the industrial domain, where each slice is time series data. Compared to current linear regression or interpolation, treating the slice as a whole and also considering the trend of the slice over time can fill in missing data in a more meaningful way and reflect the true physical state.

本公开的实施例包含用于填充缺失的工业纵向数据的方法、设备。Embodiments of the present disclosure include methods and apparatus for filling in missing industrial longitudinal data.

根据本公开的第一方面,呈现一种用于填充缺失的工业纵向数据的方法。方法包含以下步骤:收集工业纵向数据,其中工业纵向数据包括缺失切片,每一切片对应于收集时间点;估计工业纵向数据的所有切片随时间的总体趋势;基于总体趋势计算每一缺失切片的趋势值;对于每一缺失切片,基于趋势值寻找至少一个类似切片;基于至少一个类似切片填充每一缺失切片。According to a first aspect of the present disclosure, a method for filling in missing industrial longitudinal data is presented. The method comprises the steps of: collecting industrial longitudinal data, wherein the industrial longitudinal data includes missing slices, each slice corresponding to a collection time point; estimating an overall trend over time for all slices of the industrial longitudinal data; calculating a trend for each missing slice based on the overall trend value; for each missing slice, find at least one similar slice based on the trend value; fill each missing slice based on at least one similar slice.

根据本公开的第二方面,呈现一种用于填充缺失的工业纵向数据的设备。设备包含:数据收集模块,其被配置成收集工业纵向数据,其中工业纵向数据包括缺失切片,每一切片对应于收集时间点;数据处理模块,其被配置成:估计工业纵向数据的所有切片随时间的总体趋势;基于总体趋势计算每一缺失切片的趋势值;对于每一缺失切片,基于趋势值寻找至少一个类似切片;基于至少一个类似切片填充每一缺失切片。According to a second aspect of the present disclosure, an apparatus for filling in missing industrial longitudinal data is presented. The apparatus comprises: a data collection module configured to collect industrial longitudinal data, wherein the industrial longitudinal data includes missing slices, each slice corresponding to a collection time point; a data processing module configured to: estimate all slices of the industrial longitudinal data over time The overall trend of time; the trend value of each missing slice is calculated based on the overall trend; for each missing slice, at least one similar slice is found based on the trend value; each missing slice is filled based on at least one similar slice.

根据本公开的第三方面,呈现一种用于填充缺失的工业纵向数据的设备。设备包含:至少一个处理器;至少一个存储器,其耦合到至少一个处理器、被配置成执行根据第一方面的方法。According to a third aspect of the present disclosure, an apparatus for filling in missing industrial longitudinal data is presented. The apparatus comprises: at least one processor; at least one memory coupled to the at least one processor configured to perform the method according to the first aspect.

根据本公开的第四方面,呈现一种用于填充缺失的工业纵向数据的计算机可读介质。计算机可读介质存储计算机可执行指令,其中计算机可执行指令在被执行时使得至少一个处理器执行根据第一方面的方法。According to a fourth aspect of the present disclosure, there is presented a computer readable medium for filling in missing industrial longitudinal data. The computer-readable medium stores computer-executable instructions, wherein the computer-executable instructions, when executed, cause at least one processor to perform the method according to the first aspect.

与当前使用解决方案相比,在本公开中提供的解决方案的情况下,根据总体趋势和具有类似趋势值的现有数据将每一缺失切片作为整体填充,估计纵向数据的切片的总体趋势,呈现可广泛用于周期性时间序列数据的更合理的填充解决方案。In the case of the solution provided in this disclosure, each missing slice is populated as a whole based on the overall trend and existing data with similar trend values, compared to the currently used solution, estimating the overall trend of the slice of longitudinal data, Presents a more reasonable padding solution that can be widely used for periodic time series data.

任选地,在估计工业纵向数据的所有切片随时间的总体趋势之前,可将工业纵向数据的每一切片归一化;接着,可确定工业纵向数据的所有归一化切片是否具有相同形状,且如果不具有,那么可将工业纵向数据的所有切片划分为部分,其中每一部分中的切片具有相同形状;以及对于具有缺失切片的每一部分,可估计部分中的切片的总体趋势,且可基于部分的总体趋势计算每一缺失切片的趋势值,可基于趋势值而在部分中寻找每一缺失切片的至少一个类似切片,且可基于至少一个类似切片填充每一缺失切片。为了确保填充结果更接近于真实状态,应首先寻找具有相同形状的切片以用于参考。然而,在幅度差的影响下,具有相同形状和明显不同幅度的切片可视为不同的形状。为了引入更多切片用于参考,首先,应通过归一化消除不同幅度的影响。如果在归一化之后随时间真实地存在不同的形状,那么为了确保要参考的最接近切片,应将具有相同形状的切片处理为单独部分。可从这个部分中选择现有切片以填充缺失切片。在通过切片的形状进行归一化和划分的情况下,填充结果可更准确且更接近于真实状态。可基于消费者的要求定制归一化的方法。Optionally, each slice of the industrial longitudinal data can be normalized before estimating the overall trend over time for all the slices of the industrial longitudinal data; then, it can be determined whether all the normalized slices of the industrial longitudinal data have the same shape, And if not, all slices of the industrial longitudinal data can be divided into parts, where the slices in each part have the same shape; and for each part with missing slices, the overall trend of the slices in the part can be estimated and can be based on The overall trend of the portion calculates a trend value for each missing slice, at least one similar slice for each missing slice can be found in the portion based on the trend value, and each missing slice can be filled based on the at least one similar slice. To ensure that the fill result is closer to the true state, you should first look for slices with the same shape for reference. However, slices with the same shape and significantly different amplitudes can be regarded as different shapes under the influence of the amplitude difference. To introduce more slices for reference, first, the effects of different amplitudes should be removed by normalization. If there are truly different shapes over time after normalization, slices with the same shape should be treated as separate parts in order to ensure the closest slice to be referenced. Existing slices can be selected from this section to fill in missing slices. With normalization and division by the shape of the slice, the fill result can be more accurate and closer to the true state. The normalization method can be customized based on the requirements of the consumer.

任选地,如果时间段期间的所有归一化切片之间的形状差都小于预定义阈值,那么确定时间段期间的切片具有相同形状。Optionally, the slices during the time period are determined to have the same shape if the difference in shape between all normalized slices during the time period is less than a predefined threshold.

任选地,趋势值是切片的均值。Optionally, the trend value is the mean of the slice.

附图说明Description of drawings

通过参考以下结合附图进行的对本发明技术的实施例的描述,本发明技术的上文所提到属性和其它特征和优点以及其实现方式将变得更显而易见,并且本发明技术自身将得到更好理解,在附图中:The above-mentioned attributes and other features and advantages of the present technology, as well as the manner in which it is achieved, will become more apparent, and the present technology itself will be further enhanced by reference to the following description of embodiments of the present technology taken in conjunction with the accompanying drawings. Easy to understand, in the attached picture:

图1描绘工业纵向数据和缺失数据。Figure 1 depicts industry longitudinal data and missing data.

图2描绘根据本公开的一个实施例的用于填充缺失的工业纵向数据的设备的框图。2 depicts a block diagram of an apparatus for filling in missing industrial longitudinal data according to one embodiment of the present disclosure.

图3A描绘根据本公开的一个实施例的用于填充缺失的工业纵向数据的方法的流程图。3A depicts a flowchart of a method for filling in missing industrial longitudinal data according to one embodiment of the present disclosure.

图3B描绘步骤S302的流程图。FIG. 3B depicts a flowchart of step S302.

图4描绘图1中的数据的归一化切片。FIG. 4 depicts normalized slices of the data in FIG. 1 .

图5A和图5B描绘来自图1中的切片的所提取趋势。5A and 5B depict the extracted trends from the slices in FIG. 1 .

图6描绘利用本公开中提供的解决方案的填充结果。FIG. 6 depicts the fill results using the solutions provided in this disclosure.

附图标号:Reference number:

10,用于填充缺失的工业纵向数据的设备10. Equipment for filling in missing industrial longitudinal data

101,数据收集模块101. Data collection module

102,数据处理模块102, data processing module

103,至少一个处理器103, at least one processor

104,至少一个存储器104, at least one memory

105,通信模块105, communication module

30,数据处理程序30. Data processing program

31,所收集的纵向数据31. Longitudinal data collected

300,用于填充缺失的工业纵向数据的方法300, Methods for Filling Missing Industrial Longitudinal Data

S301~S305,方法300的步骤S301-S305, the steps of the method 300

S3021~S3023,S302的子步骤S3021~S3023, sub-steps of S302

具体实施方式Detailed ways

在下文中,详细描述本发明技术的上述特征和其它特征。参考图式描述各种实施例,其中相同的参考标号贯穿全文用于指代相同的元件。在以下描述中,出于解释的目的,阐述许多具体细节以便提供对一个或多个实施例的透彻理解。可注意到,所示出的实施例意图解释而非限制本发明。可显而易见的是,可在没有这些具体细节的情况下实践此类实施例。In the following, the above-described features and other features of the present technology are described in detail. Various embodiments are described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more embodiments. It may be noted that the illustrated embodiments are intended to illustrate, but not to limit, the invention. It may be evident that such embodiments may be practiced without these specific details.

当介绍本公开的各种实施例的元件时,冠词“一(a/an)”和“所述(the)”意欲意味存在所述元件中的一个或多个。术语“包括”、“包含”和“具有”意欲为包含性的且意味着可以存在除所列元件之外的另外元件。When introducing elements of various embodiments of the present disclosure, the articles "a/an" and "the" are intended to mean that there are one or more of the elements. The terms "comprising", "comprising" and "having" are intended to be inclusive and mean that there may be additional elements other than the listed elements.

现在,将在下文中通过参考图2到图6详细地描述本公开。Now, the present disclosure will hereinafter be described in detail by referring to FIGS. 2 to 6 .

图2描绘根据本公开的一个实施例的设备的框图。本公开中呈现的用于填充缺失的工业纵向数据的设备10可实施为计算机处理器的网络,以执行本公开中呈现的用于填充缺失的工业纵向数据的以下方法300。设备10还可是单个计算机,如图2所示,其包含至少一个存储器104,所述存储器包含计算机可读介质,例如随机存取存储器(RAM)。设备10还包含与至少一个存储器104耦合的至少一个处理器103。计算机可执行指令存储于至少一个存储器104中,且当由至少一个处理器103执行时,可使得至少一个处理器103执行本文中所描述的步骤。至少一个处理器103可包含微处理器、专用集成电路(ASIC)、数字信号处理器(DSP)、中央处理单元(CPU)、图形处理单元(GPU)、状态机等。计算机可读介质的实施例包含但不限于软磁盘、CD-ROM、磁盘、存储器芯片、ROM、RAM、ASIC、经配置处理器、所有光学介质、所有磁带或其它磁性介质,或计算机处理器可从其中读取指令的任何其它介质。并且,各种其它形式的计算机可读介质可将指令传输或载送到计算机,包含路由器、私用或公用网络,或有线和无线的其它传输装置或信道。指令可包含来自任何计算机编程语言的代码,包含例如C、C++、C#、Visual Basic、Java和JavaScript。2 depicts a block diagram of a device according to one embodiment of the present disclosure. The apparatus 10 for filling in missing industrial longitudinal data presented in this disclosure may be implemented as a network of computer processors to perform the following method 300 for filling in missing industrial longitudinal data presented in this disclosure. The device 10 may also be a single computer, as shown in FIG. 2, which includes at least one memory 104 including a computer-readable medium, such as random access memory (RAM). Device 10 also includes at least one processor 103 coupled with at least one memory 104 . Computer-executable instructions are stored in at least one memory 104 and, when executed by at least one processor 103, can cause at least one processor 103 to perform the steps described herein. The at least one processor 103 may include a microprocessor, an application specific integrated circuit (ASIC), a digital signal processor (DSP), a central processing unit (CPU), a graphics processing unit (GPU), a state machine, and the like. Examples of computer readable media include, but are not limited to, floppy disks, CD-ROMs, magnetic disks, memory chips, ROM, RAM, ASICs, configured processors, all optical media, all magnetic tapes or other magnetic media, or the computer processor can Any other medium from which instructions are read. Also, various other forms of computer-readable media can transmit or carry the instructions to a computer, including routers, private or public networks, or other transmission devices or channels, both wired and wireless. Instructions may contain code from any computer programming language, including, for example, C, C++, C#, Visual Basic, Java, and JavaScript.

图2中所示的至少一个存储器104可含有数据处理程序30,当由至少一个处理器103执行时,使得至少一个处理器103执行本公开中呈现的用于填充缺失的工业纵向数据的方法300。纵向数据31还可存储于至少一个存储器104中。可经由设备10的通信模块105接收这些数据。The at least one memory 104 shown in FIG. 2 may contain a data processing program 30 that, when executed by the at least one processor 103, causes the at least one processor 103 to perform the method 300 presented in this disclosure for filling in missing industrial longitudinal data . Longitudinal data 31 may also be stored in at least one memory 104 . These data may be received via the communication module 105 of the device 10 .

数据处理程序30可包含:Data processing program 30 may include:

-数据收集模块101,其被配置成收集工业纵向数据31;- a data collection module 101 configured to collect industrial longitudinal data 31;

-数据处理模块102,其被配置成处理所收集的工业纵向数据31。- A data processing module 102 configured to process the collected industrial longitudinal data 31 .

此处,工业纵向数据31包含缺失切片,且每一切片对应于收集时间点。Here, the industrial longitudinal data 31 includes missing slices, and each slice corresponds to a collection time point.

详细地说,数据处理模块102被配置成In detail, the data processing module 102 is configured to

-估计工业纵向数据31的所有切片随时间的总体趋势;- Estimate the general trend over time for all slices of industrial longitudinal data 31;

-基于总体趋势计算每一缺失切片的趋势值;- Calculate the trend value for each missing slice based on the overall trend;

-对于每一缺失切片,基于趋势值寻找至少一个类似切片;- for each missing slice, find at least one similar slice based on the trend value;

-基于至少一个类似切片填充每一缺失切片。- Filling each missing slice based on at least one similar slice.

任选地,数据处理模块102进一步被配置成:在估计工业纵向数据的所有切片随时间的总体趋势之前,将工业纵向数据的每一切片归一化;当估计工业纵向数据的所有切片随时间的总体趋势时,确定工业纵向数据的所有归一化切片是否具有相同形状,且如果没有,那么将工业纵向数据的所有切片划分为部分,其中每一部分中的切片具有相同形状;对于具有缺失切片的每一部分,估计部分中的切片的总体趋势;当基于总体趋势计算每一缺失切片的趋势值时,对于具有缺失切片的每一部分,基于部分的总体趋势计算每一缺失切片的趋势值;当对于每一缺失切片,基于趋势值寻找至少一个类似切片时,对于具有缺失切片的每一部分,基于趋势值而在部分中寻找每一缺失切片的至少一个类似切片;当基于至少一个类似切片填充每一缺失切片时,对于具有缺失切片的每一部分,基于至少一个类似切片填充每一缺失切片。Optionally, the data processing module 102 is further configured to: normalize each slice of the industrial longitudinal data before estimating the overall trend of all the slices of the industrial longitudinal data over time; when estimating the overall trend of all the slices of the industrial longitudinal data over time; Determines whether all normalized slices of industrial longitudinal data have the same shape when the overall trend of For each part of , estimate the general trend of the slices in the part; when calculating the trend value for each missing slice based on the general trend, for each part with missing slices, calculate the trend value for each missing slice based on the general trend of the part; when For each missing slice, when looking for at least one similar slice based on the trend value, for each section with missing slices, look for at least one similar slice for each missing slice in the section based on the trend value; when filling each section based on at least one similar slice When a slice is missing, for each section with missing slices, each missing slice is filled based on at least one similar slice.

任选地,数据处理模块102进一步被配置成当确定工业纵向数据的所有归一化切片是否具有相同形状时,如果时间段期间的所有归一化切片之间的形状差都小于预定义阈值,那么确定时间段期间的切片具有相同形状。Optionally, the data processing module 102 is further configured to, when determining whether all normalized slices of the industrial longitudinal data have the same shape, if the difference in shape between all normalized slices during the time period is less than a predefined threshold, The slices during the time period are then determined to have the same shape.

任选地,趋势值是切片的均值。Optionally, the trend value is the mean of the slice.

稍后将参考图3A和图3B描述通过数据处理模块102的数据处理细节。Details of data processing by the data processing module 102 will be described later with reference to FIGS. 3A and 3B .

尽管数据收集模块101、数据处理模块102在上文描述为数据处理程序30的软件模块。并且,数据收集模块和数据处理模块可经由硬件,例如ASIC芯片实施。数据收集模块和数据处理模块可集成到一个芯片中,或单独地实施和电连接。Although the data collection module 101 , the data processing module 102 are described above as software modules of the data processing program 30 . Also, the data collection module and the data processing module may be implemented via hardware, such as an ASIC chip. The data collection module and the data processing module can be integrated into one chip, or implemented separately and electrically connected.

应提及,本公开可包含具有与图2中所示的架构不同的架构的设备。以上架构仅是示范性的且用于解释图3A和图3B中所示的示范性方法300。It should be mentioned that the present disclosure may encompass devices having different architectures than that shown in FIG. 2 . The above architecture is exemplary only and serves to explain the exemplary method 300 shown in FIGS. 3A and 3B .

可进行根据本公开的各种方法。根据本公开的一种示范性方法300包含以下步骤:Various methods in accordance with the present disclosure can be performed. An exemplary method 300 according to the present disclosure includes the following steps:

S301:收集工业纵向数据,其中工业纵向数据包含缺失切片,每一切片对应于收集时间点;S301: Collect industrial longitudinal data, wherein the industrial longitudinal data includes missing slices, and each slice corresponds to a collection time point;

S302:估计工业纵向数据的所有切片随时间的总体趋势;S302: Estimate the overall trend over time for all slices of industrial longitudinal data;

S303:基于总体趋势计算每一缺失切片的趋势值;S303: Calculate the trend value of each missing slice based on the overall trend;

S304:对于每一缺失切片,基于趋势值寻找至少一个类似切片;S304: For each missing slice, find at least one similar slice based on the trend value;

S305:基于至少一个类似切片填充每一缺失切片。S305: Fill each missing slice based on at least one similar slice.

任选地,在步骤S302之前,估计工业纵向数据的所有切片随时间的总体趋势,方法300可进一步包含:Optionally, prior to step S302, estimating an overall trend over time for all slices of the industrial longitudinal data, the method 300 may further comprise:

S301':将工业纵向数据的每一切片归一化。接着参考图3B,步骤S302可包含以下子步骤:S301': Normalize each slice of industrial longitudinal data. 3B, step S302 may include the following sub-steps:

S3021:确定工业纵向数据的所有归一化切片是否具有相同形状,且如果没有,那么程序通过子步骤S3022进行。S3021: Determine whether all normalized slices of the industrial longitudinal data have the same shape, and if not, the procedure proceeds through sub-step S3022.

S3022:将工业纵向数据的所有切片划分为部分,其中每一部分中的切片具有相同形状;以及S3022: Divide all slices of industrial longitudinal data into sections, wherein the slices in each section have the same shape; and

S3023:对于具有缺失切片的每一部分,估计部分中的切片的总体趋势。S3023: For each section with missing slices, estimate the overall trend of the slices in the section.

接着步骤S303可包含:对于具有缺失切片的每一部分,基于部分的总体趋势计算每一缺失切片的趋势值;步骤S304可包含:对于具有缺失切片的每一部分,基于趋势值而在部分中寻找每一缺失切片的至少一个类似切片;以及步骤S305可包含:对于具有缺失切片的每一部分,基于至少一个类似切片填充每一缺失切片。Then step S303 may include: for each section with missing slices, calculating a trend value for each missing slice based on the overall trend of the section; step S304 may include: for each section with missing slices, searching for each section in the section based on the trend value At least one similar slice of a missing slice; and step S305 may include: for each portion having the missing slice, filling each missing slice based on the at least one similar slice.

任选地,在子步骤S3021中,如果时间段期间的所有归一化切片之间的形状差都小于预定义阈值,那么可确定时间段期间的切片具有相同形状。Optionally, in sub-step S3021, if the shape difference between all normalized slices during the time period is less than a predefined threshold, it may be determined that the slices during the time period have the same shape.

接下来,取来自图1中所示的网格的数据作为实例来描述示范性实施例。Next, an exemplary embodiment is described taking data from the grid shown in FIG. 1 as an example.

如图1中所示,从2015年1月到2017年6月收集呈纵向数据形式的网格电力数据。缺失了从2015年5月到2015年6月的数个切片。在本公开中提供的解决方案的情况下,可参考随时间轴和现有切片的数据趋势填充缺失切片。As shown in Figure 1, grid power data in the form of longitudinal data was collected from January 2015 to June 2017. Several slices from May 2015 to June 2015 are missing. With the solution provided in this disclosure, missing slices can be filled with reference to data trends over time axis and existing slices.

基本想法是借助于总体趋势和其它切片来填充缺失数据。任选地,为了消除幅度差对形状判断的影响,在步骤S301'中,可对所收集的数据执行归一化,否则,仅在纬度上不同的切片可能被视为具有不同形状。参考图1,冬季(约1月和2月)的负载率的幅度明显低于夏季(同一年的约7月、8月和9月)的负载率的幅度,而切片的形状是一致的(从0点钟到24点钟,首先保持低,然后渐升,保持平坦且下降)。The basic idea is to fill in missing data with the help of general trends and other slices. Optionally, in order to eliminate the influence of the magnitude difference on the shape judgment, in step S301 ′, normalization may be performed on the collected data, otherwise, slices that differ only in latitude may be regarded as having different shapes. Referring to Figure 1, the magnitude of the loading rate in winter (about January and February) is significantly lower than that in summer (about July, August, and September in the same year), while the shape of the slices is consistent ( From 0 o'clock to 24 o'clock, first keep low, then gradually rise, keep flat and fall).

接着在子步骤S3021中,可关于归一化切片的形状而在归一化切片之间进行比较。欧氏距离(Euclidean distance)可用于测量切片的形状的差。如果随时间归一化切片的形状差大于预定义阈值,那么在子步骤S3022中,可将所有切片分裂成部分,在所述部分中的每一者中,所有切片共享类似形状。接着在子步骤S3023中,对于每一部分(或对于所有切片,在所有切片中不存在明显的形状差的情况下),可计算切片随时间的总体趋势。任选地,每一切片的均值可以是切片的所计算的趋势值。Then in sub-step S3021, a comparison may be made between the normalized slices with respect to the shape of the normalized slices. Euclidean distance can be used to measure the difference in shape of slices. If the shape difference of the normalized slices over time is greater than a predefined threshold, then in sub-step S3022, all slices may be split into sections, in each of which all slices share a similar shape. Then in sub-step S3023, for each section (or for all slices, in the case where there is no significant shape difference in all slices), the overall trend of the slices over time may be calculated. Optionally, the mean for each slice can be the calculated trend value for the slice.

接下来在步骤S303中,对于每一部分中的缺失切片,可应用数种技术来估计缺失切片,例如多项式曲线拟合和高斯过程(gaussian process)。在步骤S304和S305中,具有缺失切片的所估计趋势值,可基于具有类似趋势值的其它切片填充缺失切片。Next in step S303, for the missing slices in each section, several techniques can be applied to estimate the missing slices, such as polynomial curve fitting and a gaussian process. In steps S304 and S305, with the estimated trend values of the missing slices, the missing slices can be filled based on other slices with similar trend values.

在步骤S301'中,对于每一切片,我们可使用特征按比例调整作为归一化方法:In step S301', for each slice, we can use feature scaling as the normalization method:

Figure BDA0003806505850000091
Figure BDA0003806505850000091

其中x是原始切片且x归一化是归一化切片,且数据的所有归一化切片绘示于图4中。图4中的每一曲线可被视为对应切片的形状。我们可以从图4中看到几乎所有切片共享同一形状。因此,不必将数据切割到不同部分中。否则,一些技术(例如,集群)可用于分离所有切片,使得同一部分中的切片具有相同形状。where x is the original slice and x normalized is the normalized slice, and all normalized slices of the data are depicted in FIG. 4 . Each curve in Figure 4 can be viewed as the shape of the corresponding slice. We can see from Figure 4 that almost all slices share the same shape. Therefore, the data does not have to be sliced into different parts. Otherwise, some techniques (eg, clustering) can be used to separate all slices so that slices in the same section have the same shape.

图5A和图5B绘示每一切片的所提取趋势。此处,每一切片的均值(粗点)用作趋势。参考图5B,在步骤S302和步骤S303中,可应用高斯过程以估计缺失切片的趋势(点线)。在由于切片的形状不是恒定的而存在多个部分的情况下,将分别对每一部分执行这一步骤。5A and 5B illustrate the extracted trends for each slice. Here, the mean (thick dots) of each slice is used as a trend. Referring to FIG. 5B, in steps S302 and S303, a Gaussian process may be applied to estimate the trend (dotted line) of missing slices. In cases where there are multiple sections because the shape of the slice is not constant, this step will be performed for each section separately.

在步骤S304中,可寻找具有大部分类似趋势值的至少一个现有切片。此处,使用2个现有切片切片a和切片b,缺失切片可以表示为:In step S304, at least one existing slice with most similar trend values may be found. Here, using 2 existing slices slice a and slice b , the missing slice can be represented as:

Figure BDA0003806505850000092
Figure BDA0003806505850000092

其中切片a和切片b是与切片缺失具有最接近的趋势值的两个切片,且趋势a、趋势b和趋势缺失分别是切片的趋势值。可根据差应用要求来考虑要用于计算缺失切片的现有切片的数目。where slice a and slice b are the two slices with the closest trend value to slice missing , and trend a , trend b , and trend missing are the trend values of the slice, respectively. The number of existing slices to be used to calculate missing slices may be considered according to poor application requirements.

本公开的填充结果和其它方法。参考图6,在图的中间,线性回归用于填充缺失数据。切片上的点是维度,此处在切片中存在24个维度,每一维度表示一天的特定小时。右侧上的百分比是在0到1范围内的切片上的点的值。对于每一切片的每一维度,分别填充缺失值,我们可看到经填充切片不再有意义。然而,此处还可应用除线性回归外的技术,所述技术具有相同缺点。在根据本公开填充缺失切片的底部中,其达成更合理的结果,可在2015年5月1日发现急剧增加,其符合2016年和2017年的同一时段的趋势。然而在图的中间,缺失切片逐位增加,将利用此类方法省略重要性变化信息。Fill results and other methods of the present disclosure. Referring to Figure 6, in the middle of the figure, linear regression is used to fill in missing data. The dots on the slice are the dimensions, here there are 24 dimensions in the slice, each dimension representing a specific hour of the day. The percentage on the right is the value of the point on the slice in the range 0 to 1. For each dimension of each slice, missing values are filled separately, and we can see that the filled slices no longer make sense. However, techniques other than linear regression can also be applied here, which have the same disadvantages. In the bottom of the filled missing slices according to the present disclosure, which achieves more reasonable results, a sharp increase can be found on May 1, 2015, which is in line with the trend for the same period in 2016 and 2017. However, in the middle of the graph, the missing slices increase bit by bit, and the importance change information will be omitted using such methods.

以下是可采用本公开中提供的解决方案的2种使用情况。Below are 2 use cases where the solutions provided in this disclosure can be employed.

使用情况1:用于变压器的条件评估管理器Use Case 1: Condition Evaluation Manager for Transformers

在变压器处收集数据且将数据传送到数据管理系统中。在数据处理和分析之后,可将变压器的健康报告和负载-位移建议提供给消费者。归因于许多原因,所收集的数据将是不完整的,且数据处理和分析部分中需要缺失数据填充方法。Data is collected at the transformer and transmitted into a data management system. Following data processing and analysis, a transformer health report and load-displacement recommendations can be provided to consumers. The collected data will be incomplete due to many reasons and missing data imputation methods are required in the data processing and analysis section.

使用情况2:分布式能量系统Use Case 2: Distributed Energy Systems

在分布式能量系统的话题下存在各种应用,例如,负载均衡、峰值避免、防盗等。所有这些应用程序都是基于相关装置的连续监测,连续监测对缺失数据具有低容许度,使得填充方法为数据处理的必不可少的部分。Various applications exist under the topic of distributed energy systems, eg, load balancing, peak avoidance, anti-theft, etc. All of these applications are based on continuous monitoring of associated devices, which has a low tolerance for missing data, making imputation methods an essential part of data processing.

本公开中还提供一种存储可执行指令的计算机可读介质,计算机可执行指令在由计算机执行时使计算机执行本公开所呈现的方法中的任一个。Also provided in the present disclosure is a computer-readable medium storing executable instructions that, when executed by a computer, cause the computer to perform any of the methods presented in the present disclosure.

计算机程序正由至少一个处理器执行且执行本公开中呈现的方法中的任一个。A computer program is being executed by at least one processor and performs any of the methods presented in this disclosure.

虽然已参考某些实施例详细地描述了本发明技术,但应了解,本发明技术不限于那些精确的实施例。实际上,鉴于描述用于实践本发明的示范性模式的本公开,在不脱离本发明的范围和精神的情况下,所属领域的技术人员能进行许多修改和变化。因此,本发明的范围由所附权利要求书而不是由前述描述指示。落入权利要求书的等效含义和范围内的所有改变、修改和变化将被认为在权利要求书的范围内。Although the present technology has been described in detail with reference to certain embodiments, it is to be understood that the present technology is not limited to those precise embodiments. Indeed, given this disclosure, which describes exemplary modes for practicing the invention, many modifications and variations can be made by those skilled in the art without departing from the scope and spirit of the invention. Accordingly, the scope of the invention is indicated by the appended claims rather than by the foregoing description. All changes, modifications and variations that come within the meaning and range of equivalency of the claims are to be deemed to be within the scope of the claims.

Claims (10)

1. A method (300) for populating missing industrial longitudinal data, comprising:
-collecting (S301) industrial longitudinal data, wherein the industrial longitudinal data comprises missing slices, each slice corresponding to a collection time point;
-estimating (S302) a general trend of all slices of the industrial longitudinal data over time;
-calculating (S303) a trend value for each missing slice based on the overall trend;
-for each missing slice, finding (S304) at least one similar slice based on the trend value;
-filling (S305) each missing slice based on the at least one similar slice.
2. The method (300) of claim 1,
-before estimating (S302) a general trend of all slices of the industrial longitudinal data over time, the method further comprises: normalizing (S301') each slice of the industrial longitudinal data;
-estimating (S302) a general trend over time of all slices of the industrial longitudinal data, comprising:
-determining (S3021) whether all normalized slices of the industrial longitudinal data have the same shape, and if not, then
-dividing (S3022) all slices of the industrial longitudinal data into portions, wherein the slices in each portion have the same shape;
-for each portion with missing slices, estimating (S3023) a general trend of the slices in said portion;
-calculating (S303) a trend value for each missing slice based on the overall trend comprises: for each portion having a missing slice, calculating (S303) a trend value for each missing slice based on the overall trend for the portion;
-for each missing slice, finding (S304) at least one similar slice based on the trend value comprises: for each portion having missing slices, finding (S304) at least one similar slice of each missing slice in the portion based on the trend values;
-filling (S305) each missing slice based on the at least one similar slice comprises: for each portion having a missing slice, filling (S305) each missing slice based on the at least one similar slice.
3. The method (300) as claimed in claim 2, wherein determining (S3021) whether all normalized slices of the industrial longitudinal data have the same shape comprises: determining (S3021) that the slices during the time period have the same shape if the shape differences between all normalized slices during the time period are smaller than a predefined threshold.
4. The method (300) of claim 1, wherein the trend value is a mean of the slices.
5. An apparatus (10) for filling missing industrial longitudinal data, comprising:
-a data collection module (101) configured to collect industrial longitudinal data, wherein the industrial longitudinal data comprises missing slices, each slice corresponding to a collection time point;
-a data processing module (102) configured to:
-estimating the general trend of all slices of the industrial longitudinal data over time;
-calculating a trend value for each missing slice based on the overall trend;
-for each missing slice, finding at least one similar slice based on the trend values;
-filling each missing slice based on the at least one similar slice.
6. The apparatus (10) of claim 5,
-the data processing module (102) is further configured to: normalizing each slice of the industrial longitudinal data prior to estimating a general trend of all slices of the industrial longitudinal data over time;
-the data processing module (102) is further configured to, when estimating the general trend of all slices of the industrial longitudinal data over time:
-determining whether all normalized slices of the industrial longitudinal data have the same shape and, if not, then
-dividing all slices of the industrial longitudinal data into portions, wherein the slices in each portion have the same shape;
-for each portion with missing slices, estimating the general trend of the slices in said portion;
-the data processing module (102) is further configured to, when calculating a trend value for each missing slice based on the overall trend: for each portion having a missing slice, calculating a trend value for each missing slice based on the overall trend for the portion;
-the data processing module (102) is further configured to, when for each missing slice, find at least one similar slice based on a trend value: for each portion having a missing slice, finding at least one similar slice for each missing slice in the portion based on the trend values;
-the data processing module (102) is further configured to, when filling each missing slice based on the at least one similar slice: for each portion having a missing slice, filling each missing slice based on the at least one similar slice.
7. The device (10) of claim 6, wherein the data processing module (102) is further configured to, when determining whether all normalized slices of the industrial longitudinal data have the same shape: determining that the slices during a time period have the same shape if the shape difference between all normalized slices during the time period is less than a predefined threshold.
8. The apparatus (10) of claim 5, wherein the trend value is a mean of the slices.
9. An apparatus (10) for filling missing industrial longitudinal data, comprising:
-at least one processor (103);
-at least one memory (104), coupled to the at least one processor (103), configured to perform the method according to any one of claims 1 to 4.
10. A computer-readable medium for populating missing industrial longitudinal data, storing computer-executable instructions, wherein the computer-executable instructions, when executed, cause at least one processor to perform the method of any of claims 1 to 4.
CN202080097170.XA 2020-02-21 2020-02-21 Method and apparatus for filling in missing industrial longitudinal data Pending CN115151900A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/076273 WO2021164028A1 (en) 2020-02-21 2020-02-21 Method and apparatus for filling missing industrial longitudinal data

Publications (1)

Publication Number Publication Date
CN115151900A true CN115151900A (en) 2022-10-04

Family

ID=77391429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080097170.XA Pending CN115151900A (en) 2020-02-21 2020-02-21 Method and apparatus for filling in missing industrial longitudinal data

Country Status (2)

Country Link
CN (1) CN115151900A (en)
WO (1) WO2021164028A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060026156A1 (en) * 2004-07-28 2006-02-02 Heather Zuleba Method for linking de-identified patients using encrypted and unencrypted demographic and healthcare information from multiple data sources
US20130226613A1 (en) * 2012-02-23 2013-08-29 Robert Bosch Gmbh System and Method for Estimation of Missing Data in a Multivariate Longitudinal Setup
WO2017044082A1 (en) * 2015-09-09 2017-03-16 Intel Corporation Separated application security management
CN109947812A (en) * 2018-07-09 2019-06-28 平安科技(深圳)有限公司 Consecutive miss value fill method, data analysis set-up, terminal and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8941652B1 (en) * 2012-05-23 2015-01-27 Google Inc. Incremental surface hole filling
CN106844781B (en) * 2017-03-10 2020-04-21 广州视源电子科技股份有限公司 Data processing method and device
US11775873B2 (en) * 2018-06-11 2023-10-03 Oracle International Corporation Missing value imputation technique to facilitate prognostic analysis of time-series sensor data
CN109460775B (en) * 2018-09-20 2020-09-11 国家计算机网络与信息安全管理中心 A data filling method and device based on information entropy

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060026156A1 (en) * 2004-07-28 2006-02-02 Heather Zuleba Method for linking de-identified patients using encrypted and unencrypted demographic and healthcare information from multiple data sources
US20130226613A1 (en) * 2012-02-23 2013-08-29 Robert Bosch Gmbh System and Method for Estimation of Missing Data in a Multivariate Longitudinal Setup
WO2017044082A1 (en) * 2015-09-09 2017-03-16 Intel Corporation Separated application security management
CN109947812A (en) * 2018-07-09 2019-06-28 平安科技(深圳)有限公司 Consecutive miss value fill method, data analysis set-up, terminal and storage medium

Also Published As

Publication number Publication date
WO2021164028A1 (en) 2021-08-26

Similar Documents

Publication Publication Date Title
CN106991447A (en) A kind of embedded multi-class attribute tags dynamic feature selection algorithm
CN106533742A (en) Time sequence mode representation-based weighted directed complicated network construction method
CN114037478A (en) Advertisement abnormal flow detection method and system, electronic equipment and readable storage medium
CN115389954A (en) Method for estimating battery capacity, electronic device and readable storage medium
CN115062087A (en) User portrait construction method, device, equipment and medium
CN119200980B (en) An information security data processing system based on blockchain
CN117556369A (en) Power theft detection method and system for dynamically generated residual error graph convolution neural network
CN117688499A (en) A multi-index anomaly detection method, device, electronic equipment and storage medium
CN110460574A (en) A Method of Network Abnormal Traffic Detection Based on Wavelet Lifting
CN118551939A (en) Intelligent monitoring method and system for the whole process of luggage accessories production
CN103390058B (en) The domain knowledge browsing method of knowledge based map
CN116471174A (en) Log data monitoring system, method, device and storage medium
CN116581874A (en) A method and system for rapidly generating fault handling schemes based on knowledge graphs
CN119130612B (en) Personalized content push analysis system and method for mothers and infants based on AI scenarios
CN114692771B (en) A method for implementing a power grid alarm data analysis system based on clustering algorithm
CN115151900A (en) Method and apparatus for filling in missing industrial longitudinal data
CN113361956A (en) Resource quality evaluation method, device, equipment and storage medium for resource producer
CN117972104A (en) A method and system for quickly generating a standard knowledge base
CN117195118A (en) Data anomaly detection method, device, equipment and medium
CN110852893A (en) Risk identification method, system, equipment and storage medium based on mass data
CN116150483A (en) Electronic certificate recommendation method, equipment and storage medium based on Bayesian network model
CN113468152A (en) High-frequency user electricity consumption data cleaning method, system, equipment and storage medium
CN115278757A (en) A method, device and electronic device for detecting abnormal data
CN110210254A (en) The optimization verification method of repeated data in a kind of more data integrity validations
CN120578656B (en) Data cleaning method and system for AI intelligent database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination