[go: up one dir, main page]

CN108140047B - Data processing apparatus and method and data container structure - Google Patents

Data processing apparatus and method and data container structure Download PDF

Info

Publication number
CN108140047B
CN108140047B CN201680059351.7A CN201680059351A CN108140047B CN 108140047 B CN108140047 B CN 108140047B CN 201680059351 A CN201680059351 A CN 201680059351A CN 108140047 B CN108140047 B CN 108140047B
Authority
CN
China
Prior art keywords
data
container structure
data container
elements
stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201680059351.7A
Other languages
Chinese (zh)
Other versions
CN108140047A (en
Inventor
拉杜·图多兰
戈兹·布兰切
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN108140047A publication Critical patent/CN108140047A/en
Application granted granted Critical
Publication of CN108140047B publication Critical patent/CN108140047B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及一种数据处理装置和方法以及一种对应数据容器结构。所述数据处理装置(300)用于处理数据流(301),所述数据流包括按时间顺序布置的多个数据元素(301a、b)。所述数据处理装置(300)包括处理器(303),用于基于所述数据流(301)产生多个数据容器结构(305a、b),其中每个数据容器结构(305a、b)包括按时间顺序的所述数据流(301)的所述多个数据元素(301a、b)的子集,其中所述处理器(303)进一步用于向每个数据容器结构(305a、b)提供元数据(307a、b),其中所述元数据(307a、b)界定每个数据容器结构(305a、b)相对于所述多个数据容器结构(305a、b)中的其它数据容器结构(305a、b)的所述时间顺序。

Figure 201680059351

The present invention relates to a data processing device and method and a corresponding data container structure. The data processing device (300) is adapted to process a data stream (301) comprising a plurality of data elements (301a,b) arranged in time sequence. The data processing apparatus (300) includes a processor (303) for generating a plurality of data container structures (305a,b) based on the data stream (301), wherein each data container structure (305a,b) includes a a chronological subset of the plurality of data elements (301a,b) of the data stream (301), wherein the processor (303) is further configured to provide elements to each data container structure (305a,b) data (307a,b), wherein the metadata (307a,b) defines each data container structure (305a,b) relative to other data container structures (305a) in the plurality of data container structures (305a,b) , the time sequence of b).

Figure 201680059351

Description

Data processing apparatus and method, and data container structure
Technical Field
The present invention relates generally to the field of data processing. More particularly, the invention relates to a data processing apparatus and method for processing a data stream and a corresponding data container structure.
Background
In today's information rich environment, rapidly processing large amounts of data can be challenging and very important. This data is typically provided in the form of a data stream, i.e., a continuous or semi-continuous data stream, where in many cases the data elements are generated in real time as events occur. For example, sensors used in radio-frequency identification (RFID) in tracking and access applications may provide streaming data about the location of the tracked object. Quickly responding to a particular signal in streaming data is often a key aspect of many applications. For example, a network monitoring system for detecting security threats requires detecting and reporting events represented in data streams collected by monitoring.
Conventionally, processing of streaming data is performed by first storing the data in a database. The database may then be queried to retrieve the data for further processing. Therefore, analyzing data in real time is very difficult because the number of database accesses is limited, and in particular, a flow having a high data rate is limited. To address this problem, several traditional software technologies have been redesigned, such as main memory database management systems.
In recent years, a technique called "complex event processing" or "event stream processing" has been developed. By means of these techniques, events embodied as meaningful patterns within a data stream can be detected as a result of the processing of the data stream. In this context, a new class of infrastructure in the form of STREAM processing engines has emerged, e.g., "Aurora," "STREAM," "telegraph cq," to support high-capacity, low-latency data STREAM processing applications in particular.
A conventional stream processing paradigm involves a set of operations applied across all elements/events of a data stream. In many cases, processing a data stream involves specific calculations that must be repeated in time based on continuously changing data elements or events of the data stream, e.g., calculating a running average. In other words, a new data element or event received as part of a data flow typically must be visually observed to be an earlier data flow in the data flow. Alternatively, in some application scenarios, each new data element triggers a duplicate computation based on all available data elements in the data stream, i.e., the early data element and the new data element.
This typically involves splitting the data stream into different windows, i.e. time intervals, to which the computation is applied. A window is a delimitation relative to a time or sequence defined by data elements or events of a data stream that contains a subset of the data elements or events. This concept is illustrated in fig. 1. Conventional stream processing engines use windows to apply processing functions (e.g., computational operations) to data streams or events. The same mechanism is used if the events are received in real time, or if the events are stored in a database and processed repeatedly while the events are replayed to mimic the flow behavior.
A particular aspect is that conventional windows are transferable. The window is formed for the purpose of applying the computing operation to a data element or event of the data stream, but is not saved after the computing operation is completed. Because the window corresponding to the processing of the stream will typically be discarded, any successive operations on the data stream will require the window to be regenerated and recalculated. This behavior occurs even for small updates of the window, e.g., temporal recalibration. In fact, this is a general operation, considering that for a plurality of analysis scenarios, the division of the flow into a plurality of windows is done with respect to the moment of analysis. As time progresses, the reference time of the system also advances and this requires readjustment of the window.
An exemplary scenario that may occur in a conventional stream processing engine is illustrated in fig. 2. For example, at time point X, data elements or events of a data stream may be grouped into windows having a size of 1 hour, starting from the current time and going back 6 months. Based on this grouping of data elements or events into windows, different statistical measures may be calculated, e.g. an average of the data elements or events of the windows. These statistical measures must be calculated again after a new data element or event arrives as part of the data stream at time point X + Δ, since the window must be regenerated from scratch for time point X + Δ.
In view of the above, there is a need for an improved data processing apparatus and method and an improved data container structure, which in particular allows to utilize the temporal structure of data elements in a data stream in an improved way.
Disclosure of Invention
It is an object of the invention to provide an improved data processing apparatus and method and an improved data container structure, which in particular allow to utilize the temporal structure of data elements in a data stream in an improved way.
The foregoing and other objects are achieved by the subject matter of the independent claims. Further embodiments are apparent from the dependent claims, the description and the drawings.
According to a first aspect, the invention relates to a data processing apparatus for processing a data stream, the data stream comprising a plurality of data elements arranged in a time sequence within the data stream. The data processing apparatus comprises a processor for generating a plurality of data container structures based on a data stream, wherein each data container structure comprises a subset of a plurality of data elements in the data stream in a chronological order. The processor is further configured to provide metadata to each data container structure, wherein the metadata defines a temporal order of each data container structure relative to other data container structures of the plurality of data container structures, i.e., a temporal order of each data container structure in the data stream.
Hence, an improved data processing apparatus is provided, which in particular allows to exploit the temporal structure of data elements in a data stream. Based on the metadata, time-based management of data container structures by the data processing apparatus is provided. The performance advantage provided is that computational cycles triggered by updates to reoccur streaming operations are saved. The amount of computation cycles saved varies depending on the particular operation and context, reaching more than 90% of the conventional options in some cases.
In a first possible implementation form of the data processing apparatus according to the first aspect as such, each data container structure comprises at least one data window comprising at least one data element of a subset of the plurality of data elements and/or at least one data sub-container comprising at least two data sub-windows, wherein each data sub-window comprises at least one data element of a subset of the plurality of data elements.
For example, providing this hierarchy of data container structures, data sub-containers, data windows, and/or data sub-windows allows different levels of time granularity to be utilized.
In a second possible implementation form of the data processing apparatus according to the first aspect as such or according to the first implementation form thereof, the metadata of each data container structure comprises a pointer to a data element of the subset of the plurality of data elements, the pointer being the first data element in the temporal order of the subset of the plurality of data elements. In an implementation form, the metadata of each data container structure may further include a data container structure identifier and/or a start time or end time of the data container structure.
In a third possible implementation form of the data processing apparatus according to the first aspect as such or the first or second implementation form thereof, the processor is further configured to provide to each data container structure statistical data associated with a subset of the plurality of data elements of the data container structure. In an implementation form, the statistical data may comprise at least one of: a count, a sum, a mean, a median, a variance, a standard deviation, and/or a coefficient of variation for a plurality of data elements. The statistical data may include a plurality of pairs, where each pair includes a statistical function and a value of the statistical function applied to a data element or window thereof within the data container structure.
In a fourth possible implementation form of the data processing apparatus according to the third implementation form of the first aspect, the metadata of each data container structure further comprises pointers to statistics associated with a subset of the plurality of data elements of the data container structure.
In a fifth possible implementation form of the data processing apparatus according to the first aspect as such or any of the first to fourth implementation forms thereof, the processor is further configured to adapt each data container structure by adding data elements of the data stream to a subset of data elements of the data container structure or by clearing data elements from a subset of data elements of the data container structure. In an implementation form, the processor is configured to utilize a deferred update scheme or an aggressive update scheme for the statistical data associated with the data elements of the data container structure when updating or adjusting the data container structure.
In a sixth possible implementation form of the data processing device according to the first aspect as such or any one of the first to fifth implementation forms thereof, the data processing device further comprises a memory for storing a plurality of data container structures.
According to a second aspect, the invention relates to a data container structure for storing a plurality of data elements arranged in a time sequence in a data stream. The data container structure includes a subset of the plurality of data elements in the data stream in a chronological order, and metadata defining the chronological order of the data container structure in the data stream, i.e., the data container structure is based on the chronological order of the data stream relative to other container structures.
In a first possible implementation form of the data container structure according to the second aspect as such, the data container structure comprises at least one data window comprising at least one data element of the subset of the plurality of data elements and/or at least one data sub-container comprising at least two data sub-windows, wherein each data sub-window comprises at least one data element of the subset of the plurality of data elements.
In a second possible implementation form of the data container structure according to the second aspect as such or the first implementation form thereof, the metadata of the data container structure comprises a pointer to a data element of the subset of the plurality of data elements, the pointer being the first data element in the temporal order of the subset of the plurality of data elements.
In a third possible implementation form of the data container structure according to the second aspect as such or the first or second implementation form thereof, the data container structure further comprises statistical data associated with a subset of the plurality of data elements of the data container structure.
In a fourth possible implementation form of the data container structure according to the third implementation form of the second aspect, the metadata of the data container structure comprises pointers to statistics associated with a subset of the plurality of data elements of the data container structure.
In a fifth possible implementation form of the data container structure according to the second aspect as such or any of the first to fourth implementation forms thereof, the data container structure is adapted to be adjustable by adding data elements of the data stream to a subset of data elements of the data container structure or by clearing data elements from a subset of data elements of the data container structure.
According to a third aspect, the invention relates to a data processing method for processing a data stream, the data stream comprising a plurality of data elements arranged in a temporal order. The data processing method comprises the following steps: generating a plurality of data container structures based on the data stream, wherein each data container structure comprises a subset of a plurality of data elements of the data stream in a chronological order; and providing metadata to each data container structure, wherein the metadata defines a temporal order of each data container structure relative to other data container structures of the plurality of data container structures.
The data processing method according to the third aspect of the present invention may be performed by the data processing apparatus according to the first aspect of the present invention. Further features of the data processing method according to the third aspect of the invention result directly from the functionality of the data processing device according to the first aspect of the invention and its different embodiments described above.
According to a fourth aspect, the invention relates to a computer program comprising program code for performing the data processing method according to the third aspect of the invention or any of its embodiments when run on a computer.
The present invention may be implemented in hardware and/or software.
Drawings
Other embodiments of the invention will be described with respect to the following drawings, in which:
FIG. 1 shows a schematic diagram illustrating aspects of a conventional stream processing engine;
FIG. 2 shows a schematic diagram illustrating aspects of a conventional stream processing engine;
FIG. 3 shows a schematic diagram illustrating a data processing apparatus for processing a data stream according to an embodiment;
FIG. 4 shows a schematic diagram illustrating a data processing method for processing a data stream according to an embodiment;
fig. 5 shows a schematic diagram illustrating a plurality of data container structures according to an embodiment provided by a data processing apparatus according to an embodiment;
fig. 6 shows a schematic diagram illustrating different aspects of a data container structure according to an embodiment provided by a data processing apparatus according to an embodiment;
fig. 7 shows a schematic diagram illustrating different aspects of a data container structure according to an embodiment provided by a data processing apparatus according to an embodiment;
FIG. 8 shows a schematic diagram illustrating different aspects of a data processing apparatus according to an embodiment and a plurality of data container structures according to an embodiment; and
fig. 9 shows a table illustrating the performance of the data processing apparatus and method according to the embodiment.
Detailed Description
Reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific aspects in which the invention may be practiced. It is to be understood that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, as the scope of the present invention is defined by the appended claims.
For example, it should be understood that the disclosure in connection with the described methods may equally apply to the corresponding apparatus or system for performing the methods, and vice versa. For example, if a particular method step is described, the corresponding apparatus may comprise means for performing the described method step, even if this means is not explicitly described or illustrated in the figures. Further, it is to be understood that features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.
Fig. 3 shows a schematic diagram illustrating a data processing apparatus 300 for processing a data stream 301 according to an embodiment. The data stream 301 is composed of a plurality of data elements 301a, b arranged within the data stream 301 in a time sequence, e.g. due to the arrival times of the respective data elements 301a, b at the data processing apparatus 300.
The data processing apparatus 300 comprises a processor 301 for generating a plurality of data container structures 305a, b based on a data stream 301, wherein each data container structure 305a, b comprises a subset of a plurality of data elements 301a, b of the data stream 301 in a temporal order. A memory 309 may be provided as part of the data processing apparatus 300 for storing the plurality of data container structures 305a, b.
The processor 301 is further configured to provide metadata 307a, b to each data container structure 305a, b, wherein the metadata 307a, b defines a temporal order of each data container structure 305a, b relative to other data container structures 305a, b of the plurality of data container structures 305a, b, i.e., a temporal order of each data container structure 305a, b in the data stream 301. For example, in the embodiment shown in fig. 3, the metadata 307b of the data container structure 305b may define a chronological order of the data container structure 305b relative to the data container structure 305 a.
Thus, the data container structure 305a, b shown in fig. 3 is used to store the chronologically arranged data elements 301a, b in the data stream 301. Each data container structure 305a, b comprises a subset of the plurality of data elements 301a, b in the data stream 301 in a temporal order and metadata 307a, b defining the temporal order of the data container structures 305a, b in the data stream 301, i.e. the data container structures 305a, b are based on the temporal order of the data stream 301 relative to the other container structures 305a, b.
Fig. 4 shows a schematic illustration of a data processing method 400 for processing a data stream, such as the data stream 301 shown in fig. 3, according to an embodiment, wherein the data stream 301 comprises a plurality of data elements 301a, b arranged in a time sequence.
The data processing method 400 comprises a step 401 of generating a plurality of data container structures 305a, b based on a data stream 301, wherein each data container structure 305a, b comprises a subset of a plurality of data elements 301a, b in the data stream 301 in a temporal order, and a step 403 of providing metadata 307a, b to each data container structure 305a, b, wherein the metadata 307a, b defines the temporal order of each data container structure 305a, b with respect to other data container structures 305a, b of the plurality of data container structures 305a, b.
Further embodiments, embodiments and aspects of the data processing device 300, the data container structures 305a, b and the data processing method 400 will be described below. As has been described above, the present invention provides a data container structure, such as the data container structures 305a, b shown in fig. 3, that allows data elements or events of the data stream 301 to be organized into windows for storage. This structure encapsulates the concept of grouping data elements or events into a data container structure or window and preserving this partition when storing data. In an embodiment, the data processing means and the data container structure employ time-based references as internal references for delimiting, managing, accessing and navigating. In an embodiment, the organization and saving of events or data elements in a window is done in a hierarchical manner. That is, in an embodiment, the data container holding the window will consist of sub-containers corresponding to smaller granularity/partitioning of data. In embodiments, these granularities may be reduced to the level of events or data elements. In an embodiment, the data container structure may have a time boundary as a delimiter and a navigation reference. Additionally, in embodiments, the internal structure forming the data container structure may also have its own corresponding time reference. In an embodiment, the data processing apparatus 300 is configured to apply a plurality of operations on the data container structures 305a, b. The purpose is to save computation cycles when the event-to-window segmentation or statistics computed thereon need to be updated. By preserving the window partition of the stream, the window can only be rescaled according to a particular operation when an update needs to be applied, rather than being regenerated from the original data. Furthermore, the organization of the data in the window is flexible in the sense that the organization of the data in the window enables combining the window to obtain a new partition of data or re-adjusting the window time boundaries to delete, add, and insert new events or containers. To achieve this later goal, the data container structure holding the data may be extensible and allow internal changes. In addition to this, the data container structure may record predefined statistics about the windows that are linked to metadata that enables navigation of the corresponding container. In an embodiment, these statistics may be saved and updated as new updates are generated. This allows fast access to the common composition information, which is usually computed on the event stream. Several of the above aspects are described in further detail below.
In an embodiment of the data processing apparatus 300, each data container structure 305a, b may comprise at least one data window comprising at least one data element of a subset of the plurality of data elements and/or at least one data sub-container comprising at least two data sub-windows, wherein each data sub-window comprises at least one data element of a subset of the plurality of data elements. This embodiment is described in more detail below in the context of FIG. 5, with FIG. 5 illustrating a hierarchical structure of a plurality of data container structures 305a-c according to an embodiment.
The data container structures 305a-c shown in FIG. 5 may hold other containers (referred to herein as data sub-containers) or multiple events or data elements. Each upper level in the hierarchy will thus correspond to a window corresponding to a particular time size granularity. This structure encapsulates the temporal references and meta information that enable navigation through the data based on time.
Thus, in an embodiment, the metadata 307a, b of each data container structure 305a, b may include a pointer to a data element in the subset of the plurality of data elements, the pointer being the first data element in the temporal order of the subset of the plurality of data elements. In an implementation form, the metadata of each data container structure may further include a data container structure identifier and/or a start time or end time of the data container structure.
Thus, the data container structures 305a-c shown in FIG. 5 can hold any type of partition generated based on a stream, such as the data stream 301 shown in FIG. 3. In an embodiment, this metadata regarding windows, time-granular organization, and required navigation information is collected in metadata 307 a.
In addition to metadata 307a, information about certain statistics computed based on the window may also be stored for quick retrieval. Thus, in an embodiment, the processor 303 is further configured to provide each data container structure 305a-c with statistical data 311a associated with a subset of the plurality of data elements of the respective data container structure 305 a-c. For clarity, only the statistics 311a of the data container structure 305 are indicated in fig. 5.
In an embodiment, the statistical data 311a may include at least one of: a count, a sum, a mean, a median, a variance, a standard deviation, and/or a coefficient of variation for a plurality of data elements. The statistical data 311a may include a plurality of pairs, where each pair includes a statistical function and a value of the statistical function applied to a data element or window thereof within the data container structure.
In an embodiment, the metadata 307a of the data container structure 305a further includes pointers to statistics 311a associated with a subset of the plurality of data elements of the data container structure 305 a.
At each level in the hierarchy, control, expansion, and readjustment may be performed. Some of these operations are supported only by readjusting the metadata information without actually accessing the data itself. Furthermore, the independent objective is to be able to control the respective data container structure at any granularity, such that the data container structure can be reused when an update is needed, e.g. recalibration of a window, event, insertion/removal of a data element or window, etc. Further, having a mechanism for loosely linking windows may perform window-based operations, such as composition or decomposition of windows to obtain other time granularities.
Thus, in an embodiment, the processor 303 is further configured to adjust each data container structure 305a-c by adding data elements of the data stream 301 to a subset of data elements of the data container structure 305a-c or by purging data elements from a subset of data elements of the data container structure 305 a-c.
Fig. 6 illustrates metadata 307a of a data container structure 305a, according to an embodiment. As already described above, the main purpose of the metadata 307a is to preserve the temporal partitioning of the stream with respect to events, i.e. the information about the delimitation of each window itself. In addition, metadata 307a may contain references to containers holding corresponding data in the window and statistics associated therewith. To hold this information, each element in metadata 307a may be constructed as a record (i.e., an n-gram), where each field in the record holds a particular type of this information: window ID, start time, end time, pointer to container header, pointer to container end, pointer to statistics, and reference ID of package window. The organization of records in practice may take a number of forms, such as hash tables or trees.
Further, FIG. 6 illustrates exemplary statistics 311a associated with data elements of the data container structure 305 a. If the statistics are to be calculated in addition to for each window (or closed window), a list of pairs of function types and function values may be kept as part of the statistics 311 a. When a new statistical function is applied over a window of data flow, the type of statistical function and the resulting value may be recorded in the statistical data 311 a. An example of such a statistical function is: counts, sums, means, medians, variances, standard deviations, coefficients of variation, and the like. Each window (across the different granularity layers used) may have its own list of pairs. The set of all these lists of pairs may be organized, as in the case of metadata 307a that ranges in different forms from hash tables into trees or lists. The key point is that each item in the statistics 311a (i.e., the list of pairs) must be allowed to be referenced so that a pointer to the item can be saved in the metadata 307 a.
When an update (e.g., recalibration, new item, delete … …) is to be applied to the data container structure, two options are available for statistics 311 a: delayed updates or proactive updates. Updates to be used may be set when the structure is generated and may be updated during its lifecycle. The deferred evaluation would imply that when any operation is applied to the data container structure (or a data container structure), the dirty flag would be set to not update the flag of the corresponding statistic 311 a. When one of the functions recorded in the statistical data 311a is recalculated in the future, the record is updated and the dirty flag is set to clear. In the case of the proactive update option, each update applied to the data container structure or window will trigger a new calculation of all statistics. In this way, the correct statistical information about the windows in the data stream is immediately available in the future.
The data container structure is responsible for saving events in streams organized in windows. The data container structure will contain events between certain boundaries. Often the delimitation boundary refers to time. Where the data container structure can hold a window, the time boundary will refer to the time delimitation of the window. When the stream is split in multiple windows, each such window will be recorded into a different container, thereby preserving the time definition of the application. These containers may be linked to each other based on the same logical order that the input streams have. The time granularity used to define the window may be one time granularity used when the partition is completed or set when operations on the data container structure are applied, e.g., a combination of different data container structures or a readjustment of the window to a new time granularity.
In an embodiment, the data processing apparatus 300 is for providing time-based management of data. To this end, the start and end time delimitations of the data container structure are not only time references. Within the data container structure, there may be time references corresponding to finer granularity. This internal granularity is set when a structure is generated or when an operation is applied that changes the temporal organization in the structure. As already explained further above, in this case a layered structure is obtained. The containers may thus be internally organized on a sub-container basis, which divides the time corresponding to the time interval in which the containers are packaged. At the lowest granularity, a child container will only hold data elements or events. Fig. 7 shows a hierarchy of data container structures according to an embodiment with an exemplary hierarchical depth 3. However, larger hierarchies can be created using the same principles, based on the particular needs of the situation and the sequential operations applied.
Thus, in embodiments, a container may be formed as a collection of other interconnected containers (i.e., sub-containers) or as a collection of events or data elements. The top level data container structures themselves may be interconnected with each other, thereby capturing partitioning of the stream. Each data container structure may have a corresponding record in metadata 307a and optionally corresponding statistics 311 a. In embodiments, this is not the case for events or data elements, as this would result in a large number of metadata records. At any granularity, the data container structure supports expansion or compression only by moving time boundaries in the metadata 307a and by updating the links of the affected subset in the child container or event. From an implementation perspective, the above-described functionality of the data container structure may be obtained by a doubly linked list of fixed-size events or data elements with the temporal order of the elements.
As has been described above, it is a primary object of embodiments of the present invention to provide a mechanism for saving partitioning of streams in a structure with the aim of increasing performance. This is achieved by reusing existing calculations. To this end, multiple operations of the proposed format should be enabled to perform time-based operations on the resulting data container structure. In addition, performance improvements may also be obtained by enabling operations on the data container structure directly at the metadata level, e.g., retrieving pre-computed statistics. Finally, another advantage comes from the option for managing data container structures in the form of windows in isolated mode or across other similar structures. Each such functionality is exposed through a particular operation. An example of such an operation is shown in fig. 8. In the following, some exemplary operations from each category are described in the context of fig. 8.
As has been described above, embodiments of the present invention allow avoiding re-computing partitions of a data stream whenever an update is available. This update involves the insertion/addition, deletion, updating, addition of new events or data elements in the stream. These operations may be supported by generating the required adjustments at the level of the metadata and by performing events or window movements at the level of the data container structure. For example, at the level of metadata, a modification may imply an adjustment of the temporal boundaries of the data container structure and possibly a reference to the encapsulation container for some sub-containers. At the level of the data container structure, the operation may involve moving a portion of the data in the data container structure to an adjacent data container structure. As an optimization, fine-grained time-based access to events or data elements may be used in order to reduce the amount of data that is read and subsequently written on a storage backend of a retention data container structure, such as storage 309 shown in fig. 3.
Furthermore, embodiments of the present invention allow for time control of stored data. To this end, the data container structure may utilize a temporal reference that segments the data, which may be used for fine-grained navigation. However, the window of split data has a static time dimension. The data container structure according to the present invention supports operations for changing this time partition or grouping time into larger intervals. The operations for changing the time partition are generally used to divide the data container structure or window into sub-intervals. This would imply generating a child container into which the events would be grouped when the data elements would be moved. Although a similar mechanism may be used to reconstruct the data into a new time partition, it should be understood that this is equivalent to generating an entirely new data container structure corresponding to the new partition of the stream. Alternatively, when a window is involved into a larger interval of packets, this can be done by generating a new container in which existing windows will be grouped. This also implies that corresponding fields of metadata will be generated. This is an important operation because it allows to obtain a high-level view summary of the stream.
As mentioned previously, this statistical data allows to record mathematical functions and operators applied on the stream and to record the values of said mathematical functions and operators. This enables fast access to this pre-computed data. The operations to be enabled at the level of statistical operations are reading the value, searching for a pre-computed value of the function, recording/updating a pair of function type and result, and marking whether the result is valid (in particular in case of delayed evaluation).
The set of operations that are enabled is a function of applying one data container structure on another. This function includes cascading, combining, differentiating, comparing, etc. Depending on the particular function, the containers of the source structure will be copied into the resulting structure (which may be one of the source structures) and logically linked according to the function. This would also imply an update of the level of metadata and statistics.
To highlight some of the advantages provided by embodiments of the present invention, an exemplary scenario will be described below in the context of fig. 8. In this example, the corresponding average for the bank account is calculated every 5 minutes over each hourly interval for the last 6 months. This is one example of the processing required in fraud prevention analysis (for credit card payments, prevention of bank money laundering, etc.). The value calculated for each window is used as a ground reference for validating the new transaction. This analysis is common across multiple business domains, which can therefore be optimized by this solution that optimizes basic processing operations.
Conventionally, the above situation will be handled in the following manner. All events are stored in a table. When a new computation is triggered, the stream processing engine will start to read each event starting with the last one. All 1 hour windows will be recalculated upon reading the event. The average value in each window is recalculated. The overhead of these operations is, where N is the number of events: o (n) read from disk operation, o (n) "+" operation, and 4392(6 x 30.5 x 24) "/" operation.
According to an embodiment of the present invention, the above situation can be handled in the following manner. All events are stored and formatted with an internal time reference of 5 minutes relative to a data container structure according to an embodiment, which has a 1 hour window container. The metadata for each window container is read and the cleanup and add operations are completed within a series of 5 minute schedules for each container. The overhead is: the number of reads from disk is reduced to 8% (only the 1/125 minute schedule in each 1 hour window is read) and then written (depending on the implementation, if all data is saved in the same file compared to reads and writes can be completely cleared and replaced only by updates in the metadata table). The number of "+" operations (which can be considered to be 8% of all data) is reduced according to the number of events in the boundary 5 minute window.
For example, embodiments of the present invention provide the following advantages. An apparatus and method are provided for implementing a temporal concept of stored data derived from a data stream. A data container structure for stream partitioning in a data window is provided that allows computational loop optimization of streams to be performed through memory. A data container structure is provided for efficiently handling discrete updates on a stream. A data container structure for window-based control is provided. A data container structure supporting a hierarchical format for organizing stream events is provided. A data container structure is provided for supporting event functions and statistics.
While a particular feature or aspect of the invention may have been disclosed with respect to only one of several implementations or embodiments, such feature or aspect may be combined with one or more other features or aspects of the other implementations or embodiments as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "includes," "has," "having," or any other variation thereof, are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted. Moreover, the terms "exemplary," "by way of example," and "such as" are merely meant as examples, rather than the best or optimal. The terms "coupled" and "connected," along with their derivatives, may be used. It should be understood that these terms may be used to indicate that two elements co-operate or interact with each other, whether in direct physical or electrical contact, or not in direct contact with each other.
Although specific aspects have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.
Although the elements of the above claims are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily limited to being implemented in that particular sequence.
Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art will readily recognize that there are numerous other applications of the present invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art will recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.

Claims (8)

1.一种用于处理包括按时间顺序布置的多个数据元素(301a、b)的数据流(301)的数据处理装置(300),其特征在于,所述数据处理装置(300)包括:1. A data processing device (300) for processing a data stream (301) comprising a plurality of data elements (301a, b) arranged in time sequence, characterized in that the data processing device (300) comprises: 处理器(303),用于基于所述数据流(301)产生多个数据容器结构(305a、b),其中每个数据容器结构(305a、b)包括按时间顺序的所述数据流(301)中的所述多个数据元素(301a、b)的子集,a processor (303) for generating a plurality of data container structures (305a,b) based on the data stream (301), wherein each data container structure (305a,b) comprises a chronological sequence of the data stream (301) a subset of the plurality of data elements (301a,b) in ), 其中所述处理器(303)进一步用于向每个数据容器结构(305a、b)提供元数据(307a、b),其中所述元数据(307a、b)界定每个数据容器结构(305a、b)相对于所述多个数据容器结构(305a、b)中的其它数据容器结构(305a、b)的所述时间顺序;wherein the processor (303) is further configured to provide metadata (307a,b) to each data container structure (305a,b), wherein the metadata (307a,b) defines each data container structure (305a,b) b) said temporal order relative to other data container structures (305a,b) in said plurality of data container structures (305a,b); 每个数据容器结构(305a、b)包括至少一个数据子容器,所述至少一个数据子容器包括至少两个数据子窗口,其中每个数据子窗口包括所述多个数据元素的所述子集中的至少一个数据元素。Each data container structure (305a, b) includes at least one data subcontainer including at least two data subwindows, wherein each data subwindow includes the subset of the plurality of data elements at least one data element of . 2.根据权利要求1所述的数据处理装置(300),其特征在于,每个数据容器结构(305a、b)包括至少一个数据窗口和至少一个数据子容器,所述至少一个数据窗口包括所述多个数据元素的所述子集中的至少一个数据元素。2. The data processing apparatus (300) according to claim 1, wherein each data container structure (305a, b) comprises at least one data window and at least one data subcontainer, the at least one data window comprising all at least one data element in the subset of the plurality of data elements. 3.根据权利要求1或2所述的数据处理装置(300),其特征在于,每个数据容器结构(305a、b)的所述元数据(307a、b)包括所述多个数据元素(301a、b)的所述子集中的所述数据元素的指针,所述指针是所述多个数据元素(301a、b)的所述子集的所述时间顺序中的第一数据元素。3. The data processing apparatus (300) according to claim 1 or 2, characterized in that the metadata (307a,b) of each data container structure (305a,b) comprises the plurality of data elements ( 301a,b) a pointer to said data element in said subset of said plurality of data elements (301a,b), said pointer being a first data element in said temporal order of said subset of said plurality of data elements (301a,b). 4.根据权利要求1或2所述的数据处理装置(300),其特征在于,所述处理器(303)进一步用于向每个数据容器结构(305a、b)提供与所述数据容器结构(305a、b)的所述多个数据元素(301a、b)的所述子集相关联的统计数据。4. The data processing apparatus (300) according to claim 1 or 2, characterized in that the processor (303) is further configured to provide each data container structure (305a, b) with the data container structure Statistics associated with the subset of the plurality of data elements (301a,b) of (305a,b). 5.根据权利要求4所述的数据处理装置(300),其特征在于,每个数据容器结构(305a、b)的所述元数据(307a、b)包括与所述数据容器结构(305a、b)的所述多个数据元素(301a、b)的所述子集相关联的所述统计数据的指针。5. The data processing apparatus (300) according to claim 4, characterized in that the metadata (307a, b) of each data container structure (305a, b) comprises the same data as the data container structure (305a, b) A pointer to said statistical data associated with said subset of said plurality of data elements (301a, b) of b). 6.根据权利要求1或2所述的数据处理装置(300),其特征在于,所述处理器(303)进一步用于通过将所述数据流(301)的数据元素添加到所述数据容器结构(305a、b)的数据元素的所述子集或通过从所述数据容器结构(305a、b)的数据元素的所述子集清除数据元素来调整每个数据容器结构(305a、b)。6. The data processing apparatus (300) according to claim 1 or 2, characterized in that the processor (303) is further configured to add data elements of the data stream (301) to the data container by adding The subset of data elements of the structure (305a,b) or each data container structure (305a,b) is adjusted by clearing data elements from the subset of data elements of the data container structure (305a,b) . 7.根据权利要求1或2所述的数据处理装置(300),其特征在于,所述数据处理装置(300)进一步包括用于存储所述多个数据容器结构(305a、b)的存储器(309)。7. The data processing apparatus (300) according to claim 1 or 2, characterized in that the data processing apparatus (300) further comprises a memory (300) for storing the plurality of data container structures (305a, b) 309). 8.一种用于处理包括按时间顺序布置的多个数据元素(301a、b)的数据流(301)的数据处理方法(400),其特征在于,所述数据处理方法(400)包括:8. A data processing method (400) for processing a data stream (301) comprising a plurality of data elements (301a, b) arranged in time sequence, characterized in that the data processing method (400) comprises: 基于所述数据流(301)产生(401)多个数据容器结构(305a、b),其中每个数据容器结构(305a、b)包括按时间顺序的所述数据流(301)的所述多个数据元素的子集;以及A plurality of data container structures (305a,b) are generated (401) based on the data stream (301), wherein each data container structure (305a,b) includes the plurality of data stream (301) in chronological order. a subset of data elements; and 向每个数据容器结构(305a、b)提供(403)元数据(307a、b),其中所述元数据(307a、b)界定每个数据容器结构(305a、b)相对于所述多个数据容器结构(305a、b)中的其它数据容器结构(305a、b)的所述时间顺序;Each data container structure (305a,b) is provided (403) with metadata (307a,b), wherein the metadata (307a,b) defines each data container structure (305a,b) relative to the plurality of the chronological order of the other data container structures (305a,b) in the data container structures (305a,b); 其中,所述数据容器结构(305a、b)包括至少一个数据子容器,所述至少一个数据子容器包括至少两个数据子窗口,其中每个数据子窗口包括所述多个数据元素(301a、b)的所述子集中的至少一个数据元素。Wherein, the data container structure (305a, b) includes at least one data subcontainer, the at least one data subcontainer includes at least two data subwindows, wherein each data subwindow includes the plurality of data elements (301a, at least one data element in said subset of b).
CN201680059351.7A 2016-01-05 2016-01-05 Data processing apparatus and method and data container structure Active CN108140047B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2016/050053 WO2017118474A1 (en) 2016-01-05 2016-01-05 A data processing apparatus and method and a data container structure

Publications (2)

Publication Number Publication Date
CN108140047A CN108140047A (en) 2018-06-08
CN108140047B true CN108140047B (en) 2021-06-29

Family

ID=55080111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201680059351.7A Active CN108140047B (en) 2016-01-05 2016-01-05 Data processing apparatus and method and data container structure

Country Status (2)

Country Link
CN (1) CN108140047B (en)
WO (1) WO2017118474A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3789864A1 (en) * 2019-09-06 2021-03-10 dspace digital signal processing and control engineering GmbH Method for testing a control software of a control device
US11436033B2 (en) * 2019-10-11 2022-09-06 International Business Machines Corporation Scalable virtual memory metadata management
US11599545B2 (en) * 2020-02-19 2023-03-07 EMC IP Holding Company LLC Stream retention in a data storage system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7383253B1 (en) * 2004-12-17 2008-06-03 Coral 8, Inc. Publish and subscribe capable continuous query processor for real-time data streams
CN102456065A (en) * 2011-07-01 2012-05-16 中国人民解放军国防科学技术大学 Offline historical statistical data storage and query method for data streams
CN102456069A (en) * 2011-08-03 2012-05-16 中国人民解放军国防科学技术大学 Data stream increment aggregation statistics and query method and query system
CN103916478A (en) * 2014-04-11 2014-07-09 华为技术有限公司 Streaming data cube establishing method and device based on distributed system
CN104090952A (en) * 2014-07-02 2014-10-08 华中科技大学 Method and system for estimating average value of data flow under sliding window

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU5624700A (en) * 1999-06-18 2001-01-09 Torrent Systems, Inc. Segmentation and processing of continuous data streams using transactional semantics
US6792433B2 (en) * 2000-04-07 2004-09-14 Avid Technology, Inc. Indexing interleaved media data
JP4687253B2 (en) * 2005-06-03 2011-05-25 株式会社日立製作所 Query processing method for stream data processing system
US7512700B2 (en) * 2005-09-30 2009-03-31 International Business Machines Corporation Real-time mining and reduction of streamed data
US8768895B2 (en) * 2007-04-11 2014-07-01 Emc Corporation Subsegmenting for efficient storage, resemblance determination, and transmission
US7945540B2 (en) * 2007-05-04 2011-05-17 Oracle International Corporation Method to create a partition-by time/tuple-based window in an event processing service
CA2695645C (en) * 2007-08-20 2017-05-23 Nokia Corporation Segmented metadata and indexes for streamed multimedia data
US8195648B2 (en) * 2009-10-21 2012-06-05 Microsoft Corporation Partitioned query execution in event processing systems
WO2011070552A1 (en) * 2009-12-11 2011-06-16 Nokia Corporation Apparatus and methods for describing and timing representations in streaming media files
US8983952B1 (en) * 2010-07-29 2015-03-17 Symantec Corporation System and method for partitioning backup data streams in a deduplication based storage system
US8949240B2 (en) * 2012-07-03 2015-02-03 General Instrument Corporation System for correlating metadata
US9390135B2 (en) * 2013-02-19 2016-07-12 Oracle International Corporation Executing continuous event processing (CEP) queries in parallel
US9087082B2 (en) * 2013-03-07 2015-07-21 International Business Machines Corporation Processing control in a streaming application
CN103218423B (en) * 2013-04-02 2016-09-07 中国科学院信息工程研究所 Data query method and device
US9244978B2 (en) * 2014-06-11 2016-01-26 Oracle International Corporation Custom partitioning of a data stream

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7383253B1 (en) * 2004-12-17 2008-06-03 Coral 8, Inc. Publish and subscribe capable continuous query processor for real-time data streams
CN102456065A (en) * 2011-07-01 2012-05-16 中国人民解放军国防科学技术大学 Offline historical statistical data storage and query method for data streams
CN102456069A (en) * 2011-08-03 2012-05-16 中国人民解放军国防科学技术大学 Data stream increment aggregation statistics and query method and query system
CN103916478A (en) * 2014-04-11 2014-07-09 华为技术有限公司 Streaming data cube establishing method and device based on distributed system
CN104090952A (en) * 2014-07-02 2014-10-08 华中科技大学 Method and system for estimating average value of data flow under sliding window

Also Published As

Publication number Publication date
CN108140047A (en) 2018-06-08
WO2017118474A1 (en) 2017-07-13

Similar Documents

Publication Publication Date Title
US9767174B2 (en) Efficient query processing using histograms in a columnar database
CN111737265B (en) Block data access method, block data storage method and device
CN111309720B (en) Time sequence data storage and reading method and device, electronic equipment and storage medium
US10417265B2 (en) High performance parallel indexing for forensics and electronic discovery
US10552460B2 (en) Sensor data management apparatus, sensor data management method, and computer program product
US20160062651A1 (en) Cache management for sequential write storage
CN104731816A (en) Method and device for processing abnormal business data
CN105868216B (en) A kind of method, apparatus and equipment for realizing the expired operation of object
CN113946552B (en) Data processing method and electronic device
CN108140047B (en) Data processing apparatus and method and data container structure
US7562199B2 (en) Method, apparatus and program for management of access history, storage unit, and information processing apparatus
CN106570005A (en) Database cleaning method and device
US10025816B2 (en) Managing a data set
US9405786B2 (en) System and method for database flow management
CN111737266B (en) Block data access method, block data storage method and device
US10552059B2 (en) Data migration with placement based on access patterns
US11841865B2 (en) Database management system and associated methods
US20180081942A1 (en) Managing Data Obsolescence in Relational Databases
CN109255001A (en) Maintaining method and device, the electronic equipment in interface instance library
US20210133194A1 (en) Tag coexistence detection
US12113677B2 (en) Efficient transfer of collected discovery data
KR102139578B1 (en) Method for restoring data of database through analysis of disc block pattern
US20240264977A1 (en) Storing and searching for data in data stores
HK40041830B (en) Block data access method, block data storage method and apparatus
HK40041830A (en) Block data access method, block data storage method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220210

Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province

Patentee after: Huawei Cloud Computing Technologies Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.