CN108140047B

CN108140047B - Data processing apparatus and method and data container structure

Info

Publication number: CN108140047B
Application number: CN201680059351.7A
Authority: CN
Inventors: 拉杜·图多兰; 戈兹·布兰切
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2016-01-05
Filing date: 2016-01-05
Publication date: 2021-06-29
Anticipated expiration: 2036-01-05
Also published as: CN108140047A; WO2017118474A1

Abstract

The present invention relates to a data processing device and method and a corresponding data container structure. The data processing device (300) is adapted to process a data stream (301) comprising a plurality of data elements (301a,b) arranged in time sequence. The data processing apparatus (300) includes a processor (303) for generating a plurality of data container structures (305a,b) based on the data stream (301), wherein each data container structure (305a,b) includes a a chronological subset of the plurality of data elements (301a,b) of the data stream (301), wherein the processor (303) is further configured to provide elements to each data container structure (305a,b) data (307a,b), wherein the metadata (307a,b) defines each data container structure (305a,b) relative to other data container structures (305a) in the plurality of data container structures (305a,b) , the time sequence of b).

Description

Data processing apparatus and method, and data container structure

Technical Field

The present invention relates generally to the field of data processing. More particularly, the invention relates to a data processing apparatus and method for processing a data stream and a corresponding data container structure.

Background

In today's information rich environment, rapidly processing large amounts of data can be challenging and very important. This data is typically provided in the form of a data stream, i.e., a continuous or semi-continuous data stream, where in many cases the data elements are generated in real time as events occur. For example, sensors used in radio-frequency identification (RFID) in tracking and access applications may provide streaming data about the location of the tracked object. Quickly responding to a particular signal in streaming data is often a key aspect of many applications. For example, a network monitoring system for detecting security threats requires detecting and reporting events represented in data streams collected by monitoring.

Conventionally, processing of streaming data is performed by first storing the data in a database. The database may then be queried to retrieve the data for further processing. Therefore, analyzing data in real time is very difficult because the number of database accesses is limited, and in particular, a flow having a high data rate is limited. To address this problem, several traditional software technologies have been redesigned, such as main memory database management systems.

In recent years, a technique called "complex event processing" or "event stream processing" has been developed. By means of these techniques, events embodied as meaningful patterns within a data stream can be detected as a result of the processing of the data stream. In this context, a new class of infrastructure in the form of STREAM processing engines has emerged, e.g., "Aurora," "STREAM," "telegraph cq," to support high-capacity, low-latency data STREAM processing applications in particular.

A conventional stream processing paradigm involves a set of operations applied across all elements/events of a data stream. In many cases, processing a data stream involves specific calculations that must be repeated in time based on continuously changing data elements or events of the data stream, e.g., calculating a running average. In other words, a new data element or event received as part of a data flow typically must be visually observed to be an earlier data flow in the data flow. Alternatively, in some application scenarios, each new data element triggers a duplicate computation based on all available data elements in the data stream, i.e., the early data element and the new data element.

This typically involves splitting the data stream into different windows, i.e. time intervals, to which the computation is applied. A window is a delimitation relative to a time or sequence defined by data elements or events of a data stream that contains a subset of the data elements or events. This concept is illustrated in fig. 1. Conventional stream processing engines use windows to apply processing functions (e.g., computational operations) to data streams or events. The same mechanism is used if the events are received in real time, or if the events are stored in a database and processed repeatedly while the events are replayed to mimic the flow behavior.

A particular aspect is that conventional windows are transferable. The window is formed for the purpose of applying the computing operation to a data element or event of the data stream, but is not saved after the computing operation is completed. Because the window corresponding to the processing of the stream will typically be discarded, any successive operations on the data stream will require the window to be regenerated and recalculated. This behavior occurs even for small updates of the window, e.g., temporal recalibration. In fact, this is a general operation, considering that for a plurality of analysis scenarios, the division of the flow into a plurality of windows is done with respect to the moment of analysis. As time progresses, the reference time of the system also advances and this requires readjustment of the window.

An exemplary scenario that may occur in a conventional stream processing engine is illustrated in fig. 2. For example, at time point X, data elements or events of a data stream may be grouped into windows having a size of 1 hour, starting from the current time and going back 6 months. Based on this grouping of data elements or events into windows, different statistical measures may be calculated, e.g. an average of the data elements or events of the windows. These statistical measures must be calculated again after a new data element or event arrives as part of the data stream at time point X + Δ, since the window must be regenerated from scratch for time point X + Δ.

In view of the above, there is a need for an improved data processing apparatus and method and an improved data container structure, which in particular allows to utilize the temporal structure of data elements in a data stream in an improved way.

Disclosure of Invention

It is an object of the invention to provide an improved data processing apparatus and method and an improved data container structure, which in particular allow to utilize the temporal structure of data elements in a data stream in an improved way.

The foregoing and other objects are achieved by the subject matter of the independent claims. Further embodiments are apparent from the dependent claims, the description and the drawings.

According to a first aspect, the invention relates to a data processing apparatus for processing a data stream, the data stream comprising a plurality of data elements arranged in a time sequence within the data stream. The data processing apparatus comprises a processor for generating a plurality of data container structures based on a data stream, wherein each data container structure comprises a subset of a plurality of data elements in the data stream in a chronological order. The processor is further configured to provide metadata to each data container structure, wherein the metadata defines a temporal order of each data container structure relative to other data container structures of the plurality of data container structures, i.e., a temporal order of each data container structure in the data stream.

Hence, an improved data processing apparatus is provided, which in particular allows to exploit the temporal structure of data elements in a data stream. Based on the metadata, time-based management of data container structures by the data processing apparatus is provided. The performance advantage provided is that computational cycles triggered by updates to reoccur streaming operations are saved. The amount of computation cycles saved varies depending on the particular operation and context, reaching more than 90% of the conventional options in some cases.

In a first possible implementation form of the data processing apparatus according to the first aspect as such, each data container structure comprises at least one data window comprising at least one data element of a subset of the plurality of data elements and/or at least one data sub-container comprising at least two data sub-windows, wherein each data sub-window comprises at least one data element of a subset of the plurality of data elements.

For example, providing this hierarchy of data container structures, data sub-containers, data windows, and/or data sub-windows allows different levels of time granularity to be utilized.

In a second possible implementation form of the data processing apparatus according to the first aspect as such or according to the first implementation form thereof, the metadata of each data container structure comprises a pointer to a data element of the subset of the plurality of data elements, the pointer being the first data element in the temporal order of the subset of the plurality of data elements. In an implementation form, the metadata of each data container structure may further include a data container structure identifier and/or a start time or end time of the data container structure.

In a third possible implementation form of the data processing apparatus according to the first aspect as such or the first or second implementation form thereof, the processor is further configured to provide to each data container structure statistical data associated with a subset of the plurality of data elements of the data container structure. In an implementation form, the statistical data may comprise at least one of: a count, a sum, a mean, a median, a variance, a standard deviation, and/or a coefficient of variation for a plurality of data elements. The statistical data may include a plurality of pairs, where each pair includes a statistical function and a value of the statistical function applied to a data element or window thereof within the data container structure.

In a fourth possible implementation form of the data processing apparatus according to the third implementation form of the first aspect, the metadata of each data container structure further comprises pointers to statistics associated with a subset of the plurality of data elements of the data container structure.

In a fifth possible implementation form of the data processing apparatus according to the first aspect as such or any of the first to fourth implementation forms thereof, the processor is further configured to adapt each data container structure by adding data elements of the data stream to a subset of data elements of the data container structure or by clearing data elements from a subset of data elements of the data container structure. In an implementation form, the processor is configured to utilize a deferred update scheme or an aggressive update scheme for the statistical data associated with the data elements of the data container structure when updating or adjusting the data container structure.

In a sixth possible implementation form of the data processing device according to the first aspect as such or any one of the first to fifth implementation forms thereof, the data processing device further comprises a memory for storing a plurality of data container structures.

According to a second aspect, the invention relates to a data container structure for storing a plurality of data elements arranged in a time sequence in a data stream. The data container structure includes a subset of the plurality of data elements in the data stream in a chronological order, and metadata defining the chronological order of the data container structure in the data stream, i.e., the data container structure is based on the chronological order of the data stream relative to other container structures.

In a first possible implementation form of the data container structure according to the second aspect as such, the data container structure comprises at least one data window comprising at least one data element of the subset of the plurality of data elements and/or at least one data sub-container comprising at least two data sub-windows, wherein each data sub-window comprises at least one data element of the subset of the plurality of data elements.

In a second possible implementation form of the data container structure according to the second aspect as such or the first implementation form thereof, the metadata of the data container structure comprises a pointer to a data element of the subset of the plurality of data elements, the pointer being the first data element in the temporal order of the subset of the plurality of data elements.

In a third possible implementation form of the data container structure according to the second aspect as such or the first or second implementation form thereof, the data container structure further comprises statistical data associated with a subset of the plurality of data elements of the data container structure.

In a fourth possible implementation form of the data container structure according to the third implementation form of the second aspect, the metadata of the data container structure comprises pointers to statistics associated with a subset of the plurality of data elements of the data container structure.

In a fifth possible implementation form of the data container structure according to the second aspect as such or any of the first to fourth implementation forms thereof, the data container structure is adapted to be adjustable by adding data elements of the data stream to a subset of data elements of the data container structure or by clearing data elements from a subset of data elements of the data container structure.

According to a third aspect, the invention relates to a data processing method for processing a data stream, the data stream comprising a plurality of data elements arranged in a temporal order. The data processing method comprises the following steps: generating a plurality of data container structures based on the data stream, wherein each data container structure comprises a subset of a plurality of data elements of the data stream in a chronological order; and providing metadata to each data container structure, wherein the metadata defines a temporal order of each data container structure relative to other data container structures of the plurality of data container structures.

The data processing method according to the third aspect of the present invention may be performed by the data processing apparatus according to the first aspect of the present invention. Further features of the data processing method according to the third aspect of the invention result directly from the functionality of the data processing device according to the first aspect of the invention and its different embodiments described above.

According to a fourth aspect, the invention relates to a computer program comprising program code for performing the data processing method according to the third aspect of the invention or any of its embodiments when run on a computer.

The present invention may be implemented in hardware and/or software.

Drawings

Other embodiments of the invention will be described with respect to the following drawings, in which:

FIG. 1 shows a schematic diagram illustrating aspects of a conventional stream processing engine;

FIG. 2 shows a schematic diagram illustrating aspects of a conventional stream processing engine;

FIG. 3 shows a schematic diagram illustrating a data processing apparatus for processing a data stream according to an embodiment;

FIG. 4 shows a schematic diagram illustrating a data processing method for processing a data stream according to an embodiment;

fig. 5 shows a schematic diagram illustrating a plurality of data container structures according to an embodiment provided by a data processing apparatus according to an embodiment;

fig. 6 shows a schematic diagram illustrating different aspects of a data container structure according to an embodiment provided by a data processing apparatus according to an embodiment;

fig. 7 shows a schematic diagram illustrating different aspects of a data container structure according to an embodiment provided by a data processing apparatus according to an embodiment;

FIG. 8 shows a schematic diagram illustrating different aspects of a data processing apparatus according to an embodiment and a plurality of data container structures according to an embodiment; and

fig. 9 shows a table illustrating the performance of the data processing apparatus and method according to the embodiment.

Detailed Description

Reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific aspects in which the invention may be practiced. It is to be understood that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, as the scope of the present invention is defined by the appended claims.

For example, it should be understood that the disclosure in connection with the described methods may equally apply to the corresponding apparatus or system for performing the methods, and vice versa. For example, if a particular method step is described, the corresponding apparatus may comprise means for performing the described method step, even if this means is not explicitly described or illustrated in the figures. Further, it is to be understood that features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.

Fig. 3 shows a schematic diagram illustrating a data processing apparatus 300 for processing a data stream 301 according to an embodiment. The data stream 301 is composed of a plurality of data elements 301a, b arranged within the data stream 301 in a time sequence, e.g. due to the arrival times of the respective data elements 301a, b at the data processing apparatus 300.

The data processing apparatus 300 comprises a processor 301 for generating a plurality of data container structures 305a, b based on a data stream 301, wherein each data container structure 305a, b comprises a subset of a plurality of data elements 301a, b of the data stream 301 in a temporal order. A memory 309 may be provided as part of the data processing apparatus 300 for storing the plurality of data container structures 305a, b.

The processor 301 is further configured to provide metadata 307a, b to each data container structure 305a, b, wherein the metadata 307a, b defines a temporal order of each data container structure 305a, b relative to other data container structures 305a, b of the plurality of data container structures 305a, b, i.e., a temporal order of each data container structure 305a, b in the data stream 301. For example, in the embodiment shown in fig. 3, the metadata 307b of the data container structure 305b may define a chronological order of the data container structure 305b relative to the data container structure 305 a.

Thus, the data container structure 305a, b shown in fig. 3 is used to store the chronologically arranged data elements 301a, b in the data stream 301. Each data container structure 305a, b comprises a subset of the plurality of data elements 301a, b in the data stream 301 in a temporal order and metadata 307a, b defining the temporal order of the data container structures 305a, b in the data stream 301, i.e. the data container structures 305a, b are based on the temporal order of the data stream 301 relative to the other container structures 305a, b.

Fig. 4 shows a schematic illustration of a data processing method 400 for processing a data stream, such as the data stream 301 shown in fig. 3, according to an embodiment, wherein the data stream 301 comprises a plurality of data elements 301a, b arranged in a time sequence.

The data processing method 400 comprises a step 401 of generating a plurality of data container structures 305a, b based on a data stream 301, wherein each data container structure 305a, b comprises a subset of a plurality of data elements 301a, b in the data stream 301 in a temporal order, and a step 403 of providing metadata 307a, b to each data container structure 305a, b, wherein the metadata 307a, b defines the temporal order of each data container structure 305a, b with respect to other data container structures 305a, b of the plurality of data container structures 305a, b.

Further embodiments, embodiments and aspects of the data processing device 300, the data container structures 305a, b and the data processing method 400 will be described below. As has been described above, the present invention provides a data container structure, such as the data container structures 305a, b shown in fig. 3, that allows data elements or events of the data stream 301 to be organized into windows for storage. This structure encapsulates the concept of grouping data elements or events into a data container structure or window and preserving this partition when storing data. In an embodiment, the data processing means and the data container structure employ time-based references as internal references for delimiting, managing, accessing and navigating. In an embodiment, the organization and saving of events or data elements in a window is done in a hierarchical manner. That is, in an embodiment, the data container holding the window will consist of sub-containers corresponding to smaller granularity/partitioning of data. In embodiments, these granularities may be reduced to the level of events or data elements. In an embodiment, the data container structure may have a time boundary as a delimiter and a navigation reference. Additionally, in embodiments, the internal structure forming the data container structure may also have its own corresponding time reference. In an embodiment, the data processing apparatus 300 is configured to apply a plurality of operations on the data container structures 305a, b. The purpose is to save computation cycles when the event-to-window segmentation or statistics computed thereon need to be updated. By preserving the window partition of the stream, the window can only be rescaled according to a particular operation when an update needs to be applied, rather than being regenerated from the original data. Furthermore, the organization of the data in the window is flexible in the sense that the organization of the data in the window enables combining the window to obtain a new partition of data or re-adjusting the window time boundaries to delete, add, and insert new events or containers. To achieve this later goal, the data container structure holding the data may be extensible and allow internal changes. In addition to this, the data container structure may record predefined statistics about the windows that are linked to metadata that enables navigation of the corresponding container. In an embodiment, these statistics may be saved and updated as new updates are generated. This allows fast access to the common composition information, which is usually computed on the event stream. Several of the above aspects are described in further detail below.

In an embodiment of the data processing apparatus 300, each data container structure 305a, b may comprise at least one data window comprising at least one data element of a subset of the plurality of data elements and/or at least one data sub-container comprising at least two data sub-windows, wherein each data sub-window comprises at least one data element of a subset of the plurality of data elements. This embodiment is described in more detail below in the context of FIG. 5, with FIG. 5 illustrating a hierarchical structure of a plurality of data container structures 305a-c according to an embodiment.

The data container structures 305a-c shown in FIG. 5 may hold other containers (referred to herein as data sub-containers) or multiple events or data elements. Each upper level in the hierarchy will thus correspond to a window corresponding to a particular time size granularity. This structure encapsulates the temporal references and meta information that enable navigation through the data based on time.

Thus, in an embodiment, the metadata 307a, b of each data container structure 305a, b may include a pointer to a data element in the subset of the plurality of data elements, the pointer being the first data element in the temporal order of the subset of the plurality of data elements. In an implementation form, the metadata of each data container structure may further include a data container structure identifier and/or a start time or end time of the data container structure.

Thus, the data container structures 305a-c shown in FIG. 5 can hold any type of partition generated based on a stream, such as the data stream 301 shown in FIG. 3. In an embodiment, this metadata regarding windows, time-granular organization, and required navigation information is collected in metadata 307 a.

In addition to metadata 307a, information about certain statistics computed based on the window may also be stored for quick retrieval. Thus, in an embodiment, the processor 303 is further configured to provide each data container structure 305a-c with statistical data 311a associated with a subset of the plurality of data elements of the respective data container structure 305 a-c. For clarity, only the statistics 311a of the data container structure 305 are indicated in fig. 5.

In an embodiment, the statistical data 311a may include at least one of: a count, a sum, a mean, a median, a variance, a standard deviation, and/or a coefficient of variation for a plurality of data elements. The statistical data 311a may include a plurality of pairs, where each pair includes a statistical function and a value of the statistical function applied to a data element or window thereof within the data container structure.

In an embodiment, the metadata 307a of the data container structure 305a further includes pointers to statistics 311a associated with a subset of the plurality of data elements of the data container structure 305 a.

At each level in the hierarchy, control, expansion, and readjustment may be performed. Some of these operations are supported only by readjusting the metadata information without actually accessing the data itself. Furthermore, the independent objective is to be able to control the respective data container structure at any granularity, such that the data container structure can be reused when an update is needed, e.g. recalibration of a window, event, insertion/removal of a data element or window, etc. Further, having a mechanism for loosely linking windows may perform window-based operations, such as composition or decomposition of windows to obtain other time granularities.

Thus, in an embodiment, the processor 303 is further configured to adjust each data container structure 305a-c by adding data elements of the data stream 301 to a subset of data elements of the data container structure 305a-c or by purging data elements from a subset of data elements of the data container structure 305 a-c.

Fig. 6 illustrates metadata 307a of a data container structure 305a, according to an embodiment. As already described above, the main purpose of the metadata 307a is to preserve the temporal partitioning of the stream with respect to events, i.e. the information about the delimitation of each window itself. In addition, metadata 307a may contain references to containers holding corresponding data in the window and statistics associated therewith. To hold this information, each element in metadata 307a may be constructed as a record (i.e., an n-gram), where each field in the record holds a particular type of this information: window ID, start time, end time, pointer to container header, pointer to container end, pointer to statistics, and reference ID of package window. The organization of records in practice may take a number of forms, such as hash tables or trees.

Further, FIG. 6 illustrates exemplary statistics 311a associated with data elements of the data container structure 305 a. If the statistics are to be calculated in addition to for each window (or closed window), a list of pairs of function types and function values may be kept as part of the statistics 311 a. When a new statistical function is applied over a window of data flow, the type of statistical function and the resulting value may be recorded in the statistical data 311 a. An example of such a statistical function is: counts, sums, means, medians, variances, standard deviations, coefficients of variation, and the like. Each window (across the different granularity layers used) may have its own list of pairs. The set of all these lists of pairs may be organized, as in the case of metadata 307a that ranges in different forms from hash tables into trees or lists. The key point is that each item in the statistics 311a (i.e., the list of pairs) must be allowed to be referenced so that a pointer to the item can be saved in the metadata 307 a.

When an update (e.g., recalibration, new item, delete … …) is to be applied to the data container structure, two options are available for statistics 311 a: delayed updates or proactive updates. Updates to be used may be set when the structure is generated and may be updated during its lifecycle. The deferred evaluation would imply that when any operation is applied to the data container structure (or a data container structure), the dirty flag would be set to not update the flag of the corresponding statistic 311 a. When one of the functions recorded in the statistical data 311a is recalculated in the future, the record is updated and the dirty flag is set to clear. In the case of the proactive update option, each update applied to the data container structure or window will trigger a new calculation of all statistics. In this way, the correct statistical information about the windows in the data stream is immediately available in the future.

The data container structure is responsible for saving events in streams organized in windows. The data container structure will contain events between certain boundaries. Often the delimitation boundary refers to time. Where the data container structure can hold a window, the time boundary will refer to the time delimitation of the window. When the stream is split in multiple windows, each such window will be recorded into a different container, thereby preserving the time definition of the application. These containers may be linked to each other based on the same logical order that the input streams have. The time granularity used to define the window may be one time granularity used when the partition is completed or set when operations on the data container structure are applied, e.g., a combination of different data container structures or a readjustment of the window to a new time granularity.

In an embodiment, the data processing apparatus 300 is for providing time-based management of data. To this end, the start and end time delimitations of the data container structure are not only time references. Within the data container structure, there may be time references corresponding to finer granularity. This internal granularity is set when a structure is generated or when an operation is applied that changes the temporal organization in the structure. As already explained further above, in this case a layered structure is obtained. The containers may thus be internally organized on a sub-container basis, which divides the time corresponding to the time interval in which the containers are packaged. At the lowest granularity, a child container will only hold data elements or events. Fig. 7 shows a hierarchy of data container structures according to an embodiment with an exemplary hierarchical depth 3. However, larger hierarchies can be created using the same principles, based on the particular needs of the situation and the sequential operations applied.

Thus, in embodiments, a container may be formed as a collection of other interconnected containers (i.e., sub-containers) or as a collection of events or data elements. The top level data container structures themselves may be interconnected with each other, thereby capturing partitioning of the stream. Each data container structure may have a corresponding record in metadata 307a and optionally corresponding statistics 311 a. In embodiments, this is not the case for events or data elements, as this would result in a large number of metadata records. At any granularity, the data container structure supports expansion or compression only by moving time boundaries in the metadata 307a and by updating the links of the affected subset in the child container or event. From an implementation perspective, the above-described functionality of the data container structure may be obtained by a doubly linked list of fixed-size events or data elements with the temporal order of the elements.

As has been described above, it is a primary object of embodiments of the present invention to provide a mechanism for saving partitioning of streams in a structure with the aim of increasing performance. This is achieved by reusing existing calculations. To this end, multiple operations of the proposed format should be enabled to perform time-based operations on the resulting data container structure. In addition, performance improvements may also be obtained by enabling operations on the data container structure directly at the metadata level, e.g., retrieving pre-computed statistics. Finally, another advantage comes from the option for managing data container structures in the form of windows in isolated mode or across other similar structures. Each such functionality is exposed through a particular operation. An example of such an operation is shown in fig. 8. In the following, some exemplary operations from each category are described in the context of fig. 8.

As has been described above, embodiments of the present invention allow avoiding re-computing partitions of a data stream whenever an update is available. This update involves the insertion/addition, deletion, updating, addition of new events or data elements in the stream. These operations may be supported by generating the required adjustments at the level of the metadata and by performing events or window movements at the level of the data container structure. For example, at the level of metadata, a modification may imply an adjustment of the temporal boundaries of the data container structure and possibly a reference to the encapsulation container for some sub-containers. At the level of the data container structure, the operation may involve moving a portion of the data in the data container structure to an adjacent data container structure. As an optimization, fine-grained time-based access to events or data elements may be used in order to reduce the amount of data that is read and subsequently written on a storage backend of a retention data container structure, such as storage 309 shown in fig. 3.

Furthermore, embodiments of the present invention allow for time control of stored data. To this end, the data container structure may utilize a temporal reference that segments the data, which may be used for fine-grained navigation. However, the window of split data has a static time dimension. The data container structure according to the present invention supports operations for changing this time partition or grouping time into larger intervals. The operations for changing the time partition are generally used to divide the data container structure or window into sub-intervals. This would imply generating a child container into which the events would be grouped when the data elements would be moved. Although a similar mechanism may be used to reconstruct the data into a new time partition, it should be understood that this is equivalent to generating an entirely new data container structure corresponding to the new partition of the stream. Alternatively, when a window is involved into a larger interval of packets, this can be done by generating a new container in which existing windows will be grouped. This also implies that corresponding fields of metadata will be generated. This is an important operation because it allows to obtain a high-level view summary of the stream.

As mentioned previously, this statistical data allows to record mathematical functions and operators applied on the stream and to record the values of said mathematical functions and operators. This enables fast access to this pre-computed data. The operations to be enabled at the level of statistical operations are reading the value, searching for a pre-computed value of the function, recording/updating a pair of function type and result, and marking whether the result is valid (in particular in case of delayed evaluation).

The set of operations that are enabled is a function of applying one data container structure on another. This function includes cascading, combining, differentiating, comparing, etc. Depending on the particular function, the containers of the source structure will be copied into the resulting structure (which may be one of the source structures) and logically linked according to the function. This would also imply an update of the level of metadata and statistics.

To highlight some of the advantages provided by embodiments of the present invention, an exemplary scenario will be described below in the context of fig. 8. In this example, the corresponding average for the bank account is calculated every 5 minutes over each hourly interval for the last 6 months. This is one example of the processing required in fraud prevention analysis (for credit card payments, prevention of bank money laundering, etc.). The value calculated for each window is used as a ground reference for validating the new transaction. This analysis is common across multiple business domains, which can therefore be optimized by this solution that optimizes basic processing operations.

Conventionally, the above situation will be handled in the following manner. All events are stored in a table. When a new computation is triggered, the stream processing engine will start to read each event starting with the last one. All 1 hour windows will be recalculated upon reading the event. The average value in each window is recalculated. The overhead of these operations is, where N is the number of events: o (n) read from disk operation, o (n) "+" operation, and 4392(6 x 30.5 x 24) "/" operation.

According to an embodiment of the present invention, the above situation can be handled in the following manner. All events are stored and formatted with an internal time reference of 5 minutes relative to a data container structure according to an embodiment, which has a 1 hour window container. The metadata for each window container is read and the cleanup and add operations are completed within a series of 5 minute schedules for each container. The overhead is: the number of reads from disk is reduced to 8% (only the 1/125 minute schedule in each 1 hour window is read) and then written (depending on the implementation, if all data is saved in the same file compared to reads and writes can be completely cleared and replaced only by updates in the metadata table). The number of "+" operations (which can be considered to be 8% of all data) is reduced according to the number of events in the boundary 5 minute window.

For example, embodiments of the present invention provide the following advantages. An apparatus and method are provided for implementing a temporal concept of stored data derived from a data stream. A data container structure for stream partitioning in a data window is provided that allows computational loop optimization of streams to be performed through memory. A data container structure is provided for efficiently handling discrete updates on a stream. A data container structure for window-based control is provided. A data container structure supporting a hierarchical format for organizing stream events is provided. A data container structure is provided for supporting event functions and statistics.

While a particular feature or aspect of the invention may have been disclosed with respect to only one of several implementations or embodiments, such feature or aspect may be combined with one or more other features or aspects of the other implementations or embodiments as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "includes," "has," "having," or any other variation thereof, are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted. Moreover, the terms "exemplary," "by way of example," and "such as" are merely meant as examples, rather than the best or optimal. The terms "coupled" and "connected," along with their derivatives, may be used. It should be understood that these terms may be used to indicate that two elements co-operate or interact with each other, whether in direct physical or electrical contact, or not in direct contact with each other.

Although specific aspects have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.

Although the elements of the above claims are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily limited to being implemented in that particular sequence.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art will readily recognize that there are numerous other applications of the present invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art will recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.

Claims

1. A data processing device (300) for processing a data stream (301) comprising a plurality of data elements (301a, b) arranged in time sequence, characterized in that the data processing device (300) comprises:

a processor (303) for generating a plurality of data container structures (305a,b) based on the data stream (301), wherein each data container structure (305a,b) comprises a chronological sequence of the data stream (301) a subset of the plurality of data elements (301a,b) in ),

wherein the processor (303) is further configured to provide metadata (307a,b) to each data container structure (305a,b), wherein the metadata (307a,b) defines each data container structure (305a,b) b) said temporal order relative to other data container structures (305a,b) in said plurality of data container structures (305a,b);

Each data container structure (305a, b) includes at least one data subcontainer including at least two data subwindows, wherein each data subwindow includes the subset of the plurality of data elements at least one data element of .

2. The data processing apparatus (300) according to claim 1, wherein each data container structure (305a, b) comprises at least one data window and at least one data subcontainer, the at least one data window comprising all at least one data element in the subset of the plurality of data elements.

3. The data processing apparatus (300) according to claim 1 or 2, characterized in that the metadata (307a,b) of each data container structure (305a,b) comprises the plurality of data elements ( 301a,b) a pointer to said data element in said subset of said plurality of data elements (301a,b), said pointer being a first data element in said temporal order of said subset of said plurality of data elements (301a,b).

4. The data processing apparatus (300) according to claim 1 or 2, characterized in that the processor (303) is further configured to provide each data container structure (305a, b) with the data container structure Statistics associated with the subset of the plurality of data elements (301a,b) of (305a,b).

5. The data processing apparatus (300) according to claim 4, characterized in that the metadata (307a, b) of each data container structure (305a, b) comprises the same data as the data container structure (305a, b) A pointer to said statistical data associated with said subset of said plurality of data elements (301a, b) of b).

6. The data processing apparatus (300) according to claim 1 or 2, characterized in that the processor (303) is further configured to add data elements of the data stream (301) to the data container by adding The subset of data elements of the structure (305a,b) or each data container structure (305a,b) is adjusted by clearing data elements from the subset of data elements of the data container structure (305a,b) .

7. The data processing apparatus (300) according to claim 1 or 2, characterized in that the data processing apparatus (300) further comprises a memory (300) for storing the plurality of data container structures (305a, b) 309).

8. A data processing method (400) for processing a data stream (301) comprising a plurality of data elements (301a, b) arranged in time sequence, characterized in that the data processing method (400) comprises:

A plurality of data container structures (305a,b) are generated (401) based on the data stream (301), wherein each data container structure (305a,b) includes the plurality of data stream (301) in chronological order. a subset of data elements; and

Each data container structure (305a,b) is provided (403) with metadata (307a,b), wherein the metadata (307a,b) defines each data container structure (305a,b) relative to the plurality of the chronological order of the other data container structures (305a,b) in the data container structures (305a,b);

Wherein, the data container structure (305a, b) includes at least one data subcontainer, the at least one data subcontainer includes at least two data subwindows, wherein each data subwindow includes the plurality of data elements (301a, at least one data element in said subset of b).