DESCRIPTION
"DATA WAREHOUSE DISTRIBUTED SYSTEM AND ARCHITECTURE TO SUPPORT DISTRIBUTED QUERY EXECUTION"
Field of the invention This invention relates to data warehousing and decision support systems. Particularly, the invention relates to a method to distribute a data warehouse over a large number of low cost computers, assuring a near linear speed up and scale up .
The invention provides the necessary tools to load and distribute the data in the cluster, to execute queries in the system, and manage the system.
Background of the invention Data warehousing has had an explosive growth in the last years. In fact, today's highly competitive market drives the organizations to acquire information systems dedicated to helping the business managers to make faster and more effective decisions. The information stored in these systems - generically called data warehouses - were historical in nature and could include detailed transactional data from operational databases, legacy systems, worksheets or even external data sources. All this data must be
integrated and conciliated before it is stored in the data warehouse for decision support purposes .
Data warehouses range from comprehensive enterprise-wide data warehouses to subject or application oriented data marts. No matter the data size, data warehouses must have efficient Online Analytical Processing (O AP) tools to explore the data and to provide the users with a real insight from the data in the warehouse .
In logical terms, a data warehouse is organized according to the multidimensional model . Each dimension of this model represents a different perspective for the analysis of the business. For instance, in the classical example of a chain of stores business, some of the dimensions are products, stores, and time. Each cell within the multidimensional structure (a cube in this simple three dimension example) contains data (typically numerical facts) along each of the dimensions. For example, a single cell may contain the total sales for a given product in a given store in a single day.
Although the most flexible way to store the data in a data warehouse could be a multidimensional database server, most of the data warehouses and OLAP applications store the data in a relational database. That is, the multidimensional model is implemented as a star schema formed by a large central fact table surrounded by several dimensional tables related to the fact table by foreign keys.
One well-known problem of data warehouses is that they tend to be extremely large, which causes heavy storage and performance problems. Data warehouses also tend to grow quite rapidly. Thus, a scalable architecture is crucial in a warehouse environment, not only to handle with very large amounts of data but also to assure interactive response time to OLAP users .
Typical warehouse queries are very complex and ad hoc in nature and generally access huge volumes of warehouse data and perform many joins and aggregations. Additionally, the decision making process using OLAP is often based on a sequence of interactive queries. That is, the answer of one query immediately sets the need for a second query, and the answer of this second query raises another query, and so on in an ad hoc manner. Thus, efficient query processing is a critical requirement to cope with the usual large amount of data involved and to assure interactive response time.
As can be seen in FIG. 1 - Single Server Architecture, prior art of data warehousing systems, were comprised of single server (Id) with a storage system (lc) , that receive queries sent by the clients (la) through a connection (lb) to the server. The connection was established using a driver that could be ODBC (open database connectivity) or any other drivers that provide access to the database .
This architecture had suffered evolutions and some variations among time to increase processing capability. The SMP (Symmetric multi processing) is one of the evolutions where processors were added to the system, sharing the same bus, memory, and I/O. This architecture is also known as Shared-Everything or Shared-Memory. The communication between nodes was performed through shared-memory, and so the scalability was thought very high although, all processors access to all data, and the amount of data that crosses the bus creates a bottleneck that puts a huge limit in the scalability of these systems.
The MPP (Massive Parallel Processing) , as can be seen in FIG. 2 - Shared Disk Architecture, have introduced a big server (2a) comprised of several nodes (2b) , each one with its own memory, bus, I/O, and optionally its own disks. In the illustration presented in FIG. 2 the storage (lc) is separated and the system is called Shared Disk. The MPP has removed the bus sharing bottleneck, presented in SMP, enabling the system to scale to several hundreds of processors. Due to the lack of shared memory, the major issue of these architectures is the communication between nodes . The performance and scalability of these systems is directly linked to the amount of data that must be exchanged between nodes through the network (2c) . In order to achieve a good performance an efficient distribution of data must be achieved together with an efficient query execution method. The methods proposed to distribute the data are general purpose and do not consider the special characteristics of
data warehouses, and so are not optimized for the data warehouse systems. In spite of the queries being executed in parallel, the high overhead introduced by the communication through the network lead to poor speed up and scale up. In practice, each query does not take advantage of the entire distributed system.
The prior art architectures defined until this point are able to manage high volumes of data. However, they require very expensive hardware and software. Furthermore, the prior art architectures have two additional major problems : lack of optimal data partitioning methods and lack of efficient query execution methods to take advantage of the processing power of the distributed system.
Regarding the partition method, there were several approaches. Range partition mapped data into partitions based on the ranges of partition keys. It could be used, for example, to partition a table into month partitions, each partition being assigned to one node. It is commonly used in data warehouse environments with good performance results when the queries access only a small number of partitions, what is rarely the case . Hash partition maps data into partitions based on a hash function applied to partition keys . It can be used to distribute the data through the partitions without a semantic meaning. This is typically used to increase the performance of I/O, because it is possible to read from all partitions at the same time. However, as the hash partitioning does not assume any prior knowledge on the
most frequent queries, it cannot take real advantage of local partition indexing, as it is the case of range partitioning.
Regarding the execution method, there are two kinds of parallelism, inter-query and intra-query. Inter- query means that several queries are being executed at the same time, and it is widely used in the commercially available database management system. Intra-query parallelism means that each single query is partitioned and executed in parallel by more than one node. This intra-query parallelism is not widely used and few database management systems incorporate some degree of intra-query parallelism. The query execution is divided in several steps (e.g. table scan, merge, sort, join, etc.) and when possible, more than one of these operations were executed in parallel . This kind of parallelism is not specifically conceived for distributed systems, and do not take the full advantage of distributed systems because only few nodes, and not the entire processing power available, are used to each query.
In the patent number US 6,178,418 of prior art an architecture and method of operation for servicing data queries within a distributed data warehouse system composed of a plurality of physical servers housing the data to be accessed. Queries are generated by data accessing and processing applications executing on client computers within the distributed data warehouse system. The method enables queried data to be requested by, and delivered to respective client computers without having to pass through a middle-tier
server. The method further provides a manageable and scalable system architecture which will readily support a variety of physical servers, including legacy databases, and query format translation when required. Although it scales in the number of physical servers, the inventors do not discuss performance considerations in using the system for data warehouses, which, as already discuss, can grow to hundreds of gigabytes of information representing serious performance problems to current data warehouse systems and architectures .
In the patent number US 6,438,538 of prior art a method, apparatus and program storage device for optimizing a query in a relational database management system is provided. The query including aggregate and grouping functions . An application table is preferably located in a source site and an aggregation operation is performed from a target site. After an initial aggregation operation performed from the source-based application table, the further aggregation operations are incrementally performed, by aggregating only the newly inputted relational database data into the target- based base aggregates table. This procedure allows the transformed query to perform more efficiently than the original query, while providing same results.
In the patent number US 6,418,450 of prior art a method, apparatus, and article of manufacture for data warehouse programs architecture. Data in a database stored on a data storage device is transformed. Initially, a program template is retrieved that contains parameters. A business
view is invoked with the program template . Under control of the business view, a transformation program is invoked that transforms source data to generate target data using the parameters provided in the program template.
Therefore, when considering prior art data warehousing system, there is the need for a scalable architecture, a data distribution method and query execution method, and a low cost arrangements to overcome the problems stated above. Objects of the present invention are, therefore, to provide new and improved data warehousing systems and architectures to support the processing of queries to large and scaling amounts data, having one or more of the following capabilities, features, characteristics, and/or advantages : A scalable architecture comprised of a cluster of computers, possibly personal computers or low cost workstations, able to handle large data warehouses, and able to scale up to accommodate the large growth of data warehouses; A data distribution method assuring an optimal load balance :
All nodes have approximately the same amount of information; Each query requires approximately the same amount of data to be processed in each node;
- A query execution method that re-writes each query in a manner that it is executed in parallel by all nodes, having
all nodes about the same amount of data to process within their local data;
- The data distribution method combined with the execution method assures low communication between nodes; High performance achieved using low cost personal computers;
- Intra-query load balancing with a query re-writer that assures that each query is executed in a truly parallel and distributed manner;
- Inter-query load balancing assuring that the control of the execution of each query is assigned to the less busy node;
- Each node is a complete computer with an instance of a database management system. This way, each node takes advantage of all optimization methods provided by the database for the execution of a query;
- Easy recon iguration on the addition and or removal of new nodes;
- Easy configuration and maintenance of the cluster;
- Easy and fast data load into the system; Transparent query execution through both clients (data analysis tools) and data management systems;
- Higher performance can be achieved by using more powerful nodes instead of personal computers;
- Adding nodes to process the same amount of data will reduce the processing time assuring a near linear speed up;
- Adding nodes to process more data will maintain the same processing time assuring a near linear scale up;
The above listed advantages of the invention, as well as others, will become more apparent with a careful review of the descriptions and figures provided herein. Attention is called to the fact, however, that the drawings and the associated descriptions are illustrative and exemplary only, and variations are certainly possible.
Summary of the invention In accordance with the present invention an architecture for distributed data warehouses named data warehouse star partitioning (DW-SP) using a cluster of computers, a method for loading and partition the DW-SP distributed data warehouse, and a query re-writer method and execution method for querying DW-SP distributed data warehouses are provided.
In the present invention a round-robin and probabilistic row by row data partitioning method is applied to the DW star schema. In an embodiment of the method the fact tables are partitioned through all nodes according to partition methods objects of the current invention, and dimension tables are replicated in all nodes. The data to be loaded to the data warehouse must pass through a traditional process of extraction, transformation and loading. Once it has finished the transformation process and it is ready to be loaded to the DW, objects of current invention load the data according to the partition methods, partitioning the fact tables and replicating the dimensions.
In the present invention a query re-write and execution method is also applied to the DW. The partitioning approaches assure an optimal data load balance through all nodes of the cluster. The query re-write and execution method takes advantage of the specific characteristics of star schemas, typical data warehouse queries profile, and the optimal data load balancing, achieved by data partition objects of the current invention, to guarantee an optimal load balance of query execution and optimal intra-query parallelism. The data partitioning approach and the query rewrite and execution method take advantage of the specific characteristics of star schemas and typical data warehouse queries profile to guarantee an optimal load balance and near linear speed up of query execution and assures high scalability.
A key aspect of the current invention method is that all the queries (or, at least, the vast majority of queries) can be converted into queries that compute partial results, and the global result can be computed very fast from these partial results. The (partial) queries are executed independently in the different computers, which is the necessary condition to achieve optimal load balance and performance speed up.
Using the present invention a huge data warehouse can be distributed by an arbitrary number of relatively small
computers, guarantying a scalable and cost effective solution as a large and very expensive server can be replaced by a number of inexpensive computers (i.e., computers with the best cost/performance relation) .
In a most preferred embodiment of the current invention, the system is comprised of an arbitrary number of single servers (possibly personal computers) forming a cluster. (It is assumed that all nodes of the cluster and the client computers are connected through a network) . One or more of the nodes receive queries from the clients (data analysis tools) , and should assign one node to control the current execution of the query. It shall also request the query re-written and send the queries to all cluster nodes getting the partial answer. Afterwards it should merge the partial results into the final result and return it to the client as if the request was been made to a single server. Cases might happen where the query could be answered using one node only.
Brief description of the drawings In the drawings, like elements are assigned like reference numerals . It is important to note that each of the embodiments of the invention depicted are but one of a number of possible arrangements utilizing the fundamental concepts of the present invention. The drawings are briefly described as follows :
FIG. 1 - provides a simple diagram a single server prior art, with storage, and client computers arrangements for user access .
FIG. 2 - provides a simple diagram of a shared-disk distributed architecture prior art, with storage, nodes, and client computers arrangements for user access.
FIG. 3 - illustrates a simplified diagram of a logical representation of an embodiment of the system architecture of the present invention.
FIG. 4 - illustrates a simplified diagram of an embodiment of the loading and partitioning method.
FIG. 5 - illustrates a simplified diagram of an embodiment of the software architecture responsible for receive the query and execute them in the System.
FIG. 6 - presents a flow chart of an embodiment of a query execution in the system.
Detailed description of the preferred embodiments It is important to establish the definition of a number of terms and expressions that are used throughout the disclosure.
The term λquery' as generated or produced by an application (or program) may be assumed to be any request for any volume of data needed for commencement, continuation, and or completion of a data accessing and or data processing activities. It should be understood that the query may be made in common query language employing, for example, a
structured query language (SQL) instruction, or a custom query format/language.
The term 'client computer' may be any system ranging from the personal computer (PC) , which may also be termed as workstation, network computer, thin client, web terminal, etc. to computer devices, such as a mini-computer, a laptop, a palmtop, a handheld computer, a super computer, or any other device able to create a query in the above listed formats. As such, the term client computer is to be broadly defined. The expression 'clients (data analysis tools) ' or the like, is to be indicative of one or mode software programs that are executing, or possible in a suspended state but ready to execute in a client computer as defined above, that are able to accessing and processing data. Such applications may vary from a simple spreadsheet type of program, to a data processing software program, or even a mathematical program performing complex analysis .
The term DW-SP Node' may be assumed as a personal computer or any other computer with the capacity of executing a database management system in it. The term DW-SP Controller Node' may be assumed as a DW-SP Node which is also executing the DW-SP software. The DW-SP software may be assumed as an embodiment of the objects of the present invention responsible for processing queries and returning the results to the clients. The term λcluster' may be assumed as a set a DW-SP Nodes and DW-SP Controller Nodes connected between them with any kind of network apparatus .
The term 'software component' may be assumed as a software program, or part of a program, with specific functionalities that is executed in a client computer (as defined above) , in a DW-SP Node (as defined above) , or in a DW-SP Controller Node (as defined above) .
The term λstar schema' may be assumed as a database design model used for data warehouses comprised of fact tables and dimension tables forming a configuration similar to a star, with a fact table in the centre of the star and the dimension tables around the fact table. Several stars can exist in the same star schema. The term star schema may be assumed as a database schema comprised only of fact tables and dimension tables or any variation of the star schema model (e.g. snow-flakes, mini-dimensions) .
Referring now to the FIG. 3, it shows a simplified diagram that illustrates an embodiment of the loading of data into the data warehouse and the partition method of the star schema object in accordance with the present invention. As can be seen in FIG. 3, there is a data staging area 3d containing a database with a star schema model with two dimension tables 3b and 3c and one fact table 3a. The data to be loaded to the data warehouse is stored in the tables 3a, 3b, and 3c of the data staging area 3d. It must be noted that the data staging area is only illustrative and the data to be loaded can be stored in any database, file, or any other apparatus able to store data. Before the data is in the state
of being loaded to the data warehouse, as illustrated in FIG. 3, it must pass through a process of extraction and transformation (not shown) . It must be noted that the star schema presented is only illustrative and can be comprised of any number of fact and or dimension table, furthermore, variations to the star schema are certain possible (e.g. snow-flakes, mini-dimensions). As can be seen in FIG. 3, a cluster of nodes is comprised of three nodes, 3e, 3f, and 3g (At least one node must be DW-SP Controller Node) . The cluster node's 3e, 3f, and 3g are connected through the network connection 3h. Each node contains the same star schema model as in the data staging area 3d, comprised of two dimension tables and one fact table. The data staging area 3d is connected to the nodes 3e, 3f, and 3g through the network connection 3i. The data of the dimension table 3b of the data staging area 3d is totally loaded (replicated) to the dimension table 3b of all the nodes 3e, 3f and 3g. The data of the dimension table 3c of the data staging area 3d is totally loaded (replicated) to the dimension table 3c of all the nodes 3e, 3f and 3g. As can be seen in FIG. 3 The data of the fact table 3a is distributed according to the round-robin partitioning method to the tables 3al, 3a2, and 3a3 of the nodes 3e, 3f, and 3g respectively. It must be noted that the distribution according to the round-robin partitioning method is only illustrative and the probabilistic partitioning method, object of the present invention, could also be applied, as other partitioning methods that assure a uniform data distribution.
Referring now to the FIG. 4, there is provided a high-level block diagram of a logical representation of a portion of an embodiment of a distributed data warehouse system in accordance with the present invention. It should be noted that the data distribution is not included in FIG. 4, but was shown in FIG. 3, and the DW-SP software architecture is not included in FIG. 4 but will be shown in FIG. 5. As can be seen in FIG. 4 a cluster 4d is comprised of two types of nodes: DW-SP Nodes 4a and DW-SP Controller Nodes 4b. All DW- SP Nodes 4a and DW-SP Controller nodes 4b are executing a database management system application, have their portion of the data warehouse data, are all connected through a network connection 4c and share nothing among them. The Client Computers la are executing data analysis tools (nor shown) and are connected to the DW-SP Nodes 4a and DW-SP Controller Nodes 4b through a network connection lb. The DW-SP Controller Nodes 4b are executing the DW-SP software and are ready to receive queries from the data analysis tools being executed in the Client Computers la. The data analysis tool generates a query and forwards it through lb using a generic client connection driver (not shown) to one of the DW-SP Controller Node 4b. The forwarding of the query may be via any suitable communication channel or network including, for example, Ethernet LAN, Fast Ethernet LAN, Internet, Wireless, etc.
The DW-SP Controller Node 4b that has received the query is responsible for the query execution in the cluster 4d. It uses the DW-SP software to analyse, re-write, execute
in the nodes (DW-SP Nodes 4a and DW-SP Controller Nodes 4b) the re-written queries to obtain the necessary data (named partial results) to answer the original query, merge the partial results to obtain the final result, and send the results that answer the query to the Client Computer la through 4d.
FIG. 5 shows a simplified diagram representing an embodiment of the DW-SP Software architecture. The Client Computers la are connected to the DW-SP Controller Node 5h, more specifically they are connected to the DW-SP Software being executed in 5h. The Query Receiver 5a, the Query Execution Controller 5b, the Query Re-Writer 5c, the Query Executor 5d, and the Results Merger 5e are all software components of the DW-SP software. The cluster is comprised of DW-SP Controller Nodes 4b, DW-SP Nodes 4a, and a DW-SP Controller Node 5h responsible for the execution of queries in the current illustration, connected through the network connection 5g.
The Query Receiver 5a is responsible for handling the connections of the data analysis tools (not shown) being executed in the Client Computer la to the DW-SP System. It receives the queries generated by the data analysis tools. The communication is performed using a client driver connection and is illustrated by 5i. After the reception of the query, the Query Receiver 5a forwards the query internally in the DW-SP Software to the Query Execution Controller 5b. The internal communication is represented in
FIG. 5 by the dashed arrows. The Query Execution Controller 5b is responsible for controlling the process of executing a query in the DW-SP System. It uses the Query Re-Writer 5c, the Query Executor 5d, and the Results Merger 5e to obtain the final result. The Query Re-Writer 5c analysis the query to verify the need for re-written, and when necessary constructs a group of partial queries to be executed in all cluster node's to obtain the partial results. For each partial query the Query Re-Writer 5c constructs a merge query to merge the partial results obtained as the result of the execution of the partial query in all nodes of the cluster. The Query Executor 5d is responsible for executing a given partial query in all system nodes. The connection to all nodes 5f is established using a client connection driver (not shown) , through the physical network connection 5g. The Results Merger 5e is responsible for merging the partial results, obtained by the Query Executor 5d, and build the final results, according to the merge queries built in the Query Re-Writer 5c, to be send to the data analysis tools that have request it. To merge the results, the Results Merger 5e needs to have a connection to all cluster nodes. The connection to all nodes 5j is established using a client connection driver (not shown) , through the physical network connection 5g.
Turning to FIG. 6, it is provided therein a flow chart of another embodiment of the present invention concerning the execution of queries in the present- invention. It must be noted that the data loading and distribution
method to the data warehouse is not represented in FIG. 6, as it was presented in FIG. 3.
At 10 a query is generated by data analysis tools being executed in the client computers and passed at 15 through a client connection driver to the DW-SP Software being executed in DW-SP Controller Node. As discussed, the passing of the query may be realized by employing any suitable communication network. At 20 the query is received by the DW-SP Software. At 30 the query is analysed by the Query Re-writer in order to verify if the query needs to be re-written to be executed in the DW-SP System. If the query requires to be re-written, in 40 the query is re-written in one or more partial queries and in one merge query per partial query.
Each query is partitioned according to the following characteristics: existence of aggregation functions and access to fact tables . The queries or sub-queries that access fact tables need to be executed in all nodes . The aggregation functions in queries and sub-queries that access fact tables must be partitioned in count, sum, min, and max basic aggregation functions. For each query that needs to be executed in all nodes, a partial query and a merge query must be constructed; the partial query must be executed in all nodes to obtain the partial results and the merge query must gather the partial results and compute a final result . A query that does not access fact tables must be executed in one and only one node. The sub-queries that do not access
fact tables do not need to be partitioned neither executed independently in the system but are executed together with the query that contains it .
In 50, one partial query is executed by the Query Executor in all nodes of the cluster. In 60, the partial results obtained in 50 are merged by the Results Merger using the adequate merge query built for this purpose by the Query Re-Writer in 40. In 70 is verified if there are more partial queries to be executed, and if there are, it goes back to 50, if there aren't, the final result has been obtained and is sent back to the data analysis tool that have request it by 80.
If in 30 the Query Re-Writer verifies that the query did not require re-write than the query is executed in a single node of the system and then the results are sent back to the data analysis tools in 80.
It is important to understand that the description of the embodiments of the present invention are illustrative only, and other equivalent arrangements are certainly possible. Therefore, while there have been described herein the currently preferred embodiments of a distributed data warehouse using a cluster of shared nothing computers with a star schema model data distribution method and query partition and execution method, those skilled in the art will recognize that other and further modifications may be made without departing from the present invention, and it is
intended to claim all modifications and variations as fall within the scope of the appended claims .