[go: up one dir, main page]

WO2005076160A1 - Data warehouse distributed system and architecture to support distributed query execution - Google Patents

Data warehouse distributed system and architecture to support distributed query execution Download PDF

Info

Publication number
WO2005076160A1
WO2005076160A1 PCT/PT2004/000001 PT2004000001W WO2005076160A1 WO 2005076160 A1 WO2005076160 A1 WO 2005076160A1 PT 2004000001 W PT2004000001 W PT 2004000001W WO 2005076160 A1 WO2005076160 A1 WO 2005076160A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
data
nodes
cluster
tables
Prior art date
Application number
PCT/PT2004/000001
Other languages
French (fr)
Inventor
Jorge Vieira
Jorge Bernardino
Henrique Madeira
Marco Costa
Original Assignee
Critical Software, Sa
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Critical Software, Sa filed Critical Critical Software, Sa
Priority to PCT/PT2004/000001 priority Critical patent/WO2005076160A1/en
Publication of WO2005076160A1 publication Critical patent/WO2005076160A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries

Definitions

  • This invention relates to data warehousing and decision support systems. Particularly, the invention relates to a method to distribute a data warehouse over a large number of low cost computers, assuring a near linear speed up and scale up .
  • the invention provides the necessary tools to load and distribute the data in the cluster, to execute queries in the system, and manage the system.
  • Data warehouses range from comprehensive enterprise-wide data warehouses to subject or application oriented data marts. No matter the data size, data warehouses must have efficient Online Analytical Processing (O AP) tools to explore the data and to provide the users with a real insight from the data in the warehouse .
  • O AP Online Analytical Processing
  • a data warehouse is organized according to the multidimensional model .
  • Each dimension of this model represents a different perspective for the analysis of the business. For instance, in the classical example of a chain of stores business, some of the dimensions are products, stores, and time.
  • Each cell within the multidimensional structure (a cube in this simple three dimension example) contains data (typically numerical facts) along each of the dimensions. For example, a single cell may contain the total sales for a given product in a given store in a single day.
  • the most flexible way to store the data in a data warehouse could be a multidimensional database server
  • most of the data warehouses and OLAP applications store the data in a relational database. That is, the multidimensional model is implemented as a star schema formed by a large central fact table surrounded by several dimensional tables related to the fact table by foreign keys.
  • One well-known problem of data warehouses is that they tend to be extremely large, which causes heavy storage and performance problems. Data warehouses also tend to grow quite rapidly.
  • a scalable architecture is crucial in a warehouse environment, not only to handle with very large amounts of data but also to assure interactive response time to OLAP users .
  • Typical warehouse queries are very complex and ad hoc in nature and generally access huge volumes of warehouse data and perform many joins and aggregations. Additionally, the decision making process using OLAP is often based on a sequence of interactive queries. That is, the answer of one query immediately sets the need for a second query, and the answer of this second query raises another query, and so on in an ad hoc manner. Thus, efficient query processing is a critical requirement to cope with the usual large amount of data involved and to assure interactive response time.
  • FIG. 1 Single Server Architecture
  • prior art of data warehousing systems were comprised of single server (Id) with a storage system (lc) , that receive queries sent by the clients (la) through a connection (lb) to the server.
  • the connection was established using a driver that could be ODBC (open database connectivity) or any other drivers that provide access to the database .
  • This architecture had suffered evolutions and some variations among time to increase processing capability.
  • the SMP Symmetric multi processing
  • This architecture is also known as Shared-Everything or Shared-Memory.
  • the MPP Massive Parallel Processing
  • FIG. 2 - Shared Disk Architecture have introduced a big server (2a) comprised of several nodes (2b) , each one with its own memory, bus, I/O, and optionally its own disks.
  • the storage (lc) is separated and the system is called Shared Disk.
  • the MPP has removed the bus sharing bottleneck, presented in SMP, enabling the system to scale to several hundreds of processors. Due to the lack of shared memory, the major issue of these architectures is the communication between nodes . The performance and scalability of these systems is directly linked to the amount of data that must be exchanged between nodes through the network (2c) .
  • the prior art architectures defined until this point are able to manage high volumes of data. However, they require very expensive hardware and software. Furthermore, the prior art architectures have two additional major problems : lack of optimal data partitioning methods and lack of efficient query execution methods to take advantage of the processing power of the distributed system.
  • Range partition mapped data into partitions based on the ranges of partition keys. It could be used, for example, to partition a table into month partitions, each partition being assigned to one node. It is commonly used in data warehouse environments with good performance results when the queries access only a small number of partitions, what is rarely the case .
  • Hash partition maps data into partitions based on a hash function applied to partition keys . It can be used to distribute the data through the partitions without a semantic meaning. This is typically used to increase the performance of I/O, because it is possible to read from all partitions at the same time.
  • hash partitioning does not assume any prior knowledge on the most frequent queries, it cannot take real advantage of local partition indexing, as it is the case of range partitioning.
  • Inter- query means that several queries are being executed at the same time, and it is widely used in the commercially available database management system.
  • Intra-query parallelism means that each single query is partitioned and executed in parallel by more than one node. This intra-query parallelism is not widely used and few database management systems incorporate some degree of intra-query parallelism.
  • the query execution is divided in several steps (e.g. table scan, merge, sort, join, etc.) and when possible, more than one of these operations were executed in parallel .
  • This kind of parallelism is not specifically conceived for distributed systems, and do not take the full advantage of distributed systems because only few nodes, and not the entire processing power available, are used to each query.
  • Objects of the present invention are, therefore, to provide new and improved data warehousing systems and architectures to support the processing of queries to large and scaling amounts data, having one or more of the following capabilities, features, characteristics, and/or advantages :
  • a scalable architecture comprised of a cluster of computers, possibly personal computers or low cost workstations, able to handle large data warehouses, and able to scale up to accommodate the large growth of data warehouses;
  • a data distribution method assuring an optimal load balance :
  • All nodes have approximately the same amount of information; Each query requires approximately the same amount of data to be processed in each node;
  • a query execution method that re-writes each query in a manner that it is executed in parallel by all nodes, having all nodes about the same amount of data to process within their local data;
  • Each node is a complete computer with an instance of a database management system. This way, each node takes advantage of all optimization methods provided by the database for the execution of a query;
  • DW-SP data warehouse star partitioning
  • a round-robin and probabilistic row by row data partitioning method is applied to the DW star schema.
  • the fact tables are partitioned through all nodes according to partition methods objects of the current invention, and dimension tables are replicated in all nodes.
  • the data to be loaded to the data warehouse must pass through a traditional process of extraction, transformation and loading. Once it has finished the transformation process and it is ready to be loaded to the DW, objects of current invention load the data according to the partition methods, partitioning the fact tables and replicating the dimensions.
  • a query re-write and execution method is also applied to the DW.
  • the partitioning approaches assure an optimal data load balance through all nodes of the cluster.
  • the query re-write and execution method takes advantage of the specific characteristics of star schemas, typical data warehouse queries profile, and the optimal data load balancing, achieved by data partition objects of the current invention, to guarantee an optimal load balance of query execution and optimal intra-query parallelism.
  • the data partitioning approach and the query rewrite and execution method take advantage of the specific characteristics of star schemas and typical data warehouse queries profile to guarantee an optimal load balance and near linear speed up of query execution and assures high scalability.
  • a key aspect of the current invention method is that all the queries (or, at least, the vast majority of queries) can be converted into queries that compute partial results, and the global result can be computed very fast from these partial results.
  • the (partial) queries are executed independently in the different computers, which is the necessary condition to achieve optimal load balance and performance speed up.
  • a huge data warehouse can be distributed by an arbitrary number of relatively small computers, guarantying a scalable and cost effective solution as a large and very expensive server can be replaced by a number of inexpensive computers (i.e., computers with the best cost/performance relation) .
  • the system is comprised of an arbitrary number of single servers (possibly personal computers) forming a cluster. (It is assumed that all nodes of the cluster and the client computers are connected through a network) .
  • One or more of the nodes receive queries from the clients (data analysis tools) , and should assign one node to control the current execution of the query. It shall also request the query re-written and send the queries to all cluster nodes getting the partial answer. Afterwards it should merge the partial results into the final result and return it to the client as if the request was been made to a single server. Cases might happen where the query could be answered using one node only.
  • FIG. 1 - provides a simple diagram a single server prior art, with storage, and client computers arrangements for user access .
  • FIG. 2 - provides a simple diagram of a shared-disk distributed architecture prior art, with storage, nodes, and client computers arrangements for user access.
  • FIG. 3 - illustrates a simplified diagram of a logical representation of an embodiment of the system architecture of the present invention.
  • FIG. 4 - illustrates a simplified diagram of an embodiment of the loading and partitioning method.
  • FIG. 5 - illustrates a simplified diagram of an embodiment of the software architecture responsible for receive the query and execute them in the System.
  • FIG. 6 - presents a flow chart of an embodiment of a query execution in the system.
  • ⁇ query' as generated or produced by an application may be assumed to be any request for any volume of data needed for commencement, continuation, and or completion of a data accessing and or data processing activities. It should be understood that the query may be made in common query language employing, for example, a structured query language (SQL) instruction, or a custom query format/language.
  • SQL structured query language
  • the term 'client computer' may be any system ranging from the personal computer (PC) , which may also be termed as workstation, network computer, thin client, web terminal, etc. to computer devices, such as a mini-computer, a laptop, a palmtop, a handheld computer, a super computer, or any other device able to create a query in the above listed formats.
  • client computer is to be broadly defined.
  • the expression 'clients (data analysis tools) ' or the like is to be indicative of one or mode software programs that are executing, or possible in a suspended state but ready to execute in a client computer as defined above, that are able to accessing and processing data.
  • Such applications may vary from a simple spreadsheet type of program, to a data processing software program, or even a mathematical program performing complex analysis .
  • the term DW-SP Node' may be assumed as a personal computer or any other computer with the capacity of executing a database management system in it.
  • the term DW-SP Controller Node' may be assumed as a DW-SP Node which is also executing the DW-SP software.
  • the DW-SP software may be assumed as an embodiment of the objects of the present invention responsible for processing queries and returning the results to the clients.
  • the term ⁇ cluster' may be assumed as a set a DW-SP Nodes and DW-SP Controller Nodes connected between them with any kind of network apparatus .
  • the term 'software component' may be assumed as a software program, or part of a program, with specific functionalities that is executed in a client computer (as defined above) , in a DW-SP Node (as defined above) , or in a DW-SP Controller Node (as defined above) .
  • star schema' may be assumed as a database design model used for data warehouses comprised of fact tables and dimension tables forming a configuration similar to a star, with a fact table in the centre of the star and the dimension tables around the fact table.
  • Several stars can exist in the same star schema.
  • star schema may be assumed as a database schema comprised only of fact tables and dimension tables or any variation of the star schema model (e.g. snow-flakes, mini-dimensions) .
  • FIG. 3 shows a simplified diagram that illustrates an embodiment of the loading of data into the data warehouse and the partition method of the star schema object in accordance with the present invention.
  • a data staging area 3d containing a database with a star schema model with two dimension tables 3b and 3c and one fact table 3a.
  • the data to be loaded to the data warehouse is stored in the tables 3a, 3b, and 3c of the data staging area 3d.
  • the data staging area is only illustrative and the data to be loaded can be stored in any database, file, or any other apparatus able to store data.
  • a cluster of nodes is comprised of three nodes, 3e, 3f, and 3g (At least one node must be DW-SP Controller Node) .
  • the cluster node's 3e, 3f, and 3g are connected through the network connection 3h.
  • Each node contains the same star schema model as in the data staging area 3d, comprised of two dimension tables and one fact table.
  • the data staging area 3d is connected to the nodes 3e, 3f, and 3g through the network connection 3i.
  • the data of the dimension table 3b of the data staging area 3d is totally loaded (replicated) to the dimension table 3b of all the nodes 3e, 3f and 3g.
  • the data of the dimension table 3c of the data staging area 3d is totally loaded (replicated) to the dimension table 3c of all the nodes 3e, 3f and 3g.
  • the data of the fact table 3a is distributed according to the round-robin partitioning method to the tables 3al, 3a2, and 3a3 of the nodes 3e, 3f, and 3g respectively.
  • FIG. 4 there is provided a high-level block diagram of a logical representation of a portion of an embodiment of a distributed data warehouse system in accordance with the present invention. It should be noted that the data distribution is not included in FIG. 4, but was shown in FIG. 3, and the DW-SP software architecture is not included in FIG. 4 but will be shown in FIG. 5. As can be seen in FIG. 4 a cluster 4d is comprised of two types of nodes: DW-SP Nodes 4a and DW-SP Controller Nodes 4b.
  • All DW- SP Nodes 4a and DW-SP Controller nodes 4b are executing a database management system application, have their portion of the data warehouse data, are all connected through a network connection 4c and share nothing among them.
  • the Client Computers la are executing data analysis tools (nor shown) and are connected to the DW-SP Nodes 4a and DW-SP Controller Nodes 4b through a network connection lb.
  • the DW-SP Controller Nodes 4b are executing the DW-SP software and are ready to receive queries from the data analysis tools being executed in the Client Computers la.
  • the data analysis tool generates a query and forwards it through lb using a generic client connection driver (not shown) to one of the DW-SP Controller Node 4b.
  • the forwarding of the query may be via any suitable communication channel or network including, for example, Ethernet LAN, Fast Ethernet LAN, Internet, Wireless, etc.
  • the DW-SP Controller Node 4b that has received the query is responsible for the query execution in the cluster 4d. It uses the DW-SP software to analyse, re-write, execute in the nodes (DW-SP Nodes 4a and DW-SP Controller Nodes 4b) the re-written queries to obtain the necessary data (named partial results) to answer the original query, merge the partial results to obtain the final result, and send the results that answer the query to the Client Computer la through 4d.
  • FIG. 5 shows a simplified diagram representing an embodiment of the DW-SP Software architecture.
  • the Client Computers la are connected to the DW-SP Controller Node 5h, more specifically they are connected to the DW-SP Software being executed in 5h.
  • the Query Receiver 5a, the Query Execution Controller 5b, the Query Re-Writer 5c, the Query Executor 5d, and the Results Merger 5e are all software components of the DW-SP software.
  • the cluster is comprised of DW-SP Controller Nodes 4b, DW-SP Nodes 4a, and a DW-SP Controller Node 5h responsible for the execution of queries in the current illustration, connected through the network connection 5g.
  • the Query Receiver 5a is responsible for handling the connections of the data analysis tools (not shown) being executed in the Client Computer la to the DW-SP System. It receives the queries generated by the data analysis tools. The communication is performed using a client driver connection and is illustrated by 5i. After the reception of the query, the Query Receiver 5a forwards the query internally in the DW-SP Software to the Query Execution Controller 5b. The internal communication is represented in FIG. 5 by the dashed arrows.
  • the Query Execution Controller 5b is responsible for controlling the process of executing a query in the DW-SP System. It uses the Query Re-Writer 5c, the Query Executor 5d, and the Results Merger 5e to obtain the final result.
  • the Query Re-Writer 5c analysis the query to verify the need for re-written, and when necessary constructs a group of partial queries to be executed in all cluster node's to obtain the partial results. For each partial query the Query Re-Writer 5c constructs a merge query to merge the partial results obtained as the result of the execution of the partial query in all nodes of the cluster.
  • the Query Executor 5d is responsible for executing a given partial query in all system nodes.
  • the connection to all nodes 5f is established using a client connection driver (not shown) , through the physical network connection 5g.
  • the Results Merger 5e is responsible for merging the partial results, obtained by the Query Executor 5d, and build the final results, according to the merge queries built in the Query Re-Writer 5c, to be send to the data analysis tools that have request it. To merge the results, the Results Merger 5e needs to have a connection to all cluster nodes. The connection to all nodes 5j is established using a client connection driver (not shown) , through the physical network connection 5g.
  • FIG. 6 it is provided therein a flow chart of another embodiment of the present invention concerning the execution of queries in the present- invention. It must be noted that the data loading and distribution method to the data warehouse is not represented in FIG. 6, as it was presented in FIG. 3.
  • a query is generated by data analysis tools being executed in the client computers and passed at 15 through a client connection driver to the DW-SP Software being executed in DW-SP Controller Node. As discussed, the passing of the query may be realized by employing any suitable communication network.
  • the query is received by the DW-SP Software.
  • the query is analysed by the Query Re-writer in order to verify if the query needs to be re-written to be executed in the DW-SP System. If the query requires to be re-written, in 40 the query is re-written in one or more partial queries and in one merge query per partial query.
  • Each query is partitioned according to the following characteristics: existence of aggregation functions and access to fact tables .
  • the queries or sub-queries that access fact tables need to be executed in all nodes .
  • the aggregation functions in queries and sub-queries that access fact tables must be partitioned in count, sum, min, and max basic aggregation functions.
  • For each query that needs to be executed in all nodes a partial query and a merge query must be constructed; the partial query must be executed in all nodes to obtain the partial results and the merge query must gather the partial results and compute a final result .
  • a query that does not access fact tables must be executed in one and only one node.
  • the sub-queries that do not access fact tables do not need to be partitioned neither executed independently in the system but are executed together with the query that contains it .
  • one partial query is executed by the Query Executor in all nodes of the cluster.
  • the partial results obtained in 50 are merged by the Results Merger using the adequate merge query built for this purpose by the Query Re-Writer in 40.
  • In 70 is verified if there are more partial queries to be executed, and if there are, it goes back to 50, if there aren't, the final result has been obtained and is sent back to the data analysis tool that have request it by 80.
  • the Query Re-Writer verifies that the query did not require re-write than the query is executed in a single node of the system and then the results are sent back to the data analysis tools in 80.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A distributed system architecture, data distribution method, query partition and execution method for data warehousing distributed systems comprised of a cluster of independent nodes connected through a network. Queries are generated by data accessing and data analysis applications executing in client computers. The method enables the distribution of data through all physical servers (nodes) according to a uniform distribution, assuring an optimal load balance in which all nodes have approximately the same amount of information and each query requires approximately the same amount of data to be processed in each node. The method enables each query to be re-written in a manner that is executed in parallel by all nodes, each node with an equivalent amount of data within its local data to process providing a near linear speed-up and scale-up.

Description

DESCRIPTION
"DATA WAREHOUSE DISTRIBUTED SYSTEM AND ARCHITECTURE TO SUPPORT DISTRIBUTED QUERY EXECUTION"
Field of the invention This invention relates to data warehousing and decision support systems. Particularly, the invention relates to a method to distribute a data warehouse over a large number of low cost computers, assuring a near linear speed up and scale up .
The invention provides the necessary tools to load and distribute the data in the cluster, to execute queries in the system, and manage the system.
Background of the invention Data warehousing has had an explosive growth in the last years. In fact, today's highly competitive market drives the organizations to acquire information systems dedicated to helping the business managers to make faster and more effective decisions. The information stored in these systems - generically called data warehouses - were historical in nature and could include detailed transactional data from operational databases, legacy systems, worksheets or even external data sources. All this data must be integrated and conciliated before it is stored in the data warehouse for decision support purposes .
Data warehouses range from comprehensive enterprise-wide data warehouses to subject or application oriented data marts. No matter the data size, data warehouses must have efficient Online Analytical Processing (O AP) tools to explore the data and to provide the users with a real insight from the data in the warehouse .
In logical terms, a data warehouse is organized according to the multidimensional model . Each dimension of this model represents a different perspective for the analysis of the business. For instance, in the classical example of a chain of stores business, some of the dimensions are products, stores, and time. Each cell within the multidimensional structure (a cube in this simple three dimension example) contains data (typically numerical facts) along each of the dimensions. For example, a single cell may contain the total sales for a given product in a given store in a single day.
Although the most flexible way to store the data in a data warehouse could be a multidimensional database server, most of the data warehouses and OLAP applications store the data in a relational database. That is, the multidimensional model is implemented as a star schema formed by a large central fact table surrounded by several dimensional tables related to the fact table by foreign keys. One well-known problem of data warehouses is that they tend to be extremely large, which causes heavy storage and performance problems. Data warehouses also tend to grow quite rapidly. Thus, a scalable architecture is crucial in a warehouse environment, not only to handle with very large amounts of data but also to assure interactive response time to OLAP users .
Typical warehouse queries are very complex and ad hoc in nature and generally access huge volumes of warehouse data and perform many joins and aggregations. Additionally, the decision making process using OLAP is often based on a sequence of interactive queries. That is, the answer of one query immediately sets the need for a second query, and the answer of this second query raises another query, and so on in an ad hoc manner. Thus, efficient query processing is a critical requirement to cope with the usual large amount of data involved and to assure interactive response time.
As can be seen in FIG. 1 - Single Server Architecture, prior art of data warehousing systems, were comprised of single server (Id) with a storage system (lc) , that receive queries sent by the clients (la) through a connection (lb) to the server. The connection was established using a driver that could be ODBC (open database connectivity) or any other drivers that provide access to the database . This architecture had suffered evolutions and some variations among time to increase processing capability. The SMP (Symmetric multi processing) is one of the evolutions where processors were added to the system, sharing the same bus, memory, and I/O. This architecture is also known as Shared-Everything or Shared-Memory. The communication between nodes was performed through shared-memory, and so the scalability was thought very high although, all processors access to all data, and the amount of data that crosses the bus creates a bottleneck that puts a huge limit in the scalability of these systems.
The MPP (Massive Parallel Processing) , as can be seen in FIG. 2 - Shared Disk Architecture, have introduced a big server (2a) comprised of several nodes (2b) , each one with its own memory, bus, I/O, and optionally its own disks. In the illustration presented in FIG. 2 the storage (lc) is separated and the system is called Shared Disk. The MPP has removed the bus sharing bottleneck, presented in SMP, enabling the system to scale to several hundreds of processors. Due to the lack of shared memory, the major issue of these architectures is the communication between nodes . The performance and scalability of these systems is directly linked to the amount of data that must be exchanged between nodes through the network (2c) . In order to achieve a good performance an efficient distribution of data must be achieved together with an efficient query execution method. The methods proposed to distribute the data are general purpose and do not consider the special characteristics of data warehouses, and so are not optimized for the data warehouse systems. In spite of the queries being executed in parallel, the high overhead introduced by the communication through the network lead to poor speed up and scale up. In practice, each query does not take advantage of the entire distributed system.
The prior art architectures defined until this point are able to manage high volumes of data. However, they require very expensive hardware and software. Furthermore, the prior art architectures have two additional major problems : lack of optimal data partitioning methods and lack of efficient query execution methods to take advantage of the processing power of the distributed system.
Regarding the partition method, there were several approaches. Range partition mapped data into partitions based on the ranges of partition keys. It could be used, for example, to partition a table into month partitions, each partition being assigned to one node. It is commonly used in data warehouse environments with good performance results when the queries access only a small number of partitions, what is rarely the case . Hash partition maps data into partitions based on a hash function applied to partition keys . It can be used to distribute the data through the partitions without a semantic meaning. This is typically used to increase the performance of I/O, because it is possible to read from all partitions at the same time. However, as the hash partitioning does not assume any prior knowledge on the most frequent queries, it cannot take real advantage of local partition indexing, as it is the case of range partitioning.
Regarding the execution method, there are two kinds of parallelism, inter-query and intra-query. Inter- query means that several queries are being executed at the same time, and it is widely used in the commercially available database management system. Intra-query parallelism means that each single query is partitioned and executed in parallel by more than one node. This intra-query parallelism is not widely used and few database management systems incorporate some degree of intra-query parallelism. The query execution is divided in several steps (e.g. table scan, merge, sort, join, etc.) and when possible, more than one of these operations were executed in parallel . This kind of parallelism is not specifically conceived for distributed systems, and do not take the full advantage of distributed systems because only few nodes, and not the entire processing power available, are used to each query.
In the patent number US 6,178,418 of prior art an architecture and method of operation for servicing data queries within a distributed data warehouse system composed of a plurality of physical servers housing the data to be accessed. Queries are generated by data accessing and processing applications executing on client computers within the distributed data warehouse system. The method enables queried data to be requested by, and delivered to respective client computers without having to pass through a middle-tier server. The method further provides a manageable and scalable system architecture which will readily support a variety of physical servers, including legacy databases, and query format translation when required. Although it scales in the number of physical servers, the inventors do not discuss performance considerations in using the system for data warehouses, which, as already discuss, can grow to hundreds of gigabytes of information representing serious performance problems to current data warehouse systems and architectures .
In the patent number US 6,438,538 of prior art a method, apparatus and program storage device for optimizing a query in a relational database management system is provided. The query including aggregate and grouping functions . An application table is preferably located in a source site and an aggregation operation is performed from a target site. After an initial aggregation operation performed from the source-based application table, the further aggregation operations are incrementally performed, by aggregating only the newly inputted relational database data into the target- based base aggregates table. This procedure allows the transformed query to perform more efficiently than the original query, while providing same results.
In the patent number US 6,418,450 of prior art a method, apparatus, and article of manufacture for data warehouse programs architecture. Data in a database stored on a data storage device is transformed. Initially, a program template is retrieved that contains parameters. A business view is invoked with the program template . Under control of the business view, a transformation program is invoked that transforms source data to generate target data using the parameters provided in the program template.
Therefore, when considering prior art data warehousing system, there is the need for a scalable architecture, a data distribution method and query execution method, and a low cost arrangements to overcome the problems stated above. Objects of the present invention are, therefore, to provide new and improved data warehousing systems and architectures to support the processing of queries to large and scaling amounts data, having one or more of the following capabilities, features, characteristics, and/or advantages : A scalable architecture comprised of a cluster of computers, possibly personal computers or low cost workstations, able to handle large data warehouses, and able to scale up to accommodate the large growth of data warehouses; A data distribution method assuring an optimal load balance :
All nodes have approximately the same amount of information; Each query requires approximately the same amount of data to be processed in each node;
- A query execution method that re-writes each query in a manner that it is executed in parallel by all nodes, having all nodes about the same amount of data to process within their local data;
- The data distribution method combined with the execution method assures low communication between nodes; High performance achieved using low cost personal computers;
- Intra-query load balancing with a query re-writer that assures that each query is executed in a truly parallel and distributed manner;
- Inter-query load balancing assuring that the control of the execution of each query is assigned to the less busy node;
- Each node is a complete computer with an instance of a database management system. This way, each node takes advantage of all optimization methods provided by the database for the execution of a query;
- Easy recon iguration on the addition and or removal of new nodes;
- Easy configuration and maintenance of the cluster;
- Easy and fast data load into the system; Transparent query execution through both clients (data analysis tools) and data management systems;
- Higher performance can be achieved by using more powerful nodes instead of personal computers;
- Adding nodes to process the same amount of data will reduce the processing time assuring a near linear speed up;
- Adding nodes to process more data will maintain the same processing time assuring a near linear scale up; The above listed advantages of the invention, as well as others, will become more apparent with a careful review of the descriptions and figures provided herein. Attention is called to the fact, however, that the drawings and the associated descriptions are illustrative and exemplary only, and variations are certainly possible.
Summary of the invention In accordance with the present invention an architecture for distributed data warehouses named data warehouse star partitioning (DW-SP) using a cluster of computers, a method for loading and partition the DW-SP distributed data warehouse, and a query re-writer method and execution method for querying DW-SP distributed data warehouses are provided.
In the present invention a round-robin and probabilistic row by row data partitioning method is applied to the DW star schema. In an embodiment of the method the fact tables are partitioned through all nodes according to partition methods objects of the current invention, and dimension tables are replicated in all nodes. The data to be loaded to the data warehouse must pass through a traditional process of extraction, transformation and loading. Once it has finished the transformation process and it is ready to be loaded to the DW, objects of current invention load the data according to the partition methods, partitioning the fact tables and replicating the dimensions. In the present invention a query re-write and execution method is also applied to the DW. The partitioning approaches assure an optimal data load balance through all nodes of the cluster. The query re-write and execution method takes advantage of the specific characteristics of star schemas, typical data warehouse queries profile, and the optimal data load balancing, achieved by data partition objects of the current invention, to guarantee an optimal load balance of query execution and optimal intra-query parallelism. The data partitioning approach and the query rewrite and execution method take advantage of the specific characteristics of star schemas and typical data warehouse queries profile to guarantee an optimal load balance and near linear speed up of query execution and assures high scalability.
A key aspect of the current invention method is that all the queries (or, at least, the vast majority of queries) can be converted into queries that compute partial results, and the global result can be computed very fast from these partial results. The (partial) queries are executed independently in the different computers, which is the necessary condition to achieve optimal load balance and performance speed up.
Using the present invention a huge data warehouse can be distributed by an arbitrary number of relatively small computers, guarantying a scalable and cost effective solution as a large and very expensive server can be replaced by a number of inexpensive computers (i.e., computers with the best cost/performance relation) .
In a most preferred embodiment of the current invention, the system is comprised of an arbitrary number of single servers (possibly personal computers) forming a cluster. (It is assumed that all nodes of the cluster and the client computers are connected through a network) . One or more of the nodes receive queries from the clients (data analysis tools) , and should assign one node to control the current execution of the query. It shall also request the query re-written and send the queries to all cluster nodes getting the partial answer. Afterwards it should merge the partial results into the final result and return it to the client as if the request was been made to a single server. Cases might happen where the query could be answered using one node only.
Brief description of the drawings In the drawings, like elements are assigned like reference numerals . It is important to note that each of the embodiments of the invention depicted are but one of a number of possible arrangements utilizing the fundamental concepts of the present invention. The drawings are briefly described as follows : FIG. 1 - provides a simple diagram a single server prior art, with storage, and client computers arrangements for user access .
FIG. 2 - provides a simple diagram of a shared-disk distributed architecture prior art, with storage, nodes, and client computers arrangements for user access.
FIG. 3 - illustrates a simplified diagram of a logical representation of an embodiment of the system architecture of the present invention.
FIG. 4 - illustrates a simplified diagram of an embodiment of the loading and partitioning method.
FIG. 5 - illustrates a simplified diagram of an embodiment of the software architecture responsible for receive the query and execute them in the System.
FIG. 6 - presents a flow chart of an embodiment of a query execution in the system.
Detailed description of the preferred embodiments It is important to establish the definition of a number of terms and expressions that are used throughout the disclosure.
The term λquery' as generated or produced by an application (or program) may be assumed to be any request for any volume of data needed for commencement, continuation, and or completion of a data accessing and or data processing activities. It should be understood that the query may be made in common query language employing, for example, a structured query language (SQL) instruction, or a custom query format/language.
The term 'client computer' may be any system ranging from the personal computer (PC) , which may also be termed as workstation, network computer, thin client, web terminal, etc. to computer devices, such as a mini-computer, a laptop, a palmtop, a handheld computer, a super computer, or any other device able to create a query in the above listed formats. As such, the term client computer is to be broadly defined. The expression 'clients (data analysis tools) ' or the like, is to be indicative of one or mode software programs that are executing, or possible in a suspended state but ready to execute in a client computer as defined above, that are able to accessing and processing data. Such applications may vary from a simple spreadsheet type of program, to a data processing software program, or even a mathematical program performing complex analysis .
The term DW-SP Node' may be assumed as a personal computer or any other computer with the capacity of executing a database management system in it. The term DW-SP Controller Node' may be assumed as a DW-SP Node which is also executing the DW-SP software. The DW-SP software may be assumed as an embodiment of the objects of the present invention responsible for processing queries and returning the results to the clients. The term λcluster' may be assumed as a set a DW-SP Nodes and DW-SP Controller Nodes connected between them with any kind of network apparatus . The term 'software component' may be assumed as a software program, or part of a program, with specific functionalities that is executed in a client computer (as defined above) , in a DW-SP Node (as defined above) , or in a DW-SP Controller Node (as defined above) .
The term λstar schema' may be assumed as a database design model used for data warehouses comprised of fact tables and dimension tables forming a configuration similar to a star, with a fact table in the centre of the star and the dimension tables around the fact table. Several stars can exist in the same star schema. The term star schema may be assumed as a database schema comprised only of fact tables and dimension tables or any variation of the star schema model (e.g. snow-flakes, mini-dimensions) .
Referring now to the FIG. 3, it shows a simplified diagram that illustrates an embodiment of the loading of data into the data warehouse and the partition method of the star schema object in accordance with the present invention. As can be seen in FIG. 3, there is a data staging area 3d containing a database with a star schema model with two dimension tables 3b and 3c and one fact table 3a. The data to be loaded to the data warehouse is stored in the tables 3a, 3b, and 3c of the data staging area 3d. It must be noted that the data staging area is only illustrative and the data to be loaded can be stored in any database, file, or any other apparatus able to store data. Before the data is in the state of being loaded to the data warehouse, as illustrated in FIG. 3, it must pass through a process of extraction and transformation (not shown) . It must be noted that the star schema presented is only illustrative and can be comprised of any number of fact and or dimension table, furthermore, variations to the star schema are certain possible (e.g. snow-flakes, mini-dimensions). As can be seen in FIG. 3, a cluster of nodes is comprised of three nodes, 3e, 3f, and 3g (At least one node must be DW-SP Controller Node) . The cluster node's 3e, 3f, and 3g are connected through the network connection 3h. Each node contains the same star schema model as in the data staging area 3d, comprised of two dimension tables and one fact table. The data staging area 3d is connected to the nodes 3e, 3f, and 3g through the network connection 3i. The data of the dimension table 3b of the data staging area 3d is totally loaded (replicated) to the dimension table 3b of all the nodes 3e, 3f and 3g. The data of the dimension table 3c of the data staging area 3d is totally loaded (replicated) to the dimension table 3c of all the nodes 3e, 3f and 3g. As can be seen in FIG. 3 The data of the fact table 3a is distributed according to the round-robin partitioning method to the tables 3al, 3a2, and 3a3 of the nodes 3e, 3f, and 3g respectively. It must be noted that the distribution according to the round-robin partitioning method is only illustrative and the probabilistic partitioning method, object of the present invention, could also be applied, as other partitioning methods that assure a uniform data distribution. Referring now to the FIG. 4, there is provided a high-level block diagram of a logical representation of a portion of an embodiment of a distributed data warehouse system in accordance with the present invention. It should be noted that the data distribution is not included in FIG. 4, but was shown in FIG. 3, and the DW-SP software architecture is not included in FIG. 4 but will be shown in FIG. 5. As can be seen in FIG. 4 a cluster 4d is comprised of two types of nodes: DW-SP Nodes 4a and DW-SP Controller Nodes 4b. All DW- SP Nodes 4a and DW-SP Controller nodes 4b are executing a database management system application, have their portion of the data warehouse data, are all connected through a network connection 4c and share nothing among them. The Client Computers la are executing data analysis tools (nor shown) and are connected to the DW-SP Nodes 4a and DW-SP Controller Nodes 4b through a network connection lb. The DW-SP Controller Nodes 4b are executing the DW-SP software and are ready to receive queries from the data analysis tools being executed in the Client Computers la. The data analysis tool generates a query and forwards it through lb using a generic client connection driver (not shown) to one of the DW-SP Controller Node 4b. The forwarding of the query may be via any suitable communication channel or network including, for example, Ethernet LAN, Fast Ethernet LAN, Internet, Wireless, etc.
The DW-SP Controller Node 4b that has received the query is responsible for the query execution in the cluster 4d. It uses the DW-SP software to analyse, re-write, execute in the nodes (DW-SP Nodes 4a and DW-SP Controller Nodes 4b) the re-written queries to obtain the necessary data (named partial results) to answer the original query, merge the partial results to obtain the final result, and send the results that answer the query to the Client Computer la through 4d.
FIG. 5 shows a simplified diagram representing an embodiment of the DW-SP Software architecture. The Client Computers la are connected to the DW-SP Controller Node 5h, more specifically they are connected to the DW-SP Software being executed in 5h. The Query Receiver 5a, the Query Execution Controller 5b, the Query Re-Writer 5c, the Query Executor 5d, and the Results Merger 5e are all software components of the DW-SP software. The cluster is comprised of DW-SP Controller Nodes 4b, DW-SP Nodes 4a, and a DW-SP Controller Node 5h responsible for the execution of queries in the current illustration, connected through the network connection 5g.
The Query Receiver 5a is responsible for handling the connections of the data analysis tools (not shown) being executed in the Client Computer la to the DW-SP System. It receives the queries generated by the data analysis tools. The communication is performed using a client driver connection and is illustrated by 5i. After the reception of the query, the Query Receiver 5a forwards the query internally in the DW-SP Software to the Query Execution Controller 5b. The internal communication is represented in FIG. 5 by the dashed arrows. The Query Execution Controller 5b is responsible for controlling the process of executing a query in the DW-SP System. It uses the Query Re-Writer 5c, the Query Executor 5d, and the Results Merger 5e to obtain the final result. The Query Re-Writer 5c analysis the query to verify the need for re-written, and when necessary constructs a group of partial queries to be executed in all cluster node's to obtain the partial results. For each partial query the Query Re-Writer 5c constructs a merge query to merge the partial results obtained as the result of the execution of the partial query in all nodes of the cluster. The Query Executor 5d is responsible for executing a given partial query in all system nodes. The connection to all nodes 5f is established using a client connection driver (not shown) , through the physical network connection 5g. The Results Merger 5e is responsible for merging the partial results, obtained by the Query Executor 5d, and build the final results, according to the merge queries built in the Query Re-Writer 5c, to be send to the data analysis tools that have request it. To merge the results, the Results Merger 5e needs to have a connection to all cluster nodes. The connection to all nodes 5j is established using a client connection driver (not shown) , through the physical network connection 5g.
Turning to FIG. 6, it is provided therein a flow chart of another embodiment of the present invention concerning the execution of queries in the present- invention. It must be noted that the data loading and distribution method to the data warehouse is not represented in FIG. 6, as it was presented in FIG. 3.
At 10 a query is generated by data analysis tools being executed in the client computers and passed at 15 through a client connection driver to the DW-SP Software being executed in DW-SP Controller Node. As discussed, the passing of the query may be realized by employing any suitable communication network. At 20 the query is received by the DW-SP Software. At 30 the query is analysed by the Query Re-writer in order to verify if the query needs to be re-written to be executed in the DW-SP System. If the query requires to be re-written, in 40 the query is re-written in one or more partial queries and in one merge query per partial query.
Each query is partitioned according to the following characteristics: existence of aggregation functions and access to fact tables . The queries or sub-queries that access fact tables need to be executed in all nodes . The aggregation functions in queries and sub-queries that access fact tables must be partitioned in count, sum, min, and max basic aggregation functions. For each query that needs to be executed in all nodes, a partial query and a merge query must be constructed; the partial query must be executed in all nodes to obtain the partial results and the merge query must gather the partial results and compute a final result . A query that does not access fact tables must be executed in one and only one node. The sub-queries that do not access fact tables do not need to be partitioned neither executed independently in the system but are executed together with the query that contains it .
In 50, one partial query is executed by the Query Executor in all nodes of the cluster. In 60, the partial results obtained in 50 are merged by the Results Merger using the adequate merge query built for this purpose by the Query Re-Writer in 40. In 70 is verified if there are more partial queries to be executed, and if there are, it goes back to 50, if there aren't, the final result has been obtained and is sent back to the data analysis tool that have request it by 80.
If in 30 the Query Re-Writer verifies that the query did not require re-write than the query is executed in a single node of the system and then the results are sent back to the data analysis tools in 80.
It is important to understand that the description of the embodiments of the present invention are illustrative only, and other equivalent arrangements are certainly possible. Therefore, while there have been described herein the currently preferred embodiments of a distributed data warehouse using a cluster of shared nothing computers with a star schema model data distribution method and query partition and execution method, those skilled in the art will recognize that other and further modifications may be made without departing from the present invention, and it is intended to claim all modifications and variations as fall within the scope of the appended claims .

Claims

1. Data warehousing distributed system and architecture to support distributed query execution: A scalable architecture comprised of a cluster of computers, possibly personal computers, able to handle large data warehouses and able to accommodate the large growth of data warehouses;
- a data distribution method assuring an optimal load balance in which all physical servers (nodes) have approximately the same amount of information and each query requires approximately the same amount of data to be processed in each node;
- a query execution method that re-writes each query in a manner that it is executed in parallel by all nodes, having all nodes a similar amount of data to process within their local data.
2. A method according to claim 1, wherein the data is loaded to the star schema tables of a data warehouse system comprised of a plurality of physical servers (nodes) , sharing nothing among them, and connected through a network, forming what can be denominated a cluster. Each node of the cluster has its own database, and all nodes have the same star schema model. The method comprising: a) loading and distributing the fact tables to all nodes of the cluster, each row of the fact table is loaded to only one node; b) loading the dimension tables to all nodes of the cluster, all rows of the dimension tables are loaded to all nodes; c) loading the tables of the star schema, others than dimension and fact tables, containing information related to dimension tables to all nodes of the cluster, all rows of these tables are loaded to all nodes.
3. The method according to claim 2, wherein the star schema is distributed by all cluster nodes of the cluster.
4. The method according to claim 2 , wherein the rows of the fact tables are to be distributed by all nodes in a disjoint manner, according to the round-robin partitioning method, the probabilistic partitioning method, or any other partitioning method that assures that the fact table data is distributed according to a uniform distribution.
5. The method according to claim 2 , wherein all rows of the dimension tables are to be loaded to all nodes of the cluster.
6. The method according to claim 2, wherein all rows of the tables others than dimension and fact tables are to be loaded to all nodes of the cluster.
7. A method according to claim 1, wherein for servicing data queries to a distributed data warehouse system generated by at least one data accessing application executing on at least one client computer, the distributed data warehouse system being comprised of a plurality of physical servers (nodes) housing data to be accessed, the method comprising: a) generating a query for data required by an application executing on the client computer; b) passing the query to a connection driver of the client computer; c) forwarding the query to the a Query Receiver; d) receiving the query at the Query Received; e) determining if the query requires to be re-written to be executed in the distributed system; f) re-writing the query and sub-queries in partial and merge queries; g) executing each partial query in all nodes to obtain the partial results; h) executing the merge query for each partial query executed to merge the partial results into a final result; i) transmitting the result set back to the connection driver of the client; j) receiving the required data at the client computer and providing said data to the applications.
8. The method according to claim 7, wherein the Query Receiver receives the queries created in the data analysis tools.
9. The method according to claim 7, wherein the query is analysed concerning for the need of re-write in order to be executed in the distributed system.
10. The method according to claim 7, wherein the query or sub-query that accesses fact tables is partitioned in partial queries and merge queries .
11. The method according to claim 7, wherein one partial query and one merge query is created for each sub- query of the query that accesses fact tables .
12. The method according to claim 7, wherein the aggregation functions of queries that access fact tables are partitioned in count, sum, min, and max basic aggregation functions .
13. The method according to claim 7, wherein a node is assigned to merge the results obtained with the execution of partial queries.
14. The method according to claim 7, wherein each partial query is executed in all nodes of the cluster followed by the execution of the merge query to obtain a merged result .
15. The method according to claim 7 , wherein the final result is returned to the data analysis tool.
16. The method according to claim 7, wherein more than one partial query can be executed in parallel in all nodes of the cluster.
PCT/PT2004/000001 2004-02-06 2004-02-06 Data warehouse distributed system and architecture to support distributed query execution WO2005076160A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/PT2004/000001 WO2005076160A1 (en) 2004-02-06 2004-02-06 Data warehouse distributed system and architecture to support distributed query execution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/PT2004/000001 WO2005076160A1 (en) 2004-02-06 2004-02-06 Data warehouse distributed system and architecture to support distributed query execution

Publications (1)

Publication Number Publication Date
WO2005076160A1 true WO2005076160A1 (en) 2005-08-18

Family

ID=34836919

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/PT2004/000001 WO2005076160A1 (en) 2004-02-06 2004-02-06 Data warehouse distributed system and architecture to support distributed query execution

Country Status (1)

Country Link
WO (1) WO2005076160A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008018969A1 (en) * 2006-08-04 2008-02-14 Parallel Computers Technology, Inc. Apparatus and method of optimizing database clustering with zero transaction loss
US7680765B2 (en) 2006-12-27 2010-03-16 Microsoft Corporation Iterate-aggregate query parallelization
US7917541B2 (en) 2008-03-31 2011-03-29 Microsoft Corporation Collecting and aggregating data using distributed resources
US7930294B2 (en) 2008-08-12 2011-04-19 International Business Machines Corporation Method for partitioning a query
GB2478189A (en) * 2010-02-22 2011-08-31 Sean Corbett Method of optimizing data flow between a client and a database server
US8046750B2 (en) 2007-06-13 2011-10-25 Microsoft Corporation Disco: a simplified distributed computing library
WO2012025884A1 (en) * 2010-08-23 2012-03-01 Nokia Corporation Method and apparatus for processing search requests for a partitioned index
US8700679B2 (en) * 2012-04-17 2014-04-15 Sap Ag Classic to in-memory cube conversion
WO2014107359A1 (en) * 2013-01-07 2014-07-10 Facebook, Inc. System and method for distributed database query engines
US9195710B2 (en) * 2007-08-07 2015-11-24 International Business Machines Corporation Query optimization in a parallel computer system to reduce network traffic
CN113646767A (en) * 2019-03-19 2021-11-12 西格玛计算机有限公司 Enable editable tables on cloud-based data warehouse

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
DATABASE INSPEC [online] THE INSTITUTION OF ELECTRICAL ENGINEERS, STEVENAGE, GB; 1998, DATTA A ET AL: "A case for parallelism in data warehousing and OLAP", XP002296778, Database accession no. 6034650 *
DATABASE INSPEC [online] THE INSTITUTION OF ELECTRICAL ENGINEERS, STEVENAGE, GB; 2000, BELLATRECHE L ET AL: "OLAP query processing for partitioned data warehouses", XP002296779, Database accession no. 6604944 *
DATABASE INSPEC [online] THE INSTITUTION OF ELECTRICAL ENGINEERS, STEVENAGE, GB; 2001, BERNARDINO J ET AL: "Experimental evaluation of a new distributed partitioning technique for data warehouses", XP002296777, Database accession no. 7024493 *
PROCEEDINGS 2001 INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM 16-18 JULY 2001 GRENOBLE, FRANCE, July 2001 (2001-07-01), Proceedings 2001 International Database Engineering and Applications Symposium IEEE Comput. Soc Los Alamitos, CA, USA, pages 312 - 321, XP002296683, ISBN: 0-7695-1140-6, Retrieved from the Internet <URL:http://ieeexplore.ieee.org/iel5/7469/20303/00938099.pdf?tp=&arnumber=938099&isnumber=20303&arSt=312&ared=321&arAuthor=Bernardino%2C+J.%3B+Madeira%2C+H.%3B> [retrieved on 20040916] *
PROCEEDINGS NINTH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS 26-28 AUG. 1998 VIENNA, AUSTRIA, 1998, Proceedings Ninth International Workshop on Database and Expert Systems Applications (Cat. No.98EX130) IEEE Comput. Soc Los Alamitos, CA, USA, pages 226 - 231, XP002296684, ISBN: 0-8186-8353-8, Retrieved from the Internet <URL:http://ieeexplore.ieee.org/iel4/5718/15304/00707407.pdf?tp=&arnumber=707407&isnumber=15304&arSt=226&ared=231&arAuthor=Datta%2C+A.%3B+Bongki+Moon%3B+Thomas%2C+H.%3B> [retrieved on 20040916] *
PROCEEDINGS OF 1999 INTERNATIONAL SYMPOSIUM ON DATABASE APPLICATIONS IN NON-TRADITIONAL ENVIRONMENTS (DANTE'99) 28-30 NOV. 1999 KYOTO, JAPAN, 2000, Proceedings 1999 International Symposium on Database Applications in Non-Traditional Environments (DANTE'99) (Cat. No.PR00496) IEEE Comput. Soc Los Alamitos, CA, USA, pages 35 - 42, XP002296685, ISBN: 0-7695-0496-5, Retrieved from the Internet <URL:http://ieeexplore.ieee.org/iel5/6800/18245/00844939.pdf?tp=&arnumber=844939&isnumber=18245&arSt=35&ared=42&arAuthor=Bellatreche%2C+L.%3B+Karlapalem%2C+K.%3B+Mohania%2C+M.%3B> [retrieved on 20040916] *
RUDIN K: "WHEN PARALLEL LINES MEET", BYTE, MCGRAW-HILL INC. ST PETERBOROUGH, US, vol. 23, no. 5, 1 May 1998 (1998-05-01), pages 81 - 84,86,88, XP000774153, ISSN: 0360-5280 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008018969A1 (en) * 2006-08-04 2008-02-14 Parallel Computers Technology, Inc. Apparatus and method of optimizing database clustering with zero transaction loss
US7680765B2 (en) 2006-12-27 2010-03-16 Microsoft Corporation Iterate-aggregate query parallelization
US8046750B2 (en) 2007-06-13 2011-10-25 Microsoft Corporation Disco: a simplified distributed computing library
US9195710B2 (en) * 2007-08-07 2015-11-24 International Business Machines Corporation Query optimization in a parallel computer system to reduce network traffic
US7917541B2 (en) 2008-03-31 2011-03-29 Microsoft Corporation Collecting and aggregating data using distributed resources
US7930294B2 (en) 2008-08-12 2011-04-19 International Business Machines Corporation Method for partitioning a query
US9396228B2 (en) 2010-02-22 2016-07-19 Data Accelerator Ltd. Method of optimizing the interaction between a software application and a database server or other kind of remote data source
GB2478189A (en) * 2010-02-22 2011-08-31 Sean Corbett Method of optimizing data flow between a client and a database server
US8543642B2 (en) 2010-02-22 2013-09-24 Data Accelerator Limited Method of optimizing data flow between a software application and a database server
WO2012025884A1 (en) * 2010-08-23 2012-03-01 Nokia Corporation Method and apparatus for processing search requests for a partitioned index
CN103069421B (en) * 2010-08-23 2017-02-08 诺基亚技术有限公司 Method and apparatus for processing search requests for a partitioned index
CN103069421A (en) * 2010-08-23 2013-04-24 诺基亚公司 Method and apparatus for processing search requests for a partitioned index
US9229946B2 (en) 2010-08-23 2016-01-05 Nokia Technologies Oy Method and apparatus for processing search request for a partitioned index
US8700679B2 (en) * 2012-04-17 2014-04-15 Sap Ag Classic to in-memory cube conversion
CN104903894A (en) * 2013-01-07 2015-09-09 脸谱公司 System and method for distributed database query engine
US9361344B2 (en) 2013-01-07 2016-06-07 Facebook, Inc. System and method for distributed database query engines
US9081826B2 (en) 2013-01-07 2015-07-14 Facebook, Inc. System and method for distributed database query engines
WO2014107359A1 (en) * 2013-01-07 2014-07-10 Facebook, Inc. System and method for distributed database query engines
CN104903894B (en) * 2013-01-07 2018-12-28 脸谱公司 System and method for distributed database query engine
US10210221B2 (en) 2013-01-07 2019-02-19 Facebook, Inc. System and method for distributed database query engines
US10698913B2 (en) 2013-01-07 2020-06-30 Facebook, Inc. System and methods for distributed database query engines
US11347761B1 (en) 2013-01-07 2022-05-31 Meta Platforms, Inc. System and methods for distributed database query engines
CN113646767A (en) * 2019-03-19 2021-11-12 西格玛计算机有限公司 Enable editable tables on cloud-based data warehouse
US12153577B1 (en) 2019-03-19 2024-11-26 Sigma Computing, Inc. Enabling editable tables on a cloud-based data warehouse

Similar Documents

Publication Publication Date Title
US9946780B2 (en) Interpreting relational database statements using a virtual multidimensional data model
US20060218123A1 (en) System and Methodology for Parallel Query Optimization Using Semantic-Based Partitioning
EP1164509A2 (en) Aggregate join index for relational databases
AU2005201996B2 (en) Combining multidimensional expressions and data mining extensions to mine OLAP cubes
JP2003526159A (en) Multidimensional database and integrated aggregation server
Dehne et al. The cgmCUBE project: Optimizing parallel data cube generation for ROLAP
Chattopadhyay et al. Procella
Oussous et al. NoSQL databases for big data
US9934304B2 (en) Systems and methods for memory optimization interest-driven business intelligence systems
Hua et al. ANTELOPE: A semantic-aware data cube scheme for cloud data center networks
WO2005076160A1 (en) Data warehouse distributed system and architecture to support distributed query execution
Özsu et al. NoSQL, NewSQL, and polystores
Gueidi et al. Towards unified modeling for NoSQL solution based on mapping approach
Cuzzocrea et al. A rewrite/merge approach for supporting real-time data warehousing via lightweight data integration.
Schreiner et al. A hybrid partitioning strategy for NewSQL databases: the VoltDB case
Wehrle et al. A model for distributing and querying a data warehouse on a computing grid
Johnson et al. Hierarchically split cube forests for decision support: description and tuned design
Khedr Decomposable algorithm for computing k-nearest neighbours across partitioned data
Sergey et al. Applying map-reduce paradigm for parallel closed cube computation
Ranawade et al. Online analytical processing on hadoop using apache kylin
Lawrence et al. The OLAP-enabled grid: Model and query processing algorithms
Valduriez Principles of distributed data management in 2020?
Tian et al. A fast location service for partial spatial replicas
Ordonez et al. A survey on parallel database systems from a storage perspective: rows versus columns
Fiore et al. Towards high performance data analytics for climate change

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase