CN103927346B

CN103927346B - Query connection method on basis of data volumes

Info

Publication number: CN103927346B
Application number: CN201410124531.1A
Authority: CN
Inventors: 陈岭; 周强
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-03-28
Filing date: 2014-03-28
Publication date: 2017-02-15
Anticipated expiration: 2034-03-28
Also published as: CN103927346A

Abstract

The invention discloses a query connection method based on data volume. In the real-time query of big data, this method deeply considers the characteristics of columnar file reading and other characteristics to estimate the cost to ensure the generation of the best connection sequence. It mainly includes: first constructing the metadata server; then completing the collection of statistical information; Secondly, by querying the metadata server to obtain the relevant statistical information of each table participating in the connection; then, according to the statistical information, the selection and data volume and other related parameters are estimated; finally, the optimal connection sequence is found by calculating the corresponding cost of each execution plan. This method can improve the accuracy of cost estimation, so as to ensure that the execution plan is optimal, and effectively improve the efficiency of the overall query.

Description

Query join method based on data volume

技术领域technical field

本发明涉及大数据实时查询优化技术领域，尤其涉及一种基于数据量的查询连接方法。The invention relates to the technical field of big data real-time query optimization, in particular to a query connection method based on data volume.

背景技术Background technique

大数据实时查询是重要的大数据技术，现有的大数据查询系统有Google Dremel、Cloudera Impala、Berkeley Shark、Apache Drill等。大数据实时查询一般采用分布式计算架构，由于弱化了对事务等功能的支持，所以相对于关系型数据库集群具有更高的可扩展性。同时由于大数据实时查询能很好的满足实时查询的用户需求，因此其在互联网、智慧城市等领域有广阔的应用空间。Big data real-time query is an important big data technology. Existing big data query systems include Google Dremel, Cloudera Impala, Berkeley Shark, Apache Drill, etc. Big data real-time query generally adopts a distributed computing architecture, which has higher scalability than relational database clusters due to weakened support for functions such as transactions. At the same time, because big data real-time query can well meet the needs of real-time query users, it has broad application space in the Internet, smart cities and other fields.

多连接顺序查询优化是数据库管理系统的重要组成部分，在大数据实时查询技术领域中同样具备不可替代的重要性。其通过采用一定的优化方法，不断遍历执行计划的搜索空间，找出最佳的连接顺序，以生成最佳的执行计划，从而提升大数据查询系统的性能，满足用户查询的实时性需求。Multi-connection sequential query optimization is an important part of the database management system, and it is also of irreplaceable importance in the field of big data real-time query technology. By adopting a certain optimization method, it continuously traverses the search space of the execution plan to find the best connection sequence to generate the best execution plan, thereby improving the performance of the big data query system and meeting the real-time requirements of user queries.

由于代价估计是多连接顺序查询优化过程中非常重要的部分，能否给出一种有效的查询代价估计方法是查询优化有效实现的关键。传统的代价估计方法是一种基于表基数的方法，通过该方法能够有效的解决传统代价估算问题，从而保证找到符合代价模型的最佳执行计划。但在分布式数据库系统或数据仓库中，存在以列式文件格式存储的数据表，该格式文件是为了优化底层数据进行读取时的I/O性能及减少数据传输数据量，以RCFile文件为例，该文件是一种先按行横向切分然后按列纵向切分的文件格式，其将仅读取和传输所需的数据列。在对有列式文件格式存储的数据表参与连接时，采用传统基于表基数的代价估计方法进行估算时，其结果可能会产生严重的偏差，进而导致连接顺序优化算法找出符合代价模型的执行计划并非为最佳，即找到的连接顺序并非为最优，以致使得整个查询延迟更高。Since cost estimation is a very important part in the multi-join sequential query optimization process, whether an effective query cost estimation method can be given is the key to the effective implementation of query optimization. The traditional cost estimation method is a method based on table cardinality, which can effectively solve the traditional cost estimation problem, so as to ensure that the best execution plan conforming to the cost model is found. However, in a distributed database system or data warehouse, there are data tables stored in a columnar file format. This format file is used to optimize the I/O performance when the underlying data is read and reduce the amount of data transmitted. The RCFile file is used as the For example, the file is a file format split horizontally by rows and then vertically by columns, which will read and transfer only the required columns of data. When the data tables stored in the columnar file format participate in the connection, when the traditional cost estimation method based on the cardinality of the table is used for estimation, the result may have serious deviations, which will cause the connection order optimization algorithm to find the execution that conforms to the cost model The plan is not optimal, that is, the connection order found is not optimal, which makes the entire query delay higher.

发明内容Contents of the invention

本发明要解决的技术问题是如何确保大数据实时查询系统进行多连接顺序优化时提升其代价估计的准确性，从而提升查询整体的效率。为了解决上述传统基于表基数进行代价估算存在的问题，本发明提出了基于数据量的多连接查询代价估计方法，考虑了用户提交的查询中参与连接的部分关系可能以列式文件存储，通过深入考虑列式文件读取等特性，增加更细粒度的统计信息，利用每个字段的平均长度以估算查询的连接中间结果大小，从而有效的确保代价估算的准确性。The technical problem to be solved by the present invention is how to ensure that the big data real-time query system improves the accuracy of its cost estimation when optimizing the multi-connection sequence, thereby improving the overall efficiency of the query. In order to solve the above-mentioned problems existing in traditional cost estimation based on table cardinality, the present invention proposes a multi-join query cost estimation method based on data volume, which considers that the part of the relationship in the query submitted by the user may be stored in a columnar file, and through in-depth Consider features such as columnar file reading, add finer-grained statistical information, and use the average length of each field to estimate the size of the intermediate result of the query connection, thereby effectively ensuring the accuracy of cost estimation.

一种基于数据量的查询连接方法，包括：A query connection method based on data volume, including:

步骤1，向元数据服务器提交查询请求，获取参与连接的各表所对应的统计信息；Step 1. Submit a query request to the metadata server to obtain statistical information corresponding to each table participating in the connection;

步骤2，根据获取到的统计信息估算得到当前执行计划中所有表的数据量；Step 2, estimate the data volume of all tables in the current execution plan according to the obtained statistical information;

步骤3，重复步骤1及步骤2，直至遍历执行计划的搜索空间，找出具有合适数据量使得查询代价最小的执行计划，按该执行计划中的连接顺序进行表的连接。Step 3, repeat step 1 and step 2 until traversing the search space of the execution plan, find out the execution plan with the appropriate amount of data to minimize the query cost, and connect the tables according to the connection sequence in the execution plan.

其中执行计划的搜索空间是指所有执行计划所得到的表连接顺序的集合。The search space of the execution plan refers to the collection of table connection sequences obtained by all execution plans.

本发明以数据量作为查询代价来确定多连接查询中连接的顺序，从而确保大数据实时查询系统进行多连接顺序优化时提升其代价估计的准确性，从而提升查询整体的效率。The present invention uses the amount of data as the query cost to determine the sequence of connections in the multi-connection query, thereby ensuring that the big data real-time query system improves the accuracy of cost estimation when optimizing the multi-connection sequence, thereby improving the overall efficiency of the query.

其中，元数据服务器构建方式为，选取关系型数据库并设计列级别的表模式，根据设计好的表模式在相应的关系型数据库中创建元数据库及表关系，得到元数据服务器。Among them, the metadata server is constructed by selecting a relational database and designing a column-level table schema, and creating a metadata database and table relationships in the corresponding relational database according to the designed table schema to obtain a metadata server.

为了能够为查询系统提供表级别、分区级别和列级别等三种粒度的统计信息，设计相应的表模式需要符合适当的范式，同时在能够完成代价估计的前提下，尽量减少不必要的存储开销。In order to be able to provide the query system with statistical information at three granularities of table level, partition level, and column level, the design of the corresponding table schema needs to conform to the appropriate paradigm, and at the same time, under the premise of completing the cost estimation, the unnecessary storage overhead should be minimized .

元数据服务器中的统计信息为每张表所对应的统计信息，所述统计信息根据设计的表模式对表进行统计得到。The statistical information in the metadata server is the statistical information corresponding to each table, and the statistical information is obtained by counting the tables according to the designed table schema.

统计信息的细粒度根据表模式的细粒度得到由于表模式为列级别模式，因此统计信息包括列级别的统计信息。The fine-grainedness of the statistical information is obtained from the fine-grainedness of the table schema. Since the table schema is a column-level schema, the statistical information includes column-level statistical information.

所述关系型数据库为：MYSQL数据库，Derby数据库或Oracle数据库。The relational database is: MYSQL database, Derby database or Oracle database.

根据企业用户及系统的实际需求，选取合适的关系型数据库作为大数据实时查询系统的元数据服务器。According to the actual needs of enterprise users and systems, an appropriate relational database is selected as the metadata server of the big data real-time query system.

统计信息包括：列名、列中数据值的下界、列中数据值的上界、表中列数据为空的数量、表中列数据不同值的数量、列中字段数据的平均长度以及列中字段数据的最大长度、表或视图的总行数。Statistics include: column name, lower bound of data value in column, upper bound of data value in column, number of empty column data in table, number of different values of column data in table, average length of field data in column, and The maximum length of the field data, total number of rows for the table or view.

元数据服务器以及统计信息在元数据服务器中的存储均在离线状态下完成。The metadata server and the storage of statistical information in the metadata server are completed offline.

由于元数据服务器的构建及统计信息的收集都是离线完成，使得实际进行查询时进行统计信息的返回不需要耗费多少运行时开销，极大地降低了代价估计的时间延迟。Since the construction of the metadata server and the collection of statistical information are completed offline, the return of statistical information during actual query does not require much runtime overhead, which greatly reduces the time delay of cost estimation.

在步骤2中，每个表的数据量根据该表所对应的选择度、字段平均数据量和表的总行数计算得到。In step 2, the data volume of each table is calculated according to the selectivity of the table, the average data volume of the field and the total number of rows of the table.

选择度根据统计信息所得的列中数据值的下界、列中数据值的上界等统计信息以及连接查询中相关条件，其中一般用selectivity表示。The degree of selectivity is based on statistical information such as the lower bound of the data value in the column, the upper bound of the data value in the column, and related conditions in the connection query, which are generally expressed by selectivity.

选择度的估算方法为，根据查询中的查询条件及统计信息进行相应计算，得到表中满足查询条件的行在所要查询的对象集合中所占的比例。The method for estimating the selectivity is to perform corresponding calculations according to the query conditions and statistical information in the query, and obtain the proportion of the rows in the table that meet the query conditions in the object set to be queried.

其中的对象集合为是表、视图或者中间结果的集合。The collection of objects is a collection of tables, views or intermediate results.

数据量size的计算公式如下：The formula for calculating the data volume size is as follows:

selectivity表示查询的选择度，numsOfTableLine为表或视图的总行数，avgColSize_i表示需要返回的表中第i列字段的平均数据量，j为表的列数。selectivity indicates the selectivity of the query, numsOfTableLine is the total number of rows in the table or view, avgColSize _i indicates the average amount of data in column i of the table to be returned, and j is the number of columns in the table.

相较于传统基于表基数的估算方法，其不仅依赖于查询中间结果产生的行数大小，而且同时也将估算的数据量考虑在内，从而提升代价估算的准确性。Compared with the traditional estimation method based on table cardinality, it not only depends on the number of rows generated by the query intermediate results, but also takes the estimated data volume into consideration, thereby improving the accuracy of cost estimation.

本发明的优点包括：Advantages of the present invention include:

针对传统基于表基数的代价方法存在估算不准确的问题，深入考虑列式文件读取等特性，增加更细粒度的统计信息，有效提升了代价估算的准确性。In view of the problem of inaccurate estimation in the traditional cost method based on table cardinality, in-depth consideration of features such as columnar file reading and more fine-grained statistical information has effectively improved the accuracy of cost estimation.

通过元数据服务器存储和维护表相关统计信息，避免多次进行大量的分析工作，减少了运行时开销，提升了代价估计的效率。The metadata server stores and maintains table-related statistical information, avoiding multiple analysis tasks, reducing runtime overhead, and improving the efficiency of cost estimation.

附图说明Description of drawings

图1为本发明方法一个实施例基于数据量的查询连接方法总体流程图；Fig. 1 is an overall flow chart of the query connection method based on the amount of data in an embodiment of the method of the present invention;

图2为本发明当前实施例所采用的查询处理架构图；FIG. 2 is a query processing architecture diagram adopted in the current embodiment of the present invention;

图3为本发明当前实施例中元数据服务器构建流程图；FIG. 3 is a flow chart of constructing a metadata server in the current embodiment of the present invention;

图4为本发明当前实施例中统计信息收集流程图；Fig. 4 is a flow chart of statistical information collection in the current embodiment of the present invention;

图5为本发明当前实施例中统计信息查询流程图；Fig. 5 is a flowchart of statistical information query in the current embodiment of the present invention;

图6为本发明当前实施例中数据量估算流程图；Fig. 6 is a flow chart of data volume estimation in the current embodiment of the present invention;

图7为本发明当前实施例中连接顺序生成流程图。Fig. 7 is a flowchart of connection sequence generation in the current embodiment of the present invention.

具体实施方式detailed description

本发明提出了基于数据量的查询连接方法，在进行查询时对多连接查询进行代价估计，代价估计方法的总体流程如图1所示。其首先进行元数据服务器的构建工作；然后完成统计信息的收集；其次通过查询元数据服务器获取参与连接的各表的相关统计信息；接着根据统计信息进行选择度及数据量等相关参数的估算工作；最后采用基于数据量的估计方法计算各个执行计划相应代价找出最佳的连接顺序。The present invention proposes a query connection method based on data volume, and performs cost estimation on multi-connection queries when performing a query. The overall flow of the cost estimation method is shown in FIG. 1 . It first constructs the metadata server; then completes the collection of statistical information; secondly, obtains the relevant statistical information of each table participating in the connection by querying the metadata server; and then estimates the relevant parameters such as selectivity and data volume according to the statistical information ;Finally, use the estimation method based on the amount of data to calculate the corresponding cost of each execution plan to find the best connection sequence.

为了更直观的介绍本发明提出的方法在查询优化中的作用，现给出查询处理的架构如图2所示，其阐述了基于数据量的代价估计模块与连接顺序生成模块之间的关系。其中，连接顺序生成模块中由相关优化方法进行执行计划搜索的工作，而基于数据量的代价估计模块主要由代价模型及MetaStore两部分组成，以完成代价估计的工作。对于用户提交的查询，经过解析后将由多连接顺序查询优化方法以完成顺序优化的工作，其在进行执行计划搜索的过程中，需要调用相关代价估计模块进行代价的估算工作，以确保找到符合给定代价模型的最佳连接顺序。In order to more intuitively introduce the function of the method proposed in the present invention in query optimization, the architecture of query processing is shown in Figure 2, which illustrates the relationship between the cost estimation module based on data volume and the connection sequence generation module. Among them, in the connection sequence generation module, the relevant optimization method is used to search for the execution plan, and the cost estimation module based on data volume is mainly composed of two parts: the cost model and the MetaStore, to complete the cost estimation work. For the query submitted by the user, after parsing, the multi-connection sequential query optimization method will be used to complete the sequential optimization work. During the execution plan search process, it needs to call the relevant cost estimation module to estimate the cost, so as to ensure that it finds a query that meets the given requirements. Determines the optimal join order for the cost model.

本发明提出的基于数据量的多连接查询代价估计方法的步骤包括：The steps of the multi-connection query cost estimation method based on the amount of data proposed by the present invention include:

在进行查询连接之前首先需要构建元数据服务器并将元数据服务器中的表中所存储的统计信息。Before making a query connection, it is first necessary to build a metadata server and store statistical information in tables in the metadata server.

关系型数据库并设计表模式，构建元数据服务器。Relational database and design table schema, build metadata server.

为了基于数据量的代价估计方法能够得以高效实现，首先需要进行元数据服务器的构建工作，其流程如图3所示，具体步骤如下：In order to efficiently implement the cost estimation method based on the amount of data, it is first necessary to construct a metadata server. The process is shown in Figure 3, and the specific steps are as follows:

根据企业用户及系统的实际需求，选取合适的关系型数据库（如MYSQL数据库、Derby数据库）作为大数据实时查询系统的元数据服务器；According to the actual needs of enterprise users and systems, select a suitable relational database (such as MYSQL database, Derby database) as the metadata server of the big data real-time query system;

为了能够为查询系统提供表级别、分区级别和列级别等三种粒度的统计信息，设计相应的表模式需要符合适当的范式，同时在能够完成代价估计的前提下，尽量减少不必要的存储开销；In order to be able to provide the query system with statistical information at three granularities of table level, partition level, and column level, the design of the corresponding table schema needs to conform to the appropriate paradigm, and at the same time, under the premise of completing the cost estimation, the unnecessary storage overhead should be minimized ;

根据设计好的表模式在相应的数据库服务器中创建元数据库及表关系，以供后续步骤使用。Create metadata and table relationships in the corresponding database server according to the designed table schema for use in subsequent steps.

根据所设计好的表模式，分析每张表中的关系并将相应统计信息存储到元数据服务器中以完成统计信息的收集；According to the designed table schema, analyze the relationship in each table and store the corresponding statistical information in the metadata server to complete the collection of statistical information;

为了对解析后的查询进行连接顺序优化工作，在创建元数据服务器之后需要完成统计信息收集的工作，其流程如图4所示，具体步骤如下：In order to optimize the connection order of the parsed query, the collection of statistical information needs to be completed after the metadata server is created. The process is shown in Figure 4, and the specific steps are as follows:

为了减少连接顺序优化过程中代价估计获取统计信息的开销，首先通过相应的分析语句或工具对经常进行连接查询的表进行分析工作；In order to reduce the overhead of cost estimation and statistical information acquisition during the connection sequence optimization process, first analyze the tables that are frequently connected and queried through corresponding analysis statements or tools;

对分析后的表进行相关统计信息的收集工作，并将该统计信息存储到元数据服务器的相应表中，为了更好的完成基于数据量的代价估计，需要收集包括字段平均长度AVG_COL_LEN等列级别的统计信息，其在进行表模式设计的过程中给出。其中统计信息包括：列名、列中数据值的下界、列中数据值的上界、表中列数据为空的数量、表中列数据不同值的数量、列中字段数据的平均数据量以及列中字段数据的最大数据量、表或视图的总行数。Collect relevant statistical information on the analyzed table, and store the statistical information in the corresponding table of the metadata server. In order to better complete the cost estimation based on the amount of data, it is necessary to collect column levels including the average field length AVG_COL_LEN , which is given during the table schema design process. The statistical information includes: column name, the lower bound of the data value in the column, the upper bound of the data value in the column, the number of empty column data in the table, the number of different values of the column data in the table, the average data volume of the field data in the column, and The maximum amount of data for field data in a column, total number of rows for a table or view.

元数据服务器（即元数据库）的创建以及统计信息的收集均为离线完成，接着进行查询。The creation of the metadata server (that is, the metadata database) and the collection of statistical information are completed offline, and then the query is performed.

步骤1，通过向元数据服务器提交查询请求以获取参与连接的各表的相关统计信息；Step 1, by submitting a query request to the metadata server to obtain the relevant statistical information of each table participating in the connection;

该步骤主要完成相关统计信息的查询及获取工作，其流程如图5所示，具体步骤如下：This step mainly completes the query and acquisition of relevant statistical information. The process is shown in Figure 5. The specific steps are as follows:

为了得到查询中参与连接的各表的相应统计信息，需要由查询优化模块向相应元数据服务器提交查询请求；In order to obtain the corresponding statistical information of each table participating in the connection in the query, the query optimization module needs to submit a query request to the corresponding metadata server;

由元数据服务器返回各表关系相应的统计信息，以完成统计信息的获取工作，从而用于下阶段相关参数的计算。The statistical information corresponding to each table relationship is returned by the metadata server to complete the acquisition of statistical information, which is used for the calculation of relevant parameters in the next stage.

由于元数据服务器的构建及统计信息的收集都是离线完成，故该步不需要耗费多少运行时开销，极大的降低了代价估计的时间延迟。Since the construction of the metadata server and the collection of statistical information are completed offline, this step does not require much runtime overhead, which greatly reduces the time delay of cost estimation.

步骤2，根据获取到的统计信息估算得到当前执行计划中所有表的数据量。Step 2, estimate the data volume of all tables in the current execution plan according to the obtained statistical information.

其中执行计划是指以不同的表连接顺序进行的查询。The execution plan refers to queries performed in different table join sequences.

在进行执行计划的相应代价估计之前，需要完成相关参数的估算工作，相关参数包括选择度和数据量的计算其流程如图6所示，具体步骤如下：Before estimating the corresponding cost of the execution plan, it is necessary to complete the estimation of relevant parameters. The calculation process of relevant parameters including the degree of selectivity and data volume is shown in Figure 6. The specific steps are as follows:

通过上一步骤中获取到的相关统计信息，首先进行参与连接的各表选择度的计算，步骤2-1，根据连接查询中的查询条件及统计信息进行相应计算，得到满足条件的行在所要查询的对象集合中所占的比例。Based on the relevant statistical information obtained in the previous step, first calculate the selectivity of each table participating in the connection, step 2-1, perform corresponding calculations according to the query conditions and statistical information in the connection query, and obtain the rows that meet the conditions in the desired query The proportion of objects in the collection.

对于查询中包含的任意两个查询条件，满足的不同关系相应的计算公式不同：For any two query conditions contained in the query, the corresponding calculation formulas are different for different relationships satisfied:

查询同时满足查询条件A和查询条件B时的选择度selectivity_(AandB)的计算公式为：The formula for calculating selectivity _(AandB) when the query satisfies both query condition A and query condition B is:

selectivity_(AandB)＝selectivity_(A)×selectivity_(B) （1）selectivity (A and B ₎ = selectivity _(A) × selectivity _(B) (1)

其中，selectivity_(A)表示单个查询条件A的选择度，selectivity_(B)表示单个查询条件B的选择度；Among them, selectivity _(A) represents the selectivity of a single query condition A, and selectivity _(B) represents the selectivity of a single query condition B;

查询满足查询条件A或查询条件B时的选择度selevtivity_(AorB)计算公式为：The formula for calculating the selectivity _(AorB) when the query satisfies query condition A or query condition B is:

selevtivity_(AorB)＝P(A)+P(B)-selectivity_(AandB) （2）Selevtivity _(AorB) = P(A)+P(B)-selectivity _(AandB) (2)

P（A）表示查询条件A的出现概率，P（B）表示查询条件B的出现概率；P(A) represents the occurrence probability of query condition A, and P(B) represents the occurrence probability of query condition B;

查询满足排除查询条件A时选择度selectivity_(notA)的计算公式：The formula for calculating the selectivity _(notA) when the query meets the exclusion query condition A:

selectivity_(ntoA)=1-selectivity_(A) （3）selectivity _(ntoA) =1-selectivity _(A) (3)

任意两个查询条件A和B之间所满足的关系为：同时满足、满足A或满足B，查询条件还可能为不包含A。包含多个查询条件且包含查询条件之间多个关系时，可分别根据上述公式对其中的查询条件进行两两组合，根据各个组合所满足的关系进行计算，得到最终的选择度。The satisfied relationship between any two query conditions A and B is: satisfy at the same time, satisfy A or satisfy B, and the query condition may not include A. When there are multiple query conditions and multiple relationships between the query conditions, the query conditions can be combined in pairs according to the above formulas, and calculated according to the relationship satisfied by each combination to obtain the final selectivity.

步骤2-2，根据步骤2-1所得的选择度计算每个表的数据量，计算公式如下：Step 2-2, calculate the amount of data in each table according to the selectivity obtained in step 2-1, the calculation formula is as follows:

selectivity表示步骤2-1计算所得选择度，numsOfTableLine为表或视图的总行数，avgColSize_i表示需要返回的表中第i列字段的平均数据量，j为表的列数。selectivity indicates the selectivity calculated in step 2-1, numsOfTableLine is the total number of rows in the table or view, avgColSize _i indicates the average data volume of the column i in the table to be returned, and j is the number of columns in the table.

将式(4)计算所得的各个表数据量输入代价模型，进行多连接查询的代价估算，从而得到不同执行计划所得的代价。相较于传统基于表基数的估算方法，其不仅依赖于查询中间结果产生的行数大小，而且同时也将估算的数据量考虑在内，从而提升代价估算的准确性。Input the amount of data in each table calculated by formula (4) into the cost model to estimate the cost of multi-join query, so as to obtain the cost of different execution plans. Compared with the traditional estimation method based on table cardinality, it not only depends on the number of rows generated by the query intermediate results, but also takes the estimated data volume into consideration, thereby improving the accuracy of cost estimation.

步骤3，重复步骤1及步骤2，直至遍历执行计划的搜索空间，找出数据量最小的表连接顺序进行连接。Step 3, repeat step 1 and step 2, until the search space of the execution plan is traversed, and the table connection order with the smallest amount of data is found for connection.

为了能够找到最佳的连接顺序，执行计划的搜索过程中需要使用本发明提出的基于数据量的代价估计方法，其流程如图7所示，具体步骤如下：In order to be able to find the best connection sequence, the cost estimation method based on the amount of data proposed by the present invention needs to be used in the search process of the execution plan. The process flow is shown in Figure 7, and the specific steps are as follows:

根据所采用的连接顺序优化方法进行执行计划的空间搜索工作（即重复步骤1及步骤2），其通过考虑实时查询系统的特性并同时增加相应的剪枝技术以优化执行计划搜索的性能，减少算法本身执行的查询延迟；According to the connection order optimization method adopted, the spatial search of the execution plan is performed (that is, repeating steps 1 and 2), which optimizes the performance of the execution plan search by considering the characteristics of the real-time query system and adding corresponding pruning techniques, reducing query latency performed by the algorithm itself;

通过步骤2得到相应执行计划的数据量的估算值，找出满足给定代价模型的执行计划，并进行存储；Obtain the estimated value of the data volume of the corresponding execution plan through step 2, find out the execution plan that satisfies the given cost model, and store it;

根据上述步骤找出的最佳执行计划，以生成最佳的连接顺序，由于采用了本发明提出的代价估算方法，从而有效提高了代价估计的准确性。According to the optimal execution plan found in the above steps, the optimal connection sequence is generated, and the accuracy of the cost estimation is effectively improved due to the adoption of the cost estimation method proposed by the present invention.

Claims

1. A query connection method based on data volume, characterized in that, comprising:

Step 1. Submit a query request to the metadata server to obtain statistical information corresponding to each table participating in the connection;

Step 2, estimate the data volume of all tables in the current query execution plan according to the obtained statistical information;

Step 3, repeat step 1 and step 2 until traversing the search space of the query execution plan, find out the execution plan with the appropriate amount of data to minimize the query cost, and connect the tables according to the connection sequence in the execution plan;

In step 1, the method of constructing the metadata server is to select a relational database and design a column-level table schema, create a metadata database and table relationships in the corresponding relational database according to the designed table schema, and build a metadata server;

In step 2, the data volume of each table is calculated based on the corresponding selectivity of the table, the average field data volume and the total number of rows in the table; the selection method is calculated according to the query conditions and statistical information in the query , to get the proportion of rows in the table that meet the query conditions in the object set to be queried; the calculation formula for the data size of each table is as follows:

s the s i i z z e e = = s the s e e l l e e c c t t i i v v i i t t y the y \times \times n no u u m m O o f f T T a a b b l l e e L L i i n no e e \times \times {Σ Σ}_{i i = = 11}^{j j} {avgColSize avgColSize}_{i i}

selectivity indicates the selectivity of the query, numsOfTableLine is the total number of rows in the table or view, avgColSize _i indicates the average amount of data in column i of the table to be returned, and j is the number of columns in the table.

2. The query connection method based on data volume as claimed in claim 1, wherein the statistical information stored in the metadata server is the statistical information corresponding to each table, and the statistical information is performed on the table according to the designed table mode Statistics are obtained.

3. The query connection method based on data volume according to claim 1, wherein the relational database is: MYSQL database, Derby database or Oracle database.

4. The query connection method based on data volume as claimed in claim 1, wherein the statistical information includes: the column name, the lower bound of the data value in the column, the upper bound of the data value in the column, and the number of empty column data in the table , the number of distinct values of column data in a table, the average and maximum data size of field data in a column, and the total number of rows in a table or view.

5 . The query connection method based on data volume according to claim 1 , wherein the metadata server and the storage of statistical information in the metadata server are both completed in an offline state. 6 .