[go: up one dir, main page]

CN106649754A - Method and system for testing data warehouse performance - Google Patents

Method and system for testing data warehouse performance Download PDF

Info

Publication number
CN106649754A
CN106649754A CN201611220973.1A CN201611220973A CN106649754A CN 106649754 A CN106649754 A CN 106649754A CN 201611220973 A CN201611220973 A CN 201611220973A CN 106649754 A CN106649754 A CN 106649754A
Authority
CN
China
Prior art keywords
data
data warehouse
jobs
warehouse
machining path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611220973.1A
Other languages
Chinese (zh)
Other versions
CN106649754B (en
Inventor
姜兆龙
成银乾
李蒙
李修全
金妍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN201611220973.1A priority Critical patent/CN106649754B/en
Publication of CN106649754A publication Critical patent/CN106649754A/en
Application granted granted Critical
Publication of CN106649754B publication Critical patent/CN106649754B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/3604Analysis of software for verifying properties of programs
    • G06F11/3612Analysis of software for verifying properties of programs by runtime analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and system for testing data warehouse performance. According to the method, by conducting detection and amendment on data types used by data in a data warehouse, storage space occupied by the data in the data warehouse is reduced, and thus the operating performance of the data warehouse is improved; by detecting distribution keys of physical tables, rank storing requirements of data tables, table partitions of the physical tables and table connection relations among the physical tables in which connection relations exist in the data warehouse, and conducting detection and amendment on ranked sentences of the physical tables, the operating speed of the data warehouse is increased, thus the purpose of conducting comprehensive performance testing and optimization on the data warehouse is achieved, and a practicable method is provided for the comprehensive performance testing to find out problems before the data warehouse is used online.

Description

A kind of data warehouse performance method of testing and system
Technical field
The application is related to data warehouse performance technical field of measurement and test, more particularly, it relates to a kind of data warehouse performance is surveyed Method for testing and system.
Background technology
Data warehouse (Data Warehouse, DWH) is the collection for preserving the data produced in enterprise's production or R&D process Close.Data warehouse is generally used for recording information data of the enterprise from past a certain time point to each current stage, and manager is led to These data are crossed, quantitative analysis and prediction can be made to the development course of enterprise and future trend.Due to the work of data warehouse Once causing certain data to enter after data warehouse with characteristic, generally will be retained for a long time, that is to say, that counting It is usually inquiry operation according to the operation carried out in warehouse, modification and deletion action seldom, generally only need to regularly data loading And refreshing.
Typically the data warehouse that a certain enterprise designs is being directed to the performance test of the data warehouse in prior art Reach the standard grade after use, within the observation period of certain length, enterprise personnel carries out artificial test to the performance of the data warehouse, In this process if it find that the problem for existing in a certain respect of the data warehouse cannot stand, then enter for this problem The performance optimization of row local, lacks the method for carrying out comprehensive test for the performance of the data warehouse.
The content of the invention
To solve above-mentioned technical problem, the invention provides a kind of data warehouse performance method of testing and system, to realize For the purpose of comprehensive test of the performance of data warehouse.
To realize above-mentioned technical purpose, following technical scheme is embodiments provided:
A kind of data warehouse performance method of testing, is applied to data warehouse, and the data warehouse performance method of testing includes:
Detect whether the data type that the data in the data warehouse are used matches with the size of data, if It is then to be changed to the data type of unmatched data to meet the minimum data type of the call data storage;
The distribution key for detecting the physical table in the data warehouse chooses whether meet Even distribution entails, if it is not, then According to the distribution key of the physical table being uniformly distributed in the principle selection data warehouse;
Detect whether the tables of data in the data warehouse meets ranks memory requirement, if it is not, then adjustment no enough row The ranks storage mode of the tables of data of row memory requirement;
Detect whether the physical table in the data warehouse meets table zoning requirements, if it is, to meeting table subregion Physical table carries out table subregion;
Detect and whether meet table connection request between the physical table that there is annexation in the data warehouse, if not, Then the physical table to being unsatisfactory for table connection request is reconnected according to table connection request;
Detect whether an ordering statement of the physical table in the data warehouse meets sentence and select to require, if it is not, then right Not meeting sentence selects an ordering statement of desired physical table to be corrected.
Optionally, whether an ordering statement of the physical table in the detection data warehouse meets sentence and selects to require, If it is not, then an ordering statement of the physical table required not meeting sentence to select also includes after being corrected:
It is reverse to obtain all of All Jobs according to job dependence relation by the data output end of the data warehouse Operation machining path;
Obtain operation average latency of the All Jobs machining path of each operation, operation average performance times and be System resource averagely takes;
When the operation average latency, operation horizontal according to the All Jobs machining path of each operation for obtaining is performed both by Between and current system resource averagely take, obtain the shortest path of each operation;
The operation average latency, operation average performance times and system resource according to the shortest path of each operation puts down Take the optimization data warehouse.
Optionally, the data output end by the data warehouse, it is reverse to obtain all according to job dependence relation The All Jobs machining path of operation includes:
By the data output end of the data warehouse, the work of last level that All Jobs data genaration needs is obtained Industry;
The operation of last level needed according to All Jobs data genaration and job dependence relation, inversely obtain institute There is the operation machining path of operation.
Optionally, the operation average latency of the All Jobs machining path for obtaining each operation, operation are average Execution time and system resource averagely take including:
Operation stand-by period of All Jobs machining path of each operation in record predetermined period, the Job execution time and System resource takes;
Operation stand-by period, Job execution time according to the All Jobs machining path of each operation in predetermined period, System resource takes operation average latency, the operation average performance times of the All Jobs machining path for calculating each operation Averagely take with system resource.
Optionally, the operation average latency of the shortest path according to each operation, operation average performance times Averagely taking the optimization data warehouse with system resource includes:
The normal form requirement of data warehouse is reduced, increases data redundancy;
Operation is adjusted according to operation average latency of the shortest path of each operation and operation average performance times Priority;
The concurrency for adjusting the data warehouse is averagely taken according to system resource.
A kind of data warehouse performance test system, including:
Data type detection module, for detecting the data warehouse in the data type that used of data and the number Whether match according to size, if it is, the data type of unmatched data is changed to meet the call data storage most Small data type;
Distribution key detection module, for detecting the data warehouse in the distribution key of physical table choose whether meet uniform Spreading requirements, if it is not, then according to the distribution key of the physical table being uniformly distributed in the principle selection data warehouse;
Storage mode detection module, for detecting the data warehouse in tables of data whether meet ranks memory requirement, If it is not, then adjustment is unsatisfactory for the ranks storage mode of the tables of data of ranks memory requirement;
Subarea detecting module, for detecting the data warehouse in physical table whether meet table zoning requirements, if it is, Then the physical table to meeting table subregion carries out table subregion;
Table connection detection module, for detecting the data warehouse in exist and whether meet between the physical table of annexation Table connection request, if it is not, then the physical table to being unsatisfactory for table connection request is reconnected according to table connection request;
Sequence detection module, for detecting the data warehouse in physical table an ordering statement whether meet sentence selection Require, if it is not, then an ordering statement of the physical table required not meeting sentence to select is corrected.
Optionally, also include:
Machining path acquisition module, it is inverse according to job dependence relation for by the data output end of the data warehouse To the All Jobs machining path for obtaining All Jobs;
Time-obtaining module, for obtaining the operation average latency of the All Jobs machining path of each operation, making Industry average performance times and system resource averagely take;
Shortest path acquisition module, for impartial according to the operation horizontal of the All Jobs machining path of each operation for obtaining Treat that time, operation average performance times and current system resource averagely take, obtain the shortest path of each operation;
Optimization module, during for being performed both by according to the operation average latency of the shortest path of each operation, operation horizontal Between and system resource averagely take the optimization data warehouse.
Optionally, the machining path acquisition module includes:
Date Conversion Unit, needs for by the data output end of the data warehouse, obtaining All Jobs data genaration The operation of last level wanted;
Backstepping unit, the operation of last level and job dependence for being needed according to All Jobs data genaration is closed System, the reverse operation machining path for obtaining All Jobs.
Optionally, the time-obtaining module includes:
Recording unit, for record the All Jobs machining path of each operation in predetermined period the operation stand-by period, Job execution time and system resource take;
Computing unit, for according to the operation stand-by period of the All Jobs machining path of each operation in predetermined period, Job execution time, system resource take the operation average latency of the All Jobs machining path for calculating each operation, make Industry average performance times and system resource averagely take.
Optionally, the optimization module includes:
Normal form unit, for reducing the normal form requirement of data warehouse, increases data redundancy;
Priority adjustment unit, for average according to the operation average latency of the shortest path of each operation and operation The execution time adjusts the priority of operation;
Concurrency adjustment unit, for averagely taking the concurrency for adjusting the data warehouse according to system resource.
From above-mentioned technical proposal as can be seen that embodiments providing a kind of data warehouse performance method of testing and being System, wherein, the data warehouse performance method of testing to the data type that the data in the data warehouse are used by entering Row is detected and corrected, to reduce the data warehouse in data shared by memory space, lift the fortune of the data warehouse Row performance;The table point of the distribution key of physical table, the ranks memory requirement of tables of data, physical table in by detecting the data warehouse Area, there is table annexation between the physical table of annexation and an ordering statement of physical table is detected and corrected, with The speed of service of the data warehouse is lifted, so as to realize that the mesh of all round properties test and optimization is carried out to the data warehouse , find out problem and provide a feasible method using front carrying out all round properties test reaching the standard grade for the data warehouse.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can be with basis The accompanying drawing of offer obtains other accompanying drawings.
A kind of schematic flow sheet of data warehouse performance method of testing that Fig. 1 is provided for one embodiment of the application;
A kind of flow process of data warehouse performance method of testing that Fig. 2 is provided for a preferred embodiment of the application is illustrated Figure;
A kind of structural representation of data warehouse performance test system that Fig. 3 is provided for one embodiment of the application;
A kind of structural representation of data warehouse performance test system that Fig. 4 is provided for a preferred embodiment of the application Figure.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than the embodiment of whole.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.
The embodiment of the present application provides a kind of data warehouse performance method of testing, as shown in figure 1, including:
S101:Detect whether the data type that the data in the data warehouse are used matches with the size of data, If it is, the data type of unmatched data is changed to meet the minimum data type of the call data storage.
It should be noted that following during the selection of data type that used of the data in the data warehouse Choosing the minimum data type of memory space occupancy as far as possible can greatly save data space, lift the data bins The speed of service in storehouse.
Specifically, when the data in the data warehouse are stored, if a certain data simultaneously can be with data When type char and varchar are stored, optimization is stored with varchar data type.This is because data type Between char and varchar and nonexistence energy difference, but the data length of varchar is variable, and the data length of char It is immutable, if the data that storage is stored with char data type, if the length of the data is less than can holding for char The data length received, then vacant position with space character supplement, so these space characters will be taken needed for the data storage Memory space beyond memory space, cause the waste of memory space, therefore optimize using data type varchar to the number According to being stored.
For the data of numeric type, if the magnitude range of a certain data can (can with data type SMALLINT Storage digital data scope -231-231- 1, take byte 2) stored, and if data data type BIGINT (digital data scope -2 can be stored63-263- 1, take byte 2) when being stored, will result in a large amount of waves of memory space Take.
In addition, the uniformity and normalization in order to keep data warehouse type as far as possible, and in order to the number When being migrated according to the data in warehouse, the complexity of data conversion is reduced, the dtd--data type definition in data warehouse should not mistake In complexity.
It is described to detect in the data warehouse in one embodiment of the application on the basis of above-described embodiment The data type that data are used and the size of data whether match including:
The data type that data in the data warehouse are used is used as target type;
All data types of the memory space less than target type will be taken as set to be selected;
Judge with the presence or absence of the data type that disclosure satisfy that the call data storage in the set to be selected, if sentenced The data of breaking are mismatched with the target type.
S102:The distribution key for detecting the physical table in the data warehouse chooses whether meet Even distribution entails, if It is no, then according to the distribution key of the physical table being uniformly distributed in the principle selection data warehouse.
It should be noted that the distribution key of the physical table needs to meet to require:
Each physical table must show formulation distribution key, not allow to be set up using the mode of acquiescence distribution key;
Distribution key field is in principle 1, at most no more than 3;
The distribution key of associated physical table is as far as possible consistent;
For dimension table, code table, associate field should be used as distribution key;
Distribution key field can not perform update operations;
Avoid using random distribution strategy as far as possible, although its data distribution is uniform, but random distribution strategy always leads Cause in query execution, data exchange among the nodes and migration affect execution performance.
S103:Detect whether the tables of data in the data warehouse meets ranks memory requirement, if it is not, then adjustment is discontented The ranks storage mode of the tables of data of sufficient ranks memory requirement.
It should be noted that the tables of data in the data warehouse generally can be by the way of line storage Stored.But if a certain tables of data can be stored in the way of column storage, then to this in the way of column storage Tables of data carries out storing can save the internal memory shared when the tables of data is inquired about, so as to lift the operation of the data warehouse Performance.So described ranks memory requirement is, and can be stored in the way of column storage in a certain tables of data, then with row The mode of formula storage is stored to the tables of data.
One embodiment of the application is provided a kind of tables of data that can be stored in the way of column storage and is musted The condition that must be met, including:
A) must add table using the table of column storage, principle upper table can only do insertion operation, seldom do deletion and more Newly;
B) insertion operation in principle can not be very frequent;
C) under normal circumstances using row storage, only frequently inquiring about, and access module is only to access a small amount of field In the case of, it is just preferential using row storage.
D) tables of data of the Column quantity more than 10.
S104:Detect whether the physical table in the data warehouse meets table zoning requirements, if it is, to meeting table point The physical table in area carries out table subregion.
It should be noted that when a certain physical table in the data warehouse is very big, to it table multidomain treat-ment is carried out It is divided into a lot of little and manageable parts, has carried out the physical table of table subregion when given query statement is performed, scanning is related The data of partial data rather than whole physical tables, so as to improve query performance.Table subregion is for the management of data warehouse It is helpful, such as it is easy to roll old data in data warehouse.
But not all physical table is suitable for doing table subregion, the physical table should be judged whether from the following aspects Meet table zoning requirements:
A) whether table is sufficiently large
The fact that only very big, table was just adapted to do table subregion (generally, when the data record of a true table surpasses When crossing 100,000,000, we are considered as it and meet this requirement).If having several hundred million records in a true table, logically True table is divided into less subregion can improve performance.And table for only tens thousand of or the fact that less record, to point The administration overhead that area is carried out in advance is by much larger than the performance improvement that can be obtained.
Wherein, true table refers to the physical table for needing constantly to carry out data renewal.
B) whether querying condition can match subregion condition
Check whether the WHERE conditions of query statement are consistent with the field for considering table subregion.For example, if most look into Ask use date condition, then the date zoning design according to the moon or week is perhaps very useful, and if querying condition is more It is territory of use's condition, it may be considered that true table is done the subregion of list type for territory of use.
C) whether data warehouse needs to roll historical data
The rolling demand of historical data is also the Consideration of zoning design.Such as, only need to retain in data warehouse Go bimestrial data.If data carry out subregion according to the moon, two months data before can will be easily deleted, and Nearest data are stored in the subregion in nearest month.
D) whether can uniformly be decoupled according to certain regular data
Data are uniformly decoupled rule as far as possible should be selected.If the data volume of each subregion storage is suitable, then inquiry The improvement of performance will be related to the quantity of subregion.For example, a table is divided into 10 subregions, hits the inquiry of single subregion condition Sweep table performance will than non-subregion in the case of it is high 10 times.
S105:Detect and whether meet table connection request between the physical table that there is annexation in the data warehouse, such as Really no, then the physical table to being unsatisfactory for table connection request is reconnected according to table connection request.
Wherein, the table connection request is specifically included:
The quantity of multilist association does not exceed 15 in one SQL statement.
The type of attachment for producing cartesian product is prohibitted the use of (if CROSS JOIN or two tables of association are in associate field It is upper to there is a large amount of repetition).
When multiple physical tables are associated, the less physical table of association results preferentially does table connection;
When big/big/little three physical tables are inline, it is to avoid first two big tables are carried out table connection, unless filterability is very By force;
When big/little/little three tables are inline, preferentially two little tables are carried out table connection;
When two big physical tables are associated, two physical tables that there are a large amount of repetition records are prohibitted the use of to be closed Connection, in order to avoid producing huge intermediate result because repeating record association, causes increasing substantially for disk occupation proportion;For example, if The physical table table of the repetition record of 1,000,000 and the physical table association of the record of the repetition of 10,000, as a result can be up to 1,000,000 * 10000=10,000,000,000 records.
S106:Detect whether an ordering statement of the physical table in the data warehouse meets sentence and select to require, if not, An ordering statement of the physical table for then requiring not meeting sentence to select is corrected.
It should be noted that the sentence selects to require to include:
A) ORDER BY an ordering statements should not be in the view used, in the view, an ordering statement can be ignored;
B) ORDER BY an ordering statements executory cost is very high, if other optional an ordering statements, it is to avoid use;
C) sorting operation should not be performed on big data set;
D) PartitI/On By an ordering statements inside realizes that needs sort the data, if data volume is more than hundred million ranks, Realized using group by an ordering statements.
On the basis of above-described embodiment, in a preferred embodiment of the application, as shown in Fig. 2 detection institute Whether an ordering statement for stating the physical table in data warehouse meets sentence selection requirement, if it is not, then selecting not meeting sentence An ordering statement of the physical table of requirement also includes after being corrected:
S107:By the data output end of the data warehouse, according to job dependence relation, reverse acquisition All Jobs All Jobs machining path.
It should be noted that should be noted operation during the reverse All Jobs machining path for obtaining All Jobs The integrality of dependence, reasonability.The problem of Circular dependency is checked whether there is emphatically.For the Circular dependency for occurring, need Further confirm that with developer, it is to avoid due to reprocessabilty caused by job dependence configuration error, cause the wave of computing resource Take.
On the basis of above-described embodiment, in one embodiment of the application, the number by the data warehouse According to output end, according to job dependence relation, the reverse All Jobs machining path for obtaining All Jobs includes:
S1071:By the data output end of the data warehouse, last layer that All Jobs data genaration needs is obtained The operation of level;
S1072:The operation of last level needed according to All Jobs data genaration and job dependence relation, inversely Obtain the operation machining path of All Jobs.
S108:When obtaining the operation average latency of the All Jobs machining path of each operation, operation horizontal and being performed both by Between and system resource averagely take.
On the basis of above-described embodiment, in another preferred embodiment of the application, each operation of acquisition Operation average latency of All Jobs machining path, operation average performance times and system resource averagely take including:
S1081:Operation stand-by period, the Job execution of the All Jobs machining path of each operation in record predetermined period Time and system resource take;
Explanation is needed exist for, the resource such as system CPU and I/O disappears when the system resource occupancy refers to Job execution Consumption.
S1082:According to operation stand-by period, the Job execution of the All Jobs machining path of each operation in predetermined period Time, system resource take and calculate operation average latency of All Jobs machining path of each operation, operation and averagely hold Row time and system resource averagely take.
It should be noted that the length of the predetermined period is typically chosen for one week, but it is also possible to elect as two weeks, January or 1 year etc..The application is not limited this, concrete depending on actual conditions.
S109:The operation average latency, operation according to the All Jobs machining path of each operation for obtaining is average Execution time and current system resource averagely take, and obtain the shortest path of each operation;
S110:The operation average latency, operation average performance times and system according to the shortest path of each operation Resource averagely takes the optimization data warehouse.
It is described according to each operation in another preferred embodiment of the application on the basis of above-described embodiment Operation average latency of shortest path, operation average performance times and system resource averagely take the optimization data warehouse Including:
S111:The normal form requirement of data warehouse is reduced, increases data redundancy.
The purpose for increasing data redundancy is to exchange subtracting for calculating time and I/O operation time for many occupancy of memory space It is few, so as to improve the ageing of data processing.
S112:The operation average latency and operation average performance times adjustment work according to the shortest path of each operation The priority of industry.
It should be noted that according to data warehouse rule general in the industry, job priority can be generally divided into A, B, C etc. Different stage.Average latency longer job priority in critical path is improved, is shorten the average latency most simple Single effectively method.After improving priority, there can be bigger priority to obtain related resource in the case of contention for resources.
S113:The concurrency for adjusting the data warehouse is averagely taken according to system resource.
It should be noted that during operation average performance times, with Job execution all kinds of resources of system busy extent It is in close relations.If during Job execution, the resource utilization such as data warehouse CPU, I/O it is higher (>80%), then can be with The concurrency (while the operation quantity for carrying out) of the appropriate reduction data warehouse, limits the requisition of resource.So can protect Operation in card critical path, can obtain more resources, relative to can ensure that operation is faster finished.
Accordingly, the embodiment of the present application additionally provides a kind of data warehouse performance test system, as shown in figure 3, including:
Data type detection module 100, for detecting the data warehouse in the data type that used of data and institute State whether size of data matches, if it is, the data type of unmatched data is changed to meet the call data storage Minimum data type;
Distribution key detection module 200, for detecting the data warehouse in physical table distribution key choose whether meet Even distribution entails, if it is not, then according to the distribution key of the physical table being uniformly distributed in the principle selection data warehouse;
Storage mode detection module 300, for detecting the data warehouse in tables of data whether meet ranks storage will Ask, if it is not, then adjustment is unsatisfactory for the ranks storage mode of the tables of data of ranks memory requirement;
Subarea detecting module 400, for detecting the data warehouse in physical table whether meet table zoning requirements, if It is that then the physical table to meeting table subregion carries out table subregion;
Table connection detection module 500, for detecting the data warehouse in whether exist between the physical table of annexation Table connection request is met, if it is not, then the physical table to being unsatisfactory for table connection request is reconnected according to table connection request;
Sequence detection module 600, for detecting the data warehouse in an ordering statement of physical table whether meet sentence Select to require, if it is not, then an ordering statement of the physical table required not meeting sentence to select is corrected.
It should be noted that what the data type detection module 100 performed is data type to the data warehouse Detection, follow during the selection of the data type that the data in the data warehouse are used as far as possible choose storage The minimum data type of space hold can greatly save data space, lift the speed of service of the data warehouse.
Specifically, when the data in the data warehouse are stored, if a certain data simultaneously can be with data When type char and varchar are stored, optimization is stored with varchar data type.This is because data type Between char and varchar and nonexistence energy difference, but the data length of varchar is variable, and the data length of char It is immutable, if the data that storage is stored with char data type, if the length of the data is less than can holding for char The data length received, then vacant position with space character supplement, so these space characters will be taken needed for the data storage Memory space beyond memory space, cause the waste of memory space, therefore optimize using data type varchar to the number According to being stored.
For the data of numeric type, if the magnitude range of a certain data can (can with data type SMALLINT Storage digital data scope -231-231- 1, take byte 2) stored, and if data data type BIGINT (digital data scope -2 can be stored63-263- 1, take byte 2) when being stored, will result in a large amount of waves of memory space Take.
In addition, the uniformity and normalization in order to keep data warehouse type as far as possible, and in order to the number When being migrated according to the data in warehouse, the complexity of data conversion is reduced, the dtd--data type definition in data warehouse should not mistake In complexity.
It is described to detect in the data warehouse in one embodiment of the application on the basis of above-described embodiment The data type that data are used and the size of data whether match including:
The data type that data in the data warehouse are used is used as target type;
All data types of the memory space less than target type will be taken as set to be selected;
Judge with the presence or absence of the data type that disclosure satisfy that the call data storage in the set to be selected, if sentenced The data of breaking are mismatched with the target type.
The distribution key detection module 200 mainly follows following original when the distribution key to the physical table is detected Then:
Each physical table must show formulation distribution key, not allow to be set up using the mode of acquiescence distribution key;
Distribution key field is in principle 1, at most no more than 3;
The distribution key of associated physical table is as far as possible consistent;
For dimension table, code table, associate field should be used as distribution key;
Distribution key field can not perform update operations;
Avoid using random distribution strategy as far as possible, although its data distribution is uniform, but random distribution strategy always leads Cause in query execution, data exchange among the nodes and migration affect execution performance.
Also, it should be noted that the tables of data in the data warehouse can generally adopt the side of line storage Formula is stored.But if a certain tables of data can by column storage in the way of be stored, then by column storage in the way of pair The tables of data carries out storing can save the internal memory shared when the tables of data is inquired about, so as to lift the fortune of the data warehouse Row performance.So described ranks memory requirement is, a certain tables of data can by column storage in the way of be stored, then with The mode of column storage is stored to the tables of data.
One embodiment of the application is provided a kind of tables of data that can be stored in the way of column storage and is musted The condition that must be met, including:
A) must add table using the table of column storage, principle upper table can only do insertion operation, seldom do deletion and more Newly;
B) insertion operation in principle can not be very frequent;
C) under normal circumstances using row storage, only frequently inquiring about, and access module is only to access a small amount of field In the case of, it is just preferential using row storage.
D) tables of data of the Column quantity more than 10.
When a certain physical table in the data warehouse is very big, it should using the Subarea detecting module 400 to it Carry out table multidomain treat-ment and be divided into a lot of little and manageable parts, the physical table for having carried out table subregion is performing given inquiry language During sentence, the data of relevant portion rather than the data of whole physical tables are scanned, so as to improve query performance.Table subregion is for data The management in warehouse is also helpful, such as be easy to roll old data in data warehouse.
But not all physical table is suitable for doing table subregion, the physical table should be judged whether from the following aspects Meet table zoning requirements:
A) whether table is sufficiently large
The fact that only very big, table was just adapted to do table subregion (generally, when the data record of a true table surpasses When crossing 100,000,000, we are considered as it and meet this requirement).If having several hundred million records in a true table, logically True table is divided into less subregion can improve performance.And table for only tens thousand of or the fact that less record, to point The administration overhead that area is carried out in advance is by much larger than the performance improvement that can be obtained.
Wherein, true table refers to the physical table for needing constantly to carry out data renewal.
B) whether querying condition can match subregion condition
Check whether the WHERE conditions of query statement are consistent with the field for considering table subregion.For example, if most look into Ask use date condition, then the date zoning design according to the moon or week is perhaps very useful, and if querying condition is more It is territory of use's condition, it may be considered that true table is done the subregion of list type for territory of use.
C) whether data warehouse needs to roll historical data
The rolling demand of historical data is also the Consideration of zoning design.Such as, only need to retain in data warehouse Go bimestrial data.If data carry out subregion according to the moon, two months data before can will be easily deleted, and Nearest data are stored in the subregion in nearest month.
D) whether can uniformly be decoupled according to certain regular data
Data are uniformly decoupled rule as far as possible should be selected.If the data volume of each subregion storage is suitable, then inquiry The improvement of performance will be related to the quantity of subregion.For example, a table is divided into 10 subregions, hits the inquiry of single subregion condition Sweep table performance will than non-subregion in the case of it is high 10 times.
The table connection request that the table connection detection module 500 is followed when table connecting detection is carried out to the data warehouse Specifically include:
The quantity of multilist association does not exceed 15 in one SQL statement.
The type of attachment for producing cartesian product is prohibitted the use of (if CROSS JOIN or two tables of association are in associate field It is upper to there is a large amount of repetition).
When multiple physical tables are associated, the less physical table of association results preferentially does table connection;
When big/big/little three physical tables are inline, it is to avoid first two big tables are carried out table connection, unless filterability is very By force;
When big/little/little three tables are inline, preferentially two little tables are carried out table connection;
When two big physical tables are associated, two physical tables that there are a large amount of repetition records are prohibitted the use of to be closed Connection, in order to avoid producing huge intermediate result because repeating record association, causes increasing substantially for disk occupation proportion;For example, if The physical table table of the repetition record of 1,000,000 and the physical table association of the record of the repetition of 10,000, as a result can be up to 1,000,000 * 10000=10,000,000,000 records.
The sentence detection that the sequence detection module 600 is followed is required to include:
A) ORDER BY an ordering statements should not be in the view used, in the view, an ordering statement can be ignored;
B) ORDER BY an ordering statements executory cost is very high, if other optional an ordering statements, it is to avoid use;
C) sorting operation should not be performed on big data set;
D) PartitI/On By an ordering statements inside realizes that needs sort the data, if data volume is more than hundred million ranks, Realized using group by an ordering statements.
On the basis of above-described embodiment, in a preferred embodiment of the application, as shown in figure 4, the data bins Storehouse Performance Test System also includes:
Machining path acquisition module 700, for by the data output end of the data warehouse, closing according to job dependence System, the reverse All Jobs machining path for obtaining All Jobs;
Time-obtaining module 800, for obtain the All Jobs machining path of each operation the operation average latency, Operation average performance times and system resource averagely take;
Shortest path acquisition module 900, for according to the operation horizontal of the All Jobs machining path of each operation for obtaining Stand-by period, operation average performance times and current system resource averagely takes, and obtains the shortest path of each operation;
Optimization module 1000, for averagely being held according to the operation average latency of the shortest path of each operation, operation Row time and system resource averagely take the optimization data warehouse.
It should be noted that should be noted operation during the reverse All Jobs machining path for obtaining All Jobs The integrality of dependence, reasonability.The problem of Circular dependency is checked whether there is emphatically.For the Circular dependency for occurring, need Further confirm that with developer, it is to avoid due to reprocessabilty caused by job dependence configuration error, cause the wave of computing resource Take.
On the basis of above-described embodiment, in one embodiment of the application, the machining path acquisition module 700 is wrapped Include:
Date Conversion Unit, needs for by the data output end of the data warehouse, obtaining All Jobs data genaration The operation of last level wanted;
Backstepping unit, the operation of last level and job dependence for being needed according to All Jobs data genaration is closed System, the reverse operation machining path for obtaining All Jobs.
On the basis of above-described embodiment, in another preferred embodiment of the application, the time-obtaining module 800 Including:
Recording unit, for record the All Jobs machining path of each operation in predetermined period the operation stand-by period, Job execution time and system resource take;
Computing unit, for according to the operation stand-by period of the All Jobs machining path of each operation in predetermined period, Job execution time, system resource take the operation average latency of the All Jobs machining path for calculating each operation, make Industry average performance times and system resource averagely take.
Explanation is needed exist for, the resource such as system CPU and I/O disappears when the system resource occupancy refers to Job execution Consumption.
The length of the predetermined period is typically chosen for one week, but it is also possible to elect two weeks, January or 1 year etc. as.The application This is not limited, it is concrete depending on actual conditions.
On the basis of above-described embodiment, in another preferred embodiment of the application, the optimization module 1000 is wrapped Include:
Normal form unit, for reducing the normal form requirement of data warehouse, increases data redundancy;
Priority adjustment unit, for average according to the operation average latency of the shortest path of each operation and operation The execution time adjusts the priority of operation;
Concurrency adjustment unit, for averagely taking the concurrency for adjusting the data warehouse according to system resource.
It should be noted that the purpose for increasing data redundancy is to exchange calculating time and I/O for many occupancy of memory space The reduction of operating time, so as to improve the ageing of data processing.
According to the rule that data warehouse is general in the industry, job priority can be generally divided into the different stages such as A, B, C.Improve Average latency longer job priority in critical path, is the most simple effective method for shortening the average latency. After improving priority, there can be bigger priority to obtain related resource in the case of contention for resources.
Also, it should be noted that during operation average performance times and Job execution all kinds of resources of system busy extent It is in close relations.If during Job execution, the resource utilization such as data warehouse CPU, I/O it is higher (>80%), then can be with The concurrency (while the operation quantity for carrying out) of the appropriate reduction data warehouse, limits the requisition of resource.So can protect Operation in card critical path, can obtain more resources, relative to can ensure that operation is faster finished.
In sum, the embodiment of the present application provides a kind of data warehouse performance method of testing and system, wherein, the number According to warehouse performance test methods by being detected and corrected to the data type that the data in the data warehouse are used, with The memory space shared by the data in the data warehouse is reduced, the runnability of the data warehouse is lifted;By detection The distribution key of physical table in the data warehouse, the ranks memory requirement of tables of data, the table subregion of physical table, there is annexation Physical table between table annexation and an ordering statement of physical table be detected and corrected, to lift the data warehouse The speed of service, be the data warehouse so as to realize that the purpose of all round properties test and optimization is carried out to the data warehouse Find out problem there is provided a feasible method using front carrying out all round properties test reaching the standard grade.
Each embodiment is described by the way of progressive in this specification, and what each embodiment was stressed is and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.
The foregoing description of the disclosed embodiments, enables professional and technical personnel in the field to realize or using the present invention. Various modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, the present invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope for causing.

Claims (10)

1. a kind of data warehouse performance method of testing, it is characterised in that be applied to data warehouse, the data warehouse performance test Method includes:
Detect whether the data type that the data in the data warehouse are used matches with the size of data, if it is, The data type of unmatched data is changed to meet the minimum data type of the call data storage;
The distribution key for detecting the physical table in the data warehouse chooses whether meet Even distribution entails, if it is not, then foundation It is uniformly distributed the distribution key that principle chooses the physical table in the data warehouse;
Detect whether the tables of data in the data warehouse meets ranks memory requirement, if it is not, then adjustment is unsatisfactory for ranks and deposits The ranks storage mode of the tables of data that storage is required;
Detect whether the physical table in the data warehouse meets table zoning requirements, if it is, the physics to meeting table subregion Table carries out table subregion;
Detect and whether meet table connection request between the physical table that there is annexation in the data warehouse, if it is not, then right The physical table for being unsatisfactory for table connection request is reconnected according to table connection request;
Detect whether an ordering statement of the physical table in the data warehouse meets sentence and select to require, if it is not, then to not being inconsistent Closing sentence selects an ordering statement of desired physical table to be corrected.
2. data warehouse performance method of testing according to claim 1, it is characterised in that the detection data warehouse In physical table an ordering statement whether meet sentence select require, if it is not, then to do not meet sentence select require physics An ordering statement of table also includes after being corrected:
By the data output end of the data warehouse, according to job dependence relation, the reverse All Jobs for obtaining All Jobs Machining path;
Obtain operation average latency, operation average performance times and the system money of the All Jobs machining path of each operation Source averagely takes;
Operation average latency of All Jobs machining path, operation average performance times according to each operation for obtaining and Current system resource averagely takes, and obtains the shortest path of each operation;
The operation average latency, operation average performance times and system resource according to the shortest path of each operation averagely accounts for With the optimization data warehouse.
3. data warehouse performance method of testing according to claim 2, it is characterised in that described by the data warehouse Data output end, according to job dependence relation, the reverse All Jobs machining path for obtaining All Jobs includes:
By the data output end of the data warehouse, the operation of last level that All Jobs data genaration needs is obtained;
The operation of last level needed according to All Jobs data genaration and job dependence relation, inversely obtain all works The operation machining path of industry.
4. data warehouse performance method of testing according to claim 2, it is characterised in that the institute of each operation of acquisition Operation average latency, operation average performance times and the system resource for having operation machining path averagely take including:
The operation stand-by period of the All Jobs machining path of each operation, Job execution time and system in record predetermined period Resource occupation;
Operation stand-by period, Job execution time according to the All Jobs machining path of each operation in predetermined period, system Resource occupation calculates operation average latency of the All Jobs machining path of each operation, operation average performance times and is System resource averagely takes.
5. data warehouse performance method of testing according to claim 2, it is characterised in that it is described according to each operation most Operation average latency of short path, operation average performance times and system resource averagely take the optimization data warehouse bag Include:
The normal form requirement of data warehouse is reduced, increases data redundancy;
The preferential of operation is adjusted according to operation average latency of the shortest path of each operation and operation average performance times Level;
The concurrency for adjusting the data warehouse is averagely taken according to system resource.
6. a kind of data warehouse performance test system, it is characterised in that include:
Data type detection module, for detecting the data warehouse in the data type that used of data it is big with the data It is little whether to match, if it is, the data type of unmatched data is changed to meet the minimum number of the call data storage According to type;
Distribution key detection module, for detecting the data warehouse in the distribution key of physical table choose and whether meet and be uniformly distributed Require, if it is not, then according to the distribution key of the physical table being uniformly distributed in the principle selection data warehouse;
Storage mode detection module, for detecting the data warehouse in tables of data whether meet ranks memory requirement, if No, then adjustment is unsatisfactory for the ranks storage mode of the tables of data of ranks memory requirement;
Subarea detecting module, for detecting the data warehouse in physical table whether meet table zoning requirements, if it is, right Meeting the physical table of table subregion carries out table subregion;
Table connection detection module, for detecting the data warehouse in have table company whether is met between the physical table of annexation Requirement is connect, if it is not, then the physical table to being unsatisfactory for table connection request is reconnected according to table connection request;
Sequence detection module, for detecting the data warehouse in physical table an ordering statement whether meet sentence select will Ask, if it is not, then an ordering statement of the physical table required not meeting sentence to select is corrected.
7. data warehouse performance test system according to claim 6, it is characterised in that also include:
Machining path acquisition module, for by the data output end of the data warehouse, according to job dependence relation, inversely obtaining Take the All Jobs machining path of All Jobs;
Time-obtaining module, for obtaining operation average latency, the operation horizontal of the All Jobs machining path of each operation It is performed both by the time and system resource averagely takes;
Shortest path acquisition module, during for according to the operation average waiting of the All Jobs machining path of each operation for obtaining Between, operation average performance times and current system resource averagely take, obtain the shortest path of each operation;
Optimization module, for according to the operation average latency of the shortest path of each operation, operation average performance times and System resource averagely takes the optimization data warehouse.
8. data warehouse performance test system according to claim 7, it is characterised in that the machining path acquisition module Including:
Date Conversion Unit, for by the data output end of the data warehouse, obtaining what All Jobs data genaration needed The operation of last level;
Backstepping unit, for the operation of last level that needed according to All Jobs data genaration and job dependence relation, The reverse operation machining path for obtaining All Jobs.
9. data warehouse performance test system according to claim 7, it is characterised in that the time-obtaining module bag Include:
Recording unit, for recording operation stand-by period, the operation of the All Jobs machining path of each operation in predetermined period Execution time and system resource take;
Computing unit, for according to operation stand-by period, the operation of the All Jobs machining path of each operation in predetermined period Execution time, system resource take operation average latency, the operation horizontal of the All Jobs machining path for calculating each operation It is performed both by the time and system resource averagely takes.
10. data warehouse performance test system according to claim 7, it is characterised in that the optimization module includes:
Normal form unit, for reducing the normal form requirement of data warehouse, increases data redundancy;
Priority adjustment unit, for being performed both by according to the operation average latency of the shortest path of each operation and operation horizontal Time adjusts the priority of operation;
Concurrency adjustment unit, for averagely taking the concurrency for adjusting the data warehouse according to system resource.
CN201611220973.1A 2016-12-26 2016-12-26 Data warehouse performance test method and system Active CN106649754B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611220973.1A CN106649754B (en) 2016-12-26 2016-12-26 Data warehouse performance test method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611220973.1A CN106649754B (en) 2016-12-26 2016-12-26 Data warehouse performance test method and system

Publications (2)

Publication Number Publication Date
CN106649754A true CN106649754A (en) 2017-05-10
CN106649754B CN106649754B (en) 2021-04-20

Family

ID=58827633

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611220973.1A Active CN106649754B (en) 2016-12-26 2016-12-26 Data warehouse performance test method and system

Country Status (1)

Country Link
CN (1) CN106649754B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112416752A (en) * 2020-11-02 2021-02-26 四川新网银行股份有限公司 ETL (extract transform load) layered test method based on data warehouse

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8019795B2 (en) * 2007-12-05 2011-09-13 Microsoft Corporation Data warehouse test automation framework
CN103390020A (en) * 2012-05-10 2013-11-13 西门子公司 Method and system for storing data in database

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8019795B2 (en) * 2007-12-05 2011-09-13 Microsoft Corporation Data warehouse test automation framework
CN103390020A (en) * 2012-05-10 2013-11-13 西门子公司 Method and system for storing data in database

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
1生有你: "SQL性能优化", 《HTTPS://WWW.CNBLOGS.COM/YSYN/P/5445064.HTML,博客园》 *
刘南海等: "大数据时代运营商分析支撑域转型的实践与思考", 《电信科学》 *
李小庆: "银行数据仓库性能的深度优化", 《中国金融电脑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112416752A (en) * 2020-11-02 2021-02-26 四川新网银行股份有限公司 ETL (extract transform load) layered test method based on data warehouse
CN112416752B (en) * 2020-11-02 2023-06-06 四川新网银行股份有限公司 Data warehouse ETL (extract-transform-load) layered test method

Also Published As

Publication number Publication date
CN106649754B (en) 2021-04-20

Similar Documents

Publication Publication Date Title
US5758146A (en) Method and apparatus for optimizing data retrieval using index scanning
US7676451B2 (en) Selective database statistics recollection
US20190303876A1 (en) Nearest known person directory function
CN105095413B (en) A kind of method and device solving data skew
US20050131947A1 (en) Data processing system and method
US6654756B1 (en) Combination of mass storage sizer, comparator, OLTP user defined workload sizer, and design
WO2017183065A1 (en) Device and method for tuning relational database
US8650224B2 (en) Batching content management operations to facilitate efficient database interactions
CN105069134A (en) A Method of Automatic Collection of Oracle Statistical Information
CN103984726A (en) Local revision method for database execution plan
US11726975B2 (en) Auto unload
Zhang et al. HyBench: A new benchmark for HTAP databases
CN100367278C (en) Device and method for archiving and inquiry historical data
CN106485409A (en) A kind of workload apparatus for evaluating and method
US6304871B1 (en) Method and system for characterizing applications for use with databases having structured query language interfaces
US7752181B2 (en) System and method for performing a data uniqueness check in a sorted data set
CN108805458A (en) A kind of enterprise technology Competitiveness Assessment method and device
CN106649754A (en) Method and system for testing data warehouse performance
Müller et al. Efficient view maintenance for enterprise applications in columnar in-memory databases
Dvoretskyi et al. Data Utility Assessment while Optimizing the Structure and Minimizing the Volume of a Distributed Database Node.
CN114969025A (en) Data wide table optimization processing method, system, device and medium
JP2017010376A (en) Mart-less verification support system and mart-less verification support method
Dean et al. Performance optimization of the open xdmod datawarehouse
CN114184219B (en) Metering device verification distribution method, metering device verification distribution system, storage medium and electronic equipment
US8620902B2 (en) Computer program products and methods for query collection optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant