Background technology
In recent years, spatial database becomes more and more important in many application, as Geographic Information System (GIS), computer-aided design (CAD) (CAD), image processing, multimedia system etc.Because space querying can be processed the multidimensional data in spatial database and satisfy application type widely, therefore quite concerned in recent years to the research of space querying, this is because space data quantity is huge, data structure is complicated, operation costs dearly, query optimization Cost Model in relational database can not be applicable to spatial data fully, so the optimization of space querying certainly will become difficult point and the breakthrough point that use in the space.
It is most important in space querying that the space is selected to be connected with the space, is also two kinds of more expensive operations of cost (I/O and CPU cost), about their query optimization attracts wide attention.At present, the Cost Model based on the Query Cost estimation is the most frequently used a kind of enquiring and optimizing method.The subject matter of estimation cost is the size of estimation Query Result (selection rate).
In Database Systems, for given inquiry, the cost that the more various differences of query optimizer carry into execution a plan, the person carries out therefrom to select the Least-cost.The whether accurate accuracy that directly has influence on the corresponding scheme cost estimation of selection rate estimation, thus the quality of query optimizer there is substantial impact.Therefore, the selectivity estimation of space querying is very important when seeking the most effective executive plan.
The algorithm that is used for the estimation selection rate mainly contains following three kinds: sampling method, parametric method and histogram.In numerous methods, be a kind of the most frequently used method based on histogrammic space querying selectivity estimation.Its basic thought: adopt certain strategy that data space is divided into several subspaces, the corresponding sub spaces of record cell; Statistics drops on the object number in its corresponding subspace in record cell; With someway these statistical values being estimated, obtain the estimated value of query results size, estimate thus Query Cost.These record cells are called bucket, and the set of bucket is called histogram.There is a serious defective in the spatial histogram of setting up in this way: the repeat count problem.
Chinese scholars has successively proposed the histograms such as MinSkew, SQ, CD, Euler, MP, GH.Wherein, MinSkew and SQ histogram are to adapt to the arbitary inquiry window by effective dividing data space, but there is the repeat count problem in they in the application for line and polygon data, therefore the error of its selection rate estimation is larger.The CD histogram is divided into equal-sized grid unit with data space, four angle points with data object MBR are corresponding respectively with four sub-histograms, a bucket storage in sub-histogram falls into the number of the corresponding angle point of this barrel, and it can return to the geographical geometric element number that intersects with query window accurately; The Euler histogram is the histogram that utilizes the Euler principle to build, its speciality to some extent aspect the topological relation selectivity estimation, but owing to there being " boundary problem ", can't accurately estimate the selection rate when query region does not overlap with the bins netting twine; Although CD and Euler histogram can solve the repeat count problem, but they can not distinguish more meticulous spatial relationship (as cotains, overlap etc.), two kinds of different situations are arranged in Fig. 2, CD and Euler histogram have but been stored respectively identical information (seeing accompanying drawing 2).The people such as Sun have studied more meticulous 4 kinds of topological relations (disjoint, cotains on Euelr histogrammic basis, within, overlap) selectivity estimation, 3 kinds of approximate algorithms have been proposed, but there is larger error when calculating Nie, makes its selectivity estimation error larger.Afterwards, the people such as liu had proposed the selectivity estimation algorithm that a kind of accurate multi-scalability solves above-mentioned 4 kinds of relations.The histogrammic improvement algorithm of MinSkew, SQ, CD, Euler and Euler is all selected operation for the space and is proposed, even can not apply in spatial join operation.
The histogram research that relevant space connects is less.Mainly contain at present MP algorithm and GH histogram.The basic assumption of MP algorithm is: the data object that intersects with grid unit is not only equally distributed, and similar width and height will be arranged.Therefore, under meticulous graticule mesh granularity, the non-constant of estimation effect.Generally only have under this extreme case of sizableness of the size of data object and query window, the error of MP algorithm is just smaller.The GH histogram is based on the observation that two crossing rectangles is always produced four crossing points, and the selection rate that the space connects can by at first estimating the number of joining between two data sets, then obtain divided by 4.The histogrammic degree of accuracy of GH depends on the height and width hypothesis relatively less than the height and width of query window of data object, if namely data centralization is some less data objects, the GH estimation precision is higher, otherwise, data centralization is take larger object as main, and the GH estimation precision is lower.MP algorithm and GH histogram can only be used for spatial join operation, can not be used for the selectivity estimation that the space is selected.
Above-mentioned existing various histograms, except the selectivity estimation of support space selection simultaneously and spatial join operation, also do not point out from the histogrammic method of existing histogram generated query result data, namely they only set up histogram in advance to raw data.And a query manipulation is carried out step by step according to query plan tree, and in query tree, the selection rate of each node is very important for the parameter in Cost Model.But do not estimate the histogramming algorithm of subsequent node selection rate at present, the well requirement of meeting spatial query optimization.
Summary of the invention
The objective of the invention is to propose a kind of based on the histogrammic space querying selectivity estimation of AB method.The method can more accurately be estimated disjoint, intersect, within, contains is connected the space of 5 kinds of spatial relationships to select the selection rate that is connected with the space with overlap, and can from the histogram of the histogram generated query result data of raw data, satisfy the demand of subsequent node selectivity estimation in query tree.
It is a kind of based on the histogrammic space querying selectivity estimation of AB method that the present invention proposes, and comprises the following steps:
The first step is pressed the histogrammic method for building up of AB, carries out the vector data of selectivity estimation for needs and sets up histogram;
Second step, the given query manipulation condition of user;
In the 3rd step, if only relate to a vector data set in described query manipulation condition, this is operating as the space and selects operation so, estimates first that each barrel of histogram satisfies the probability of given space alternative condition, then estimates the selection rate of this space selection operation;
The 4th step, if relate to two above vector data set in described query manipulation condition, this is operating as spatial join operation so, estimates first that each barrel of histogram satisfies the probability of given space condition of contact, then estimates the selection rate of two data ensemble spaces attended operation;
The 5th goes on foot, and utilizes the AB histogram of original AB histogram generated query result data, the selection rate of subsequent node in the estimation query tree.
Each barrel of estimation histogram of mentioning in above-mentioned the 3rd step satisfies the probability of given space alternative condition, and given space alternative condition refers to disjoint herein, intersect, within, contains, one or more in the overlap relation.Wherein, the probabilistic method of estimation disjoint and overlap is illustrated.
The 9-I model (9-intersection model) that proposes according to people such as Max J.Egenhofer as can be known, the spatial relationship between two spatial object P, Q can represent with following 3 * 3 matrixes:
Wherein, P.i, Q.i, P.e, Q.e, P.b, Q.b represent respectively inside, outside and the border of two objects.If only relatively whether intersect the inside of two spatial objects, spatial relationship can be divided into disjoint and two kinds of complementary situations of intersect; If pass through to compare the inside of two spatial objects and outside crossing situation, can obtain disjoint, overlap, contains, five kinds of objectionable intermingling situations of within and equal, be that the intersect relation can further be decomposed into overlap, contains, these four kinds of relations of within and equal.
When carrying out Probability estimate, it is 0 that there is the probability of equal relation in general hypothesis.Because in actual applications, the number that the MBR of query window and spatial object consists of the equal relation mostly is 1 (being 0 under normal circumstances) most, for a lot of spatial objects in whole data acquisition, the probability of equal relation is approximately 0, therefore this hypothesis is rational.Owing to being mutual exclusive two kinds of situations from disjoint and crossing intersect and consisting of complementary relationship, according to the fundamental property of probability, certain barrel of probability that satisfies disjoint equals 1 and deducts the probability that this barrel satisfies intersect.Because the intersect relation can be decomposed into overlap, contains, the mutual exclusive four kinds of situations of within and equal, and the probability of equal is 0, according to the addition formula of probability, certain barrel is satisfied the overlap probability and equals the probability that this barrel satisfy intersect and deduct respectively the probability that this barrel satisfies within, contains.
Above-mentioned estimated probability method is equally applicable to estimate in the 4th step that each barrel of histogram satisfies the probability of given space condition of contact.
It is a kind of based on the histogrammic space querying selectivity estimation of AB method that the present invention proposes, and its advantage is as follows:
(1) the AB histogram is the spatial histogram of a kind of more summary, practicality.The selectivity estimation that it can not only support space be selected, and the selectivity estimation that can support space connects.
(2) the AB histogram is to disjoint, intersect, and within, the selectivity estimation of contains and five kinds of spatial relationships of overlap has higher precision.In addition, utilize the rule of mutually changing between spatial relationship, the AB histogram can solve the selectivity estimation problem of more meticulousr spatial relationships.
(3) the AB histogram can from the histogram of the histogram generated query result data of raw data, satisfy the demand of subsequent node selectivity estimation in query tree.The histogram transformation approach makes the selection rate of estimating subsequent node in query tree and whole query tree become possibility.The selection rate of these nodes the optimization of space querying with carry out in be an important parameter.
(4) in spatial database, the AB histogram is easy to traditional attribute histogram integrated.Because except calculate different spaces concern the logical method of probability belong to the AB histogram peculiar, all the attribute histogram with traditional is identical for the histogrammic most of concepts of AB (such as the probability that represents various spatial relationships with patterned method etc.).Like this, can fully use for reference the existing good nature of traditional attribute histogram in the Spatial Query Optimization process.
Embodiment
1, take vector data shown in Figure 3 as example, the concrete implementation step that in the present invention, the space selects the query selection rate to estimate is described:
(1.1) be that vector data shown in Figure 3 generates the AB histogram;
MBR with each data object replaces corresponding data object, whole data space is divided 12 row * 12 row grid units, annular bucket according to each object MBR position judgement under it, if do not have the annular bucket to comprise this object, increase an annular bucket in the AB histogram, record the scope of annular bucket, and numerical value in bucket is set to 1; If have corresponding annular bucket to comprise this object, numerical value in bucket increased by 1.Repeat this process, until all data objects all have been assigned in corresponding annular bucket.The MBR of data object, and the annular bucket is as shown in Fig. 3 (a); The AB histogram of setting up is as shown in Fig. 3 (b).Total A, B, C, four annular buckets of D in this histogram, scope is respectively (2,2; 5,5), (6,2; 10,5), (2,6; 7,10), (0,0; 12,12), in each annular bucket, 2,3,1,2 data objects are arranged respectively.
(1.2) the given space querying operating conditions of user is selected operation for the space herein;
(i) user need to specify a query window, and query window herein is as shown in Fig. 3 (a) lower left corner.
(ii) estimate that each annular bucket satisfies the probability of intersect relation, as shown in Fig. 4 (a);
After estimating, A, B, C, four annular buckets of D are respectively 1.0,0.3,0.0,1.0 with the probability that query window satisfies the intersect relation.
(iii) estimate that each annular bucket satisfies the probability of within relation, as shown in Fig. 4 (b);
After estimating, A, B, C, four annular buckets of D are respectively 0.5,0.0,0.0,0.0 with the probability that query window satisfies the within relation.
(iv) estimate that each annular bucket satisfies the probability of contains relation, as shown in Fig. 4 (c);
After estimating, A, B, C, four annular buckets of D are respectively 0.0,0.0,0.0,0.5 with the probability that query window satisfies the contains relation.
(v) satisfy the overlap probability according to certain barrel and equal probability that this barrel satisfy intersect and deduct respectively the probability that this barrel satisfy within, contains and calculate probability that each annular bucket satisfies the overlap relation, as shown in Fig. 4 (d);
As calculated, A, B, C, four annular buckets of D are respectively 0.5,0.3,0.0,0.5 with the probability that query window satisfies the overlap relation.
(1.3) selection rate of selecting according to the space equals the summation that object number and this barrel in each annular bucket satisfy the probability product of given alternative condition, satisfies the spatial object selection rate of given space condition in the given query window of calculating user.
In Fig. 3 (a), the estimated value that query window and spatial object consist of the intersect relation is: 2 * 1.0+3 * 0.3+1 * 0.0+2 * 1=4.9; The estimated value that query window and spatial object consist of the overlap relation is: 2 * 0.5+3 * 0.3+1 * 0.0+2 * 0.5=2.9.In like manner can get the estimated value of other spatial relationships.
2, take Fig. 3 and vector data shown in Figure 5 as example, the concrete implementation step of intersect Spatial Join Query selectivity estimation in the present invention is described:
(2.1) be that Fig. 3 and vector data shown in Figure 5 generate the AB histogram;
Data space to two vector datas all is divided into 12 row * 12 row grid units, sets up respectively the AB histogram, and step is as described in concrete implementation step 1.1.Because grid unit does not become, therefore the AB histogram that in Fig. 3, vector data generates is also constant.The total A ' of the histogram that in Fig. 5, vector data generates, two annular buckets of B ', scope is respectively (5,7; 9,10), (1,1; 6,3), in each annular bucket, 1,1 data object is arranged respectively.
(2.2) the given space querying operating conditions of user is the intersect spatial join operation herein;
(i) obtain respectively the MBR distribution plan of data object;
The MBR of Fig. 3 data set is used as the master data set, and the MBR of Fig. 5 data set is as the query window set.
For the purpose of following narration is convenient, the data object MBR of annular bucket A ' in Fig. 5 is designated as I, the data object in annular bucket B ' is designated as II, namely always has 2 query windows in Fig. 5.
(ii) probability of the interior intersect spatial relationship of each annular bucket in drawing for estimate 3.
If bucket[i] outer shroud and query window rectangle from, P (i)=0.0; If bucket[i] outer shroud and query window rectangle intersection and interior ring and query window rectangle from, at this moment, need to calculate bucket[i] with length l ength and the width wide of query window rectangle lap, the distance of P (i)=min (length, wide) ÷ outer shroud and interior interannular; Otherwise, P (i)=1.Result of calculation is as follows:
1. the probability that after estimating, in Fig. 3, in A, B, C, four annular buckets of D, in data objects and Fig. 5, query window I satisfies the intersect relation is respectively 0.0,0.0,1.0,1.0;
2. the probability that after estimating, in Fig. 3, in A, B, C, four annular buckets of D, in data objects and Fig. 5, query window II satisfies the intersect relation is respectively 0.8,0.0,0.0,1.0;
(2.3) selection rate according to master data set space connection is to multiply by by the number that calculates the interior object of each annular bucket in this data acquisition the probability sum that this barrel satisfies given space condition of contact, and master data set (Fig. 3) with respect to the intersect space connection selection rate of query window data acquisition (Fig. 5) is:
In like manner, the MBR of Fig. 3 data set can be used as the query window set, the MBR of Fig. 5 data set is as the master data set, and calculating chart 5 data sets are with respect to the intersect space connection selection rate of Fig. 3 data set.
Similar to embodiment 2, can estimate the selection rate of the Spatial Join Queries such as within, contains and overlap.
3, take intersect as example, the AB Nogata drawing generating method of Query Result in the present invention is described.
(3.1) the Query Result AB histogram of operation is selected in the intersect space.
Select to be operating as example with intersect space in Fig. 3.By Spatial Select Query as a result in the histogrammic bucket of AB numerical value be to be satisfied long-pending the obtaining as can be known of probability of given space alternative condition by numerical value in each annular bucket in the AB histogram of raw data and this barrel, in the histogrammic annular bucket of the AB of intersect Spatial Select Query result, numerical value is to be obtained by the amassing of probability that numerical value and this barrel in each annular bucket in the AB histogram of raw data satisfy intersect space alternative condition.In Query Result AB histogram, A, B, C, four interior numerical value of annular bucket of D are respectively: 2 * 1.0=2.0; 3 * 0.3=0.9; 1 * 0.0=0.0; 2 * 1.0=2.0.Generate the AB histogram of result as shown in Fig. 6 (a).
(3.2) the Query Result AB histogram of intersect spatial join operation.
The intersect spatial join operation is as example in Fig. 3 and Fig. 5.By numerical value in newly-generated AB histogram bucket be multiply by all probability that satisfy given space condition of contact in this barrel and query window data acquisition between a plurality of objects by numerical value in former annular bucket average as can be known, after vector data in Fig. 3 carried out the intersect spatial join operation, in newly-generated AB histogram annular bucket, numerical value was to multiply by by numerical value in former annular bucket the mathematical expectation of probability that satisfies intersect space condition of contact in this barrel and query window data acquisition between a plurality of objects to obtain.According to
In Fig. 3 in the newly-generated AB histogram of data in A, B, C, four annular buckets of D numerical value be respectively: 2 * (0.0+0.8)/(1+1)=0.8; 3 * (0.0+0.0)/(1+1)=0.0; 1 * (1.0+0.0)/(1+1)=0.5; 2 * (1.0+1.0)/(1+1)=2.0.Generate the AB histogram of result as shown in Fig. 6 (b).In like manner can get the Query Result AB histogram after data intersect spatial join operation in Fig. 5, be omitted herein.
Similar to embodiment 3, in like manner can obtain disjoint, within, the AB histogram of the selection of the spaces such as contains and overlap and Spatial Join Query result.
The comparative analysis experiment:
We were with 1: 10 of certain city of China, and 000 land use data is experimental data, and it comprises 17207 polygons, and its MBR distributes as shown in Fig. 7 (a), and we define 9 rectangles as the query window data in addition, as shown in Fig. 7 (b).Select in test in the space, we respectively as different query windows, estimate to satisfy the number of the land use data of particular space relation with 9 rectangles.In the connecting test of space, 9 rectangles are regarded as another data set, are used as the set of query window.Use the inventive method, we respectively the space of the relations such as disjoint, intersect, within, contains and overlap are selected and the selectivity estimation of spatial join operation is tested, and have calculated the relative error between selectivity estimation value and true value.The experimental data statistics that the space is selected to be connected with the space is seen respectively Fig. 8 and Fig. 9.
As seen, the AB histogram can be used for be estimated disjoint, and intersect, within, contains are connected the space and are selected the selection rate that is connected with the space with the overlap relation.The error of method of the present invention is less, has higher accuracy rate.
The content that is not described in detail in instructions of the present invention belongs to the known prior art of this area professional and technical personnel.
The above is only the preferred embodiment of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.