A kind of QR decomposition method for the direction of energy Jacobian matrix that GPU accelerates
    
      Technical field
      The direction of energy accelerated the invention belongs to High performance computing in power system application field more particularly to a kind of GPU is refined
The GPU threaded design method decomposed than the QR of matrix.
    
    
      Background technique
      Load flow calculation is most widely used, most basic and most important a kind of electrical operation in electric system.In power train
In the research of the method for operation of uniting and programme, require to carry out Load flow calculation to compare the method for operation or plan power supply plan
Feasibility, reliability and economy.Meanwhile in order to monitor the operating status of electric system in real time, it is also desirable to carry out a large amount of and fast
The Load flow calculation of speed.Therefore, in the method for operation of programming and planning and schedule system, using offline Load flow calculation;In electricity
In the real time monitoring of Force system operating status, then calculated using online power flow.
      And in actual production process, no matter offline trend and online power flow calculating all there is this to compare the calculating speed of trend
High requirement.In being related to planning and designing and the offline trend for arranging the method for operation, situations such as landing scheme because of equipment, is complicated, needs
Want the type of simulation run more, Load flow calculation amount is big, and single Load flow calculation time effects integrally emulate duration;And in electric system
The online power flow carried out in operation is calculated to temporal sensitivity height is calculated, and is needed to provide calculation of tidal current in real time, is such as being envisioned
In accident, the equipment Load flow calculation out of service to the influence of static security, system needs to calculate trend under a large amount of forecast accidents
Distribution, and the method for operation Adjusted Option of anticipation is made in real time.
      In traditional Newton-Laphson method Load flow calculation, the solution of update equation group accounts for the 70% of the Load flow calculation time, amendment
The calculating speed of solving equations influences the overall performance of program.And slowing down with the promotion of CPU calculating speed, list at this stage
A Load flow calculation calculating time has reached a bottleneck.The accelerated method of Load flow calculation is concentrated mainly on using cluster at present
Coarseness acceleration carried out to more trends with multiple-core server, it is practical at the research that single trend internal arithmetic is accelerated in production compared with
It is few.
      GPU is a kind of many-core parallel processor, will be considerably beyond CPU in the quantity of processing unit.GPU traditionally is only
It is responsible for figure rendering, and CPU has all been given in most processing.Method battle array is a kind of multicore, multithreading, tool to present GPU
There are powerful calculating ability and high bandwidth of memory, programmable processor.Under universal computer model, association of the GPU as CPU
Processor work, is decomposed by task reasonable distribution and completes high-performance calculation.
      Sparse vectors, which solve to calculate, has concurrency.After carrying out the decomposition of QR symbol to equation system matrix number, obtain
To Household transformation matrix V and R gusts of upper triangular matrix of sparsity structure, according to R gusts of sparsity structure, progress is respectively arranged matrix J
Parallelization layering.Wherein the calculating of the column in every layer is mutually indepedent, naturally can be by parallel calculating without dependence
Reason is suitble to GPU to accelerate.Therefore by the way that reasonably equation group coefficient matrix progress QR can be rapidly completed in scheduling between CPU and GPU
It decomposes, and solves sparse vectors, domestic and foreign scholars, which have begun, carries out what sparse vectors acceleration solved to GPU
Method is studied, but not deep optimization threaded design, is studied computational threads from the distribution of calculation amount merely and is set
Meter, to thread calculation, data directory mode is not furtherd investigate, and program can not be made to give full play to the advantage of GPU.
      It would therefore be highly desirable to solve the above problems.
    
    
      Summary of the invention
      Goal of the invention: in view of the deficiencies of the prior art, the present invention, which provides, a kind of can be greatly decreased direction of energy Jacobean matrix
The QR for the direction of energy Jacobian matrix that battle array QR decomposition computation time and a kind of GPU that can promote Load flow calculation speed accelerate is decomposed
Method.
      Load flow calculation: electrodynamic noun refers in given power system network topology, component parameters and power generation, load parameter
Under the conditions of, calculate the distribution of active power, reactive power and voltage in power network.
      Parallel computation: being a kind of algorithm of primary executable multiple instruction, it is therefore an objective to improve and calculate relative to serial arithmetic
Speed, and by expanding problem solving scale, solve large-scale and complicated computational problem.
      GPU: graphics processor (English: GraphicsProcessingUnit, abbreviation: GPU).
      The invention discloses a kind of QR decomposition methods of GPU direction of energy Jacobian matrix accelerated, which comprises
      (1) decomposition of QR symbol is carried out to Jacobian matrix J in CPU, obtains Household transformation matrix V and upper triangular matrix R
The sparsity structure of battle array;According to R gusts of sparsity structure, matrix J is respectively arranged and carries out parallelization layering;
      (2) layering QR is calculated by the sequence that level is incremented by GPU decompose kernel function SparseQR.
      Wherein, in the step (1), parallelization, which is layered, is integrated into the n column of matrix J in MaxLevel layers, belongs to same
Column in layer carry out QR decomposition parallel;The quantity of every layer of column for including is Levelnum (k), and k indicates level number;It stores in kth layer
All row numbers are to mapping table Mapk。
      Preferably, in the step (2), layering QR decomposes kernel function and is defined as SparseQR < Nblocks, Nthreads>,
Its thread block size Nthread128 are fixed as, when calculating k layers, thread number of blocks Nblocks=Levelnum (k), always
Number of threads are as follows: Nblocks×Nthreads;According to the sequence that level is incremented by, call kernel function SparseQR < Levelnum (k),
Nthreads> decompose all column for belonging to kth layer;SparseQR < Levelnum (k), Nthreads> calculation process are as follows:
      ((2.1) CUDA is the thread index in per thread distribution thread block index blockID and thread block automatically
threadID;
      (2.2) blockID and threadID are assigned to variable bid and t, index bid line by bid and t later
T thread in journey block;
      (2.3) bid thread blocks are responsible for jth=Map that QR decomposes Jacobian matrix Jk(bid) it arranges;
      In (2.4) bid thread blocks, variable i is incremented to j-1 from 1, if R (i, j) ≠ 0, executes following calculate:
      Firstly, calculating intermediate variable β=2V (i:n, i)TJ (i:n, j), wherein V (i:n, i) is Household transformation
The vector that i-th~n row element of the i-th column is constituted in matrix V, J (i:n, j) are i-th~n row members that jth arranges in Jacobian matrix J
The vector that element is constituted;Then, it is arranged using the jth that formula J (i:n, j)=J (i:n, j)-β V (i:n, i) updates Jacobian matrix J;
      (2.5) Household for calculating jth column converts vector;
      (2.6) J matrix jth column: J (j, j)=a, J (j+1:n, j)=0 are updated.
      Preferably, in the step (2.4):
      Firstly, calculating intermediate variable β=2V (i:n, i)TJ (i:n, j), specifically calculating step is,
      1) judge whether thread number t is less than n-i, otherwise the thread terminates to execute;
      2) the variable Temp+=2V (i+t, i) in thread × J (i+t, j);
      3) t=t+128 is returned and 1) is executed;
      4) value of Temp is assigned to shared drive variable Cache [threadID] in thread;
      5) thread in thread block is synchronized;
      6) variable M=128/2;
      7) it is executed 8) when M is not equal to 0;Otherwise terminate to calculate, jump to 12);
      8) judge that threadID is executed less than M: Cache [threadID] +=Cache [threadID+M], otherwise thread
It does not execute;
      9) thread in thread block is synchronized;
      10) M/=2;
      11) it returns and 7) executes;
      12) (0) β=Cache;
      Then, it is arranged using the jth that formula J (i:n, j)=J (i:n, j)-β V (i:n, i) updates Jacobian matrix J, specifically
Step is,
      1) judge whether thread number t is less than n-i, otherwise thread terminates to execute;
      2) J (t+i, j)=J (t+i, j)-β V (t+i, i);
      3) 1) t=t+128 is returned;
      Further, in the step (2.5),
      Firstly, calculating variable a2=J (j:n, j)TJ (j:n, j), steps are as follows for specific calculating:
      1) judge whether thread number t is less than n-j, otherwise the thread terminates to execute;
      2) the variable Temp, Temp+=J (j+t, j) in thread × J (j+t, j);
      3) t=t+128 is returned and 1) is executed;
      4) value of Temp is assigned to shared drive variable Cache [threadID] in thread;
      5) thread in thread block is synchronized;
      6) variable M=128/2;
      7) it is executed 8) when M is not equal to 0;Otherwise terminate to calculate, jump to 12);
      8) judge that threadID is executed less than M: Cache [threadID] +=Cache [threadID+M], otherwise thread
It does not execute;
      9) thread in thread block is synchronized;
      10) M/=2;
      11) it returns and 7) executes;
      12)a2=Cache (0);
      Secondly, calculating, V (j:n, j)=J (j:n, j)-aej(j:n), wherein being ejBe j-th of element be 1 unit to
Amount;
      Specific step is as follows:
      1) judge whether thread number t is less than n-j, otherwise thread terminates to execute;
      2) V (t+j, j)=J (t+j, j)-aej(t+j);
      3) 1) t=t+128 is returned;
      Then, variable b is calculated2=V (j:n, j)TV (j:n, j);
      Steps are as follows for specific calculating:
      1) judge whether thread number t is less than n-j, otherwise the thread terminates to execute;
      2) variable in thread is Temp, Temp+=V (j+t, j) × V (j+t, j);
      3) t=t+128 is returned and 1) is executed;
      4) value of Temp is assigned to shared drive variable Cache [threadID] in thread;
      5) thread in thread block is synchronized;
      6) variable M=128/2;
      7) it is executed 8) when M is not equal to 0;Otherwise terminate to calculate, jump to 12);
      8) judge that threadID is executed less than M: Cache [threadID] +=Cache [threadID+M], otherwise thread
It does not execute;
      9) thread in thread block is synchronized;
      10) M/=2;
      11) it returns and 7) executes;
      12)b2=Cache (0);
      Finally, calculating V (j:n, j)=V (j:n, j)/b;
      Specific step is as follows:
      1) judge whether thread number t is less than n-j, otherwise thread terminates to execute;
      2) V (t+j, j)=V (t+j, j)/b;
      3) 1) t=t+128 is returned.
      The utility model has the advantages that compared with the prior art, the invention has the benefit that the present invention is using CPU to the direction of energy first
Jacobian matrix J carry out the decomposition of QR symbol, according to R gusts of sparse format, it is possible to reduce unnecessary Floating-point Computation;Secondly root
Assigning to J gusts of each column according to R gusts of sparsity structures can be with the different levels of parallel computation, and layering result is transmitted to GPU;Furthermore
Layering QR is calculated by the sequence that level is incremented by GPU and decomposes kernel function SparseQR, and is used reduction in GPU thread and calculated
Method handles dot-product operation, improves computational efficiency;The last present invention using CPU control program process and handle basic data and
GPU handles the efficiency that the mode that intensive floating-point operation combines improves direction of energy Jacobian matrix QR decomposition, solves
The time-consuming big problem of Load flow calculation in power system static safety analysis.
    
    
      Detailed description of the invention:
      Fig. 1 is the example calculation time of the invention;
      Fig. 2 is that example of the invention is layered result;
      Fig. 3 is flow diagram of the invention.
    
    
      Specific embodiment:
      As shown in figure 3, the QR decomposition method for the direction of energy Jacobian matrix that a kind of GPU of the present invention accelerates, the method
Include:
      (1) decomposition of QR symbol is carried out to Jacobian matrix J in CPU, obtains Household transformation matrix V and upper three angular moment
The sparsity structure of the sparsity structure of battle array R matrix, the J after symbol decomposition is equal to V+R;According to R gusts of sparsity structure, to matrix J
Each column carry out parallelization layering.
      (2) layering QR is calculated by the sequence that level is incremented by GPU decompose kernel function SparseQR.
      One, QR symbol decomposition method is carried out to direction of energy Jacobian matrix J in CPU
      Firstly, carrying out the decomposition of QR symbol to Jacobian matrix J in CPU, Household transformation matrix V and upper three are obtained
The sparsity structure of the sparsity structure of R gusts of angle battle array, the J after symbol decomposition is equal to V+R;Then, parallelization is layered the n of matrix J
Column are integrated into MaxLevel layers, and the column belonged in same layer carry out QR decomposition parallel;The quantity of every layer of column for including is
Levelnum (k), k indicate level number;All row numbers are stored in kth layer to mapping table Mapk.Finally, CPU is counted needed for calculating GPU
According to GPU is transferred to, data include: Jacobian matrix J, dimension n, and R gusts of upper triangular matrix, number of plies MaxLevel, every layer includes
Columns Levelnum and mapping table Map.
      Wherein QR symbol decomposition principle and the parallelization principle of stratification referring to
      " DirectMethodsforSparseLinearSystems " TimothyA.Davis, SIAM,
Philadelphia, 2006.The QR symbol that this patent uses decomposes and parallelization blocking routine is referring to CSparse:
      AConciseSparseMatrixpackage.VERSION3.1.4, Copyright (c) 2006-2014,
TimothyA.Davis, Oct10,2014.Household shift theory is referring to document: a kind of use of Hu Bingxin, Li Ning, Lv Jun
Household converts complex QR decomposition algorithm [J] Journal of System Simulation of Recursive Implementation, 2004,16 (11): 2432-
2434。
      Two, the sequence starting layering QR being incremented by GPU by level decomposes kernel function SparseQR
      Layering QR decomposes kernel function and is defined as SparseQR < Nblocks, Nthreads>, thread block size NthreadIt is fixed
It is 128, when calculating k layers, thread number of blocks Nblocks=Levelnum (k), total number of threads are as follows: Nblocks×
Nthreads;According to the sequence that level is incremented by, call kernel function SparseQR < Levelnum (k), Nthreads> to decompose belong to
K layers of all column.
      SparseQR < Levelnum (k), Nthreads> calculation process are as follows:
      (1) CUDA is the thread index in per thread distribution thread block index blockID and thread block automatically
threadID;
      (2) blockID and threadID are assigned to variable bid and t, index bid thread by bid and t later
T thread in block;
      (3) bid thread blocks are responsible for jth=Map that QR decomposes Jacobian matrix Jk(bid) it arranges;
      In (4) bid thread blocks, variable i is incremented to j-1 from 1, if R (i, j) ≠ 0, executes following calculate:
      Firstly, calculating intermediate variable β=2V (i:n, i)TJ (i:n, j), wherein V (i:n, i) is Household transformation
The vector that i-th~n row element of the i-th column is constituted in matrix V, J (i:n, j) are i-th~n row members that jth arranges in Jacobian matrix J
The vector that element is constituted;Then, it is arranged using the jth that formula J (i:n, j)=J (i:n, j)-β V (i:n, i) updates Jacobian matrix J;
      Firstly, calculating intermediate variable β=2V (i:n, i)TJ (i:n, j), specifically calculating step is,
      1) judge whether thread number t is less than n-i, otherwise the thread terminates to execute;
      2) the variable Temp+=2V (i+t, i) in thread × J (i+t, j);
      3) t=t+128 is returned and 1) is executed;
      4) value of Temp is assigned to shared drive variable Cache [threadID] in thread;
      5) thread in thread block is synchronized;
      6) variable M=128/2;
      7) it is executed 8) when M is not equal to 0;Otherwise terminate to calculate, jump to 12);
      8) judge that threadID is executed less than M: Cache [threadID] +=Cache [threadID+M], otherwise thread
It does not execute;
      9) thread in thread block is synchronized;
      10) M/=2;
      11) it returns and 7) executes;
      12) (0) β=Cache;
      Then, it is arranged using the jth that formula J (i:n, j)=J (i:n, j)-β V (i:n, i) updates Jacobian matrix J, specifically
Step is,
      1) judge whether thread number t is less than n-i, otherwise thread terminates to execute;
      2) J (t+i, j)=J (t+i, j)-β V (t+i, i);
      3) 1) t=t+128 is returned.
      (5) Household for calculating jth column converts vector:
      Firstly, calculating variable a2=J (j:n, j)TJ (j:n, j), steps are as follows for specific calculating:
      1) judge whether thread number t is less than n-j, otherwise the thread terminates to execute;
      2) the variable Temp, Temp+=J (j+t, j) in thread × J (j+t, j);
      3) t=t+128 is returned and 1) is executed;
      4) value of Temp is assigned to shared drive variable Cache [threadID] in thread;
      5) thread in thread block is synchronized;
      6) variable M=128/2;
      7) it is executed 8) when M is not equal to 0;Otherwise terminate to calculate, jump to 12);
      8) judge that threadID is executed less than M: Cache [threadID] +=Cache [threadID+M], otherwise thread
It does not execute;
      9) thread in thread block is synchronized;
      10) M/=2;
      11) it returns and 7) executes;
      12)a2=Cache (0);
      Secondly, calculating, V (j:n, j)=J (j:n, j)-aej(j:n), wherein being ejBe j-th of element be 1 unit to
Amount;
      Specific step is as follows:
      1) judge whether thread number t is less than n-j, otherwise thread terminates to execute;
      2) V (t+j, j)=J (t+j, j)-aej(t+j);
      3) 1) t=t+128 is returned;
      Then, variable b is calculated2=V (j:n, j)TV (j:n, j);
      Steps are as follows for specific calculating:
      1) judge whether thread number t is less than n-j, otherwise the thread terminates to execute;
      2) variable in thread is Temp, Temp+=V (j+t, j) × V (j+t, j);
      3) t=t+128 is returned and 1) is executed;
      4) value of Temp is assigned to shared drive variable Cache [threadID] in thread;
      5) thread in thread block is synchronized;
      6) variable M=128/2;
      7) it is executed 8) when M is not equal to 0;Otherwise terminate to calculate, jump to 12);
      8) judge that threadID is executed less than M: Cache [threadID] +=Cache [threadID+M], otherwise thread
It does not execute;
      9) thread in thread block is synchronized;
      10) M/=2;
      11) it returns and 7) executes;
      12)b2=Cache (0);
      Finally, calculating V (j:n, j)=V (j:n, j)/b;
      Specific step is as follows:
      1) judge whether thread number t is less than n-j, otherwise thread terminates to execute;
      2) V (t+j, j)=V (t+j, j)/b;
      3) 1) t=t+128 is returned.
      (6) J gusts of jth column: J (j, j)=a, J (j+1:n, j)=0 are updated.
      GPU computing platform used in the present invention is equipped with a TeslaK20CGPU card and IntelXeonE5-2620CPU,
The peak bandwidth of GPU is up to 208GB/s, and single-precision floating point calculation amount peak value is up to 3.52Tflops, CPU frequency 2GHz.CPU
The CPU of computing platform outfit IntelCorei7-3520M2.90GHz.It is dilute to 16557 dimensions respectively in CPU and GPU computing platform
It dredges Jacobian matrix example to be tested, Fig. 1 is the 16557 sparse Jacobian matrixes of dimension in the calculating on different calculating libraries
Between.The time is solved as 67ms using the QR algorithm of two-stage paralleling tactic on GPU, lags behind the value decomposition time of KLU, slightly fastly
With the calculating time of UMPACK.GPU accelerating algorithm lag behind KLU calculate the time the reason of the mainly every layer columns for including with
The number of plies increases reduction rapidly, and the 1st layer includes 1642 column, and first 30 layers every layer is greater than 100 column, but only remains 1~2 later to 207 layers
Column.In the level of front can parallel computation number of columns it is more, behind column in level need it is serial execute, be unable to give full play GPU
Calculated performance.Fig. 2 is that (every layer of the elimination-tree columns for including is indulged for the degree of parallelism change curves of the 16557 sparse Jacobian matrixes of dimension
Coordinate denary logarithm indicates).