CN113885941A

CN113885941A - A method, device and related equipment for realizing singular value decomposition operation

Info

Publication number: CN113885941A
Application number: CN202111040096.0A
Authority: CN
Inventors: 范登栋; 杨凯; 张超; 吴泽文; 刘勇翔; 徐鹏翔
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2022-01-04
Anticipated expiration: 2041-09-06
Also published as: CN113885941B

Abstract

The invention discloses a singular value decomposition operation implementation method, a singular value decomposition operation implementation device and related equipment, wherein the singular value decomposition operation implementation method comprises the following steps: constructing a singular value decomposition operator, wherein the singular value decomposition operator is used for carrying data in target equipment and carrying out singular value decomposition operation; deploying the singular value decomposition operator to the target device, wherein the target device is an Itanium AI processor; and acquiring data to be processed, and performing singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator. Compared with the prior art, the proposal of the invention constructs the singular value decomposition operator which can carry out the transportation and the singular value decomposition operation of the data in the soar AI processor, and deploys the singular value decomposition operator to the soar AI processor, which is beneficial to fully utilizing the computing capability of the soar AI processor and directly carrying out the SVD operation on the data to be processed based on the soar AI processor.

Description

Singular value decomposition operation implementation method, device and related equipment

Technical Field

The invention relates to the technical field of singular value decomposition, in particular to a singular value decomposition operation implementation method, a singular value decomposition operation implementation device and related equipment.

Background

The development of AI techniques often relies on the processing of massive amounts of data, which places very high demands on computing power. The soar AI processor is a computationally intensive processor that has been developed for the features of this computationally intensive task. The promoted AI processor achieves good performance in the AI task processing, but the support of the bottom layer operators is still deficient.

Singular Value Decomposition (SVD) operation is a common matrix calculation process in mathematics, and can be used to accelerate the calculation of matrix inversion, and plays an important role in signal processing, image compression, tensor network, second-order optimization, and the like. In the prior art, corresponding SVD functions can be run on a plurality of computing platforms such as x86, arm and GPU to implement SVD operations. The problem of the prior art is that there is no SVD operator that can run on the soar AI processor, and for the data that needs to be subjected to singular value decomposition operation, the SVD operation can not be directly performed based on the soar AI processor, which is not good for fully utilizing the computing capability of the soar processor.

Thus, there is still a need for improvement and development of the prior art.

Disclosure of Invention

The main objective of the present invention is to provide a singular value decomposition operation implementation method, apparatus and related device, which are used to solve the problems that the singular value decomposition operator in the prior art cannot run on the soar AI processor, and for the data that needs to be subjected to the singular value decomposition operation, the singular value decomposition operation cannot be directly performed based on the soar AI processor, and it is not favorable to fully utilize the computation capability of the soar AI processor.

In order to achieve the above object, a first aspect of the present invention provides a method for implementing singular value decomposition operation, wherein the method includes:

constructing a singular value decomposition operator, wherein the singular value decomposition operator is used for carrying data in target equipment and carrying out singular value decomposition operation;

deploying the singular value decomposition operator to the target device, wherein the target device is an Itanium AI processor;

and acquiring data to be processed, and performing singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator.

Optionally, the constructing the singular value decomposition operator includes:

and constructing a subfunction of the singular value decomposition operator based on an algorithm flow of singular value decomposition, wherein the algorithm flow is a flow corresponding to a power iteration method.

Optionally, the subfunction of the singular value decomposition operator includes: vector normalization subfunctions, matrix-by-vector subfunctions, vector-by-matrix subfunctions, and vector orthogonalization subfunctions.

Optionally, the vector normalization subfunction is used for: and carrying out data transportation and calculation on the blocks of the vector to be normalized to obtain the normalized vector of the vector to be normalized and the modulus of the vector to be normalized.

Optionally, the matrix-by-vector sub-function is used to: and carrying out data transportation and calculation on the matrix and the vector corresponding to the matrix-multiplied vector subfunction in batches to obtain an output vector.

Optionally, the vector multiplication matrix sub-function is used to: and carrying out data transportation and calculation on the vectors and the matrixes corresponding to the vector multiplication matrix subfunction in batches to obtain output vectors.

Optionally, the vector orthogonalization sub-function is used for: and calculating to obtain an output vector based on the input vector and the normalized vector of the vector orthogonalization subfunction, wherein the inner product of the input vector and the output vector is 0.

Optionally, the constructing the singular value decomposition operator based on all the sub-functions includes:

and completing the compiling of the singular value decomposition operator through the TIK based on the vector normalization subfunction, the matrix multiplication vector subfunction, the vector multiplication matrix subfunction and the vector orthogonalization subfunction.

Optionally, the obtaining of the data to be processed and the performing of singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator include:

and acquiring data to be processed, and calling a subfunction of the singular value decomposition operator to perform singular value decomposition operation on the data to be processed based on a power iteration method.

A second aspect of the present invention provides an apparatus for implementing singular value decomposition operation, wherein the apparatus includes:

the operator construction module is used for constructing a singular value decomposition operator, wherein the singular value decomposition operator is used for carrying data in the target equipment and carrying out singular value decomposition operation;

an operator deployment module, configured to deploy the singular value decomposition operator to the target device, wherein the target device is an itanium AI processor;

and the operation module is used for acquiring the data to be processed and carrying out singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator.

Optionally, the operator building module is specifically configured to: and constructing a subfunction of the singular value decomposition operator based on an algorithm flow of singular value decomposition, wherein the algorithm flow is a flow corresponding to a power iteration method.

A third aspect of the present invention provides an intelligent terminal, including a memory, a processor, and a singular value decomposition operation implementation program stored in the memory and executable on the processor, wherein the singular value decomposition operation implementation program, when executed by the processor, implements any one of the steps of the singular value decomposition operation implementation method.

A fourth aspect of the present invention provides a computer-readable storage medium having a singular value decomposition operation implementation program stored thereon, wherein the singular value decomposition operation implementation program, when executed by a processor, implements any one of the steps of the singular value decomposition operation implementation method.

As can be seen from the above, in the scheme of the present invention, a singular value decomposition operator is constructed, wherein the singular value decomposition operator is used for carrying data in a target device and performing singular value decomposition operation; deploying the singular value decomposition operator to the target device, wherein the target device is an Itanium AI processor; and acquiring data to be processed, and performing singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator. Compared with the prior art, the proposal of the invention constructs the singular value decomposition operator which can carry out the transportation and the singular value decomposition operation of the data in the soar AI processor, and deploys the singular value decomposition operator to the soar AI processor, which is beneficial to fully utilizing the computing capability of the soar AI processor and directly carrying out the SVD operation on the data to be processed based on the soar AI processor.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic diagram of an AI Core architecture of davinci according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an SVD decomposition of a matrix according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating a method for implementing singular value decomposition according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a data transfer process of a vector calculation unit in AI Core according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a matrix multiplication vector according to an embodiment of the present invention;

FIG. 6 is a diagram of a vector multiplication matrix according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an apparatus for implementing singular value decomposition operation according to an embodiment of the present invention;

fig. 8 is a schematic block diagram of an internal structure of an intelligent terminal according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when …" or "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted depending on the context to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings of the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

The development of AI techniques often relies on the processing of massive amounts of data, which places very high demands on computing power. The soar AI processor is a computationally intensive processor that has been developed for the features of this computationally intensive task. The helianthus AI processor employs a DaVinci architecture, the computation Core of which is mainly composed of AI cores (AI cores), FIG. 1 is a schematic diagram of the Hua-DaVinci AI Core architecture according to an embodiment of the present invention. As shown in fig. 1, the AI Core includes three basic computing units, namely a matrix computing Unit (Cube Unit), a Vector computing Unit (Vector Unit) and a Scalar computing Unit (Scalar Unit), which correspond to three common computing types, namely a matrix, a Vector and a Scalar. The Cube calculation unit is responsible for matrix multiplication, if the input data type is float16, the multiplication of two 16-by-16 matrices can be completed each time; if the input data type is int8, 16 × 32 is multiplied with 32 × 16 two matrices at a time, i.e., C ═ a × B. Where A, B and C are stored in L0A, L0B and L0C, respectively. The Vector computing unit is responsible for Vector computing, and the computing power of the Vector computing unit is lower than that of the Cube computing unit, but the Vector computing unit is more flexible, and the common data types are float16 and float 32. It can process not only vectors but also tensors of arbitrary dimensions, as long as it is considered as a one-dimensional vector. The scaler unit is mainly used for processing Scalar data operation of various types and flow control of programs. The three calculation units form three mutually independent pipelines in the calculation process and are mutually coordinated and executed under the unified scheduling of the system, so that the strong calculation power of the AI Core is exerted to the maximum extent. In addition, a soaring AI chip includes a plurality of AI Core cores, all of which share a Global Memory (GM). The performance of the soar AI processor can be further greatly enhanced by using the multi-core parallel computing capability.

Cann (computer Architecture for Neural networks) is a heterogeneous computing Architecture proposed by china for AI scenarios, and is a chip computer library and a highly automated operator development tool. It connects the bottom hardware and the upper AI framework layer and supports the user to fully exert the computing power of the promotion AI processor through the multi-level programming interface. In order to meet the requirement of user diversification, the CANN provides a custom operator development capability based on a tvm (temporal Virtual machine) framework, and the development of the corresponding neural network operator can be completed through an API and a custom operator programming development interface provided by a tbe (temporal Boost engine). Currently, the promotion AI software stack provides two development methods: one is DSL (Domain-Specific Language) development, that is, a schedule of some common operations is provided in advance, and they are packaged into a single interface. The developer expresses the computation logic of the operator by using the predefined interfaces, and then automatically generates a target code by using an automatic scheduling mechanism, so that the writing of the operator can be completed. The second method is developed by using a TIK (timer editor kernel) module, which is a dynamic programming framework based on python language and provides a mechanism for Buffer management and data automatic synchronization. The two operator development modes provide different development abstraction levels for users. Generally, the difficulty is higher by adopting a TIK development mode, but the developed operator can often obtain higher performance.

The soar AI processor achieves good performance in the AI task processing, but the support of the bottom layer operator is still insufficient, so that the related applications at the upper layer can not be constructed. For example, in the aspect of scientific calculation, SVD operation is a matrix calculation process which is common in mathematics, and can be used to accelerate the calculation of matrix inversion and approximate decomposition of large matrices, and plays an important role in signal processing, image compression, tensor network, second-order optimization, and the like. In the prior art, corresponding SVD functions can be run on a plurality of computing platforms such as x86, arm and GPU to implement SVD operations. However, the current CANN, which is the basic library of the soar AI processor, does not provide a matching API interface, which obviously limits its development potential. Therefore, the problem of the prior art is that there is no SVD operator that can run on the soaring AI processor, and it is not easy to fully utilize the computing power of the soaring AI processor to perform SVD operation based on the soaring AI processor.

FIG. 2 is a schematic diagram of an SVD decomposition of a matrix according to an embodiment of the present invention, and as shown in FIG. 2, the SVD decomposition of the matrix refers to a decomposition of an arbitrary matrix A (M × N) into a product of three matrices, that is, a decomposition of the arbitrary matrix A (M × N) into a product of three matrices

Wherein S_K×KIs a diagonal matrix whose diagonal elements are referred to as the singular values of matrix a. When K ═ N, the above formulae are strictly equal; when K is<min (M, N), the matrix product on the right is an approximate representation of the matrix on the left. And K represents the dimension of the matrix, and K represents that the matrix to be decomposed contains the first K maximum singular values. SVD is a function in the well-known linear algebra computing library LAPACK, which can already be run on multiple computing platforms such as x86, arm and GPU. The current CANN provides an algorithm library that is not implemented, and the boosted AI processor cannot support SVD matrix computation, the root cause is that the hardware and underlying instructions of the boosted AI processor are different from the general purpose CPU. For example, the BLAS math library, which can be called on the CPU, cannot be run on the GPU hardware. Therefore, some applications based on the boosted AI processor cannot be built on the boosted AI processor when they need to rely on the SVD operator.

In order to solve the problems in the prior art, the invention provides a singular value decomposition operation implementation method, in the embodiment of the invention, a singular value decomposition operator is constructed, wherein the singular value decomposition operator is used for carrying data in target equipment and carrying out singular value decomposition operation; deploying the singular value decomposition operator to the target device, wherein the target device is an Itanium AI processor; and acquiring data to be processed, and performing singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator. Compared with the prior art, the proposal of the invention constructs the singular value decomposition operator which can carry out the transportation and the singular value decomposition operation of the data in the soar AI processor, and deploys the singular value decomposition operator to the soar AI processor, which is beneficial to fully utilizing the computing capability of the soar AI processor and directly carrying out the SVD operation on the data to be processed based on the soar AI processor.

As shown in fig. 3, an embodiment of the present invention provides a method for implementing singular value decomposition operation, and specifically, the method includes the following steps:

and S100, constructing a singular value decomposition operator, wherein the singular value decomposition operator is used for carrying data in the target equipment and carrying out singular value decomposition operation.

In this embodiment, the singular value decomposition operator is a function that can be compiled and run in the target device. The singular value decomposition operator is used for performing singular value decomposition operation on the matrix.

Step S200, deploy the singular value decomposition operator to the target device, wherein the target device is an Itanium AI processor.

Specifically, the singular value decomposition operator is deployed in the target device and can be compiled and operated, so that the target device can call the singular value decomposition operator to carry data and perform singular value decomposition operation. In this embodiment, the target device is an Itanium AI processor. In practical use, the target device may also be another processor or device, and is not limited specifically herein.

And S300, acquiring data to be processed, and performing singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator.

The data to be processed is data that needs to be subjected to singular value decomposition, for example, a matrix that needs to be subjected to singular value decomposition. In an application scenario, the matrix to be singular value decomposed can be inputted into the rising AI processor in a data stream manner, and the data stream is divided by presetting or inputting the number of rows and columns of the matrix in real time to determine the corresponding matrix.

As can be seen from the above, in the singular value decomposition operation implementation method provided in the embodiment of the present invention, a singular value decomposition operator is constructed, where the singular value decomposition operator is used to carry data in a target device and perform singular value decomposition operation; deploying the singular value decomposition operator to the target device, wherein the target device is an Itanium AI processor; and acquiring data to be processed, and performing singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator. Compared with the prior art, the proposal of the invention constructs the singular value decomposition operator which can carry out the transportation and the singular value decomposition operation of the data in the soar AI processor, and deploys the singular value decomposition operator to the soar AI processor, which is beneficial to fully utilizing the computing capability of the soar AI processor and directly carrying out the SVD operation on the data to be processed based on the soar AI processor.

Specifically, in this embodiment, the above-mentioned constructing a singular value decomposition operator (i.e. SVD operator) includes: and constructing a subfunction of the singular value decomposition operator based on an algorithm flow of singular value decomposition, wherein the algorithm flow is a flow corresponding to a power iteration method.

In order to realize the SVD operator with high performance, the TIK is adopted to finish the writing of the operator, and the adopted algorithm is a power iteration method. The power iteration method is an algorithm for solving matrix eigenvalues in mathematics, and the idea is that the eigenvector corresponding to the maximum eigenvalue of the matrix contains the most information of the matrix. The specific method is to multiply an original matrix by a random vector, normalize the obtained vector, multiply the obtained vector by the original matrix continuously, and repeat the process continuously, wherein the finally obtained vector is the eigenvector corresponding to the maximum eigenvalue of the matrix. The SVD is realized by a vector calculation unit because the SVD decomposition process is basically vector and matrix-vector operation and does not involve matrix multiplication. In AI Core, one vector calculation unit is divided into 8 blocks, each block having a size of 32 Byte. If the input data is float16 (size is 2 bytes), the number of data that can be calculated simultaneously by calling the vector calculation unit once is 128. The vector computing unit is called for multiple times, so that extra overhead is caused, and repeated computation is allowed to be carried out when the vector computing unit is called for one time in the AI Core. Due to hardware limitations, the maximum number of repetitions is 255. Therefore, it can be considered that the vector calculation unit is called once, and 255 × 128 — 32640 float16 data can be calculated at the same time. In addition, the vector calculation unit has its own cached Unified Buffer (UB) with a size of 256 KB. When calculating, the vector calculating unit will automatically read and write back the data from the UB, and the data needs 32B alignment in the cache. Data in UB needs to be carried in from GM, which has large space, such as Ascend910 training chip, and size is 32 GB. For a large tensor, the data blocks need to be transferred from the GM to the UB, then the data blocks are put into the vector calculation unit for calculation, and after the calculation, the data blocks are transferred from the UB to the GM, and the data transfer process is as shown in fig. 4.

The operator implementation process is divided into two stages: the first stage analyzes the algorithm flow of the SVD, and designs the module repeatedly calculated in the algorithm flow as a subfunction of the SVD. Specifically, the method includes the following four subfunctions:

the singular value decomposition operator implementation process is divided into two stages, wherein the first stage analyzes the algorithm flow of the SVD, and a module repeatedly calculated in the first stage is designed as a subfunction of the SVD.

Specifically, the subfunction of the singular value decomposition operator includes: vector normalization subfunctions, matrix-by-vector subfunctions, vector-by-matrix subfunctions, and vector orthogonalization subfunctions. Therefore, the singular value decomposition operator is decomposed into 4 sub-functions, which is beneficial to respectively designing and deploying each sub-function and realizing SVD operation.

In this embodiment, the vector normalization subfunction is used for: and carrying out data transportation and calculation on the blocks of the vector to be normalized to obtain the normalized vector of the vector to be normalized and the modulus of the vector to be normalized.

Specifically, for the vector normalization subfunction, a vector to be normalized V _ in is input, a normalized vector V _ out and a vector modulus are returned, and the element square sum of V _ out is 1. The size of the UB space is 256KB, wherein 248KB of the space can be used for the vector calculation unit, and 248 × 1024/2-126976 data of float16 can be stored. For the SVD operator, the length of the vector to be used is much smaller than this value. Therefore, the vector can be completely stored in the UB, and the process from batch handling of data to UB by the GM can be omitted. The specific implementation steps of the vector normalization subfunction are as follows:

(1) in order to reduce the number of data transfer times and improve efficiency, the data V _ in the UB is transferred to the vector calculation unit to the maximum extent to participate in calculation. The maximum processing data volume of the vector calculating unit (the maximum disposable data number capable of being processed by the vector calculating unit) is obtained, and the data in the UB is sent to the vector calculating unit for operation in a blocking mode based on the maximum processing data volume. Specifically, for data of float16, the vector calculation unit can be filled with 128 numbers at a time (if it is float32, the data amount is halved). Since the interface of the vector calculation unit can be repeatedly calculated 255 times at most by calling once, the data in the UB can be partitioned into blocks according to the number of 255 × 128-32640 (that is, the maximum processing data amount is 32640, and each block is partitioned according to the number of 32640), and the data can be sent to the vector calculation unit for operation. Then, the built-in vec _ mul function of the TIK is called to calculate the dot product of the vector and the vector (the value of the repetition times is 255), and then the built-in vec _ reduce _ add function is called to sum the dot product.

(2) And (3) calculating the number of times (less than 255) of calling the vector function to repeat for the residual data which cannot be divided into the whole blocks in the step (1), namely dividing the residual data by 128 to be rounded, and then calling the vector calculation unit to calculate. Similarly, the vec _ mul function is called to calculate the product of each element and itself, and then the vec _ reduce _ add function is called to accumulate the sum.

(3) Finally, for the data (less than 128 float16 data) left in step (2) that are not processed by the vector calculation unit once, only one vector calculation function needs to be called. It should be noted that when vec _ reduce _ add is called, an appropriate mask must be set to avoid adding irrelevant data into the vector sum of squares to ensure the correctness of the result.

(4) Summing the reduction results in the above three steps, the sum of squares of all elements of the input vector V _ in is obtained. And then, squaring by using an interface function scalar _ sqrt of the scalar calculation unit to obtain norm of the vector (the modulus of the vector, namely the sum of squares of all elements, and then squaring), and taking the reciprocal to obtain inv _ norm. And finally, calling a vector scalar product function vec _ muls, repeating the steps (1) to (3), and multiplying all input vectors by inv _ norm again through blocking to obtain an output vector V _ out.

The step (1) is to calculate the number of times of vector units which need to be called maximally (255 times of repetition), the step (2) is to calculate the number of times of vector unit repetition (less than 255), and the step (3) is to calculate the data which can be completed by the remaining one-time processing of the step (2). The buffer area is too small, the data amount processed by the vector computing unit at one time is limited, and therefore, the data can only be sent to the vector computing unit in batches to be computed to the maximum extent.

Vecmul and vecredmount are the underlying function interfaces built into the TIK, vecmul is the dot product of two vectors to a new vector, vecredmount sums all the elements inside a vector.

It should be noted that, in the calculation process of the subfunction or operator described below, data also needs to be carried out in blocks and then calculated, but details are not described again.

In this embodiment, the above-mentioned matrix-by-vector subfunction is used for: and carrying out data transportation and calculation on the matrix and the vector corresponding to the matrix-multiplied vector subfunction in batches to obtain an output vector.

The matrix and the vector corresponding to the matrix-by-vector subfunction are the matrix and the vector which are input to the matrix-by-vector subfunction and need to be processed. Specifically, for the above matrix-by-vector subfunction, the input is the matrix a (M, N) and the vector V _ in (N), and the output is the vector V _ out (M). Since the space occupied by the vectors in the SVD is relatively small, both the input and output vectors can be considered to be stored in the UB. However, the input matrix may be large, so there are two cases: the matrices are stored in UBs and in GMs, which are respectively adapted to compute SVD decompositions of small-scale and large-scale matrices. Only data transportation is needed according to requirements in the calculation process. Where M and N represent the dimensions of the corresponding matrix. Fig. 5 is a schematic diagram of a matrix multiplied by a vector according to an embodiment of the present invention, and as shown in fig. 5, one row of the matrix is taken out each time and an inner product is performed on the vector to obtain one element of an output vector. The specific implementation steps of the matrix-by-vector subfunction are as follows:

(1) matrix preprocessing: for the case where the matrix is stored in UB, attention is paid to the 32-byte alignment constraint, and the number of columns N must be compensated to 32-byte alignment (usually complement 0), assuming N _ ex is complete. Wherein, N is the number of columns of the original matrix, N _ ex is the number of columns expanded after byte alignment, and when the data is the data of float16, N _ ex is 32byte alignment and is a multiple of 16; when the data is data of float32, N _ ex is a multiple of 8. Specifically, for data of float16, a byte is 8 bits, 32 bytes are 8 × 32 ═ 256 bits, and it is just 256/16 ═ 16 data of float16 type (16 bits). When the matrix is stored in the GM, there is no limitation described above.

(2) Through the for _ range loop of the TIK (assuming that the loop variable is i, the for _ range loop means that the loop is performed when i is within a certain range), the for _ range can identify the scalar data (i.e. the data stored in the scalar computing unit in the anagram AI processor), and extract one row of the matrix a for calculation each time. Considering matrix a as a one-dimensional vector, when the matrix is in UB, the initial displacement of the ith row is offset — i × N _ ex. When the matrix is in the GM, the displacement of the matrix is offset — i × N. For the case that the matrix is stored in the GM, a TIK built-in function data _ move function needs to be called, and the data of the ith row is copied into the UB. Let the vector fetched for participation in the next calculation be V _ tmp.

(3) The inner product of the vector V _ tmp and V _ in for each row of the matrix a is calculated. The vec _ mul is called to calculate the product of the corresponding elements of the vector, and then the vec _ reduce _ add function is called to accumulate the product. When these two functions are called, the function still needs to be reused multiple times with the repeat parameter. Further, when the reduced function vec _ reduce _ add is called, attention is paid to the processing of the data tail. When the remaining data is not enough to fill the number of data processed by one vector calculation unit, a mask parameter is used to ensure that the result of the accumulation is correct.

The repeat parameter is used for limiting the number of times of repeatedly calling the vector calculation unit, and can be set and adjusted according to actual requirements. Since each call to the vector calculation unit has a certain time overhead, the upgrade processor provides the function of repeating the calculation for a plurality of times with one call. repeat is a parameter in the function interface of the vector computing unit, and the data is only required to be transmitted into the function interface (the upper limit is 255). For example, if there are 255 × 128+256+20 data to be processed, the maximum data that can be processed by the vector calculation unit should be first sent to the calculation, that is, the maximum number of times of repetition 255 is multiplied by the amount of data 128 that the vector calculation unit is filled in a single time; the remaining 256 data, then repeat need only be set to 2 (since 128 data can be processed at one time); the repeat of the last 20 remaining data is set to 1, but since 20 data are not enough to fill up one vector calculation unit, the mask parameter needs to be set to 20 to tell the vector calculation unit, and only the first 20 data need to be calculated. The general goal is to minimize the number of data transfers by means of repeat parameter settings, since each data transfer is time consuming. The mask parameter is used for limiting the number of elements which need to be considered by the vector calculation unit in the calculation result.

(4) And (4) circulating for M times according to the steps (2) to (3), and putting the calculation result of each time into V _ out [ i ]. Wherein i represents the ith cycle, the initial value of i may be set to 1, and 1 is added in each cycle until all rows of the matrix a are calculated.

In this embodiment, the vector-by-matrix subfunction is used to: and carrying out data transportation and calculation on the vectors and the matrixes corresponding to the vector multiplication matrix subfunction in batches to obtain output vectors.

The vector and the matrix corresponding to the vector-by-matrix subfunction are the vector and the matrix which are input into the vector-by-matrix subfunction and need to be processed. Specifically, for the vector-by-matrix subfunction, the input is vector V _ in (M) and matrix a (M, N), and the output is V _ out (N). Similar to the matrix multiplication vector operator, here, the submatrix is also stored in UB or GM, and is not described again. It should be noted that the calculation method of vector multiplication matrix and matrix multiplication vector is different from each other in consideration of the fact that the matrix is linearly stored by rows and it is most efficient to acquire the matrix elements by each row. Fig. 6 is a schematic diagram of a vector multiplication matrix according to an embodiment of the present invention, and as shown in fig. 6, all elements in each row of the matrix are multiplied by corresponding elements (e.g., V1, V2, V3, etc.) of a vector to obtain a new vector, and then all the new vectors are accumulated to obtain an output vector V _ out. The specific implementation steps of the vector multiplication matrix subfunction are as follows:

(1) and setting N times of circulation, wherein the circulation variable is i, taking out the ith element of the V _ in each time, and calling the vec _ muls function to multiply the ith row vector of the matrix A. Where N is the number of columns in matrix a and i has an initial value of 1.

(2) Calling a built-in function vec _ add function of the TIK, and performing accumulation summation on the product of the row vectors of the matrix A; the issue of data tails need not be considered here.

(3) And (3) circulating the steps (1) and (2) for N times to just obtain an output V _ out.

It should be noted that the Vec _ add function is a TIK built-in function interface, and is used for summing two vectors, and the obtained result is still one vector. The Vec _ reduce _ add function is used to sum all elements in a vector, resulting in a number or scalar.

In this embodiment, the vector orthogonalization sub-function is used for: and calculating to obtain an output vector based on the input vector and the normalized vector of the vector orthogonalization subfunction, wherein the inner product of the input vector and the output vector is 0.

Specifically, for the vector orthogonalization subfunction, the vector V1_ in and the normalized vector V2_ in are input, and the vector V _ out is output, and finally the inner product of the vector V1_ in and V _ out is required to be 0. The specific implementation steps of the vector orthogonalization subfunction are as follows:

(1) and calling a built-in TIK function vec _ mul, and calculating the product of V1_ in and V2_ in to obtain a temporary vector V _ tmp.

(2) Calling a built-in TIK function vec _ reduce _ add, accumulating the value of V _ tmp to obtain sum _ tmp, paying attention to the processing of a data tail, wherein the data tail refers to data which cannot fill the whole vector computing unit after the original whole block of data is partitioned, and the processing of the tail data can be carried out by referring to the descriptions of the steps (2) and (3) in the vector normalization sub-function.

(3) The TIK built-in function vec _ muls is called to multiply the vector V1_ in by sum _ tmp (equivalent to the inner product of V1_ in and V2_ in).

(4) And calling a TIK built-in function vec _ sub, and subtracting the result of (3) by V2_ in to obtain V _ out.

Specifically, in this embodiment, the constructing the singular value decomposition operator based on all the sub-functions includes: and completing the compiling of the singular value decomposition operator through the TIK based on the vector normalization subfunction, the matrix multiplication vector subfunction, the vector multiplication matrix subfunction and the vector orthogonalization subfunction. Thereby obtaining a high-efficiency singular value decomposition operator.

Further, the obtaining of the data to be processed and the performing of singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator include: and acquiring data to be processed, and calling a subfunction of the singular value decomposition operator to perform singular value decomposition operation on the data to be processed based on a power iteration method.

In this embodiment, the singular value decomposition operation is specifically performed by the following steps:

defining and calculating the maximum times max _ iter of each singular value iteration and a numerical precision constant eps;

defining left eigenvector temporary variables L (M) and L '(M) and right eigenvector temporary variables R (N) and R' (N);

defining singular value vectors S (K), wherein the left eigenvector matrix and the right eigenvector matrix are respectively U (K, M) and V (K, N);

defining diff to represent a mode of difference of left feature vectors in the two iteration processes, and defining norm as a normalization factor of the vectors in the iteration process;

inputting a matrix A (M, N) to be decomposed and a parameter K, wherein the K represents the first K maximum singular values in the decomposition process, namely the number of singular values to be calculated;

sequentially taking values of ind from 0, 1, …, K-1 each time;

randomly initializing a left feature vector L, and resetting diff to 1.0;

starting a singular value iteration step, wherein the iteration number iter takes values from 0, 1, 2, …, max _ iter-1 and max _ iter each time:

if iter is max _ iter, exiting the current loop and carrying out the next step; otherwise, the following steps are executed:

calling a vector multiplication matrix operator, inputting the vector multiplication matrix operator into a matrix A and a vector L, and obtaining a vector R';

calling a vector normalization operator, inputting the vector normalization operator into R ', and returning R' and a normalization factor norm;

if the normalization factor norm is smaller than the precision eps, exiting the current cycle;

calling a vector orthogonalization operator to ensure that R ' is orthogonal to the leading ind row of the right feature matrix V, inputting the R ' and the leading ind row of the right feature matrix V every time, and outputting the R ';

calling a matrix vector multiplier, inputting a matrix A and a vector R ', and outputting L';

calling a vector normalization operator, wherein the input is L ', and the output is L' and a normalization factor norm;

storing norm with s;

if the normalized factor norm of the angelica is smaller than the precision eps, the current cycle is exited, and the next step is carried out, otherwise, the above cycle is carried out;

calling a vector orthogonalization operator to ensure that L ' is orthogonal to the leading ind row of the left feature matrix U, inputting the L ' and the leading ind row of the left feature matrix V every time, and outputting the L ';

calling a vector normalization operator, outputting L ', and returning L' and a normalization factor norm;

calculating the modulus diff of the difference between L and L';

replacing the values of L and R with L 'and R', respectively;

if the value of s is greater than the precision eps:

storing S to S [ ind ];

storing the left eigenvector L into the ind row of the left eigenvector matrix U;

storing the right eigenvector R into the ind row of the right eigenvector matrix V;

and returning a singular value vector S, a left eigenvector U and a right eigenvector V.

Wherein max _ iter represents the maximum number of iterations of each singular value calculated by the power iteration method, and eps represents numerical calculation precision. If the eps accuracy is reached in advance, the iteration terminates and the next singular value starts to be calculated. If the iteration times exceed max _ iter and the specified precision eps is not reached, the iteration of the current singular value is directly terminated, and the next singular value is calculated. K is customizable data, and the value of K is less than or equal to min (M, N). L (M) and L' (M) are vectors of length M; and similarly, R (N) and R' (N) are vectors with the length of N. They both represent the eigenvectors of the matrix. Diff is the magnitude of the mode of the vector difference in the two iteration processes of the left eigenvector of the matrix, norm is the mode of the eigenvector, and in the SVD power iteration method, the norm is just the singular value of the matrix to be decomposed. s is a variable that holds the singular values of the matrix. L represents the left eigenvector of the matrix and can be initialized randomly, but the TIK does not provide a random function interface, and the actual calculation process is replaced by the normalized all-1 vector. Normalizing R '(N) modifies the value of R' (N) directly in bits (in-place). In SVD decomposition, K specifies the first K maximum singular values to be computed. The index takes values from 0 to K-1, representing the index of the largest singular value, the index of the second largest singular value, and so on.

Therefore, the SVD scheme based on the Itanium AI processor provides a matrix with any size, and can flexibly decide to carry out SVD on the first K maximum singular values of the matrix by specifying K, thereby providing necessary support for related applications depending on SVD operators on the Itanium AI processor.

It should be noted that the SVD operator may be specifically applied to the following scenarios, for example: matrix inversion: for the SVD decomposition of matrix a, a ═ USV, the inverse of matrix a can be expressed as a^-1＝US^-1V, because S is a diagonal matrix, the matrix inversion is the reciprocal of all diagonal elements, and the matrix inversion is widely applied to matrix calculation; matrix approximation: when matrix A is subjected to SVD, singular values are arranged on diagonal elements of matrix S from large to small, and the size of the singular values reflects the proportion of corresponding feature vectors in the whole matrixTherefore, K maximum singular values before decomposition can be selected to approximate the original matrix; and (3) tensor network calculation: in a tensor network, SVD decomposition is a very common calculation process, and by truncating singular values of a high-dimensional tensor, SVD can greatly reduce the calculation amount and accelerate the calculation process of the tensor network under the condition of ensuring higher precision; other specific applications are as follows: solving a homogeneous linear equation set, a total least square method, a separable model, signal processing, image processing, quantum information, a recommendation system and the like. Thus, the computing power of the soar AI processor can be effectively utilized to increase the data processing speed. In the singular value decomposition operation implementation method of this embodiment, based on the characteristics of the rising AI processor, TIK is used to implement high performance SVD operator, which includes 4 basic sub-functions. Based on the characteristics of the vector calculation unit, the data handling, memory access and calculation processes of the 4 subfunctions are finely controlled, and the calculation capability of the AI Core is exerted to the maximum extent under the condition of ensuring the accuracy of the calculation result. In this embodiment, the performance of the singular value decomposition operation implementation method is described based on a specific test result. The specific test results are shown in the following table:

specifically, the singular value decomposition operation implementation method provided by the embodiment of the present invention can obtain very high accuracy under the condition of a given data type, for example, by adopting the float16 data type, the calculated singular value can reach 2-bit significant digits. If greater accuracy is required, it may be convenient to change to the float32 data type. As shown in the above table, compared with the method written by C + + or numpy, the SVD operator written by TIK provided by the embodiment of the present invention has better performance and lower decomposition time consumption, and the advantage of decomposition time consumption is more obvious with the increase of the size of the matrix to be decomposed. For example, when the input matrix is 128 × 256, the SVD performance written in TIK is more than 20 times that of the C + + version (running on the aarch64 CPU).

As shown in fig. 7, in correspondence to the singular value decomposition operation implementation method, an embodiment of the present invention further provides a singular value decomposition operation implementation device, where the singular value decomposition operation implementation device includes:

and an operator constructing module 410, configured to construct a singular value decomposition operator, where the singular value decomposition operator is used to carry data in the target device and perform singular value decomposition operation.

An operator deployment module 420 for deploying the singular value decomposition operator to the target device, wherein the target device is an Itanium AI processor.

And the operation module 430 is configured to acquire data to be processed, and perform singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator.

As can be seen from the above, the singular value decomposition operation implementation apparatus provided in the embodiment of the present invention may be used to: constructing a singular value decomposition operator, wherein the singular value decomposition operator is used for carrying data in target equipment and carrying out singular value decomposition operation; deploying the singular value decomposition operator to the target device, wherein the target device is an Itanium AI processor; and acquiring data to be processed, and performing singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator. Compared with the prior art, the proposal of the invention constructs the singular value decomposition operator which can carry out the transportation and the singular value decomposition operation of the data in the soar AI processor, and deploys the singular value decomposition operator to the soar AI processor, which is beneficial to fully utilizing the computing capability of the soar AI processor and directly carrying out the SVD operation on the data to be processed based on the soar AI processor.

Specifically, the operator constructing module 410 is configured to: and constructing a subfunction of the singular value decomposition operator based on an algorithm flow of singular value decomposition, wherein the algorithm flow is a flow corresponding to a power iteration method.

It should be noted that, the specific functions corresponding to the singular value decomposition operation implementation apparatus and the specific modules thereof may be set and adjusted by referring to the singular value decomposition operation implementation method, which is not described herein again.

Based on the above embodiment, the present invention further provides an intelligent terminal, and a schematic block diagram thereof may be as shown in fig. 8. The intelligent terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein, the processor of the intelligent terminal is used for providing calculation and control capability. The memory of the intelligent terminal comprises a nonvolatile storage medium and an internal memory. The nonvolatile storage medium stores an operating system and a singular value decomposition operation implementation program. The internal memory provides an environment for an operating system and a singular value decomposition operation implementation program in the nonvolatile storage medium to run. The network interface of the intelligent terminal is used for being connected and communicated with an external terminal through a network. The singular value decomposition operation implementation program is executed by a processor to implement the steps of any one of the singular value decomposition operation implementation methods. The display screen of the intelligent terminal can be a liquid crystal display screen or an electronic ink display screen.

It will be understood by those skilled in the art that the block diagram of fig. 8 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the intelligent terminal to which the solution of the present invention is applied, and a specific intelligent terminal may include more or less components than those shown in the figure, or combine some components, or have different arrangements of components.

In one embodiment, an intelligent terminal is provided, where the intelligent terminal includes a memory, a processor, and a singular value decomposition operation implementation program stored on the memory and executable on the processor, and the singular value decomposition operation implementation program, when executed by the processor, performs the following operation instructions:

The embodiment of the present invention further provides a computer-readable storage medium, where a singular value decomposition operation implementation program is stored in the computer-readable storage medium, and when being executed by a processor, the singular value decomposition operation implementation program implements any one of the steps of the singular value decomposition operation implementation method provided in the embodiment of the present invention.

Optionally, the intelligent terminal and the computer-readable storage medium may also store a singular value decomposition operation implementation program to implement the steps of the singular value decomposition operation implementation method.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art would appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the above modules or units is only one logical division, and the actual implementation may be implemented by another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The integrated modules/units described above, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium and can implement the steps of the embodiments of the method when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer readable medium may include: any entity or device capable of carrying the above-mentioned computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signal, telecommunication signal, software distribution medium, etc. It should be noted that the contents contained in the computer-readable storage medium can be increased or decreased as required by legislation and patent practice in the jurisdiction.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims

1. a singular value decomposition operation implementation method, is characterized in that, described method comprises:

constructing a singular value decomposition operator, wherein the singular value decomposition operator is used to carry data in the target device and perform singular value decomposition operations;

deploying the singular value decomposition operator into the target device, wherein the target device is an Ascend AI processor;

The data to be processed is acquired, and a singular value decomposition operation is performed on the data to be processed based on the deployed singular value decomposition operator.

2. The method for realizing singular value decomposition operation according to claim 1, wherein the constructing a singular value decomposition operator comprises:

A sub-function of the singular value decomposition operator is constructed based on the algorithm flow of singular value decomposition, wherein the algorithm flow is a flow corresponding to the power iteration method.

3. The method for realizing singular value decomposition operation according to claim 2, wherein the sub-function of the singular value decomposition operator comprises: a vector normalization sub-function, a matrix multiplication sub-function, a vector multiplication matrix sub-function and the vector orthogonalization subfunction.

4. The method for realizing singular value decomposition operation according to claim 3, wherein the vector normalization sub-function is used to: carry out data handling and calculation in blocks to be normalized, and obtain the vector to be normalized The normalized vector of , and the modulus of the vector to be normalized.

5. singular value decomposition operation implementation method according to claim 3, is characterized in that, described matrix multiplication vector function is used for: carry out data handling and calculation to the matrix and vector corresponding to described matrix multiplication vector function in batches , to get the output vector.

6. singular value decomposition operation implementation method according to claim 3, is characterized in that, described vector multiplication matrix sub-function is used for: carry out data handling and calculation to the vector and matrix corresponding to described vector multiplication matrix sub-function in batches , to get the output vector.

7. The singular value decomposition operation implementation method according to claim 3, wherein the vector orthogonalization sub-function is used for: based on the input vector and normalization vector of the input vector orthogonalization sub-function, Calculate and obtain an output vector, wherein the inner product of the input vector and the output vector is 0.

8. The method for realizing singular value decomposition operation according to claim 3, wherein the constructing the singular value decomposition operator based on all the sub-functions comprises:

Based on the vector normalization sub-function, the matrix multiplication sub-function, the vector multiplication matrix sub-function and the vector orthogonalization sub-function, the writing of the singular value decomposition operator is completed by TIK.

9. The method for realizing singular value decomposition operation according to claim 2, wherein the acquiring data to be processed, performing singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator, comprising:

Acquire the data to be processed, and call the sub-function of the singular value decomposition operator to perform singular value decomposition operation on the data to be processed based on the power iteration method.

10. A device for realizing singular value decomposition operation, characterized in that the device comprises:

an operator building module for constructing a singular value decomposition operator, wherein the singular value decomposition operator is used to carry data in the target device and perform singular value decomposition operations;

an operator deployment module, configured to deploy the singular value decomposition operator into the target device, wherein the target device is an Ascend AI processor;

The operation module is used for acquiring the data to be processed, and performing singular value decomposition operation on the data to be processed based on the deployed singular value decomposition operator.

11. The singular value decomposition operation implementation device according to claim 10, wherein the operator building module is specifically used for: constructing the sub-function of the singular value decomposition operator based on the algorithm flow of the singular value decomposition, The algorithm flow is a flow corresponding to the power iteration method.

12. An intelligent terminal, characterized in that the intelligent terminal comprises a memory, a processor, and a singular value decomposition operation implementation program stored on the memory and executable on the processor, the singular value decomposition operation When the realization program is executed by the processor, the steps of realizing the method for realizing the singular value decomposition operation according to any one of claims 1-9 are realized.

13. A computer-readable storage medium, characterized in that, a singular value decomposition operation implementation program is stored on the computer-readable storage medium, and the singular value decomposition operation implementation program is implemented as claimed in claim 1 when the singular value decomposition operation implementation program is executed by a processor. 9. Steps of the singular value decomposition operation implementation method described in any one of the above.