CN120526066B

CN120526066B - Panoramic image three-dimensional reconstruction method based on 3DGS

Info

Publication number: CN120526066B
Application number: CN202511030517.XA
Authority: CN
Inventors: 刘哲强; 王俊杰; 冯锦鸿; 梁晨; 白晓阳; 张傲宇; 徐鹏
Original assignee: Lanjian Suzhou Technology Co ltd
Current assignee: Lanjian Suzhou Technology Co ltd
Priority date: 2025-07-25
Filing date: 2025-07-25
Publication date: 2025-10-03
Anticipated expiration: 2045-07-25
Also published as: CN120526066A

Abstract

This invention provides a method for 3D reconstruction of panoramic images based on 3DGS, which relates to the fields of computer vision and image processing. The method comprises: Step 1: Acquiring a sequence of panoramic images of the scene to be reconstructed and uniformly encoding them via equirectangular projection to obtain a coded dataset; Step 2: Performing panoramic structure-from-motion processing on the coded dataset to generate six-degree-of-freedom camera pose parameters for each image and a corresponding sparse 3D point cloud. This method achieves high-precision and high-efficiency 3D reconstruction of panoramic scenes, capable of outputting high-quality panoramic images or dense 3D models, improving both the quality and efficiency of 3D reconstruction.

Description

Panoramic image three-dimensional reconstruction method based on 3DGS

Technical Field

The invention relates to the technical field of computer vision and image processing, in particular to a panoramic image three-dimensional reconstruction method based on 3 DGS.

Background

With the development of VR, AR and digital twin technologies, three-dimensional reconstruction based on panoramic images has wide prospects in scenes such as city modeling, but the prior art still has some places to be improved:

The characteristic point extraction and elimination links have certain dependence on the traditional algorithm, the characteristic matching precision can be influenced in a partial shielding area, a texture repetition or a weak texture scene, and the stability in a large-view-angle span scene also has a lifting space;

When processing extreme angle viewing angles, irregular structural scenes or areas with large parallax, an optimization space exists in terms of point cloud density and structural definition. Under the influence of regional image distortion, the reconstruction accuracy of the prior art may be reduced, the resource utilization efficiency is required to be improved and the convergence speed is accelerated in the repeated scene reconstruction, the space consistency requirement of the traditional projection model and the panoramic view angle is insufficient, and challenges such as the need of reinforcing detail control and further optimization of redundant points are faced in the Gaussian modeling process.

Disclosure of Invention

The invention aims to solve the technical problem of providing a panoramic image three-dimensional reconstruction method based on 3DGS, which effectively improves the accuracy and efficiency of panoramic image three-dimensional reconstruction and reduces the consumption of training resources.

In order to solve the technical problems, the technical scheme of the invention is as follows:

In a first aspect, a3 DGS-based panoramic image three-dimensional reconstruction method, the method comprising:

Step 1, collecting a panoramic image sequence of a scene to be reconstructed, and uniformly coding through equidistant columnar projections to obtain a coding data set;

Step 2, panoramic motion recovery structure processing is carried out on the coded data set, and six-degree-of-freedom camera pose parameters and corresponding sparse three-dimensional point clouds of each image are generated;

Step 3, initializing a 3D Gaussian primitive set based on a sparse three-dimensional point cloud, endowing each three-dimensional point with a position, a rotation parameter, a scale initial value, a color parameter and an opacity parameter, and fusing the history reconstruction result with the current Gaussian primitive set through pose registration when the history reconstruction result overlapped with the current scene exists, so as to obtain a fused Gaussian primitive set;

Performing distortion correction operation on the top or bottom region of the panoramic image sequence, and performing iterative training on the basis of the fused Gaussian primitive set and the corrected image set by adopting local spherical remapping or cube projection reconstruction to obtain a corrected image set;

And 5, performing panoramic rendering on the optimized Gaussian primitive set, and outputting a panoramic image or a dense three-dimensional model.

Further, collecting a panoramic image sequence of a scene to be reconstructed, uniformly encoding through equidistant columnar projections, and obtaining an encoded data set, including:

acquiring a multi-view spherical image sequence of a scene through a panoramic camera;

Converting each spherical image into a two-dimensional plane image by adopting equidistant columnar projection, and establishing a mapping relation between spherical view angle coordinates and plane pixel coordinates;

and carrying out resolution standardization processing on all the converted images to generate a coding data set with a uniform format.

Further, performing panoramic motion recovery structure processing on the encoded data set to generate six-degree-of-freedom camera pose parameters and corresponding sparse three-dimensional point clouds for each image, including:

extracting feature points with panorama invariance from images in the unified format coding data set, and screening matching point pairs according to the similarity of spherical viewing angles to obtain screened matching point pairs;

Based on the screened matching point pairs, solving the pose of the camera through an incremental motion restoration structure algorithm to generate initial three-dimensional point coordinates;

And (3) utilizing a beam method to adjust and jointly optimize the pose and three-dimensional point coordinates of the camera, checking the panoramic projection consistency of each three-dimensional point, and outputting an optimized camera pose parameter set and a sparse three-dimensional point cloud meeting consistency constraint.

Further, based on the screened matching point pairs, solving the pose of the camera through an incremental motion restoration structure algorithm to generate initial three-dimensional point coordinates, including:

Initializing the pose of the first camera based on the screened matching point pair by taking the camera position of the first image as the origin of the world coordinate system, and marking the corresponding view as a registered view;

Selecting a new view with common view characteristic points with the registered view from the rest unregistered views, and solving a rotation matrix and a translation vector of the new view relative to a world coordinate system through a PnP algorithm to obtain a six-degree-of-freedom camera pose;

and performing triangulation operation on the new view and the registered view based on the pose of the new view and the common view feature points, and generating initial three-dimensional point coordinates bound to a world coordinate system.

Further, initializing a 3D gaussian primitive set based on a sparse three-dimensional point cloud, assigning a position, a rotation parameter, a scale initial value, a color parameter and an opacity parameter to each three-dimensional point, and fusing a history reconstruction result with a current gaussian primitive set through pose registration when the history reconstruction result overlapped with a current scene exists, so as to obtain a fused gaussian primitive set, including:

Converting the sparse three-dimensional point cloud into an initial Gaussian primitive set, wherein each Gaussian primitive inherits the position and color parameters of a three-dimensional point, and initializes rotation parameters into a non-rotation state, a microscopic cube in scale and an opacity of 0.8;

When the history reconstruction model is detected, performing feature matching based on the initial Gaussian primitive set, calculating the re-projection error of the history Gaussian primitive by using the camera pose parameter, and eliminating invalid Gaussian with the error exceeding 2 pixels;

and merging the history Gaussian primitives with effective verification with the initial Gaussian primitive set, and outputting a final merged Gaussian primitive set.

Further, performing distortion correction operation on top or bottom regions of the panoramic image sequence, and performing iterative training based on the fused Gaussian primitive set and the corrected image set by adopting local spherical remapping or cube projection reconstruction, performing densification control on the Gaussian primitives based on the spatial gradient of the panoramic screen during training, and stopping training when reaching an upper iteration limit to obtain an optimized Gaussian primitive set, wherein the method comprises the following steps:

the method comprises the steps of (1) reconstructing a top or bottom region of a panoramic image sequence by adopting local spherical remapping or cube projection, and outputting a correction image set;

And correcting the image set based on the fused Gaussian primitive set, performing iterative training, and stopping training and outputting the optimized Gaussian primitive set when the iteration times reach the upper limit.

Further, based on the corrected image set of the fused Gaussian primitive set, performing iterative training, and when the iteration number reaches an upper limit, terminating the training and outputting an optimized Gaussian primitive set, including:

taking the corrected image set as training input data, and initializing parameters of the Gaussian primitive set after fusion;

Randomly selecting a training view and a corresponding camera pose from the corrected image set, and generating a simulation view through an equidistant columnar projection rendering pipeline based on the camera pose and the current Gaussian primitive set;

calculating a loss function between the simulation view and the real training view, and updating the position, rotation, scale, color and opacity parameters of the Gaussian primitives according to the back propagation of the loss function to obtain an updated Gaussian primitive set;

Based on the updated Gaussian primitive set, calculating the gradient amplitude of each Gaussian primitive in the panoramic screen space, if the gradient amplitude is larger than a splitting threshold value, executing splitting operation to generate sub-Gaussian, if the gradient amplitude is smaller than a cloning threshold value, executing cloning operation, and if the gradient amplitude is equal to the cloning threshold value, keeping the original state;

and when the iteration times reach a preset upper limit, stopping training and outputting the optimized Gaussian primitive set.

Further, performing panorama rendering on the optimized Gaussian primitive set, and outputting a panoramic image or a dense three-dimensional model, including:

Inputting the optimized Gaussian primitive set into an equidistant columnar projection rendering pipeline, and generating equidistant columnar projection grids according to the panoramic image output resolution;

for each pixel point in the grid, generating rays emitted from the center of the camera according to the spherical sight direction corresponding to the plane coordinate mapping;

Calculating the intersection weight of the ray and all Gaussian primitives in the scene, and generating the final color of the current pixel by weighting and accumulating the color value and the opacity of each Gaussian contribution;

And traversing all pixels to finish panoramic image rendering output, or generating a dense three-dimensional point cloud model based on the spatial position and the scale parameter of the optimized Gaussian primitive set.

In a second aspect, a computing device includes:

One or more processors;

and a storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method.

In a third aspect, a computer readable storage medium has a program stored therein, which when executed by a processor, implements the method.

The scheme of the invention at least comprises the following beneficial effects:

And coding a panoramic image through equidistant columnar projection, and generating a camera pose and sparse point cloud by utilizing the panorama SfM, so as to ensure the consistency of data formats and the accuracy of camera parameters. When a history reconstruction result exists, the Gaussian primitives are fused through pose registration, repeated training is avoided, and the consumption of calculation resources is reduced. For example, in the same or overlapping scenes, the historical high-confidence Gaussian can be directly used as an initial parameter, so that optimization convergence is accelerated, and reconstruction efficiency is improved. Aiming at the serious distortion problem of equidistant columnar projection in a polar region (+/-90-degree view angle), partial spherical remapping or cube map splicing is adopted to correct the spatial distortion of the polar region image, misleading of projection errors on reconstruction is avoided, and the modeling stability and precision of an edge region (such as the top/bottom of the image) are improved.

Based on the panoramic screen spatial gradient, the Gaussian distribution is dynamically adjusted, the adaptive modeling of the spatial non-uniform structure is realized, the key region is optimized preferentially in a resource-limited scene, and the reconstruction efficiency and detail performance are balanced. The rendering frame based on equidistant columnar projection supports 360-degree omnidirectional visual angles, and the ray consistency and visual continuity of the rendered image are ensured by combining a joint loss function (MSE+SSIM) of photometric errors and structural similarity errors, so that a high-precision panoramic image or a dense three-dimensional model is finally output. By means of mechanisms such as distortion correction and gradient driving densification, the problem of matching errors of the traditional method in polar region, weak texture and other scenes is solved.

Drawings

Fig. 1 is a schematic flow chart of a3 DGS-based panoramic image three-dimensional reconstruction method according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As shown in fig. 1, an embodiment of the present invention proposes a3 DGS-based panoramic image three-dimensional reconstruction method, which includes the steps of:

In the embodiment of the invention, the panoramic image sequence is uniformly encoded through equidistant columnar projection, the consistency of the data format is ensured, the pose and sparse point cloud of the six-degree-of-freedom camera are generated, the view angle and scene structure of the camera are precisely positioned, and a space coordinate foundation is laid for the initialization of a Gaussian model. Based on sparse point cloud initialization Gaussian primitives and by combining pose registration fusion of historical reconstruction results, repeated training of repeated scenes is avoided, model convergence is accelerated, and calculation resource consumption is reduced. The modeling precision of the edge region is improved by correcting the projection distortion of the polar region through spherical remapping or cube mapping, the Gaussian distribution is dynamically optimized based on densification control (splitting, cloning and trimming) of the panoramic gradient, the encryption of the detail region and the elimination of redundant points are realized, and the reconstruction efficiency and precision are balanced. And generating an omnidirectional continuous panoramic image or a dense three-dimensional model based on the optimized Gaussian set, adapting to high-precision visual requirements of scenes such as VR/AR and the like, and ensuring light consistency and structural integrity.

In a preferred embodiment of the present invention, the step 1 of collecting a panoramic image sequence of a scene to be reconstructed, and uniformly encoding the panoramic image sequence by equidistant columnar projection to obtain an encoded data set may include:

step 100, acquiring a multi-view spherical image sequence of a scene through a panoramic camera;

step 101, converting each spherical image into a two-dimensional plane image by adopting equidistant columnar projection, and establishing a mapping relation between spherical view angle coordinates and plane pixel coordinates;

And 102, performing resolution standardization processing on all the converted images to generate a coded data set in a unified format.

In the embodiment of the invention, a panoramic camera (such as a fisheye lens splicing camera or a multi-lens array camera) is adopted, and the field angle of the lens needs to cover a range of 360 degrees horizontally and 180 degrees vertically, so that no visual angle blind area is ensured. Acquisition parameters are set, exposure time is dynamically adjusted according to scene illumination (such as ISO 100-800), resolution is generally set to 4096×2048 or higher to keep enough details, and frame rate is required to meet scene static or dynamic shooting requirements (such as static scene 1-5fps, dynamic scene is more than or equal to 30 fps). If the camera is used for fixed position acquisition, the camera keeps the central position motionless, a group of images are shot according to a preset angle (such as every 10 degrees) through the rotary cradle head, so that the overlapping rate of adjacent images is more than or equal to 50%, and if the camera track is used for mobile acquisition, the camera track is required to be recorded (such as through SLAM real-time positioning), so that the position continuity between continuous frames is ensured.

The image sequence acquisition process comprises the following steps:

and shooting sequentially in a clockwise or anticlockwise direction, covering the full view angle range, and forming a closed image sequence. For example, the horizontal direction starts from 0 degree, and is shot every 10 degrees, 36 frames are taken in total, the vertical direction can be collected in multiple layers (such as zenith, horizontal and zenith), each region of the spherical surface is ensured to be covered, the image definition, exposure consistency and view angle overlapping degree are checked in real time in the collecting process, and the blurred, overexposed or underoverlapped images are removed.

Step 101, each pixel of the spherical image corresponds to a direction vector in three-dimensional space, and the direction vector is represented by a longitude θ (range 0-2π) and a latitude φ (range-~) And (3) representing. The spherical surface is unfolded into a rectangular plane by equidistant columnar projection, the transverse coordinate x of the plane image corresponds to theta, the longitudinal coordinate y corresponds to phi, and the mapping relation is as follows: plane transverse coordinate x=) X plane image width W, where θ=0 corresponds to the plane left edge, θ=2pi corresponds to the plane right edge (loop stitching is required), and plane longitudinal coordinate y= = ("the right edge")+0.5) ×Planar image height H, where Φ= = ≡ = -Corresponding to the bottom of the plane (nadir), phi =Corresponding to the top (zenith).

Pixel value interpolation calculation process:

For target pixel coordinates (x, y) in the planar image, the direction coordinates of the target pixel coordinates in the original spherical image are reversely deduced according to the mapping relation of equidistant columnar projections:

longitude θ= (x/W) ×2pi, where W is the planar image width, and θ ranges from [0, 2pi);

Latitude φ= (y/H-0.5) ×pi, where H is the planar image height, and φ ranges from [ ≡ ,]. Since x and y are integer pixel coordinates, θ and Φ are calculated to be non-integers (e.g., θ= 3.1416 radians), and the pixel coordinates of the original spherical image are integers (e.g., θ1=3, θ2=4), it is necessary to determine the pixel values at the non-integer coordinates by interpolation.

Separating the integer portions of θ and φ from the fractional portion:

Let the integer part of θ be θ_int=floor (θ), the fractional part be θ_frac=θ - θ_int (range [0, 1));

Let the integer part of phi be phi_int=floor (phi) and the fractional part be phi_frac=phi-phi_int (range [0, 1)).

Thereby determining four pixels adjacent to (θ, φ) in the original spherical image:

upper left point (θ1, Φ1) = (θ_int, Φ_int);

upper right point (θ2, Φ1) = (θ_int+1, Φ_int);

lower left point (θ1, Φ2) = (θ_int, Φ_int+1);

bottom right point (θ2, Φ2) = (θ_int+1, Φ_int+1);

(if θint=2pi-1, θ2 needs to be looped to 0 to process the left-right boundary continuity of the spherical image).

The weighting coefficients are based on the fractional parts θ_frac and Φ_frac of θ and Φ, following the principle of "the closer the distance is, the higher the weight:

lateral weight (θ direction, controlling left and right pixel impact):

the weights of the upper left point and the lower left point are wtheta _left=1-thetafrac, wherein thetafrac is a decimal part of the longitude theta and takes a value range of [0,1 ], and when theta is an integer, thetafrac=0, wtheta _left=1.

The weights of the upper right point and the lower right point are wtheta _right=θfrac, when θ approaches the next integer coordinate, θfrac approaches 1, and wtheta_right approaches 1.

Longitudinal weight (phi direction, controlling the upper and lower pixel impact):

The weights of the upper left point and the upper right point are wphi _top=1-phi frac, wherein phi frac is the decimal part of the latitude phi, the value range is [0,1 ], and when phi is an integer, phi frac=0, and wphi _top=1.

The weights of the lower left and lower right points are wphi _bottom=Φfrac, and Φfrac approaches 1 and wphi_bottom approaches 1 when Φ approaches the next integer coordinate.

Four-point combining weights (spatial position weights):

The upper left point weight is w1= wtheta _left× wphi _top, and represents the combined effect of the left and upper pixels, and w1=1 directly takes the upper left point pixel value when θfrac=0 and Φfrac=0.

The upper right point weight is w2= wtheta _right× wphi _top, and represents the combined effect of the right and upper pixels, w2=1 when θfrac=1 and Φfrac=0 (limit case).

The lower left dot weight is w3= wtheta _left× wphi _bottom, representing the combined effect of the left and lower pixels, w3=1 when θfrac=0 and Φfrac=1 (limit case).

The lower right point weight is w4= wtheta _right× wphi _bottom, representing the combined effect of the right and lower pixels, w4=1 when θfrac=1 and Φfrac=1 (limit case). Regardless of how θfrac and Φfrac are taken, always satisfy w1+w2+w3 +w4= (3835_left+ wtheta) +w4= (wtheta) left+ wtheta.

Extracting RGB values of four adjacent pixels:

upper left dot pixel value: RGB 1= (R1, G1, B1);

upper right dot pixel value: RGB 2= (R2, G2, B2);

lower left pixel value: RGB 3= (R3, G3, B3);

Lower right dot pixel value: RGB 4= (R4, G4, B4);

The RGB value of the target pixel is obtained by weighting and summing:

R=w1×R1+w2×R2+w3×R3+w4×R4;

G=w1×G1+w2×G2+w3×G3+w4×G4;

B=w1×B1+w2×B2+w3×B3+w4×B4。

the process avoids jagged edges or abrupt color changes caused by direct rounding by smoothly transiting the colors of adjacent pixels, and ensures the visual continuity of the plane image after projection.

And (3) region distortion pretreatment:

The latitude phi is within the range [ ] ,The longitudinal coordinates (nadir to zenith) of the corresponding planar image. When |φ| >80 ° (about 1.396 radians), defined as the polar region (top φ >80 ° or bottom φ < 80 °), there is severe longitudinal compression of this region in equidistant cylindrical projection:

the number of plane pixels corresponding to the unit latitude is about H/6 (H is the image height);

the number of planar pixels corresponding to the unit latitude exceeds H/2, so that the pixel density of the polar region is more than 3 times that of a normal region, and a pixel stacking phenomenon is formed.

The spatial distribution of the region pixels is uneven, so that gradient calculation in rendering can deviate, and density distribution unbalance of Gaussian modeling is caused.

The longitudinal pixel coordinates y of the planar image are converted into latitudes phi:

top region, when phi >80 DEG, y > H× (0.5+80 DEG/180 DEG) approximately H×0.944;

Bottom region when phi < -80 deg., y < H× (0.5-80 deg./180 deg.). Apprxeq.H×0.056.

The polar region is divided into n×n grids (e.g., n=16), each corresponding to a small area of the spherical polar region, so as to facilitate the local scaling operation.

Compressing the latitude phi of the polar region from [80 deg., 90 deg. ] to [70 deg., 80 deg. ], alleviating the longitudinal pixel density problem, the scaling factor k (phi) is defined as:,( For target latitudes, e.g. =80°). The pixels within each grid of polar regions are scaled longitudinally by k (phi):

Original pixel coordinates (y, x) →new coordinates (y ', x), where y' =y+ (y-y_p) × (1-k (Φ)), y_p is the polar region starting pixel coordinate;

And filling blank pixels by using bicubic interpolation after scaling, and keeping the image smooth. At the junction of the polar region and the non-polar region (such as y (0.944H) and y (0.056H)) a transition zone with the width of 50-100 pixels is arranged, and a weighted fusion mode (such as the linear change of the weight along with y) is adopted, so that obvious boundary faults are avoided after scaling. For the projected plane image, calculating the corresponding latitude phi of the plane image pixel by pixel, if phi >80 degrees, marking the plane image as a polar region pixel (for example, setting a mask mask=1 and a non-polar region mask=0), and according to the projection mapping relation, calculating phi= (y/H-0.5) x 180 degrees (converted into an angle system) for the pixel (y, x), and marking the plane image as a polar region, and generating a binary mask matrix with the same size as the image or storing the pixel coordinate range of the polar region (for example, a top polar region y epsilon [ y_top_start, H ], a bottom polar region y epsilon [0, y_bottom_end ]).

Step 102, defining a standard resolution (e.g. width w0=2048, height h0=1024), and calculating a scaling according to the resolution (W1, H1) of the original projection image:

lateral scaling s_w= Vertical scaling s_h=If (1)≠Equal scale scaling (maintaining aspect ratio) or clipping padding is selected:

scaling by taking s=min (s_w, s_h), scaling the image by s first, and then filling black edges or repeating pixels to standard size at the edges;

Clipping and filling, namely scaling according to s_w and s_h respectively, clipping the excess part or filling the blank, and ensuring the final size to be (W0, H0).

Assume that the original image has a width W1 and a height H1, and the target image has a width W2 and a height H2. A horizontal scaling factor s_w=w1/W2 is calculated and a vertical scaling factor s_h=h1/H2. For any pixel point (x 0, y 0) in the target image (x 0 ranges from 0 to W2-1, y0 ranges from 0 to H2-1), the corresponding position in the original image is calculated, wherein the horizontal coordinate x1=x0×s_w, the obtained x1 is a decimal floating point value, such as x0=10, s_w=1.5, x1=15.2, the vertical coordinate y1=y0×s_h, and the like y1 is a floating point value, such as y0=5, s_h=1.2, and y1=6.4.

Finding the integer parts of x1 and y1, determining the pixel range of 4×4 around it:

Taking the integer part of x1 as ix=floor (x 1), the 4 pixel positions in the horizontal direction are ix-1, ix, ix+1, ix+2, taking the integer part of y1 as iy=floor (y 1), the 4 pixel positions in the vertical direction are iy-1, iy, iy+1, iy+2, if ix-1 is less than 0, it is set to 0 (or mirrored symmetry such as 1 when ix-1= -1 is adopted), if ix+2 exceeds the original image width W1-1, it is set to W1-1, ensuring that the neighborhood does not exceed the image range.

For each neighborhood pixel point (i, j) (i is a horizontal position and j is a vertical position), its distance from the target point (x 1, y 1) is calculated:

horizontal distance dx= |x1-i|, such as x1=15.2, i=15, dx=0.2, i=16, dx=0.8;

Vertical distance dy= |y1-j|, such as y1=6.4, j=6, dy=0.4, j=7, dy=0.6;

According to the logic of the bicubic kernel function, the closer the distance is, the greater the pixel weight is, and the weight decays in a curve with the change of the distance:

when the distance is between 0 and 1, the weight decreases rapidly with increasing distance;

when the distance is between 1 and 2, the weight-down speed becomes slow, but still contributes;

The pixel weights with distances exceeding 2 are considered as 0 and do not participate in the calculation (only 4 x 4 neighborhood is actually considered, the distances being at most 2).

For each neighborhood pixel (i, j), a weight is calculated based on dx and dy, then the RGB value (or RGBA) of the pixel is multiplied, and then the weighted values of the 16 pixels are accumulated. For example, R value of a pixel in the neighborhood is 200, weight is 0.3, R value of another pixel is 150, weight is 0.2, 200×0.3+150×0.2 is added, RGB three channels are respectively calculated to obtain R, G, B value of target pixel, if sum of all weights is not 1 (possibly caused by boundary processing), the added result is divided by the sum of weights, so as to ensure that the pixel value is in reasonable range (such as 0-255).

Whether the original image is in JPG, BMP, or other format, format conversion by an image processing tool is required:

reading original image data, and analyzing information such as width, height, color channels and the like;

creating a file structure in PNG format, and setting a compression mode (such as lossless compression);

And writing the pixel data of the original image into the PNG file to ensure that the pixel value is not lost.

The unified color space is RGB or RGBA.

Color space detection and conversion:

If the original image is in CMYK format (commonly seen in printed images), it needs to be converted to RGB by a specific conversion matrix, converting the cyan, magenta, yellow, black four channel values to red, green, blue three channels, if it is in YCbCr format (commonly seen in video coding), calculating RGB values from the luminance and color difference channels, if the image contains transparency channels (Alpha), it is preserved and converted to RGBA format, if not, it adds a fully transparent Alpha channel (value 255). The bit depth of each color channel is checked (e.g., 16 bits, 32 bits, etc.) and converted to 8 bits:

for a 16-bit image, the value range of each channel is 0-65535, divided by 256, rounded, compressed to 0-255 (e.g., 65535/256≡256, 255);

for a 4-bit image, there are only 16 possible values (0-15) per channel, which is multiplied by 17 to extend to 0-255 (e.g., 15×17=255);

ensuring that the bit depth of each channel after conversion is consistent to 8 bits.

Detecting whether the image contains color configuration files (such as AdobeRGB, proPhotoRGB, etc.), if the image contains non-sRGB configuration files, converting the image data into sRGB color gamut by using a color management tool, for example, the AdobeRGB color gamut is wider, colors beyond the sRGB range need to be mapped to the nearest displayable color during conversion, and after conversion, removing the original configuration files and embedding standard sRGB configuration files to ensure that all the images are displayed consistently on different devices.

Through panoramic camera multi-view acquisition, 360-degree spatial information of a scene is ensured to be missing, and accuracy and integrity of three-dimensional point cloud reconstruction are improved. The equidistant columnar projection converts the spherical image into a two-dimensional plane, establishes a linear mapping relation between the spherical viewing angle and the plane pixels, lays a foundation for distortion correction of the subsequent step by regional distortion preprocessing, and reduces the influence of projection errors on reconstruction. The resolution standardization processing eliminates calculation deviation caused by resolution difference of different images, ensures consistency of input data during model training, and improves reconstruction efficiency and accuracy.

In a preferred embodiment of the present invention, the step 2 of performing panoramic motion recovery structure processing on the encoded data set to generate six-degree-of-freedom camera pose parameters and corresponding sparse three-dimensional point clouds for each image may include:

Step 200, extracting feature points with panorama invariance from images in a unified format coding dataset, and screening matching point pairs according to the similarity of spherical viewing angles to obtain screened matching point pairs;

Step 201, based on the screened matching point pairs, solving the pose of the camera through an incremental motion restoration structure algorithm, and generating initial three-dimensional point coordinates, which specifically comprises:

Step 2010, initializing a first camera pose based on the screened matching point pair by taking the camera position of the first image as the origin of the world coordinate system, and marking a corresponding view as a registered view;

Step 2011, selecting a new view with common view feature points with the registered view from the rest unregistered views, and solving a rotation matrix and a translation vector of the new view relative to a world coordinate system through a PnP algorithm to obtain a six-degree-of-freedom camera pose;

step 2022, based on the pose of the new view and the common view feature points, performing a triangularization operation on the new view and the registered view, generating an initial three-dimensional point coordinate bound to the world coordinate system;

Step 202, the pose and three-dimensional point coordinates of a camera are jointly optimized by utilizing a beam method adjustment, the panoramic projection consistency of each three-dimensional point is checked, and an optimized camera pose parameter set and a sparse three-dimensional point cloud meeting consistency constraint are output.

In the embodiment of the invention, each plane image is converted into spherical coordinate representation, and three-dimensional space mapping is established by using longitude theta (0-360 degrees in the horizontal direction) and latitude phi (-90 degrees in the vertical direction), so that the continuity of panoramic viewing angles is ensured to be considered during feature extraction. For spherical coordinate images, 4 layers of scale spaces (such as original scale, 1/2, 1/4 and 1/8) are respectively constructed in the horizontal direction and the vertical direction, each layer is smoothed through a Gaussian convolution kernel (standard deviation sigma=1.6), and then a Gaussian difference (DoG) image is generated by difference with the next layer of image. In the three-dimensional neighborhood of the DoG image (3 x 3 windows of the current layer and upper and lower layers), searching local maximum value or minimum value by comparing pixel values, determining characteristic point position, each extreme point is required to be accurate to sub-pixel precision by fitting a quadratic function, and meanwhile, edge effect and low-contrast points are removed (e.g. curvature is judged by a Hessian matrix, and points with a main curvature ratio larger than 10 are removed).

The method is characterized in that a 16×16 local area is defined in spherical coordinates by taking a feature point as a center (taking longitude cycle characteristics into consideration, right exceeding parts are spliced from the left), the area is divided into 4×4 sub-blocks, gradient histograms of 8 directions (pixel gradients based on theta and phi directions) are calculated for each sub-block, 128-dimensional descriptors are finally generated, main directions of the spherical coordinates are introduced when the descriptors are generated (taking the feature point as the center, and the main directions of gradients of surrounding areas are calculated as direction references of the descriptors), and rotation invariance under panoramic viewing angles is ensured. For feature points between every two of all images, the euclidean distance of descriptors is calculated, and candidate point pairs with a distance less than 40% of the maximum descriptor distance are retained (e.g. if a certain pair of descriptors is 0.8 and the maximum distance is 2.0).

And reversely calculating the plane coordinates of the candidate point pairs to spherical coordinates (theta 1, phi 1) and (theta 2, phi 2), and calculating the angular distance of the two points on the spherical surface, wherein if the angular distance is smaller than 15 degrees (about 1/3 of the camera view angle), the two points are considered to belong to neighboring areas in the spherical view angle, and remain as effective candidates. Randomly selecting 4 pairs of candidate point pairs, assuming correct matching, calculating a basic matrix (FundamentalMatrix) by using an 8-point method, and solving through SVD decomposition. And checking all candidate point pairs by using the basic matrix, calculating the epipolar line distance (namely the vertical distance from the two-dimensional point to the epipolar line) of each point pair, counting the number of internal points with the epipolar line distance smaller than 1 pixel, finally selecting the basic matrix with the maximum number of internal points, and reserving all the internal points (the epipolar line distance is less than or equal to 1 pixel) as the final matching point pair.

Step 2010, traversing the encoded data set, calculating the sharpness (e.g., variance of laplace operator) and exposure uniformity (e.g., standard deviation of pixel values) of each image, and selecting the image with highest sharpness and most uniform exposure as the first reference view (if the indexes of the plurality of images are similar, the sequence number is the first). The camera optical center of the reference view is set as the origin (0, 0) of the world coordinate system, the optical axis direction is defined as the positive direction of the z axis, the x axis points to the horizontal right of the image, and the y axis points to the vertical upward of the image, so that the right-hand coordinate system is formed.

Initializing camera internal parameters:

Let the camera parameters be focal length f=image width/2 (e.g. f=1024 when the width is 2048 pixels), principal point (cx, cy) = (image width/2, image height/2), distortion coefficient set to 0 (subsequent adjustment by beam method optimization).

And (5) pose marking:

The pose parameters are allocated for the reference view, namely the rotation matrix R is a 3 multiplied by 3 identity matrix, the translation vector t is (0, 0), the registered view is marked in the data structure, and the pose parameters and the feature point coordinates are recorded.

In step 2011, an index structure (such as a KD tree or a hash table) is built according to the descriptors for all feature points of the registered view, so as to facilitate quick query. Each feature point stores its descriptor vector, spherical coordinates (θ, φ), and planar coordinates (u, v) in the image. For each unregistered view, all its feature points (assuming the number is N) are traversed, and for each feature point f_unreg:

Extracting the description sub-vector D_unreg, and searching the most similar feature points in the feature point index of the registered view (such as K=2 through K neighbor search).

And calculating the distance (such as Euclidean distance) between the D_unreg and the candidate feature point descriptors, and if the ratio of the nearest distance to the next nearest distance is smaller than 0.7 (experience threshold), considering that the matching point F_reg is found.

And establishing a matching point pair set, avoiding repeated counting of the same three-dimensional point (such as de-duplication through a local descriptor hash value of the feature point), counting the number of effective matching point pairs of each unregistered view and registered view, and recording as the number of common-view feature points. Traversing all unregistered views, screening out views with the number of common view feature points being more than or equal to 20 pairs (20 pairs are experience thresholds, and the pose resolving stability is ensured), selecting the view with the largest number of common view feature points as a current new view if a plurality of views meeting the conditions exist, and terminating the incremental SfM flow if all the views are less than 20 pairs.

PnP algorithm pose solving process:

If three-dimensional points generated by triangulation do not exist, virtual three-dimensional points are generated on the basis of the feature points of the registered view by assuming that the depth of all matching points is 1 meter. For example, the plane coordinates (u, v) of the feature point f_reg are back projected as rays under the camera coordinate system by the internal reference K, taking a point on the rays 1 meter away from the optical center as a virtual three-dimensional point p_virtual= (X, Y, 1). Using three-dimensional points generated by early triangulation ensures that each three-dimensional point is observed by at least two views and that the reprojection error is less than 2 pixels. At least 6 pairs of three-dimensional-two-dimensional matching points (p_i, g_i) are selected from the pair of common view feature points, where p_i is the three-dimensional point coordinates and g_i is the two-dimensional feature point coordinates in the new view. Converting the three-dimensional point p_i from the world coordinate system to the camera coordinate system of the registered view:

The pose of the registered view is (r_reg, t_reg), then the coordinates under the camera coordinate system are p_cam=r_reg×p_i+t_reg (assuming the world coordinate system origin is the registered view optical center, so t_reg=0, abbreviated as p_cam=r_reg×p_i). For each matching point pair (P_cam, g_i), g_i= (u, v) representing the projection position of the three-dimensional point on the image plane, the internal reference K= (f_x, f_y, c_x, c_y) is established as the projection equation The linear equation is formed by A x=0, wherein x is a camera pose parameter vector, A is a coefficient matrix (constraint of each row corresponding to one matching point); And Is the direction vector component (i.e. normalized horizontal and vertical directions) of the three-dimensional point in the camera coordinate system, and c_x, c_y are the principal point coordinates in the camera internal reference. And (3) carrying out singular value decomposition on the coefficient matrix A to obtain a singular vector corresponding to the minimum singular value, namely, an initial solution of pose parameters, wherein the solution comprises an initial estimated value of a rotation matrix R and a translation vector t (at the moment, R may not be orthogonal, and t is a vector after scaling). The rotation matrix R resulting from SVD decomposition may not be an orthogonal matrix (i.eX R is not equal to I), wherein R is decomposed into an orthogonal matrix Q and an upper triangular matrix R 'by QR decomposition, Q is taken as a orthogonalized rotation matrix, and the geometric correctness of rotation is ensured, wherein the R is divided into a matrix Q and an upper triangular matrix R', and the matrix Q is a matrix QIs the transpose of the rotation matrix, and I is the identity matrix. The reprojection error is the Euclidean distance between the projection coordinates of the three-dimensional point in the new view and the actual feature point coordinates, and the total error is the sum of squares of all the matching point errors.

Iterative optimization process:

the initial pose is (R0, t 0) obtained by SVD decomposition, an initial total error E0 is calculated, a micro disturbance is added to each pose parameter (rotation quaternion and translation vector), and a jacobian matrix J (partial derivative of the error on the parameter) is calculated. Solving normal equations WhereinIs the parameter updating quantity; And lambda is a damping factor, and in the iterative process, lambda can be dynamically changed according to the following strategies:

when the iteration reduces the error, lambda is reduced by a factor of 0.1 to 0.5 (such as lambda +. Lambda. X0.1), and the convergence is accelerated by approaching Gaussian-Newton method;

when the error increases, λ is amplified by a factor of 10 to 20 (e.g., λ+.λ×10), enhancing the stability of the gradient drop, avoiding divergence.

Lambda must be a positive number (lambda > 0) with an upper bound not exceeding(An excessively large lambda would degrade the optimization to a pure gradient drop, very slow convergence);

Updating pose parameters, wherein R=R+delta R (through quaternion interpolation), and t=t+delta t;

and calculating a new total error E1, and if the number of the iterations or the number of the pixels of the total error E1-E0 <0.1 reaches 10, ending the iteration.

Step 2012, for each pair of matching feature points (u 1, v 1) and (u 2, v 2) of the new view and the registered view:

By means of the internal parameters K1 and the pose (R1, t 1) of the registered view, (u 1, v 1) is back projected as rays in a camera coordinate system, wherein the ray direction is K1 ^-1 × (u 1, v1, 1), and the starting point is a camera optical center (0, 0). Through the internal parameters K2 and the pose (R2, t 2) of the new view, (u 2, v 2) is back projected as rays in a camera coordinate system, wherein the ray direction is K2 ^-1 × (u 2, v2, 1), the starting point is a camera optical center (0, 0), and the rays are converted into a world coordinate system through R2 and t 2.

Solving the nearest point of the two rays in the world coordinate system:

Constructing a least square problem, searching a point P to enable the square sum of the distances between the P and two rays to be minimum, and solving the point coordinate through matrix operation, wherein the specific steps are as follows:

The two rays are L1 (t) =o1+t×d1, L2(s) =o2+s×d2, where o1=0 (registered view optical center), o2=t2 (new view optical center), and D1, D2 are unit direction vectors. The projection of the vector O2 to L1 and the cross product of the two ray direction vectors are calculated, and t and s are solved through a linear equation system to obtain the closest point P. And calculating the re-projection error of the P in the two views, and eliminating the point if the error of any view is more than 2 pixels, or the depth (the distance from the P to the origin) is less than 1 meter (near point interference) or more than 100 meters (far point noise). Adding the effective three-dimensional points to the sparse point cloud set, recording coordinates, creating a timestamp, and associating corresponding two-view indexes (for error calculation in the subsequent beam method adjustment).

Step 202, for each registered view, a rotation matrix R (represented by a quaternion, avoiding gimbal lock) and translation vector t are stored for a total of 6 degrees of freedom, and each point stores world coordinates (x, y, z), and corresponding camera references (shared global references, or independent references for each view). For each three-dimensional point P and its corresponding view i, the re-projection coordinates (u_i, v_i) =ki× (ri×p+ti) of P are calculated from the pose (Ri, ti) of view i and the internal reference Ki, and the pixel distance errors from the actual feature point coordinates (u_i ', v_i') are calculated, with the total error being the sum of the squares of the errors of all the point-view pairs.

A sparse BundleAdjustment algorithm is employed (e.g., using an incremental solver):

for each error term, its partial derivatives are calculated with respect to the camera pose and three-dimensional point coordinates, forming a sparse matrix (non-zero element preservation only).

Solving normal equations×J×Δx=And x b, wherein deltax is variable updating quantity, solving by Cholesky decomposition or pretreatment conjugate gradient method, updating the pose of the camera by quaternion interpolation, and directly adding the updating quantity to the three-dimensional point coordinates. The processes of constructing the jacobian matrix, solving the normal equation to obtain the update quantity and updating the pose of the camera and the coordinates of the three-dimensional points are repeated until the iteration is performed for 20 times or the total error change quantity is smaller than 0.01 pixel, and at the moment, the mean value of the reprojection errors can be reduced to be within 1 pixel.

For each optimized three-dimensional point P:

The re-projection errors in all relevant views are calculated, and the pixel errors are converted into spherical angle errors, namely, angle errors are approximately equal to pixel errors/focal length (radians), and then converted into degrees. And calculating the mean value and standard deviation of all view angle errors, and if the mean value is larger than 5 degrees or the standard deviation is larger than 3 degrees, indicating that the projection of the point is inconsistent under the panoramic view angle, possibly being a mismatching point, and eliminating the mismatching point. The optimized camera pose parameter set (quaternion and translation vector of each view) and the sparse three-dimensional point cloud (coordinates and corresponding view list).

The panoramic invariance feature points are extracted, and matching point pairs are screened, so that the mismatching rate of the panoramic image caused by regional distortion and large visual angle span can be reduced, the accuracy of subsequent pose calculation is improved, and a reliable feature matching basis is provided for three-dimensional reconstruction. The incremental SfM algorithm can adaptively construct a three-dimensional frame of the panoramic scene by gradually adding new views and solving pose, so that the explosion of computational complexity caused by processing a large amount of data at one time is avoided, and an initial three-dimensional point cloud generated by triangulation provides a basic space structure for the scene. The method has the advantages that the method can eliminate accumulated errors and promote global consistency by combining adjustment of the camera pose and the three-dimensional points through a beam method, ensures projection accuracy of the three-dimensional points under the full view angle through panoramic projection consistency check, avoids reconstruction deviation of a polar region and an edge region, and generates a more accurate sparse point cloud structure.

In a preferred embodiment of the present invention, the step 3 of initializing a 3D gaussian primitive set based on a sparse three-dimensional point cloud, assigning a position, a rotation parameter, a scale initial value, a color parameter and an opacity parameter to each three-dimensional point, and fusing the history reconstruction result and the current gaussian primitive set through pose registration when there is a history reconstruction result coinciding with the current scene, so as to obtain a fused gaussian primitive set, which may include:

step 300, converting the sparse three-dimensional point cloud into an initial Gaussian primitive set, wherein each Gaussian primitive inherits the position and color parameters of a three-dimensional point, and initializes rotation parameters into a non-rotation state, a microscopic cube in scale and an opacity of 0.8;

Step 301, when a history reconstruction model is detected, performing feature matching based on an initial Gaussian primitive set, calculating a re-projection error of the history Gaussian primitive by using camera pose parameters, and eliminating invalid Gaussian with the error exceeding 2 pixels;

Step 302, merging the history Gaussian primitives with valid verification with the initial Gaussian primitive set, and outputting the final merged Gaussian primitive set.

In the embodiment of the invention, firstly, sparse three-dimensional point cloud data are read from storage equipment (such as a hard disk and a memory), the data exist in the form of a set of points, each point comprises three-dimensional coordinates (X, Y, Z) and color information (RGB values), and abnormal points are identified and removed through a statistical method. For example, the average distance from all points to their K adjacent points is calculated, if the distance from a certain point exceeds 3 times of the global average distance, the point is considered as an outlier and removed, or a RANSAC algorithm is used to randomly select partial points to fit a plane or space model, and the points which do not accord with the model are considered as outliers. And creating a Gaussian primitive for each three-dimensional point, and directly assigning the three-dimensional coordinates (X, Y, Z) of the point to be the center position of the corresponding Gaussian primitive, so as to ensure that the positioning of the Gaussian primitive in space is consistent with the original point. The RGB color values of the three-dimensional points are directly given to the Gaussian primitives, for example, the colors of the points are (255, 0), and then the colors of the corresponding Gaussian primitives are set to be red, so that the color characteristics of the original point cloud are reserved. Initialization to a non-rotated state, specifically represented by a unit quaternion (1, 0) or a3 x 3 unit matrix, means that the gaussian primitive does not undergo any rotation in the initial stage, and maintains an isotropic spherical or cubic morphology. The dimensions of the gaussian primitives are initialized to a tiny cube, and the side length of each dimension is set to 0.01 (the units can be adjusted according to the scene size, such as 0.01 meters), so that the dimension approximates a tiny point light source when visualized. The opacity of all gaussian primitives is set to 0.8 in a unified way, so that the gaussian primitives exhibit semitransparent effects when rendered. The parameters (position, rotation, scale, color, opacity) of each gaussian primitive are packaged into a separate data unit (e.g., structure or object), wherein a storage and access interface containing 5 sets of parameters stores the data units of all gaussian primitives in a set (e.g., array, linked list, or hash table) in sequence to form an initial gaussian primitive set.

Step 301, extracting key features (such as a logo object and a geometric structure) of the current scene, matching the key features with features in a historical scene library, if the matching degree exceeds a certain threshold (such as 80%), considering that there is an overlap, if the pose change of a camera is small (such as a translation distance is less than 0.5m and a rotation angle is less than 15 degrees), and the time interval between the current frame and the historical frame is short, judging that the same scene is continued, and if the overlap is detected, loading a corresponding historical Gaussian primitive set from a database or a file storing historical data, wherein the set comprises the historical reconstructed Gaussian primitives and parameters thereof.

The method comprises the steps of selecting the central position of each Gaussian primitive as a characteristic point for a current Gaussian primitive set, extracting local geometric characteristics (such as surface normals and curvatures) and color characteristics (such as RGB histograms) of the Gaussian primitives, and extracting characteristic points and characteristic information of the characteristic points for a historical Gaussian primitive set. And using a descriptor matching algorithm (such as SIFT descriptor matching based on Euclidean distance) to establish a corresponding relation between the characteristic points of the current Gaussian primitive and the characteristic points of the history Gaussian primitive. For example, the distance between two feature point descriptors is calculated, and if the distance is less than a threshold (e.g., 0.6 times the descriptor length), then the distance is considered to be a matching point pair. The method comprises the steps of calculating the transformation relation between the current camera pose and a historical scene by using a matching point pair through an iterative closest point algorithm (ICP), and specifically comprises the steps of converting points in a historical Gaussian primitive set into a current coordinate system through a transformation matrix (a rotation matrix R and a translation vector t) so as to align the points in space. For each historical Gaussian primitive, according to an internal parameter matrix (focal length, principal point) and an external parameter matrix (pose parameter) of the current camera, the central position of the current camera is projected to a current image plane to obtain projection point coordinates (u, v), and the pixel distance between the projection points (u, v) and the current observation points (such as the positions of corresponding feature points) is calculated. Setting an error threshold (such as 2 pixels), if the reprojection error of a certain historical Gaussian primitive exceeds the threshold, considering that the historical Gaussian primitive is not matched with the current scene, removing the historical Gaussian primitive from the set, and only keeping the effective Gaussian primitive with the error within the threshold to ensure the geometric consistency of the historical information and the current data.

Step 302, traversing the current gaussian primitive set and the reserved historical gaussian primitive set, calculating the spatial distance between the central positions of any two gaussian primitives (one from the current set and one from the historical set), if the distance is smaller than a threshold value (e.g. 0.05 m), considering that the two gaussian primitives represent the same spatial position, and combining the gaussian primitives, namely calculating a new position in a weighted average mode, wherein the weight can be determined according to the observation times or opacity of the gaussian primitives. For example, if the opacity of the current gaussian primitive is 0.8, the opacity of the history gaussian primitive is 0.7. Similarly, weighted averaging of RGB color values preserves more reliable color information. The rotation parameter can select one with smaller variance (namely a more stable rotation state), the scale parameter takes an average value or a larger value according to the size relation of the two, and the opacity adopts weighted average, so that the rationality of the opacity after fusion is ensured. And for Gaussian primitives with center position distances exceeding a threshold value, the Gaussian primitives are directly added into a final set, combination is not performed, and useful information is prevented from being deleted by mistake.

The Gaussian primitives can represent complex scenes in a compact parametric form, and have better continuity and visual effects than traditional point clouds. By fusing the historical reconstruction results, repeated work is avoided, reconstruction efficiency is improved, the method is particularly suitable for long-term monitoring or incremental image construction scenes, a re-projection error screening mechanism ensures that only reliable historical information is reserved, and accuracy and consistency of a final model are improved. The multi-parameter (position, rotation, scale, color, opacity) characteristics of the Gaussian primitives enable the Gaussian primitives to adapt to different scene requirements, such as dynamic scene or semantic enhancement, and the method can be combined with other three-dimensional reconstruction technologies (such as nerve radiation fields) to further improve the quality and flexibility of scene representation.

In a preferred embodiment of the present invention, the step 4 of performing distortion correction operation on the top or bottom region of the panoramic image sequence, and performing iterative training based on the fused gaussian primitive set and the corrected image set by using local spherical remapping or cube projection reconstruction, and performing densification control on the gaussian primitives based on the spatial gradient of the panoramic screen during the training, and terminating training when the iteration upper limit is reached, to obtain an optimized gaussian primitive set may include:

step 400, adopting partial spherical remapping or cube projection reconstruction to the top or bottom region of the panoramic image sequence, and outputting a correction image set;

Step 401, based on the corrected image set of the fused Gaussian primitive set, performing iterative training, and when the iteration number reaches an upper limit, terminating the training and outputting an optimized Gaussian primitive set, specifically comprising:

Step 4010, using the corrected image set as training input data, initializing parameters of the Gaussian primitive set after fusion;

Step 4011, randomly selecting a training view and a corresponding camera pose from a corrected image set, and generating a simulation view through an equidistant columnar projection rendering pipeline based on the camera pose and a current Gaussian primitive set;

Step 4012, calculating a loss function between the simulation view and the real training view, and updating the position, rotation, scale, color and opacity parameters of the Gaussian primitives according to the back propagation of the loss function to obtain an updated Gaussian primitive set;

step 4013, based on the updated Gaussian primitive set, calculating the gradient amplitude of each Gaussian primitive in the panoramic screen space, if the gradient amplitude is larger than the splitting threshold, executing the splitting operation to generate sub-Gaussian, if the gradient amplitude is smaller than the cloning threshold, executing the cloning operation, if the gradient amplitude is equal to the cloning threshold, keeping the original state;

Step 4014, when the iteration number reaches a preset upper limit, the training is terminated and the optimized Gaussian primitive set is output.

In the embodiment of the invention, for each frame of image of the panoramic image sequence, a polar region (such as a region with the top phi of 80 degrees or the bottom phi of 80 degrees) is determined through latitude mapping, and the region has a longitudinal pixel density problem under equidistant columnar projection, so that the shape of an object is distorted (such as a linear deformation curve). The spatial distribution characteristic of the polar region pixel is that the number of pixels occupied by the spherical region of unit angle on the plane image is more than 3 times of that of the middle region, and the pixel density is required to be redistributed through coordinate transformation.

Local sphere remapping:

The polar region is divided into m×n small grids (e.g., 64×64), each grid corresponds to a local region of the spherical polar region, and the original latitude phi and longitude theta ranges of each grid are recorded. For each pixel in the grid, the latitude φ is compressed from [80 °,90 ° ] to [70 °,80 ° ], by reducing the number of vertical pixels in the polar region by 50% while maintaining the longitude θ unchanged by φ' =70+(φ -80 °) ×0.5. A blank area is generated between the grids after transformation, and RGB values of the blank are calculated according to pixel values of adjacent grids by adopting a bicubic interpolation method, so that smooth and fault-free images are ensured. The spherical region corresponding to the polar region is projected onto one face (e.g., the top or bottom face) of the cube, each face of the cube being aligned with the meridian of the sphere, e.g., the center of the top face corresponding to the zenith (phi=90°), and the four sides corresponding to the intersection of phi=80° with θ=0°/90 °/180 °/270 °. The top surface of the cube is unfolded to be a square plane, the side length of each plane is 1/2 of the height of the original image, the pixels of the polar region are ensured to be uniformly distributed in the square, and the longitudinal compression is eliminated. When splicing the image after the cube projection and the non-polar region image, a transition zone with the width of 100 pixels is arranged at the junction, and the splicing trace is avoided by linear weighted fusion (the front 50 pixels take the cube projection pixel value and the rear 50 pixels take the original image pixel value).

In step 4010, the corrected image sets are arranged according to the viewing angle sequence, and camera pose parameters (rotation matrix R, translation vector t) and internal parameters (focal length f, principal point (cx, cy)) of each image are extracted and stored as training data pairs (images, poses). Taking the position, rotation, scale, color and opacity parameters of the fused Gaussian primitive set as initial values, wherein:

The position is kept at the fused coordinates, the rotation is reset to be a unit quaternion, the scale is uniformly adjusted to be 0.02m multiplied by 0.02m, the color is taken as the fused RGB value, and the opacity is set to be 0.7 (reducing the initial transparency is convenient for iterative optimization). The upper iteration limit (such as 500 times) is preset, the learning rate is initialized to 0.01, and the learning rate is exponentially attenuated according to the iteration times (such as multiplying the learning rate by 0.5 every 100 times).

In step 4011, a frame is randomly extracted from the corrected image set as the current training view, and the camera pose (r_cam, t_cam) and the internal reference K are obtained. For each gaussian primitive, its coordinates in the camera coordinate system are calculated p_cam=r_cam×p_world+t_cam (p_world is world coordinate).

Through equidistant columnar projection pipeline, convert 3D gaussian into 2D image:

calculating the longitude θ and latitude Φ of the Gaussian center, θ=atan2 (P_cam.y, P_cam.x), Φ=asin [ (] ) , wherein,The coordinates of the three-dimensional point in the camera coordinate system are three-dimensional vectors (P_cam.x, P_cam.y, P_cam.z) and are converted into plane coordinates x= (θ/2pi) ×w, y= (Φ/pi+0.5) ×h (W, H is the image width and height). An elliptical region is rendered at planar coordinates based on the scale, rotation and opacity of the gaussian, the colors being RGB values of the gaussian, the opacity being superimposed by parameters (e.g. alpha blending). And superposing rendering results of all the Gaussian primitives to generate a simulation view, wherein the size of the simulation view is consistent with that of the training view.

In step 4012, the Mean Square Error (MSE) of the simulated view and the training view, i.e. the average of the sum of squares of RGB differences for each pixel, is calculated, focusing on the region pixel (weight set to 1.5). Extracting feature graphs of two images by using a pretrained CNN, calculating cosine similarity of the feature graphs, ensuring structural consistency, calculating gradient of loss to parameters of Gaussian primitives, wherein total loss=0.8xMSE+0.2xstructural loss:

Position gradient, namely adjusting along the optical axis direction of the camera to reduce projection errors;

the rotation gradient is that the influence of rotation change on projection is calculated through quaternion disturbance;

the scale gradient is that the scale is increased or decreased, so that the Gaussian coverage area is matched with a real object;

color gradient, namely adjusting RGB values to be close to training view colors;

an opacity gradient, namely adjusting an alpha value to enable rendering brightness to be consistent with a real image;

And updating parameters according to the gradient direction and the learning rate, wherein the learning rate is 0.5, the P_old is the original value of the Gaussian primitive parameter before the iteration updating, updating is carried out on the rotation parameter by adopting a quaternion interpolation mode, and the quaternion is normalized after updating, so that the module length is ensured to be 1, and the orthogonality of the rotation matrix is maintained.

In step 4013, for each gaussian primitive, in the latest simulation view, the pixel gradient magnitude of its projection area is calculated (e.g., sobel operator calculates the x and y gradients, taking the L2 norm). The gradient amplitude reflects the detail richness of the Gaussian surface, and the gradient is large to represent edges or rich textures, and denser Gaussian representation is needed.

Split operation (gradient magnitude > split threshold):

Splitting the gaussian into two sub-gauss, shifting the position of the sub-gauss by + -delta (delta is 0.01m along the gradient direction) of the original gauss center, scaling down to 0.8 times of the original size, inheriting the original gauss by color and opacity, and increasing random small disturbance (such as quaternion multiplication (1, epsilon) by epsilon=0.1). The original gaussian is removed and two sub-gaussians are added to the set to ensure that the total opacity is unchanged (e.g., original α=0.7, sub Gao Si =0.35).

Cloning operation (gradient magnitude < cloning threshold):

The current Gaussian is copied, the position is unchanged, the scale is enlarged by 1.2 times, random noise (such as RGB offset of +/-5) is added to the color, the opacity is reduced by 0.1, the method is used for filling a sparse region, the cloned Gaussian and the original Gaussian form complementation, and the coverage of a low-gradient region is improved. The split threshold is set to 15 (pixel gradient unit), the clone threshold is set to 5, and the intermediate value (5-15) remains the same.

In step 4014, training is terminated when the number of iterations reaches a preset upper limit (e.g., 500), or the total loss variance is less than 0.01 (10 iterations in succession). The gaussian with opacity <0.1 is removed, and the gaussian with position distance <0.02m and similarity >0.8 (similarity calculated based on color and scale) is merged. The optimized Gaussian primitive set is saved, and each Gaussian contains final position, rotation (quaternion), scale (three-dimensional vector), color (RGB) and opacity (alpha) parameters.

The problem of pixel density at the top/bottom of the panoramic image is eliminated by polar region distortion correction, so that the shape of an object in the polar region is recovered to be normal, pixels are uniformly distributed, and the position accuracy of Gaussian primitives is improved by comparing and optimizing a simulated view and a real view through iterative training. The screen space gradient control dynamically adjusts the Gaussian density, automatically splits the Gaussian in a texture rich area (such as an edge), improves the detail representation capability, clones the Gaussian in a smooth area and avoids holes.

In a preferred embodiment of the present invention, the step 5 of panoramic rendering the optimized gaussian primitive set to output a panoramic image or a dense three-dimensional model may include:

Step 500, inputting the optimized Gaussian primitive set into an equidistant columnar projection rendering pipeline, and generating equidistant columnar projection grids according to the output resolution of the panoramic image;

Step 501, for each pixel point in the grid, generating a ray emitted from the center of the camera according to the spherical sight direction corresponding to the planar coordinate mapping;

step 502, calculating the intersection weights of the ray and all Gaussian primitives in the scene, and generating the final color of the current pixel by weighting and accumulating the color value and the opacity of each Gaussian contribution;

Step 503, traversing all pixels to complete panoramic image rendering output, or generating a dense three-dimensional point cloud model based on the spatial position and scale parameters of the optimized Gaussian primitive set.

In an embodiment of the invention, an optimized set of gaussian primitives is obtained, each gaussian primitive comprising a position (X, Y, Z), a rotation quaternion (w, X, Y, Z), a three-dimensional scale (sx, sy, sz), a color (R, G, B) and an opacity α, and the resolution of the output panoramic image is determined (e.g. 4096X 2048 pixels), which determines the density of the projected grid.

Horizontal direction, namely uniformly mapping a longitude range [0 DEG, 360 DEG ] to pixels with the image width W, wherein each pixel corresponds to a longitude increment of 360 DEG/W;

Vertical direction, the latitude range is uniformly mapped from the south latitude 90 degrees (90 degrees S) to the north latitude 90 degrees (90 degrees N) (total span 180 degrees) to the image height H pixels, and each pixel corresponds to the latitude increment of 180 degrees/H.

For each pixel (i, j) (0≤i < W, 0≤j < H), calculating the corresponding spherical coordinates:

longitude θ=i× ;

Latitude phi=90-j×。

A two-dimensional array is created storing the spherical coordinates (θ, Φ) of each grid point while recording its pixel index (i, j) in the image, forming an equidistant columnar projection grid.

Step 501 converts the spherical coordinates of each grid point (θ, Φ) into a three-dimensional direction vector (vx, vy, vz) on a unit sphere, where vx=cos (Φ) ×cos (θ), vy=cos (Φ) ×sin (θ), and vz=sin (Φ).

Starting from the camera center C, a ray r (e) =c+e× (vx, vy, vz) is generated along a direction vector (vx, vy, vz), where e is a parameter (e+.0) on a ray representing the direction of view from the camera perspective through pixel (i, j) on the image plane.

Step 502, converting ray r (e) from the world coordinate system to the local coordinate system of each gaussian cell:

The ray origin and direction vector are converted to local space using inverse rotation (conjugate of the gaussian primitive rotation quaternion) and translation (negative gaussian primitive position). In the local coordinate system, the gaussian primitive is a scaled ellipsoid (scale parameters (sx, sy, sz)) centered at the origin. And if the umin is less than or equal to the umax and the umin is more than or equal to 0, the ray is intersected with the Gaussian primitive, and an intersection interval [ umin, umax ] is recorded. For each point u within the intersection interval, its weight in the gaussian distribution is calculated:

weight w (u) = Where d (u) is the distance of point u from the center of the gaussian primitive and σ is the standard deviation of the gaussian distribution (related to the gaussian primitive scale). Discrete sampling (such as 10 sample points) is carried out on the intersection interval [ umin, umax ], and the weight of each sample point is accumulated to obtain the total intersection weight W of the ray and the Gaussian primitive.

Calculating the contribution of each Gaussian primitive to the final color;

Cumulative opacity:;

Updating the final color: 。

Step 503, traversing each pixel (i, j) in the equidistant columnar projection grid, repeating steps 501 and 502, calculating the final color of each pixel, and filling the color values of all pixels into the corresponding positions of the output image to form a complete panoramic image. For each gaussian cell, N points (e.g., n=100) are uniformly sampled on its surface, the sampling point positions are determined by the position, rotation and scale of the gaussian cell, and each sampling point inherits the color (R, G, B) and opacity α of the gaussian cell. And merging sampling points of all Gaussian primitives, and removing repeated or too-close points (such as a distance threshold of 0.01 m) to form a final dense three-dimensional point cloud model.

The equidistant columnar projection ensures complete coverage of 360 degrees multiplied by 180 degrees visual angles, and no splicing mark exists. Weighted rendering based on gaussian primitives preserves the fine structure of the scene, especially the polar details obtained by iterative optimization. By pre-calculating the intersection weights of the projection grid and the rays, the rendering speed is improved compared with that of the traditional ray tracing, each Gaussian primitive generates a plurality of sampling points, and the density of the point cloud is improved compared with that of the original sparse point cloud. Based on the optimized Gaussian primitive parameters, the geometric accuracy of the point cloud reaches the centimeter level, and the point cloud can be directly used for three-dimensional modeling and measurement. From polar region correction to Gaussian primitive optimization to rendering, a complete quality improvement link is formed, the mean value of the reprojection error of the panoramic image is smaller than 1 pixel, and the same Gaussian primitive set can support panoramic image and three-dimensional point cloud output at the same time.

Embodiments of the invention also provide a computing device comprising a processor, a memory storing a computer program which, when executed by the processor, performs a method as described above. All the implementation manners in the method embodiment are applicable to the embodiment, and the same technical effect can be achieved.

Embodiments of the present invention also provide a computer-readable storage medium storing instructions that, when executed on a computer, cause the computer to perform a method as described above. All the implementation manners in the method embodiment are applicable to the embodiment, and the same technical effect can be achieved.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. The panoramic image three-dimensional reconstruction method based on 3DGS is characterized by comprising the following steps:

Step 2, performing panoramic motion restoration structure processing on the coded data set to generate six-degree-of-freedom camera pose parameters and corresponding sparse three-dimensional point clouds of each image, wherein the steps include extracting characteristic points with panoramic invariance for the images in the coded data set in a unified format, screening matching point pairs according to spherical view similarity to obtain screened matching point pairs;

Step 3, initializing a 3D Gaussian primitive set based on a sparse three-dimensional point cloud, endowing each three-dimensional point with a position, a rotation parameter, a scale initial value, a color parameter and an opacity parameter, when a history reconstruction result which coincides with a current scene exists, fusing the history reconstruction result with the current Gaussian primitive set through pose registration to obtain a fused Gaussian primitive set, converting the sparse three-dimensional point cloud into an initial Gaussian primitive set, wherein each Gaussian primitive inherits the position and the color parameter of the three-dimensional point, initializes the rotation parameter to be in a non-rotation state, the scale to be a microcosm, and the opacity to be 0.8;

Step 4, performing distortion correction operation on the top or bottom region of the panoramic image sequence, and obtaining a corrected image set by adopting partial spherical remapping or cube projection reconstruction; performing iterative training based on the fused Gaussian primitive set and the corrected image set, performing densification control on the Gaussian primitives based on the spatial gradient of the panoramic screen in the training process, stopping training when the iteration upper limit is reached to obtain an optimized Gaussian primitive set, adopting partial spherical remapping or cube projection reconstruction to the top or bottom region of the panoramic image sequence to output the corrected image set, taking the corrected image set as training input data, initializing parameters of the fused Gaussian primitive set, randomly selecting a training view and a corresponding camera pose from the corrected image set, generating a simulation view based on the camera pose and the current Gaussian primitive set through an equidistant columnar projection rendering pipeline, calculating a loss function between the simulation view and the real training view, reversely transmitting and updating the position, rotation, scale, color and opacity parameters of the Gaussian primitives according to the loss function to obtain an updated Gaussian primitive set, calculating the gradient amplitude of each Gaussian primitive in the panoramic screen space based on the updated Gaussian primitive set, and performing splitting operation to generate sub-Gaussian primitives if the gradient amplitude is larger than a splitting threshold value, performing cloning operation if the gradient is smaller than the cloning threshold value, maintaining the gradient is equal to the cloning threshold value, and stopping the iteration upper limit when the iteration upper limit is reached;

2. The 3 DGS-based panoramic image three-dimensional reconstruction method according to claim 1, wherein collecting a panoramic image sequence of a scene to be reconstructed, uniformly encoding by equidistant columnar projections, obtaining an encoded data set, comprises:

3. The 3 DGS-based panoramic image three-dimensional reconstruction method of claim 2, wherein solving the camera pose by an incremental motion restoration structure algorithm based on the screened matched point pairs to generate initial three-dimensional point coordinates comprises:

4. A 3 DGS-based panoramic image three-dimensional reconstruction method according to claim 3, wherein panoramic rendering is performed on the optimized gaussian primitive set, and a panoramic image or a dense three-dimensional model is output, comprising:

5. A computing device, comprising:

One or more processors;

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1to 4.

6. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program which, when executed by a processor, implements the method according to any of claims 1 to 4.