CN114494395B

CN114494395B - Depth map generation method, device, equipment and storage medium based on plane prior

Info

Publication number: CN114494395B
Application number: CN202210127177.2A
Authority: CN
Inventors: 田泽藩; 暴林超; 张浩贤
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-11
Filing date: 2022-02-11
Publication date: 2025-01-21
Anticipated expiration: 2042-02-11
Also published as: CN114494395A

Abstract

The present application discloses a method, device, equipment and storage medium for generating a depth map based on plane prior, which relates to the field of image processing technology and is used to improve the accuracy of image depth estimation. The method comprises: acquiring multiple scene images of a target scene shot from different perspectives, and performing plane detection on a reference perspective image in the multiple scene images to obtain plane information in the reference perspective image; wherein the plane information is used to indicate pixel points belonging to the same plane; based on the pixel feature correlation between the multiple scene images, determining a reference pixel point set whose pixel feature correlation meets a set correlation condition from each pixel point included in the reference perspective image; based on the plane information and the reference pixel point set, obtaining a plane prior condition of the reference perspective image; and taking the plane prior condition as a constraint and based on the pixel feature correlation between the multiple scene images, obtaining a depth map of the reference perspective image.

Description

Depth map generation method, device, equipment and storage medium based on plane prior

Technical Field

The application relates to the technical field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), in particular to the technical field of image processing, and provides a depth map generation method, device, equipment and storage medium based on plane prior.

Background

Depth Estimation (Depth Estimation), which is a process of estimating the distance between each pixel in an image and a shooting source by using a single image or a plurality of images under different viewing angles, is a key step of a 3-dimensional (3D) scene reconstruction task, is a very important part in the field of computer vision, and has very important applications in scenes such as Virtual Reality (VR) or augmented Reality (Augmented Reality, AR).

The Multi-View Stereo matching (PATCH MATCH Multi-View Stereo, PATCH MATCH MVS) technique is a three-dimensional visual technique for recovering a Stereo structure from a color image of multiple View angles with known camera internal and external parameters by using a Stereo matching algorithm, and has been able to obtain a better effect for depth estimation.

However, in the weak texture region, the pixels in the region cannot be reliably distinguished due to the small difference of the pixels in the region, so that the depth value cannot be accurately estimated, and therefore, the depth estimation effect of the related technology in the weak texture region is poor.

Disclosure of Invention

The embodiment of the application provides a depth map generation method, device and equipment based on plane prior and a storage medium, which are used for improving the accuracy of image depth estimation.

In one aspect, a depth map generating method based on plane priors is provided, and the method includes:

Acquiring a plurality of scene images of a target scene shot based on different view angles, and performing plane detection on reference view angle images in the plurality of scene images to obtain plane information in the reference view angle images, wherein the plane information is used for indicating the pixel points belonging to the same plane:

Determining a reference pixel point set with the pixel characteristic association degree meeting a set association condition from all pixel points included in the reference view angle image based on the pixel characteristic association degrees among the plurality of scene images;

Based on the plane information and the reference pixel point set, obtaining a plane prior condition of the reference view angle image;

and taking the plane prior condition as a constraint, and obtaining a depth map of the reference view angle image based on the pixel characteristic association degree among the plurality of scene images.

In one aspect, a three-dimensional modeling method based on the above method, the method comprising:

acquiring a plurality of scene images of a target scene shot based on different visual angles, and acquiring depth maps corresponding to the plurality of scene images by adopting the method;

and based on the obtained depth maps, obtaining a three-dimensional stereogram corresponding to the target scene.

In one aspect, an automatic driving method based on the above method, the method includes:

Based on the obtained respective depth maps, obstacle distances around the target vehicle are determined, and automatic driving is performed based on the obtained respective obstacle distances.

In one aspect, a depth map generating apparatus based on plane priors is provided, the apparatus comprising:

the plane detection unit is used for acquiring a plurality of scene images of a target scene shot based on different view angles, and carrying out plane detection on reference view angle images in the plurality of scene images to obtain plane information in the reference view angle images, wherein the plane information is used for indicating the pixel points belonging to the same plane;

The pixel screening unit is used for determining a reference pixel point set with the pixel characteristic association degree meeting a set association condition from all pixel points included in the reference view angle image based on the pixel characteristic association degrees among the plurality of scene images;

The plane prior generation unit is used for obtaining plane prior conditions of the reference view angle image based on the plane information and the reference pixel point set;

and the depth estimation unit is used for obtaining the depth map of the reference view image based on the pixel characteristic association degree among the plurality of scene images by taking the plane prior condition as a constraint.

Optionally, the plane prior generation unit is specifically configured to:

performing plane fitting processing by adopting a triangulation method based on the reference pixel point set to obtain a triangular mesh structure, wherein each triangular plane in the triangular mesh structure comprises three reference pixel points in the reference pixel point set;

Based on the plane information, merging triangular planes corresponding to the reference pixel points positioned on the same plane in the triangular mesh structure to obtain a merged triangular mesh structure;

and obtaining the plane prior condition based on the combined triangular mesh structure.

Optionally, the plane prior generation unit is specifically configured to:

Determining a plurality of reference pixel point groups belonging to the same plane in the reference pixel point set based on the plane information, wherein each reference pixel point group comprises at least one reference pixel point;

performing plane fitting processing on the obtained multiple reference pixel point groups respectively to obtain a plane combination structure consisting of a corresponding multiple fitting planes, wherein each reference pixel point group corresponds to one fitting plane in the multiple fitting planes;

based on the planar combination structure, the planar prior condition is obtained.

Optionally, the plane detection unit is specifically configured to:

Performing image semantic segmentation on the reference view angle image to obtain respective corresponding plane masks of each plane area in the reference view angle image;

Performing depth estimation processing on the reference view angle image to obtain an estimated depth map of the reference view angle image;

Based on the estimated depth map, the plane mask is updated to obtain plane information of the reference view image.

Optionally, the pixel screening unit is specifically configured to:

for each pixel point, the following steps are respectively executed:

Determining a three-dimensional matching area of one pixel point in the reference view angle image aiming at the pixel point, wherein each three-dimensional matching area comprises an image area with the pixel point as a setting range of a datum point;

respectively determining the mapping areas of the stereo matching areas in other scene images and the pixel feature association degree between the stereo matching areas, wherein the other scene images are scene images except the reference view angle image in the plurality of scene images;

And determining a reference pixel point set with the pixel characteristic association degree meeting the set association condition from the pixel points.

Optionally, the pixel screening unit is further configured to:

respectively initializing parameters of the pixel points to obtain initial parallax plane parameters corresponding to the pixel points;

Based on the pixel characteristic association degree of the corresponding areas in the reference view angle image and the other scene images, respectively performing repeated iterative updating processes on the initial parallax plane parameters corresponding to each pixel point to obtain estimated parallax plane parameters corresponding to each pixel point;

and determining the stereo matching areas corresponding to the pixel points respectively based on the obtained estimated parallax plane parameters, and mapping areas in the other scene images.

Optionally, the pixel screening unit is specifically configured to:

acquiring a plurality of sampling pixel points from the pixel points in the set area around the pixel point;

determining an iterative updating loss value of the pixel point based on the correlation degree of the three-dimensional matching area corresponding to the pixel point and the pixel characteristics in each corresponding mapping area;

determining iterative updating loss values corresponding to the sampling pixel points respectively based on the correlation degree of the three-dimensional matching areas corresponding to the sampling pixel points and the pixel characteristics in the corresponding mapping areas;

If the target sampling pixel point with the iteration updating loss value smaller than the iteration updating loss value corresponding to the pixel point exists in the sampling pixel points, the current parallax plane parameter of the pixel point is updated according to the current parallax plane parameter corresponding to the target sampling pixel point.

Optionally, the pixel screening unit is specifically configured to:

determining a stereo matching area corresponding to the pixel point based on the current parallax plane parameter corresponding to the pixel point and a homography transformation matrix between each view angle, and mapping areas in other scene images;

Based on the stereo matching region and pixel feature values in each mapping region, correspondingly determining the pixel feature association degree between the stereo matching region and each mapping region;

And determining an iterative updating loss value corresponding to the pixel point based on the obtained pixel characteristic association degree, wherein the iterative updating loss value and the pixel characteristic association degree are in negative correlation.

Optionally, the depth estimation unit is specifically configured to:

Performing repeated iterative updating on the current parallax plane parameters corresponding to each pixel point based on the plane prior condition and the pixel characteristic association degree to obtain target parallax plane parameters corresponding to each pixel point;

And determining depth values corresponding to the pixel points respectively based on the obtained target parallax plane parameters so as to obtain the depth map.

Optionally, the depth estimation unit is specifically configured to:

starting from the appointed pixel points of the reference visual angle image, performing the following steps on the pixel points one by one in a traversing mode:

for one pixel point, acquiring a plurality of sampling pixel points from the pixel points in a set area around the one pixel point;

Determining iterative updating loss values of the pixel point and the sampling pixel points respectively based on the plane where the pixel points are located in the plane prior condition and the correlation degree of the pixel characteristics in the corresponding three-dimensional matching area and the mapping area;

If a target sampling pixel point with a corresponding iteration updating loss value smaller than the iteration updating loss value corresponding to the pixel point exists in the plurality of sampling pixel points, updating the current parallax plane parameter of the pixel point according to the current parallax plane parameter corresponding to the target sampling pixel point.

Optionally, the pixel screening unit and the depth estimation unit are further specifically configured to:

Dividing the reference view angle image into N channel images according to the set processing channel number N, wherein each channel image comprises any two adjacent pixel points, and N-1 pixel points are spaced between original positions in the reference view angle image;

and acquiring the plurality of sampling pixel points from the pixel points of the surrounding set area included in the channel image where the pixel point is located.

In one aspect, a computer device is provided comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods described above when the computer program is executed.

In one aspect, there is provided a computer storage medium having stored thereon computer program instructions which, when executed by a processor, perform the steps of any of the methods described above.

In one aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the steps of any of the methods described above.

In the embodiment of the application, when depth estimation is carried out on a target scene, plane detection is carried out on a reference view angle image to obtain plane information in the reference view angle image, namely, which pixel points belong to one plane in the reference view angle image, and a reference pixel point set with the pixel feature association degree meeting a set association condition is selected from the reference view angle image according to the pixel feature association degree between scene images of different view angles of the target scene, so that the plane priori condition of the reference view angle image is generated based on the obtained plane information and the reference pixel point set, and the plane priori condition is taken as constraint to generate a depth map corresponding to the reference view angle image. Therefore, by introducing plane detection, the embodiment of the application fully utilizes the plane information in the image, improves the accuracy of the plane prior condition, and further combines the depth value of the weak texture region in the depth map solved by the plane prior condition to be more accurate.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the provided drawings without inventive effort for those skilled in the art.

FIG. 1 is a schematic diagram of patch mapping in images with different viewing angles according to an embodiment of the present application;

FIGS. 2 a-2 d are depth maps obtained by performing depth estimation according to the related art;

fig. 3 is a schematic view of an application scenario provided in an embodiment of the present application;

Fig. 4 is a flow chart of a depth estimation method based on plane prior according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of a plane detection process according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of determining a reference pixel point set from each pixel point according to an embodiment of the present application;

FIGS. 7 a-7 c are schematic diagrams illustrating several traversal methods according to embodiments of the present application;

fig. 8 is a flowchart illustrating an iterative updating process of a pixel Pi according to an embodiment of the present application;

FIGS. 9a and 9b are schematic diagrams illustrating a set of sampling pixels according to an embodiment of the present application;

Fig. 10 is a schematic diagram of a process for obtaining a plane prior based on plane information and a reference pixel point set according to an embodiment of the present application;

FIG. 11 is a schematic flow chart of obtaining a plane prior condition according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a planar prior condition provided by an embodiment of the present application;

FIG. 13 is a schematic diagram of another process for obtaining plane prior conditions according to an embodiment of the present application;

fig. 14 is a flowchart of obtaining a depth map of a reference view image according to an embodiment of the present application;

FIGS. 15 a-15 b are depth maps obtained by the method according to the embodiment of the application;

Fig. 16 is a schematic structural diagram of a depth map generating device based on planar prior according to an embodiment of the present application;

Fig. 17 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. Embodiments of the application and features of the embodiments may be combined with one another arbitrarily without conflict. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained here:

The target scene is formed by one or more scene elements, wherein one scene element can be a person, a game role, an object and the environment where the person, the game role, the object and the environment are located, for example, the game role or the environment in a game can be used as the scene elements to form a game scene, or vehicles in front and the road test environment can be used as the scene elements in the driving process to form a driving scene.

The stereo matching area, or patch, is defined on a pixel of the reference view image, and corresponds to a tiny plane in the three-dimensional space, so that depth and normal vector can be expressed. Referring to fig. 1, for a pixel Pi in the reference view, the patch in the reference view image is Qi shown in fig. 1, where the size and shape of Qi may be defined by itself, for example Qi may be a plane with Pi as the center and n×n.

The pixel feature correlation, or normalized correlation (normalized cross correlation, NCC), is used to describe the degree of similarity between two patches, the greater the NCC value, the more similar the two patches are, otherwise the smaller the NCC value, the more dissimilar the two patches are. In a plurality of scene images with different view angles, although the view angles are different, as the presented scene elements are the same, pixel characteristics in the presented scene images have a certain similarity, and when depth estimation is performed, a parallax plane of each pixel can be solved based on the characteristics of the multi-view images.

Taking pixel characteristics as luminosity as an example, luminosity consistency refers to that no matter which view angle is used for observing a certain point of a target object, the luminosity received by the space is the same, so that the point to be reconstructed needs to have the same color on images of all view angles, namely, the core of three-dimensional reconstruction is a point with luminosity consistency in a recovery space, as shown in fig. 1, a patch with a certain size is taken by taking a pixel Pi as a center, after the patch is mapped into scene images of other view angles, as shown in fig. 1, the patch with the Pi1 as a center is mapped into a patch in C1, namely, qi1, and the similarity of the patch colors is calculated to judge whether Pi is the same point as Pi1, pi2 and Pi3 respectively.

Homography matrix is mainly plane homography matrix, through which projection mapping from one plane to another plane can be realized. Homography matrices can be used to solve two problems, one that states the perspective transformation of a plane in the real world with its corresponding image, and the other that transforms the image from one view to the other by perspective transformation.

When N scene Images with different visual angles are aimed at a target scene, if the depth map of the ith scene image is supposed to be solved, the ith scene image is set as a reference visual angle image (REFERENCE IMAGE), and other Images are all Source Images. For a certain pixel coordinate (x 0, y 0) of the reference view image, a plane parameter (a, b, c, d) is randomly allocated to the pixel coordinate, and then the pixel belongs to a plane ax+by+cz+d=0 in space, and then (a, b, c, d) is a plane assumption parameter of the pixel (x 0, y 0), wherein d is a depth corresponding to the pixel (x 0, y 0). According to the plane hypothesis parameters, a patch with (x 0, y 0) as a center and n×n as a size can be mapped into images of other view angles according to homography transformation, and the mapping result is shown in fig. 1.

In general, the process of solving the depth map is the process of determining the parallax plane parameter of each pixel point, and in one image, some weak texture areas may exist, and the depth cannot be accurately perceived through edge characteristics, for example, the plane area of the table in fig. 2a, in theory, the depth of the area should be smoothly transited in the depth map, but the deficiency of the practical algorithm is that the depth of the area cannot be accurately perceived, but from the practical point of view, since the area belongs to one plane, plane constraint can be added when the depth map is solved, so that when the parallax plane parameter is selected, plane constraint factors can be added in the weak texture areas, and the parallax plane parameter between pixels in the area is forced to meet certain constraint requirements, and the plane constraint is the plane prior condition. In general, the depth values of some reliable pixels can be reserved by checking with strict geometric consistency, and in addition, the plane area in the image can be detected more accurately through plane detection, so that a triangular grid can be constructed on the image plane based on the reliable pixels and combined with plane detection results, and the triangular grid thus constructed can be used as a plane priori condition for depth estimation.

Triangulation-in geometry, triangulation refers to the subdivision of a planar object into triangles and by extension of a high-dimensional geometric object into simplex, there are many ways of triangulation for a given set of points, delaunay triangulation being a more common triangulation way that satisfies two characteristics:

(1) The empty circle characteristic is that the Delaunay triangle net is unique, namely any four points cannot be co-rounded, and no other points exist in the circumcircle range of any triangle in the Delaunay triangle net.

(2) Maximizing the minimum angle characteristic-the minimum angle of the triangle formed by Delaunay triangulation is the largest in the triangulation that the scatter set may form, in the sense that Delaunay triangulation is the "closest to regularized" triangulation, specifically that the minimum angle of the six interior angles no longer increases after the two adjacent triangles form the diagonal of the convex quadrilateral.

Embodiments of the present application relate to artificial intelligence and machine learning techniques, designed primarily based on computer vision techniques in artificial intelligence.

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as identifying and measuring a target by human eyes, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like. An artificial neural network (ARTIFICIAL NEURAL NETWORK, ANN) abstracts the human brain neural network from the information processing point of view, builds a certain simple model, and forms different networks according to different connection modes. The neural network is an operation model, which is formed by interconnecting a plurality of nodes (or neurons), each node represents a specific output function, called an excitation function (activation function), the connection between every two nodes represents a weighting value for the signal passing through the connection, called a weight, which is equivalent to the memory of an artificial neural network, the output of the network is different according to the connection mode of the network, the weight value and the excitation function are different, and the network itself is usually an approximation to a certain algorithm or function in nature, and can also be an expression of a logic strategy.

With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, car networking, autopilot, smart transportation, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will be of increasing importance.

The scheme provided by the embodiment of the application relates to the technology of computer vision of artificial intelligence, machine learning and the like. According to the embodiment of the application, the plane detection model of machine learning is adopted to carry out plane detection on the reference view angle image so as to acquire plane information in the reference view angle image, and then a more accurate plane priori condition is generated by combining the plane information, and the plane priori condition is taken as a constraint to carry out depth estimation of the reference view angle image.

The following briefly describes the design concept of the embodiment of the present application.

The multi-view stereo matching technology is a three-dimensional visual technology for recovering a stereo structure from color images of multiple view angles with known camera internal and external parameters by utilizing a stereo matching algorithm, and has better effect on depth estimation. However, in the weak texture region, the pixels in the region cannot be reliably distinguished due to the small difference of the pixels in the region, so that the depth value cannot be accurately estimated, and therefore, the depth estimation effect of the related technology in the weak texture region is poor.

Referring to fig. 2a to 2d, in order to obtain a depth map obtained by estimating the depth by using the related technology, it is obvious that in fig. 2a, the depth map at the position of the round table should be in a gentle transition theoretically, but in fig. 2b, the depth estimated at the round table area enclosed by the corresponding square frame is obviously not matched with the actual depth, in the same way, in fig. 2c, the right side door department should belong to a plane, in theory, the depth at the part should be in a gentle transition, but in fig. 2d, the depth estimated at the part enclosed by the corresponding square frame is obviously not matched with the actual depth, so that the depth estimation result obtained by the related technology is not accurate.

Considering that the round table position shown in fig. 2a belongs to the same plane, the pixel points in the area have smaller differences, plane constraint can be applied to the round table position so that a certain constraint relation exists between depth values in the area, and for better fusion of plane information, plane detection can be performed on the image so as to acquire more accurate plane information.

In view of this, an embodiment of the present application provides a depth estimation method based on plane prior, in which, when depth estimation is performed for a target scene, plane information in a reference view image is obtained by performing plane detection on the reference view image, that is, which pixels in the reference view image belong to the same plane, and further, a reference pixel point set whose pixel feature association degree satisfies a set association condition is selected from the reference view image by regarding pixel feature association degrees between scene images of different views of the target scene, and then, based on the obtained plane information and the reference pixel point set, a plane prior condition of the reference view image is generated, and then, generation of a depth map corresponding to the reference view image is performed with the plane prior condition as a constraint. Therefore, by introducing plane detection, the embodiment of the application fully utilizes the plane information in the image, improves the accuracy of the plane prior condition, and further combines the depth value of the weak texture region in the depth map solved by the plane prior condition to be more accurate.

The following description is made for some simple descriptions of application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application, but not limiting. In the specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the application can be applied to most application scenes related to depth estimation, such as VR, AR, 3D games, 3D film and television works, short videos or automatic driving scenes. As shown in fig. 3, a schematic view of an application scenario provided in an embodiment of the present application may include an image acquisition device 301, a depth estimation device 302, and a depth map processing device 303.

The image acquisition device 301 may be, for example, an image capturing device for acquiring a scene image of a target scene. In one possible implementation, the image capturing apparatus 301 may be a binocular image capturing apparatus or a multi-view image capturing apparatus in order to capture images of a scene at different perspectives.

The depth estimation device 302 is configured to implement a depth estimation function, which may include one or more processors 3021, a memory 3022, and an I/O interface 3023 for interacting with the image acquisition device 301, etc. The memory 3022 of the depth estimation device 302 may further store program instructions of the depth estimation method based on plane priors provided by the embodiment of the present application, where the program instructions when executed by the processor 3021 may be configured to implement the steps of the depth estimation method based on plane priors provided by the embodiment of the present application, so as to implement a depth estimation process and obtain a corresponding depth map.

In an embodiment, the depth map processing device 303 may be, for example, a display device, which may directly output the depth map obtained by the depth estimation device 302.

In one embodiment, the above-mentioned scene may be an automatic driving scene, and the image capturing device 301 may be disposed on the vehicle, and the depth image processing device 303 determines the distance between surrounding obstacles based on the depth image after the depth image is obtained by the depth estimating device 302 by capturing an image of the scene around the vehicle and transmitting the image to the depth estimating device 302, so as to perform auxiliary driving.

In practical application, the depth estimation device 302 may be a computer device and be disposed on a vehicle together with the image acquisition device 301, or the depth estimation device 302 may also be a background server, where after the image acquisition device 301 acquires the scene image, the scene image may be transmitted to the depth estimation device 302 through a network to generate a depth map.

In one embodiment, the above-mentioned scene may be an application scene such as VR, AR, 3D game, 3D movie and television works, etc., and the image capturing device 301 may be used to capture a scene image of a real scene, or if there are a plurality of scene images with different viewing angles in the scene, the image capturing device 301 may not be provided. Further, the depth estimation device 302 may estimate the depth of each object in the image based on the scene image, obtain a corresponding depth map, and the depth map processing device 303 may restore the 3D structure corresponding to each object based on the depth map.

The devices may be directly or indirectly connected through one or more networks. The network may be a wired network, or may be a Wireless network, for example, a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, or may be other possible networks, which embodiments of the present invention are not limited in this respect.

It should be noted that each of the above devices may be deployed separately or in combination according to an actual application scenario, for example, one or more depth estimation devices 302 may be deployed, or the depth estimation device 302 and the depth map processing device 303 may be implemented by the same device.

Of course, the method provided by the embodiment of the present application is not limited to the application scenario shown in fig. 3, but may be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 3 will be described together in the following method embodiments, which are not described in detail herein.

The method flow provided in embodiments of the present application may be performed by depth estimation device 302 in fig. 3. Referring to fig. 4, a schematic flow chart of a depth estimation method based on plane prior according to an embodiment of the present application is shown.

Step 401, acquiring a plurality of scene images of a target scene photographed based on different viewing angles.

In the embodiment of the application, in order to more accurately generate the depth map of the target scene, a multi-view stereo matching method is adopted, and the depth map of the reference view is generated based on scene images of a plurality of different views of the target scene.

In one embodiment, as in an automatic driving scene, a plurality of scene images around the vehicle, such as a plurality of scene images in front of the vehicle, may be captured by a multi-view image capturing apparatus provided on the vehicle, which generally generates parallax due to a certain spatial distance in layout, to obtain a plurality of scene images of different angles.

In another embodiment, in a scene related to 3D modeling to generate a corresponding 3D stereogram, for example, an application scene such as AR/VR, a 3D game, a 3D film and television work, etc., the image capturing device may be moved to capture scene images of multiple frames of different perspectives for a real scene or a built solid model scene.

Of course, other possible ways of acquiring the scene image may be used, and embodiments of the present application are not limited in this respect.

And 402, performing plane detection on reference view images in a plurality of scene images to obtain plane information in the reference view images, wherein the plane information is used for indicating pixel points belonging to the same plane.

In the embodiment of the application, in order to improve the accuracy of the finally generated depth map, the plane detection is performed on the reference view angle image so as to extract the plane information in the image, and the plane prior is generated by fully utilizing the plane information, so that the weak texture area in the reference view angle image can still obtain a better depth estimation effect.

Specifically, the plane detection (plane estimation) refers to detecting pixels belonging to the same plane in the scene image, and further outputting which pixels in the scene image belong to the same plane.

In one embodiment, when the plane detection is performed on the reference view image, the plane detection can be performed only according to the reference view image, so that the plane information in the reference view image can be obtained.

In another embodiment, considering that the image feature information in the reference view image is still included in the scene images of other views, the plane information in the reference view image may be obtained by integrating the image feature information in the scene images of other views, for example, the plane detection may be performed on the scene images of other views, so as to integrate the plane detection results of each scene image, or in the image feature extraction stage, the image features of the scene images of other views may be fused into the corresponding image features of the reference view image, where the corresponding image features may be corresponding pixel features or corresponding region features. The method fuses the image information of the images with multiple visual angles, can improve the accuracy of plane information, and correspondingly improves the accuracy of depth estimation results.

In practical application, a plane detection method may be used, for example, a 3D Hough transform point cloud plane detection algorithm or a regional plane detection method (plane Regions with CNN features, PLANE RCNN) based on a convolutional neural network feature, and of course, other possible plane detection methods may also be used, which is not limited in the embodiment of the present application.

Next, a plane detection of the above-described reference view angle image based on a plane detection model of one of the plane detection methods will be described as an example.

In the practical application process, the plane detection model needs to be trained in advance by adopting a machine learning method until reaching a convergence condition, and the plane detection model can be applied to the practical plane detection process.

In one embodiment, during the machine learning phase, a supervised learning-based approach may be employed for model training. Specifically, a plurality of training samples can be acquired and obtained, each training sample can comprise a training image and a training label of the training image, the training label can be used for indicating whether each image area in the training image is a plane or not, and then a plane detection model obtained through training can accurately detect which areas in the image are planes and which areas are non-planes. Of course, the training label may be appropriately adjusted based on the actual training requirements, which is not limited by the embodiment of the present application.

In one embodiment, during the machine learning phase, a self-supervised learning-based approach may also be employed for model training. Specifically, a plurality of training samples may be acquired, where each training sample may include a training image, and during the training process, whether each region in the image belongs to the same plane may be identified based on a correlation degree of the images in the training image.

After the plane detection model is obtained through training in the machine learning stage, the plane detection model can be applied to the plane detection model. Referring to fig. 5, a schematic flow chart of the plane detection process is shown.

S4021, performing image semantic segmentation on the reference view angle image to obtain the corresponding plane masks of each plane area in the reference view angle image.

In the embodiment of the application, the plane detection model comprises a semantic segmentation part, and the semantic segmentation part is used for carrying out image semantic segmentation on the reference view image so as to obtain the plane masks corresponding to each plane area in the reference view image.

The semantic segmentation part may be, for example, a mask-CNN network structure, and of course, other possible network structures may also be used, which is not limited in the embodiment of the present application.

S4022, performing depth estimation processing on the reference view image to obtain an estimated depth map of the reference view image.

Specifically, to estimate the depth map of the whole image, a decoder may be added behind the feature pyramid extraction network (feature pyramid networks, FPN) of Mask R-CNN, where each layer of the depth map decoder uses a convolution kernel with a size of 3x3, a step size of 1, and a convolution kernel with a size of 4x4, and a step size of 2, and finally, a bilinear upsampling method is used to obtain the depth map with the same size as the input image.

And S4023, updating the plane mask based on the estimated depth map to obtain plane information of the reference view image.

In the implementation, the plane mask obtained through the semantic segmentation is only a rough plane estimation, and the edge part of the plane mask has a large error, so that the plane mask obtained through the process can be optimized and updated by combining an estimated depth map to obtain the plane information of the reference view image more accurately.

Step 403, determining a reference pixel point set with the pixel characteristic association degree meeting the set association condition from all the pixel points included in the reference view angle image based on the pixel characteristic association degrees among the plurality of scene images.

In the embodiment of the application, since the plurality of scene images are obtained by shooting the target scene, the same object exists in the plurality of scene images, so that a certain correlation exists between the plurality of scene images, corresponding pixel points in the plurality of scene images can be positioned based on the correlation, and the similarity between the corresponding pixel points in different scene images is high, namely, the pixel point with higher possibility that the corresponding pixel point belongs to the same pixel point can be used as the reference pixel point.

Specifically, an optimal plane parameter of each pixel point can be determined by adopting a multi-view stereo matching method, namely, the cost of the pixel point is enabled to be the smallest plane, so that a reference pixel point set with the pixel characteristic association degree meeting the set association condition is determined from all the pixel points according to the optimal plane parameter. Or the similarity between the pixel characteristics in each scene image can be calculated, the similarity pixel with the maximum similarity of each pixel is found, and whether the pixel is a reference pixel is determined according to the similarity between the pixel and the similar pixel.

In the following, taking multi-view stereo matching as an example, the basic idea of multi-view stereo matching is to randomly initialize a plane hypothesis parameter for each pixel point, and then replace the current plane hypothesis parameter with its surrounding plane hypothesis parameter to calculate the cost (cost), where the plane hypothesis parameter with the minimum cost is taken as a new plane hypothesis parameter of the pixel, and usually the cost can be calculated by using photometric consistency. It mainly comprises four basic steps, namely random initialization, propagation, view selection and refinement.

Referring to fig. 6, a flow chart of determining a reference pixel set from each pixel is shown.

Step 4031, initializing parameters of each pixel point to obtain initial parallax plane parameters corresponding to each pixel point.

Specifically, a random initialization manner may be adopted to randomly assign a parallax plane to each pixel, and by random assignment, it is expected that at least one pixel in each plane area can be randomly assigned to the correct parallax plane parameter.

Of course, any other possible initialization method may be used, for example, assigning values according to pixel values in the reference view image, which is not limited by the embodiment of the present application.

Step 4032, determining a stereo matching area of each pixel point in the reference view image for one pixel point, wherein each stereo matching area comprises an image area with a set range taking each pixel point as a reference point.

In the embodiment of the present application, the stereo matching area corresponding to each pixel point, that is, patch refers to an image area with a size of n×n centered on the pixel point, where the value of N may be set according to the actual requirement, which is not limited.

Step 4033, performing iterative update processes on the initial parallax plane parameters corresponding to each pixel point respectively based on the pixel feature association degree of the corresponding areas in the reference view angle image and other scene images, so as to obtain the estimated parallax plane parameters corresponding to each pixel point respectively.

In the embodiment of the application, after the initialization process is completed, parameter propagation can be performed according to the current parallax plane parameters of each pixel point, namely, the initial parallax plane parameters corresponding to each pixel point are respectively subjected to a plurality of iterative updating processes, and in the iterative updating process, the better parallax plane parameters are selected, so that the estimated parallax plane parameters corresponding to each pixel point are finally obtained. The iterative updating process can be performed according to the designated pixel sequence, so as to complete the iterative updating process of all the pixels.

In one embodiment, the parallax plane parameter may be iteratively updated for each pixel one by one starting from the first pixel in the upper left corner of the reference view image. Referring to fig. 7a, a schematic diagram of a traversal manner is shown, and when performing iterative update, each pixel point may be traversed one by one from left to right starting from the first pixel point at the upper left corner to perform iterative update of the parallax plane parameters until the first row of pixel points complete traversal, and then traversing is performed on the next row of pixel points until the iterative update of the parallax plane parameters of all the pixel points is completed.

Of course, the above-mentioned traversing direction is only one possible direction, and other traversing directions may be set in practical implementation, as long as the iterative updating of the parallax plane parameters of all the pixels can be completed, which is not limited in the embodiment of the present application.

In one embodiment, in order to facilitate the image processor (graphics processing unit, GPU) to perform multi-channel parallel processing on the image, so as to fully utilize the processing resources of the GPU and improve the image processing efficiency, the reference view image may be divided into N channel images according to the set number N of processing channels, where each channel image includes any two adjacent pixel points, and N-1 pixel points are spaced between the original positions in the reference view image.

Referring to fig. 7b and fig. 7c, which are schematic diagrams of another traversing manner, taking the number N of channels as 2 as an example, as shown in fig. 7b or fig. 7c, all pixels included in the reference view image may be divided into 2 channels, vector pixels of each channel are separated by one pixel in the original image, and dots with different colors in fig. 7b or fig. 7c represent pixels of different channels, and each pixel may be traversed row by row on each channel, as shown by an arrow in fig. 7b or fig. 7c, starting from the first pixel in the upper left corner, so as to perform an iterative updating process of each pixel.

In the following, since the iterative updating of each pixel is similar, a sequential iterative updating process of one of the pixels Pi will be described as an example. Referring to fig. 8, a flowchart of an iterative updating process of the pixel Pi is shown.

And 40331, acquiring a plurality of sampling pixel points from the pixel points of the set area around the pixel point Pi.

In the embodiment of the present application, when different traversal modes are adopted, the sampling modes are correspondingly different, so as to obtain a sampling point set S, s= (S1, S2,) corresponding to the pixel point Pi.

In one embodiment, when the traversal method shown in fig. 7a is adopted, pixel sampling may be directly performed from the pixels in the set area around the pixels in the reference view image, so as to obtain a plurality of sampled pixels. Referring to fig. 9a, a schematic diagram of a sampling pixel set is shown, in which a sampling area of a reference pixel set may be preset, and a black dot in the center of fig. 9a represents a pixel Pi, and a white dot in the periphery is a pixel corresponding to the set sampling area.

In one embodiment, when the traversal manner shown in fig. 7b or fig. 7c is adopted, pixel point sampling may be performed from the pixels of the surrounding set area included in the channel image where the pixel point Pi is located, so as to obtain a plurality of sampling pixel points. Referring to fig. 9b, which is a schematic diagram of another sampling pixel set, where the black dot in the center of fig. 9b represents a pixel Pi, and the white dots around the black dot are pixels corresponding to the set sampling area, it can be seen that the sampled pixels are all separated by one pixel.

Step 40332, determining an iterative update loss value of the pixel Pi based on the correlation degree of the stereo matching region corresponding to the pixel Pi and the pixel features in the corresponding mapping regions.

In the embodiment of the application, the patch can be transformed into the image of any view angle through the homography transformation matrix among the view angles, and based on the characteristics, the patch corresponding to the pixel point Pi can be determined based on the current parallax plane parameter corresponding to the pixel point Pi and the homography transformation matrix among the view angles, and the patch is mapped in other scene images.

Specifically, referring to fig. 1, a patch with a certain size is taken with a pixel Pi as the center, and after the patch is mapped into a scene image with other view angles, the patch is mapped to a patch with Pi1 as the center in C1 in fig. 1, namely qi1, and similarly mapped to qi2 and qi3 in C2 and C3 respectively.

Theoretically, if the current parallax plane parameter corresponding to Pi is accurate enough, pi1 to Pi3 obtained by mapping should be the same pixel point of Pi in the scene images of other perspectives, and the corresponding patches are the same region, so that the similarity between the patches should be high enough, and therefore, the similarity between the patches can be calculated to determine whether Pi is the same point as Pi1, pi2 and Pi3 respectively.

Specifically, the pixel feature association degree between the stereo matching region and each mapping region can be determined correspondingly based on the stereo matching region corresponding to the pixel point Pi and the pixel feature values in each mapping region, and the pixel feature association degree is used for representing the similarity degree between two patches.

In one embodiment, the pixel characteristic may be a photometric characteristic, where photometric consistency refers to that the target object is spatially subjected to the same luminosity regardless of the view angle from which it is observed, so that the point to be reconstructed has the same color on the image at each view angle, and thus the similarity of patch colors can be calculated to determine whether Pi is the same point as Pi1, pi2, and Pi3, respectively. Thus, the pixel feature correlation between two patches can be calculated as follows:

Wherein NCC characterizes the pixel characteristic association degree between two patches, f and t are two different patches, mu _f and mu _t are average values of pixel characteristics in patch f and patch t, sigma _f and sigma _t are standard deviations of pixel characteristics corresponding to patch f and patch t respectively, patch f is a patch in a reference view image, and patch t is a mapping patch in a scene image of other views.

Through the above process, the relevance of each pixel characteristic between the patch in the reference view image and the mapping patch in the scene image of other views can be respectively calculated, and further, the iterative updating loss value corresponding to the pixel point Pi can be determined based on the obtained relevance of each pixel characteristic, wherein the iterative updating loss value and the relevance of the pixel characteristic are in negative correlation.

Specifically, a certain weight value may be given to each view angle, and then, from the above pixel feature association degrees, a pixel feature association degree corresponding to the view angle with the largest weight value is selected according to the weight value of each view angle, and an iterative update loss value corresponding to the pixel point Pi is determined according to the pixel feature association degree. Or, the weighted summation can be performed according to the weight value, so as to determine the iterative update loss value corresponding to the pixel point Pi according to the pixel characteristic association degree obtained by the weighted summation. It should be noted that the weight values of the respective views corresponding to the different pixels may be different.

In one embodiment, the iteratively updated penalty value cost may be calculated as follows:

m(pi,l)=1-NCC

wherein l is the current parallax plane parameter corresponding to the pixel pi, the function m is a cost function, namely the cost when the corresponding current parallax plane of the pixel p is l, and NCC in the formula is the final obtained pixel characteristic association degree. Of course, the cost may be calculated in other manners, which are not limited in this embodiment of the present application.

And 40333, respectively determining iteration updating loss values corresponding to the sampling pixel points based on the correlation degree of the stereo matching areas corresponding to the sampling pixel points and the pixel characteristics in the corresponding mapping areas.

Similar to the calculation process of the iterative update loss value of the pixel Pi, the iterative update loss value corresponding to other sampling pixels can be calculated in the same way, so that the process of obtaining other sampling pixels is not repeated here.

And 40334, if a target sampling pixel point with the iteration update loss value smaller than the iteration update loss value corresponding to the pixel point Pi exists in the plurality of sampling pixel points, updating the current parallax plane parameter of the pixel point Pi by the current parallax plane parameter corresponding to the target sampling pixel point.

In the embodiment of the application, the parameter propagation is to update the current parallax plane parameter of the parameter with the parallax plane parameter of the parameter itself or the parallax plane parameter of the adjacent pixel points, and the aim is to find the parallax plane parameter with the minimum cost for each pixel point, namely, the objective function f _pi is as follows:

f is a set of parallax plane parameters, for example, may be a set formed by the set of sampling points S and the current parallax plane parameters corresponding to the pixel points Pi.

Specifically, for each sampling pixel point in the sampling pixel point set, whether the following results are established is judged:

m(pi,l)>m(si,li)

If the above equation is true, it indicates that li is more suitable for the current pixel Pi, and the current parallax plane parameter of the pixel Pi is updated by the current parallax plane parameter corresponding to the target sampling pixel, i.e. the current parallax plane parameter of Pi is updated to li.

Step 4034, determining a stereo matching area corresponding to each pixel point based on the obtained estimated parallax plane parameters corresponding to each pixel point, and mapping areas in the other scene images.

In the embodiment of the application, the iterative updating process can be performed for multiple times until the convergence condition is met, wherein the convergence condition can be that the iterative updating times reach the preset times, or that the updating rate of the parallax plane parameters meets a certain condition, for example, in the one-time iterative updating process, the parallax plane parameters of 90% of pixels are not changed any more, and the convergence condition can be considered to be met, so that the current parallax plane parameters finally obtained once are estimated parallax plane parameters.

Furthermore, based on the estimated parallax plane parameters corresponding to each pixel point and the homography transformation matrix between each view angle, the patches of each pixel point in the reference view angle image can be mapped to the scene images of other view angles respectively, and the mapping patches in other scene images can be obtained.

Step 4035, determining the mapping areas of the stereo matching areas in other scene images and the pixel feature association degree between the stereo matching areas, wherein the other scene images are scene images except the reference view angle image in the plurality of scene images.

The calculation manner of the pixel feature association degree corresponding to each pixel point may be the same as that in the above iterative updating process, so the description in the above iterative updating process may be referred to, and no further description is given here.

Step 4036, determining a reference pixel point set with the pixel characteristic association degree meeting the set association condition from all the pixel points.

Specifically, for each pixel pi, if the pixel feature association degree m (pi, l) corresponding to the pixel pi is greater than the set association degree threshold, the similarity between the patch of the pixel pi and a certain patch of other view angles is considered to be very high, the accuracy of the estimated parallax plane parameter corresponding to the pixel pi is very high, the reliability of the estimated parallax plane parameter of the pixel pi is high, and the estimated parallax plane parameter of the pixel pi can be used as a reference pixel, so that a reference pixel set is obtained, and the number of the pixels is usually not large, so that the estimated parallax plane parameter is also called a sparse reliable point.

Step 404, obtaining a plane prior condition of the reference view image based on the plane information and the reference pixel point set.

In the embodiment of the application, the reference view angle image can be subjected to plane subdivision based on the reference pixel point set, and the pixels in the reference view angle image belong to the same plane can be obtained by combining the plane information, so that the plane priori condition obtained in the plane subdivision process of the reference view angle image can be more accurate.

Referring to fig. 10, a schematic process of obtaining a plane prior based on plane information and a reference pixel point set is shown, wherein the reference pixel point set shown in fig. 10 is obtained by extracting reference pixel points from a reference view angle image, and fig. 10 shows a depth map of each pixel point obtained based on the iterative updating process, in addition, plane detection can be performed on the reference view angle image through PLANERCNN to obtain plane information shown in fig. 10, the plane information divides those image areas in the image into planes, and the depth map can be optimized by combining with the plane information to obtain the plane prior shown in fig. 10, so that the user can obviously feel that after the plane prior is optimized, the depth in the plane area is better filtered, the actual situation in the image is more met, and the accuracy is higher.

In one embodiment, referring to fig. 11, a schematic flow chart of a planar prior condition for obtaining a reference view image is shown.

Step 4041a, performing plane fitting processing by adopting a triangulation method based on the reference pixel point set to obtain a triangular mesh structure, wherein each triangular plane in the triangular mesh structure comprises three reference pixel points in the reference pixel point set.

Specifically, referring to fig. 12, each pixel point in the reference pixel point set may be taken as a vertex, and a Delaunay triangulation method may be adopted to obtain a triangular mesh structure shown on the right side, where each triangular plane in the triangular mesh structure includes three reference pixel points in the reference pixel point set.

Step 4042a, based on the plane information, merging the triangular planes corresponding to the reference pixel points located on the same plane in the triangular mesh structure to obtain a merged triangular mesh structure.

Referring to fig. 12, each three reference pixel points form a triangular plane, and different triangular planes belong to different parallax planes, and in an actual reference view image, there may be a case that a plurality of reference pixel points belong to the same plane, so that the fitting result of the triangular mesh structure does not conform to the actual situation, and thus, in order to alleviate the problem of inaccurate final depth estimation caused by the situation, the triangular mesh structure is modified by introducing a plane detection result.

Specifically, the plane information obtained through plane detection can know which pixel points belong to the same plane, and the same reference pixel points can also know which reference pixel points belong to the same plane, so if reference pixel points which should belong to the same plane exist and respectively belong to a plurality of triangular planes, the triangular planes should be combined, for example, plane fitting is performed again based on the triangular pixel points which belong to the same plane, so as to update the triangular mesh structure, and a combined triangular mesh structure is obtained.

Referring to fig. 12, based on the plane information, it is determined that the 5 reference pixel points in the dashed frame should belong to the same plane, but the triangular mesh structure obtained by triangulation can obviously see that the several reference pixel points are not located on the same plane, so that the several triangular planes can be combined to obtain the triangular mesh structure after combination in fig. 12.

Step 4043a, obtaining a plane prior condition based on the combined triangular mesh structure.

Specifically, based on the combined triangular mesh structure, corresponding plane parameters are obtained and used as plane priori conditions.

In another embodiment, referring to fig. 13, another flow chart of obtaining a planar prior condition of a reference view image is shown.

Step 4041b, determining a plurality of reference pixel point groups belonging to the same plane in the reference pixel point set based on the plane information, wherein each reference pixel point group comprises at least one reference pixel point.

Specifically, in the plane information, which pixel points belong to the same plane is specified, then it may be determined based on the plane information, which reference pixel points belong to the same plane, and a plurality of reference pixel points belonging to the same plane are merged into the same reference pixel point group, so as to obtain a plurality of reference pixel point groups, where each reference pixel point group includes one or more reference pixel points.

Step 4042b, performing plane fitting processing on the obtained plurality of reference pixel point groups respectively to obtain a plane combination structure consisting of a plurality of fitting planes, wherein each reference pixel point group corresponds to one fitting plane in the plurality of fitting planes.

For each reference pixel point group, plane fitting processing can be performed based on the reference pixel points included in the reference pixel point group, so that a fitting plane corresponding to the reference pixel point group is obtained, and a plane combination structure formed by a plurality of fitting planes is obtained.

Step 4043b, obtaining a plane prior condition based on the plane combination structure.

Step 405, obtaining a depth map of the reference view image based on the pixel characteristic association degree among a plurality of scene images by taking the plane prior condition as a constraint.

In the embodiment of the present application, the procedure of step 405 is similar to the procedure of step 403 described above, except that a plane prior condition is added as a constraint in the iterative update procedure.

Referring to fig. 14, a flow chart of obtaining a depth map of a reference view image is shown.

Step 4051, initializing parameters of each pixel point to obtain initial parallax plane parameters corresponding to each pixel point.

Specifically, when the parameter initialization is performed, the parameter initialization may be performed by a random initialization method, or may be performed according to plane information indicated by the plane prior condition.

Step 4052, performing multiple iterative updating processes on the current parallax plane parameters corresponding to each pixel point based on the plane prior condition and the pixel feature association degree, so as to obtain the target parallax plane parameters corresponding to each pixel point.

Similarly, as described in the above step 403, from the designated pixel point of the reference view image, an iterative update process may be performed on each pixel point one by one in a traversal manner, so as to sample and propagate the parallax plane parameter of each pixel point, update the parallax plane parameter of each pixel point, and distinguish the parallax plane parameter from the above step 403, where the cost is the combination of the photometric consistency and the plane prior. The traversing manner may refer to the description of step 403, and will not be described herein.

Here, taking the pixel Pi as an example, a plurality of sampling pixels may be obtained from pixels in a set region around the pixel Pi.

Furthermore, the iterative updating loss values corresponding to the pixel points Pi and the sampling pixel points can be respectively determined based on the plane in which each pixel point is located in the plane prior condition and the correlation degree of the corresponding stereo matching area and the pixel characteristics in each mapping area.

Specifically, for each pixel point, a stereo matching area of the pixel point in the reference view angle image can be determined, and based on the current parallax plane parameter and the homography transformation matrix between the view angles, the stereo matching area is mapped to the scene images of other view angles, and the mapping area in the scene images of other view angles is determined.

The current parallax plane parameters are initial parallax plane parameters in the first iteration updating process, and the current parallax plane parameters obtained after the last iteration updating process in the subsequent iteration updating process.

Through the above process, the relevance of each pixel characteristic between the patch in the reference view image and the mapping patch in the scene image of other views can be respectively calculated, and further, the iterative updating loss value corresponding to the pixel point Pi can be determined based on the obtained relevance of each pixel characteristic and the plane where the pixel point Pi is located in the plane prior condition, wherein the iterative updating loss value and the relevance of the pixel characteristic are in negative correlation.

Wherein m (Pi, l) represents a cost part obtained based on the pixel characteristic association degree, alpha and gamma are constants, d _i and n _i are used for representing the current parallax plane parameter of the pixel Pi, d _i represents a depth value of the pixel Pi in the current parallax plane parameter, n _i represents a normal vector of the pixel Pi corresponding to the current parallax plane, namely a vector expression corresponding to (a, b, c) in the current parallax plane parameter, d _p and n _p are used for representing the plane priori condition of the pixel Pi, d _p represents a depth value of the pixel Pi in the plane priori condition, n _p represents a normal vector of the pixel Pi corresponding to the plane priori condition, namely a vector expression corresponding to (a, b, c) in the plane priori condition, lambda _d represents a corresponding coefficient of a depth difference of the current parallax plane parameter and the plane priori condition, and lambda _n represents a corresponding coefficient of a normal vector difference of the current parallax plane parameter and the plane priori condition.

After each iteration update loss value is obtained, if a corresponding target sampling pixel point with the iteration update loss value smaller than the iteration update loss value corresponding to one pixel point exists in the plurality of sampling pixel points, the current parallax plane parameter of the pixel point Pi is updated according to the current parallax plane parameter corresponding to the target sampling pixel point.

In the embodiment of the application, the iterative updating process can be performed for a plurality of times until the convergence condition is met, so that the current parallax plane parameter obtained at one time finally is the target parallax plane parameter.

Step 4053, determining respective corresponding depth values of the pixel points based on the obtained target parallax plane parameters, so as to obtain a depth map corresponding to the reference view image.

Specifically, the obtained target parallax plane parameter corresponding to each pixel point includes a plane parameter (a, b, c, d), that is, the pixel point belongs to a plane ax+by+cz+d=0 in space, where d is a depth value corresponding to the pixel point.

It should be noted that the execution sequence of the steps is not strictly according to the above sequence, and may be adjusted according to actual situations in practical applications, that is, the steps may be executed according to the above execution sequence, or the steps may be executed simultaneously, which is not limited in the embodiment of the present application.

Referring to fig. 15a and 15b, referring to a depth map obtained by the method according to the embodiment of the present application, fig. 15a is a depth map obtained by taking an image shown in fig. 2a as a reference view image, and fig. 15b is a depth map obtained by taking an image shown in fig. 2c as a reference view image, it can be clearly seen that in the round table area in fig. 15a, the depth has a significant excessive effect, which is more consistent with the actual depth situation, compared with fig. 2b, and in the box selection area in fig. 15b, the depth is more consistent with the actual depth situation, compared with fig. 2d, so that the depth estimation effect in the weak texture area can be effectively increased.

In the embodiment of the application, the depth map corresponding to the reference view angle image can be obtained through the process, and likewise, the depth map of any scene image of other view angles can also be obtained through the process, so that the obtained depth map can be used in downstream application.

In one embodiment, the three-dimensional modeling may be performed on the target scene based on the obtained depth maps, so as to obtain a three-dimensional stereo map corresponding to the target scene.

In one embodiment, it is also possible to determine obstacle distances around the target vehicle based on the obtained respective depth maps, and to perform automatic driving based on the obtained respective obstacle distances.

In summary, in the embodiment of the application, by introducing plane detection, plane information in an image is fully mined and used as a plane prior solution depth map, information of an extremely weak texture region can be recovered, and the quality of depth estimation of the weak texture region is better improved. The method can be applied to application scenes such as AR/VR, 3D games, 3D film and television works, short videos, automatic driving, free view points and the like, can effectively assist in recovering a better 3D structure, generates a more attractive result, improves product experience, and improves product user experience.

Referring to fig. 16, based on the same inventive concept, an embodiment of the present application further provides a depth map generating apparatus 160 based on plane priors, where the apparatus includes:

A plane detection unit 1601, configured to acquire a plurality of scene images of a target scene captured based on different perspectives, and perform plane detection on reference perspective images in the plurality of scene images to obtain plane information in the reference perspective images, where the plane information is used to indicate pixel points belonging to a same plane;

a pixel screening unit 1602, configured to determine, based on pixel feature association degrees among a plurality of scene images, a reference pixel point set in which the pixel feature association degrees satisfy a set association condition from among the pixel points included in the reference view image;

A plane prior generating unit 1603, configured to obtain a plane prior condition of the reference view image based on the plane information and the reference pixel point set;

The depth estimation unit 1604 is configured to obtain a depth map of the reference view image based on a degree of correlation of pixel features between a plurality of scene images with a plane prior condition as a constraint.

Optionally, the plane prior generation unit 1603 is specifically configured to:

Based on plane information, combining triangular planes corresponding to reference pixel points positioned on the same plane in the triangular mesh structure to obtain a combined triangular mesh structure;

and obtaining a plane prior condition based on the combined triangular mesh structure.

Optionally, the plane prior generation unit 1603 is specifically configured to:

determining a plurality of reference pixel point groups belonging to the same plane in a reference pixel point set based on plane information, wherein each reference pixel point group comprises at least one reference pixel point;

Based on the planar combination structure, a planar prior condition is obtained.

Optionally, the plane detection unit 1601 is specifically configured to:

Optionally, the pixel filtering unit 1602 is specifically configured to:

for each pixel point, the following steps are respectively executed:

Determining a three-dimensional matching area of a pixel point in a reference view image aiming at the pixel point, wherein each three-dimensional matching area comprises an image area with a setting range taking the pixel point as a datum point;

Respectively determining the mapping areas of the stereo matching areas in other scene images and the pixel feature association degree between the stereo matching areas, wherein the other scene images are scene images except for the reference view angle image in the plurality of scene images;

And determining a reference pixel point set with the pixel characteristic association degree meeting the set association condition from all the pixel points.

Optionally, the pixel filtering unit 1602 is further configured to:

Respectively initializing parameters of each pixel point to obtain initial parallax plane parameters corresponding to each pixel point;

Based on the pixel characteristic association degree of the corresponding areas in the reference view angle image and other scene images, respectively performing iterative updating processes on the initial parallax plane parameters corresponding to each pixel point for a plurality of times to obtain estimated parallax plane parameters corresponding to each pixel point;

And determining a stereo matching area corresponding to each pixel point based on each obtained estimated parallax plane parameter, and mapping areas in other scene images.

Optionally, the pixel filtering unit 1602 is specifically configured to:

acquiring a plurality of sampling pixel points from the pixel points in a set area around one pixel point;

determining an iterative updating loss value of a pixel point based on the correlation degree of the three-dimensional matching area corresponding to the pixel point and the pixel characteristics in each corresponding mapping area;

if the target sampling pixel point with the iteration updating loss value smaller than the iteration updating loss value corresponding to one pixel point exists in the plurality of sampling pixel points, updating the current parallax plane parameter of the one pixel point by the current parallax plane parameter corresponding to the target sampling pixel point.

Optionally, the pixel filtering unit 1602 is specifically configured to:

Determining a three-dimensional matching area corresponding to one pixel point based on the current parallax plane parameter corresponding to the one pixel point and a homography transformation matrix between each view angle, and mapping areas in other scene images;

based on the three-dimensional matching area and the pixel characteristic values in each mapping area, correspondingly determining the pixel characteristic association degree between the three-dimensional matching area and each mapping area;

and determining an iterative updating loss value corresponding to one pixel point based on the obtained pixel characteristic association degree, wherein the iterative updating loss value and the pixel characteristic association degree are in negative correlation.

Optionally, the depth estimation unit 1604 is specifically configured to:

Based on the plane prior condition and the pixel characteristic association degree, performing repeated iterative updating process on the current parallax plane parameters corresponding to each pixel point to obtain target parallax plane parameters corresponding to each pixel point;

And determining respective corresponding depth values of the pixel points based on the obtained target parallax plane parameters so as to obtain a depth map.

Optionally, the depth estimation unit 1604 is specifically configured to:

starting from the appointed pixel point of the reference view angle image, adopting a traversing mode to execute the following steps for each pixel point one by one:

For one pixel point, acquiring a plurality of sampling pixel points from the pixel points in a set area around the pixel point;

Determining iterative updating loss values of a pixel point and a plurality of sampling pixel points respectively based on planes of the pixel points in the plane prior condition and pixel feature association degrees of the corresponding three-dimensional matching areas and the mapping areas;

If a plurality of sampling pixel points exist, and the corresponding iteration updating loss value is smaller than the target sampling pixel point of the iteration updating loss value corresponding to one pixel point, updating the current parallax plane parameter of one pixel point according to the current parallax plane parameter corresponding to the target sampling pixel point.

And acquiring a plurality of sampling pixel points from the pixel points of the surrounding set area included in the channel image where one pixel point is positioned.

Through the device, plane information in an image can be fully mined through introducing plane detection, the information of an extremely weak texture region can be recovered as a plane priori solving depth map, the quality of depth estimation of the weak texture region can be better improved, a better 3D structure can be effectively recovered in an auxiliary mode, a more attractive result is generated, the experience of a product is improved, and the user experience of the product is improved.

The apparatus may be used to perform the methods shown in the embodiments of the present application, and therefore, the description of the foregoing embodiments may be referred to for the functions that can be implemented by each functional module of the apparatus, and the like, which are not repeated.

Referring to fig. 17, based on the same technical concept, an embodiment of the present application further provides a computer device 170, which computer device 170 may be the depth estimation device shown in fig. 3, and the computer device 170 may include a memory 1701 and a processor 1702.

The memory 1701 is configured to store a computer program executed by the processor 1702. The memory 1701 may mainly include a storage program area that may store an operating system, an application program required for at least one function, and the like, and a storage data area that may store data created according to the use of the computer device, and the like. The processor 1702 may be a central processing unit (central processing unit, CPU), or a digital processing unit, or the like. The specific connection medium between the memory 1701 and the processor 1702 is not limited in this embodiment of the present application. In the embodiment of the present application, the memory 1701 and the processor 1702 are connected by a bus 1703 in fig. 17, and the bus 1703 is shown by a thick line in fig. 17, and the connection manner between other components is only schematically illustrated, which is not limited thereto. The bus 1703 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 17, but not only one bus or one type of bus.

The memory 1701 may be a volatile memory (RAM) such as a random-access memory (RAM), the memory 1701 may also be a non-volatile memory (non-volatile memory) such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HARD DISK DRIVE, HDD) or a Solid State Disk (SSD), or the memory 1701 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 1701 may be a combination of the above.

The processor 1702 is configured to execute the method executed by the device in each embodiment of the present application when invoking the computer program stored in the memory 1701.

In some possible implementations, aspects of the methods provided by the present application may also be implemented in the form of a program product comprising program code for causing a computer device to carry out the steps of the methods according to the various exemplary embodiments of the application described above when the program product is run on the computer device, for example, the computer device may carry out the methods performed by the devices in the various embodiments of the application.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A depth map generation method based on plane priors, the method comprising:

acquiring a plurality of scene images of a target scene shot based on different view angles, and performing plane detection on reference view angle images in the plurality of scene images to obtain plane information in the reference view angle images, wherein the plane information is used for indicating pixel points belonging to the same plane;

Determining a plurality of reference pixel point groups belonging to the same plane in the reference pixel point set based on the plane information, wherein at least one reference pixel point included in each reference pixel point group is positioned on the same plane;

Optimizing the plane determined by the reference pixel point set according to the plurality of reference pixel point groups to obtain a plane prior condition of the reference view angle image;

2. The method of claim 1, wherein the determining, based on the plane information, a plurality of reference pixel groups belonging to a same plane in the reference pixel set comprises:

Determining a triangular plane corresponding to a reference pixel point located on the same plane in the triangular mesh structure based on the plane information, wherein the reference pixel point included in the triangular plane corresponding to the reference pixel point on the same plane is included in a reference pixel point group;

The optimizing the plane determined by the reference pixel point set according to the plurality of reference pixel point groups to obtain a plane prior condition of the reference view image includes:

Combining triangular planes corresponding to reference pixel points positioned on the same plane in the triangular mesh structure to obtain a combined triangular mesh structure;

3. The method of claim 1, wherein the optimizing the plane determined by the set of reference pixels according to the plurality of reference pixel groups to obtain the plane prior condition of the reference view image comprises:

performing plane fitting processing on the plurality of reference pixel point groups respectively to obtain a plane combination structure consisting of a plurality of corresponding fitting planes, wherein each reference pixel point group corresponds to one fitting plane in the plurality of fitting planes;

4. The method of claim 1, wherein performing plane detection on a reference view image corresponding to a target view by using a plane detection method to obtain plane information in the reference view image, comprises:

5. The method according to any one of claims 1 to 4, wherein determining, based on the pixel feature association degrees between the plurality of scene images, a reference pixel point set having a pixel feature association degree satisfying a set association condition from among the pixel points included in the reference view image includes:

for each pixel point, the following steps are respectively executed:

6. The method of claim 5, wherein prior to separately determining the mapped regions of the stereo matching region in the other scene image and the degree of pixel feature association with the stereo matching region, the method further comprises:

7. The method of claim 6, wherein for the one pixel point, an iterative update process comprises the steps of:

8. The method of claim 7, wherein determining an iteratively updated loss value for the one pixel point based on a degree of correlation of the stereo matching region corresponding to the one pixel point with pixel features within respective ones of the mapped regions comprises:

9. The method of any of claims 1-4, wherein determining the depth map of the reference perspective image based on the degree of pixel feature association between the plurality of scene images, subject to the planar prior condition, comprises:

10. The method of claim 9, wherein one iterative update procedure comprises:

11. The method of claim 10, wherein obtaining a plurality of sampling pixels from pixels of a set area around the one pixel comprises:

12. A depth map generation apparatus based on planar priors, the apparatus comprising:

The plane detection unit is used for acquiring a plurality of scene images of a target scene shot based on different view angles, and carrying out plane detection on reference view angle images in the plurality of scene images to obtain plane information in the reference view angle images, wherein the plane information is used for indicating pixel points belonging to the same plane;

The plane prior generation unit is used for determining a plurality of reference pixel point groups belonging to the same plane in the reference pixel point set based on the plane information, wherein at least one reference pixel point included in each reference pixel point group is positioned on the same plane;

13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that,

The processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 11.

14. A computer storage medium having stored thereon computer program instructions, characterized in that,

Which computer program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 11.

15. A computer program product comprising computer program instructions, characterized in that,