CN119963651B - Indoor plan positioning method, system, equipment and storage medium - Google Patents
Indoor plan positioning method, system, equipment and storage medium Download PDFInfo
- Publication number
- CN119963651B CN119963651B CN202510455316.8A CN202510455316A CN119963651B CN 119963651 B CN119963651 B CN 119963651B CN 202510455316 A CN202510455316 A CN 202510455316A CN 119963651 B CN119963651 B CN 119963651B
- Authority
- CN
- China
- Prior art keywords
- indoor
- features
- view
- probability distribution
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of indoor positioning, and relates to an indoor plane map positioning method, an indoor plane map positioning system, equipment and a storage medium, wherein the indoor plane map positioning method comprises the steps of obtaining a multi-view indoor scene image acquired by an object to be positioned, extracting monocular depth characteristics of the indoor scene image, analyzing the monocular depth characteristics of the multi-view indoor scene image to obtain multi-view depth characteristics, fusing the monocular depth characteristics and the multi-view depth characteristics, processing the fused depth characteristics according to a preset rule to obtain a first posterior probability distribution of the gesture of the object to be positioned, extracting semantic characteristics of the indoor scene image, processing the semantic characteristics according to the preset rule to obtain a second posterior probability distribution of the gesture of the object to be positioned, and determining a positioning result of the object to be positioned according to the first posterior probability distribution and the second posterior probability distribution. The invention can improve the accuracy, efficiency and robustness of indoor positioning in complex scenes.
    Description
Technical Field
      The present invention relates to the field of indoor positioning technologies, and in particular, to a method, a system, an apparatus, and a storage medium for positioning an indoor plan.
    Background
      In recent years, robots having autonomous navigation capabilities, which refers to the ability of a robot to sense a known or unknown environment with sensor data and to autonomously route planning based on the sensed data, have become increasingly popular. In indoor environments, accurate positioning is critical for numerous applications such as Virtual Reality (VR) and autonomous navigation of robots.
      The traditional technical scheme mainly relies on a pre-constructed 3D environment model or a high-precision sensor database to realize positioning, and the method not only needs a large amount of storage resources to maintain environment data, but also has the problem of model update lag caused by scene dynamic change, thereby remarkably increasing the operation and maintenance cost of the system. In recent years, a positioning method based on a plane graph gradually becomes an alternative scheme, equipment positioning is realized by extracting geometric and semantic features of a building plane graph, however, the existing positioning method based on the plane graph still has obvious defects that firstly, a robustness algorithm is lacking in the aspect of space-time feature fusion of a continuous image sequence, so that positioning tracks drift frequently occurs, secondly, the adaptability to non-upright camera gestures (such as inclination, pitching and the like) is insufficient, the compatibility of complex observation visual angles is limited, thirdly, hidden semantic topological relations (such as porch connectivity, functional division and the like) in the plane graph cannot be effectively mined, the utilization rate of environment constraint information is low, and fourthly, the existing scheme has defects in the aspects of positioning precision improvement, real-time balance and the like, and millisecond response requirements under a large-scale scene are difficult to meet.
    Disclosure of Invention
      In order to improve the accuracy, efficiency and robustness of indoor positioning in complex scenes and meet the requirements of different indoor application scenes on positioning accuracy and instantaneity, the application provides an indoor plane diagram positioning method, an indoor plane diagram positioning system, indoor plane diagram positioning equipment and a storage medium.
      The invention provides an indoor plane diagram positioning method, which comprises the following steps:
       acquiring a multi-view indoor scene image acquired by a target to be positioned; 
       extracting monocular depth features of the indoor scene image, and analyzing the monocular depth features of the multi-view indoor scene image to obtain multi-view depth features; 
       fusing the monocular depth features and the multi-view depth features to obtain fused depth features, and processing the fused depth features according to a preset rule to obtain a first posterior probability distribution of the gesture of the target to be positioned, wherein the gesture  represents the position and the direction of the target to be positioned in the indoor space; 
       extracting semantic features of the indoor scene image, and processing the semantic features according to a preset rule to obtain second posterior probability distribution of the gesture of the target to be positioned; 
       weighting the first posterior probability distribution and the second posterior probability distribution according to a preset proportion to obtain weighted posterior probability distribution, inputting the weighted posterior probability distribution into a histogram filter, and outputting a three-dimensional probability body of the gesture of the target to be positioned by the histogram filter to obtain a positioning result of the target to be positioned. 
      Optionally, the extracting the monocular depth feature of the indoor scene image and analyzing the monocular depth feature of the multi-view indoor scene image to obtain the multi-view depth feature include:
       respectively extracting monocular depth characteristics of the multi-view indoor scene images; 
       Extracting column features of the monocular depth features, and gathering the column features of the monocular depth features of the multi-view indoor scene image to obtain cross view features; 
       And calculating the depth value of each pixel point of the planar image of the indoor scene according to the feature variance of the cross view feature to obtain the multi-view depth feature. 
      Optionally, the fusing the monocular depth feature and the multi-view depth feature to obtain a fused depth feature, and processing the fused depth feature according to a preset rule to obtain a first posterior probability distribution of the pose of the target to be positioned, where the first posterior probability distribution includes:
       Acquiring attitude information of an indoor scene image, wherein the attitude information is used for representing the position and angle of a camera when the indoor scene image is shot; 
       Determining the relative pose between image frames of the multi-view indoor scene image according to the pose information of the multi-view indoor scene image; 
       Determining reference weights of the multi-view depth features and the monocular depth features during feature fusion according to relative pose among image frames, and fusing the multi-view depth features and the monocular depth features according to the reference weights to obtain fusion depth features; 
       and processing the fusion depth characteristic by using a Bayesian rule to obtain a first posterior probability distribution of the gesture of the target to be positioned. 
      Optionally, the calculation formula of the first posterior probability distribution of the pose of the object to be positioned is:
       ,;
       Wherein,  A first posterior probability distribution representing the pose of the object to be localized,A probability distribution representing the predicted pixel depth corresponding to the monocular depth feature,A probability distribution representing the predicted pixel depth corresponding to the multi-view depth feature,The reference weights representing the monocular depth features,An up-sampling operation is indicated and,Reference weights representing multi-view depth features.
      Optionally, extracting the semantic features of the indoor scene image, and processing the semantic features according to a preset rule to obtain a second posterior probability distribution of the pose of the target to be positioned, where the second posterior probability distribution includes:
       carrying out semantic segmentation on the indoor scene image to obtain semantic features of the indoor scene image; 
       Determining semantic identification labels of all pixel points of the indoor scene image according to the semantic features; 
       and selecting a semantic likelihood field according to the semantic identification label of the pixel point, carrying out semantic ray projection according to the semantic likelihood field, and determining a second posterior probability distribution of the gesture of the target to be positioned. 
      Optionally, the method further comprises:
       acquiring an indoor scene plane graph, wherein the indoor scene plane graph is provided with a reference semantic tag; 
       calculating the matching value of the semantic identification label of the pixel point and the reference semantic label; 
       And constructing a semantic likelihood field according to the matching value. 
      Optionally, the indoor scene image is an RGB image, and after the acquiring the multi-view indoor scene image, before the extracting the monocular depth feature of the indoor scene image, the method further includes:
       And adjusting the direction of the indoor scene image, and converting the indoor scene image into a gravity alignment image. 
      In order to solve the above problems, the present invention also provides an indoor plan positioning system, the system comprising:
       the acquisition module is used for acquiring multi-view indoor scene images acquired by the target to be positioned; 
       the feature extraction module is used for extracting monocular depth features of the indoor scene image and analyzing the monocular depth features of the multi-view indoor scene image to obtain multi-view depth features; 
       The first processing module is used for fusing the monocular depth features and the multi-view depth features to obtain fused depth features, processing the fused depth features according to a preset rule to obtain a first posterior probability distribution of the gesture of the target to be positioned, wherein the gesture  represents the position and the direction of the target to be positioned in the indoor space; 
       the second processing module is used for extracting semantic features of the indoor scene image and processing the semantic features according to a preset rule to obtain second posterior probability distribution of the gesture of the target to be positioned; 
       The output module is used for weighting the first posterior probability distribution and the second posterior probability distribution according to a preset proportion to obtain weighted posterior probability distribution, inputting the weighted posterior probability distribution into the histogram filter, and outputting a three-dimensional probability body of the gesture of the target to be positioned by the histogram filter to obtain a positioning result of the target to be positioned. 
      In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:
       at least one processor, and 
      A memory communicatively coupled to the at least one processor, wherein,
      The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the indoor plan positioning method described above.
      In order to solve the above-mentioned problems, the present invention also provides a computer-readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the indoor plan positioning method described above.
      In summary, the application has the following beneficial technical effects:
       When indoor scene depth images of multiple visual angles are processed, monocular depth features of the indoor scene images of the multiple visual angles are respectively extracted, multi-view depth features are extracted, first posterior probability distribution of the pose of an object to be positioned is calculated according to the monocular depth features and the multi-view depth features, positioning results of the object to be positioned are calculated by the first posterior probability distribution, positioning blur and errors are reduced, a high-efficiency histogram filter is used for fusing positioning information calculated by the depth features of the indoor scene images and positioning information calculated by semantic features to obtain final positioning information of the object to be positioned, and when the positioning results of the object to be positioned are calculated, the semantic information of each pixel point of the indoor scene image is referred to, so that the accuracy, efficiency and robustness of indoor positioning in a complex scene are improved. 
    Drawings
      FIG. 1 is a flow chart of an indoor plan positioning method according to an embodiment of the present invention;
       FIG. 2 is a system flow chart of an indoor plan positioning method according to an embodiment of the present invention; 
       FIG. 3 is a flowchart illustrating steps for extracting monocular depth features of an indoor scene image and resolving the monocular depth features of the indoor scene image from multiple views to obtain the multi-view depth features according to an embodiment of the present invention; 
       Fig. 4 is a schematic structural diagram of an electronic device for implementing the indoor plan positioning method according to an embodiment of the present invention. 
      Reference numeral 10, a processor, 11, a memory, 12, a communication bus, 13 and a communication interface.
      The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
    Detailed Description
      Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
      In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention.
      In the description of the present invention, unless otherwise specified and defined, it should be noted that the terms "mounted," "connected," and "coupled" are to be construed broadly, and may be, for example, mechanical or electrical, or may be in communication with each other between two elements, directly or indirectly through intermediaries, as would be understood by those skilled in the art, in view of the specific meaning of the terms described above.
      Referring to fig. 1 and fig. 2, a flow chart of an indoor plan positioning method according to an embodiment of the invention is shown. In this embodiment, the indoor plan positioning method includes:
       S1, acquiring a multi-view indoor scene image acquired by a target to be positioned. 
      In the embodiment, an indoor scene image is an RGB image, a wide-angle camera is used for acquiring an indoor scene image sequence, the horizontal angle of view of the wide-angle camera is 108 degrees, so that the wide coverage of surrounding environment is ensured, in addition, a target to be positioned can also utilize a self-sensor system (a wheel speed sensor) to acquire motion data to simulate an odometer function, and the positioning of a follow-up indoor plane graph is assisted.
      In a preferred implementation of the present embodiment, after acquiring the multi-view indoor scene image, before extracting the monocular depth features of the indoor scene image, an operation of adjusting the direction of the indoor scene image, converting the indoor scene image into a gravity-aligned image is performed.
      In converting an indoor scene image into a gravity-aligned image, the image is first processed through roll angleAnd pitch angleTo calculate a rotation matrixTo facilitate subsequent rotation of the indoor scene image from the current pitch attitude to the horizontal attitude.
      And then, converting pixel coordinates in the original image into coordinates in the gravity alignment image by using a homography matrix, wherein the homography matrix is a 3 multiplied by 3 matrix used for describing  two plane projection transformation relations  in computer vision and represents the mapping relation of the same plane under different view angles. The homography matrix is expressed as, wherein,As an internal reference matrix of the camera,Is the inverse of the reference matrix of the camera,Is the homogeneous image coordinates of the original pixel,Is the corresponding pixel coordinate in the gravity aligned image.
      And then, screening out pixel points which cannot be seen at the current pitch angle, shielding out invisible pixels, and finally obtaining a gravity aligned image. It should be noted that, after the gravity alignment, the image rotates to a horizontal ground-based viewing angle, and some pixels (such as the ceiling or overhead content) that were originally on top of the image may be moved out of the image frame and become invisible. The rotated out pixels are marked as invisible pixels (masked out) avoiding the network from trying to make meaningless depth estimations for these pixels.
      After the indoor scene image is converted into the gravity alignment image, the building in the gravity alignment image presents a vertical state (corresponding to a picture shot by a camera in a horizontal posture), and the state of the building in the image in an inclined state is corrected, so that the accuracy and the robustness of feature matching and semantic segmentation in the subsequent depth estimation process are ensured.
      S2, extracting monocular depth features of the indoor scene image, and analyzing the monocular depth features of the multi-view indoor scene image to obtain multi-view depth features.
      Referring to fig. 3, extracting monocular depth features of an indoor scene image, and resolving the monocular depth features of a multi-view indoor scene image to obtain multi-view depth features, comprising:
       s21, respectively extracting monocular depth features of the multi-view indoor scene images. 
      Specifically, resNet (Residual Neural Network ) and Attention mechanism (Attention mechanism) networks may be utilized to extract features of a single frame indoor scene image, the Attention mechanism network being capable of masking invisible pixels and outputting a depth hypothesis probability distribution, the expectation of the depth hypothesis probability distribution being a plan depth predictor for constructing an equiangular ray scan localization.
      S22, extracting column features of the monocular depth features, and gathering the column features of the monocular depth features of the multi-view indoor scene image to obtain cross view features.
      Cross-view features refer to column features extracted from images from different perspectives, which can correspond to each other for subsequent depth estimation. Specifically, a predicted depth of each pixel point under a single frame of indoor scene image is obtained. The prediction depth of each pixel point under the reference frame indoor scene image can be obtained through multi-view depth estimation (the reference frame and the single frame RGB image are the same frame).
      Inspired by Multi-view stereoscopic vision, the MVS network is used for estimating the depth of a plane view from Multi-frame images, english of the MVS network is called Multi-View Stereo network, and Chinese paraphrasing is the Multi-view stereoscopic network. A multi-view stereoscopic network is a computer vision technique that uses images from multiple perspectives to restore the three-dimensional structure of a scene. This technique is typically used to estimate depth information from multiple two-dimensional images, thereby reconstructing a three-dimensional model. The MVS network estimates the depth of each pixel point by analyzing the image characteristics under different visual angles and calculating the pixel or characteristic point matching between the visual angles, and finally realizes the reconstruction of the three-dimensional scene.
      In step S22, first, the column features of the single-frame indoor scene image are extracted one by one, the column features of the indoor scene images with different views are collected through plane scanning, 2D cost distribution is constructed according to the cross view feature variance, and then the final depth is calculated through soft-argmin, so as to supplement multi-view geometric clues for positioning. The soft-argmin algorithm is a differentiable optimal value estimation method, and continuous spatial regression is achieved by probability discrete candidate values.
      Specifically, after the feature variance under different depth assumptions is obtained, the feature variance directly corresponds to the 2D cost distribution. For each pair of matched feature points, a cost value (calculated by means of weighted summation or averaging) is calculated from the feature variance, typically based on the differences between feature descriptions, the smaller the differences, the lower the cost, the more reliable the match, and the higher the probability that this depth assumption is correct.
      S23, calculating the depth value of each pixel point of the planar image of the indoor scene according to the feature variance of the cross view feature to obtain the multi-view depth feature.
      S3, fusing the monocular depth features and the multi-view depth features to obtain fused depth features, and processing the fused depth features according to a preset rule to obtain a first posterior probability distribution of the gesture of the target to be positioned.
      The gesture  of the target to be positioned represents the position and the direction of the target to be positioned in the indoor space, and the target to be positioned may be an autonomous navigation robot, an autonomous navigation unmanned aerial vehicle, or VR device, and specifically, the direction of the target to be positioned in the indoor space refers to the front facing direction or the moving direction of the target to be positioned. The monocular depth features and the multi-view depth features are fused for depth estimation, so that multi-modal clues can be fully utilized, and positioning blurring and errors are reduced.
      Referring to fig. 2, the specific steps of s3 include:
       s31, acquiring attitude information of an indoor scene image, wherein the attitude information is used for representing the position of a camera when the indoor scene image is shot and the angle between a lens and the horizontal ground when the indoor scene image is shot. 
      S32, determining the relative pose between image frames of the multi-view indoor scene image according to the pose information of the multi-view indoor scene image.
      S33, determining reference weights of the multi-view depth features and the monocular depth features during feature fusion according to the relative pose among the image frames, and fusing the multi-view depth features and the monocular depth features according to the reference weights to obtain fusion depth features.
      S34, the fusion depth characteristic is processed by using a Bayesian rule, and a first posterior probability distribution of the gesture of the target to be positioned is obtained.
      The calculation formula of the first posterior probability distribution of the gesture of the target to be positioned is as follows:
       ,;
       Wherein,  A first posterior probability distribution representing the pose of the object to be localized,A probability distribution representing the predicted pixel depth corresponding to the monocular depth feature,A probability distribution representing the predicted pixel depth corresponding to the multi-view depth feature,The reference weights representing the monocular depth features,Representing an upsampling operation that ensures the validity of the addition; A first posterior probability distribution of pose of target to be positioned Is used to provide the final depth prediction.
      Setting the current pose of a camera in a positioning model corresponding to a target to be positioned as,Representing camera on two-dimensional planeThe position of the coordinates,Representing camera on two-dimensional planeThe position of the coordinates,Representing the camera relative to a reference direction (global coordinate systemAxis) and fusing the multi-mode feature positioning machine pose based on the acquired data and the semantic plane occupation map.
      Monocular depth estimation is independent of camera motion but prone to scale blur problems, and multi-view stereo methods can provide correct scale but rely on adequate baseline and camera overlap. Based on the observations, an MLP network is adopted to carry out soft selection from two predictions (a single-frame indoor scene image and a multi-frame indoor scene image), the English holonomy of MLP is wieldMulti-Layer Perceptron, which is a multi-Layer Perceptron, and the MLP network can carry out nonlinear feature transformation through stacking all connection layers and is widely applied to classification, regression and deep learning feature coding.
      The MLP network adaptively weights and fuses probability distribution of monocular depth estimation and multi-view depth estimation according to relative pose among image frames, monocular depth characteristics and average depth predicted values of multi-view depth characteristics, and accuracy and reliability of depth estimation are improved. Specifically, the predicted plan view depth is used as the observation, and the following observation model is adopted:
       Wherein the method comprises the steps of Is a gestureA plan view ray at the point where,Is a ray interpolated from the plan depth prediction,Is a constant factor; Representation of Observing by a camera at the moment; For the time of day index, The range of the values is as follows。
       The method is characterized in that at the time t, the depth prediction at the pose of the camera is obtained by complementarily fusing the monocular depth and the multi-view depth according to the observation of the camera, and the rays obtained by interpolation according to the depth prediction can be compared with the plane graph rays at the possible pose of the camera in the whole plane graph to obtain probability distribution of each pose.
      In this embodiment, the probability distribution is calculated by:
       Using relative poses between frames, i.e. self-motion, as transfer model ;
      Wherein, AndTime of presentationIs provided with a self-movement of (a),Time of presentationIs used for the transfer of noise from the (c) to the (d),The operator represents the application of self-motion to a state.Time of presentationSelf-movement atThe amount of the translation of the coordinate position,Time of presentationSelf-movement atThe amount of the translation of the coordinate position,Time of presentationFrom movement relative to a reference direction (global coordinate systemA shaft); Time of presentation The motion noise translates by an amount at the x-coordinate location,Time of presentationTranslation sum of motion noise at y-coordinate positionTime of presentationThe amount of rotation angle variation of the motion noise with respect to the reference direction (x-axis of the global coordinate system).
      Further assume that the transfer noiseObeying a gaussian distribution, the transition probability is expressed as,Representing a transpose; The inverse of the covariance matrix represents the uncertainty of the state transition error. Covariance matrix Σ is a symmetric positive definite matrix, the inverse of which For measuring the magnitude of error vector and state transition probabilityIs from the current stateAnd self-movementDeriving the next time stateProbability distribution of (2); For evaluating uncertainty in modeled state transitions, which are typically modeled as gaussian distributions, Mean of (2) is the predicted stateCovariance (covariance)For describing uncertainties.
      We willModeling is performed as the covariance of the gaussian distribution,Representation ofThe variance in the direction of the coordinates,Representation ofThe variance in the direction of the coordinates,Representing relative to a reference direction (global coordinate systemShaft) and applying Bayes rule to obtainWhereinIs a normalization factor that is used to normalize the data,Is a gesture space; Is from Observing state and self-movement of time cameraDeriving the next time stateIs a probability distribution of (c).
      S4, extracting semantic features of the indoor scene image, and processing the semantic features according to a preset rule to obtain second posterior probability distribution of the gesture of the target to be positioned.
      Extracting semantic features of the indoor scene image, and processing the semantic features according to a preset rule to obtain a second posterior probability distribution of the gesture of the target to be positioned, wherein the second posterior probability distribution comprises:
       S41, performing semantic segmentation on the indoor scene image to obtain semantic features of the indoor scene image; 
       Semantic segmentation is performed by using a Convolutional Neural Network (CNN) encoder-decoder architecture, and semantic objects such as walls, doors, windows and the like in an input RGB image (indoor scene image) are identified. The semantic information is fused with the geometric features, so that the understanding and distinguishing capability of the positioning on the scene is enhanced. 
      S42, determining semantic identification labels of all pixel points of the indoor scene image according to the semantic features;
       s43, selecting a semantic likelihood field according to the semantic identification tag of the pixel point, carrying out semantic ray projection according to the semantic likelihood field, and determining a second posterior probability distribution of the gesture of the target to be positioned. 
      In a preferred implementation manner of this embodiment, the indoor plan positioning method further includes:
       s401, acquiring an indoor scene plane map, wherein the indoor scene plane map is provided with a reference semantic tag; 
       S402, calculating a matching value of the semantic identification label of the pixel point and a reference semantic label; 
       S403, constructing a semantic likelihood field according to the matching value. 
      The reference semantic tags are known semantic information (such as walls, doors, windows, etc.) extracted from the indoor plan, and the semantic identification tags are semantic information identified from the indoor scene images acquired in real time through the semantic segmentation network. The calculation of the matching value may be achieved by comparing the consistency of the semantic identification label of each pixel with the reference semantic label, using a semantic similarity measure. The higher the matching value, the more consistent the semantic information representing the pixel point is with the reference information in the plan, so that the greater the contribution of the pixel point to positioning when constructing the semantic likelihood field.
      The semantic likelihood field is a probabilistic model that incorporates environmental semantic information to enhance the positioning and navigation capabilities of a robot or system. It provides a richer constraint for positioning by calculating the probability that a particular semantic tag (e.g., wall, door, window, etc.) is observed at a given location and orientation. Compared with the traditional positioning method based on geometric information, the semantic likelihood field can more accurately identify and utilize key landmarks in the environment, so that positioning accuracy and robustness are improved, and the method is more excellent in performance especially under the condition of inaccurate complex environment or map.
      Specifically, the reference semantic tags are extracted from the existing indoor scene plan, and in this embodiment, the reference semantic tags include walls, doors and windows, because the walls, doors and windows are easily extracted from the plan automatically and are critical to human localization, thereby improving localization effects.
      In order to enable the indoor scene plan with the reference semantic tags to be read by the robot, the indoor scene plan is converted into an occupied grid map. The occupancy grid is a two-dimensional representation of the world, and each cell in the occupancy grid has an occupancy probability that is determined by its normalized gray value.
      And combining semantic likelihood and geometric feature likelihood, and updating the pose posterior probability of the camera by combining Bayesian rule, so that the robustness of positioning in complex environment and inaccurate map is enhanced. A joint approach is used herein that can use likelihood fields to incorporate semantic information in the presence of semantic tags. More importantly, it can also use light projection within the likelihood field to operate without distance measurement.
      MCL motion models are typically distributedThe representation, wherein, english of MCL is totally called Monte Carlo Localization, chinese paraphrasing is Monte Carlo positioning; The representation is represented at a given current state Odometer measurementThe system is at the next time stepTransition to StateIs a probability of (2).At the time ofIs used to determine one of the possible states of (1),Is an index of possible states, representing different state hypotheses; At the time of The odometer measurement value is measured and,At the time ofIs used to determine the current state of the (c),Is an index of the current state.
      Previous group of particlesUsing odometer measurementsPropagated to the current set of particles。
      Wherein, Is a normalization factor that is used to normalize the data,Is a collection containing each cell in the map,At the time ofOne of the possible states of (a). Here, two likelihoods are allowed to be regarded as independent, and motionThe definition of (3) is the same as RMCL, the English of RMCL is totally named Rao-Blackwellized Monte Carlo Localization, and the positioning algorithm based on the Monte Carlo method is adopted.
      The RMCL algorithm combines the random sampling characteristic of the Monte Carlo method and the optimization effect of the Rao-Blackwell theorem to improve the accuracy and efficiency of positioning. In RMCL, the algorithm maintains a set of particles representing possible positions of the robot and updates the weights of these particles based on the sensor data and control inputs, and then generates a new set of particles by a resampling step to reflect the more accurate localization probability distribution. The prior is comprised ofThe occupation probability of the cells of (a), i.e,Representing the probability of occupancy corresponding to the plan at observation at time t; Is a random variable, expressed in state The probability of observed semantic information o, in particular,Representing the state at time t, typically including position and orientation (e.g., of the robot in the environment), o representing observed semantic information, e.g., reference semantic tags (e.g., walls, doors, windows, etc.; v being semantic features, representing the stateThe degree of matching of the observed semantic information o with a particular semantic label v is then determined.
      The likelihood field model computes a distance map. For each cellCalculating the distance to the nearest occupied cellAnd store each cell in the mapThe set of the distance to the nearest occupied cell is the matching value of the semantic identification label of the pixel point and the reference semantic label.
      In the formula,Representation of landmarks in a set of landmarks meeting a conditionNearest landmarksIs a distance of (2); Representing all possible landmarks Find the minimum, i.e. find andNearest landmarksIt should be noted that the landmarksNamely, the semantic recognition label of the pixel point and the landmarkI.e., reference identification tags (e.g., doors, windows, and walls).
       Representing landmarksSum landmarksThe distance between them, typically using Euclidean distance (Euclidean distance) or other distance measures;
        Representing landmarks Occupancy of (c);
        Representing a threshold value for screening landmarks Only whenAt the time of landmarkWill be taken into account.
      When receiving the observationWhen an endpoint is estimated and used as an index to the distance map. Assuming a gaussian error distribution, each particleThe weights of (2) can be estimated as
       Expressed in timeIs the first of (2)Observing;
        Expressed in time One of the possible states of (1), in whichIs an index of states;
        a set containing each cell in the map; 
        representing observations Calculating the distance to the nearest occupied cell;
        Represents the standard deviation of the distance error, which defines the width of the gaussian distribution. 
       Is part of a probability density function of a gaussian distribution for calculating a given distance differenceAnd standard deviationProbability of time.
      For each reference semantic label (door, wall, window) present in the floor plan, we can calculate a distance map that stores the shortest distance to the cells with the same label. Formally, for each map cellWe can estimate the distance to the nearest cell of each tag as
       The shortest distance is calculated in cells with the same semantic label.
       Representing landmarksOccupancy of (c); Representing a threshold value for screening landmarks . Only whenAt the time of landmarkWill be taken into account.
       The distances to the nearest wall, door and window, respectively, construct three semantic likelihood fields.
      When we receive the observationWhen we use the tagTo determine which semantic likelihood field to use for semantic ray projection.
      In this embodiment, there are three semantic likelihood fields, namely a gate, a wall and a window, respectively, and then when there are semantic tags in the My observation imageThe matching process is to project by using semantic rays, wherein the pixel point of the gate is 1, the other pixel points are 0, and the equiangular ray semantic matching of the sliding window is carried out. Similarly, the other two semantic likelihood fields (walls/windows) are also so dematched.
      After obtaining the observation probability, the weight of each particle is calculatedThe calculation is thatWhereinIs the weight at the last moment. Normalized, new weights are used for state estimation, i.eIn the prior art, the state is accurately estimated, and the state cooperatively promote the performance of a positioning system, so that the robot can accurately position and navigate in a complex environment.
       At the time ofAt the time of the firstThe weight of the individual particles;
        At the time of At the time of the firstThe weight of the individual particles;
        In a given state And a set of each cell in the mapUnder the condition of (1) observing a landmarkProbability of (2);
        For all of The observation probability summation of the individual particles is used for normalizing the weight; for the purpose of the particle index, Total number of particles;
        Given odometer input Current observationPrevious stateAnd a set of each cell in the mapUnder the condition of (1), the system is in timeIn a state ofProbability of (2);
        by passing the state of each particle Multiplied by their corresponding weightsAnd summed to estimate the system timeIs a state of (2).
      Since the lower a priori is more discriminative,A priori with each tagThe association is not only because it is a parameter that requires less adjustment, but also because it implicitly makes observing rare landmarks more beneficial than observing common landmarks.
      Semantic information fusion enhances tolerance to environmental changes and map errors. And when complex scenes such as inaccurate maps, repeated scene structures, dynamic obstacles and illumination changes are processed, stable and accurate positioning is realized, the system performance is maintained, and the positioning failure and drifting risks are reduced.
      S5, weighting the first posterior probability distribution and the second posterior probability distribution according to a preset proportion to obtain weighted posterior probability distribution, inputting the weighted posterior probability distribution into a histogram filter, and outputting a three-dimensional probability body of the gesture of the target to be positioned by the histogram filter to obtain a positioning result of the target to be positioned.
      The three-dimensional probability body refers to a digital expression  of  probability density fields in a three-dimensional space, and the geometric/semantic existence probability of each voxel (voxel) is quantized, so that the position of an object to be positioned is known through the three-dimensional probability body, and the object to be positioned can be positioned conveniently when the indoor scene is complex.
      The specific steps of fusing probability likelihood distribution (second posterior probability distribution) obtained by semantic ray projection to posterior probability distribution (first posterior probability distribution) obtained by Bayesian rule semantic ray projection are as follows:
       summing the weight calculated by the Bayesian likelihood function and the weight calculated by the semantic ray projection according to a certain proportion, wherein the weight of the Bayesian likelihood function is as follows The semantic ray projection weight is。
      The weighted probability distribution is expressed as:
        A first posterior probability distribution is represented, Representing a second posterior probability distribution.
      And the high-efficiency histogram filter is adopted to fuse the positioning information calculated by the Bayesian rule and the positioning information calculated by the semantic features, so that the positioning stability and accuracy are improved. The filter represents the pose posterior as a three-dimensional probability body, decomposes translation and rotation according to the relative pose between frames, and realizes efficient transfer updating on grouping convolution in different directions to quickly converge pose estimation probability distribution.
      Based on the same inventive concept, an embodiment of the invention provides an indoor plane map positioning system.
      The indoor plane figure positioning system can be loaded in electronic equipment. According to the realized functions, the indoor plane map positioning system comprises an acquisition module, a feature extraction module, a first processing module, a second processing module and an output module,
      The feature extraction module can extract monocular depth features of the indoor scene image and analyze the monocular depth features of the multi-view indoor scene image to obtain multi-view depth features;
       The method comprises the steps of obtaining fusion depth characteristics by fusing monocular depth characteristics and multi-view depth characteristics, processing the fusion depth characteristics according to preset rules to obtain first posterior probability distribution of the gesture of a target to be positioned, enabling the gesture  to represent the position and the direction of the target to be positioned in an indoor space, extracting semantic characteristics of an indoor scene image, processing the semantic characteristics according to preset rules to obtain second posterior probability distribution of the gesture of the target to be positioned, weighting the first posterior probability distribution and the second posterior probability distribution according to preset proportions by an output module to obtain weighted posterior probability distribution, inputting the weighted posterior probability distribution into a histogram filter, and outputting a three-dimensional probability body of the gesture of the target to be positioned by the histogram filter to obtain a positioning result of the target to be positioned. 
      The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
      The various modifications and specific examples of the indoor plane view positioning method provided in the foregoing embodiment are applicable to the indoor plane view positioning system of this embodiment, and those skilled in the art will clearly know the implementation method of the indoor plane view positioning system in this embodiment through the foregoing detailed description of the indoor plane view positioning method, which is not described in detail herein for brevity of description.
      The application also discloses an electronic device, as shown in fig. 4, which is a schematic structural diagram of the electronic device of the indoor plan positioning method provided by an embodiment of the application. The electronic device may comprise at least one processor 10, a memory 11 in communicative connection with the at least one processor, a communication bus 12 and a communication interface 13, and may further comprise a computer program stored in said memory 11 and executable on said processor 10, such as an indoor plan positioning method program.
      The processor 10 may be formed by an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units (Central Processing Unit, CPU), a microprocessor, a digital processing chip, a combination of a graphics processor and various control chips, etc. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device and processes data by running or executing programs or modules stored in the memory 11 (for example, executing an indoor floor map positioning method, etc.), and calling data stored in the memory 11.
      The memory 11 includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, such as a mobile hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only for storing application software installed in the electronic device and various types of data, such as codes of indoor floor plan positioning method programs, etc., but also for temporarily storing data that has been output or is to be output.
      The communication bus 12 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between said memory 11 and at least one processor 10 etc.
      The communication interface 13 is used for communication between the above-described electronic device and other devices, including a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), or alternatively a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface.
      Fig. 4 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 4 is not limiting of the electronic device and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.
      For example, although not shown, the electronic device may further include a power source (such as a battery) for powering the respective components, and the power source may be logically connected to the at least one processor 10 through a power management device, so as to perform functions of charge management, discharge management, and power consumption management through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device may further include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described herein.
      It should be understood that the examples are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
      Further, the electronic device integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or nonvolatile.
      Embodiments of the present application provide a computer-readable storage medium including, for example, any entity or device capable of carrying the computer program code, a recording medium, a USB flash disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM). The computer readable storage medium stores a computer program that can be loaded by a processor and that performs the indoor plan positioning method of the above-described embodiment.
      In the description of the present specification, the descriptions of the terms "one embodiment," "some embodiments," "examples," "specific examples," "one implementation," "a preferred embodiment," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
      Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.
    Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202510455316.8A CN119963651B (en) | 2025-04-11 | 2025-04-11 | Indoor plan positioning method, system, equipment and storage medium | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202510455316.8A CN119963651B (en) | 2025-04-11 | 2025-04-11 | Indoor plan positioning method, system, equipment and storage medium | 
Publications (2)
| Publication Number | Publication Date | 
|---|---|
| CN119963651A CN119963651A (en) | 2025-05-09 | 
| CN119963651B true CN119963651B (en) | 2025-06-20 | 
Family
ID=95603064
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202510455316.8A Active CN119963651B (en) | 2025-04-11 | 2025-04-11 | Indoor plan positioning method, system, equipment and storage medium | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN119963651B (en) | 
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN110243370A (en) * | 2019-05-16 | 2019-09-17 | 西安理工大学 | A 3D Semantic Map Construction Method for Indoor Environment Based on Deep Learning | 
| CN111798475A (en) * | 2020-05-29 | 2020-10-20 | 浙江工业大学 | A method for constructing 3D semantic map of indoor environment based on point cloud deep learning | 
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| WO2021178537A1 (en) * | 2020-03-04 | 2021-09-10 | Magic Leap, Inc. | Systems and methods for efficient floorplan generation from 3d scans of indoor scenes | 
| CN113673400A (en) * | 2021-08-12 | 2021-11-19 | 土豆数据科技集团有限公司 | Real scene three-dimensional semantic reconstruction method and device based on deep learning and storage medium | 
| CN114460943B (en) * | 2022-02-10 | 2023-07-28 | 山东大学 | Self-adaptive target navigation method and system for service robot | 
| CN115239809A (en) * | 2022-07-11 | 2022-10-25 | 东北大学 | Object perception SLAM algorithm based on quadric surface initialization and joint data association | 
| US12175562B2 (en) * | 2022-11-11 | 2024-12-24 | MFTB Holdco, Inc. | Automated inter-image analysis of multiple building images for building information determination | 
| CN116229247A (en) * | 2023-03-02 | 2023-06-06 | 深圳市金地数字科技有限公司 | Indoor Scene Semantic Segmentation Method, Device, Equipment and Medium | 
| CN118485824A (en) * | 2024-04-09 | 2024-08-13 | 华北水利水电大学 | Semantic segmentation method for complex indoor scenes based on RGB-D feature fusion | 
| CN118823826A (en) * | 2024-06-20 | 2024-10-22 | 中科晶锐(苏州)科技有限公司 | A real-time human posture estimation method and computer-readable storage medium | 
- 
        2025
        - 2025-04-11 CN CN202510455316.8A patent/CN119963651B/en active Active
 
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN110243370A (en) * | 2019-05-16 | 2019-09-17 | 西安理工大学 | A 3D Semantic Map Construction Method for Indoor Environment Based on Deep Learning | 
| CN111798475A (en) * | 2020-05-29 | 2020-10-20 | 浙江工业大学 | A method for constructing 3D semantic map of indoor environment based on point cloud deep learning | 
Also Published As
| Publication number | Publication date | 
|---|---|
| CN119963651A (en) | 2025-05-09 | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| US11816907B2 (en) | Systems and methods for extracting information about objects from scene information | |
| Yin et al. | Dynam-SLAM: An accurate, robust stereo visual-inertial SLAM method in dynamic environments | |
| WO2021233029A1 (en) | Simultaneous localization and mapping method, device, system and storage medium | |
| EP3280977B1 (en) | Method and device for real-time mapping and localization | |
| Zia et al. | Towards scene understanding with detailed 3d object representations | |
| CN112258618A (en) | Semantic mapping and localization method based on fusion of prior laser point cloud and depth map | |
| Tsai et al. | Real-time indoor scene understanding using bayesian filtering with motion cues | |
| CN102609942B (en) | Depth map is used to carry out mobile camera location | |
| CN110335316A (en) | Pose determination method, device, medium and electronic device based on depth information | |
| Sahili et al. | A survey of visual SLAM methods | |
| Taketomi et al. | Real-time and accurate extrinsic camera parameter estimation using feature landmark database for augmented reality | |
| CN115727854B (en) | VSLAM positioning method based on BIM structure information | |
| US20200226392A1 (en) | Computer vision-based thin object detection | |
| CN117132649A (en) | Artificial intelligence integrated Beidou satellite navigation ship video positioning method and device | |
| Bu et al. | Semi-direct tracking and mapping with RGB-D camera for MAV | |
| Wientapper et al. | Composing the feature map retrieval process for robust and ready-to-use monocular tracking | |
| Yu et al. | CPR-SLAM: RGB-D SLAM in dynamic environment using sub-point cloud correlations | |
| Gard et al. | SPVLoc: Semantic panoramic viewport matching for 6D camera localization in unseen environments | |
| Liu et al. | Long-Term Localization Method Integrated with Voxel Mapping LiDAR Odometry and Adaptive Updating Map in Diverse Environment | |
| CN119963651B (en) | Indoor plan positioning method, system, equipment and storage medium | |
| Mei et al. | Multi-modal 6-DoF object pose tracking: integrating spatial cues with monocular RGB imagery | |
| Zhu et al. | Ellipsoid-SLAM: enhancing dynamic scene understanding through ellipsoidal object representation and trajectory tracking | |
| Kanna et al. | Enhancing SLAM efficiency: a comparative analysis of B-spline surface mapping and grid-based approaches | |
| Hu et al. | DYO-SLAM: Visual Localization and Object Mapping in Dynamic Scenes | |
| Peng et al. | IDMF-VINS: Improving Visual-Inertial SLAM for Complex Dynamic Environments With Motion Consistency and Feature Filtering | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |