CN119629304B - A multi-view surveillance video splicing method and system based on energy map estimation - Google Patents
A multi-view surveillance video splicing method and system based on energy map estimationInfo
- Publication number
- CN119629304B CN119629304B CN202411555475.7A CN202411555475A CN119629304B CN 119629304 B CN119629304 B CN 119629304B CN 202411555475 A CN202411555475 A CN 202411555475A CN 119629304 B CN119629304 B CN 119629304B
- Authority
- CN
- China
- Prior art keywords
- video
- splicing
- point
- energy
- pixel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
- 
        - H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/18—Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
- H04N7/181—Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast for receiving images from a plurality of remote sources
 
- 
        - H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/698—Control of cameras or camera modules for achieving an enlarged field of view, e.g. panoramic image capture
 
- 
        - H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/265—Mixing
 
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-view monitoring video splicing method and system based on energy map estimation, comprising the steps of obtaining video data and camera positions of a plurality of monitoring cameras of a road section to be spliced, extracting video frame characteristics of each camera video, dividing pixels in the video frames into background pixels or foreground pixels, matching the background pixels, carrying out visual depth layering on the video frames by using a multi-layer RANSAC method, carrying out local matching, layer-by-layer splicing and splicing layer fusion on the video frames with overlapping areas, which are obtained by the cameras at the same time, at two different positions, and splicing the video frames shot by all the cameras to finish the splicing of the multi-view monitoring video to obtain panoramic video. The method has the characteristics of small calculated amount, high calculated speed and good splicing effect, keeps the balance of algorithm efficiency and splicing precision, and provides a method with high efficiency and strong robustness for the field of monitoring video splicing.
    Description
Technical Field
      The invention belongs to the technical field of video stitching, and particularly relates to a multi-view monitoring video stitching method and system based on energy map estimation.
    Background
      In recent years, with the continuous promotion of 'intelligent traffic' construction projects in various places, the coverage range of traffic monitoring is also continuously enlarged, however, a large number of complicated monitoring pictures make it difficult for management staff to intuitively acquire the traffic condition of a monitored area. In the current traffic monitoring system, a plurality of monocular cameras with small range are mostly adopted to monitor the road traffic condition, and the problems of frequent visual angle switching, space relation cutting and the like exist among different pictures, so that the problems of traffic accidents, road congestion and the like are not favorably and rapidly processed.
      Video stitching is an extension of image stitching, and refers to a technique of seamlessly stitching Cheng Kuanjing or even panoramic video with several video sequences with overlapping portions. It creates a video with a wider field of view by merging together a set of videos with overlapping fields of view. Typical video stitching procedures include feature extraction, corresponding point matching, image registration, and image synthesis. Through feature extraction and corresponding point matching, corresponding points can be found among images of different visual angles, then the images are converted into a coordinate system according to the corresponding points in an image registration process, so that the images are aligned, and as the aligned images have overlapping areas, the color information of the overlapping areas and the positions nearby the overlapping areas is determined in an image synthesis step.
      The video stitching technology can solve the problems of traffic monitoring management. By applying the video splicing technology of the monitoring video, a manager can intuitively know the current road traffic situation. However, the existing video stitching technology has a plurality of problems when facing a scene of traffic monitoring, which is a multi-moving object. While image registration has been widely studied to improve the quality of image alignment during the initial stitching model setup phase, image registration remains a challenging problem when there is a large parallax, lens distortion, and exposure variance between the input images. Existing image registration algorithms assume that the input image should be captured by a camera sufficiently far from the scene to reduce parallax. However, this limits the application of video stitching to the field of surveillance video with large parallax, lens distortion, and exposure differences that are far apart. Video stitching also faces visual artifacts caused by temporal inconsistencies and temporal constraints during the successive stitching fusion phase of video frames. Stitch location transitions or lighting condition changes between successive frames can lead to additional visual artifacts. In addition, the traditional video fusion algorithm has the problems of large calculated amount and low video generation speed, and is difficult to apply to the field of monitoring video splicing requiring quick response.
    Disclosure of Invention
      Aiming at the defects of the prior art, the invention provides a multi-view monitoring video splicing method and system based on energy map estimation. In the continuous splicing and fusion stage of video frames, the video splicing lines are dynamically calculated through an energy map estimation algorithm, and the minimum change splicing line between video frames bypassing the moving object is searched, so that the problems of high object moving speed and obvious object ghost and splicing lines on the object caused by a plurality of moving objects in the monitoring video are effectively solved.
      In order to achieve the above object, the present invention provides a multi-view monitoring video stitching method based on energy map estimation, comprising the following steps:
       step 1, obtaining video data and camera positions of a plurality of monitoring cameras of a road section to be spliced; 
       step 2, extracting video frame characteristics of each camera video by using a characteristic extraction algorithm, carrying out background modeling on the video frame by using a background modeling algorithm, and dividing pixels in the video frame into background pixels or foreground pixels; 
       Step 3, matching background pixels in the video frames based on nearest neighbor searching, and completing background matching of the video frames shot by different cameras; 
       Step 4, based on the video frame characteristics extracted in the step 2, using a multilayer RANSAC method to carry out visual depth layering on video frames with overlapping areas, which are acquired by cameras at two different positions at the same time, and separating objects at different visual depths into different layers; 
       Step 5, locally matching the video frames with the overlapping areas, which are acquired by the cameras at the same time and are provided with the overlapping areas, layer by layer; 
       Step 6, splicing video frames with overlapping areas, which are acquired by two cameras at different positions at the same time, layer by adopting a minimum energy-based stitching line searching method, and fusing the spliced video frames of all layers to finish the splicing of the two video frames; 
       And 7, performing the splicing operation of the steps 4-6 on the video frames shot by all cameras to finish splicing the multi-view monitoring video and obtain the panoramic video. 
      Further, in step 2, a feature extraction algorithm is used to extract video frame features of each camera video, including color features and gray values of pixels in the video frames, a Gaussian mixture model GMM is used to perform background modeling on the video frames, the specific operations are that ① organizes all the extracted video frame features into a data set and is divided into a training set and a testing set, the training set is used to train the GMM model, the testing set is used to test training effects of the GMM model, ② initializes parameters of the GMM in a random initialization mode, including mean value and variance of each Gaussian component and weight of each Gaussian component, the parameters of the model are iteratively updated by using a maximum expected algorithm to maximize likelihood functions, the GMM model after training contains Gaussian component parameters of each feature dimension, ③ inputs the extracted video frame features into the trained GMM model, calculates probability that each pixel point in the video frame belongs to the background, and classifies the pixels into background pixels or foreground pixels according to probability values.
      Further, in step 3, the euclidean distance between each background pixel in the video frame and the background pixels of other video frames is calculated, if the euclidean distance between the background pixels is smaller than the set threshold, the two background pixels are considered to be matched, and the calculation and judgment operations are sequentially executed on all the video frames, so that the background matching of all the video frames is completed.
      Further, in step 4, the homography matrix of each layer is calculated first, let F represent the matching feature point pair,AndRepresenting the ith pair of feature points in the first video frame and the second video frame respectively, the relation is expressed asN is the total logarithm of the matched characteristic points, and for each pair of characteristic points, an augmentation matrix A and a target vector b are formed, specifically expressed as:
       In the formula, AndRespectively representing coordinates of points in the matched characteristic point pair in the first video frame and the second video frame;
       the homography matrix H maps the points in one video frame to the points in the other video frame, so that the matching of video pictures at different angles is realized, the parameters of the homography matrix are estimated by using a least square method, and the calculation formula of the homography matrix is estimated by using the least square method: 
      h=argmin||Ah-b|| (3)
       Wherein h is a 9-dimensional vector representing parameters of the homography matrix; 
       The homography matrix H is obtained by rearranging the vector H obtained by the least square method: 
       wherein, h 1、h2、h3、h4、h5、h6、h7、h8、h9 is the vector of each dimension in h; 
       The matrix H has 9 parameters, the degree of freedom is 8, 8 pairs of characteristic points are needed to solve, for the first layer RANSAC, 8 pairs of different characteristic points are extracted from all characteristic point pairs for a plurality of times, the homography matrix is solved based on 8 pairs of characteristic points extracted each time, residual sums caused by mapping of the homography matrix are calculated for other characteristic point pairs, a first layer visual depth layering is carried out on the residual sums and the minimum homography matrix H 1, a threshold T is set, after all characteristic point pairs are mapped by H 1, the residual is smaller than T and is regarded as an inner point, otherwise, all obtained inner points are regarded as outer points, all obtained points are regarded as points in the first layer of visual depth, then 8 pairs of characteristic points are extracted for a plurality of times from all outer points, the homography matrix is solved based on 8 pairs of characteristic points extracted each time, residual sums caused by mapping of the homography matrix are calculated for other characteristic point pairs, a second layer visual depth layering is carried out on the basis of the residual sums and the minimum homography matrix H 2, all characteristic point pairs are regarded as inner points of the second layer visual depth layering after all characteristic point pairs are mapped by H 2, and the depth of the second layer is regarded as the depth of the second visual depth is not equal to the depth of the inner point. 
      Further, in step 5, matching the video frames with overlapping areas acquired by two cameras at the same time and at different positions layer by layer, respectively aligning the video frames, dividing the source video frames into m×n grids, respectively performing local matching on different grids when each layer is matched, selecting deformation mapping of the grid by using the distance between the grid center point and the nearest neighbor point in different layers by the feature point, wherein the grid g j is represented by the center point c j, and the homography matrix of the grid g j is represented as followsBy using weightsH k of the N 0 layers is synthesized, and the calculation formula is as follows:
       wherein H k represents a homography matrix of a kth layer;  Represents the feature point closest to the grid center point c j in layer k, and 2 represents the euclidean norm for computing the feature point And the grid center point c j, wherein sigma represents the standard deviation of the Gaussian function; Is a gaussian weight associated with a location for converting distance to a weight value such that closer locations have higher weights and farther locations have lower weights;  The gaussian weight overall duty cycle is represented. 
      Exponential term in the formulaA smaller value representing the square of the euclidean distance between two feature points divided by the variance of the gaussian function indicates a closer approach of the two feature points with a greater corresponding weight.
      Calculating homography matrix of each grid, and homography matrix based on gridsSource video frame pixel at position p' in grid g j is passed throughAnd obtaining the target position p of the target video frame in the target video frame, and completing matching.
      Further, the energy map at the pixel (i, j) position at time t is calculated in step 6The specific calculation formula is as follows:
       In the formula, Representing the energy map at the position of the pixel (i, j) of the video frame at the time t and the time t-1 respectively, the attenuation factor alpha determines the contribution of the energy map at the previous time to the formation of the current energy map, and S t (i, j) is the union of the pixels contained in any foreground object detected in the video frame at the different view angles at the time t, namely:
       wherein V represents the operation of logical OR, AndRespectively representing the corresponding layer instance segmentation results in video frames of different view angles, wherein when S t (i, j) is positioned in the overlapping area, the value is 1, otherwise, the value is 0, namely:
       Wherein Ω 0 is a set of foreground object pixel positions belonging to the overlap region; 
       To solve the problem that long-term memory object energy may create ineffective slot boundaries, the contributions of the energy maps of the previous frames are adjusted by the following formula: 
       Where τ is a threshold defining a time window of accumulated energy, the time instants when the energy value is less than τ will not be considered, so as to prevent redundant information at too many past time instants from interfering with the current energy map estimation, τ is set to Where N * is the number of frames to hold energy, α is the attenuation factor, and by adjusting the attenuation factor α, the contribution of each frame to the output energy map can be controlled.
      The middle point of the first row of the overlapping area is used as the starting point of the splicing line, the pixel point with the minimum energy is searched row by row to be used as the splicing point, the splicing line is obtained by connection, if the middle point of the first row is the foreground point, the non-foreground pixel point with the minimum column offset with the middle point is selected as the starting point of the splicing line, when the path searching of the splicing line is carried out, the left energy map of the current splicing point is compared firstlyAnd right energy mapSize, energy map of (2)The calculation formula of (2) is as follows:
       Wherein, (x, y) is the position of the current splicing point, and (i, j) is the pixel position participating in the judgment of the energy map;  For the energy map at the pixel (i, j) position at time t, E p (i, j) is the distance contribution coefficient at the pixel (i, j) position, H is the height of the overlap region, [ ] is the rounding-up operation, ψ () is an indication function when In the time-course of which the first and second contact surfaces,And the value is 1, otherwise, the value is 0, and beta 1 and beta 2 are weight parameters.
      Selecting the side with the smaller energy map, ensuring that the stitching line extends in the direction with the smallest estimated energy, and in order to speed up the processing time, checking only 2p-1 pixels below the current stitching point (i, j) in the side with the smaller energy map at a time, the calculation process can be implemented by using greedy search:
       Where (x *,y*) denotes the pixel with the lowest energy, Representing the energy map at time t (i+1, k) whenWhen k is { y-p, y-p+1, y }, whenWhen k is { y-p+1, y, y+p-1}, whenWhen k is { y, y+p-1, y+p }, p is the set offset, and p is more than or equal to 2.
      The minimum energy pixel is only allocated as a splice point when the searched minimum energy pixel is unique and does not belong to a foreground object, if the minimum energy pixel does not belong to the foreground object but is not unique, the pixel with the smallest column offset with the current splice point is selected as the splice point, and if the minimum energy pixel is unique but belongs to the foreground object, the pixel with the smallest energy in the non-foreground object pixel is selected as the splice point.
      Further, in step 7, the splicing operation of step 4-step 6 is performed on the video frames shot by all cameras, and for the two spliced video frames, the two spliced video frames are regarded as one frame and the other video frames with overlapping areas, the splicing operation of step 4-step 6 is performed, and finally the splicing of the multi-view monitoring video is completed, so that the panoramic video is obtained.
      The invention also provides a multi-view monitoring video splicing system based on the energy map estimation, which is used for realizing the multi-view monitoring video splicing method based on the energy map estimation.
      Further, the system includes a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions in the memory to perform a multi-view surveillance video stitching method based on energy map estimation as described above.
      Or comprises a readable storage medium having a computer program stored thereon, which when executed, implements a multi-view surveillance video stitching method based on energy map estimation as described above.
      Compared with the prior art, the invention has the following advantages:
       1) The method fully considers the characteristics of large parallax and more moving objects of the monitoring video, and adopts a local matching alignment algorithm in the stage of establishing the initial splicing template, thereby avoiding the problem of image deformation caused by large parallax; 
       2) In the stage of searching the video stitching line, a path searching method based on energy map estimation is adopted, the problem is converted into a path with minimum searching energy according to the characteristic that the pixel points of a foreground object have larger energy on the energy map, the extension towards the direction with the minimum estimated energy is ensured by comparing the sizes of the energy maps at the left side and the right side of the current stitching point, and the problems that the object moving speed is higher, the stitching line on the object is obvious and the object ghosts are caused by a plurality of moving objects in the monitoring video are solved; 
       3) The method has the characteristics of small calculated amount, high calculated speed and good splicing effect, keeps the balance of algorithm efficiency and splicing precision, and provides an efficient method with parallax robustness for the field of monitoring video splicing. 
    Drawings
      Fig. 1 is a flowchart of a multi-view monitoring video stitching method based on energy map estimation according to an embodiment of the present invention.
      Fig. 2 is a schematic diagram of a stitching line searching process according to an embodiment of the present invention.
      FIG. 3 is a graph of a splice line search result according to an embodiment of the present invention.
    Detailed Description
      The invention provides a multi-view monitoring video splicing method and system based on energy map estimation, and the technical scheme of the invention is further described below with reference to the accompanying drawings.
      Example 1
      As shown in fig. 1, an embodiment of the present invention provides a multi-view monitoring video stitching method based on energy map estimation, which includes the following steps:
       and step 1, acquiring video data and camera positions of a plurality of monitoring cameras of a road section to be spliced. 
      And 2, extracting the video frame characteristics of each camera video by using a characteristic extraction algorithm, carrying out background modeling on the video frame by using a background modeling algorithm, and dividing pixels in the video frame into background pixels or foreground pixels.
      The embodiment uses a SIFT algorithm to extract the video frame characteristics of each camera video, including the color characteristics and gray values of pixels in the video frame. The method comprises the steps of carrying out background modeling on video frames by using a Gaussian Mixture Model (GMM), specifically, organizing all extracted video frame features of cameras into a data set by ①, dividing the data set into a training set and a testing set, training the GMM model by the training set, testing the training effect of the GMM model by the testing set, initializing parameters of the GMM by ② in a random initialization mode, including the mean value and the variance of each Gaussian component, and the weight of each Gaussian component, iteratively updating parameters of the model by using a maximum expectation algorithm (Expectation Maximization Algorithm, EM) to maximize likelihood functions, inputting the extracted video frame features into the trained GMM model by ③, calculating the probability that each pixel point in the video frame belongs to the background, classifying the pixels according to probability values, and dividing the pixels into background (non-moving object) pixels or foreground (moving object) pixels.
      And step 3, matching background pixels in the video frames based on nearest neighbor searching, and completing background matching of the video frames shot by different cameras.
      And if the Euclidean distance between the background pixels is smaller than a set threshold value, the two background pixels are considered to be matched, and the calculation and judgment operation is sequentially carried out on all the video frames to finish the background matching of all the video frames. And the corresponding relation among the pixels with the same name in the background parts of the video frames of different cameras is established through background matching, so that the fusion and unification of the background parts of the video frames are finished, the color incoherency of the background parts is reduced, the foundation is laid for the subsequent video splicing, and the video quality of the subsequent splicing is improved.
      And 4, based on the video frame characteristics extracted in the step 2, performing visual depth layering on video frames with overlapping areas, which are acquired by two cameras at different positions at the same time, by using a multilayer RANSAC method, and separating objects at different visual depths into different layers.
      The homography matrix for each layer is first calculated by letting F represent the matching feature point pairs,AndRepresenting the ith pair of feature points in the first video frame and the second video frame respectively, the relation is expressed asN is the total logarithm of the matched characteristic points, and for each pair of characteristic points, an augmentation matrix A and a target vector b are formed, specifically expressed as:
       In the formula, AndRepresenting the coordinates of the points in the matching feature point pair in the first video frame and the second video frame, respectively.
      The homography matrix H maps the points in one video frame to the points in the other video frame, so that the matching of video pictures at different angles is realized, the parameters of the homography matrix are estimated by using a least square method, and the calculation formula of the homography matrix is estimated by using the least square method:
      h=argmin||Ah-b|| (3)
       where h is a 9-dimensional vector representing parameters of the homography matrix. 
      The homography matrix H is obtained by rearranging the vector H obtained by the least square method:
       Where h 1、h2、h3、h4、h5、h6、h7、h8、h9 is the vector for each dimension in h. 
      The matrix H has 9 parameters, the degree of freedom is 8, 8 pairs of characteristic points are needed to solve, for the first layer RANSAC, 8 pairs of different characteristic points are extracted from all characteristic point pairs for a plurality of times, the homography matrix is solved based on 8 pairs of characteristic points extracted each time, residual sums caused by mapping of the homography matrix are calculated for other characteristic point pairs, a first layer visual depth layering is carried out on the residual sums and the minimum homography matrix H 1, a threshold T is set, after all characteristic point pairs are mapped by H 1, the residual is smaller than T and is regarded as an inner point, otherwise, all obtained inner points are regarded as outer points, all obtained points are regarded as points in the first layer of visual depth, then 8 pairs of characteristic points are extracted for a plurality of times from all outer points, the homography matrix is solved based on 8 pairs of characteristic points extracted each time, residual sums caused by mapping of the homography matrix are calculated for other characteristic point pairs, a second layer visual depth layering is carried out on the basis of the residual sums and the minimum homography matrix H 2, all characteristic point pairs are regarded as inner points of the second layer visual depth layering after all characteristic point pairs are mapped by H 2, and the depth of the second layer is regarded as the depth of the second visual depth is not equal to the depth of the inner point.
      And 5, locally matching the video frames with the overlapping areas, which are acquired by the cameras at the same time and are provided with the overlapping areas, layer by layer.
      And matching the video frames acquired by the cameras at the two different positions at the same time layer by layer, and respectively aligning the video frames so as to avoid the deformation of the object caused by the integral matching of the images and obtain integral smooth transition. In order to simplify the matching calculation and reduce the calculation complexity of the splicing method, the source video frame is divided into M multiplied by N grids, and when each layer of matching is performed, different grids are respectively subjected to local matching. The feature point needs to use the distance between the center point of the grid and the nearest neighbor point in the different layers to select the deformation map of the grid in a deformation mode, and the grid g j is represented by the center point c j. The homography matrix of grid g j is represented asBy using weightsH k of the comprehensive N 0 layers is calculated, and the calculation formula is as follows:
       wherein H k represents a homography matrix of a kth layer;  Represents the feature point closest to the grid center point c j in layer k, and 2 represents the euclidean norm for computing the feature point And the grid center point c j, wherein sigma represents the standard deviation of the Gaussian function; Is a gaussian weight associated with a location for converting distance to a weight value such that closer locations have higher weights and farther locations have lower weights;  The gaussian weight overall duty cycle is represented. 
      Exponential term in the formulaA smaller value representing the square of the euclidean distance between two feature points divided by the variance of the gaussian function indicates a closer approach of the two feature points with a greater corresponding weight.
      Calculating homography matrix of each grid, and homography matrix based on gridsSource video frame pixel at position p' in grid g j is passed throughAnd obtaining the target position p of the target video frame in the target video frame, and completing matching.
      The corresponding relation between the pixel positions of the source video frame and the target video frame is stored in a pixel mapping table, and when the video frames are matched, the transformed video frames can be obtained by searching corresponding pixels in the source video frames instead of repeatedly transforming, so that calculation is accelerated.
      And 6, splicing video frames with overlapping areas, which are acquired by two cameras at different positions at the same time, layer by adopting a minimum energy-based stitching line searching method, and fusing the spliced video frames of all layers to finish the splicing of the two video frames.
      And calculating object energy representations (formulas (8) - (10)) of two video frame overlapping area parts acquired by cameras at different positions and positioned at different layers. If the object is a foreground object, the object has higher importance and higher energy, the spliced line should bypass the part of pixel points, otherwise, the background part is the spliced line, and the ghost image is not generated by the motion of the object.
      For the energy of the foreground object, the example segmentation result of the foreground object may be represented by a binary diagram S t (i, j), S t (i, j) representing the union of the pixels contained in any foreground object detected in the video frames of different perspectives:
       wherein V represents the operation of logical OR, AndRespectively representing the corresponding layer instance segmentation results in video frames of different view angles, wherein when S t (i, j) is positioned in the overlapping area, the value is 1, otherwise, the value is 0, namely:
       Where Ω 0 is a set of foreground object pixel locations belonging to the overlap region. 
      For a certain pixel position at a certain moment, the union of the energy map at the previous moment and the pixels contained in the foreground object at the current moment is synthesized to calculate the energy map at the moment of the pixels, so that the energy map has time consistency.
      Energy map at pixel (i, j) location at time tThe calculation formula of (2) is as follows:
       In the formula, And S t (i, j) is the union of pixels contained in any foreground object detected in the video frames of different visual angles at the moment t, and the attenuation factor alpha determines the contribution of the energy diagram at the previous moment to the formation of the current energy diagram.
      Due to its recursive structure, time-consistent energy mapIn practice, a set of weighted sums of energies from 0 to time t, whose importance decays over time. I.e. the current energy map will memorize those objects moving through the previous frame without storing the previous frame, reducing the need of memory space. Specifically, to solve the problem that long-term memory object energy may create ineffective slot boundaries, the contributions of the energy maps of the previous frames are adjusted by the following formula:
       where τ is a threshold defining a time window for accumulating energy, moments where the energy value is less than τ will not be considered, thereby preventing redundant information at too many past moments from interfering with the current energy map estimation. 
      Specifically, τ may be set toWhere N * is the number of frames to hold energy, α is the attenuation factor, and by adjusting the attenuation factor α, the contribution of each frame to the output energy map can be controlled.
      And taking the middle point of the first row of the overlapping area as a starting point of the splicing line, searching the minimum energy pixel point row by row as the splicing point, and connecting to obtain the splicing line. If the middle point of the first row is the foreground point, selecting the non-foreground pixel point with the smallest column offset with the middle point as the starting point of the splicing line. In order to avoid abrupt splice lines on each row during splice line path finding, the left energy map of the current splice point is comparedAnd right energy mapThe smaller side of the energy map is selected to ensure that the overall lower energy (i.e., fewer foreground objects are contained) is found in the selected stitching line region below, thereby making the stitching line overall better, taking into account more pixels (avoiding misleading by local low energy pixels).
      Computing left energy map of current splice pointAnd right energy mapThe calculation formula is as follows:
       Wherein, (x, y) is the position of the current splicing point, and (i, j) is the pixel position participating in the judgment of the energy map;  For the energy map at the pixel (i, j) position at time t, E p (i, j) is the distance contribution coefficient at the pixel (i, j) position, H is the height of the overlap region, [ ] is the rounding-up operation, ψ () is an indication function when In the time-course of which the first and second contact surfaces,The value is 1, otherwise, the value is 0, and beta 1 and beta 2 are weight parameters;
       selecting the side with the smaller energy map, ensuring that the stitching line extends in the direction with the smallest estimated energy, and in order to speed up the processing time, checking only 2p-1 pixels below the current stitching point (i, j) in the side with the smaller energy map at a time, the calculation process can be implemented by using greedy search: 
       Where (x *,y*) denotes the pixel with the lowest energy, Representing the energy map at time t (i+1, k) whenWhen k is { y-p, y-p+1, y }, whenWhen k is { y-p+1, y, y+p-1}, whenWhen k is { y, y+p-1, y+p }, p is the set offset, p is greater than or equal to 2, and p in this embodiment takes a value of 2.
      The minimum energy pixel is only allocated as a splice point when the searched minimum energy pixel is unique and does not belong to a foreground object, if the minimum energy pixel does not belong to the foreground object but is not unique, the pixel with the smallest column offset with the current splice point is selected as the splice point, and if the minimum energy pixel is unique but belongs to the foreground object, the pixel with the smallest energy in the non-foreground object pixel is selected as the splice point.
      As shown in FIG. 2, the grid represents pixels, the gray shaded portion represents pixels of the foreground object, and the red dashed line is framed in a position to the left of the current splice point (first row of black dots)The green dotted line boxes the right energy map of the current splice point (first row of black dots)The left energy diagram of this embodiment is smaller than the right energy diagram, i.eThus comparing the three dots within the blue box below the current splice point. Since the white dot directly below the current splice point belongs to the pixel of the foreground object, the pixel with the smallest energy in the two dots on the left side is selected as the splice point, and the black dot with the smallest energy in the embodiment is used as the splice point.
      Fig. 3 shows a splice path calculated by the proposed splice line finding algorithm of the present invention.
      And 7, performing the splicing operation of the steps 4-6 on the video frames shot by all cameras to finish splicing the multi-view monitoring video and obtain the panoramic video.
      And (3) performing the splicing operation of the step 4-step 6 on the video frames shot by all cameras, and performing the splicing operation of the step 4-step 6 on the spliced two video frames which are regarded as one frame and other video frames with overlapping areas, so as to finally finish the splicing of the multi-view monitoring video and obtain the panoramic video.
      Example 2
      Based on the same inventive concept, the invention also provides a multi-view monitoring video splicing system based on energy map estimation, which comprises a processor and a memory, wherein the memory is used for storing program instructions, and the processor is used for calling the program instructions in the memory to execute the multi-view monitoring video splicing method based on the energy map estimation.
      Example 3
      Based on the same inventive concept, the invention also provides a multi-view monitoring video splicing system based on energy map estimation, which comprises a readable storage medium, wherein the readable storage medium is stored with a computer program, and the computer program realizes the multi-view monitoring video splicing method based on energy map estimation when being executed.
      In particular, the method according to the technical solution of the present invention may be implemented by those skilled in the art using computer software technology to implement an automatic operation flow, and a system apparatus for implementing the method, such as a computer readable storage medium storing a corresponding computer program according to the technical solution of the present invention, and a computer device including the operation of the corresponding computer program, should also fall within the protection scope of the present invention.
      The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.
    Claims (9)
1. The multi-view monitoring video stitching method based on energy map estimation is characterized by comprising the following steps of:
       step 1, obtaining video data and camera positions of a plurality of monitoring cameras of a road section to be spliced; 
       step 2, extracting video frame characteristics of each camera video by using a characteristic extraction algorithm, carrying out background modeling on the video frame by using a background modeling algorithm, and dividing pixels in the video frame into background pixels or foreground pixels; 
       Step 3, matching background pixels in the video frames based on nearest neighbor searching, and completing background matching of the video frames shot by different cameras; 
       Step 4, based on the video frame characteristics extracted in the step 2, using a multilayer RANSAC method to carry out visual depth layering on video frames with overlapping areas, which are acquired by cameras at two different positions at the same time, and separating objects at different visual depths into different layers; 
       Step 5, locally matching the video frames with the overlapping areas, which are acquired by the cameras at the same time and are provided with the overlapping areas, layer by layer; 
       Step 6, splicing video frames with overlapping areas, which are acquired by two cameras at different positions at the same time, layer by adopting a minimum energy-based stitching line searching method, and fusing the spliced video frames of all layers to finish the splicing of the two video frames; 
       The middle point of the first row of the overlapping area is used as the starting point of the splicing line, the pixel point with the minimum energy is searched row by row to be used as the splicing point, the splicing line is obtained by connection, if the middle point of the first row is the foreground point, the non-foreground pixel point with the minimum column offset with the middle point is selected as the starting point of the splicing line, when the path searching of the splicing line is carried out, the left energy map of the current splicing point is compared firstly And right energy mapSize, energy map of (2)、The calculation formula of (2) is as follows:
       (12)
       (13)
       (14)
       In the formula, The position of the current splicing point is the position; the pixel positions are involved in the judgment of the energy map;  For the pixel at time t An energy map at the location; Is a pixel H is the height of the overlap region; is an upward rounding operation;  Is an indication function when In the time-course of which the first and second contact surfaces,The value is 1, otherwise, the value is 0; And Is a weight parameter;
       Selecting the side with smaller energy diagram, ensuring that the splicing line extends towards the direction with the least estimated energy, and checking the current splicing point in the side with smaller energy diagram each time for accelerating the processing time Below (below)A pixel by pixel, the computation process is achieved by using a greedy search:
       (15)
       In the formula, Representing the pixel with the lowest energy,Indicating time tEnergy diagram at whenIn the time-course of which the first and second contact surfaces,When (when)In the time-course of which the first and second contact surfaces,When (when)In the time-course of which the first and second contact surfaces,P is the set offset amount, and,;
      The method comprises the steps that the searched minimum energy pixel is only allocated as a splicing point when the minimum energy pixel is unique and does not belong to a foreground object, if the minimum energy pixel does not belong to the foreground object but is not unique, the pixel with the smallest column offset with the current splicing point is selected as the splicing point, and if the minimum energy pixel is unique but belongs to the foreground object, the pixel with the smallest energy in the non-foreground object pixel is selected as the splicing point;
       And 7, performing the splicing operation of the steps 4-6 on the video frames shot by all cameras to finish splicing the multi-view monitoring video and obtain the panoramic video. 
    2. A multi-view monitoring video stitching method based on energy map estimation is characterized in that in step 2, a feature extraction algorithm is used for extracting video frame features of each camera video, including color features and gray values of pixels in video frames, a Gaussian mixture model GMM is used for conducting background modeling on the video frames, the extracted video frame features of all cameras are organized into a data set and divided into a training set and a testing set, the training set is used for training the GMM model, the testing set is used for testing training effects of the GMM model, parameters of the GMM are initialized in a random initialization mode, the parameters comprise mean values and variances of each Gaussian component, weights of each Gaussian component are used for iteratively updating the parameters of the model by using a maximum expected algorithm to maximize likelihood functions, the GMM model after training is completed contains Gaussian component parameters of each feature dimension, the extracted video frame features are input into the trained GMM model, probability that each pixel point in the video frame belongs to the background is calculated, and the background is classified into pixels or foreground pixels according to the probability values.
    3. The multi-view monitoring video stitching method based on energy map estimation as set forth in claim 1, wherein in step 3, euclidean distance between each background pixel in a video frame and background pixels of other video frames is calculated, if Euclidean distance between background pixels is smaller than a set threshold, the two background pixels are considered to be matched, and the calculation and judgment operations are sequentially executed on all video frames to complete the background matching of all video frames.
    4. The method for multi-view surveillance video stitching based on energy map estimation as set forth in claim 1, wherein step 4 first calculates a homography matrix for each layer by letting F represent a matching feature point pair,AndRepresenting the ith pair of feature points in the first video frame and the second video frame respectively, the relation is expressed asN is the total logarithm of the matched characteristic points, and for each pair of characteristic points, an augmentation matrix A and a target vector b are formed, specifically expressed as:
       (1)
       (2)
       In the formula, AndRespectively representing coordinates of points in the matched characteristic point pair in the first video frame and the second video frame;
       the homography matrix H maps the points in one video frame to the points in the other video frame, so that the matching of video pictures at different angles is realized, the parameters of the homography matrix are estimated by using a least square method, and the calculation formula of the homography matrix is estimated by using the least square method: 
       (3)
       wherein h is a 9-dimensional vector representing parameters of the homography matrix; 
       The homography matrix H is obtained by rearranging the vector H obtained by the least square method: 
       (4)
       In the formula, 、、、、、、、、Vector of each dimension in h;
       For the first layer RANSAC, extracting 8 pairs of different characteristic points from all characteristic point pairs for multiple times, solving a homography matrix based on 8 pairs of characteristic points extracted each time, calculating residual error sum caused by mapping the residual error sum minimum homography matrix of the rest characteristic point pairs according to the homography matrix Performing first-layer visual depth layering, setting a threshold T, and using all feature points in pairsAfter mapping, regarding the residual error as an inner point smaller than T, otherwise regarding as an outer point, taking all the obtained inner points as points in a first layer of visual depth, extracting 8 pairs of different characteristic points from all the outer points for multiple times, solving a homography matrix based on 8 pairs of characteristic points extracted each time, calculating the residual error sum of the rest characteristic points caused by mapping according to the homography matrix, and taking the homography matrix with the minimum residual error sumPerforming second-layer visual depth layering, and using all feature pointsAfter mapping, the residual error is smaller than T and is regarded as an inner point which is used as a point in the second layer of the visual depth, and the steps are repeated until no outer point exists, so that the visual depth layering is completed.
    5. The method for splicing multi-view monitoring video based on energy map estimation as set forth in claim 1, wherein in step 5, video frames with overlapping areas acquired by two cameras at different positions at the same time are matched layer by layer, aligned respectively, and source video frames are divided intoWhen each layer is matched, respectively carrying out local matching on different grids, selecting deformation modes of deformation mapping of the grids by using the distances between the center point of the grid and nearest neighbors in different layers by using characteristic points, and selecting the gridsFrom its central pointRepresentation, gridHomography matrix of (a) is expressed asBy using weightsComprehensive synthesisOf individual layersThe calculation formula is as follows:
       (5)
       (6)
       (7)
       In the formula, A homography matrix representing a kth layer; represented in layer k with the grid center point The nearest feature point; Representing Euclidean norms for computing feature points And grid center pointThe distance between them; Standard deviation of the gaussian function is shown;  Is a gaussian weight associated with a location for converting distance to a weight value such that closer locations have higher weights and farther locations have lower weights;  then the gaussian weight overall duty cycle is represented; 
       Exponential term in the formula A smaller value representing the square of the euclidean distance between two feature points divided by the variance of the gaussian function indicates that the closer the two feature points are, the greater the corresponding weight;
       calculating homography matrix of each grid, and homography matrix based on grids Grid is meshedMiddle positionSource video frame pixels of (1) byAnd obtaining the target position p of the target video frame in the target video frame, and completing matching.
    6. The multi-view monitoring video stitching method based on energy map estimation according to claim 1, wherein the energy map in step 6 has a calculation formula as follows:
       (8)
       In the formula, 、Respectively representing the pixels of the video frame at the time t and the time t-1Energy map at location, attenuation factorThe contribution of the previous time instant energy map to the formation of the current energy map is determined,A union of pixels contained in any foreground object detected in video frames of different perspectives at time t, namely:
       (9)
       wherein V represents the operation of logical OR, 、AndRespectively representing the corresponding layer instance segmentation results in video frames of different view angles,When located in the overlapping region, the value is 1, otherwise, the value is 0, namely:
       (10)
       In the formula, Is a set of foreground object pixel locations belonging to an overlapping region;
       To solve the problem that long-term memory object energy may create ineffective slot boundaries, the contributions of the energy maps of the previous frames are adjusted by the following formula: 
       (11)
       In the formula, Is a threshold defining a time window of accumulated energy, the energy value is less thanThe moments of (2) will not be considered, thereby preventing redundant information at too many past moments from interfering with the current energy map estimate, willIs arranged asWhereinIs the number of frames for which energy is to be maintained,For the attenuation factor, by adjusting the attenuation factorThe contribution of each frame to the output energy map is controlled.
    7. The method for splicing the multi-view monitoring video based on the energy map estimation of claim 1, wherein in the step 7, the splicing operation of the step 4 to the step 6 is performed on video frames shot by all cameras, and for the spliced two video frames, the splicing operation of the step 4 to the step 6 is performed on the two video frames which are regarded as one frame and other video frames with overlapping areas, so that the splicing of the multi-view monitoring video is finally completed, and a panoramic video is obtained.
    8. A multi-view monitoring video splicing system based on energy map estimation, comprising a processor and a memory, wherein the memory is used for storing program instructions, and the processor is used for calling the program instructions in the memory to execute a multi-view monitoring video splicing method based on energy map estimation as claimed in any one of claims 1-7.
    9. A multi-view monitoring video splicing system based on energy map estimation, characterized by comprising a readable storage medium, wherein the readable storage medium is stored with a computer program, and when the computer program is executed, the method for splicing multi-view monitoring video based on energy map estimation is realized according to any one of claims 1-7.
    Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202411555475.7A CN119629304B (en) | 2024-11-04 | 2024-11-04 | A multi-view surveillance video splicing method and system based on energy map estimation | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202411555475.7A CN119629304B (en) | 2024-11-04 | 2024-11-04 | A multi-view surveillance video splicing method and system based on energy map estimation | 
Publications (2)
| Publication Number | Publication Date | 
|---|---|
| CN119629304A CN119629304A (en) | 2025-03-14 | 
| CN119629304B true CN119629304B (en) | 2025-09-16 | 
Family
ID=94905498
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202411555475.7A Active CN119629304B (en) | 2024-11-04 | 2024-11-04 | A multi-view surveillance video splicing method and system based on energy map estimation | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN119629304B (en) | 
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN102426705A (en) * | 2011-09-30 | 2012-04-25 | 北京航空航天大学 | Behavior splicing method of video scene | 
| CN111800609A (en) * | 2020-06-29 | 2020-10-20 | 中国矿业大学 | Mine roadway video stitching method based on multi-plane and multi-sensing sutures | 
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US7747107B2 (en) * | 2007-03-06 | 2010-06-29 | Mitsubishi Electric Research Laboratories, Inc. | Method for retargeting images | 
| US9185284B2 (en) * | 2013-09-06 | 2015-11-10 | Qualcomm Incorporated | Interactive image composition | 
| CN113793382B (en) * | 2021-08-04 | 2024-10-18 | 北京旷视科技有限公司 | Video image seam search method, video image splicing method and device | 
| CN116132729A (en) * | 2022-12-30 | 2023-05-16 | 电子科技大学 | Panoramic video stitching method and system for landslide monitoring | 
| CN117336620B (en) * | 2023-11-24 | 2024-02-09 | 北京智汇云舟科技有限公司 | Adaptive video stitching method and system based on deep learning | 
- 
        2024
        - 2024-11-04 CN CN202411555475.7A patent/CN119629304B/en active Active
 
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN102426705A (en) * | 2011-09-30 | 2012-04-25 | 北京航空航天大学 | Behavior splicing method of video scene | 
| CN111800609A (en) * | 2020-06-29 | 2020-10-20 | 中国矿业大学 | Mine roadway video stitching method based on multi-plane and multi-sensing sutures | 
Also Published As
| Publication number | Publication date | 
|---|---|
| CN119629304A (en) | 2025-03-14 | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| CN110222787B (en) | Multi-scale target detection method, device, computer equipment and storage medium | |
| CN111539273A (en) | A traffic video background modeling method and system | |
| CN111724439A (en) | A visual positioning method and device in a dynamic scene | |
| CN109145747B (en) | A Semantic Segmentation Method for Panoramic Water Surface Images | |
| CN110706269B (en) | Binocular vision SLAM-based dynamic scene dense modeling method | |
| WO2018000752A1 (en) | Monocular image depth estimation method based on multi-scale cnn and continuous crf | |
| CN113139602A (en) | 3D target detection method and system based on monocular camera and laser radar fusion | |
| CN113159466B (en) | Short-time photovoltaic power generation prediction system and method | |
| CN111695403B (en) | Depth perception convolutional neural network-based 2D and 3D image synchronous detection method | |
| CN111027415B (en) | Vehicle detection method based on polarization image | |
| CN109859249B (en) | Scene flow estimation method based on automatic layering in RGBD sequence | |
| CN114283103B (en) | A multi-depth fusion technology for ultra-high-definition panoramic images during AIT process of manned spacecraft | |
| CN116105721A (en) | Loop optimization method, device and equipment for map construction and storage medium | |
| CN114627397A (en) | Behavior recognition model construction method and behavior recognition method | |
| CN113850136A (en) | Yolov5 and BCNN-based vehicle orientation identification method and system | |
| CN113569981A (en) | A power inspection bird's nest detection method based on single-stage target detection network | |
| CN117495919B (en) | An optical flow estimation method based on occluded object detection and motion continuity | |
| CN113763474A (en) | Scene geometric constraint-based indoor monocular depth estimation method | |
| CN110544268A (en) | A multi-target tracking method based on structured light and SiamMask network | |
| CN115290072B (en) | Mobile robot positioning and mapping method and system in dynamic environment | |
| CN111914938A (en) | Image attribute classification and identification method based on full convolution two-branch network | |
| CN109544584B (en) | A method and system for realizing the precision measurement of patrol inspection and image stabilization | |
| CN119555105A (en) | Visual odometer method, system, device and medium based on point and line features | |
| Xu et al. | Towards End-to-End Neuromorphic Voxel-based 3D Object Reconstruction Without Physical Priors | |
| CN119629304B (en) | A multi-view surveillance video splicing method and system based on energy map estimation | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |