WO2018109265A1 - Procédé et équipement technique de codage de contenu de média - Google Patents
Procédé et équipement technique de codage de contenu de média Download PDFInfo
- Publication number
- WO2018109265A1 WO2018109265A1 PCT/FI2017/050846 FI2017050846W WO2018109265A1 WO 2018109265 A1 WO2018109265 A1 WO 2018109265A1 FI 2017050846 W FI2017050846 W FI 2017050846W WO 2018109265 A1 WO2018109265 A1 WO 2018109265A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frame
- sparse voxel
- voxel octree
- octree
- frames
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
- G06T17/005—Tree description, e.g. octree, quadtree
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/001—Model-based coding, e.g. wire frame
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/40—Tree coding, e.g. quadtree, octree
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/20—Image signal generators
- H04N13/204—Image signal generators using stereoscopic image cameras
- H04N13/243—Image signal generators using stereoscopic image cameras using three or more 2D image sensors
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/60—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
- H04N19/61—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/90—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
- H04N19/96—Tree coding, e.g. quad-tree coding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/18—Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
Definitions
- the present solution generally relates to video encoding.
- the solution relates to volumetric encoding and virtual reality (VR).
- VR virtual reality
- new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all axes).
- new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being "immersed" into the scene captured by the 360 degrees camera.
- the new capture and display paradigm, where the field of view is spherical is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.
- VR virtual reality
- a method comprising receiving video data sequence comprising volumetric frames; generating a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame: generating a frame sparse voxel octree for a frame that is currently encoded; comparing the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; assigning an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and producing a frame change set for the frame; producing an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and the produced frame change sets for
- the method comprises selecting a reference frame from the volumetric frames, wherein the reference sparse voxel octree is generated based on said reference frame.
- the method further comprises receiving the video data sequence from a multicamera device.
- a change relates to one or more of the following: deletion of a node, addition of a node, change of a content of a node.
- the frame change set comprises a frame number within the video data sequence and a set of identifications associated with a sparse voxel subtree.
- an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: to receive video data sequence comprising volumetric frames; to generate a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame: to generate a frame sparse voxel octree for a frame that is currently encoded; to compare the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; to assign an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and to produce a frame change set for the frame
- the reference sparse voxel octree is a combined from more than one volumetric frames of the video sequence.
- the apparatus further comprises computer program code to cause the apparatus to select a reference frame from the volumetric frames, wherein the reference sparse voxel octree is generated based on said reference frame.
- the apparatus further comprises computer program code to cause the apparatus to receive the video data sequence from a multicamera device.
- a change relates to one or more of the following: deletion of a node, addition of a node, change of a content of a node.
- the frame change set comprises a frame number within the video data sequence and a set of identifications associated with a sparse voxel subtree.
- a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive video data sequence comprising volumetric frames; generate a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame: to generate a frame sparse voxel octree for a frame that is currently encoded; to compare the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; to assign an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and to produce a frame change set for
- FIG. 2a shows a camera device for stereo viewing
- Fig. 2b shows a head-mounted display for stereo viewing
- Fig. 3 shows a camera according to an embodiment
- Figs. 4a, 4b show examples of a multicamera capturing device
- Figs. 5a, 5b show an encoder and a decoder according to an embodiment
- Fig. 6 illustrates an example of processing steps of manipulating volumetric video data
- Fig. 7 shows an example of a volumetric video pipeline
- Fig. 8 shows an example of comparison of volumetric frame octrees as simplified to a quadtree representation
- Fig. 9 shows an example of an output data of an encoder
- Fig. 10 is a flowchart of a method according to an embodiment; and Figs. 1 1 a, b show an apparatus according to an embodiment.
- the present embodiments relate to real-time computer graphics and virtual reality (VR).
- VR virtual reality
- Volumetric video may be captured using one or more 3D cameras. Volumetric video is to virtual reality what traditional video is to 2D/3D displays. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world.
- a multicamera device comprises two or more cameras, wherein the two or more cameras may be arranged in pairs in said multicamera device. Each said camera has a respective field of view, and each said field of view covers the view direction of the multicamera device.
- the multicamera device may comprise cameras at locations corresponding to at least some of the eye positions of a human head at normal anatomical posture, eye positions of the human head at maximum flexion anatomical posture, eye positions of the human head at maximum extension anatomical postures, and/or eye positions of the human head at maximum left and right rotation anatomical postures.
- the multicamera device may comprise at least three cameras, the cameras being disposed such that their optical axes in the direction of the respective camera's field of view fall within a hemispheric field of view, the multicamera device comprising no cameras having their optical axes outside the hemispheric field of view, and the multicamera device having a total field of view covering a full sphere.
- the multicamera device described here may have cameras with wide-angle lenses.
- the multicamera device may be suitable for creating stereo viewing image data and/or multiview video, comprising a plurality of video sequences for the plurality of cameras.
- the multicamera may be such that any pair of cameras of the at least two cameras has a parallax corresponding to parallax (disparity) of human eyes for creating a stereo image.
- At least two cameras may have overlapping fields of view such that an overlap region for which every part is captured by said at least two cameras is defined, and such overlap area can be used in forming the image for stereo viewing.
- Fig. 1 shows a system and apparatuses for stereo viewing, that is, for 3D video and 3D audio digital capture and playback.
- the task of the system is that of capturing sufficient visual and auditory information from a specific location such that a convincing reproduction of the experience, or presence, of being in that location can be achieved by one or more viewers physically located in different locations and optionally at a time later in the future.
- Such reproduction requires more information that can be captured by a single camera or microphone, in order that a viewer can determine the distance and location of objects within the scene using their eyes and their ears.
- two camera sources are used.
- at least two microphones are used (the commonly known stereo sound is created by recording two audio channels).
- the human auditory system can detect the cues, e.g.
- a video capture device SRC1 comprises multiple cameras CAM1 , CAM2, CAMN with overlapping field of view so that regions of the view around the video capture device is captured from at least two cameras.
- the device SRC1 may comprise multiple microphones to capture the timing and phase differences of audio originating from different directions.
- the device SRC1 may comprise a high resolution orientation sensor so that the orientation (direction of view) of the plurality of cameras can be detected and recorded.
- the device SRC1 comprises or is functionally connected to a computer processor PROC1 and memory MEM1 , the memory comprising computer program PROGR1 code for controlling the video capture device.
- the image stream captured by the video capture device may be stored on a memory device MEM2 for use in another device, e.g. a viewer, and/or transmitted to a server using a communication interface COMM1 .
- a memory device MEM2 for use in another device, e.g. a viewer, and/or transmitted to a server using a communication interface COMM1 .
- COMM1 a communication interface
- one or more sources SRC2 of synthetic images may be present in the system.
- Such sources of synthetic images may use a computer model of a virtual world to compute the various image streams it transmits.
- the source SRC2 may compute N video streams corresponding to N virtual cameras located at a virtual viewing position.
- the viewer may see a three-dimensional virtual world.
- the device SRC2 comprises or is functionally connected to a computer processor PROC2 and memory MEM2, the memory comprising computer program PROGR2 code for controlling the synthetic sources device SRC2.
- the image stream captured by the device may be stored on a memory device MEM5 (e.g.
- memory card CARD1 for use in another device, e.g. a viewer, or transmitted to a server or the viewer using a communication interface COMM2.
- the device SERVER comprises or is functionally connected to a computer processor PROC3 and memory MEM3, the memory comprising computer program PROGR3 code for controlling the server.
- the device SERVER may be connected by a wired or wireless network connection, or both, to sources SRC1 and/or SRC2, as well as the viewer devices VIEWER1 and VIEWER2 over the communication interface COMM3.
- the devices may comprise or be functionally connected to a computer processor PROC4 and memory MEM4, the memory comprising computer program PROG4 code for controlling the viewing devices.
- the viewer (playback) devices may consist of a data stream receiver for receiving a video data stream from a server and for decoding the video data stream. The data stream may be received over a network connection through communications interface COMM4, or from a memory device MEM6 like a memory card CARD2.
- the viewer devices may have a graphics processing unit for processing of the data to a suitable format for viewing.
- the viewer VIEWER1 comprises a high-resolution stereo-image head-mounted display for viewing the rendered stereo video sequence.
- the head-mounted display may have an orientation sensor DET1 and stereo audio headphones.
- the viewer VIEWER2 comprises a display enabled with 3D technology (for displaying stereo video), and the rendering device may have a head-orientation detector DET2 connected to it.
- the viewer VIEWER2 may comprise a 2D display, since the volumetric video rendering can be done in 2D by rendering the viewpoint from a single eye instead of a stereo eye pair.
- Any of the devices (SRC1 , SRC2, SERVER, RENDERER, VIEWER1 , VIEWER2) may be a computer or a portable computing device, or be connected to such.
- Such rendering devices may have computer program code for carrying out methods according to various examples described in this text.
- Fig. 2a shows a camera device 200 for stereo viewing.
- the camera comprises two or more cameras that are configured into camera pairs 201 for creating the left and right eye images, or that can be arranged to such pairs.
- the distances between cameras may correspond to the usual (or average) distance between the human eyes.
- the cameras may be arranged so that they have significant overlap in their field-of-view. For example, wide-angel lenses of 180-degrees or more may be used, and there may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, or 20 cameras.
- the cameras may be regularly or irregularly spaced to access the whole sphere of view, or they may cover only part of the whole sphere.
- a camera device with all cameras in one hemisphere may be used.
- the number of cameras may be e.g., 2, 3, 4, 6, 8, 12, or more.
- the cameras may be placed to create a central field of view where stereo images can be formed from image data of two or more cameras, and a peripheral (extreme) field of view where one camera covers the scene and only a normal non-stereo image can be formed.
- Fig. 2b shows a head-mounted display (HMD) for stereo viewing.
- the head-mounted display comprises two screen sections or two screens DISP1 and DISP2 for displaying the left and right eye images.
- the displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes' field of view.
- the device is attached to the head of the user so that it stays in place even when the user turns his head.
- the device may have an orientation detecting module ORDET1 for determining the head movements and direction of the head.
- the head-mounted display gives a three-dimensional (3D) perception of the recorded/streamed content to a user.
- Fig. 3 illustrates a camera CAM1 .
- the camera has a camera detector CAMDET1 , comprising a plurality of sensor elements for sensing intensity of
- the camera has a lens OBJ1 (or a lens arrangement of a plurality of lenses), the lens being positioned so that the light hitting the sensor elements travels through the lens to the sensor elements.
- the camera detector CAMDET1 has a nominal center point CP1 that is a middle point of the plurality of sensor elements, for example for a rectangular sensor the crossing point of the diagonals.
- the lens has a nominal center point PP1 , as well, lying for example on the axis of symmetry of the lens.
- the direction of orientation of the camera is defined by the line passing through the center point CP1 of the camera sensor and the center point PP1 of the lens.
- the direction of the camera is a vector along this line pointing in the direction from the camera sensor to the lens.
- the optical axis of the camera is understood to be this line CP1 -PP1 .
- Time-synchronized video, audio and orientation data is first recorded with the capture device. This can consist of multiple concurrent video and audio streams as described above. These are then transmitted immediately or later to the storage and processing network for processing and conversion into a format suitable for subsequent delivery to playback devices. The conversion can involve post-processing steps to the audio and video data in order to improve the quality and/or reduce the quantity of the data while preserving the quality at a desired level.
- each playback device receives a stream of the data from the network, and renders it into a stereo viewing reproduction of the original location which can be experienced by a user with the head-mounted display and headphones.
- Figs. 4a and 4b show an example of a camera device for being used as a source for media content, such as images and/or video.
- media content such as images and/or video.
- video panorama these images need to be shot simultaneously to keep the eyes in sync with each other.
- one camera cannot physically cover the whole 360 degree view, at least without being obscured by another camera, there need to be multiple cameras to form the whole 360 degree panorama. Additional cameras however increase the cost and size of the system and add more data streams to be processed. This problem becomes even more significant when mounting cameras on a sphere or platonic solid shaped arrangement to get more vertical field of view.
- the camera pairs will not achieve free angle parallax between the eye views.
- the parallax between eyes is fixed to the positions of the individual cameras in a pair, that is, in the perpendicular direction to the camera pair, no parallax can be achieved. This is problematic when the stereo content is viewed with a head mounted display that allows free rotation of the viewing angle around z-axis as well.
- a video codec consists of an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form.
- encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).
- Figure 5a An example of an encoding process is illustrated in Figure 5a.
- Figure 5a illustrates an image to be encoded ( ); a predicted representation of an image block (P' n ); a prediction error signal (D n ); a reconstructed prediction error signal (D' n ); a preliminary reconstructed image (l' n ); a final reconstructed image (R' n ); a transform (T) and inverse transform (T ⁇ 1 ); a quantization (Q) and inverse quantization (Cr 1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (P inter ); intra prediction (P intra ); mode selection (MS) and filtering (F).
- An example of a decoding process is illustrated in Figure 5b.
- Figure 5b illustrates a predicted representation of an image block (P' n ); a reconstructed prediction error signal (D' n ); a preliminary reconstructed image (l' n ); a final reconstructed image (R' n ); an inverse transform (T ⁇ 1 ); an inverse quantization (Cr 1 ); an entropy decoding (E ⁇ 1 ); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).
- Figure 6 demonstrates an example of processing steps of manipulating volumetric video data, starting from raw camera frames (from various locations within the world) and ending with a frame rendered at a freely-selected 3D viewpoint.
- the starting point 610 is media content obtained from one or more camera devices.
- the media content may comprise raw camera frame images, depth maps, and camera 3D positions.
- the recorded media content i.e. image data, is used to construct an animated 3D model 620 of the world.
- the viewer is then freely able to choose his/her position and orientation within the world when the volumetric video is being played back 630.
- "A sparse voxel octree" is a central data structure to which the present embodiments are based.
- Voxel of a three-dimensional world corresponds to a pixel of a two- dimensional world. Voxels exist in a 3D grid layout.
- An octree is a tree data structure used to partition a three-dimensional space. Octrees are the three-dimensional analog of quadtrees.
- a sparse voxel octree describes a volume of a space containing a set of solid voxels of varying sizes. Empty areas within the volume are absent from the tree, which is why it is called "sparse".
- a volumetric video frame is a complete sparse voxel octree that models the world at a specific point in time in a video sequence.
- Voxel attributes contain information like color, opacity, surface normal vectors, and surface material properties. These are referenced in the sparse voxel octrees (e.g. color of a solid voxel), but can also be stored separately.
- the present embodiments relate to real-time computer graphics and virtual reality (VR).
- Volumetric video is to virtual reality what traditional video is to 2D/3D displays.
- Mipmaps are pre-calculated, optimized sequences of images, each of which is a progressively lower resolution representation of the same image.
- the height and width of each image, or level, in the mipmap is a power of two smaller than the previous level. They are intended to increase rendering speed and reduce aliasing artifacts.
- each level in the octree can be considered a 3D mipmap of the next lower level.
- each frame may produce several hundred megabytes or several gigabytes of voxel data which needs to be converted to a format that can be streamed to the viewer, and rendered in real-time.
- the amount of data depends on the world complexity and the number of cameras. The larger impact comes in a multi-device recording setup with a number of separate locations where the cameras are recording. Such a setup produces more information than a camera at a single location.
- the present embodiments are targeted to a problem of detecting and encoding changes that occur in a volumetric video sequence, by reducing the amount of data to transfer to the viewer.
- the encoding also needs to be done in a particular way to enable later rendering the video efficiently on a GPU (Graphics Processing Unit).
- nodes are selected nodes of the frames' sparse voxel octrees. Due to sparseness, nodes may also be absent in either of the volumetric frames.
- the changed, added, or deleted node locations are given unique IDs within the video sequence, and the node subtrees are written as patches to a reference volumetric frame.
- Figure 7 illustrates an example of a volumetric video pipeline.
- the present embodiments are targeted to a "Change Detection” and “Frame Change Sets” stages of Voxel Encoding 740 in the pipeline.
- multiple cameras 710 captured video data of the world, which video data is input 720 to the pipeline.
- the video data comprises camera frames, positions and depth maps 730 which are transmitted to the Voxel Encoding 740.
- a volumetric reference frame may be chosen for each sequence.
- the reference frame can be the first frame in the sequence, or the reference frame can be any one of the other frames in the sequences. Alternatively, the reference frame can be combined from more than one volumetric frames of the video sequence.
- the encoder is configured to produce a sparse voxel octree for the sequence's volumetric reference frame, and the volumetric frame currently being encoded. Alternatively, the sparse voxel octree may be generated without any reference frame, but using statistics or any other available data.
- Figure 8 illustrates an example how the volumetric frame octrees are compared (shown as quadtrees in Fig. 8 instead of octrees).
- there are an initial octree 810 i.e. a reference frame
- a frame octree 820 i.e. a current frame
- the gray nodes G are spare (blank space).
- the selection of the nodes to be compared depends on the captured world contents.
- One way to do the comparison is to iterate through all the nodes of the tree at a given level. Locations that are far away from any camera positions can be compared on a higher level (larger nodes), because the captured voxel data also has less resolution if it is far away from cameras. This approach yields a roughly equal amount of complexity in each of the compared nodes.
- this may be applied to adjust the level of compared nodes as appropriate. For example, smaller nodes can be selected around objects known to be important in the scene (such as humans), or objects that have complex moving parts. Unimportant parts of the world can be omitted from comparisons entirely.
- Changes may be detected within nodes. For example, let us assume a node existing in both the volumetric reference frame 810 and the current frame 820 (e.g. nodes E1 , E2 respectively). The encoder needs to determine if the contents of the node have changed significantly enough to warrant including the new contents in the encoded output.
- the information available in each node is: mipmapped voxel attributes (e.g., color) representing node's entire subtree as a whole; sum of the volume of the solid voxels within the node's subtrees (may be determined when the mipmapped attributes are prepared).
- the node is considered as changed (node C in the changed set 830).
- These thresholds can be determined based on preset encoder parameters or by analyzing the captured world contents. Comparisons done using mipmapped values are much faster than comparing all the contained nodes and solid voxels individually. Mipmapped information also compensates noise in the captured data. It is also possible to select larger nodes for comparison, but then the comparison checks are to be performed using the child nodes (one or more levels down). The end result is that the likelihood of a false negative decreases while avoiding issues with too many changed nodes being detected in the sequence.
- a node is absent in the current frame 820 but present P1 in the reference frame 810, it is considered to be deleted D and thus also changed in the changed set 830.
- a node P2 is present in the current frame 820 but absent G in the reference frame 810, it is always considered as changed.
- location ID is persistent and unique within the sequence. All changes occurring later on in the sequence in the same location will use the same ID number.
- the location ID is written in the volumetric reference frame's octree (for example as a node attribute). If the reference octree already has an ID for this node, the existing ID is used. Location ID zero (the default value) is used for al nodes whose contents will not change during the sequence.
- the location ID is assigned to the nearest parent node that exists in both octrees.
- This shared parent node is copied from the reference frame into the output data, and the changed node's subtree is copied into it. Additional empty nodes may be added as padding if there are octree levels missing between the compared location and the shared parent node.
- the encoder produces a frame change set 830. This is also illustrated in Figure 9.
- the change sets comprise at least frame number (e.g. within the encoded sequence of frames); and a set of location IDs 901 , 902, each associated with a sparse voxel subtree 905.
- a deleted subtree is encoded as a special value that identifies that no subtree exists for that location ("X" in Figure 9). If no changes were detected in the compared frames, the change set can be omitted from the output entirely.
- each node may have the addresses of eight child nodes, and the address of one parent node.
- their root node's parent node is the parent node of the corresponding node in the reference octree.
- the output data for the entire sequence contains the full reference octree (that contains the location IDs) plus all the frame change sets.
- attributes can be shared between the octrees/subtrees to reduce total data size. This is useful also in the case when nodes have been added and the change set thus also contains unmodified contents from the reference octree.
- the outcome of the Voxel Encoding 740 is a SVOX (Sparse VOXel) file 750, which is transmitted for playback 760.
- the SVOX file 750 is streamed 770, which creates stream packets 780.
- a voxel rendering 790 is applied which provides viewer state (e.g. current time, view frustum) 795 to the streaming 770.
- FIG. 10 is a flowchart illustrating a method according to an embodiment.
- a method comprises receiving 1010 video data sequence comprising volumetric frames; optionally selecting 1020 a reference frame from the volumetric frames; generating 1030 a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame 1040: generating 1050 a frame sparse voxel octree for a frame that is currently encoded; comparing 1060 the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; assigning 1070 an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and producing 1080 a frame change set for the frame
- An apparatus comprises means for receiving video data sequence comprising volumetric frames; means for generating a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame: means for generating a frame sparse voxel octree for a frame that is currently encoded; means for comparing the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; means for assigning an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and means for producing a frame change set for the frame; means for producing an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and
- the various embodiments may provide advantages.
- the method of encoding changes according to present embodiments does not require any special modelling tools for setting up the scene. It can be applied to any 3DA/R footage where depth information and camera positions can be extracted.
- the method produces a data set that can be fed to a GPU with minimal preprocessing. This allows rendering the animated volumetric frames efficiently.
- One particular advantage is that the renderer can switch on the fly between several frames currently available in GPU memory.
- a sparse voxel octree is a simple tree structure, which makes resolution adjustments and spatial subdivision trivial: resolution can be changed simply by limiting the depth of the tree, and subdivision can be done by picking specific subtrees. This makes the data structure well-suited for parallelized encoding and adaptive streaming. Sparse voxel octrees also have the advantage of supporting variable resolution within the volume. It is also trivial to merge octrees together so that each octree contributes details to a combined octree. This is especially useful when merging the captured contents of multiple 3D cameras. When compared to triangle meshes, sparse voxel data has a simpler, recursive overall structure.
- An octree may be composed of a sequence of integers, while triangle meshes are more complicated (3D floating-point coordinates, edges, faces).
- GPUs have been primarily designed to work with triangle meshes, modern graphics APIs provide enough programmability (shader programs) that enable rendering sparse voxel data in real time.
- Mipmapping can be achieved trivially. This is important in 3D graphics because objects in the distance should be rendered using a lower level of detail to avoid wasting processing time. Mipmapping also benefits streaming use cases (as noted above).
- a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
- a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Software Systems (AREA)
- Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
Abstract
L'invention concerne un procédé et un équipement technique destiné à mettre en œuvre le procédé. Le procédé comporte les étapes consistant à recevoir une séquence de données vidéo comportant des trames volumétriques; à générer un arbre octal de voxels épars de référence d'après des trames volumétriques en nombre supérieur ou égal à zéro au sein de la séquence vidéo; pour toute autre trame parmi les trames volumétriques: générer un arbre octal de voxels épars de trame pour une trame qui est actuellement codée; comparer l'arbre octal de voxels épars de trame à l'arbre octal de voxels épars de référence pour détecter des changements entre des nœuds de l'arbre octal de voxels épars de trame et l'arbre octal de voxels épars de référence; affecter une identification (901, 902) pour un nœud à l'arbre octal de voxels épars de référence, lorsqu'un changement entre les nœuds comparés est détecté; et produire un ensemble de changements de trame pour la trame; produire une sortie pour la séquence de données vidéo, comportant l'arbre octal de voxels épars de référence avec les identifications et les ensembles de changements de trame produits.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FI20165968 | 2016-12-15 | ||
FI20165968 | 2016-12-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018109265A1 true WO2018109265A1 (fr) | 2018-06-21 |
Family
ID=62558086
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/FI2017/050846 WO2018109265A1 (fr) | 2016-12-15 | 2017-11-30 | Procédé et équipement technique de codage de contenu de média |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2018109265A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11109066B2 (en) | 2017-08-15 | 2021-08-31 | Nokia Technologies Oy | Encoding and decoding of volumetric video |
EP3821602A4 (fr) * | 2018-07-10 | 2022-05-04 | Nokia Technologies Oy | Procédé, appareil et produit programme informatique pour codage de vidéo volumétrique |
US11405643B2 (en) | 2017-08-15 | 2022-08-02 | Nokia Technologies Oy | Sequential encoding and decoding of volumetric video |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150268058A1 (en) * | 2014-03-18 | 2015-09-24 | Sri International | Real-time system for multi-modal 3d geospatial mapping, object recognition, scene annotation and analytics |
-
2017
- 2017-11-30 WO PCT/FI2017/050846 patent/WO2018109265A1/fr active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150268058A1 (en) * | 2014-03-18 | 2015-09-24 | Sri International | Real-time system for multi-modal 3d geospatial mapping, object recognition, scene annotation and analytics |
Non-Patent Citations (4)
Title |
---|
KAMMERL, J. ET AL.: "Real-time Compression of Point Cloud Streams", 2012 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA, 18 May 2012 (2012-05-18), pages 778 - 785, XP032450378, Retrieved from the Internet <URL:http://ieeexplore. ieee .org/abstract/document/6224647 DOI: 10.1109/ICRA.2012.6224647> [retrieved on 20180404] * |
KAMPE, V. ET AL.: "Exploiting Coherence in Time-Varying Voxel Data", PROCEEDINGS OF THE 20TH ACM SIGGRAPH SYMPOSIUM ON INTERACTIVE 3D GRAPHICS AND GAMES (13D '16, 28 February 2016 (2016-02-28), pages 15 - 21, XP058079601, Retrieved from the Internet <URL:https://dl.acm.org/citation.cfm?id=2856413><DOI:10.1145/2856400.2856413> [retrieved on 20180327] * |
MA, KL. ET AL.: "Efficient Encoding and Rendering of Time-Varying Volume Data", ICASE REPORT NO 98-22, 30 June 1998 (1998-06-30), XP002287747, Retrieved from the Internet <URL:http://www.dtic.mil/docs/citations/ADA350496> [retrieved on 20180404] * |
THANOU, D. ET AL.: "Graph-Based Compression of Dynamic 3D Point Cloud Sequences", IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 25, no. 4, 11 February 2016 (2016-02-11), pages 1765 - 1778, XP011602605, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/abstract/document/7405340> [retrieved on 20180411] * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11109066B2 (en) | 2017-08-15 | 2021-08-31 | Nokia Technologies Oy | Encoding and decoding of volumetric video |
US11405643B2 (en) | 2017-08-15 | 2022-08-02 | Nokia Technologies Oy | Sequential encoding and decoding of volumetric video |
EP3821602A4 (fr) * | 2018-07-10 | 2022-05-04 | Nokia Technologies Oy | Procédé, appareil et produit programme informatique pour codage de vidéo volumétrique |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3669330B1 (fr) | Codage et décodage de vidéo volumétrique | |
EP3669333B1 (fr) | Codage et décodage séquentiels de vidéo volumétrique | |
US11599968B2 (en) | Apparatus, a method and a computer program for volumetric video | |
US10645369B2 (en) | Stereo viewing | |
JP7344988B2 (ja) | ボリュメトリック映像の符号化および復号化のための方法、装置、およびコンピュータプログラム製品 | |
EP3396635A2 (fr) | Procédé et équipement technique de codage de contenu multimédia | |
EP3603056A1 (fr) | Procédé, appareil et produit-programme informatique pour la diffusion en continu adaptative | |
WO2019162567A1 (fr) | Codage et décodage de vidéo volumétrique | |
EP3540696B1 (fr) | Procédé et appareil de rendue de video volumetrique | |
WO2018109265A1 (fr) | Procédé et équipement technique de codage de contenu de média | |
EP3729805B1 (fr) | Procédé et appareil de codage et de décodage de données vidéo volumétriques | |
WO2019008222A1 (fr) | Procédé et appareil de codage de contenu multimédia | |
EP3698332A1 (fr) | Appareil, procédé, et programme d'ordinateur pour vidéo volumétrique | |
WO2018109266A1 (fr) | Procédé et équipement technique pour rendre un contenu multimédia | |
WO2019008233A1 (fr) | Méthode et appareil d'encodage de contenu multimédia | |
WO2019215377A1 (fr) | Procédé et équipement technique permettant de coder et décoder une vidéo volumétrique | |
HK1233091B (en) | Stereo viewing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17882250 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17882250 Country of ref document: EP Kind code of ref document: A1 |