WO2018109265A1

WO2018109265A1 - A method and technical equipment for encoding media content

Info

Publication number: WO2018109265A1
Application number: PCT/FI2017/050846
Authority: WO
Inventors: Jaakko KERÄNEN
Original assignee: Nokia Technologies Oy
Priority date: 2016-12-15
Filing date: 2017-11-30
Publication date: 2018-06-21

Abstract

The invention relates to a method and technical equipment for implementing the method. The method comprises receiving video data sequence comprising volumetric frames; generating a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame: generating a frame sparse voxel octree for a frame that is currently encoded; comparing the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; assigning an identification (901, 902) for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and producing a frame change set for the frame; producing an output for the video data sequence, comprising the reference sparse voxel octree with identifications and produced frame change sets.

Description

A METHOD AND TECHNICAL EQUIPMENT FOR ENCODING MEDIA CONTENT

Technical Field The present solution generally relates to video encoding. In particular, the solution relates to volumetric encoding and virtual reality (VR).

Background Since the beginning of photography and cinematography, the most common type of image and video content has been captured and displayed as a two-dimensional (2D) rectangular scene. The main reason of this is that cameras are mainly directional, i.e., they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all axes). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being "immersed" into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.

Summary

Now there has been invented an improved method and technical equipment implementing the method, for real-time computer graphics and virtual reality. Various aspects of the invention include a method, an apparatus, and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.

According to a first aspect, there is provided a method comprising receiving video data sequence comprising volumetric frames; generating a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame: generating a frame sparse voxel octree for a frame that is currently encoded; comparing the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; assigning an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and producing a frame change set for the frame; producing an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and the produced frame change sets for the frames of the sequence, which output is transmitted for playback. According to an embodiment the reference sparse voxel octree is a combined from more than one volumetric frames of the video sequence.

According to an embodiment, the method comprises selecting a reference frame from the volumetric frames, wherein the reference sparse voxel octree is generated based on said reference frame.

According to an embodiment the method further comprises receiving the video data sequence from a multicamera device.

According to an embodiment a change relates to one or more of the following: deletion of a node, addition of a node, change of a content of a node.

According to an embodiment the frame change set comprises a frame number within the video data sequence and a set of identifications associated with a sparse voxel subtree.

According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: to receive video data sequence comprising volumetric frames; to generate a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame: to generate a frame sparse voxel octree for a frame that is currently encoded; to compare the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; to assign an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and to produce a frame change set for the frame; to produce an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and the produced frame change sets for the frames of the sequence, which output is transmitted for playback. According to an embodiment the reference sparse voxel octree is a combined from more than one volumetric frames of the video sequence. According to an embodiment, the apparatus further comprises computer program code to cause the apparatus to select a reference frame from the volumetric frames, wherein the reference sparse voxel octree is generated based on said reference frame.

According to an embodiment the apparatus further comprises computer program code to cause the apparatus to receive the video data sequence from a multicamera device.

According to a third aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive video data sequence comprising volumetric frames; generate a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame: to generate a frame sparse voxel octree for a frame that is currently encoded; to compare the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; to assign an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and to produce a frame change set for the frame; produce an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and the produced frame change sets for the frames of the sequence, which output is transmitted for playback.

Description of the Drawings In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which shows a system and apparatuses for stereo viewing; Fig. 2a shows a camera device for stereo viewing;

Fig. 2b shows a head-mounted display for stereo viewing; Fig. 3 shows a camera according to an embodiment;

Figs. 4a, 4b show examples of a multicamera capturing device;

Figs. 5a, 5b show an encoder and a decoder according to an embodiment;

Fig. 6 illustrates an example of processing steps of manipulating volumetric video data;

Fig. 7 shows an example of a volumetric video pipeline;

Fig. 8 shows an example of comparison of volumetric frame octrees as simplified to a quadtree representation;

Fig. 9 shows an example of an output data of an encoder;

Fig. 10 is a flowchart of a method according to an embodiment; and Figs. 1 1 a, b show an apparatus according to an embodiment.

Description of Example Embodiments

The present embodiments relate to real-time computer graphics and virtual reality (VR).

Volumetric video may be captured using one or more 3D cameras. Volumetric video is to virtual reality what traditional video is to 2D/3D displays. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world.

The present embodiments are discussed in relation to media content captured with one or more multicamera devices. A multicamera device comprises two or more cameras, wherein the two or more cameras may be arranged in pairs in said multicamera device. Each said camera has a respective field of view, and each said field of view covers the view direction of the multicamera device. The multicamera device may comprise cameras at locations corresponding to at least some of the eye positions of a human head at normal anatomical posture, eye positions of the human head at maximum flexion anatomical posture, eye positions of the human head at maximum extension anatomical postures, and/or eye positions of the human head at maximum left and right rotation anatomical postures. The multicamera device may comprise at least three cameras, the cameras being disposed such that their optical axes in the direction of the respective camera's field of view fall within a hemispheric field of view, the multicamera device comprising no cameras having their optical axes outside the hemispheric field of view, and the multicamera device having a total field of view covering a full sphere.

The multicamera device described here may have cameras with wide-angle lenses. The multicamera device may be suitable for creating stereo viewing image data and/or multiview video, comprising a plurality of video sequences for the plurality of cameras. The multicamera may be such that any pair of cameras of the at least two cameras has a parallax corresponding to parallax (disparity) of human eyes for creating a stereo image. At least two cameras may have overlapping fields of view such that an overlap region for which every part is captured by said at least two cameras is defined, and such overlap area can be used in forming the image for stereo viewing. Fig. 1 shows a system and apparatuses for stereo viewing, that is, for 3D video and 3D audio digital capture and playback. The task of the system is that of capturing sufficient visual and auditory information from a specific location such that a convincing reproduction of the experience, or presence, of being in that location can be achieved by one or more viewers physically located in different locations and optionally at a time later in the future. Such reproduction requires more information that can be captured by a single camera or microphone, in order that a viewer can determine the distance and location of objects within the scene using their eyes and their ears. To create a pair of images with disparity, two camera sources are used. In a similar manner, for the human auditory system to be able to sense the direction of sound, at least two microphones are used (the commonly known stereo sound is created by recording two audio channels). The human auditory system can detect the cues, e.g. in timing difference of the audio signals to detect the direction of sound. The system of Fig. 1 may consist of three main parts: image sources, a server and a rendering device. A video capture device SRC1 comprises multiple cameras CAM1 , CAM2, CAMN with overlapping field of view so that regions of the view around the video capture device is captured from at least two cameras. The device SRC1 may comprise multiple microphones to capture the timing and phase differences of audio originating from different directions. The device SRC1 may comprise a high resolution orientation sensor so that the orientation (direction of view) of the plurality of cameras can be detected and recorded. The device SRC1 comprises or is functionally connected to a computer processor PROC1 and memory MEM1 , the memory comprising computer program PROGR1 code for controlling the video capture device. The image stream captured by the video capture device may be stored on a memory device MEM2 for use in another device, e.g. a viewer, and/or transmitted to a server using a communication interface COMM1 . It needs to be understood that although an 8-camera-cubical setup is described here as part of the system, another multicamera (e.g. a stereo camera) device may be used instead as part of the system.

Alternatively or in addition to the video capture device SRC1 creating an image stream, or a plurality of such, one or more sources SRC2 of synthetic images may be present in the system. Such sources of synthetic images may use a computer model of a virtual world to compute the various image streams it transmits. For example, the source SRC2 may compute N video streams corresponding to N virtual cameras located at a virtual viewing position. When such a synthetic set of video streams is used for viewing, the viewer may see a three-dimensional virtual world. The device SRC2 comprises or is functionally connected to a computer processor PROC2 and memory MEM2, the memory comprising computer program PROGR2 code for controlling the synthetic sources device SRC2. The image stream captured by the device may be stored on a memory device MEM5 (e.g. memory card CARD1 ) for use in another device, e.g. a viewer, or transmitted to a server or the viewer using a communication interface COMM2. There may be a storage, processing and data stream serving network in addition to the capture device SRC1 . For example, there may be a server SERVER or a plurality of servers storing the output from the capture device SRC1 or computation device SRC2. The device SERVER comprises or is functionally connected to a computer processor PROC3 and memory MEM3, the memory comprising computer program PROGR3 code for controlling the server. The device SERVER may be connected by a wired or wireless network connection, or both, to sources SRC1 and/or SRC2, as well as the viewer devices VIEWER1 and VIEWER2 over the communication interface COMM3.

For viewing the captured or created video content, there may be one or more viewer devices VIEWER1 and VIEWER2. These devices may have a rendering module and a display module, or these functionalities may be combined in a single device. The devices may comprise or be functionally connected to a computer processor PROC4 and memory MEM4, the memory comprising computer program PROG4 code for controlling the viewing devices. The viewer (playback) devices may consist of a data stream receiver for receiving a video data stream from a server and for decoding the video data stream. The data stream may be received over a network connection through communications interface COMM4, or from a memory device MEM6 like a memory card CARD2. The viewer devices may have a graphics processing unit for processing of the data to a suitable format for viewing. The viewer VIEWER1 comprises a high-resolution stereo-image head-mounted display for viewing the rendered stereo video sequence. The head-mounted display may have an orientation sensor DET1 and stereo audio headphones. According to an embodiment, the viewer VIEWER2 comprises a display enabled with 3D technology (for displaying stereo video), and the rendering device may have a head-orientation detector DET2 connected to it. Alternatively, the viewer VIEWER2 may comprise a 2D display, since the volumetric video rendering can be done in 2D by rendering the viewpoint from a single eye instead of a stereo eye pair. Any of the devices (SRC1 , SRC2, SERVER, RENDERER, VIEWER1 , VIEWER2) may be a computer or a portable computing device, or be connected to such. Such rendering devices may have computer program code for carrying out methods according to various examples described in this text.

Fig. 2a shows a camera device 200 for stereo viewing. The camera comprises two or more cameras that are configured into camera pairs 201 for creating the left and right eye images, or that can be arranged to such pairs. The distances between cameras may correspond to the usual (or average) distance between the human eyes. The cameras may be arranged so that they have significant overlap in their field-of-view. For example, wide-angel lenses of 180-degrees or more may be used, and there may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, or 20 cameras. The cameras may be regularly or irregularly spaced to access the whole sphere of view, or they may cover only part of the whole sphere. For example, there may be three cameras arranged in a triangle and having different directions of view towards one side of the triangle such that all three cameras cover an overlap area in the middle of the directions of view. As another example, 8 cameras having wide-angle lenses and arranged regularly at the corners of a virtual cube and covering the whole sphere such that the whole or essentially whole sphere is covered at all directions by at least 3 or 4 cameras. In Fig. 2a three stereo camera pairs 201 are shown. Multicamera devices with other types of camera layouts may be used. For example, a camera device with all cameras in one hemisphere may be used. The number of cameras may be e.g., 2, 3, 4, 6, 8, 12, or more. The cameras may be placed to create a central field of view where stereo images can be formed from image data of two or more cameras, and a peripheral (extreme) field of view where one camera covers the scene and only a normal non-stereo image can be formed.

Fig. 2b shows a head-mounted display (HMD) for stereo viewing. The head-mounted display comprises two screen sections or two screens DISP1 and DISP2 for displaying the left and right eye images. The displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes' field of view. The device is attached to the head of the user so that it stays in place even when the user turns his head. The device may have an orientation detecting module ORDET1 for determining the head movements and direction of the head. The head-mounted display gives a three-dimensional (3D) perception of the recorded/streamed content to a user.

Fig. 3 illustrates a camera CAM1 . The camera has a camera detector CAMDET1 , comprising a plurality of sensor elements for sensing intensity of

the light hitting the sensor element. The camera has a lens OBJ1 (or a lens arrangement of a plurality of lenses), the lens being positioned so that the light hitting the sensor elements travels through the lens to the sensor elements. The camera detector CAMDET1 has a nominal center point CP1 that is a middle point of the plurality of sensor elements, for example for a rectangular sensor the crossing point of the diagonals. The lens has a nominal center point PP1 , as well, lying for example on the axis of symmetry of the lens. The direction of orientation of the camera is defined by the line passing through the center point CP1 of the camera sensor and the center point PP1 of the lens. The direction of the camera is a vector along this line pointing in the direction from the camera sensor to the lens. The optical axis of the camera is understood to be this line CP1 -PP1 .

The system described above may function as follows. Time-synchronized video, audio and orientation data is first recorded with the capture device. This can consist of multiple concurrent video and audio streams as described above. These are then transmitted immediately or later to the storage and processing network for processing and conversion into a format suitable for subsequent delivery to playback devices. The conversion can involve post-processing steps to the audio and video data in order to improve the quality and/or reduce the quantity of the data while preserving the quality at a desired level. Finally, each playback device receives a stream of the data from the network, and renders it into a stereo viewing reproduction of the original location which can be experienced by a user with the head-mounted display and headphones.

Figs. 4a and 4b show an example of a camera device for being used as a source for media content, such as images and/or video. To create a full 360 degree stereo panorama every direction of view needs to be photographed from two locations, one for the left eye and one for the right eye. In case of video panorama, these images need to be shot simultaneously to keep the eyes in sync with each other. As one camera cannot physically cover the whole 360 degree view, at least without being obscured by another camera, there need to be multiple cameras to form the whole 360 degree panorama. Additional cameras however increase the cost and size of the system and add more data streams to be processed. This problem becomes even more significant when mounting cameras on a sphere or platonic solid shaped arrangement to get more vertical field of view. However, even by arranging multiple camera pairs on for example a sphere or platonic solid such as octahedron or dodecahedron, the camera pairs will not achieve free angle parallax between the eye views. The parallax between eyes is fixed to the positions of the individual cameras in a pair, that is, in the perpendicular direction to the camera pair, no parallax can be achieved. This is problematic when the stereo content is viewed with a head mounted display that allows free rotation of the viewing angle around z-axis as well.

The requirement for multiple cameras covering every point around the capture device twice would require a very large number of cameras in the capture device. In this technique lenses are used with a field of view of 180 degree (hemisphere) or greater, and the cameras are arranged with a carefully selected arrangement around the capture device. Such an arrangement is shown in Fig. 4a, where the cameras have been positioned at the corners of a virtual cube, having orientations DIR_CAM1 , DIR_CAM2, DIR_CAMN pointing away from the center point of the cube. Naturally, other shapes, e.g. the shape of a cuboctahedron, or other arrangement, even irregular ones, can be used.

A video codec consists of an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. Typically encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate). An example of an encoding process is illustrated in Figure 5a. Figure 5a illustrates an image to be encoded ( ); a predicted representation of an image block (P'ⁿ); a prediction error signal (Dⁿ ); a reconstructed prediction error signal (D'ⁿ); a preliminary reconstructed image (l'ⁿ); a final reconstructed image (R'ⁿ); a transform (T) and inverse transform (T^~1); a quantization (Q) and inverse quantization (Cr¹ ); entropy encoding (E); a reference frame memory (RFM); inter prediction (P^inter); intra prediction (P^intra); mode selection (MS) and filtering (F). An example of a decoding process is illustrated in Figure 5b. Figure 5b illustrates a predicted representation of an image block (P'ⁿ); a reconstructed prediction error signal (D'ⁿ); a preliminary reconstructed image (l'ⁿ); a final reconstructed image (R'ⁿ); an inverse transform (T^~1); an inverse quantization (Cr¹); an entropy decoding (E^~1); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).

Figure 6 demonstrates an example of processing steps of manipulating volumetric video data, starting from raw camera frames (from various locations within the world) and ending with a frame rendered at a freely-selected 3D viewpoint. The starting point 610 is media content obtained from one or more camera devices. The media content may comprise raw camera frame images, depth maps, and camera 3D positions. The recorded media content, i.e. image data, is used to construct an animated 3D model 620 of the world. The viewer is then freely able to choose his/her position and orientation within the world when the volumetric video is being played back 630. "A sparse voxel octree" is a central data structure to which the present embodiments are based. "Voxel" of a three-dimensional world corresponds to a pixel of a two- dimensional world. Voxels exist in a 3D grid layout. An octree is a tree data structure used to partition a three-dimensional space. Octrees are the three-dimensional analog of quadtrees. A sparse voxel octree describes a volume of a space containing a set of solid voxels of varying sizes. Empty areas within the volume are absent from the tree, which is why it is called "sparse".

A volumetric video frame is a complete sparse voxel octree that models the world at a specific point in time in a video sequence. Voxel attributes contain information like color, opacity, surface normal vectors, and surface material properties. These are referenced in the sparse voxel octrees (e.g. color of a solid voxel), but can also be stored separately.

The present embodiments relate to real-time computer graphics and virtual reality (VR). Volumetric video is to virtual reality what traditional video is to 2D/3D displays.

In computer graphics, term "mipmap" is used. Mipmaps are pre-calculated, optimized sequences of images, each of which is a progressively lower resolution representation of the same image. The height and width of each image, or level, in the mipmap is a power of two smaller than the previous level. They are intended to increase rendering speed and reduce aliasing artifacts. In the context of voxel octrees, each level in the octree can be considered a 3D mipmap of the next lower level. When encoding a volumetric video, each frame may produce several hundred megabytes or several gigabytes of voxel data which needs to be converted to a format that can be streamed to the viewer, and rendered in real-time. The amount of data depends on the world complexity and the number of cameras. The larger impact comes in a multi-device recording setup with a number of separate locations where the cameras are recording. Such a setup produces more information than a camera at a single location.

The present embodiments are targeted to a problem of detecting and encoding changes that occur in a volumetric video sequence, by reducing the amount of data to transfer to the viewer. The encoding also needs to be done in a particular way to enable later rendering the video efficiently on a GPU (Graphics Processing Unit).

To detect changes between two volumetric video frames, comparisons are done in selected nodes of the frames' sparse voxel octrees. Due to sparseness, nodes may also be absent in either of the volumetric frames. The changed, added, or deleted node locations are given unique IDs within the video sequence, and the node subtrees are written as patches to a reference volumetric frame.

Figure 7 illustrates an example of a volumetric video pipeline. The present embodiments are targeted to a "Change Detection" and "Frame Change Sets" stages of Voxel Encoding 740 in the pipeline.

In the process, multiple cameras 710 captured video data of the world, which video data is input 720 to the pipeline. The video data comprises camera frames, positions and depth maps 730 which are transmitted to the Voxel Encoding 740.

During the "Video Sequencing" stage of the Voxel Encoding 740, the input video material has been divided into shorter sequences of volumetric frames. A volumetric reference frame may be chosen for each sequence. The reference frame can be the first frame in the sequence, or the reference frame can be any one of the other frames in the sequences. Alternatively, the reference frame can be combined from more than one volumetric frames of the video sequence. The encoder is configured to produce a sparse voxel octree for the sequence's volumetric reference frame, and the volumetric frame currently being encoded. Alternatively, the sparse voxel octree may be generated without any reference frame, but using statistics or any other available data.

At the "Change Detection" stage, the encoder processes each frame in the sequence separately. Each frame is compared against the one reference frame chose for the sequence. Figure 8 illustrates an example how the volumetric frame octrees are compared (shown as quadtrees in Fig. 8 instead of octrees). In Figure 8, there are an initial octree 810 (i.e. a reference frame) and a frame octree 820 (i.e. a current frame), which are compared. The gray nodes G are spare (blank space). The comparison results in a changed set 830. In the changed set 830, some nodes of the tree may be deleted (nodes D), some nodes in the tree may be added (node A), and/or some nodes may change their content (nodes C).

According to an embodiment, the selection of the nodes to be compared depends on the captured world contents. One way to do the comparison is to iterate through all the nodes of the tree at a given level. Locations that are far away from any camera positions can be compared on a higher level (larger nodes), because the captured voxel data also has less resolution if it is far away from cameras. This approach yields a roughly equal amount of complexity in each of the compared nodes.

If the encoder has additional knowledge about the capture world contents, this may be applied to adjust the level of compared nodes as appropriate. For example, smaller nodes can be selected around objects known to be important in the scene (such as humans), or objects that have complex moving parts. Unimportant parts of the world can be omitted from comparisons entirely.

Changes may be detected within nodes. For example, let us assume a node existing in both the volumetric reference frame 810 and the current frame 820 (e.g. nodes E1 , E2 respectively). The encoder needs to determine if the contents of the node have changed significantly enough to warrant including the new contents in the encoded output. The information available in each node is: mipmapped voxel attributes (e.g., color) representing node's entire subtree as a whole; sum of the volume of the solid voxels within the node's subtrees (may be determined when the mipmapped attributes are prepared). If the mipmapped attributes or total solid volume of the compared nodes differ more than a certain threshold, the node is considered as changed (node C in the changed set 830). These thresholds can be determined based on preset encoder parameters or by analyzing the captured world contents. Comparisons done using mipmapped values are much faster than comparing all the contained nodes and solid voxels individually. Mipmapped information also compensates noise in the captured data. It is also possible to select larger nodes for comparison, but then the comparison checks are to be performed using the child nodes (one or more levels down). The end result is that the likelihood of a false negative decreases while avoiding issues with too many changed nodes being detected in the sequence. If a node is absent in the current frame 820 but present P1 in the reference frame 810, it is considered to be deleted D and thus also changed in the changed set 830. When a node P2 is present in the current frame 820 but absent G in the reference frame 810, it is always considered as changed. When a change is detected between the compared nodes, the location of the node is given a non-zero ID number. This is called "location ID". The location ID is persistent and unique within the sequence. All changes occurring later on in the sequence in the same location will use the same ID number. The location ID is written in the volumetric reference frame's octree (for example as a node attribute). If the reference octree already has an ID for this node, the existing ID is used. Location ID zero (the default value) is used for al nodes whose contents will not change during the sequence.

When it comes to added nodes A (which are absent in the reference frame 810), the location ID is assigned to the nearest parent node that exists in both octrees. This shared parent node is copied from the reference frame into the output data, and the changed node's subtree is copied into it. Additional empty nodes may be added as padding if there are octree levels missing between the compared location and the shared parent node. As output for each frame, the encoder produces a frame change set 830. This is also illustrated in Figure 9. The change sets comprise at least frame number (e.g. within the encoded sequence of frames); and a set of location IDs 901 , 902, each associated with a sparse voxel subtree 905. A deleted subtree is encoded as a special value that identifies that no subtree exists for that location ("X" in Figure 9). If no changes were detected in the compared frames, the change set can be omitted from the output entirely. In the output data, each node may have the addresses of eight child nodes, and the address of one parent node. In the case of the frame subtrees, their root node's parent node is the parent node of the corresponding node in the reference octree. The output data for the entire sequence contains the full reference octree (that contains the location IDs) plus all the frame change sets. In the output data, attributes can be shared between the octrees/subtrees to reduce total data size. This is useful also in the case when nodes have been added and the change set thus also contains unmodified contents from the reference octree.

Let us turn to Figure 7. The outcome of the Voxel Encoding 740 is a SVOX (Sparse VOXel) file 750, which is transmitted for playback 760. The SVOX file 750 is streamed 770, which creates stream packets 780. For these stream packets 780 a voxel rendering 790 is applied which provides viewer state (e.g. current time, view frustum) 795 to the streaming 770.

Figure 10 is a flowchart illustrating a method according to an embodiment. A method comprises receiving 1010 video data sequence comprising volumetric frames; optionally selecting 1020 a reference frame from the volumetric frames; generating 1030 a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame 1040: generating 1050 a frame sparse voxel octree for a frame that is currently encoded; comparing 1060 the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; assigning 1070 an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and producing 1080 a frame change set for the frame; producing 1090 an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and the produced frame change sets for the frames of the sequence, which output is transmitted for playback.

An apparatus according to an embodiment comprises means for receiving video data sequence comprising volumetric frames; means for generating a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame: means for generating a frame sparse voxel octree for a frame that is currently encoded; means for comparing the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; means for assigning an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and means for producing a frame change set for the frame; means for producing an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and the produced frame change sets for the frames of the sequence, which output is transmitted for playback. The means comprises at least one processor, a memory, and a computer program code residing in the memory.

The various embodiments may provide advantages. The method of encoding changes according to present embodiments does not require any special modelling tools for setting up the scene. It can be applied to any 3DA/R footage where depth information and camera positions can be extracted. The method produces a data set that can be fed to a GPU with minimal preprocessing. This allows rendering the animated volumetric frames efficiently. One particular advantage is that the renderer can switch on the fly between several frames currently available in GPU memory.

A sparse voxel octree is a simple tree structure, which makes resolution adjustments and spatial subdivision trivial: resolution can be changed simply by limiting the depth of the tree, and subdivision can be done by picking specific subtrees. This makes the data structure well-suited for parallelized encoding and adaptive streaming. Sparse voxel octrees also have the advantage of supporting variable resolution within the volume. It is also trivial to merge octrees together so that each octree contributes details to a combined octree. This is especially useful when merging the captured contents of multiple 3D cameras. When compared to triangle meshes, sparse voxel data has a simpler, recursive overall structure. An octree may be composed of a sequence of integers, while triangle meshes are more complicated (3D floating-point coordinates, edges, faces). Although GPUs have been primarily designed to work with triangle meshes, modern graphics APIs provide enough programmability (shader programs) that enable rendering sparse voxel data in real time.

Another big advantage over triangle meshes is that mipmapping can be achieved trivially. This is important in 3D graphics because objects in the distance should be rendered using a lower level of detail to avoid wasting processing time. Mipmapping also benefits streaming use cases (as noted above).

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above- described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:

1 . A lossy compression method, comprising:

- receiving video data sequence comprising volumetric frames;

- generating a unique reference sparse voxel octree based on zero or more volumetric frames within the video sequence, wherein said unique reference sparse voxel octree is used as a reference for all other frames of the video sequence;

- for any other frame of the volumetric frame:

• generating a frame sparse voxel octree for a frame that is currently encoded;

• comparing the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree;

• assigning an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and

• producing a frame change set for the frame;

- producing an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and the produced frame change sets for the frames of the sequence, which output is transmitted for playback.

2. The method according to claim 1 , wherein the reference sparse voxel octree is a combined from more than one volumetric frames of the video sequence.

3. The method according to claim 1 , further comprising selecting a reference frame from the volumetric frames, wherein the reference sparse voxel octree is generated based on said reference frame.

4. The method according to any of the claims 1 to 3, further comprising receiving the video data sequence from a multicamera device.

5. The method according to any of the claims 1 to 4, wherein a change relates to one or more of the following: deletion of a node, addition of a node, change of a content of a node.

6. The method according to any of the claims 1 to 5, wherein the frame change set comprises a frame number within the video data sequence and a set of identifications associated with a sparse voxel subtree.

7. An apparatus for a lossy compression comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- to receive video data sequence comprising volumetric frames;

- to generate a unique reference sparse voxel octree based on zero or more volumetric frames within the video sequence, wherein said unique reference sparse voxel octree is used as a reference for all other frames of the video sequence;

- for any other frame of the volumetric frame:

· to generate a frame sparse voxel octree for a frame that is currently encoded;

• to compare the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree;

· to assign an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and

• to produce a frame change set for the frame;

- to produce an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and the produced frame change sets for the frames of the sequence, which output is transmitted for playback.

8. The apparatus according to claim 7, wherein the reference sparse voxel octree is a combined from more than one volumetric frames of the video sequence.

9. The apparatus according to claim 7, the apparatus further comprises computer program code to cause the apparatus to select a reference frame from the volumetric frames, wherein the reference sparse voxel octree is generated based on said reference frame.

10. The apparatus according to any of the claims 7 to 9, further comprising computer program code to cause the apparatus to receive the video data sequence from a multicamera device.

1 1 . The apparatus according to any of the claims 7 to 10, wherein a change relates to one or more of the following: deletion of a node, addition of a node, change of a content of a node.

12. The apparatus according to any of the claims 7 to 1 1 , wherein the frame change set comprises a frame number within the video data sequence and a set of identifications associated with a sparse voxel subtree.

13. A computer program product for lossy compression, the computer program product being embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- receive video data sequence comprising volumetric frames;

- generate a unique reference sparse voxel octree based on zero or more volumetric frames within the video sequence, wherein said unique reference sparse voxel octree is used as a reference for all other frames of the video sequence;

- for any other frame of the volumetric frame:

• to generate a frame sparse voxel octree for a frame that is currently encoded;

• to assign an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and

• to produce a frame change set for the frame;

- produce an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and the produced frame change sets for the frames of the sequence, which output is transmitted for playback.