WO2018171596A1

WO2018171596A1 - Video encoding method, video decoding method, and related device

Info

Publication number: WO2018171596A1
Application number: PCT/CN2018/079699
Authority: WO
Inventors: 吴国星; 林灿榕
Original assignee: 华为技术有限公司
Priority date: 2017-03-21
Filing date: 2018-03-21
Publication date: 2018-09-27
Also published as: CN108632625B; CN108632625A

Abstract

Embodiments of the present invention disclose a video encoding method, a video decoding method, and a related device, for improving the compression efficiency of video frames. The method in the embodiments of the present invention comprises: acquiring a plurality of video frames, wherein redundant image content data exists between each of the plurality of video frames; reconstructing the plurality of video frames to obtain scene information and a reconstruction residual of each video frame, the scene information comprising data obtained by reducing the redundancy of the redundant data, and the reconstruction residual being used to represent a difference between the video frame and the scene information; and performing predictive encoding on the scene information and reconstruction residuals respectively to obtain scene feature predictive encoding data and residual predictive encoding data. In this way, a redundancy between video frames is reduced, and a volume of data obtained after compression is reduced. Furthermore, each video frame is reconstructed into scene features and a reconstruction residual. Reconstruction residuals are encoded based on residual encoding, producing a small volume of encoded data and a high compression ratio. In this way, the method of the embodiments of the present invention can effectively improve the compression efficiency of video frames.

Description

Video coding method, video decoding method and related equipment

This application claims priority to Chinese Patent Application, filed on March 21, 2017, the Chinese Patent Office, Application No. 201710169486.5, entitled "A Video Coding Method, Video Decoding Method, and Related Equipment", the entire contents of which are incorporated by reference. Combined in this application.

Technical field

The present invention relates to the field of video frame processing, and in particular, to a video encoding method, a video decoding method, a video encoding device and a video decoding device, and a video encoding and decoding device.

Background technique

With the continuous development of technologies such as the Internet and streaming media, digital video has been widely used in various terminal devices, such as traditional PCs, smart phones, tablet computers, and Internet interactive television (IPTV). On the other hand, people's sensory needs are also increasing, and the demand for high-definition video and ultra-high-definition video is also increasing. These video formats and increasing resolution requirements are bound to bring a very large bit rate. Therefore, in the context of large video, high-quality compression of the video is required to reduce the network transmission load and increase the storage capacity.

For video frames to be independently coded, the prior art often performs independent coding on the frame, which results in large frame redundancy information that needs to be independently coded, which is not conducive to data access and transmission.

For example, HEVC (High Efficiency Video Coding) predictive coding uses both intra-frame compression and inter-frame compression. The GOP (Group of pictures) step size is set first before encoding, that is, the number of frames included in the GOP. The frame group is a group composed of a plurality of frames. To prevent motion changes, the number of frames should not be set too much. In the specific prediction coding, HEVC divides all frames into three types of frames: I, P, and B, as shown in Figure 1. The numbers above the frames in the figure indicate the number of the corresponding frame in the original video sequence. When encoding, the I frame, the P frame, and the B frame are encoded in units of GOP. Among them, an Intra-frame (Intra-frame), also known as an intra-coded frame, is an independent frame with all the information, and can be independently encoded and decoded without reference to other images. The existing IVC frame of the HEVC standard only uses the image intraframe information of the current I frame for encoding and decoding, and is selected according to the video time axis by a fixed strategy. Thus, in the HEVC standard, the amount of independently encoded I-frame compressed data is high and there is a large amount of information redundancy.

Summary of the invention

The embodiments of the present invention provide a video encoding method, a video decoding method, a video encoding device, a video decoding device, and a video encoding and decoding device, which are used to improve the compression efficiency of a video frame.

A first aspect of the embodiments of the present invention provides a video encoding method, the method comprising: acquiring a plurality of video frames, wherein each of the plurality of video frames includes redundant data on the screen content. Then, the multiple video frames are reconstructed to obtain scene information and reconstruction residuals of each video frame, the scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent The difference between the video frame and the scene information, such that redundant data of the plurality of video frames is reduced by reconstruction. Subsequently, the scene information is predictively coded, the scene feature prediction coded data is obtained, and the reconstructed residual is predictively coded to obtain residual prediction coded data.

In this way, by performing the process of reconstructing the plurality of video frames, the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video. The amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.

In conjunction with the first aspect of the embodiments of the present application, in a first implementation manner of the first aspect of the embodiments of the present application, each of the multiple video frames includes the same picture content, and the same picture content That is, redundant data of the plurality of video frames. Reconstructing a plurality of video frames to obtain scene information and a reconstruction residual of each video frame, comprising: reconstructing a plurality of video frames to obtain scene features and reconstruction residuals of each video frame, The scene feature is used to represent the same picture content between each video frame, and the reconstructed residual is used to represent the difference between the video frame and the scene feature. The scene feature is one of the specific forms of scene information. Through the reconstruction operation, one of the plurality of identical picture contents between the plurality of video frames is saved in one scene feature, thereby reducing the repeated recording of the same picture content and reducing redundancy of redundant data. degree. Correspondingly, the scene information is predictively encoded, and the scene feature prediction encoded data is obtained, including: predicting and encoding the scene features, and obtaining scene feature prediction encoded data.

In this way, by reconstructing, the same picture content is deduplicated and represented by the scene feature, and the redundancy of the redundant information of the plurality of video frames can be reduced. Therefore, in the encoding operation, the obtained scene feature and the compressed data amount of the reconstructed residual total are reduced relative to the compressed data amount of the original video frame, and the amount of data obtained after compression is reduced. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.

With reference to the first implementation manner of the first aspect of the embodiment of the present application, in a second implementation manner of the first aspect of the embodiment of the present application, multiple video frames are reconstructed to obtain a scene feature and each video frame. Reconstruction residuals include: converting multiple video frames into an observation matrix, and the observation matrix is used to represent multiple video frames in a matrix form. Then, the observation matrix is reconstructed according to the first constraint condition to obtain a scene feature matrix and a reconstructed residual matrix. The scene feature matrix is used to represent the scene features in a matrix form, and the reconstructed residual matrix is used in a matrix form. The reconstructed residuals of the plurality of video frames are represented, the first constraint is used to define the scene feature matrix low rank and the reconstructed residual matrix is sparse. In this way, the reconstruction operation of the plurality of video frames is performed in the form of a matrix, and under the constraint of the first constraint, the reconstruction residual and the scene feature are made to meet the preset requirements, which is advantageous for reducing the coding amount and the subsequent encoding operation. Increase the compression ratio.

With reference to the second implementation manner of the first aspect of the embodiment of the present application, in a third implementation manner of the first aspect of the embodiment of the present application, the observation matrix is reconstructed according to the first constraint condition, and the scene feature matrix is obtained. Reconstructing the residual matrix includes: calculating a scene feature matrix and a reconstructed residual matrix according to a first preset formula, wherein the obtained scene feature matrix is a low rank matrix, and the reconstructed residual matrix is a sparse matrix.

Among them, the first preset formula is:

or,

s.t.D=F+E s.t.D=F+E

Both sets of formulas include two formulas: the target constraint function and the reconstruction formula. Because the former group of formulas belong to the NP problem, the slack operation is performed to obtain the latter set of formulas, and the latter set of formulas are convenient to solve.

Where D is the observation matrix, F is the scene feature matrix, E is the reconstructed residual matrix, λ is the weight parameter, and λ is used to balance the relationship between the scene feature matrix F and the reconstructed residual matrix E.

Represents seek the optimal value F and E, even though the values to obtain the objective formula rank (F) + λ || E || 1 or _{|| F || * + λ || E} || 1 minimum value F and E ,rank(·) is a matrix for the rank function, ||·|| ₁ is the matrix L1 norm, and ||·|| _* is the matrix kernel norm.

With reference to any one of the first to the third aspects of the first aspect of the present application, in a fourth implementation manner of the first aspect of the embodiments of the present application, the multiple video frames are performed on the multiple video frames. Before the reconstruction, the scene feature and the reconstruction residual of each video frame are obtained, the method of the implementation manner further includes: extracting picture feature information of each of the plurality of video frames; and then, according to the picture The feature information is calculated to obtain content metric information for measuring a difference in picture content of the plurality of video frames. Therefore, when the content metric information is not greater than the preset metric threshold, performing the step of reconstructing the plurality of video frames to obtain a scene feature and a reconstruction residual of each video frame. Through the judgment detection, the reconstruction operations of the first to third implementations of the first aspect can be performed by the plurality of video frames that meet the requirements, and the normal execution of the reconstruction operation is ensured.

With reference to the fourth implementation manner of the first aspect of the embodiments of the present application, in a fifth implementation manner of the first aspect of the application, the screen feature information is a global GIST feature, and the preset metric threshold is a preset. The variance threshold is calculated according to the picture feature information, and the content GIST feature variance is calculated according to the global GIST feature. The reconstruction of the first to third implementations of the first aspect of the present application is performed by calculating the content GIST feature variance of the plurality of video frames to measure the content consistency of the plurality of video frames. With reference to any one of the first to third implementation manners of the first aspect of the present application, in a sixth implementation manner of the first aspect of the application, the acquiring multiple video frames includes: The video stream is obtained, and the video frames of the video stream include an I frame, a B frame, and a P frame. Then, an I frame is extracted from the video stream, and the I frame is used to perform a step of reconstructing a plurality of video frames to obtain scene features and reconstruction residuals of each video frame. In a specific coding stage, the method of the implementation manner further includes: reconstructing according to the scene feature and the reconstructed residual to obtain a reference frame. Referring to the reference frame, the B frame and the P frame are inter-predictive-coded to obtain B-frame predictive coded data and P-frame predictive coded data. Then, the predictive coded data is subjected to transform coding, quantization coding, and entropy coding to obtain video compressed data; the predictive coded data includes scene feature prediction coded data, residual prediction coded data, B frame predictive coded data, and P frame predictive coded data. In this way, the I frame of the video stream can be reconstructed and encoded using the method of the present implementation, the amount of encoded data of the I frame is reduced, and the redundant data of the I frame is reduced.

With reference to the first aspect of the embodiments of the present application, in a seventh implementation manner of the first aspect of the embodiments of the present application, each of the multiple video frames includes redundant data at a local location, corresponding to The reconstruction operation is different from the foregoing implementation manner, that is, reconstructing multiple video frames to obtain scene information and reconstruction residuals of each video frame, including: splitting each video frame in multiple video frames A plurality of frame sub-blocks are obtained, and the frame sub-block obtained after the split includes redundant data, and the partial frame sub-blocks can be obtained based on other frame sub-blocks. The so-called frame sub-block is the frame content of a partial area of the video frame. Then, the plurality of frame sub-blocks are reconstructed to obtain a scene feature, a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, and a reconstruction residual of each frame sub-block, wherein the scene feature includes multiple The independent scene feature base cannot be reconstructed from each other within the scene feature. The scene feature base is used to describe the picture content feature of the frame sub-block. The indicated representation coefficient represents the correspondence between the scene feature base and the frame sub-block. The relationship, the reconstructed residual represents the difference between the frame sub-block and the scene feature base. Thus, by the reconstruction operation, the redundancy of the frame sub-block including the redundant data is reduced. The scene feature of the implementation manner is one of the specific forms of the scene information, which can reduce the redundancy between the partially redundant video frames. Correspondingly, the scene information is predictively encoded, and the scene feature prediction encoded data is obtained, including: predicting and encoding the scene features, and obtaining scene feature prediction encoded data.

With reference to the sixth implementation manner of the first aspect of the embodiments of the present application, in an eighth implementation manner of the first aspect of the embodiments of the present application, multiple frame sub-blocks are reconstructed to obtain a scene feature and multiple frames. a representation coefficient of each frame sub-block in the sub-block and a reconstruction residual of each frame sub-block, including: reconstructing a plurality of frame sub-blocks to obtain each of the plurality of frame sub-blocks Represents the coefficient and the reconstructed residual of each frame sub-block. Wherein, the representation coefficient represents a correspondence between a frame sub-block and a target frame sub-block, the target frame sub-block is an independent frame sub-block among the plurality of frame sub-blocks, and the independent frame sub-block is not based on other ones of the plurality of frame sub-blocks The frame sub-block reconstructed from the frame sub-block is reconstructed to represent the difference between the target frame sub-block and the frame sub-block. Then, a plurality of target frame sub-blocks indicating the coefficient indication are combined to obtain a scene feature, and the target frame sub-block is a scene feature base. In this way, the target frame sub-blocks that can be independently represented are selected, and the target sub-blocks and the reconstructed residuals are not represented by the frame sub-blocks that are not independently represented, thereby reducing the between the sub-blocks and the target sub-blocks that are not independently represented. Redundant data, only need to encode the target frame sub-block and the reconstructed residual when encoding, reducing the amount of coding.

With reference to the eighth implementation manner of the first aspect of the embodiment of the present application, in a ninth implementation manner of the first aspect of the embodiments of the present application, the multiple frame sub-blocks are reconstructed to obtain multiple frame sub-blocks. The representation coefficient of each frame sub-block and the reconstruction residual of each frame sub-block include: converting a plurality of frame sub-blocks into an observation matrix, and the observation matrix is used to represent the plurality of frame sub-blocks in a matrix form. Then, the observation matrix is reconstructed according to the second constraint condition, and the representation coefficient matrix and the reconstructed residual matrix are obtained. Wherein, the representation coefficient matrix is a matrix including the representation coefficients of each of the plurality of frame sub-blocks, the non-zero coefficient indicating the coefficient indicates the target frame sub-block, and the reconstructed residual matrix is used for each of the matrix forms. The reconstructed residual of the frame sub-block is represented, and the second constraint is used to define the low rank and the sparsity of the represented coefficient to meet the preset requirement. And combining the plurality of target frame sub-blocks indicating the coefficient indication to obtain the scene feature, comprising: combining the target frame sub-blocks indicating the non-zero coefficient indication coefficients of the coefficient matrix to obtain the scene features. In this way, the reconstruction operation can be performed in the form of a matrix, and the reconstruction residual and the scene feature satisfying the requirement of reducing the coding amount are calculated by using the second constraint condition.

With reference to the ninth implementation manner of the first aspect of the embodiment of the present application, in the tenth implementation manner of the first aspect of the embodiment of the present application, the observation matrix is reconstructed according to the second constraint condition, and the representation coefficient matrix is obtained. Reconstructing the residual matrix includes: calculating a representation coefficient matrix and a reconstructed residual matrix according to a second preset formula, where the second preset formula is:

or,

s.t.D=DC+E s.t.D=DC+E

Where D is the observation matrix, C is the coefficient matrix, E is the reconstructed residual matrix, and λ and β are the weight parameters.

Represents the optimal value of C and E, that is, the target formula ||C|| _* +λ||E|| ₁ or ||C|| _* +λ||E|| ₁ +β||C|| _{When the} value of ₁ is the smallest, the values of C and E, ||·|| _* are the matrix kernel norm, and ||·|| ₁ is the matrix L ₁ norm.

With reference to the seventh implementation manner of the first aspect of the embodiment of the present application, in the eleventh implementation manner of the first aspect of the embodiment of the present application, multiple frame sub-blocks are reconstructed to obtain scene features and multiple Representing coefficients of each frame sub-block in the frame sub-block and reconstructing residuals of each frame sub-block includes: reconstructing a plurality of frame sub-blocks to obtain a scene feature and each of the plurality of frame sub-blocks The representation coefficient of the frame sub-block, the scene feature includes the scene feature base as an independent feature block in the feature space, and the independent feature block is a feature block that cannot be reconstructed by other feature blocks in the scene feature. Then, the reconstructed residual of each frame sub-block is calculated according to the reconstructed residual of each frame sub-block and the reconstructed data of the scene feature and each frame sub-block. In this way, a scene feature that can represent the plurality of frame sub-blocks as a whole is obtained by reconstructing, the scene feature is composed of a scene feature base, and the scene feature base is an independent feature block in the feature space, if different frames are used. If the block reconstruction obtains the same feature block, the same feature block may not be repeatedly saved in the scene feature, thereby reducing redundant data.

With reference to the eleventh implementation manner of the first aspect of the embodiments of the present application, in a twelfth implementation manner of the first aspect of the embodiments of the present application, multiple frame sub-blocks are reconstructed to obtain scene features and multiple The representation coefficients of each frame sub-block in the frame sub-block include: converting a plurality of frame sub-blocks into an observation matrix, and the observation matrix is used to represent the plurality of frame sub-blocks in a matrix form. And reconstructing the observation matrix according to the third constraint condition to obtain a representation coefficient matrix and a scene feature matrix, wherein the representation coefficient matrix is a matrix including the representation coefficients of each frame sub-block, and the non-zero coefficient indicating the coefficient indicates the scene feature Base, the scene feature matrix is used to represent the scene feature in a matrix form, and the third constraint condition is used to define the similarity between the picture representing the coefficient matrix and the scene feature matrix reconstructed picture and the frame sub-block according to a preset similarity threshold, And limiting the data matrix sparsity to meet the preset sparse threshold, and the amount of data of the scene feature matrix is less than the preset data amount threshold.

And, according to the reconstructed residual of each frame sub-block and the reconstructed data of the scene feature and each frame sub-block, the reconstructed residual of each frame sub-block is calculated, including: according to the representation coefficient matrix and the scene feature The data and the observation matrix obtained by the matrix reconstruction are used to calculate a reconstructed residual matrix, wherein the reconstructed residual matrix is used to represent the reconstructed residual in a matrix form.

In this way, the reconstruction operation can be performed in the form of a matrix, and the representation coefficients and scene features that meet the requirements for reducing the coding amount are calculated by using the third constraint condition.

With reference to the twelfth implementation manner of the first aspect of the embodiment of the present application, in the thirteenth implementation manner of the first aspect of the embodiment of the present application, the observation matrix is reconstructed according to the third constraint condition, and the representation coefficient is obtained. The matrix and the scene feature matrix include: calculating a representation coefficient matrix and a scene feature matrix according to a third preset formula, and the third preset formula is:

Where D is the observation matrix, C is the coefficient matrix, F is the scene feature, and λ and β are the weight parameters, which are used to adjust the coefficient sparsity and low rank.

Represents the optimal value of F and C, ie the formula

The value of F and C when the value is minimum.

With reference to any one of the seventh to thirteenth aspects of the first aspect of the present application, in the fourteenth implementation manner of the first aspect of the embodiments of the present application, the multiple video frames are The method of the present implementation further includes: extracting picture feature information of each of the plurality of video frames, before each video frame is split to obtain a plurality of frame sub-blocks. Then, based on the picture feature information, content metric information is calculated, where the content metric information is used to measure the difference of the picture content of the plurality of video frames. Therefore, when the content metric information is greater than the preset metric threshold, performing the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks. In this way, when the content metric information is greater than the preset metric threshold, the image representing the plurality of video frames locally has redundant data, thereby using a method of splitting the video frame and reconstructing the frame sub-block.

With reference to the fourteenth implementation manner of the first aspect of the embodiment of the present application, in the fifteenth implementation manner of the first aspect of the application, the picture feature information is a global GIST feature, and the preset metric threshold is And a preset variance threshold, wherein the content metric information is calculated according to the picture feature information, including: calculating a feature GIST feature variance according to the global GIST feature. The content consistency of the plurality of video frames is calculated by calculating the variance of the scene GIST features of the plurality of video frames, thereby determining whether the images of the plurality of video frames have locally stored redundant data, so as to split and match the video frames. A method of reconstructing a frame sub-block.

With reference to any one of the seventh to thirteenth implementations of the first aspect of the present application, in a sixteenth implementation manner of the first aspect of the embodiments of the present application, multiple video frames are acquired, including Obtaining a video stream, where the video frame of the video stream includes an I frame, a B frame, and a P frame; extracting an I frame from the video stream, where the I frame is used to perform splitting each video frame in the multiple video frames, a step of multiple frame sub-blocks;

The method of the implementation manner further includes: performing reconfiguration according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a reference frame; using the reference frame as a reference, performing interframe prediction coding on the B frame and the P frame, and obtaining the B frame prediction coding. Data and P frame predictive coded data; transform coding, quantization coding, and entropy coding of the predictive coded data to obtain video compressed data; the predictive coded data includes scene feature predictive coded data, residual predictive coded data, B frame predictive coded data, and P Frame prediction encoded data.

Thus, the method of the present implementation can be applied to key frames of a video stream, reducing redundant data and coding amount of key frames.

With reference to the first aspect of the embodiment of the present application or any one of the first to the sixteenth aspects of the first aspect, in the seventeenth implementation manner of the first aspect of the embodiment of the present application, multiple After the video frame, the method of the implementation manner further includes: classifying the plurality of video frames based on the correlation of the picture content, and obtaining video frames of one or more classification clusters, where the video frames of the same classification cluster are used to execute multiple videos. The frame is reconstructed to obtain scene information and a step of reconstructing the residual of each video frame. By classifying, the redundancy of redundant data between multiple video frames belonging to the same cluster is greater, so that the subsequent video frame reconstruction stage reduces the redundancy of redundant data between video frames. .

With reference to the seventeenth implementation manner of the first aspect of the embodiment of the present application, in the eighteenth implementation manner of the first aspect of the embodiment of the present application, the multiple video frames are classified according to the correlation of the screen content, A video frame of one or more clusters includes extracting feature information of each of the plurality of video frames. Determining the clustering distance between any two video frames according to the feature information, the clustering distance is used to represent the similarity between the two video frames, and the video frames are clustered according to the clustering distance to obtain the video of one or more clustering clusters. frame. In this way, the classification operation of multiple video frames is realized by clustering.

With reference to the first aspect of the embodiments of the present application, in a nineteenth implementation manner of the first aspect of the present application, acquiring a plurality of video frames includes: acquiring a video stream, where the video stream includes multiple video frames. Then, feature information of the first video frame and the second video frame are respectively extracted, and the feature information is used to describe the picture content of the video frame, where the first video frame and the second video frame are video frames in the video stream; Calculating a lens distance between the first video frame and the second video frame; determining whether the lens distance is greater than a preset lens threshold; if the lens distance is greater than a preset lens threshold, segmenting the target lens from the video stream, starting frame of the target lens For the first video frame, the end frame of the target lens is the previous video frame of the second video frame; if the lens distance is less than the preset lens threshold, the first video frame and the second video frame are attributed to the same lens, and the target lens belongs to One of the lenses of the video stream, the lens is a time-continuous video frame; for each shot in the video stream, the key frame is extracted according to the frame distance between the video frames in the lens, and is randomly selected in each lens. The frame distance between two adjacent key frames is greater than a preset frame distance threshold, and the frame distance is used to indicate the degree of difference between the two video frames, and the key frame of each shot is used for execution. Reconstructing a plurality of video frames to obtain scene information and a reconstructed residual for each video frame. After the lens is divided, the key frames are extracted from the respective shots according to the distance. Such an extraction method uses the context information of the video stream, and the method of the present implementation can be applied to the video stream.

With reference to the nineteenth implementation manner of the first aspect of the embodiments of the present application, in the twentieth implementation manner of the first aspect of the embodiments of the present application, multiple video frames are reconstructed to obtain scene information and each Before reconstructing the residual of the video frame, the method further comprises: performing discriminant training according to each shot segmented from the video stream to obtain a plurality of classifiers corresponding to the shot; and using the target classifier to discriminate the target video frame, Determining the score, the target classifier is one of a plurality of classifiers, and the target video frame is one of the key frames, and the discriminant score is used to indicate the extent to which the target video frame belongs to the scene to which the target classifier belongs; When the threshold is greater than the preset score threshold, it is determined that the target video frame belongs to the same scene as the shot to which the target classifier belongs; and the video frames of one or more clusters are determined according to the video frames belonging to the same scene as the shot.

With reference to the first aspect of the embodiments of the present application, in a twenty-first implementation manner of the first aspect of the embodiments of the present application, acquiring a plurality of video frames includes: acquiring a compressed video stream, where the compressed video stream includes the compressed video a frame; a plurality of target video frames are determined from the compressed video stream, and the target video frame is an independently compressed and encoded video frame in the compressed video stream; the target video frame is decoded to obtain a decoded target video frame, and the decoded target is obtained. The video frame is used to perform the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks. Thus, in the compressed video stream, compressed independent compression-encoded video frames are extracted, and the video encoding method of the present embodiment can be used to further reduce the amount of encoded data of these video frames.

A second aspect of the embodiments of the present invention provides a video decoding method, which includes: acquiring scene feature prediction encoded data and residual prediction encoded data. Then, the scene feature prediction encoded data is decoded to obtain scene information, wherein the scene information includes data obtained by reducing redundancy of redundant data, and the redundant data is between each video frame of the plurality of video frames. Redundant data on the content. The residual prediction encoded data is decoded to obtain a reconstructed residual, and the reconstructed residual is used to represent the difference between the video frame and the scene information. The reconstruction is performed according to the scene information and the reconstructed residual, and multiple video frames are obtained. In this way, the scene feature prediction coded data and the residual prediction coded data obtained by the video coding method provided by the first aspect can be decoded by the video decoding method of the implementation manner.

With reference to the second aspect of the embodiments of the present application, in a first implementation manner of the second aspect of the embodiments of the present application, each of the multiple video frames includes the same picture content, and the pair of scene feature prediction codes The data is decoded to obtain scene information, including: decoding scene feature prediction encoded data to obtain scene features, and the scene features are used to represent the same screen content between each video frame. Reconstructing according to the scene information and the reconstructed residual, obtaining multiple video frames, including: reconstructing according to the scene feature and the reconstructed residual, to obtain multiple video frames. Thus, if the scene feature is used to represent the same picture content, the scene feature information can be decoded by this implementation.

With reference to the first implementation manner of the second aspect of the embodiments of the present application, in a second implementation manner of the second aspect of the embodiments of the present application, acquiring scene feature prediction coding data and residual prediction coding data, including: acquiring video Compressing data; performing entropy decoding, inverse quantization processing, and DCT inverse variation on the video compressed data to obtain predictive encoded data, and the predictive encoded data includes scene feature predictive encoded data, residual predictive encoded data, B frame predictive encoded data, and P frame predictive encoded data. .

Reconstructing according to the scene feature and the reconstructed residual, obtaining multiple video frames, including: reconstructing according to the scene feature and the reconstruction residual, and obtaining multiple I frames;

The method of the implementation manner further includes: performing inter-frame decoding on the B frame predictive encoded data and the P frame predictive encoded data by using the I frame as a reference frame to obtain a B frame and a P frame; and time consuming the I frame, the B frame, and the P frame. Arrange sequentially to get the video stream.

Thus, when the video frame coding method described above is used in the video stream, the video stream can be decoded by the present implementation.

With reference to the second aspect of the embodiments of the present application, in a third implementation manner of the second aspect of the embodiment of the present application, the method of the implementation manner further includes: acquiring a representation coefficient. Decoding the scene feature prediction encoded data to obtain scene information, comprising: decoding scene feature prediction encoded data to obtain a scene feature, where the scene feature includes multiple independent scene feature bases, and independent scene feature bases within the scene feature Cannot be reconstructed from each other, the scene feature base is used to describe the picture content feature of the frame sub-block, and the representation coefficient represents the correspondence between the scene feature base and the frame sub-block, and the reconstructed residual represents the difference between the frame sub-block and the scene feature base. value.

The reconstruction is performed according to the scene information and the reconstructed residual to obtain a plurality of video frames. The method includes: reconstructing according to a scene feature, a representation coefficient, and a reconstruction residual to obtain a plurality of frame sub-blocks; combining the plurality of frame sub-blocks to obtain a plurality of video frames.

In this way, after the frame sub-block is reconstructed and the encoded data is obtained, the video decoding method of the present implementation can be used to decode the scene feature and the reconstructed residual, and reconstructed to obtain a plurality of frame sub-blocks. A video frame can be obtained by reorganizing.

With reference to the third implementation manner of the second aspect of the embodiment of the present application, in a fourth implementation manner of the second aspect of the embodiments of the present application, acquiring scene feature prediction coding data and residual prediction coding data, including: acquiring video Compressing data; performing entropy decoding, inverse quantization processing, and DCT inverse variation on the video compressed data to obtain predictive encoded data, and the predictive encoded data includes scene feature predictive encoded data, residual predictive encoded data, B frame predictive encoded data, and P frame predictive encoded data. .

Combining the plurality of frame sub-blocks to obtain a plurality of video frames, comprising: combining a plurality of frame sub-blocks to obtain a plurality of I frames;

The method of this implementation manner further includes:

The I frame is used as a reference frame, and the B frame predictive coded data and the P frame predictive coded data are inter-frame decoded to obtain a B frame and a P frame; and the I frame, the B frame, and the P frame are arranged in chronological order to obtain a video stream.

After the I frame is split into the video stream to obtain the frame sub-block, the reconstructed residual, the scene feature and the representation coefficient are reconstructed by the frame sub-block, and then decoded and restored by the video decoding method of the implementation manner. Get the video stream.

A third aspect of the embodiments of the present invention provides a video encoding apparatus having a function of performing the above video encoding method. This function can be implemented in hardware or in hardware by executing the corresponding software. The hardware or software includes one or more modules corresponding to the functions described above.

In a possible implementation manner, the video encoding device includes:

An acquiring module, configured to acquire multiple video frames, and each of the plurality of video frames includes redundant data on the screen content;

The reconstruction module is configured to reconstruct multiple video frames to obtain scene information and reconstruction residuals of each video frame, where the scene information includes data obtained by reducing redundancy of redundant data, and reconstructing residuals Deducing the difference between the video frame and the scene information;

a prediction encoding module, configured to predictively encode scene information, and obtain scene feature prediction encoded data;

The prediction encoding module is further configured to perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.

In another possible implementation manner, the video encoding device includes:

Video encoder

The video encoder performs the following actions: acquiring a plurality of video frames, and each of the plurality of video frames includes redundant data on the screen content;

The video encoder further performs the following actions: reconstructing a plurality of video frames to obtain scene information and reconstruction residuals of each video frame, and the scene information includes data obtained by reducing redundancy of redundant data, and reconstructing The residual is used to represent the difference between the video frame and the scene information;

The video encoder further performs the following actions: performing predictive coding on the scene information to obtain scene feature prediction encoded data;

The video encoder also performs an operation of predictive coding the reconstructed residual to obtain residual prediction encoded data.

A fourth aspect of the embodiments of the present invention provides a video decoding apparatus having a function of performing the above video decoding method. This function can be implemented in hardware or in hardware by executing the corresponding software. The hardware or software includes one or more modules corresponding to the functions described above.

In a possible implementation manner, the video decoding device includes:

An obtaining module, configured to acquire scene feature prediction encoded data and residual prediction encoded data;

a scene information decoding module, configured to decode scene feature prediction encoded data to obtain scene information, where the scene information includes data obtained by reducing redundancy of redundant data, and the redundant data is each video frame of multiple video frames. Redundant data between screen contents;

And a reconstructed residual decoding module, configured to decode the residual prediction encoded data to obtain a reconstructed residual, where the reconstructed residual is used to represent a difference between the video frame and the scene information;

The video frame reconstruction module is configured to reconstruct according to the scene information and the reconstructed residual to obtain a plurality of video frames.

In another possible implementation manner, the video decoding device includes:

Video decoder

The video decoder performs the following actions: acquiring scene feature prediction encoded data and residual prediction encoded data;

The video decoder further performs the following operations: decoding scene feature prediction encoded data to obtain scene information, the scene information including data obtained by reducing redundancy of redundant data, and the redundant data is each of a plurality of video frames Redundant data on the content of the picture between video frames;

The video decoder further performs the following operations: decoding the residual prediction encoded data to obtain a reconstructed residual, where the reconstructed residual is used to represent a difference between the video frame and the scene information;

The video decoder also performs an action of reconstructing based on the scene information and the reconstructed residual to obtain a plurality of video frames.

A fifth aspect of the embodiments of the present invention provides a video codec device, where the video codec device includes a video encoding device and a video decoding device.

The video encoding device is the video encoding device provided by the foregoing third aspect;

The video decoding device is the video decoding device provided by the fourth aspect above.

A sixth aspect of an embodiment of the present invention provides a computer storage medium storing program code for indicating execution of the method of the first aspect described above.

A seventh aspect of the embodiments of the present invention provides a computer storage medium storing program code for indicating execution of the method of the second aspect described above.

Yet another aspect of the present application provides a computer readable storage medium having instructions stored therein that, when executed on a computer, cause the computer to perform the methods described in the above aspects.

Yet another aspect of the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the methods described in the various aspects above.

It can be seen from the above technical solutions that the embodiments of the present invention have the following advantages:

Acquiring a plurality of video frames, wherein each of the plurality of video frames includes redundant data on the picture content. Then, reconstructing the plurality of video frames to obtain scene information and reconstruction residuals of each video frame, the scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent the video frame. The difference between the scene information and the scene information. Then, the scene information is predictively coded, the scene feature prediction coded data is obtained, and the reconstructed residual is predictively coded to obtain residual prediction coded data. In this way, by performing the process of reconstructing the plurality of video frames, the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video. The amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.

DRAWINGS

1 is a schematic diagram of a conventional HEVC coding;

2 is a flowchart of a video frame encoding and decoding method according to an embodiment of the present invention;

3a is a schematic diagram of a flow of a video encoding method and a flow of an existing HEVC encoding method according to another embodiment of the present invention;

FIG. 3b is a schematic diagram of a scenario involved in a video encoding method according to another embodiment of the present invention;

4a is a schematic diagram of a flow of a video decoding method and a flow of an existing HEVC decoding method according to another embodiment of the present invention;

FIG. 4b is a schematic diagram of a scenario involved in a video decoding method according to another embodiment of the present invention;

FIG. 5 is a flowchart of a method for video encoding according to another embodiment of the present invention;

FIG. 6 is a flowchart of a method for decoding a video according to another embodiment of the present invention;

7 is a flowchart of a method of a lens segmentation method of the video encoding method shown in FIG. 5;

8 is a flowchart of a method for extracting a key frame of the video encoding method shown in FIG. 5;

9 is a flowchart of a method for scene classification of the video encoding method shown in FIG. 5;

10 is a flowchart of a method based on an SVM classification method of the video encoding method shown in FIG. 5;

11 is a flowchart of a method for reconstructing an RPCA based scene of the video encoding method shown in FIG. 5;

FIG. 12 is a flowchart of a method for a video encoding method according to another embodiment of the present invention;

13 is a schematic diagram of a scenario of the video encoding method shown in FIG. 12;

14 is a schematic diagram of a scenario of one of the specific methods of the video encoding method shown in FIG. 12;

15 is a schematic diagram of a scenario of one of the specific methods of the video encoding method shown in FIG. 12;

16 is a schematic diagram of a scenario of one of the specific methods of the video encoding method shown in FIG. 12;

FIG. 17 is a flowchart of a method for decoding a video according to another embodiment of the present invention;

FIG. 18 is a schematic structural diagram of a video encoding apparatus according to another embodiment of the present invention;

18b is a partial structural diagram of the video encoding apparatus shown in FIG. 18a;

FIG. 19 is a schematic structural diagram of a video decoding device according to another embodiment of the present invention;

FIG. 20 is a schematic structural diagram of a video codec device according to another embodiment of the present invention;

21 is a schematic block diagram of a video codec system 10 in accordance with an embodiment of the present invention;

22 is a block diagram illustrating an example video encoder 20 that is configured to implement the techniques of the present invention;

23 is a block diagram illustrating an example video decoder 30 that is configured to implement the techniques of the present invention.

detailed description

The embodiments of the present invention provide a video encoding method, a video decoding method, a video encoding device, and a video decoding device, which are used to improve the compression efficiency of a video frame, thereby reducing the network transmission load and reducing the storage load on the video frame.

For independently coded video frames, there is often a large amount of compressed data in these video frames after encoding, and there is a large amount of information redundancy between compressed video frames, resulting in increased network transmission burden and storage burden, which is not conducive to data access. .

To this end, in the video coding method of the embodiment of the present invention, after the encoding device acquires a plurality of video frames, each of the plurality of video frames includes redundant data on the screen content, and the plurality of video frames are performed. Reconstructing, obtaining scene information and reconstruction residuals of each video frame, wherein the scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent a difference between the video frame and the scene information. value. Then, the scene information is predictively coded, the scene feature prediction coded data is obtained, and the reconstructed residual is predictively coded to obtain residual prediction coded data. In this way, by performing the process of reconstructing the plurality of video frames, the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video. The amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.

Correspondingly, the embodiment of the present invention further provides a video decoding method, which is used to decode scene feature prediction encoded data and residual prediction encoded data obtained by the video encoding device, obtain scene information, and reconstruct residuals, according to The scene information and the reconstructed residual are reconstructed to obtain a video frame.

In the HEVC standard, key frames are independently coded, wherein key frames are also referred to as I frames. After compression, the I frame has a high proportion of compressed data and a large amount of information redundancy between I frames. In this regard, if the video coding method of the embodiment of the present invention is used for the I frame at the time of encoding, the coding efficiency of the I frame can be improved.

In order to explain the video frame encoding and decoding method provided by the embodiment of the present invention, a part of the following content uses the HEVC standard scenario. In order to facilitate the understanding of the full text, the HECV standard is briefly introduced.

HEVC (H.265) is a widely used and successful video codec standard. HEVC is a block-based hybrid coding method, which includes several modules such as prediction, transform, quantization, entropy coding, and loop filtering. The prediction module is a core module of the HEVC codec method, and may be specifically classified into an intra prediction and an inter prediction module. Intra prediction, that is, using the pixels already encoded in the current image to generate prediction values. Inter prediction, that is, reconstructing a pixel using the encoded image that has been previously in the current image to generate a predicted value. Since interframe prediction uses residual coding, compression is relatively high.

The existing intra prediction module of the HEVC standard only uses the current image intraframe information for encoding and decoding, and adopts a fixed strategy according to the video time axis, and does not take into consideration the context context information of the video, so the encoding and decoding efficiency is low. The compression ratio is not high. For example:

1) Scene 1: In the movie, A and B perform dialogues. The director frequently switches between A and B to express the inner feelings of the characters. At this time, it is suitable to divide and cluster all the lenses related to A, and perform inter-frame and intra-prediction encoding uniformly.

2) Scene 2, the TV drama shooting venue is mainly divided into grass, beach and office scenes. At this time, it is suitable to identify and classify all grasses, beaches, and office scenes, extract scene feature information uniformly, and express and predict key frames.

As shown in Figure 1, it shows the HEVC predictive coding process. Referring to Figure 1, HEVC predictive coding uses both intra-frame compression and inter-frame compression. The GOP step size is set first before encoding, that is, the number of frames included in the GOP. To prevent motion changes, the number of frames should not be set too much. In the specific prediction coding, HEVC divides all frames into three types of frames: I, P, and B, as shown in Figure 1. The numbers above the frames in Figure 1 indicate the number of the corresponding frame in the original video sequence. When encoding, the I frame, the P frame, and the B frame are encoded in units of GOP.

Intra-frame (Intra-frame), also known as intra-framed frame, is an independent frame with all the information. It can be independently encoded and decoded without reference to other images. It can be simply understood as a static picture. Usually the first frame in each GOP is set to an I frame, and the length of the GOP also represents the interval between two adjacent I frames. The I frame provides the most critical information in the GOP, and the amount of information in the data is relatively large, so the compression is relatively poor, generally around 7:1.

The specific process of I frame coding is as follows:

1) performing intra prediction to determine the intra prediction mode used;

2) subtracting the predicted value from the pixel value to obtain a residual;

3) transforming and quantizing the residuals;

4) programming coding and arithmetic coding;

5) Reconstruct the image and filter it, and the obtained image is used as a reference frame for other frames.

A P-frame (Predictive frame) is also called an inter-predictive coded frame. It needs to refer to the previous I frame to encode. Indicates the difference between the current frame picture and the previous frame (the previous frame may be an I frame or a P frame). When decoding, it is necessary to superimpose the difference defined by this frame with the previously buffered picture to generate the final picture. P frames typically occupy fewer bits of data than I frames, but the disadvantage is that P frames are very sensitive to transmission errors because of the complex dependence of the previous P and I reference frames. Since the residual is used for encoding, the amount of coded information required for the P frame is greatly reduced relative to the I frame, and the compression ratio is relatively high, generally around 20:1.

A bi-directional frame is also called a bidirectional predictive coding frame, that is, a B frame records the difference between the current frame and the previous and subsequent frames. Decoding a B frame requires not only the previous buffered picture but also the P frame picture after decoding, and the final picture is obtained by superimposing the previous and subsequent pictures and superimposing the current frame data. B frame compression rate is high, but the decoding performance is high. The B frame is not a reference frame and does not cause a spread of decoding errors. In addition, B frames have the highest encoding compression ratio, and the general compression ratio is around 50:1.

The specific process of BP frame coding is as follows:

1) Perform motion estimation and calculate a rate distortion function (section) value using an interframe coding process coding mode. The P frame only refers to the previous frame, and the B frame can refer to the following frame.

2) Perform intra prediction, select an intra mode with the smallest rate distortion function value and compare the inter mode to determine which encoding mode to use.

3) Calculate the difference between the actual value and the predicted value.

4) Transform and quantize the residuals.

5) Entropy coding, if it is an inter coding mode, encodes a motion vector.

The decoding process of HEVC is the reverse process of the encoding process, and will not be described here.

The HEVC codec method relies too much on I frame coding and has the following drawbacks:

1) The amount of I frame compressed data is high. I frame coding only performs spatial compression on intraframe data without considering redundant information between adjacent frames. The amount of compressed data is large, usually about 10 times that of P frames. The GOP step size needs to be preset before encoding. The I frame ratio is determined by the setting of the GOP step size. As shown in FIG. 1, when the GOP step size is set to 13, the ratio of the I frame to the BP frame is 1:12. According to the respective compression ratios of the IBP frames, the ratio of the final I frame to the BP frame compressed data is approximately 2:5. Generally, a larger GOP step size can be set to reduce the I frame ratio to improve the overall compression ratio of the video, but this also causes a decrease in the quality of the compressed video.

2) There is a large amount of information redundancy between I frames. The I frames are sequentially extracted according to the time axis sequence, and the interval between adjacent I frames is GOP step. The selection strategy does not take into account the contextual context information of the video. For example, for two video segments that are not consecutive in time but highly correlated in the picture content, if the I frame is extracted according to the GOP step size and the individual intra coding is performed, a large amount of information redundancy is caused.

The embodiment of the present invention proposes a video encoding and decoding algorithm based on intelligent video scene classification, in view of the problem that the original HEVC relies too much on I frame coding and the compression efficiency is not high. The method identifies and classifies the video shots and scenes, performs key data analysis and reconstruction on the key frames (I frames), and encodes the scene information and the representation residuals. It effectively avoids the problem of inefficient compression in a single key frame, and introduces video context information to improve the compression ratio.

2 is a flowchart of a video frame encoding and decoding method according to an embodiment of the present invention. The video frame encoding and decoding method includes an encoding method part and a decoding method part. Referring to FIG. 2, the video frame coding and decoding method includes:

Step 201: Acquire multiple video frames.

Wherein each of the plurality of video frames includes redundant data on the picture content.

The obtaining of the multiple video frames may be obtained from the video stream according to a preset rule after the video stream is acquired, or the video codec may acquire the multiple video frames from other devices, which is used by the embodiment of the present invention. No specific limitation. Wherein, the plurality of embodiments of the present invention refer to at least two.

The redundant data is data related to the content of the screen among the plurality of video frames, and information redundancy exists. The redundant data may be redundant data on the overall picture of the video frame, such as the description of the embodiment shown in Figure 5 below. It may also be redundant data on a partial picture of a video frame, such as the description of the embodiment shown in FIG.

In some embodiments of the invention, the plurality of video frames are obtained from a video stream. Specifically, the codec device divides the video lens by the scene transition detection technology on the premise of the overall video data stream, and determines whether it is a static lens. Video frame extraction is performed for each lens according to the type of lens.

For example, in the lens segmentation step, the original video stream is segmented into a short-sized lens unit by a scene change detection technique. Among them, each shot is composed of video frames that are continuous in time, and represents a temporally and spatially continuous motion in a scene. The specific lens segmentation method can perform boundary segmentation and discrimination processing on the lens according to the change of the content of the video frame. For example, by locating the lens boundary and finding the position or time point of the boundary frame, the video can be segmented accordingly.

After the lens is segmented from the video stream, the video frame of the lens is extracted on the basis of the lens segmentation, and the extracted video frame is the video frame to be acquired in step 201. The extraction of the video frame is adaptively selected according to the length of the lens and the content change, and may be one or more frames of images capable of reflecting the main information content of the lens.

Of course, in some embodiments of the present invention, the codec device may directly extract a plurality of video frames that perform the following encoding method from the video stream, for example, extract the video frames according to a preset step size, and the like.

Step 202: Perform reconstruction on multiple video frames to obtain scene information and reconstruction residuals of each video frame.

The scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent a difference between the video frame and the scene information.

The redundancy of the multiple video frames can be reduced by the reconstruction. There are various specific reconstruction methods. Correspondingly, the obtained scene information can also be in various forms. For details, refer to the following description. The scene information includes data obtained by reducing redundancy between redundant data frames, and a reconstructed residual represents a difference between a video frame and a scene feature, thereby reconstructing scene information obtained by reconstructing the plurality of video frames. The reconstructed residual reduces the redundancy of redundant data compared to the original video frame, reduces the overall amount of data, and maintains a complete amount of information.

Step 202 may be referred to as a scene reconstruction operation. The scene reconstruction is to analyze the content of the video frame, and extract scene information suitable for representing the overall scene information. In some embodiments, the scene information includes scene features, and in some embodiments the scene information includes scene features and representation coefficients. The scene feature refers to feature information capable of describing the whole or partial screen content of the scene, and may be a specific frame picture or a partial image block of the original image pixel representation space, or may be a feature base of the feature representation space, such as a wavelet feature base and a sparse coding. Dictionary base and so on.

The purpose of scene reconstruction is to reduce the redundancy of key frames in the scene. The scene feature extraction principle is that the scene feature representation succinctly occupies a small amount of data, and the data reconstructed according to the scene information matches the original image as much as possible, so that the reconstructed residual amount is small. The scene reconstruction operation directly affects the compression effect of the video encoding.

In an embodiment of the present invention, before step 202, the method of the embodiment of the present invention further includes the operation of classifying the plurality of video frames, for example, classifying the plurality of video frames based on the correlation of the screen content, A video frame of one or more clusters is obtained, and step 202 is performed subsequent to the video frames of the same cluster. The redundancy of redundant data between multiple video frames belonging to the same cluster is in accordance with a preset requirement, for example, greater than a threshold.

The specific classification methods include various methods, such as a cluster-based method, a method using a classifier, and the like, for example, feature extraction and description of key frames, and clustering key frames in the feature space. The specific implementation process is described in detail in the following embodiments, which are not specifically limited in this embodiment of the present invention.

In some embodiments of the present invention, by dividing a video stream to obtain a plurality of shots, a video frame for performing the method of the embodiment of the present invention is extracted for each shot. At this time, the video frame extracted by one shot can be reflected. The characteristics of the lens, and thus the classification of the extracted video frames, can also be referred to as scene classification of the lens. The purpose of scene classification is to combine video frames extracted from the lens that are strongly related in content, so that the entire scene content can be analyzed later. The specific strategy of scene classification is realized by analyzing and clustering key frames of each lens. The principle of scene classification is that the video frames in each cluster are highly correlated on the screen content, and there is a large amount of information redundancy. This operation plays a decisive role in the subsequent scene reconstruction operation. The better the classification effect is, the intra-class information is highly aggregated, and the larger the information redundancy, the higher the coding efficiency.

Step 203: Perform predictive coding on the scene information to obtain scene feature prediction encoded data.

After the scene information is obtained, it can be predictively encoded to obtain scene feature prediction encoded data.

Step 204: Perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.

After the reconstructed residual is obtained, it can be predictively coded to obtain residual prediction encoded data. Wherein, when performing specific coding, intra prediction coding or inter prediction coding may be employed.

After the reconstruction operation of step 202, the reconstruction residual has sparse characteristics because it does not include scene features. For example, when the reconstruction residuals are represented by a matrix, most of them are 0, and only a few are not 0. The value is therefore small in the amount of encoded information.

Since the scene information and the reconstructed residual are compared with the original multiple video frames, the redundancy of the redundant data is reduced, so that the amount of data to be encoded is reduced, so that the scene feature predictive coding data and the residual obtained after the encoding are obtained. The amount of data of the difference prediction coded data is reduced, and since the video frame is represented by the scene information and the reconstructed residual, respectively, and the reconstructed residual represents the difference between the video frame and the scene feature, the obtained reconstruction is performed. The residual has a sparse characteristic, so that the amount of coded information of the reconstructed residual is reduced.

The above steps 201 to 204 are video encoding methods, and the following are the steps of the video decoding method.

Step 205: Acquire scene feature prediction encoded data and residual prediction encoded data.

The video codec device acquires the predicted feature encoded data and the residual predictive encoded data with the encoded scene.

Step 206: Decode the scene feature prediction encoded data to obtain scene information.

The video codec device encodes the scene feature prediction encoded data to obtain scene information. As can be seen from the above description, the scene information includes data obtained by reducing the redundancy of redundant data, which is redundant data on the screen content between each of the plurality of video frames.

Step 207: Decode the residual prediction encoded data to obtain a reconstructed residual.

The video codec also decodes the residual prediction encoded data to obtain a reconstructed residual. According to the description of the encoding process described above, the reconstructed residual is used to represent the difference between the video frame and the scene information.

It should be understood that the execution sequence of step 206 and step 207 is not specifically limited in the embodiment of the present invention.

Step 208: Perform reconstruction according to the scene information and the reconstructed residual to obtain a plurality of video frames.

The scene feature prediction coding data and the reconstruction residual include information of the video frame, and the video information and the reconstruction residual are reconstructed to obtain a plurality of video frames.

It can be understood that the embodiments of the present invention can be used in various scenarios, for example, the video frame encoding and decoding method of the foregoing embodiment of the present invention is used in an HEVC scenario. At this time, the video frame obtained in step 201 of the foregoing embodiment is a key frame (I frame) in the HEVC scenario. After the step 202, the method in the embodiment of the present invention further includes: reconstructing the key frame (I frame) And use the traditional BP frame inter-prediction coding for the remaining frames as a reference. The method of the embodiment of the present invention further includes performing transform coding, quantization coding, and entropy coding on the predictive coded data according to the HEVC coding process to obtain video compression data. The predictive coding data includes scene feature prediction encoded data, residual predictive encoded data, B-frame predictive encoded data, and P-frame predictive encoded data. For details, refer to FIG. 3a, which is a schematic diagram of a flow of a video encoding method and a flow of an existing HEVC encoding method according to an embodiment of the present invention, and FIG. 3b is a scenario related to a video encoding method according to an embodiment of the present invention. schematic diagram.

Correspondingly, in the decoding operation, after the video codec device obtains the video compression data, the video compression data is subjected to entropy decoding, inverse quantization processing, and DCT (discrete cosine transformation) to obtain corresponding prediction according to the HEVC decoding process. Encoded data. The above-described operations of steps 205 to 208 are then performed using the scene feature prediction encoded data and the residual prediction encoded data in the predictive encoded data. The video frame reconstructed in step 208 is a key frame. Subsequently, the method in the embodiment of the present invention further includes performing BP frame decoding according to the decoded key frame data, and arranging the decoded data frames in time sequence to obtain a complete sequence of the original video. . For details, refer to the content shown in FIG. 4a. FIG. 4a is a schematic diagram of a comparison between a flow of a video decoding method and a flow of an existing HEVC decoding method according to an embodiment of the present invention. FIG. 4 is a schematic diagram of a scenario of a video decoding method according to an embodiment of the present invention.

The original HEVC is too dependent on the I frame coding and the compression efficiency is not high. The method of the embodiment of the present invention is used for the key frame. In the prior art, the I frame is independently coded, so that the I frame compression data amount is high, and I There is a large amount of information redundancy between frames. Through the execution of the method of the embodiment of the present invention, the redundant information of the I frame is reduced, and the amount of encoded data of the I frame is reduced. In particular, the method of the embodiment of the present invention identifies and classifies a video shot and a scene, performs overall data analysis and reconstruction on a key frame (I frame) in the scene, and encodes the scene feature and the representation residual. It effectively avoids the problem of inefficient compression in a single key frame, and introduces video context information to improve the compression ratio.

It can be understood that the method in the embodiment of the present invention can also be used in other video frames that need to be independently coded, and reconstructed by using a video frame that needs to be independently coded to obtain scene information and reconstructed residuals, and separately coded. Reduce the amount of compressed data that would otherwise require a separately encoded video frame.

In order to describe the method of the embodiment of the present invention, the method of the embodiment of the present invention is described in the context of the HEVC standard. It should be understood that the video frame encoding and decoding method provided by the embodiment of the present invention can also be applied to other scenarios. The specific usage scenarios are not limited in the embodiment of the present invention.

According to the specific implementation method of reconstructing the video frame to obtain the scene information and reconstructing the residual, two specific embodiments will be hereinafter described. Wherein, in one embodiment, the overall frame picture of the reconstructed video frame has redundant data, and in another embodiment, the partial frame picture of the reconstructed video frame has redundant data.

First, the overall frame picture of the video frame has redundant data

FIG. 5 is a flowchart of a method for a video encoding method according to an embodiment of the present invention. Referring to FIG. 5 and FIG. 3b, a video encoding method according to an embodiment of the present invention includes:

Step 501: Acquire a video stream.

The encoding device acquires a video stream that includes a plurality of video frames.

Step 502: Perform lens segmentation on the video stream to obtain multiple shots.

After acquiring the video stream, the lens segmentation module of the encoding device may segment the video stream into multiple shots to extract a video frame to be reconstructed according to the lens. Of course, it is also possible to draw a shot from a video stream.

The lens includes temporally consecutive video frames, and the lens represents a temporally and spatially continuous motion in a scene.

Specifically, referring to FIG. 7, step 502 can be implemented by the following steps:

Step A1: Acquire a video stream.

Step A1 is step 501, wherein the video stream includes a plurality of video frames.

Step A2: Extract feature information of the first video frame and the second video frame, respectively.

The feature information is used to describe the picture content of the video frame. In order to analyze the video frame of the video stream, it may be analyzed by feature information of the video stream, which is information for describing characteristics of the video frame, for example, image color, shape, edge contour or texture feature, and the like.

The first video frame and the second video frame are video frames in the video stream, and the first video frame and the second video frame are not currently assigned to any of the shots.

Step A3: Calculate the lens distance between the first video frame and the second video frame according to the feature information.

Wherein, the lens distance is used to indicate the degree of difference between the first video frame and the second video frame.

Step A4: Determine whether the lens distance is greater than a preset lens threshold.

The preset lens threshold can be set manually.

Step A5: If the lens distance is greater than the preset lens threshold, the target lens is segmented from the video stream, and if the lens distance is less than the preset lens threshold, the first video frame and the second video frame are attributed to the same lens.

The start frame of the target lens is the first video frame, and the end frame of the target lens is the previous video frame of the second video frame, the target lens belongs to one of the lenses of the video stream, and the lens is a segment in time. Continuous video frames.

If the lens distance between the first video frame and the second video frame is greater than the preset lens threshold, indicating that the difference between the first video frame and the second video frame reaches a preset requirement, and the first video frame and the second video frame are The difference between the video frame and the first video frame does not reach a preset requirement, that is, less than the preset lens threshold, so that in the video stream, from the first video frame to the previous video frame of the second video frame The video frame belongs to the target lens. Otherwise, when the first video frame is located before the second video frame, the lens distance is calculated by the next frame of the second video frame and the first video frame, and steps A4 and A5 are repeated. Thus, through the repeated execution of the above steps, multiple shots can be obtained from the video stream.

For example, the feature information of the video frame is first extracted, and the content is measured based on the feature. A more common method is to extract image color, shape, edge contour or texture features, or extract multiple features and normalize them. To improve the segmentation efficiency, the method of the embodiment of the present invention describes the image by using a block color histogram. The video image frame is first scaled to a fixed size (eg 320*240) and the image is downsampled to reduce the effects of noise on the image. Then, the image is 4*4 divided, and each block extracts an RGB color histogram. To reduce the impact of illumination on the image, the histogram is equalized. Then, the distance between the video frames is calculated based on the feature information of the video frame. The distance between video frames, that is, the lens distance, can be measured by a measure such as Mahalanobis distance and Euclidean distance. To eliminate the effects of illumination, this example uses the normalized histogram intersection method to measure. The preset lens threshold is preset. When the lens distance is greater than the preset lens threshold, the video frame in the front of the two video frames in which the lens distance is calculated is determined as the frame boundary start frame, and the video frames in the two video frames are located behind. The previous frame is determined to be the boundary end frame of the previous shot, otherwise the two video frames belong to the same shot. Finally, you can split a complete video into multiple sets of separate shots.

Step 503: Extract key frames from the obtained shots.

After the encoding device cuts out the lens, a key frame is extracted from each lens, and the reconstruction operation of the method of the embodiment of the present invention is performed with the key frame.

Specifically, after the above-described lens segmentation step, step 303 can be implemented by the execution of step A5.

Step A5: For each shot in the video stream, the key frame is extracted according to the frame distance between the video frames in the shot.

The frame distance between any two adjacent key frames in each shot is greater than a preset frame distance threshold, and the frame distance is used to indicate the degree of difference between the two video frames. Then, the reconstruction of the plurality of video frames is performed with key frames of each shot to obtain scene information and a reconstruction residual of each video frame.

For example, current key frame extraction algorithms mainly include sampling-based methods, color feature-based methods, content-based analysis methods, motion analysis-based methods, cluster-based methods, and compression-based methods. Since the BP frame needs to refer to the previous frame for inter prediction during encoding, the starting frame of each shot is set as a key frame. Feature description and distance metric are used for each frame by using block color histogram feature and histogram intersection method. In order to make the extraction of the key frame faster, the method of the embodiment of the present invention increases the judgment of the type of each lens, that is, first determines whether the lens is a static picture according to the adjacent frame feature space distance, if all the frames in the lens are between If the frame distance is 0, it is determined to be a static picture, and the key frame is no longer extracted, otherwise it is a dynamic picture. For the dynamic picture, the content of each frame is measured in chronological order from the previous key frame, and if the distance is greater than the set threshold, the frame is set as a key frame. Figure 8 shows the key frame extraction process.

Of course, in some embodiments of the present invention, it is not necessary to judge whether the lens belongs to a static picture or a dynamic picture.

The method of the embodiment of the present invention is described in the HEVC scenario. The lens obtained in the above steps can be used as a GOP, and one lens is a GOP. In a shot, the start frame of the lens is a key frame, and the lens is taken from the lens through step A5. The video frame extracted in is also a key frame, and other video frames of the lens can be used as a B frame and a P frame. The key frame extraction operation of the embodiment of the present invention takes into account the context context information of the video, so that when the key frames are subsequently classified, the classification effect of the key frames is better, which contributes to the improvement of the compression ratio of the subsequent coding. .

In the method of the embodiment of the present invention, the key frame sequence is quickly generated, and can respond to the user's fast forward and switch scene requirements in time. The user can preview the video scene according to the sequence of key frames, and accurately locate the video scene segments that are of interest to the user, thereby improving the user experience.

It can be understood that the key frame for performing the following reconstruction operation can be obtained, and other methods can be used in addition to the above methods. For example, a video stream is acquired, wherein the video frames of the video stream include an I frame, a B frame, and a P frame. Then, an I frame is extracted from the video stream, and step 504 or step 505 is performed with the I frame.

Through the execution of the above method, the encoding device acquires a plurality of key frames, which are video frames to be reconstructed to reduce redundancy. In order to further reduce the redundancy of the redundant data of the video frame by the method of the embodiment of the present invention, after the multiple key frames are acquired, the method of the embodiment of the present invention further includes the step of classifying the key frame, that is, step 504.

Step 504: Classify a plurality of key frames based on the correlation of the picture content to obtain key frames of one or more classification clusters.

The method of the embodiment of the present invention may perform the step 505 in the key frame of the same cluster by using the method of the embodiment of the present invention.

In the same cluster according to the correlation classification of the screen content, the screen content between the key frames is highly correlated, and there is a large amount of redundant data. If the classification effect is better, that is, the plurality of key frame information in the same cluster is highly aggregated. The greater the redundancy of multiple key frames in the same cluster, the more significant the effect of subsequent reconstruction operations on the reduction of redundancy.

For example, in the embodiment of the present invention, one or more classification clusters are obtained after the classification operation, and there are more portions of the same content content among the multiple key frames in the same classification cluster, so that redundant data redundancy between the key frames is performed. Larger.

In the classification operation, if different key frames are classified based on the lens, the classification may also be referred to as a scene classification. Of course, the classification operation may also directly classify different key frames without being based on the lens. The classification operation of the method provided by the embodiment of the present invention is referred to as a scene classification operation.

Among them, there are various specific classification methods. Two examples are given below, one is the classification method of clustering, and the other is the classification method using classifiers.

1) Classification method of clustering

In the clustering classification method, the plurality of key frames are classified based on the correlation of the screen content, and the key frames of the one or more classification clusters are obtained, including:

Step B1: Extract feature information of each key frame of the plurality of key frames.

The feature information of the key frame may be an underlying feature or a middle layer semantic feature.

Step B2: Determine the cluster distance between any two key frames according to the feature information.

Among them, the cluster distance is used to represent the similarity between two key frames. Any two key frames here include all the key frames extracted in the above steps, which may be key frames belonging to different shots, or key frames belonging to the same shot.

The difference between frames in the lens is smaller than the difference between frames in different lenses. In order to effectively divide the scene classification, different feature spaces may be selected, and different feature spaces correspond to different metrics, so that the cluster distance and the lens distance may be different.

Step B3: Cluster the video frames according to the cluster distance to obtain video frames of one or more clusters.

For example, scene classification is achieved by analyzing and clustering key frames of each lens. Scene classification is closely related to scene reconstruction. Prerequisite to the video coding task, the first principle of scene classification is that the key frames in each cluster are highly correlated at the content level of the screen, and there is a large amount of information redundancy. The existing scene classification algorithms are mainly divided into two categories: a) based on the underlying feature scene classification algorithm; b) based on the middle layer semantic feature modeling scene classification algorithm. These methods are based on feature detection and description, and reflect the description of the scene content at different levels. The underlying image features may include features such as color, edge, texture, SIFT (Scale-invariant feature transform), HOG (Histogram of Oriented Gradient), and GIST. Middle-level semantic features include Bag of Words, deep learning network features, and more. In order to improve efficiency, the embodiment of the present invention selects a relatively simple GIST global feature to describe the overall content of the key frame. The distance measure function uses the Euclidean distance to measure the similarity of the two images. The clustering algorithm can adopt traditional K-means, graph cutting, hierarchical clustering and other methods. In this embodiment, a condensed hierarchical clustering algorithm is used to cluster key frames. The method cluster number depends on the similarity threshold setting. The higher the threshold setting, the greater the amount of keyframe information redundancy within the class and the corresponding number of clusters. The specific flow chart of the scene classification is shown in the following figure.

The above clustering-based scene classification strategy is beneficial to the improvement of coding speed. The following classification mechanism based on the classifier model is beneficial to the improvement of coding precision.

The main idea of the scene classification strategy based on the classifier model is to perform discriminant training on each shot according to the shot segmentation result to obtain a plurality of discriminant classifiers. Each key frame is discriminated by the classifier, and the key frame with a high score is considered to be the same scene as the lens. The specific process is as follows:

2) Classification method using classifier

In the classification method using the classifier, the classification method of the video coding method in the embodiment of the present invention includes:

Step C1: Perform discrimination training according to each shot segmented from the video stream to obtain a plurality of classifiers corresponding to the shots.

Among them, the optional classifier models are: decision tree, Adaboost, Support Vector Machine (SVM), deep learning and other models.

Step C2: Using the target classifier to discriminate the target key frame to obtain a discriminant score.

The target classifier is one of the plurality of classifiers obtained in step C1, and the target video frame is one of the key frames, and the discriminant score is used to indicate the extent to which the target video frame belongs to the scene to which the target classifier belongs.

In this way, the type discrimination of each key frame can be realized, and it is determined whether a key frame belongs to the same scene as one shot.

Step C3: When the discriminant score is greater than the preset score threshold, it is determined that the target video frame belongs to the same scene as the shot to which the target classifier belongs.

When the discriminant score is greater than the preset score threshold, the target key frame may be considered to belong to the same scene as the shot to which the target classifier belongs. Otherwise, the target key frame and the shot to which the target classifier belongs are not considered to belong to the same scene.

Step C4: Determine video frames of one or more clusters according to video frames belonging to the same scene as the shot.

For example, take SVM as an example. As shown in Figure 10, the operation of classifying using a classifier includes two main phases, as follows:

2.1) Model training

First, discriminate training for each lens. All video frames contained in each shot are positive samples. All video frames in the two lenses adjacent to the lens are negative samples. The classifier parameters are trained according to the training samples. The training formula for each lens classifier is as follows:

Where y _i is the label corresponding to the i-th training sample, the positive sample corresponds to the label 1 and the negative sample is -1. φ(·) is the feature mapping function, n is the total number of training samples, w is the classifier parameter, and I is the training sample.

2.2) Scene classification

The keyframes are discriminated by the classifier model trained by each lens. The specific formula is as follows:

Indicates the probability that the key frame i belongs to the same scene as the shot j. In the formula, w _j and b _j are the classifier parameters corresponding to the jth lens, and the denominator is the normalization factor. When the probability is greater than the set threshold, it is considered that the key frame i and the shot j belong to one scene. Where i and j are positive integers.

In this way, by the above operation, the correspondence between multiple sets of key frames and shots can be obtained. These correspondences indicate that the key frames and the shots belong to the same scene, and then the encoding device can determine the video frames of one or more clusters according to the corresponding relationships. .

It can be understood that, in the example of using the SVM classifier, a specific scenario of the two-class classifier is used, and the embodiment of the present invention can also operate by using a multi-class classification algorithm.

It will be appreciated that in some embodiments of the invention, step 504 may not be included.

It can be understood that the above description is based on the HEVC scenario, and the coding method provided by the embodiment of the present invention is described. In other scenarios, the key frames in the foregoing steps may be directly described by video frames.

Through the above method, after multiple video frames are classified based on the correlation of the picture content, and the key frames of one or more classification clusters are obtained, the redundancy of the redundant data between the key frames in the same classification cluster is large, thereby When the subsequent reconstruction of the key frames of the same cluster is performed, the redundancy of the redundant data can be further reduced to further reduce the amount of encoded data. In addition, in the embodiment in which the key frame of the lens is classified, the video is compressed according to the scene, and the content is clipped later, and the video green mirror (that is, the essence video is generated according to the heat analysis) is facilitated.

Step 505: Perform reconstruction on multiple key frames of the same cluster to obtain scene features and reconstruction residuals of each video frame.

Each of the plurality of key frames includes the same picture content, that is, redundant data included on the picture content between each key frame. If these key frames are not reconstructed, the encoding device will repeatedly encode the same picture content between these key frames. The reconstructed scene features are used to represent the same picture content between each video frame, such that the scene information includes data resulting from reducing redundancy of redundant data. The reconstructed residual is used to represent the difference between the key frame and the scene feature. The scene feature thus obtained may represent the overall information of the frame, so that the reconstruction operation of step 505 is directed to a scene in which the entire screen of the plurality of video frames has the same picture content.

The specific implementation of step 505 is as follows:

Convert keyframes of the same taxonomy into observation matrices. The observation matrix is used to represent the plurality of key frames in a matrix form. Then, the observation matrix is reconstructed according to the first constraint condition to obtain a scene feature matrix and a reconstructed residual matrix.

The scene feature matrix is used to represent the scene features in a matrix form, and the reconstructed residual matrix is used to represent the reconstructed residuals of the plurality of key frames in a matrix form. The first constraint is used to define the scene feature matrix low rank and the reconstructed residual matrix is sparse.

In some embodiments of the present invention, the observation matrix is reconstructed according to the first constraint condition, and the scene feature matrix and the reconstructed residual matrix are obtained, including: calculating the scene feature matrix and reconstructing the residual according to the first preset formula a difference matrix, the scene feature matrix is a low rank matrix, and the reconstructed residual matrix is a sparse matrix;

The first preset formula is:

or,

s.t.D=F+E s.t.D=F+E

Represents seek the optimal value F and E, even though the values to obtain the objective formula rank (F) + λ || E || 1 or _{|| F || * + λ || E} || 1 minimum value F and E . Rank(·) is a matrix for the rank function, ||·|| ₁ is the matrix L1 norm, and ||·|| _* is the matrix kernel norm.

For example, the scene reconstruction is to perform content analysis on the scene of each cluster cluster obtained by the scene classification, and extract scene features and representation coefficients suitable for reconstructing all key frames in the scene. Models that can be used for scene reconstruction include RPCA (Robust Principle Component Analysis), LRR (low rank representation), SR (sparse representation), SC (sparse coding), and SDAE (Sparse Self-Coded Deep Learning Model). , CNN (convolution neural network) and so on. The representation coefficient of the embodiment of the present invention may be represented by a unit matrix, and the scene feature and the representation coefficient are still multiplied by the scene feature. Of course, in some embodiments of the present invention, since the representation coefficient can be ignored, it can be used as a unit matrix. The representation coefficient may or may not be used. In this case, in the decoding and reconstruction stage, only the scene feature and the reconstruction residual are required to represent the original video frame.

The video coding method in this embodiment uses RPCA to reconstruct key frames in the scene. The scene reconstruction strategy based on RPCA method reconstructs the overall content data of key frames, which can reduce the square phenomenon caused by block prediction.

Suppose a scene S contains N key frames, that is, a certain cluster includes N key frames, and N is a natural number. The pixel values of all key frame images in the same cluster are drawn into a column vector to form an observation matrix D, that is, D=[I ₁ , I ₂ , . . . , I _n ], where Ii is the ith key frame The column vector representation. Since the content between the key frames in the same scene is similar, it can be assumed that each key frame contains the same scene feature f _i , and the feature matrix F=[f ₁ , f ₂ , . . . , f _n ] composed of scene features is essentially It should be a low rank matrix; each key frame undergoes a small change on the basis of the F matrix to obtain the observation matrix D, so the reconstruction error E = [e ₁ , e ₂ , ..., e _n ] should be sparse.

Translate the scene reconstruction problem into the following optimization problem to describe:

s.t.D=F+E

Where λ is a weight parameter used to balance the relationship between the scene feature matrix F and the reconstructed residual matrix E, rank(·) is a matrix rank function, and ||·|| ₁ is a matrix L ₁ norm. The above optimization problem belongs to the NP problem, and can be relaxed to solve the following problems:

s.t.D=F+E

Where ||·|| _* is the matrix kernel norm. The above optimization problem can be solved by matrix optimization algorithms such as Accelerated Proximal Gradient (APG) and Inexact Augmented Lagrange Multiplier (IALM).

After the reconstruction, the scene feature and the reconstruction residual are obtained, and the original compression of the key frame I can be converted into the scene feature F and the reconstruction error E by reconstruction. Since the scene feature matrix F has a low rank and the reconstructed residual E is sparse, the amount of compressed data of the two is greatly reduced relative to the conventional I frame compression algorithm. Figure 11 shows an example diagram based on RPCA scene reconstruction, where key frames 1 to 3 belong to different shot segments of the same video. As can be seen from the figure, the scene feature matrix F rank is 1, so only one column of the matrix needs to be data compressed. The residual matrix E has a value of 0 in most of the regions, so only a small amount of information is needed to represent E.

The scene feature in the embodiment of the present invention is one specific implementation manner of the scenario information, and the step 505 is to reconstruct multiple video frames to obtain scene information and one of reconstructed residuals of each video frame. Specific implementation.

The method of the embodiment of the present invention may perform a reconstruction operation on a key frame having redundant data of the frame overall information. In order to efficiently reduce redundant data of the key frame by the reconstruction operation, the key frame needs to be detected first to determine the current Whether the selected key frame is suitable for the reconstruction operation of the method of the embodiment of the present invention, so that the adaptive coding can be performed according to the content of the video scene.

The method of the embodiment of the present invention further includes: extracting each of the multiple video frames, before the reconstructing the plurality of video frames to obtain the scene features and the reconstructed residuals of the video frames. The picture feature information of the frame, wherein the extracted picture feature information may be a global feature or a local feature of the video frame, and specifically includes a GIST global feature, a HOG global feature, a SIFT local feature, and the like, which are not specifically limited in this embodiment of the present invention. Then, the encoding device calculates the content metric information according to the picture feature information, where the content metric information is used to measure the difference of the picture content of the multiple video frames, that is, the content consistency measurement of the key frame, the key frame Content consistency metrics can be measured in terms of feature variance, Euclidean distance, and the like. And performing the step of reconstructing the plurality of video frames to obtain a scene feature and a reconstruction residual of each video frame, when the content metric information is not greater than a preset metric threshold.

For example, before the step 505, the method of the embodiment of the present invention further includes:

Step D1: Extract global GIST features of each of the plurality of video frames.

In the HEVC scenario described above, step D1 is to extract global GIST features for each of a plurality of key frames of the same cluster.

This global GIST feature is used to describe the characteristics of keyframes.

Step D2: Calculate the variance of the scene GIST feature according to the global GIST feature.

The scene GIST feature variance is used to measure the content consistency of multiple video frames.

In the above HEVC scenario, the scene GIST feature variance is used to measure the content consistency of multiple key frames of the same cluster.

Step D3: When the scenario GIST feature variance is not greater than the preset variance threshold, step 304 is performed.

After obtaining the scene feature and the reconstruction residual by performing the foregoing steps, the video coding and decoding device may perform intra prediction coding on the scene feature and the reconstructed residual, respectively.

The above steps D1 to D3 are specific methods for determining whether the key frame of the same cluster is applicable to step 505.

Step 506: Perform predictive coding on the scene features to obtain scene feature prediction encoded data.

Step 507: Perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.

The predictive coding portion of the encoding device includes two parts of intra prediction coding and inter prediction coding. The scene features and reconstruction errors are encoded by intra prediction, and the remaining frames of the shot, that is, the non-key frames of the shot, are inter-predictive encoded. The specific process of intra prediction coding is similar to the HEVC intra coding module. Since the scene feature matrix has a low rank, only the key columns of the scene feature matrix need to be encoded. The reconstruction error belongs to residual coding, and the amount of coded data is small and the compression ratio is high.

Step 508: Perform reconstruction according to the scene feature and the reconstructed residual to obtain a reference frame.

In order to perform interframe predictive coding on B and P frames, a reference frame needs to be obtained. In the HEVC scenario, the key frame is used as the reference frame. In the above method, if the scene feature and the reconstructed residual use lossy compression, if the original key frame in step 504 is used for inter-frame prediction, the error diffusion phenomenon occurs when the BP frame is decompressed. In step 507, the reverse reconstruction scheme is adopted to prevent the error from spreading between the BP frames. The reconstruction is performed according to the scene feature and the reconstruction residual, and the following step 509 is performed with reference to the obtained reference frame.

It can be understood that if the scene feature and the reconstructed residual interframe predictive coding in

steps

506 and 507 adopt the lossless compression mode, the BP frame inter prediction can be directly performed through the key frame extracted in step 504.

Step 509: Perform inter-prediction encoding on the B frame and the P frame with reference to the reference frame, and obtain B frame predictive encoded data and P frame predictive encoded data.

Inter-frame coding first reconstructs the key frame (I frame) according to the scene features and reconstruction error, and then performs motion compensation prediction and coding on the BP frame content. The specific inter prediction encoding process is the same as HEVC.

Step 510: Perform transform coding, quantization coding, and entropy coding on the prediction encoded data to obtain video compressed data.

The predictive coded data includes scene feature predictive coded data, residual predictive coded data, B frame predictive coded data, and P frame predictive coded data.

The data is subjected to change coding, quantization coding, and entropy coding on the basis of predictive coding, which is the same as HEVC.

The video coding method in the embodiment of the present invention can improve the video compression ratio. In some embodiments, when the scene content is highly correlated, the entire scene information can be represented by a small amount of information, and the code rate is lowered, and the video quality is guaranteed. The reduction of the compression volume is more suitable for the transmission and storage of images in a low bit rate environment. Taking the digital video industry as an example, the existing on-demand (VOD), personal video recording (NPVR), and Catch-up TV video services account for 70% of the server's storage resources and network bandwidth. The technical solution of the embodiment of the invention can reduce the pressure of the storage server and improve the network transmission efficiency. In addition, the CDN edge node can store more videos, the user hit rate will be greatly increased, the return rate is reduced, the user experience is improved, and the network device consumption is reduced. Moreover, the method in the embodiment of the present invention can generate different code rate videos by performing feature extraction on different levels of the scene.

In summary, by reconstructing, the same picture content is de-duplicated and represented by scene features, which can reduce the redundancy of redundant information of the multiple video frames. Therefore, in the encoding operation, the obtained scene feature and the compressed data amount of the reconstructed residual total are reduced relative to the compressed data amount of the original video frame, and the amount of data obtained after compression is reduced. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.

After the compression encoded data is obtained by performing the above steps, the video codec device may perform a decompression operation on the compressed encoded data.

FIG. 6 is a flowchart of a method for decoding a video according to an embodiment of the present invention. Referring to FIG. 6 and FIG. 4b, a video decoding method according to an embodiment of the present invention includes:

Step 601: Acquire video compression data.

The decoding device acquires video compressed data, which may be video compressed data obtained by the video encoding method of the embodiment shown in FIG. 5.

Step 602: Perform entropy decoding, inverse quantization processing, and DCT inverse change on the video compressed data to obtain predictive encoded data.

The prediction encoded data includes scene feature prediction encoded data, residual prediction encoded data, B-frame predictive encoded data, and P-frame predictive encoded data.

In the HEVC scenario, corresponding to step 510, the video compression data needs to be entropy decoded, inverse quantized, and DCT inversely changed according to the HEVC decoding process to obtain corresponding predictive encoded data.

In this way, the acquisition of scene feature prediction encoded data and residual prediction encoded data can be achieved.

Step 603: Decode the scene feature prediction encoded data to obtain a scene feature.

Corresponding to the embodiment shown in FIG. 5, the scene feature is used to represent the same picture content between each video frame, and the scene feature obtained by decoding the scene feature prediction encoded data represents each video in the plurality of video frames. The same picture content between frames.

Step 604: Decode the residual prediction encoded data to obtain a reconstructed residual.

The reconstructed residual is used to represent the difference between the video frame and the scene information.

For example, the scene feature prediction encoded data and the key frame error prediction encoded data are respectively decoded to obtain a scene feature matrix F and a reconstructed residual e _i .

Step 605: Perform reconstruction according to the scene feature and the reconstructed residual to obtain multiple I frames.

In the embodiment shown in FIG. 5, the key frame is reconstructed to obtain the scene feature and the reconstruction residual. Therefore, in the coding method of the video frame, the scene feature and the reconstruction residual are reconstructed. The result is multiple keyframes.

Step 606: Perform inter-frame decoding on the B frame predictive coded data and the P frame predictive coded data by using the I frame as a reference frame to obtain a B frame and a P frame.

Step 607: Arranging the I frame, the B frame, and the P frame in chronological order to obtain a video stream.

After the I frame, the B frame, and the P frame are acquired, the video streams are obtained by arranging the three types of video frames in chronological order.

For example, the original data reconstruction is performed in combination with the decoded scene feature F and the key frame error e _i to obtain key frame decoded data. Finally, BP frame decoding is performed according to the decoded key frame data, and the decoded data frames are arranged in chronological order to obtain a complete sequence of the original video.

In this way, after the video compression data is obtained by the video coding method shown in FIG. 5, in some embodiments, the scene feature prediction coded data and the residual prediction coded data are obtained, and the data can be obtained by the video decoding method shown in FIG. 6. Decoding to get a video frame.

The embodiment shown in FIG. 5 is mainly applied to perform efficient compression in a redundant scenario in which overall information between key frames exists. The embodiment shown in FIG. 12 is applied to perform efficient compression in a redundant scene where local information of key frames exists, and the local information may be, for example, a texture image, a lens gradation, or the like.

FIG. 12 is a flowchart of a method for a video encoding method according to an embodiment of the present invention. Referring to FIG. 12, a video decoding method provided by an embodiment of the present invention includes:

Step 1201: Acquire a video stream.

For details of the implementation of step 1201, reference may be made to step 501.

Step 1202: Perform lens segmentation on the video stream to obtain multiple shots.

The implementation details of step 1202 can be referred to step 502.

Step 1203: Extract key frames from the obtained shots.

For details of the implementation of step 1203, refer to step 503.

Similar to the embodiment shown in FIG. 5, in the method of the embodiment shown in FIG. 12, the video frame to be reconstructed is obtained, and the video stream may also be acquired by other methods, for example, the video frame of the video stream. Includes I frames, B frames, and P frames. Then, extracting an I frame from the video stream, and performing subsequent steps of splitting each of the plurality of video frames by using the I frame to obtain a plurality of frame sub-blocks;

Step 1204: classify a plurality of key frames based on the correlation of the picture content to obtain key frames of one or more classification clusters.

The implementation details of step 1204 can be referred to step 504.

The specific classification method used in the method of the embodiment of the present invention can also refer to the related description of step 504.

The method of the embodiment of the present invention may perform a reconstruction operation on a key frame in which frame local information has redundant data. In order to efficiently reduce redundant data of a key frame through a reconstruction operation, the key frame needs to be detected first to determine the current Whether the selected key frame is suitable for the reconstruction operation of the method of the embodiment of the present invention, that is, before each video frame of the plurality of video frames is split to obtain a plurality of frame sub-blocks, the embodiment of the present invention The method further includes: extracting picture feature information of each of the plurality of video frames, wherein the extracted picture feature information may be a global feature or a local feature of the video frame, specifically a GIST global feature, a HOG global feature, The SIFT local features and the like are not specifically limited in the embodiment of the present invention. Then, the encoding device calculates the content metric information according to the picture feature information, where the content metric information is used to measure the difference of the picture content of the multiple video frames, that is, the content consistency measurement of the key frame, the key frame Content consistency metrics can be measured in terms of feature variance, Euclidean distance, and the like. When the content metric information is greater than the preset metric threshold, performing the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks.

For example, before the step 1205, the method of the embodiment of the present invention further includes:

Step E1: Extract global GIST features of each of the plurality of video frames.

In the HEVC scenario, step E1 is to extract global GIST features for each of a plurality of key frames of the same cluster. This global GIST feature is used to describe the characteristics of keyframes.

Step E2: Calculate the variance of the scene GIST feature according to the global GIST feature.

The scene GIST feature variance is used to measure the content consistency of multiple video frames;

Step E3: When the scene GIST feature variance is greater than the preset variance threshold, step 1205 is performed.

The video frames in the steps E1 to E3 are key frames in the HEVC scenario. In some embodiments of the present invention, the key frames are key frames belonging to the same cluster.

The above steps E1 to E3 are specific methods for determining whether the key frame of the same cluster is applicable to step 1205. If the variance of the scene GIST feature of the plurality of key frames is greater than the preset variance threshold, it indicates that the local part of the frame picture of the multiple key frames has redundant data, so that step 1205 or step 1206 can be performed on the multiple key frames to Reduce the redundancy of these local redundant data.

Step 1205: Split each video frame in multiple video frames to obtain multiple frame sub-blocks.

Specifically, in the embodiment for performing classification, after encoding the key frames of one or more classification clusters, the encoding device splits multiple key frames of the same cluster to obtain a plurality of frame sub-blocks.

In the multiple video frames of step 1205, each of the plurality of video frames includes redundant data at a local location, that is, redundant data exists between different video frames and within a video frame, and these redundancy The data is in a local location of the frame. For example, of the two video frames, one video frame has a window image in the lower part of the frame, and the other video frame has the same window image in the upper part of the frame. In these two video frames, the window image constitutes redundant data.

By splitting these video frames, a plurality of frame sub-blocks are obtained. Since the original video frames or video frames have redundant data inside, after the split, the redundant data is carried by the split frame sub-blocks. Because the redundant data is located at the local position of the frame of the video frame, it is inconvenient to extract the scene features representing the overall picture of the frame from the video frames, or such scene features have little effect on reducing the redundancy of the redundant data. Therefore, the video frames can be split first, and the frame picture at this time is a frame sub-block picture, and the granularity of the redundant data relative to the frame picture is reduced, thereby facilitating the acquisition of the scene feature base, and the scene feature base See the description of step 1206 for details.

It can be understood that the plurality of frame sub-blocks obtained by the splitting may be equal in size or unequal. After the splitting, the frame sub-blocks may be pre-processed, such as zooming in or out.

Step 1206: Perform reconstruction on multiple frame sub-blocks to obtain a scene feature, a representation coefficient of each frame sub-block in the plurality of frame sub-blocks, and a reconstruction residual of each frame sub-block.

The scene feature includes multiple independent scene feature bases, and the independent scene feature bases in the scene feature cannot be reconstructed from each other. The scene feature base is used to describe the screen content features of the frame sub-block, and the representation coefficients represent the scene features. The correspondence between the base and the frame sub-block, the reconstructed residual represents the difference between the frame sub-block and the scene feature base. The reconstructed residual may be a specific value or zero.

In an embodiment of the invention, the representation coefficients may be stored in separate fields, passed by encoding the ancillary information, such as by adding corresponding fields within the image header, strip header or macroblock information.

The scene feature base can be configured in various forms, for example, it can be a certain frame sub-block, or a feature block in a specific space. For details, refer to the following two examples. Multiple scene feature bases may constitute scene features. In the same scene feature, different scene feature bases cannot be reconstructed from each other, and thus these scene feature bases constitute a basic image unit. The basic image unit and the corresponding reconstructed residual combination can obtain a certain frame sub-block. Since there are multiple basic image units, it is necessary to represent the coefficients to match the scene feature base and the reconstructed residual corresponding to the same frame sub-block. Contact. It can be understood that one frame sub-block may correspond to one scene feature base, or may correspond to multiple scene feature bases. When multiple scene feature bases correspond to one frame sub-block, the scene feature bases are superimposed on each other and reconstructed residuals are performed. The reconstructed frame sub-block is obtained.

The scene features are composed of scene feature bases, and the scene feature bases in one scene feature cannot be reconstructed from each other, and the additional parameter reconstruction residuals represent the difference between the frame sub-block and the scene feature base, thereby being composed of multiple frames. When the block obtains multiple identical scene feature bases, the scene feature may record only one of the same scene feature bases, and the scene information includes data obtained by reducing redundancy of redundant data. Thus, after the reconstruction of step 1206, the data of the frame sub-block is converted into data composed of the reconstructed residual and the scene feature, and the redundancy of the redundant data is reduced.

The video coding method of the embodiment of the present invention may refer to FIG. 3b. However, after the scene reconstruction, the method further includes the representation coefficient C. For example, after performing scene reconstruction on the key frame of the scene 1, the weight is obtained. Construct residual matrices E1, E2, E3, and scene features F1*[C1, C3, C5] ^T . C1, C3, and C5 are the representation coefficients of the key frames I1, I3, and I5, respectively.

Step 1205 and step 1206 described above are one of the specific forms of the steps of reconstructing a plurality of video frames to obtain scene information and reconstruction residuals of each video frame.

There are a plurality of ways to perform the step 1206. Two examples are as follows:

Example 1:

First, the encoding apparatus reconstructs a plurality of frame sub-blocks to obtain a representation coefficient of each frame sub-block of the plurality of frame sub-blocks and a reconstruction residual of each frame sub-block.

Wherein, the representation coefficient represents a correspondence between a frame sub-block and a target frame sub-block, the target frame sub-block is an independent frame sub-block among the plurality of frame sub-blocks, and the independent frame sub-block is not based on other ones of the plurality of frame sub-blocks The frame sub-block reconstructed from the frame sub-block is reconstructed to represent the difference between the target frame sub-block and the frame sub-block.

Then, the encoding device combines a plurality of target frame sub-blocks indicating the coefficient indication to obtain a scene feature, and the target frame sub-block is a scene feature base.

That is, in this embodiment, after obtaining a plurality of frame sub-blocks from a plurality of video frames, the frame sub-blocks that are independently represented are determined by the reconstruction operation, and the independently represented frame sub-blocks are now referred to as target frames. Piece. The target frame sub-block and the non-target frame sub-block are included in the obtained multiple frame sub-blocks, the target frame sub-block cannot be reconstructed based on other target frame sub-blocks, and the non-target frame sub-block can be obtained based on other target frame sub-blocks. . In this way, the scene features are composed of target frame sub-blocks, which can reduce the redundancy of redundant data. Because the scene feature base is the original frame sub-block, the scene feature base constituting the scene feature can be determined according to the indication of the representation coefficient.

For example, as shown in FIG. 13, one of the two frame sub-blocks includes a window pattern 1301, and the frame sub-block plus the gate image 1303 can obtain another frame sub-block, so that the previous frame sub-block is targeted. Frame sub-block 1302, the next frame sub-block is a non-target frame sub-block 1304. The target frame sub-block and the reconstruction residual of the gate pattern are reconstructed to obtain the target frame sub-block, so that in the scene including the two frame sub-blocks, the window pattern of the two-frame sub-block is redundant data. After the reconstruction operation of the embodiment of the present invention, the reconstructed residual of the target frame sub-block and the gate is obtained, two representation coefficients, one representation coefficient indicates the target frame sub-block itself, and the other representation coefficient indicates the target frame sub-block and the gate. Reconstructing the correspondence of residuals, the target frame sub-block is a scene feature base. At the decoding device, a frame sub-block is obtained as a target frame sub-block according to a representation coefficient indicating the target frame sub-block itself, and the target frame sub-block is represented according to a representation coefficient indicating a correspondence relationship between the target frame sub-block and the reconstructed residual of the gate. The reconstruction residual of the AND gate is reconstructed to obtain another frame sub-block. Thus, at the time of encoding, the redundancy of the redundant data is reduced and the amount of encoding is reduced by the above-described reconstruction operation.

Specifically, reconstructing a plurality of frame sub-blocks to obtain a representation coefficient of each frame sub-block of the plurality of frame sub-blocks and a reconstruction residual of each frame sub-block, including:

Converting a plurality of frame sub-blocks into an observation matrix, the observation matrix being used to represent a plurality of frame sub-blocks in a matrix form;

Reconstructing the observation matrix according to the second constraint condition to obtain a representation coefficient matrix and a reconstruction residual matrix, wherein the coefficient matrix is a matrix including the representation coefficients of each frame sub-block of the plurality of frame sub-blocks, indicating the non-coefficient of the coefficients The zero coefficient indicates a target frame sub-block, and the reconstructed residual matrix is used to represent the reconstructed residual of each frame sub-block in a matrix form, and the second constraint is used to define a low rank and sparsity of the representation coefficient. Set requirements;

Combining a plurality of target frame sub-blocks indicating the coefficient indication to obtain scene features, including:

The target frame sub-block indicated by the non-zero coefficient indicating the coefficient in the coefficient matrix is combined to obtain a scene feature.

Optionally, reconstructing the observation matrix according to the second constraint condition to obtain a representation coefficient matrix and a reconstruction residual matrix, including:

According to the second preset formula, the representation coefficient matrix and the reconstructed residual matrix are calculated, and the second preset formula is:

or,

s.t.D=DC+E s.t.D=DC+E

For example, suppose a scene S contains N key frames, that is, the same cluster includes N key frames, and N is a natural number. Split each keyframe evenly into M equal-sized sub-blocks. Pull each sub-block into a column vector to form an observation matrix D, ie

Since there is a large amount of redundancy in the information content between the key frame and the key frame, the matrix can be regarded as a union of a plurality of subspaces. The goal of scene reconstruction is to find these independent subspaces and solve the representation coefficients of the observation matrix D in these independent subspaces. Space refers to a collection with some specific properties. The observation matrix D contains a plurality of image feature vectors, and the representation space formed by these vectors is a full space. A subspace is a partial space that represents a dimension that is smaller than the full space. This subspace is the space formed by independent frame sub-blocks.

The scene reconstruction problem can be transformed into the following optimization problem to describe:

s.t.D=DC+E

Where C is the coefficient of representation. According to the representation coefficient C, the scene features corresponding to each subspace can be obtained. The non-zero number of coefficients C corresponds one-to-one with the number of scene feature bases. The representation coefficient of this embodiment refers to a coefficient matrix (or vector) represented by each scene feature base in the scene feature in the key frame reconstruction process, that is, a correspondence relationship between the frame sub-block and the scene feature base. The representation coefficient between different independent frame sub-blocks is usually 0. For example, the grass image does not contain the lake scene feature, so the coefficient of the image block represented by the lake scene feature is usually zero.

In this way, the self-representation of the observation matrix D can be realized, that is, each frame sub-block in the observation matrix D can be represented by observing other frame sub-blocks in the matrix D, independent frame sub-blocks. It is itself represented by itself. Each column in the representation coefficient matrix C is a representation coefficient of a frame sub-block, and each column of E in the residual matrix is a reconstruction residual of the corresponding frame sub-block. So the formula can use D=DC+E.

The target constraint function indicates that, under the premise of self-representation, since the observation matrix is composed of multiple scene feature bases, the represented coefficients should be low rank matrices (ie, multiple representation coefficients are strongly correlated), given the low rank constraint. Under the premise, the solution can be avoided to obtain a trivial solution (ie, C=I, E=0). At the same time, the reconstruction error sparse constraint is given so that the representation is as close as possible to the original image.

In order to reduce the amount of data represented by the scene feature, it is necessary to perform coefficient sparse constraint on the representation coefficient, namely:

s.t.D=DC+E

Among them, λ and β are weight parameters, and the coefficient sparsity and low rank are adjusted. The above optimization problem can be solved by matrix optimization algorithms such as APG and IALM. The final scene feature consists of a feature base corresponding to a non-zero coefficient C.

In order to reduce the number of feature bases, the representation coefficients need to be sparsely constrained, that is, the representation coefficients of the frame sub-blocks belonging to the same type of scene (for example, both are grassland) are not only strongly correlated but also indicate that the coefficients are mostly 0, and a small portion is not 0. The image sub-block corresponding to the representation coefficient is the scene feature that needs to be encoded eventually.

For example, suppose that the representation coefficient matrix C and the observation matrix D are matrices arranged by the column vectors c, d, that is, C = [c1, c2, c3, ...], D = [d1, d2, d3, ...], where c1 =[c1_1,c1_2,c1_3] is the representation coefficient corresponding to the observation sample d1, and d1 is the matrix representation of one frame sub-block. DC indicates that the matrix is multiplied, then d1=D*c1, that is, d1=d1*c1_1+d2*c1_2+d3*c1_3. After solving, only a small part of the dimension in the c1 vector is not 0, for example, c1_2 is not 0. The scene feature base is d2, that is, the frame sub-block d1 can be represented based on the frame sub-block d2, the frame sub-block d2 is an independent frame sub-block, and the reconstructed residuals of the frame sub-block d2 and the frame sub-block d1 are heavy. The frame sub-block d1 is obtained, and the coefficient c1=[0, c1_2, 0] indicates the correspondence between the frame sub-block d1 and the independent frame sub-block d2.

Thus, the embodiment of the present invention converts the information amount of the I frame into the scene feature base information and the residual matrix information. The redundancy of the I frame information amount is reflected in the scene feature base and the residual matrix information. Multiple I frames have the same scene feature base, so only the scene feature base needs to be encoded once to greatly reduce the amount of encoded data.

In addition to coding the scene features and reconstruction errors, it is also necessary to record the representation coefficients and the number of the sub-blocks. In the decoding process, each sub-block is first reconstructed according to the decoded scene features, representation coefficients and reconstruction errors, and then the sub-blocks are combined by number to obtain the final key frame content. Figure 14 shows an example of scene reconstruction based on local information representation.

Certainly, in some embodiments of the present invention, the frame sub-blocks may be arranged in a preset order without using the number of the sub-blocks, and the reconstructed process is performed according to the preset rule. The video sub-blocks can be combined to obtain a video frame.

This implementation can mine the texture structure existing in the key frame. If there are a large number of texture features in the scene, the representation coefficient C obtained by the above formula will be low rank and sparse. The feature base corresponding to the sparse coefficient is the basic unit of the scene texture structure. Figure 15 shows an example diagram of local feature reconstruction under a texture scene.

In the compression scheme given by the above implementation, the scene content is represented and reconstructed according to the underlying data features of the image. The following implementations will use higher-level semantic features to describe and reconstruct the content of the scene to achieve data compression. Specific models include Sparse Coding (SC), Deep Neural Network (DNN), Convolutional Neural Network (CNN), Stacked Auto Encoder (SAE), and so on.

Case 2

First, the decoding device reconstructs a plurality of frame sub-blocks to obtain a scene feature and a representation coefficient of each frame sub-block of the plurality of frame sub-blocks. The scene feature includes a scene feature set as an independent feature block in the feature space, and the independent feature block is a feature block that cannot be reconstructed by other feature blocks in the scene feature.

Then, the decoding device calculates the reconstructed residual of each frame sub-block according to the reconstructed residual of each frame sub-block and the reconstructed data of the scene feature and each frame sub-block.

The scene feature base is an independent feature block in the feature space. For example, the feature space may be an RGB color space, a HIS color space, a YUV color space, etc., and different frame sub-blocks do not seem to have the same picture, but After the high-level mapping, the same feature blocks are formed. These same feature blocks constitute redundant data, and the scene features record the same feature blocks only one by one, thereby reducing the redundancy between the frame sub-blocks. Such scene features are similar to a dictionary consisting of feature blocks, which are the feature blocks needed to select a frame sub-block from the dictionary and map the corresponding reconstructed residuals.

It can be understood that one frame sub-block can correspond to multiple feature blocks, and the multiple feature blocks are superimposed and reconstructed by the reconstructed residual to obtain a frame sub-block.

Specifically, the plurality of frame sub-blocks are reconstructed to obtain a scene feature and a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, including:

Reconstructing the observation matrix according to the third constraint condition to obtain a representation coefficient matrix and a scene feature matrix, wherein the coefficient matrix is a matrix including the representation coefficients of each frame sub-block, and the non-zero coefficient indicating the coefficient indicates the scene feature base, as shown The scene feature matrix is used to represent the scene features in a matrix form, and the third constraint condition is used to define the similarity between the picture representing the coefficient matrix and the scene feature matrix reconstructed picture and the frame sub-block according to the preset similarity threshold, and the limitation The data matrix indicating that the coefficient matrix sparsity conforms to the preset sparse threshold and the scene feature matrix is smaller than the preset data volume threshold;

The reconstructed residuals of each frame sub-block are calculated according to the reconstructed residual of each frame sub-block and the reconstructed data of the scene feature and each frame sub-block, including:

The reconstructed residual matrix is calculated according to the data and the observation matrix reconstructed by the coefficient matrix and the scene feature matrix, and the reconstructed residual matrix is used to represent the reconstructed residual in a matrix form.

For example, the observation matrix is reconstructed according to the third constraint condition, and the representation coefficient matrix and the scene feature matrix are obtained, including:

According to the third preset formula, the representation coefficient matrix and the scene feature matrix are calculated, and the third preset formula is:

For example, a sparse coding model is used for modeling and analysis for description. Suppose a scene S contains N key frames, and each key frame is evenly split into M equal-sized frame sub-blocks. Pull each frame sub-block into a column vector to form an observation matrix D, ie

The scene reconstruction problem can be transformed into the following problem to describe:

Where λ and β are weight parameters, and the matrix optimization parameters are scene features F and representation coefficients C.

Represents the optimal value of F and C, which makes the formula:

The value of F and C when the value is minimum.

The first item in the objective function is to constrain the reconstruction error, so that the picture reconstructed by the scene feature and the representation coefficient is as similar as possible to the original picture. The second term is the sparse constraint on the coefficient C, which means that each picture can be reconstructed with a small number of feature bases. The last item is to constrain the scene feature F to prevent the F data from being too large, that is, the first item of the formula is the error term, and the last two items are regular items, and the representation coefficients are constrained. The specific optimization algorithm can adopt the methods of conjugate gradient method, OMP (Orthogonal Matching Pursuit), LASSO and the like. The scene features obtained by the final solution are shown in Fig. 16. Then, the reconstructed residual is solved according to the formula E=D-FC. The dimension and number of the F matrix are consistent with the dimensions of the frame sub-block.

Referring to FIG. 16, each small frame of FIG. 16 is a scene feature base, and the scene feature matrix F is a matrix composed of small frames (scene feature bases), FC=F[c1, c2, c3, ...], and Fc1 represents The representation coefficient c1 combines the scene feature bases to obtain a linear representation of the feature space, and the reconstructed residual e1 is added to restore the original frame sub-block image I1.

In the first example, the scene feature base is directly determined by the observation sample D. That is, the scene feature base is selected from the observation sample D. The scene features in this example are learned according to the algorithm. In the optimization process of the parameter F, the iterative solution is performed according to the objective function, and the optimization result can minimize the reconstruction error. The amount of coded information is concentrated on F, E. The dimension of F is consistent with the dimension of the frame sub-block, and the number of Fs can be set in advance. The less the setting, the less the coding information, but the reconstruction residual E is larger. The more the F setting is, the larger the coding information is, but the reconstruction residual E is smaller, so the number of Fs needs to be weighed by the weight parameter.

Step 1207: Perform predictive coding on the feature of the scene to obtain scene feature prediction encoded data.

Step 1208: Perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.

Step 1209: Perform reconstruction according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a reference frame.

For the specific implementation of step 1209, reference may be made to step 508.

Step 1210: Perform reference frame prediction on the B frame and the P frame by using the reference frame as a reference, and obtain B frame predictive coded data and P frame predictive coded data.

For the specific implementation of step 1210, reference may be made to step 509.

Step 1211: Perform transform coding, quantization coding, and entropy coding on the predictive coded data to obtain video compressed data.

For the specific implementation of step 1211, reference may be made to step 510.

Similar to the embodiment shown in FIG. 5, the embodiment shown in FIG. 12 is described based on the HEVC scenario, but the video encoding method shown in FIG. 12 can also be applied to other scenarios.

In summary, the encoding device acquires a plurality of video frames, and each of the plurality of video frames includes redundant data on the picture content, and in particular, each of the plurality of video frames is in a mutual Local locations include redundant data. In this regard, the encoding device splits each video frame in the plurality of video frames to obtain a plurality of frame sub-blocks, and then reconstructs the plurality of frame sub-blocks to obtain scene features and multiple frame sub-blocks. The representation coefficient of each frame sub-block and the reconstruction residual of each frame sub-block. The scene feature includes multiple independent scene feature bases, and the independent scene feature bases in the scene feature cannot be reconstructed from each other. The scene feature base is used to describe the screen content features of the frame sub-block, and the representation coefficients represent the scene features. The correspondence between the base and the frame sub-block, the reconstructed residual represents the difference between the frame sub-block and the scene feature base. Subsequently, the scene features are predictively coded, the scene feature prediction coded data is obtained, and the reconstructed residual is predictively coded to obtain residual prediction coded data.

Thus, by reconstruction, the redundancy of redundant data included in the local location is reduced. Therefore, in the encoding operation, the obtained scene feature and the compressed data amount of the reconstructed residual total are reduced relative to the compressed data amount of the original video frame, and the amount of data obtained after compression is reduced. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.

Corresponding to the video encoding method shown in FIG. 12, FIG. 17 shows a video decoding method. Referring to FIG. 17, the video decoding method in the embodiment of the present invention includes:

Step 1701: Acquire scene feature prediction encoded data, residual prediction encoded data, and representation coefficients.

The decoding device acquires video compressed data, which may be video compressed data obtained by the video encoding method of the embodiment shown in FIG.

Specifically, in the HEVC scenario, acquiring the scene feature prediction encoded data and the residual prediction encoded data includes: acquiring video compressed data, and then performing entropy decoding, inverse quantization processing, and DCT inverse change on the video compressed data to obtain predictive encoded data. . The prediction encoded data includes scene feature prediction encoded data, residual prediction encoded data, B frame predictive encoded data, and P frame predictive encoded data;

Step 1702: Decode the scene feature prediction encoded data to obtain a scene feature.

The scene feature includes multiple independent scene feature bases, and the independent scene feature bases in the scene feature cannot be reconstructed from each other. The scene feature base is used to describe the picture content features of the frame sub-block, and the represented coefficients represent the scene feature base and The correspondence between the frame sub-blocks, and the reconstructed residual represents the difference between the frame sub-block and the scene feature base.

Step 1703: Decode the residual prediction encoded data to obtain a reconstructed residual.

Step 1704: Perform reconstruction according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a plurality of frame sub-blocks.

Corresponding to the scene feature, the representation coefficient, and the reconstruction residual obtained by the video coding method shown in FIG. 12, the video decoding method in the embodiment of the present invention may perform multiple reconstructions according to the scene feature, the representation coefficient, and the reconstruction residual. Frame sub-block. The method of the embodiment of the present invention may refer to FIG. 4b, but after decoding the scene feature, the representation feature is used to determine the required scene feature base in the scene feature, for example, using the scene feature F1*[C1, C3, C5 After ^T , the reconstructed residuals E1, E3, and E5 are reconstructed respectively to obtain key frames I1, I3, and I5. C1, C3, and C5 are the representation coefficients of the key frames I1, I3, and I5, respectively.

Step 1705: Combining a plurality of frame sub-blocks to obtain a plurality of video frames.

Step 1704 and step 1705 are specific implementations of the steps of reconstructing the video information according to the scene information and the reconstructed residual.

For example, in an HEVC scenario, a plurality of frame sub-blocks are combined to obtain a plurality of video frames, and a plurality of frame sub-blocks are combined to obtain a plurality of I frames. For example, in the decoding process, each sub-block is first reconstructed according to the decoded scene features, representation coefficients, and reconstruction errors, and then the sub-blocks are combined by number to obtain the final key frame content. The method of the embodiment of the present invention further includes: inter-frame decoding the B frame predictive encoded data and the P frame predictive encoded data by using the I frame as a reference frame to obtain a B frame and a P frame. Then, the decoding device arranges the I frame, the B frame, and the P frame in chronological order to obtain a video stream.

Thus, after the compressed video data is obtained by the video encoding method of the embodiment shown in FIG. 12, the video frame can be decoded by the video decoding method of the embodiment shown in FIG.

In the foregoing embodiments, the video frame that performs the reconfiguration operation is obtained, and the embodiment that the video frame is extracted from the acquired video stream and the video frame is directly obtained is obtained. In some embodiments of the present invention, the video frame is obtained. This can be obtained by taking a compressed video frame and then decompressing it.

Specifically, step 201 can be implemented by the following steps:

Step F1: Acquire a compressed video stream.

Wherein, the compressed video stream includes a compressed video frame. The compressed video stream can be, for example, a HEVC compressed video stream.

Step F2: Determine a plurality of target video frames from the compressed video stream.

The target video frame is an independently compression-encoded video frame in the compressed video stream.

Step F3: Decoding the target video frame to obtain a decoded target video frame.

The decoded target video frame is used to perform step 202.

In the embodiment of the present invention, in order to further reduce the redundancy of the decoded video frames, the video frames may be classified. For details, refer to step 504.

By performing the video encoding method of the embodiment of the present invention on video frames that are independently compressed in the compressed video stream, the compression efficiency of these video frames can be improved, and the compressed data amount of these video frames can be reduced.

For example, the embodiment of the present invention may perform secondary compression on the HEVC compressed video stream. Specifically, after compressed video discrimination, I frame extraction, and intra-frame decoding, an I frame to be used to perform the method of the embodiment of the present invention is obtained. For example, a method of the embodiment of the present invention may be implemented by adding a compressed video discriminating, an I frame decimation, and an intra decoding module based on the original video encoding device.

First, it is determined whether the video stream is compressed according to whether the video stream contains compressed video terminal information.

Then, an I frame extraction operation is performed. Since the HEVC compressed video adopts a hierarchical code stream structure, independent GOP data is extracted according to the image group header in the image group layer according to the code stream hierarchy. Then, each frame of the GOP is extracted according to the image header, and the first frame image of the GOP group is an I frame, and the I frame can be extracted.

Subsequently, since the I frame has been subjected to independent compression operations in the HEVC compressed video, as described above in detail in the HECV standard, the decoding device performs intra-frame decoding on the extracted I-frame encoded data to obtain decoding. The subsequent I frame, residual coding and decoding steps can be referred to the encoding and decoding operations above. In this way, it is possible to perform secondary encoding and decoding of the compressed video on the basis of the original video encoded data.

Because the method of the invention can perform secondary encoding and decoding on the existing compressed video data, and is consistent with the traditional HEVC method in the process of transform coding, quantization coding, entropy coding, etc., therefore, when performing the function module deployment of the present invention, Can be compatible with legacy video compression devices.

It can be understood that the method of the embodiment of the present invention can also be applied to other encoded data, and the steps of extracting and decoding the compressed video frame according to the above steps, and then performing the steps of the video encoding method of FIG. 2, FIG. 5 and FIG. 12 described above. . Wherein, for non-HEVC video encoded data, the I frame can be determined according to the size of the compressed image data, and usually the I frame encoded data is much larger than the P frame and the B frame encoded data.

FIG. 18 is a schematic structural diagram of a video encoding apparatus according to an embodiment of the present invention. FIG. 18b is a schematic diagram showing a partial structure of a video encoding apparatus according to the embodiment shown in FIG. 18a. The video encoding apparatus can be used to perform the video encoding method in the foregoing embodiments. Referring to FIG. 18a and FIG. 18b, the video encoding apparatus includes: acquiring Module 1801, reconstruction module 1802, and prediction encoding module 1803. The obtaining module 1801 is configured to perform a process of acquiring a video frame in an embodiment of each of the foregoing video encoding methods. The reconstruction module 1802 is configured to perform a process related to the reconfiguration operation to reduce the redundancy of the redundant data in the embodiments of the foregoing video coding methods, for example, step 202, step 505, and step 1206. The prediction encoding module 1803 is configured to perform steps of predictive encoding, such as step 203 and step 204, in an embodiment of each of the above video encoding methods. The reconstruction module 1802 obtains the scene information and the reconstruction residual after performing the reconstruction operation on the plurality of video frames acquired by the obtaining module 1801, so that the prediction encoding module 1803 predictively encodes the scene information and the reconstructed residual.

Optionally, the video encoding device further includes a metric feature extraction module 1804 and a metric information calculation module 1805 between the obtaining module 1801 and the reconstruction module 1802.

The feature extraction module 1804 is configured to perform a process of extracting picture feature information of a video frame in an embodiment of each of the above video coding methods, for example, steps D1 and E1.

The metric information calculation module 1805 is configured to perform a process of calculating the content metric information in the embodiment of each of the above video coding methods, for example, steps D2 and E2.

Optionally, the video encoding device further includes:

a reference frame reconstruction module 1806, configured to perform a process of reconstructing a reference frame in an embodiment of each of the foregoing video coding methods;

The inter prediction encoding module 1807 is configured to perform a process related to inter prediction encoding in the embodiments of the foregoing video encoding methods.

The encoding module 1808 is configured to perform a process of transform coding, quantization coding, and entropy coding in the embodiments of the foregoing video coding methods.

Optionally, the reconstruction module 1802 further includes a splitting unit 1809 and a reconstruction unit 1810. The reconstruction unit 1810 may reconstruct the frame sub-block obtained by splitting the unit 1809.

The splitting unit 1809 is configured to perform a process of splitting a video frame in an embodiment of each of the video encoding methods described above, for example, step 1206. The reconstruction unit 1810 is configured to perform a process of reconstructing a frame sub-block in an embodiment of each of the foregoing video coding methods, for example, step 1206;

The reconstruction unit 1810 includes a reconstruction subunit 1811 and a combination subunit 1812.

The reconstruction sub-unit 1811 is configured to perform a process of reconstructing a frame sub-block to obtain a representation coefficient and a reconstruction residual in an embodiment of each of the above video coding methods.

The combining sub-unit 1812 is configured to perform a process of combining the target frame sub-blocks in the embodiment of each of the video encoding methods described above.

Optionally, the reconstruction unit 1810 may further include a sub-block reconstruction sub-unit 1813 and a sub-block calculation sub-unit 1814.

The sub-block reconstruction sub-unit 1813 is configured to perform a process of reconstructing a frame sub-block to obtain a scene feature and a representation coefficient in an embodiment of each of the foregoing video coding methods, where the scene feature includes a scene feature base that is independent in the feature space. Feature block.

The sub-block calculation sub-unit 1814 is for performing a computational reconstruction residual processing procedure in an embodiment for performing the above-described respective video coding methods.

Optionally, the video encoding device further includes a classification module 1815 for performing a process involving classification in an embodiment of each of the video encoding methods described above.

Optionally, the classification module 1815 includes a feature extraction unit 1816, a distance calculation unit 1817, and a clustering unit 1818.

The feature extraction unit 1816 is configured to extract feature information of each of the plurality of video frames, and the distance calculation unit 1817 is configured to perform a process of processing the cluster distance in the embodiment of each of the video coding methods. The class unit 1818 is configured to perform a process involving clustering in an embodiment of each of the above video coding methods.

Optionally, the obtaining module 1801 includes the following units:

a video stream obtaining unit 1819, configured to acquire a video stream;

a frame feature extraction unit 1820, configured to perform a process of extracting feature information of the first video frame and the second video frame in an embodiment of each of the foregoing video encoding methods;

a lens distance calculation unit 1821, configured to perform a process related to lens distance calculation in an embodiment of each of the above video coding methods;

The lens distance determining unit 1822 is configured to determine whether the lens distance is greater than a preset lens threshold;

a lens dividing unit 1823, configured to perform a process of dividing a target lens in an embodiment of each of the above video encoding methods;

The key frame extracting unit 1824 is configured to perform a process of extracting a key frame according to a frame distance in an embodiment of each of the above video encoding methods.

Optionally, the video encoding device further includes:

The training module 1825 is configured to perform discriminant training according to each shot segmented from the video stream, to obtain a plurality of classifiers corresponding to the shots;

a discriminating module 1826, configured to determine a target video frame by using a target classifier to obtain a discriminant score,

The scene determining module 1827 is configured to: when the discriminant score is greater than the preset score threshold, determine that the target video frame belongs to the same scene as the shot to which the target classifier belongs;

The cluster determination module 1828 is configured to determine video frames of one or more clusters according to video frames belonging to the same scene as the shot.

Optionally, the obtaining module 1801 includes:

a compressed video obtaining unit 1829, configured to acquire a compressed video stream, where the compressed video stream includes a compressed video frame;

a frame determining unit 1830, configured to determine, from the compressed video stream, a target video frame, where the target video frame is an independently compressed encoded video frame;

The decoding unit 1831 is configured to decode the target video frame to obtain a decoded target video frame, where the decoded target video frame is used to perform splitting of each of the plurality of video frames to obtain multiple frames. The steps of the block.

In summary, the obtaining module 1801 acquires a plurality of video frames, and each of the plurality of video frames includes redundant data on the screen content. Then, the reconstruction module 1802 reconstructs the plurality of video frames to obtain scene information and a reconstruction residual of each video frame, where the scene information includes data obtained by reducing redundancy of redundant data, and reconstructing the residual The difference is used to represent the difference between the video frame and the scene information. Next, the prediction encoding module 1803 performs predictive coding on the scene information to obtain scene feature prediction encoded data. The prediction encoding module 1803 performs predictive coding on the reconstructed residual to obtain residual prediction encoded data. In this way, by performing the process of reconstructing the plurality of video frames, the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video. The amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.

FIG. 19 is a schematic structural diagram of a video decoding device according to an embodiment of the present invention. The video decoding device can be used to perform the video decoding method in the foregoing embodiments. Referring to FIG. 19, the video decoding device includes: an obtaining module 1901, a scene information decoding module 1902, a reconstructed residual decoding module 1903, and a video frame reconstruction module. 1904. The scene information decoding module 1902 and the reconstructed residual decoding module 1903 respectively perform the decoding operation on the scene feature prediction encoded data and the residual prediction encoded data acquired by the obtaining module 1901, so that the video frame reconstruction module 1904 can reconstruct the data obtained by using the decoding. Get the video frame.

The obtaining module 1901 is configured to perform a process of acquiring encoded data in an embodiment of each of the foregoing video decoding methods, for example, step 205;

The scene information decoding module 1902 is configured to perform a process related to decoding scene information in the embodiments of the foregoing video decoding methods, for example, step 206, step 603;

The reconstructed residual decoding module 1903 is configured to perform a process of decoding the reconstructed residual in the embodiment of each of the foregoing video decoding methods, for example, step 207;

The video frame reconstruction module 1904 is configured to perform a process of reconstructing a plurality of video frames in an embodiment of each of the video decoding methods, for example, step 208 and step 604.

Optionally, the obtaining module 1901 includes an obtaining unit 1905 and a decoding unit 1906.

The obtaining unit 1905 is configured to perform a process of acquiring video compression data in an embodiment of each of the foregoing video decoding methods, for example, step 601.

The decoding unit 1906 is configured to perform a process related to obtaining the predicted encoded data in the embodiment of each of the video decoding methods described above, for example, step 602.

The video decoding apparatus further includes: an inter-frame decoding module 1907, configured to perform a process related to inter-frame decoding in an embodiment of each of the above video decoding methods, for example, step 606;

The arranging module 1908 is configured to perform a process involving frame alignment in the embodiment of each of the video decoding methods described above, for example, step 607.

Optionally, the obtaining module 1901 is further configured to acquire a representation coefficient.

The video frame reconstruction module 1904 includes a reconstruction unit 1909 and a combination unit 1910.

The reconstruction unit 1909 is configured to perform a process of reconstructing a plurality of frame sub-blocks in an embodiment of each of the video decoding methods, for example, step 1704.

The combining unit 1910 is configured to perform a process of combining frame sub-blocks in an embodiment of each of the above video decoding methods, for example, step 1705.

In summary, after the acquisition module 1901 acquires the scene feature prediction coded data and the residual prediction coded data, the scene information decoding module 1902 decodes the scene feature prediction coded data to obtain scene information, where the scene information includes reducing redundant data. The redundancy obtained data, the redundant data is redundant data on the picture content between each of the plurality of video frames. Then, the reconstructed residual decoding module 1903 decodes the residual prediction encoded data to obtain a reconstructed residual, and the reconstructed residual is used to represent the difference between the video frame and the scene information. And a video frame reconstruction module 1904, configured to perform reconstruction according to the scene information and the reconstructed residual to obtain a plurality of video frames. In this way, the scene feature predictive coded data and the residual predictive coded data obtained by encoding the video coding device in the foregoing embodiment can complete the decoding operation by using the video decoding device of the embodiment of the present invention.

FIG. 20 is a schematic structural diagram of a video codec device according to an embodiment of the present invention. The video encoding and decoding device can be used to perform the video encoding method and the video decoding method in the foregoing embodiments. Referring to FIG. 20, the video encoding and decoding device 2000 includes a video encoding device 2001 and a video decoding device 2002.

The video encoding device 2001 is the video encoding device of the embodiment shown in FIG. 18a and FIG. 18b above;

The video decoding device 2002 is the video decoding device of the embodiment shown in Fig. 19 described above.

The video encoding method and the video decoding method provided by the embodiments of the present invention are described below in the hardware architecture. In the following embodiments, a video encoding and decoding system is provided. The video frame encoding and decoding system includes a video encoder and a video decoder. .

·system structure

21 is a schematic block diagram of a video codec system 10 in accordance with an embodiment of the present invention. As shown in FIG. 21, video codec system 10 includes source device 12 and destination device 14. Source device 12 produces encoded video data. Thus, source device 12 may be referred to as a video encoding device or a video encoding device. Destination device 14 may decode the encoded video data produced by source device 12. Thus, destination device 14 may be referred to as a video decoding device or a video decoding device. Source device 12 and destination device 14 may be examples of video codec devices or video codec devices. Source device 12 and destination device 14 may include a wide range of devices including desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set top boxes, smart phones, etc., televisions, cameras, display devices , digital media player, video game console, on-board computer, or the like.

Destination device 14 may receive the encoded video data from source device 12 via channel 16. Channel 16 may include one or more media and/or devices capable of moving encoded video data from source device 12 to destination device 14. In one example, channel 16 may include one or more communication media that enable source device 12 to transmit encoded video data directly to destination device 14 in real time. In this example, source device 12 may modulate the encoded video data in accordance with a communication standard (eg, a wireless communication protocol) and may transmit the modulated video data to destination device 14. The one or more communication media may include wireless and/or wired communication media, such as a radio frequency (RF) spectrum or one or more physical transmission lines. The one or more communication media may form part of a packet-based network (eg, a local area network, a wide area network, or a global network (eg, the Internet). The one or more communication media may include routers, switches, base stations, or promotions Other devices that communicate from source device 12 to destination device 14.

In another example, channel 16 can include a storage medium that stores encoded video data generated by source device 12. In this example, destination device 14 can access the storage medium via disk access or card access. The storage medium may include a variety of locally accessible data storage media, such as Blu-ray Disc, DVD, CD-ROM, flash memory, or other suitable digital storage medium for storing encoded video data.

In another example, channel 16 can include a file server or another intermediate storage device that stores encoded video data generated by source device 12. In this example, destination device 14 may access the encoded video data stored at a file server or other intermediate storage device via streaming or download. The file server may be a server type capable of storing encoded video data and transmitting the encoded video data to the destination device 14. The instance file server includes a web server (eg, for a website), a file transfer protocol (FTP) server, a network attached storage (NAS) device, and a local disk drive.

Destination device 14 can access the encoded video data via a standard data connection (e.g., an internet connection). An instance type of a data connection includes a wireless channel (eg, a Wi-Fi connection), a wired connection (eg, DSL, cable modem, etc.), or both, suitable for accessing encoded video data stored on a file server. combination. The transmission of the encoded video data from the file server may be streaming, downloading, or a combination of both.

The technology of the present invention is not limited to a wireless application scenario. Illustratively, the technology can be applied to video codecs supporting multiple multimedia applications such as aerial television broadcasting, cable television transmission, satellite television transmission, and streaming video. Transmission (eg, via the Internet), encoding of video data stored on a data storage medium, decoding of video data stored on a data storage medium, or other application. In some examples, video codec system 10 may be configured to support one-way or two-way video transmission to support applications such as video streaming, video playback, video broadcasting, and/or video telephony.

In the example of FIG. 21, source device 12 includes video source 18, video encoder 20, and output interface 22. In some examples, output interface 22 can include a modulator/demodulator (modem) and/or a transmitter. Video source 18 may include a video capture device (eg, a video camera), a video archive containing previously captured video data, a video input interface to receive video data from a video content provider, and/or a computer for generating video data. A graphics system, or a combination of the above video data sources.

Video encoder 20 may encode video data from video source 18. In some examples, source device 12 transmits the encoded video data directly to destination device 14 via output interface 22. The encoded video data may also be stored on a storage medium or file server for later access by the destination device 14 for decoding and/or playback.

In the example of FIG. 21, destination device 14 includes an input interface 28, a video decoder 30, and a display device 32. In some examples, input interface 28 includes a receiver and/or a modem. Input interface 28 can receive the encoded video data via channel 16. Display device 32 may be integral with destination device 14 or may be external to destination device 14. In general, display device 32 displays the decoded video data. Display device 32 may include a variety of display devices such as liquid crystal displays (LCDs), plasma displays, organic light emitting diode (OLED) displays, or other types of display devices.

Video encoder 20 and video decoder 30 may operate in accordance with a video compression standard (eg, the High Efficiency Video Codec H.265 standard) and may conform to the HEVC Test Model (HM). A textual description of the H.265 standard is published on April 29, 2015, ITU-T.265(V3) (04/2015), available for download from http://handle.itu.int/11.1002/1000/12455 The entire contents of the document are incorporated herein by reference.

Alternatively, video encoder 20 and video decoder 30 may operate in accordance with other proprietary or industry standards including ITU-TH.261, ISO/IEC MPEG-1 Visual, ITU-TH.262, or ISO/IEC MPEG-2 Visual, ITU. -TH.263, ISO/IECMPEG-4 Visual, ITU-TH.264 (also known as ISO/IEC MPEG-4 AVC), including scalable video codec (SVC) and multiview video codec (MVC) extensions. It should be understood that the techniques of the present invention are not limited to any particular codec standard or technique.

Moreover, FIG. 21 is merely an example and the techniques of the present invention are applicable to video codec applications (eg, single-sided video encoding or video decoding) that do not necessarily include any data communication between the encoding device and the decoding device. In other examples, data is retrieved from local memory, data is streamed over a network, or manipulated in a similar manner. The encoding device may encode the data and store the data to a memory, and/or the decoding device may retrieve the data from the memory and decode the data. In many instances, encoding and decoding are performed by a plurality of devices that only encode data to and/or retrieve data from the memory and decode the data by not communicating with each other.

Video encoder 20 and video decoder 30 may each be implemented as any of a variety of suitable circuits, such as one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable Gate array (FPGA), discrete logic, hardware, or any combination thereof. If the technology is implemented partially or wholly in software, the device may store the instructions of the software in a suitable non-transitory computer readable storage medium, and the instructions in the hardware may be executed using one or more processors to perform the techniques of the present invention. . Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) can be considered as one or more processors. Each of video encoder 20 and video decoder 30 may be included in one or more encoders or decoders, any of which may be integrated into a combined encoder/decoder (codec) in other devices Part of the (CODEC).

The invention may generally refer to video encoder 20 "signaling" certain information to another device (e.g., video decoder 30). The term "signaling" may generally refer to a syntax element and/or to convey the communication of encoded video data. This communication can occur in real time or near real time. Alternatively, this communication may occur over a time span, such as may occur when encoding the encoded element to a computer readable storage medium at the time of encoding, the syntax element being subsequently decodable after being stored in the medium The device is retrieved at any time.

· Block mode

The video encoder 20 encodes video data. Video data may include one or more pictures. Video encoder 20 may generate a code stream that contains encoded information for the video data in the form of a bitstream. The encoded information may include encoded picture data and associated data. Associated data can include sequence parameter sets (SPS), picture parameter sets (PPS), and other syntax structures. An SPS can contain parameters that are applied to zero or more sequences. The PPS can contain parameters that are applied to zero or more pictures. A grammatical structure refers to a collection of zero or more syntax elements arranged in a specified order in a code stream.

To generate the encoded information for the picture, video encoder 20 may partition the picture into a raster of coded tree blocks (CTBs). In some examples, a CTB may be referred to as a "tree block," a "maximum coding unit" (LCU), or a "coding tree unit." The CTB is not limited to a particular size and may include one or more coding units (CUs). Each CTB can be associated with a block of pixels of equal size within the picture. Each pixel can correspond to one luminance (luminance or luma) sample and two chrominance or chroma samples. Thus, each CTB can be associated with one luma sample block and two chroma sample blocks. The CTB of a picture can be divided into one or more stripes. In some examples, each stripe contains an integer number of CTBs. As part of the encoded picture, video encoder 20 may generate encoded information for each strip of the picture, i.e., encode the CTB within the strip. To encode the CTB, video encoder 20 may recursively perform quadtree partitioning on the block of pixels associated with the CTB to partition the block of pixels into decreasing blocks of pixels. The smaller block of pixels can be associated with a CU.

·prediction

Video encoder 20 may generate one or more prediction units (PUs) that each no longer partition the CU. Each PU of a CU may be associated with a different block of pixels within a block of pixels of the CU. Video encoder 20 may generate predictive pixel blocks for each PU of the CU. Video encoder 20 may use intra prediction or inter prediction to generate predictive pixel blocks for the PU. If video encoder 20 uses intra prediction to generate a predictive pixel block for a PU, video encoder 20 may generate a predictive pixel block for the PU based on the decoded pixels of the picture associated with the PU. If video encoder 20 uses inter prediction to generate predictive pixel blocks for a PU, video encoder 20 may generate predictiveness of the PU based on decoded pixels of one or more pictures that are different from pictures associated with the PU. Pixel block. Video encoder 20 may generate residual pixel blocks of the CU based on the predictive pixel blocks of the PU of the CU. The residual pixel block of the CU may indicate the difference between the sampled value in the predictive pixel block of the PU of the CU and the corresponding sampled value in the initial pixel block of the CU.

·Transformation quantization

Video encoder 20 may perform recursive quadtree partitioning on the residual pixel blocks of the CU to partition the residual pixel blocks of the CU into one or more smaller residual pixel blocks associated with the transform units (TUs) of the CU. Because the pixels in the pixel block associated with the TU each correspond to one luma sample and two chroma samples, each TU can be associated with one luma residual sample block and two chroma residual sample blocks. Video encoder 20 may apply one or more transforms to the residual sample block associated with the TU to generate a coefficient block (ie, a block of coefficients). The transform can be a DCT transform or a variant thereof. Using the transform matrix of the DCT, the coefficient block is obtained by applying a one-dimensional transform in the horizontal and vertical directions to calculate a two-dimensional transform. Video encoder 20 may perform a quantization procedure for each of the coefficients in the coefficient block. Quantization generally refers to the process by which the coefficients are quantized to reduce the amount of data used to represent the coefficients, thereby providing further compression.

Entropy coding

Video encoder 20 may generate a set of syntax elements that represent coefficients in the quantized coefficient block. Video encoder 20 may apply an entropy encoding operation (eg, a context adaptive binary arithmetic coding (CABAC) operation) to some or all of the above syntax elements. To apply CABAC encoding to syntax elements, video encoder 20 may binarize the syntax elements to form a binary sequence that includes one or more bits (referred to as "binary"). Video encoder 20 may encode a portion of the binary using regular encoding, and may use bypass encoding to encode other portions of the binary.

·Coded end reconstructed image

In addition to the syntax elements of the entropy coding coefficient block, video encoder 20 may apply an inverse quantization and an inverse transform to the transformed coefficient block to reconstruct the residual sample block from the transformed coefficient block. Video encoder 20 may add the reconstructed residual sample block to a corresponding sample block of one or more predictive sample blocks to produce a reconstructed sample block. By reconstructing the sample block for each color component, video encoder 20 may reconstruct the block of pixels associated with the TU. The pixel block of each TU of the CU is reconstructed in this way until the entire pixel block reconstruction of the CU is completed.

·Encoding end filtering

After video encoder 20 reconstructs the block of pixels of the CU, video encoder 20 may perform a deblocking filtering operation to reduce the blockiness of the block of pixels associated with the CU. After video encoder 20 performs the deblocking filtering operation, video encoder 20 may use sample adaptive offset (SAO) to modify the reconstructed block of pixels of the CTB of the picture. After performing these operations, video encoder 20 may store the reconstructed blocks of pixels of the CU in a decoded picture buffer for use in generating predictive blocks of pixels for other CUs.

Entropy decoding

Video decoder 30 can receive the code stream. The code stream contains encoded information of video data encoded by video encoder 20 in the form of a bitstream. Video decoder 30 may parse the code stream to extract syntax elements from the code stream. When video decoder 30 performs CABAC decoding, video decoder 30 may perform regular decoding on partial bins and may perform bypass decoding on bins of other portions, and the bins in the code stream have mapping relationships with syntax elements, through parsing The binary gets the syntax element.

·Decoding end reconstruction image

Video decoder 30 may reconstruct a picture of the video data based on the syntax elements extracted from the code stream. The process of reconstructing video data based on syntax elements is generally reciprocal to the process performed by video encoder 20 to generate syntax elements. For example, video decoder 30 may generate a predictive pixel block of a PU of a CU based on syntax elements associated with the CU. Additionally, video decoder 30 may inverse quantize the coefficient blocks associated with the TUs of the CU. Video decoder 30 may perform an inverse transform on the inverse quantized coefficient block to reconstruct a residual pixel block associated with the TU of the CU. Video decoder 30 may reconstruct a block of pixels of the CU based on the predictive pixel block and the residual pixel block.

·Decoder filtering

After video decoder 30 reconstructs the block of pixels of the CU, video decoder 30 may perform a deblocking filtering operation to reduce the blockiness of the block of pixels associated with the CU. Additionally, video decoder 30 may perform the same SAO operations as video encoder 20 based on one or more SAO syntax elements. After video decoder 30 performs these operations, video decoder 30 may store the block of pixels of the CU in a decoded picture buffer. The decoded picture buffer can provide reference pictures for subsequent motion compensation, intra prediction, and presentation by the display device.

·Encoding module

22 is a block diagram illustrating an example video encoder 20 that is configured to implement the techniques of the present invention. It should be understood that FIG. 22 is exemplary and should not be considered as limiting the techniques as broadly exemplified and described herein. As shown in FIG. 22, video encoder 20 includes prediction processing unit 100, residual generation unit 102, transform processing unit 104, quantization unit 106, inverse quantization unit 108, inverse transform processing unit 110, reconstruction unit 112, filter unit 113, The picture buffer 114 and the entropy encoding unit 116 are decoded. Entropy encoding unit 116 includes a regular CABAC codec engine 118 and a bypass codec engine 120. The prediction processing unit 100 includes an inter prediction processing unit 121 and an intra prediction processing unit 126. The inter prediction processing unit 121 includes a motion estimation unit 122 and a motion compensation unit 124. In other examples, video encoder 20 may include more, fewer, or different functional components.

· Prediction module

Video encoder 20 receives the video data. To encode the video data, video encoder 20 may encode each strip of each picture of the video data. As part of the encoded strip, video encoder 20 may encode each CTB in the strip. As part of encoding the CTB, prediction processing unit 100 may perform quadtree partitioning on the pixel blocks associated with the CTB to divide the block of pixels into decreasing blocks of pixels. For example, prediction processing unit 100 may partition a block of pixels of a CTB into four equally sized sub-blocks, split one or more of the sub-blocks into four equally sized sub-sub-blocks, and the like.

Video encoder 20 may encode the CU of the CTB in the picture to generate coded information for the CU. Video encoder 20 may encode the CU of the CTB according to the fold scan order. In other words, video encoder 20 may encode the CU by the upper left CU, the upper right CU, the lower left CU, and then the lower right CU. When video encoder 20 encodes the partitioned CU, video encoder 20 may encode the CU associated with the sub-block of the pixel block of the partitioned CU according to the fold scan order.

Moreover, prediction processing unit 100 can partition the pixel blocks of the CU in one or more PUs of the CU. Video encoder 20 and video decoder 30 can support a variety of PU sizes. Assuming that the size of a particular CU is 2N×2N, video encoder 20 and video decoder 30 may support a PU size of 2N×2N or N×N for intra prediction, and support 2N×2N, 2N×N, N× 2N, N x N or similarly sized symmetric PUs for inter prediction. Video encoder 20 and video decoder 30 may also support asymmetric PUs of 2N x nU, 2N x nD, nL x 2N, and nR x 2N for inter prediction.

The inter prediction processing unit 121 may generate predictive data of the PU by performing inter prediction on each PU of the CU. The predictive data of the PU may include motion information corresponding to the predictive pixel block of the PU and the PU. The strip can be an I strip, a P strip or a B strip. The inter prediction unit 121 may perform different operations on the PU of the CU depending on whether the PU is in an I slice, a P slice, or a B slice. In the I slice, all PUs perform intra prediction.

If the PU is in a P-strip, motion estimation unit 122 may search for a reference picture in a list of reference pictures (eg, "List 0") to find a reference block for the PU. The reference block of the PU may be the pixel block that most closely corresponds to the pixel block of the PU. Motion estimation unit 122 may generate a reference picture index that indicates a reference picture of the PU-containing reference block in list 0, and a motion vector that indicates a spatial displacement between the pixel block of the PU and the reference block. The motion estimation unit 122 may output the reference picture index and the motion vector as motion information of the PU. Motion compensation unit 124 may generate a predictive pixel block of the PU based on the reference block indicated by the motion information of the PU.

If the PU is in B-strip, motion estimation unit 122 may perform uni-directional inter prediction or bi-directional inter prediction on the PU. To perform uni-directional inter prediction on the PU, motion estimation unit 122 may search for a reference picture of a first reference picture list ("List 0") or a second reference picture list ("List 1") to find a reference block for the PU. The motion estimation unit 122 may output the following as the motion information of the PU: a reference picture index indicating a position in the list 0 or the list 1 of the reference picture containing the reference block, a space between the pixel block indicating the PU and the reference block The motion vector of the displacement, and the prediction direction indicator indicating whether the reference picture is in list 0 or in list 1. To perform bi-directional inter prediction on the PU, motion estimation unit 122 may search for reference pictures in list 0 to find reference blocks for the PU, and may also search for reference pictures in list 1 to find another reference block for the PU. Motion estimation unit 122 may generate a reference picture index indicating the list 0 of the reference picture containing the reference block and the location in list 1. Additionally, motion estimation unit 122 may generate a motion vector that indicates a spatial displacement between the reference block and the pixel block of the PU. The motion information of the PU may include a reference picture index of the PU and a motion vector. Motion compensation unit 124 may generate a predictive pixel block of the PU based on the reference block indicated by the motion information of the PU.

Intra prediction processing unit 126 may generate predictive data for the PU by performing intra prediction on the PU. The predictive data of the PU may include predictive pixel blocks of the PU and various syntax elements. Intra prediction processing unit 126 may perform intra prediction on I slices, P slices, and PUs within B slices.

To perform intra prediction on a PU, intra-prediction processing unit 126 may use multiple intra-prediction modes to generate multiple sets of predictive data for the PU. To generate a set of predictive data for a PU using intra prediction mode, intra-prediction processing unit 126 may spread samples of sample blocks from neighboring PUs across sample blocks of the PU in a direction associated with the intra-prediction mode. It is assumed that the coding order from left to right and from top to bottom is used for PU, CU and CTB, and the adjacent PU may be above the PU, at the upper right of the PU, at the upper left of the PU or to the left of the PU. Intra prediction processing unit 126 may use a different number of intra prediction modes, for example, 33 directional intra prediction modes. In some examples, the number of intra prediction modes may depend on the size of the pixel block of the PU.

The prediction processing unit 100 may select the predictive data of the PU of the CU from among the predictive data generated by the inter prediction processing unit 121 for the PU or the predictive data generated by the intra prediction processing unit 126 for the PU. In some examples, prediction processing unit 100 selects predictive data for the PU of the CU based on the rate/distortion metric of the set of predictive data. For example, a Lagrangian cost function is used to select between an encoding mode and its parameter values, such as motion vectors, reference indices, and intra prediction directions. This kind of cost function uses the weighting factor lambda to relate the actual or estimated image distortion due to the lossy coding method to the actual or estimated amount of information needed to represent the pixel values in the image region: C=D+lambda×R, where C is the Lagrangian cost to be minimized, D is the image distortion with the mode and its parameters (eg mean square error), R is for reconstructing the image block in the decoder The number of bits required (for example including the amount of data used to represent the candidate motion vectors). In general, the least expensive coding mode is selected as the actual coding mode. A predictive pixel block that selects predictive data may be referred to herein as a selected predictive pixel block.

Residual generation unit 102 may generate a residual pixel block of the CU based on the pixel block of the CU and the selected predictive pixel block of the PU of the CU. For example, the residual generation unit 102 may generate a residual pixel block of the CU such that each sample in the residual pixel block has a value equal to a difference between: a sample in a pixel block of the CU, and a PU of the CU Corresponding samples in the predictive pixel block are selected.

The prediction processing unit 100 may perform quadtree partitioning to partition the residual pixel block of the CU into sub-blocks. Each residual pixel block that is no longer partitioned may be associated with a different TU of the CU. The size and location of the residual pixel block associated with the TU of the CU is not necessarily related to the size and location of the pixel block of the CU-based PU.

·Transformation module

Since the pixels of the residual pixel block of the TU can correspond to one luma sample and two chroma samples, each TU can be associated with one luma sample block and two chroma sample blocks. Transform processing unit 104 may generate a coefficient block for each TU of the CU by applying one or more transforms to the residual sample block associated with the TU. For example, transform processing unit 104 may apply a discrete cosine transform (DCT), a directional transform, or a conceptually similar transform to the residual sample block.

·Quantization module

Quantization unit 106 may quantize the coefficients in the coefficient block. For example, an n-bit coefficient can be truncated to an m-bit coefficient during quantization, where n is greater than m. Quantization unit 106 may quantize the coefficient block associated with the TU of the CU based on a quantization parameter (QP) value associated with the CU. Video encoder 20 may adjust the degree of quantization applied to the coefficient block associated with the CU by adjusting the QP value associated with the CU.

· Code reconstruction module (inverse transform quantization)

Inverse quantization unit 108 and inverse transform processing unit 110 may apply inverse quantization and inverse transform, respectively, to the transformed coefficient block to reconstruct the residual sample block from the coefficient block. Reconstruction unit 112 may add samples of the reconstructed residual sample block to corresponding samples of one or more predictive sample blocks generated by prediction processing unit 100 to generate a reconstructed sample block associated with the TU. By reconstructing the sample block of each TU of the CU in this manner, video encoder 20 may reconstruct the block of pixels of the CU.

·Filter module

Filter unit 113 may perform a deblocking filtering operation to reduce blockiness of pixel blocks associated with the CU. Further, the filter unit 113 may apply the SAO offset determined by the prediction processing unit 100 to the reconstructed sample block to restore the pixel block. Filter unit 113 may generate encoding information for the SAO syntax elements of the CTB.

·Reference image module

The decoded picture buffer 114 can store the reconstructed block of pixels. Inter prediction unit 121 may perform inter prediction on PUs of other pictures using reference pictures containing the reconstructed pixel blocks. Additionally, intra-prediction processing unit 126 can use the reconstructed block of pixels in decoded picture buffer 114 to perform intra-prediction on other PUs in the same picture as the CU.

Entropy coding module

Entropy encoding unit 116 may receive data from other functional components of video encoder 20. For example, entropy encoding unit 116 may receive a coefficient block from quantization unit 106 and may receive syntax elements from prediction processing unit 100. Entropy encoding unit 116 may perform one or more entropy encoding operations on the data to generate entropy encoded data. For example, entropy encoding unit 116 may perform context adaptive variable length codec (CAVLC) operations, CABAC operations, variable to variable (V2V) length codec operations, grammar-based context adaptive binary arithmetic coding on data. Decoding (SBAC) operations, probability interval partition entropy (PIPE) codec operations, or other types of entropy coding operations. In a particular example, entropy encoding unit 116 may encode regular CABAC codec bins of syntax elements using regular CABAC engine 118, and may encode pass-through codec bins using bypass codec engine 120.

·Decoding module

23 is a block diagram illustrating an example video decoder 30 that is configured to implement the techniques of the present invention. It should be understood that FIG. 23 is exemplary and should not be considered as limiting the techniques as broadly exemplified and described herein. As shown in FIG. 23, video decoder 30 includes an entropy decoding unit 150, a prediction processing unit 152, an inverse quantization unit 154, an inverse transform processing unit 156, a reconstruction unit 158, a filter unit 159, and a decoded picture buffer 160. The prediction processing unit 152 includes a motion compensation unit 162 and an intra prediction processing unit 164. Entropy decoding unit 150 includes a regular CABAC codec engine 166 and a bypass codec engine 168. In other examples, video decoder 30 may include more, fewer, or different functional components.

Video decoder 30 can receive the code stream. Entropy decoding unit 150 may parse the code stream to extract syntax elements from the code stream. As part of parsing the code stream, entropy decoding unit 150 may parse the entropy encoded syntax elements in the code stream. The prediction processing unit 152, the inverse quantization unit 154, the inverse transform processing unit 156, the reconstruction unit 158, and the filter unit 159 may decode the video data according to the syntax elements extracted from the code stream, that is, generate the decoded video data.

Entropy decoding module

The syntax elements may include a regular CABAC codec binary and a bypass codec binary. Entropy decoding unit 150 may use a regular CABAC codec engine 166 to decode the regular CABAC codec bins, and may use the bypass codec engine 168 to decode the bypass codec bins.

· Prediction module

If the PU uses intra prediction encoding, intra prediction processing unit 164 may perform intra prediction to generate a predictive sample block for the PU. Intra-prediction processing unit 164 may use an intra-prediction mode to generate a predictive pixel block of a PU based on a block of pixels of a spatially neighboring PU. Intra prediction processing unit 164 may determine an intra prediction mode for the PU based on one or more syntax elements parsed from the code stream.

Motion compensation unit 162 may construct a first reference picture list (List 0) and a second reference picture list (List 1) based on syntax elements parsed from the code stream. Furthermore, if the PU uses inter prediction coding, the entropy decoding unit 150 may parse the motion information of the PU. Motion compensation unit 162 can determine one or more reference blocks of the PU based on the motion information of the PU. Motion compensation unit 162 can generate a predictive pixel block of the PU from one or more reference blocks of the PU.

· Decoding reconstruction module (inverse transform quantization)

Additionally, video decoder 30 may perform a reconstruction operation on a CU that is no longer split. To perform a reconstruction operation on a CU that is no longer split, video decoder 30 may perform a reconstruction operation on each TU of the CU. By performing a reconstruction operation on each TU of the CU, video decoder 30 may reconstruct the residual pixel blocks associated with the CU.

As part of performing a reconstruction operation on the TU of the CU, inverse quantization unit 154 may inverse quantize (ie, dequantize) the coefficient block associated with the TU. Inverse quantization unit 154 may determine the degree of quantization using the QP value associated with the CU of the TU, and determine the degree of inverse quantization that the inverse quantization unit 154 will apply.

After inverse quantization unit 154 inverse quantizes the coefficient block, inverse transform processing unit 156 may apply one or more inverse transforms to the coefficient block to generate a residual sample block associated with the TU. For example, inverse transform processing unit 156 may map inverse DCT, inverse integer transform, Karhunen-Loeve transform (KLT), inverse rotation transform, inverse directional transform, or other transform to the encoding end. The inverse transform is applied to the coefficient block.

Reconstruction unit 158 may use the residual pixel block associated with the TU of the CU and the predictive pixel block of the PU of the CU (ie, intra-prediction data or inter-prediction data) to reconstruct the block of pixels of the CU, where applicable. In particular, reconstruction unit 158 can add samples of the residual pixel block to corresponding samples of the predictive pixel block to reconstruct the pixel block of the CU.

·Filter module

Filter unit 159 may perform a deblocking filtering operation to reduce the blockiness of the block of pixels associated with the CU of the CTB. Additionally, filter unit 159 can modify the pixel values of the CTB based on the SAO syntax elements parsed from the code stream. For example, filter unit 159 can determine the correction value based on the SAO syntax element of the CTB and add the determined correction value to the sample value in the reconstructed pixel block of the CTB. By modifying some or all of the pixel values of the CTB of the picture, the filter unit 159 can modify the reconstructed picture of the video data according to the SAO syntax element.

Reference image module

Video decoder 30 may store the block of pixels of the CU in decoded picture buffer 160. The decoded picture buffer 160 may provide reference pictures for subsequent motion compensation, intra prediction, and presentation by a display device (eg, display device 32 of FIG. 21). For example, video decoder 30 may perform intra-prediction operations or inter-prediction operations on PUs of other CUs according to the blocks of pixels in decoded picture buffer 160.

The video encoder of the embodiment of the present invention may be used to perform the video encoding method of the foregoing embodiments, and the functional modules of the video encoding apparatus shown in FIG. 18a and FIG. 18b may be integrated into the video encoder 20 of the embodiment of the present invention. . For example, the video encoder can be used to perform the video encoding method of the embodiment shown in FIG. 2, FIG. 5 or FIG. 12 described above.

Thus, video encoder 20 acquires a plurality of video frames, each of which includes redundant data on the picture content. Then, the video encoder 20 reconstructs the plurality of video frames to obtain scene information and reconstruction residuals of each video frame, where the scene information includes redundant data to reduce redundancy, and the residual is used for reconstruction. Indicates the difference between the video frame and the scene information. Next, the video encoder 20 predictively encodes the scene information to obtain scene feature prediction encoded data. The video encoder 20 predictively encodes the reconstructed residual to obtain residual prediction encoded data. In this way, by performing the process of reconstructing the plurality of video frames, the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video. The amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.

In an embodiment of the present invention, a video decoder is further provided, where the video decoder can be used to perform the video decoding method of the foregoing embodiments, and the functional modules of the video decoding device shown in FIG. 19 can also be integrated. On video decoder 30 of an embodiment of the invention. For example, the video decoder 30 can be used to perform the video decoding method of the embodiment shown in FIG. 2, FIG. 6, or FIG.

In this way, after the video decoder 30 obtains the scene feature prediction encoded data and the residual prediction encoded data, the video decoder 30 decodes the scene feature prediction encoded data to obtain scene information, where the scene information includes redundant data and reduces redundancy. The data, the redundant data is redundant data on the picture content between each of the plurality of video frames. Next, video decoder 30 decodes the residual prediction encoded data to obtain a reconstructed residual, which is used to represent the difference between the video frame and the scene information. And a video decoder 30, configured to perform reconstruction according to the scene information and the reconstructed residual to obtain a plurality of video frames. In this way, the scene feature predictive coded data and the residual predictive coded data obtained by encoding the video coding device in the foregoing embodiment can complete the decoding operation by using the video decoding device of the embodiment of the present invention.

In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted as one or more instructions or code via a computer-readable medium and executed by a hardware-based processing unit. The computer readable medium can comprise a computer readable storage medium (which corresponds to a tangible medium such as a data storage medium) or a communication medium comprising, for example, any medium that facilitates transfer of the computer program from one place to another in accordance with a communication protocol. . In this manner, computer readable media generally may correspond to (1) a non-transitory tangible computer readable storage medium, or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for use in carrying out the techniques described herein. The computer program product can comprise a computer readable medium.

Some computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, disk storage or other magnetic storage device, flash memory, or may be used to store instructions or data structures, by way of example and not limitation. Any other medium in the form of the desired program code and accessible by the computer. Also, any connection is properly termed a computer-readable medium. For example, if you use coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology (eg, infrared, radio, and microwave) to send commands from a website, server, or other remote source, coaxial cable , fiber optic cable, twisted pair, DSL, or wireless technologies (eg, infrared, radio, and microwave) are included in the definition of the media. However, it should be understood that computer readable storage media and data storage media do not include connections, carrier waves, signals, or other transient media, but rather non-transitory tangible storage media. As used herein, a magnetic disk and an optical disk include a compact disk (CD), a laser disk, an optical disk, a digital video disk (DVD), a flexible disk, and a Blu-ray disk, wherein the disk usually reproduces data magnetically, and the disk passes the laser Optically copy data. Combinations of the above should also be included within the scope of computer readable media.

One or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits To execute the instructions. Accordingly, the term "processor," as used herein, may refer to any of the foregoing structures or any other structure suitable for implementing the techniques described herein. In addition, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Moreover, the techniques can be fully implemented in one or more circuits or logic elements.

The techniques of the present invention can be broadly implemented by a variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a collection of ICs (eg, a chipset). Various components, modules or units are described in this disclosure to emphasize functional aspects of the apparatus configured to perform the disclosed techniques, but are not necessarily required to be implemented by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or combined with suitable software and/or by a collection of interoperable hardware units (including one or more processors as described above). Or firmware to provide.

It is to be understood that the phrase "one embodiment" or "an embodiment" or "an" Thus, "in one embodiment" or "in an embodiment" or "an" In addition, these particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In the various embodiments of the present invention, it should be understood that the size of the sequence numbers of the above processes does not mean the order of execution, and the order of execution of each process should be determined by its function and internal logic, and should not be taken to the embodiments of the present invention. The implementation process constitutes any limitation.

Additionally, the terms "system" and "network" are used interchangeably herein. It should be understood that the term "and/or" herein is merely an association relationship describing an associated object, indicating that there may be three relationships, for example, A and/or B, which may indicate that A exists separately, and A and B exist simultaneously. There are three cases of B alone. In addition, the character "/" in this article generally indicates that the contextual object is an "or" relationship.

In the embodiments provided herein, it should be understood that "B corresponding to A" means that B is associated with A, and B can be determined from A. However, it should also be understood that determining B from A does not mean that B is only determined based on A, and that B can also be determined based on A and/or other information.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both, for clarity of hardware and software. Interchangeability, the composition and steps of the various examples have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.

A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

In the several embodiments provided by the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present invention are generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.). The computer readable storage medium can be any available media that can be stored by a computer or a data storage device such as a server, data center, or the like that includes one or more available media. The usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (such as a solid state disk (SSD)).

The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the embodiments are modified, or the equivalents of the technical features are replaced by the equivalents of the technical solutions of the embodiments of the present invention.

Claims

A video encoding method, the method comprising:

Acquiring a plurality of video frames, each of the plurality of video frames including redundant data on the picture content;

Reconstructing the plurality of video frames to obtain scene information and a reconstruction residual of each video frame, where the scene information includes data obtained by reducing redundancy of the redundant data, the weight Constructing a residual is used to represent a difference between the video frame and the scene information;

Performing predictive coding on the scene information to obtain scene feature prediction encoded data;

The reconstructed residual is predictively encoded to obtain residual prediction encoded data.
The method of claim 1 wherein

Reconstructing the plurality of video frames to obtain scene information and reconstruction residuals of each video frame, including:

Reconstructing the plurality of video frames to obtain a scene feature and a reconstructed residual of each video frame, where the scene feature is used to represent the same picture content between each video frame, the weight Constructing a residual is used to represent a difference between the video frame and the scene feature;

Performing predictive coding on the scene information to obtain scene feature prediction encoded data, including:

Performing predictive coding on the scene features to obtain scene feature prediction encoded data.
The method of claim 2 wherein:

Before the reconstructing the plurality of video frames to obtain the scene features and the reconstructed residuals of the video frames, the method further includes:

Extracting picture feature information of each of the plurality of video frames;

And calculating, according to the picture feature information, content metric information, where the content metric information is used to measure the difference of the picture content of the multiple video frames;

And performing the step of reconstructing the plurality of video frames to obtain a scene feature and a reconstruction residual of each video frame, when the content metric information is not greater than a preset metric threshold.
The method of claim 2 wherein:

The acquiring multiple video frames includes:

Obtaining a video stream, where the video frame of the video stream includes an I frame and a B frame, and a P frame;

Extracting the I frame from the video stream, where the I frame is used to perform the step of performing reconfiguration on the multiple video frames to obtain a scene feature and a reconstructed residual of each video frame;

The method further includes:

Reconstructing according to the scene feature and the reconstructed residual to obtain a reference frame;

Referring to the reference frame, performing inter prediction encoding on the B frame and the P frame, to obtain B frame predictive encoded data and P frame predictive encoded data;

Performing transform coding, quantization coding, and entropy coding on the predictive coded data to obtain video compressed data; the predictive coded data includes the scene feature predictive coded data, the residual predictive coded data, the B frame predictive coded data, and the The P frame predictive coded data.
The method of claim 1 wherein

Each of the plurality of video frames includes redundant data at a local location with respect to each other;

Reconstructing the plurality of video frames to obtain scene information and reconstruction residuals of each video frame, including:

Splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks;

Reconstructing the plurality of frame sub-blocks to obtain a scene feature, a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, and a reconstruction residual of each frame sub-block, the scene The feature includes a plurality of independent scene feature bases, wherein the independent scene feature bases cannot be reconstructed from each other within the scene feature, and the scene feature base is used to describe a screen content feature of the frame sub-block, The representation coefficient represents a correspondence between the scene feature base and the frame sub-block, and the reconstruction residual represents a difference between the frame sub-block and the scene feature base;

Performing predictive coding on the scene information to obtain scene feature prediction encoded data, including:

Performing predictive coding on the scene features to obtain scene feature prediction encoded data.
The method of claim 5 wherein:

Reconstructing the plurality of frame sub-blocks to obtain a scene feature, a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, and a reconstruction residual of each frame sub-block, including :

Reconstructing the plurality of frame sub-blocks to obtain a representation coefficient of each frame sub-block of the plurality of frame sub-blocks and a reconstruction residual of each of the frame sub-blocks, where the representation coefficient represents Corresponding relationship between the frame sub-block and the target frame sub-block, where the target frame sub-block is an independent frame sub-block in the multiple frame sub-blocks, and the independent frame sub-block is not based on the multiple frame sub-blocks And reconstructing, by the other frame sub-blocks in the block, the obtained frame sub-block, where the reconstructed residual is used to represent a difference between the target frame sub-block and the frame sub-block;

And combining the plurality of target frame sub-blocks indicated by the coefficient to obtain a scene feature, where the target frame sub-block is a scene feature base.
The method of claim 5 wherein:

Before the splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks, the method further includes:

Extracting picture feature information of each of the plurality of video frames;

And calculating, according to the picture feature information, content metric information, where the content metric information is used to measure the difference of the picture content of the multiple video frames;

When the content metric information is greater than the preset metric threshold, performing the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks.
The method of claim 5 wherein:

The acquiring multiple video frames includes:

Obtaining a video stream, where the video frame of the video stream includes an I frame and a B frame, and a P frame;

Extracting the I frame from the video stream, where the I frame is used to perform the step of splitting each of the multiple video frames to obtain a plurality of frame sub-blocks;

The method further includes:

Performing reconstruction according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a reference frame;

Referring to the reference frame, performing inter prediction encoding on the B frame and the P frame, to obtain B frame predictive encoded data and P frame predictive encoded data;

Performing transform coding, quantization coding, and entropy coding on the predictive coded data to obtain video compressed data; the predictive coded data includes the scene feature predictive coded data, the residual predictive coded data, the B frame predictive coded data, and the The P frame predictive coded data.
A method according to any one of claims 1 to 8, wherein

After the acquiring multiple video frames, the method further includes:

And categorizing the plurality of video frames according to the correlation of the content of the picture, to obtain a video frame of one or more classification clusters, where the video frame of the same classification cluster is used to perform the reconstruction on the multiple video frames to obtain Scene information and the step of reconstructing the residual of each video frame.
The method of claim 1 wherein

The acquiring multiple video frames includes:

Acquiring a video stream, the video stream comprising a plurality of video frames;

Extracting feature information of the first video frame and the second video frame, where the feature information is used to describe a picture content of the video frame, where the first video frame and the second video frame are in the video stream Video frame

Calculating a lens distance between the first video frame and the second video frame according to the feature information;

Determining whether the lens distance is greater than a preset lens threshold;

If the lens distance is greater than the preset lens threshold, the target lens is segmented from the video stream, the start frame of the target lens is the first video frame, and the end frame of the target lens is a previous video frame of the second video frame; if the lens distance is less than the preset lens threshold, the first video frame and the second video frame are attributed to the same lens, and the target lens belongs to the same One of the shots of the video stream, the shot being a video frame that is continuous in time;

For each shot in the video stream, a key frame is extracted according to a frame distance between video frames in the shot, and a frame distance between any two adjacent key frames in each shot is greater than a preset frame distance threshold. The frame distance is used to indicate the degree of difference between the two video frames, and the key frame of each shot is used to perform the reconstructing of the multiple video frames to obtain scene information and each video frame. The step of reconstructing the residual.
A video decoding method, the method comprising:

Obtaining scene feature prediction encoded data and residual prediction encoded data;

Decoding the scene feature prediction encoded data to obtain scene information, where the scene information includes data obtained by reducing redundancy of the redundant data, the redundant data being each video of multiple video frames Redundant data on the content of the picture between frames;

Decoding the residual prediction encoded data to obtain a reconstructed residual, where the reconstructed residual is used to represent a difference between the video frame and the scene information;

Performing reconstruction according to the scene information and the reconstructed residual to obtain the plurality of video frames.
The method of claim 11 wherein

Decoding the scene feature prediction encoded data to obtain scenario information, including:

Decoding the scene feature prediction encoded data to obtain a scene feature, where the scene feature is used to represent the same picture content between each video frame;

Reconstructing according to the scenario information and the reconstructed residual to obtain the multiple video frames, including:

Performing reconstruction according to the scene feature and the reconstructed residual to obtain the plurality of video frames.
The method of claim 12 wherein:

And acquiring the scene feature prediction encoded data and the residual prediction encoded data, including:

Obtain video compression data;

Performing entropy decoding, inverse quantization processing, and DCT inverse variation on the video compressed data to obtain predictive encoded data, where the predictive encoded data includes the scene feature predictive encoded data, the residual predictive encoded data, B frame predictive encoded data, and P frame prediction encoded data;

Reconstructing according to the scene feature and the reconstructed residual to obtain the multiple video frames, including:

Reconstructing according to the scene feature and the reconstructed residual, to obtain a plurality of I frames;

The method further includes:

Performing inter-frame decoding on the B frame predictive coded data and the P frame predictive coded data by using the I frame as a reference frame to obtain a B frame and a P frame;

The I frame, the B frame, and the P frame are arranged in chronological order to obtain a video stream.
The method of claim 11 wherein

The method further includes:

Obtain a representation coefficient;

Decoding the scene feature prediction encoded data to obtain scenario information, including:

Decoding the scene feature prediction encoded data to obtain a scene feature, where the scene feature includes a plurality of independent scene feature bases, wherein the independent scene feature bases cannot be reconstructed from each other in the scene feature, The scene feature base is used to describe a picture content feature of the frame sub-block, the representation coefficient represents a correspondence between the scene feature base and the frame sub-block, and the reconstruction residual represents the frame sub-block and the scene The difference between the characteristic bases;

Reconstructing according to the scenario information and the reconstructed residual to obtain the multiple video frames, including:

Reconstructing according to the scene feature, the representation coefficient, and the reconstruction residual, to obtain a plurality of frame sub-blocks;

Combining the plurality of frame sub-blocks to obtain a plurality of video frames.
The method of claim 14 wherein:

And acquiring the scene feature prediction encoded data and the residual prediction encoded data, including:

Obtain video compression data;

Performing entropy decoding, inverse quantization processing, and DCT inverse variation on the video compressed data to obtain predictive encoded data, where the predictive encoded data includes the scene feature predictive encoded data, the residual predictive encoded data, B frame predictive encoded data, and P frame prediction encoded data;

Combining the plurality of frame sub-blocks to obtain a plurality of video frames, including:

Combining the plurality of frame sub-blocks to obtain a plurality of I frames;

The method further includes:

Performing inter-frame decoding on the B frame predictive coded data and the P frame predictive coded data by using the I frame as a reference frame to obtain a B frame and a P frame;

The I frame, the B frame, and the P frame are arranged in chronological order to obtain a video stream.
A video encoding device, characterized in that the device comprises:

An acquiring module, configured to acquire a plurality of video frames, where each of the plurality of video frames includes redundant data on the screen content;

a reconstruction module, configured to reconstruct the multiple video frames to obtain scene information and a reconstruction residual of each video frame, where the scenario information is obtained by reducing redundancy of the redundant data Data, the reconstruction residual is used to represent a difference between the video frame and the scene information;

a prediction encoding module, configured to perform predictive coding on the scene information to obtain scene feature prediction encoded data;

The predictive coding module is further configured to perform predictive coding on the reconstructed residual to obtain residual predictive encoded data.
The device of claim 16 wherein:

The reconstruction module is further configured to reconstruct the multiple video frames to obtain a scene feature and a reconstructed residual of each video frame, where the scene feature is used to represent the each video frame The same picture content, the reconstruction residual is used to represent a difference between the video frame and the scene feature;

The predictive coding module is further configured to perform predictive coding on the scene feature to obtain scene feature predictive encoded data.
The device according to claim 17, wherein

The device further includes:

a feature extraction module, configured to extract picture feature information of each of the plurality of video frames;

a metric information calculation module, configured to calculate, according to the picture feature information, content metric information, where the content metric information is used to measure a difference of picture content of the multiple video frames;

When the content metric information is not greater than the preset metric threshold, the re-implementing module performs the step of reconstructing the multiple video frames to obtain a scene feature and a reconstructed residual of each video frame .
The device according to claim 17, wherein

The acquiring module is further configured to acquire a video stream, where a video frame of the video stream includes an I frame and a B frame, and a P frame, and extract the I frame from the video stream, where the I frame is used to perform the Reconstructing the plurality of video frames to obtain a scene feature and a reconstruction residual of each video frame;

The device further includes:

a reference frame reconstruction module, configured to perform reconstruction according to the scene feature and the reconstructed residual, to obtain a reference frame;

An inter prediction encoding module, configured to reference the reference frame, perform inter prediction encoding on the B frame and the P frame, to obtain B frame predictive encoded data and P frame predictive encoded data;

An encoding module, configured to perform transform coding, quantization coding, and entropy coding on the predictive coded data to obtain video compressed data; the predictive coded data includes the scene feature predictive coded data, the residual predictive coded data, and the B frame The encoded data and the P frame predictive encoded data are predicted.
The device of claim 16 wherein:

Each of the plurality of video frames includes redundant data at a local location with respect to each other;

The reconstruction module includes:

a splitting unit, configured to split each of the plurality of video frames to obtain a plurality of frame sub-blocks;

a reconstruction unit, configured to reconstruct the plurality of frame sub-blocks, to obtain a scene feature, a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, and a reconstruction of each frame sub-block a residual feature, the scene feature includes a plurality of independent scene feature bases, wherein the independent scene feature bases cannot be reconstructed from each other in the scene feature, and the scene feature base is used to describe the frame sub-block a picture content feature, the representation coefficient represents a correspondence between the scene feature base and the frame sub-block, and the reconstruction residual represents a difference between the frame sub-block and the scene feature base;

The predictive coding module is further configured to perform predictive coding on the scene feature to obtain scene feature predictive coding data.
The device according to claim 20, wherein

The reconstruction unit includes:

a reconstructing sub-unit, configured to reconstruct the plurality of frame sub-blocks, to obtain a representation coefficient of each of the plurality of frame sub-blocks and a reconstructed residual of each of the sub-blocks of the frame And the representation coefficient represents a correspondence between the frame sub-block and the target frame sub-block, where the target frame sub-block is an independent frame sub-block in the multiple frame sub-blocks, and the independent frame sub-block is not And reconstructing a frame sub-block obtained by using another frame sub-block in the plurality of frame sub-blocks, where the reconstructed residual is used to represent a difference between the target frame sub-block and the frame sub-block;

a combining subunit, configured to combine the plurality of target frame sub-blocks indicated by the representation coefficients to obtain a scene feature, where the target frame sub-block is a scene feature base.
The device according to claim 20, wherein

The device further includes:

a feature extraction module, configured to extract picture feature information of each of the plurality of video frames;

a metric information calculation module, configured to calculate, according to the picture feature information, content metric information, where the content metric information is used to measure a difference of picture content of the multiple video frames;

When the content metric information is greater than the preset metric threshold, the splitting unit performs the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks.
The device according to claim 20, wherein

The acquiring module is further configured to acquire a video stream, where a video frame of the video stream includes an I frame and a B frame, and a P frame, and extract the I frame from the video stream, where the I frame is used to perform the Decomposing each of the plurality of video frames to obtain a plurality of frame sub-blocks;

The device further includes:

a reference frame reconstruction module, configured to perform reconfiguration according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a reference frame;

An inter prediction encoding module, configured to reference the reference frame, perform inter prediction encoding on the B frame and the P frame, to obtain B frame predictive encoded data and P frame predictive encoded data;

An encoding module, configured to perform transform coding, quantization coding, and entropy coding on the predictive coded data to obtain video compressed data; the predictive coded data includes the scene feature predictive coded data, the residual predictive coded data, and the B frame The encoded data and the P frame predictive encoded data are predicted.
Apparatus according to any one of claims 16 to 23, wherein

The device further includes:

a classifying module, configured to classify the plurality of video frames based on a correlation of the content of the picture, to obtain a video frame of one or more classification clusters, where the video frames of the same classification cluster are used to perform the pair of the multiple video frames Performing a reconstruction to obtain scene information and a step of reconstructing a residual of each video frame.
The device of claim 16 wherein:

The obtaining module includes:

a video stream obtaining unit, configured to acquire a video stream, where the video stream includes multiple video frames;

a frame feature extraction unit, configured to respectively extract feature information of the first video frame and the second video frame, where the feature information is used to describe a picture content of the video frame, the first video frame and the second video frame a video frame in the video stream;

a lens distance calculation unit, configured to calculate a lens distance between the first video frame and the second video frame according to the feature information;

a lens distance determining unit, configured to determine whether the lens distance is greater than a preset lens threshold;

a lens splitting unit, configured to: when the lens distance is greater than the preset lens threshold, segment a target lens from the video stream, where a starting frame of the target lens is the first video frame, and the target The end frame of the lens is the previous video frame of the second video frame; if the lens distance is less than the preset lens threshold, the first video frame and the second video frame are attributed to the same lens. The target lens belongs to one of the lenses of the video stream, and the lens is a video frame that is continuous in time;

a key frame extracting unit, configured to extract a key frame according to a frame distance between video frames in the lens for each shot in the video stream, and a frame distance between any two adjacent key frames in each shot The frame distance is greater than a preset frame distance threshold, and the frame distance is used to indicate a degree of difference between the two video frames, where the key frame of each shot is used to perform the reconstructing of the multiple video frames to obtain scene information and The step of reconstructing the residual of each video frame.
A video decoding device, the device comprising:

An obtaining module, configured to acquire scene feature prediction encoded data and residual prediction encoded data;

a scene information decoding module, configured to decode the scene feature prediction encoded data, to obtain scene information, where the scene information includes data obtained by reducing redundancy of the redundant data, where the redundant data is multiple Redundant data on the content of the picture between each video frame in the video frame;

a reconstructed residual decoding module, configured to decode the residual prediction encoded data to obtain a reconstructed residual, where the reconstructed residual is used to represent a difference between the video frame and the scene information;

And a video frame reconstruction module, configured to perform reconstruction according to the scene information and the reconstruction residual, to obtain the multiple video frames.
The device according to claim 26, wherein

The scene information decoding module is further configured to: decode the scene feature prediction encoded data to obtain a scene feature, where the scene feature is used to represent the same screen content between each video frame;

The video frame reconstruction module is further configured to perform reconfiguration according to the scene feature and the reconstruction residual to obtain the multiple video frames.
The device according to claim 27, wherein

The obtaining module includes an acquiring unit and a decoding unit,

The acquiring unit is configured to acquire video compression data;

The decoding unit is configured to perform entropy decoding, inverse quantization processing, and DCT inverse change on the video compressed data to obtain predictive encoded data, where the predictive encoded data includes the scene feature predictive encoded data and the residual predictive encoded data. , B frame predictive coded data and P frame predictive coded data;

The video frame reconstruction module is further configured to perform reconfiguration according to the scene feature and the reconstructed residual to obtain multiple I frames.

The device further includes:

An inter-frame decoding module, configured to perform inter-frame decoding on the B-frame predictive coded data and the P-frame predictive coded data by using the I frame as a reference frame to obtain a B frame and a P frame;

And an arranging module, configured to align the I frame, the B frame, and the P frame in chronological order to obtain a video stream.
The device according to claim 26, wherein

The obtaining module is further configured to acquire a representation coefficient;

The scene information decoding module is further configured to decode the scene feature prediction encoded data to obtain a scene feature, where the scene feature includes multiple independent scene feature bases, and the independent scene features in the scene feature The base feature base is used to describe the picture content feature of the frame sub-block, and the representation coefficient indicates the correspondence between the scene feature base and the frame sub-block, the reconstruction residual Determining a difference between the frame sub-block and the scene feature base;

The video frame reconstruction module includes:

a reconstruction unit, configured to perform reconfiguration according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a plurality of frame sub-blocks;

And a combining unit, configured to combine the plurality of frame sub-blocks to obtain a plurality of video frames.
The device according to claim 29, characterized in that

The acquiring module includes an acquiring unit and a decoding unit,

The acquiring unit is configured to acquire video compression data;

The decoding unit is configured to perform entropy decoding, inverse quantization processing, and DCT inverse change on the video compressed data to obtain predictive encoded data, where the predictive encoded data includes the scene feature predictive encoded data and the residual predictive encoded data. , B frame predictive coded data and P frame predictive coded data;

The combining unit is further configured to combine the plurality of frame sub-blocks to obtain a plurality of I frames;

The device further includes:

An inter-frame decoding module is configured to perform inter-frame decoding on the B-frame predictive coded data and the P-frame predictive-coded data by using the I frame as a reference frame to obtain a B frame and a P frame;

And an arranging module, configured to align the I frame, the B frame, and the P frame in chronological order to obtain a video stream.
A video codec device, characterized in that the video codec device comprises a video encoding device and a video decoding device

The video encoding device is the video encoding device according to any one of claims 16 to 25;

The video decoding device is the video decoding device according to any one of claims 26 to 30.