WO2018171596A1 - Video encoding method, video decoding method, and related device - Google Patents
Video encoding method, video decoding method, and related device Download PDFInfo
- Publication number
- WO2018171596A1 WO2018171596A1 PCT/CN2018/079699 CN2018079699W WO2018171596A1 WO 2018171596 A1 WO2018171596 A1 WO 2018171596A1 CN 2018079699 W CN2018079699 W CN 2018079699W WO 2018171596 A1 WO2018171596 A1 WO 2018171596A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frame
- video
- scene
- feature
- residual
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 372
- 238000007906 compression Methods 0.000 claims abstract description 89
- 230000006835 compression Effects 0.000 claims abstract description 89
- 238000012545 processing Methods 0.000 claims description 55
- 238000013139 quantization Methods 0.000 claims description 47
- 238000000605 extraction Methods 0.000 claims description 20
- 239000000284 extract Substances 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 11
- 230000008859 change Effects 0.000 claims description 9
- 108010001267 Protein Subunits Proteins 0.000 claims 1
- 230000002829 reductive effect Effects 0.000 abstract description 43
- 239000011159 matrix material Substances 0.000 description 204
- 230000008569 process Effects 0.000 description 68
- 230000000875 corresponding effect Effects 0.000 description 39
- 238000003860 storage Methods 0.000 description 32
- 238000010586 diagram Methods 0.000 description 29
- 230000006870 function Effects 0.000 description 29
- 201000011243 gastrointestinal stromal tumor Diseases 0.000 description 29
- 238000004891 communication Methods 0.000 description 18
- 239000013598 vector Substances 0.000 description 17
- 230000005540 biological transmission Effects 0.000 description 13
- 238000012549 training Methods 0.000 description 12
- 238000005457 optimization Methods 0.000 description 11
- 230000011218 segmentation Effects 0.000 description 11
- 238000004422 calculation algorithm Methods 0.000 description 10
- 230000036961 partial effect Effects 0.000 description 10
- 230000000694 effects Effects 0.000 description 8
- 238000001914 filtration Methods 0.000 description 8
- 238000005192 partition Methods 0.000 description 8
- 230000002596 correlated effect Effects 0.000 description 7
- 238000013500 data storage Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 241000023320 Luma <angiosperm> Species 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 6
- 238000004590 computer program Methods 0.000 description 6
- 230000008878 coupling Effects 0.000 description 6
- 238000010168 coupling process Methods 0.000 description 6
- 238000005859 coupling reaction Methods 0.000 description 6
- OSWPMRLSEDHDFF-UHFFFAOYSA-N methyl salicylate Chemical compound COC(=O)C1=CC=CC=C1O OSWPMRLSEDHDFF-UHFFFAOYSA-N 0.000 description 6
- 230000009471 action Effects 0.000 description 5
- 230000003044 adaptive effect Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 238000012706 support-vector machine Methods 0.000 description 5
- 238000007635 classification algorithm Methods 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 230000000670 limiting effect Effects 0.000 description 4
- 238000000638 solvent extraction Methods 0.000 description 4
- 238000006073 displacement reaction Methods 0.000 description 3
- 239000000835 fiber Substances 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 244000025254 Cannabis sativa Species 0.000 description 2
- 244000309469 Human enteric coronavirus Species 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000005286 illumination Methods 0.000 description 2
- 238000012432 intermediate storage Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 230000011664 signaling Effects 0.000 description 2
- 241000209504 Poaceae Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000002939 conjugate gradient method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000006837 decompression Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/124—Quantisation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/179—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a scene or a shot
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/593—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial prediction techniques
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/60—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
- H04N19/61—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/60—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
- H04N19/625—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding using discrete cosine transform [DCT]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/85—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
Definitions
- the present invention relates to the field of video frame processing, and in particular, to a video encoding method, a video decoding method, a video encoding device and a video decoding device, and a video encoding and decoding device.
- HEVC High Efficiency Video Coding predictive coding uses both intra-frame compression and inter-frame compression.
- the GOP Group of pictures
- the frame group is a group composed of a plurality of frames. To prevent motion changes, the number of frames should not be set too much.
- HEVC divides all frames into three types of frames: I, P, and B, as shown in Figure 1. The numbers above the frames in the figure indicate the number of the corresponding frame in the original video sequence.
- the I frame, the P frame, and the B frame are encoded in units of GOP.
- an Intra-frame also known as an intra-coded frame
- an Intra-frame is an independent frame with all the information, and can be independently encoded and decoded without reference to other images.
- the existing IVC frame of the HEVC standard only uses the image intraframe information of the current I frame for encoding and decoding, and is selected according to the video time axis by a fixed strategy.
- the amount of independently encoded I-frame compressed data is high and there is a large amount of information redundancy.
- the embodiments of the present invention provide a video encoding method, a video decoding method, a video encoding device, a video decoding device, and a video encoding and decoding device, which are used to improve the compression efficiency of a video frame.
- a first aspect of the embodiments of the present invention provides a video encoding method, the method comprising: acquiring a plurality of video frames, wherein each of the plurality of video frames includes redundant data on the screen content. Then, the multiple video frames are reconstructed to obtain scene information and reconstruction residuals of each video frame, the scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent The difference between the video frame and the scene information, such that redundant data of the plurality of video frames is reduced by reconstruction. Subsequently, the scene information is predictively coded, the scene feature prediction coded data is obtained, and the reconstructed residual is predictively coded to obtain residual prediction coded data.
- the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video.
- the amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression.
- Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding.
- the codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high.
- the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
- each of the multiple video frames includes the same picture content, and the same picture content That is, redundant data of the plurality of video frames.
- Reconstructing a plurality of video frames to obtain scene information and a reconstruction residual of each video frame comprising: reconstructing a plurality of video frames to obtain scene features and reconstruction residuals of each video frame, The scene feature is used to represent the same picture content between each video frame, and the reconstructed residual is used to represent the difference between the video frame and the scene feature.
- the scene feature is one of the specific forms of scene information.
- the scene information is predictively encoded, and the scene feature prediction encoded data is obtained, including: predicting and encoding the scene features, and obtaining scene feature prediction encoded data.
- the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
- multiple video frames are reconstructed to obtain a scene feature and each video frame.
- Reconstruction residuals include: converting multiple video frames into an observation matrix, and the observation matrix is used to represent multiple video frames in a matrix form. Then, the observation matrix is reconstructed according to the first constraint condition to obtain a scene feature matrix and a reconstructed residual matrix.
- the scene feature matrix is used to represent the scene features in a matrix form, and the reconstructed residual matrix is used in a matrix form.
- the reconstructed residuals of the plurality of video frames are represented, the first constraint is used to define the scene feature matrix low rank and the reconstructed residual matrix is sparse.
- the reconstruction operation of the plurality of video frames is performed in the form of a matrix, and under the constraint of the first constraint, the reconstruction residual and the scene feature are made to meet the preset requirements, which is advantageous for reducing the coding amount and the subsequent encoding operation. Increase the compression ratio.
- the observation matrix is reconstructed according to the first constraint condition, and the scene feature matrix is obtained.
- Reconstructing the residual matrix includes: calculating a scene feature matrix and a reconstructed residual matrix according to a first preset formula, wherein the obtained scene feature matrix is a low rank matrix, and the reconstructed residual matrix is a sparse matrix.
- the first preset formula is:
- Both sets of formulas include two formulas: the target constraint function and the reconstruction formula. Because the former group of formulas belong to the NP problem, the slack operation is performed to obtain the latter set of formulas, and the latter set of formulas are convenient to solve.
- D is the observation matrix
- F is the scene feature matrix
- E is the reconstructed residual matrix
- ⁇ is the weight parameter
- ⁇ is used to balance the relationship between the scene feature matrix F and the reconstructed residual matrix E.
- 1 is the matrix L1 norm, and
- * is the matrix kernel norm.
- the multiple video frames are performed on the multiple video frames.
- the method of the implementation manner further includes: extracting picture feature information of each of the plurality of video frames; and then, according to the picture The feature information is calculated to obtain content metric information for measuring a difference in picture content of the plurality of video frames. Therefore, when the content metric information is not greater than the preset metric threshold, performing the step of reconstructing the plurality of video frames to obtain a scene feature and a reconstruction residual of each video frame.
- the reconstruction operations of the first to third implementations of the first aspect can be performed by the plurality of video frames that meet the requirements, and the normal execution of the reconstruction operation is ensured.
- the screen feature information is a global GIST feature
- the preset metric threshold is a preset.
- the variance threshold is calculated according to the picture feature information
- the content GIST feature variance is calculated according to the global GIST feature.
- the reconstruction of the first to third implementations of the first aspect of the present application is performed by calculating the content GIST feature variance of the plurality of video frames to measure the content consistency of the plurality of video frames.
- the acquiring multiple video frames includes: The video stream is obtained, and the video frames of the video stream include an I frame, a B frame, and a P frame. Then, an I frame is extracted from the video stream, and the I frame is used to perform a step of reconstructing a plurality of video frames to obtain scene features and reconstruction residuals of each video frame.
- the method of the implementation manner further includes: reconstructing according to the scene feature and the reconstructed residual to obtain a reference frame.
- the B frame and the P frame are inter-predictive-coded to obtain B-frame predictive coded data and P-frame predictive coded data.
- the predictive coded data is subjected to transform coding, quantization coding, and entropy coding to obtain video compressed data;
- the predictive coded data includes scene feature prediction coded data, residual prediction coded data, B frame predictive coded data, and P frame predictive coded data.
- the I frame of the video stream can be reconstructed and encoded using the method of the present implementation, the amount of encoded data of the I frame is reduced, and the redundant data of the I frame is reduced.
- each of the multiple video frames includes redundant data at a local location, corresponding to The reconstruction operation is different from the foregoing implementation manner, that is, reconstructing multiple video frames to obtain scene information and reconstruction residuals of each video frame, including: splitting each video frame in multiple video frames A plurality of frame sub-blocks are obtained, and the frame sub-block obtained after the split includes redundant data, and the partial frame sub-blocks can be obtained based on other frame sub-blocks.
- the so-called frame sub-block is the frame content of a partial area of the video frame.
- the plurality of frame sub-blocks are reconstructed to obtain a scene feature, a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, and a reconstruction residual of each frame sub-block, wherein the scene feature includes multiple
- the independent scene feature base cannot be reconstructed from each other within the scene feature.
- the scene feature base is used to describe the picture content feature of the frame sub-block.
- the indicated representation coefficient represents the correspondence between the scene feature base and the frame sub-block.
- the reconstructed residual represents the difference between the frame sub-block and the scene feature base.
- the scene feature of the implementation manner is one of the specific forms of the scene information, which can reduce the redundancy between the partially redundant video frames.
- the scene information is predictively encoded, and the scene feature prediction encoded data is obtained, including: predicting and encoding the scene features, and obtaining scene feature prediction encoded data.
- multiple frame sub-blocks are reconstructed to obtain a scene feature and multiple frames.
- a representation coefficient of each frame sub-block in the sub-block and a reconstruction residual of each frame sub-block including: reconstructing a plurality of frame sub-blocks to obtain each of the plurality of frame sub-blocks Represents the coefficient and the reconstructed residual of each frame sub-block.
- the representation coefficient represents a correspondence between a frame sub-block and a target frame sub-block
- the target frame sub-block is an independent frame sub-block among the plurality of frame sub-blocks
- the independent frame sub-block is not based on other ones of the plurality of frame sub-blocks
- the frame sub-block reconstructed from the frame sub-block is reconstructed to represent the difference between the target frame sub-block and the frame sub-block.
- a plurality of target frame sub-blocks indicating the coefficient indication are combined to obtain a scene feature
- the target frame sub-block is a scene feature base.
- the target frame sub-blocks that can be independently represented are selected, and the target sub-blocks and the reconstructed residuals are not represented by the frame sub-blocks that are not independently represented, thereby reducing the between the sub-blocks and the target sub-blocks that are not independently represented.
- Redundant data only need to encode the target frame sub-block and the reconstructed residual when encoding, reducing the amount of coding.
- the multiple frame sub-blocks are reconstructed to obtain multiple frame sub-blocks.
- the representation coefficient of each frame sub-block and the reconstruction residual of each frame sub-block include: converting a plurality of frame sub-blocks into an observation matrix, and the observation matrix is used to represent the plurality of frame sub-blocks in a matrix form. Then, the observation matrix is reconstructed according to the second constraint condition, and the representation coefficient matrix and the reconstructed residual matrix are obtained.
- the representation coefficient matrix is a matrix including the representation coefficients of each of the plurality of frame sub-blocks, the non-zero coefficient indicating the coefficient indicates the target frame sub-block, and the reconstructed residual matrix is used for each of the matrix forms.
- the reconstructed residual of the frame sub-block is represented, and the second constraint is used to define the low rank and the sparsity of the represented coefficient to meet the preset requirement.
- combining the plurality of target frame sub-blocks indicating the coefficient indication to obtain the scene feature comprising: combining the target frame sub-blocks indicating the non-zero coefficient indication coefficients of the coefficient matrix to obtain the scene features.
- the observation matrix is reconstructed according to the second constraint condition, and the representation coefficient matrix is obtained.
- Reconstructing the residual matrix includes: calculating a representation coefficient matrix and a reconstructed residual matrix according to a second preset formula, where the second preset formula is:
- multiple frame sub-blocks are reconstructed to obtain scene features and multiple Representing coefficients of each frame sub-block in the frame sub-block and reconstructing residuals of each frame sub-block includes: reconstructing a plurality of frame sub-blocks to obtain a scene feature and each of the plurality of frame sub-blocks
- the representation coefficient of the frame sub-block, the scene feature includes the scene feature base as an independent feature block in the feature space, and the independent feature block is a feature block that cannot be reconstructed by other feature blocks in the scene feature.
- the reconstructed residual of each frame sub-block is calculated according to the reconstructed residual of each frame sub-block and the reconstructed data of the scene feature and each frame sub-block.
- a scene feature that can represent the plurality of frame sub-blocks as a whole is obtained by reconstructing, the scene feature is composed of a scene feature base, and the scene feature base is an independent feature block in the feature space, if different frames are used. If the block reconstruction obtains the same feature block, the same feature block may not be repeatedly saved in the scene feature, thereby reducing redundant data.
- multiple frame sub-blocks are reconstructed to obtain scene features and multiple
- the representation coefficients of each frame sub-block in the frame sub-block include: converting a plurality of frame sub-blocks into an observation matrix, and the observation matrix is used to represent the plurality of frame sub-blocks in a matrix form.
- the representation coefficient matrix is a matrix including the representation coefficients of each frame sub-block, and the non-zero coefficient indicating the coefficient indicates the scene feature Base
- the scene feature matrix is used to represent the scene feature in a matrix form
- the third constraint condition is used to define the similarity between the picture representing the coefficient matrix and the scene feature matrix reconstructed picture and the frame sub-block according to a preset similarity threshold, And limiting the data matrix sparsity to meet the preset sparse threshold, and the amount of data of the scene feature matrix is less than the preset data amount threshold.
- the reconstructed residual of each frame sub-block is calculated, including: according to the representation coefficient matrix and the scene feature
- the data and the observation matrix obtained by the matrix reconstruction are used to calculate a reconstructed residual matrix, wherein the reconstructed residual matrix is used to represent the reconstructed residual in a matrix form.
- the reconstruction operation can be performed in the form of a matrix, and the representation coefficients and scene features that meet the requirements for reducing the coding amount are calculated by using the third constraint condition.
- the observation matrix is reconstructed according to the third constraint condition, and the representation coefficient is obtained.
- the matrix and the scene feature matrix include: calculating a representation coefficient matrix and a scene feature matrix according to a third preset formula, and the third preset formula is:
- D is the observation matrix
- C is the coefficient matrix
- F is the scene feature
- ⁇ and ⁇ are the weight parameters, which are used to adjust the coefficient sparsity and low rank. Represents the optimal value of F and C, ie the formula The value of F and C when the value is minimum.
- the multiple video frames are The method of the present implementation further includes: extracting picture feature information of each of the plurality of video frames, before each video frame is split to obtain a plurality of frame sub-blocks. Then, based on the picture feature information, content metric information is calculated, where the content metric information is used to measure the difference of the picture content of the plurality of video frames. Therefore, when the content metric information is greater than the preset metric threshold, performing the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks. In this way, when the content metric information is greater than the preset metric threshold, the image representing the plurality of video frames locally has redundant data, thereby using a method of splitting the video frame and reconstructing the frame sub-block.
- the picture feature information is a global GIST feature
- the preset metric threshold is And a preset variance threshold
- the content metric information is calculated according to the picture feature information, including: calculating a feature GIST feature variance according to the global GIST feature.
- the content consistency of the plurality of video frames is calculated by calculating the variance of the scene GIST features of the plurality of video frames, thereby determining whether the images of the plurality of video frames have locally stored redundant data, so as to split and match the video frames.
- a method of reconstructing a frame sub-block is
- multiple video frames are acquired, including Obtaining a video stream, where the video frame of the video stream includes an I frame, a B frame, and a P frame; extracting an I frame from the video stream, where the I frame is used to perform splitting each video frame in the multiple video frames, a step of multiple frame sub-blocks;
- the method of the implementation manner further includes: performing reconfiguration according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a reference frame; using the reference frame as a reference, performing interframe prediction coding on the B frame and the P frame, and obtaining the B frame prediction coding.
- the predictive coded data includes scene feature predictive coded data, residual predictive coded data, B frame predictive coded data, and P Frame prediction encoded data.
- the method of the present implementation can be applied to key frames of a video stream, reducing redundant data and coding amount of key frames.
- the method of the implementation manner further includes: classifying the plurality of video frames based on the correlation of the picture content, and obtaining video frames of one or more classification clusters, where the video frames of the same classification cluster are used to execute multiple videos.
- the frame is reconstructed to obtain scene information and a step of reconstructing the residual of each video frame.
- the multiple video frames are classified according to the correlation of the screen content
- a video frame of one or more clusters includes extracting feature information of each of the plurality of video frames. Determining the clustering distance between any two video frames according to the feature information, the clustering distance is used to represent the similarity between the two video frames, and the video frames are clustered according to the clustering distance to obtain the video of one or more clustering clusters. frame. In this way, the classification operation of multiple video frames is realized by clustering.
- acquiring a plurality of video frames includes: acquiring a video stream, where the video stream includes multiple video frames. Then, feature information of the first video frame and the second video frame are respectively extracted, and the feature information is used to describe the picture content of the video frame, where the first video frame and the second video frame are video frames in the video stream; Calculating a lens distance between the first video frame and the second video frame; determining whether the lens distance is greater than a preset lens threshold; if the lens distance is greater than a preset lens threshold, segmenting the target lens from the video stream, starting frame of the target lens For the first video frame, the end frame of the target lens is the previous video frame of the second video frame; if the lens distance is less than the preset lens threshold, the first video frame and the second video frame are attributed to the same lens, and the target lens belongs to One of the lenses of the video stream, the lens is a time-continuous video frame
- the frame distance between two adjacent key frames is greater than a preset frame distance threshold, and the frame distance is used to indicate the degree of difference between the two video frames, and the key frame of each shot is used for execution. Reconstructing a plurality of video frames to obtain scene information and a reconstructed residual for each video frame. After the lens is divided, the key frames are extracted from the respective shots according to the distance. Such an extraction method uses the context information of the video stream, and the method of the present implementation can be applied to the video stream.
- the method further comprises: performing discriminant training according to each shot segmented from the video stream to obtain a plurality of classifiers corresponding to the shot; and using the target classifier to discriminate the target video frame, Determining the score, the target classifier is one of a plurality of classifiers, and the target video frame is one of the key frames, and the discriminant score is used to indicate the extent to which the target video frame belongs to the scene to which the target classifier belongs; When the threshold is greater than the preset score threshold, it is determined that the target video frame belongs to the same scene as the shot to which the target classifier belongs; and the video frames of one or more clusters are determined according to the video frames belonging to the same scene as the shot.
- acquiring a plurality of video frames includes: acquiring a compressed video stream, where the compressed video stream includes the compressed video a frame; a plurality of target video frames are determined from the compressed video stream, and the target video frame is an independently compressed and encoded video frame in the compressed video stream; the target video frame is decoded to obtain a decoded target video frame, and the decoded target is obtained.
- the video frame is used to perform the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks.
- a second aspect of the embodiments of the present invention provides a video decoding method, which includes: acquiring scene feature prediction encoded data and residual prediction encoded data. Then, the scene feature prediction encoded data is decoded to obtain scene information, wherein the scene information includes data obtained by reducing redundancy of redundant data, and the redundant data is between each video frame of the plurality of video frames. Redundant data on the content.
- the residual prediction encoded data is decoded to obtain a reconstructed residual, and the reconstructed residual is used to represent the difference between the video frame and the scene information.
- the reconstruction is performed according to the scene information and the reconstructed residual, and multiple video frames are obtained. In this way, the scene feature prediction coded data and the residual prediction coded data obtained by the video coding method provided by the first aspect can be decoded by the video decoding method of the implementation manner.
- each of the multiple video frames includes the same picture content, and the pair of scene feature prediction codes
- the data is decoded to obtain scene information, including: decoding scene feature prediction encoded data to obtain scene features, and the scene features are used to represent the same screen content between each video frame.
- Reconstructing according to the scene information and the reconstructed residual obtaining multiple video frames, including: reconstructing according to the scene feature and the reconstructed residual, to obtain multiple video frames.
- the scene feature information can be decoded by this implementation.
- acquiring scene feature prediction coding data and residual prediction coding data including: acquiring video Compressing data; performing entropy decoding, inverse quantization processing, and DCT inverse variation on the video compressed data to obtain predictive encoded data, and the predictive encoded data includes scene feature predictive encoded data, residual predictive encoded data, B frame predictive encoded data, and P frame predictive encoded data.
- Reconstructing according to the scene feature and the reconstructed residual, obtaining multiple video frames including: reconstructing according to the scene feature and the reconstruction residual, and obtaining multiple I frames;
- the method of the implementation manner further includes: performing inter-frame decoding on the B frame predictive encoded data and the P frame predictive encoded data by using the I frame as a reference frame to obtain a B frame and a P frame; and time consuming the I frame, the B frame, and the P frame. Arrange sequentially to get the video stream.
- the video stream can be decoded by the present implementation.
- the method of the implementation manner further includes: acquiring a representation coefficient.
- Decoding the scene feature prediction encoded data to obtain scene information comprising: decoding scene feature prediction encoded data to obtain a scene feature, where the scene feature includes multiple independent scene feature bases, and independent scene feature bases within the scene feature Cannot be reconstructed from each other, the scene feature base is used to describe the picture content feature of the frame sub-block, and the representation coefficient represents the correspondence between the scene feature base and the frame sub-block, and the reconstructed residual represents the difference between the frame sub-block and the scene feature base. value.
- the reconstruction is performed according to the scene information and the reconstructed residual to obtain a plurality of video frames.
- the method includes: reconstructing according to a scene feature, a representation coefficient, and a reconstruction residual to obtain a plurality of frame sub-blocks; combining the plurality of frame sub-blocks to obtain a plurality of video frames.
- the video decoding method of the present implementation can be used to decode the scene feature and the reconstructed residual, and reconstructed to obtain a plurality of frame sub-blocks.
- a video frame can be obtained by reorganizing.
- acquiring scene feature prediction coding data and residual prediction coding data including: acquiring video Compressing data; performing entropy decoding, inverse quantization processing, and DCT inverse variation on the video compressed data to obtain predictive encoded data, and the predictive encoded data includes scene feature predictive encoded data, residual predictive encoded data, B frame predictive encoded data, and P frame predictive encoded data.
- the method of this implementation manner further includes:
- the I frame is used as a reference frame, and the B frame predictive coded data and the P frame predictive coded data are inter-frame decoded to obtain a B frame and a P frame; and the I frame, the B frame, and the P frame are arranged in chronological order to obtain a video stream.
- the reconstructed residual, the scene feature and the representation coefficient are reconstructed by the frame sub-block, and then decoded and restored by the video decoding method of the implementation manner. Get the video stream.
- a third aspect of the embodiments of the present invention provides a video encoding apparatus having a function of performing the above video encoding method.
- This function can be implemented in hardware or in hardware by executing the corresponding software.
- the hardware or software includes one or more modules corresponding to the functions described above.
- the video encoding device includes:
- An acquiring module configured to acquire multiple video frames, and each of the plurality of video frames includes redundant data on the screen content
- the reconstruction module is configured to reconstruct multiple video frames to obtain scene information and reconstruction residuals of each video frame, where the scene information includes data obtained by reducing redundancy of redundant data, and reconstructing residuals Deducing the difference between the video frame and the scene information;
- a prediction encoding module configured to predictively encode scene information, and obtain scene feature prediction encoded data
- the prediction encoding module is further configured to perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.
- the video encoding device includes:
- the video encoder performs the following actions: acquiring a plurality of video frames, and each of the plurality of video frames includes redundant data on the screen content;
- the video encoder further performs the following actions: reconstructing a plurality of video frames to obtain scene information and reconstruction residuals of each video frame, and the scene information includes data obtained by reducing redundancy of redundant data, and reconstructing The residual is used to represent the difference between the video frame and the scene information;
- the video encoder further performs the following actions: performing predictive coding on the scene information to obtain scene feature prediction encoded data;
- the video encoder also performs an operation of predictive coding the reconstructed residual to obtain residual prediction encoded data.
- a fourth aspect of the embodiments of the present invention provides a video decoding apparatus having a function of performing the above video decoding method.
- This function can be implemented in hardware or in hardware by executing the corresponding software.
- the hardware or software includes one or more modules corresponding to the functions described above.
- the video decoding device includes:
- An obtaining module configured to acquire scene feature prediction encoded data and residual prediction encoded data
- a scene information decoding module configured to decode scene feature prediction encoded data to obtain scene information, where the scene information includes data obtained by reducing redundancy of redundant data, and the redundant data is each video frame of multiple video frames. Redundant data between screen contents;
- the video frame reconstruction module is configured to reconstruct according to the scene information and the reconstructed residual to obtain a plurality of video frames.
- the video decoding device includes:
- the video decoder performs the following actions: acquiring scene feature prediction encoded data and residual prediction encoded data;
- the video decoder further performs the following operations: decoding scene feature prediction encoded data to obtain scene information, the scene information including data obtained by reducing redundancy of redundant data, and the redundant data is each of a plurality of video frames Redundant data on the content of the picture between video frames;
- the video decoder further performs the following operations: decoding the residual prediction encoded data to obtain a reconstructed residual, where the reconstructed residual is used to represent a difference between the video frame and the scene information;
- the video decoder also performs an action of reconstructing based on the scene information and the reconstructed residual to obtain a plurality of video frames.
- a fifth aspect of the embodiments of the present invention provides a video codec device, where the video codec device includes a video encoding device and a video decoding device.
- the video encoding device is the video encoding device provided by the foregoing third aspect
- the video decoding device is the video decoding device provided by the fourth aspect above.
- a seventh aspect of the embodiments of the present invention provides a computer storage medium storing program code for indicating execution of the method of the second aspect described above.
- Yet another aspect of the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the methods described in the various aspects above.
- each of the plurality of video frames includes redundant data on the picture content. Then, reconstructing the plurality of video frames to obtain scene information and reconstruction residuals of each video frame, the scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent the video frame. The difference between the scene information and the scene information. Then, the scene information is predictively coded, the scene feature prediction coded data is obtained, and the reconstructed residual is predictively coded to obtain residual prediction coded data.
- the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video.
- the amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression.
- Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding.
- the codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high.
- the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
- FIG. 1 is a schematic diagram of a conventional HEVC coding
- FIG. 2 is a flowchart of a video frame encoding and decoding method according to an embodiment of the present invention
- FIG. 3a is a schematic diagram of a flow of a video encoding method and a flow of an existing HEVC encoding method according to another embodiment of the present invention
- FIG. 4a is a schematic diagram of a flow of a video decoding method and a flow of an existing HEVC decoding method according to another embodiment of the present invention
- FIG. 4b is a schematic diagram of a scenario involved in a video decoding method according to another embodiment of the present invention.
- FIG. 5 is a flowchart of a method for video encoding according to another embodiment of the present invention.
- FIG. 6 is a flowchart of a method for decoding a video according to another embodiment of the present invention.
- FIG. 7 is a flowchart of a method of a lens segmentation method of the video encoding method shown in FIG. 5;
- FIG. 8 is a flowchart of a method for extracting a key frame of the video encoding method shown in FIG. 5;
- FIG. 9 is a flowchart of a method for scene classification of the video encoding method shown in FIG. 5;
- FIG. 10 is a flowchart of a method based on an SVM classification method of the video encoding method shown in FIG. 5;
- FIG. 11 is a flowchart of a method for reconstructing an RPCA based scene of the video encoding method shown in FIG. 5;
- FIG. 12 is a flowchart of a method for a video encoding method according to another embodiment of the present invention.
- FIG. 13 is a schematic diagram of a scenario of the video encoding method shown in FIG. 12;
- FIG. 14 is a schematic diagram of a scenario of one of the specific methods of the video encoding method shown in FIG. 12;
- FIG. 15 is a schematic diagram of a scenario of one of the specific methods of the video encoding method shown in FIG. 12;
- FIG. 16 is a schematic diagram of a scenario of one of the specific methods of the video encoding method shown in FIG. 12;
- FIG. 17 is a flowchart of a method for decoding a video according to another embodiment of the present invention.
- FIG. 18 is a schematic structural diagram of a video encoding apparatus according to another embodiment of the present invention.
- FIG. 18b is a partial structural diagram of the video encoding apparatus shown in FIG. 18a;
- FIG. 19 is a schematic structural diagram of a video decoding device according to another embodiment of the present invention.
- FIG. 20 is a schematic structural diagram of a video codec device according to another embodiment of the present invention.
- 21 is a schematic block diagram of a video codec system 10 in accordance with an embodiment of the present invention.
- 22 is a block diagram illustrating an example video encoder 20 that is configured to implement the techniques of the present invention
- FIG. 23 is a block diagram illustrating an example video decoder 30 that is configured to implement the techniques of the present invention.
- each of the plurality of video frames includes redundant data on the screen content, and the plurality of video frames are performed.
- Reconstructing, obtaining scene information and reconstruction residuals of each video frame wherein the scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent a difference between the video frame and the scene information. value.
- the scene information is predictively coded
- the scene feature prediction coded data is obtained
- the reconstructed residual is predictively coded to obtain residual prediction coded data.
- the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video.
- the amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression.
- Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding.
- the codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high.
- the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
- the embodiment of the present invention further provides a video decoding method, which is used to decode scene feature prediction encoded data and residual prediction encoded data obtained by the video encoding device, obtain scene information, and reconstruct residuals, according to The scene information and the reconstructed residual are reconstructed to obtain a video frame.
- key frames are independently coded, wherein key frames are also referred to as I frames.
- the I frame After compression, the I frame has a high proportion of compressed data and a large amount of information redundancy between I frames.
- the video coding method of the embodiment of the present invention is used for the I frame at the time of encoding, the coding efficiency of the I frame can be improved.
- HEVC High Efficiency Video Coding
- HEVC is a widely used and successful video codec standard.
- HEVC is a block-based hybrid coding method, which includes several modules such as prediction, transform, quantization, entropy coding, and loop filtering.
- the prediction module is a core module of the HEVC codec method, and may be specifically classified into an intra prediction and an inter prediction module.
- Intra prediction that is, using the pixels already encoded in the current image to generate prediction values.
- Inter prediction that is, reconstructing a pixel using the encoded image that has been previously in the current image to generate a predicted value. Since interframe prediction uses residual coding, compression is relatively high.
- the existing intra prediction module of the HEVC standard only uses the current image intraframe information for encoding and decoding, and adopts a fixed strategy according to the video time axis, and does not take into consideration the context context information of the video, so the encoding and decoding efficiency is low.
- the compression ratio is not high. For example:
- Scene 1 In the movie, A and B perform dialogues. The director frequently switches between A and B to express the inner feelings of the characters. At this time, it is suitable to divide and cluster all the lenses related to A, and perform inter-frame and intra-prediction encoding uniformly.
- Scene 2 the TV drama shooting venue is mainly divided into grass, beach and office scenes. At this time, it is suitable to identify and classify all grasses, beaches, and office scenes, extract scene feature information uniformly, and express and predict key frames.
- HEVC predictive coding uses both intra-frame compression and inter-frame compression.
- the GOP step size is set first before encoding, that is, the number of frames included in the GOP. To prevent motion changes, the number of frames should not be set too much.
- HEVC divides all frames into three types of frames: I, P, and B, as shown in Figure 1.
- the numbers above the frames in Figure 1 indicate the number of the corresponding frame in the original video sequence.
- the I frame, the P frame, and the B frame are encoded in units of GOP.
- Intra-frame also known as intra-framed frame
- Intra-frame is an independent frame with all the information. It can be independently encoded and decoded without reference to other images. It can be simply understood as a static picture.
- the first frame in each GOP is set to an I frame, and the length of the GOP also represents the interval between two adjacent I frames.
- the I frame provides the most critical information in the GOP, and the amount of information in the data is relatively large, so the compression is relatively poor, generally around 7:1.
- a P-frame (Predictive frame) is also called an inter-predictive coded frame. It needs to refer to the previous I frame to encode. Indicates the difference between the current frame picture and the previous frame (the previous frame may be an I frame or a P frame). When decoding, it is necessary to superimpose the difference defined by this frame with the previously buffered picture to generate the final picture.
- P frames typically occupy fewer bits of data than I frames, but the disadvantage is that P frames are very sensitive to transmission errors because of the complex dependence of the previous P and I reference frames. Since the residual is used for encoding, the amount of coded information required for the P frame is greatly reduced relative to the I frame, and the compression ratio is relatively high, generally around 20:1.
- a bi-directional frame is also called a bidirectional predictive coding frame, that is, a B frame records the difference between the current frame and the previous and subsequent frames.
- Decoding a B frame requires not only the previous buffered picture but also the P frame picture after decoding, and the final picture is obtained by superimposing the previous and subsequent pictures and superimposing the current frame data.
- B frame compression rate is high, but the decoding performance is high.
- the B frame is not a reference frame and does not cause a spread of decoding errors.
- B frames have the highest encoding compression ratio, and the general compression ratio is around 50:1.
- Entropy coding if it is an inter coding mode, encodes a motion vector.
- the decoding process of HEVC is the reverse process of the encoding process, and will not be described here.
- the HEVC codec method relies too much on I frame coding and has the following drawbacks:
- the amount of I frame compressed data is high. I frame coding only performs spatial compression on intraframe data without considering redundant information between adjacent frames. The amount of compressed data is large, usually about 10 times that of P frames.
- the GOP step size needs to be preset before encoding.
- the I frame ratio is determined by the setting of the GOP step size. As shown in FIG. 1, when the GOP step size is set to 13, the ratio of the I frame to the BP frame is 1:12. According to the respective compression ratios of the IBP frames, the ratio of the final I frame to the BP frame compressed data is approximately 2:5. Generally, a larger GOP step size can be set to reduce the I frame ratio to improve the overall compression ratio of the video, but this also causes a decrease in the quality of the compressed video.
- the I frames are sequentially extracted according to the time axis sequence, and the interval between adjacent I frames is GOP step.
- the selection strategy does not take into account the contextual context information of the video. For example, for two video segments that are not consecutive in time but highly correlated in the picture content, if the I frame is extracted according to the GOP step size and the individual intra coding is performed, a large amount of information redundancy is caused.
- the embodiment of the present invention proposes a video encoding and decoding algorithm based on intelligent video scene classification, in view of the problem that the original HEVC relies too much on I frame coding and the compression efficiency is not high.
- the method identifies and classifies the video shots and scenes, performs key data analysis and reconstruction on the key frames (I frames), and encodes the scene information and the representation residuals. It effectively avoids the problem of inefficient compression in a single key frame, and introduces video context information to improve the compression ratio.
- the video frame encoding and decoding method includes an encoding method part and a decoding method part.
- the video frame coding and decoding method includes:
- each of the plurality of video frames includes redundant data on the picture content.
- the obtaining of the multiple video frames may be obtained from the video stream according to a preset rule after the video stream is acquired, or the video codec may acquire the multiple video frames from other devices, which is used by the embodiment of the present invention. No specific limitation. Wherein, the plurality of embodiments of the present invention refer to at least two.
- the redundant data is data related to the content of the screen among the plurality of video frames, and information redundancy exists.
- the redundant data may be redundant data on the overall picture of the video frame, such as the description of the embodiment shown in Figure 5 below. It may also be redundant data on a partial picture of a video frame, such as the description of the embodiment shown in FIG.
- the plurality of video frames are obtained from a video stream.
- the codec device divides the video lens by the scene transition detection technology on the premise of the overall video data stream, and determines whether it is a static lens. Video frame extraction is performed for each lens according to the type of lens.
- the original video stream is segmented into a short-sized lens unit by a scene change detection technique.
- each shot is composed of video frames that are continuous in time, and represents a temporally and spatially continuous motion in a scene.
- the specific lens segmentation method can perform boundary segmentation and discrimination processing on the lens according to the change of the content of the video frame. For example, by locating the lens boundary and finding the position or time point of the boundary frame, the video can be segmented accordingly.
- the video frame of the lens is extracted on the basis of the lens segmentation, and the extracted video frame is the video frame to be acquired in step 201.
- the extraction of the video frame is adaptively selected according to the length of the lens and the content change, and may be one or more frames of images capable of reflecting the main information content of the lens.
- the codec device may directly extract a plurality of video frames that perform the following encoding method from the video stream, for example, extract the video frames according to a preset step size, and the like.
- Step 202 Perform reconstruction on multiple video frames to obtain scene information and reconstruction residuals of each video frame.
- the scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent a difference between the video frame and the scene information.
- the redundancy of the multiple video frames can be reduced by the reconstruction.
- the obtained scene information can also be in various forms.
- the scene information includes data obtained by reducing redundancy between redundant data frames, and a reconstructed residual represents a difference between a video frame and a scene feature, thereby reconstructing scene information obtained by reconstructing the plurality of video frames.
- the reconstructed residual reduces the redundancy of redundant data compared to the original video frame, reduces the overall amount of data, and maintains a complete amount of information.
- the purpose of scene reconstruction is to reduce the redundancy of key frames in the scene.
- the scene feature extraction principle is that the scene feature representation succinctly occupies a small amount of data, and the data reconstructed according to the scene information matches the original image as much as possible, so that the reconstructed residual amount is small.
- the scene reconstruction operation directly affects the compression effect of the video encoding.
- the method of the embodiment of the present invention further includes the operation of classifying the plurality of video frames, for example, classifying the plurality of video frames based on the correlation of the screen content, A video frame of one or more clusters is obtained, and step 202 is performed subsequent to the video frames of the same cluster.
- the redundancy of redundant data between multiple video frames belonging to the same cluster is in accordance with a preset requirement, for example, greater than a threshold.
- the specific classification methods include various methods, such as a cluster-based method, a method using a classifier, and the like, for example, feature extraction and description of key frames, and clustering key frames in the feature space.
- a cluster-based method such as a cluster-based method, a method using a classifier, and the like, for example, feature extraction and description of key frames, and clustering key frames in the feature space.
- the specific implementation process is described in detail in the following embodiments, which are not specifically limited in this embodiment of the present invention.
- a video frame for performing the method of the embodiment of the present invention is extracted for each shot.
- the video frame extracted by one shot can be reflected.
- the characteristics of the lens, and thus the classification of the extracted video frames can also be referred to as scene classification of the lens.
- the purpose of scene classification is to combine video frames extracted from the lens that are strongly related in content, so that the entire scene content can be analyzed later.
- the specific strategy of scene classification is realized by analyzing and clustering key frames of each lens.
- the principle of scene classification is that the video frames in each cluster are highly correlated on the screen content, and there is a large amount of information redundancy. This operation plays a decisive role in the subsequent scene reconstruction operation. The better the classification effect is, the intra-class information is highly aggregated, and the larger the information redundancy, the higher the coding efficiency.
- Step 203 Perform predictive coding on the scene information to obtain scene feature prediction encoded data.
- the scene information After the scene information is obtained, it can be predictively encoded to obtain scene feature prediction encoded data.
- Step 204 Perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.
- the reconstructed residual After the reconstructed residual is obtained, it can be predictively coded to obtain residual prediction encoded data.
- intra prediction coding or inter prediction coding may be employed.
- the reconstruction residual has sparse characteristics because it does not include scene features. For example, when the reconstruction residuals are represented by a matrix, most of them are 0, and only a few are not 0. The value is therefore small in the amount of encoded information.
- the redundancy of the redundant data is reduced, so that the amount of data to be encoded is reduced, so that the scene feature predictive coding data and the residual obtained after the encoding are obtained.
- the amount of data of the difference prediction coded data is reduced, and since the video frame is represented by the scene information and the reconstructed residual, respectively, and the reconstructed residual represents the difference between the video frame and the scene feature, the obtained reconstruction is performed.
- the residual has a sparse characteristic, so that the amount of coded information of the reconstructed residual is reduced.
- the above steps 201 to 204 are video encoding methods, and the following are the steps of the video decoding method.
- Step 205 Acquire scene feature prediction encoded data and residual prediction encoded data.
- the video codec device acquires the predicted feature encoded data and the residual predictive encoded data with the encoded scene.
- Step 206 Decode the scene feature prediction encoded data to obtain scene information.
- the video codec device encodes the scene feature prediction encoded data to obtain scene information.
- the scene information includes data obtained by reducing the redundancy of redundant data, which is redundant data on the screen content between each of the plurality of video frames.
- Step 207 Decode the residual prediction encoded data to obtain a reconstructed residual.
- the video codec also decodes the residual prediction encoded data to obtain a reconstructed residual.
- the reconstructed residual is used to represent the difference between the video frame and the scene information.
- step 206 and step 207 is not specifically limited in the embodiment of the present invention.
- Step 208 Perform reconstruction according to the scene information and the reconstructed residual to obtain a plurality of video frames.
- the scene feature prediction coding data and the reconstruction residual include information of the video frame, and the video information and the reconstruction residual are reconstructed to obtain a plurality of video frames.
- the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video.
- the amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression.
- Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding.
- the codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high.
- the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
- the embodiments of the present invention can be used in various scenarios, for example, the video frame encoding and decoding method of the foregoing embodiment of the present invention is used in an HEVC scenario.
- the video frame obtained in step 201 of the foregoing embodiment is a key frame (I frame) in the HEVC scenario.
- the method in the embodiment of the present invention further includes: reconstructing the key frame (I frame) And use the traditional BP frame inter-prediction coding for the remaining frames as a reference.
- the method of the embodiment of the present invention further includes performing transform coding, quantization coding, and entropy coding on the predictive coded data according to the HEVC coding process to obtain video compression data.
- the predictive coding data includes scene feature prediction encoded data, residual predictive encoded data, B-frame predictive encoded data, and P-frame predictive encoded data.
- FIG. 3a is a schematic diagram of a flow of a video encoding method and a flow of an existing HEVC encoding method according to an embodiment of the present invention
- FIG. 3b is a scenario related to a video encoding method according to an embodiment of the present invention. schematic diagram.
- the video compression data is subjected to entropy decoding, inverse quantization processing, and DCT (discrete cosine transformation) to obtain corresponding prediction according to the HEVC decoding process.
- Encoded data The above-described operations of steps 205 to 208 are then performed using the scene feature prediction encoded data and the residual prediction encoded data in the predictive encoded data.
- the video frame reconstructed in step 208 is a key frame.
- the method in the embodiment of the present invention further includes performing BP frame decoding according to the decoded key frame data, and arranging the decoded data frames in time sequence to obtain a complete sequence of the original video. .
- FIG. 4a is a schematic diagram of a comparison between a flow of a video decoding method and a flow of an existing HEVC decoding method according to an embodiment of the present invention.
- FIG. 4 is a schematic diagram of a scenario of a video decoding method according to an embodiment of the present invention.
- the original HEVC is too dependent on the I frame coding and the compression efficiency is not high.
- the method of the embodiment of the present invention is used for the key frame.
- the I frame is independently coded, so that the I frame compression data amount is high, and I There is a large amount of information redundancy between frames.
- the redundant information of the I frame is reduced, and the amount of encoded data of the I frame is reduced.
- the method of the embodiment of the present invention identifies and classifies a video shot and a scene, performs overall data analysis and reconstruction on a key frame (I frame) in the scene, and encodes the scene feature and the representation residual. It effectively avoids the problem of inefficient compression in a single key frame, and introduces video context information to improve the compression ratio.
- the method in the embodiment of the present invention can also be used in other video frames that need to be independently coded, and reconstructed by using a video frame that needs to be independently coded to obtain scene information and reconstructed residuals, and separately coded. Reduce the amount of compressed data that would otherwise require a separately encoded video frame.
- the method of the embodiment of the present invention is described in the context of the HEVC standard. It should be understood that the video frame encoding and decoding method provided by the embodiment of the present invention can also be applied to other scenarios. The specific usage scenarios are not limited in the embodiment of the present invention.
- the overall frame picture of the reconstructed video frame has redundant data
- the partial frame picture of the reconstructed video frame has redundant data
- the overall frame picture of the video frame has redundant data
- FIG. 5 is a flowchart of a method for a video encoding method according to an embodiment of the present invention.
- a video encoding method according to an embodiment of the present invention includes:
- Step 501 Acquire a video stream.
- the encoding device acquires a video stream that includes a plurality of video frames.
- Step 502 Perform lens segmentation on the video stream to obtain multiple shots.
- the lens segmentation module of the encoding device may segment the video stream into multiple shots to extract a video frame to be reconstructed according to the lens.
- the lens segmentation module of the encoding device may segment the video stream into multiple shots to extract a video frame to be reconstructed according to the lens.
- the lens includes temporally consecutive video frames, and the lens represents a temporally and spatially continuous motion in a scene.
- step 502 can be implemented by the following steps:
- Step A1 Acquire a video stream.
- Step A1 is step 501, wherein the video stream includes a plurality of video frames.
- Step A2 Extract feature information of the first video frame and the second video frame, respectively.
- the feature information is used to describe the picture content of the video frame.
- it may be analyzed by feature information of the video stream, which is information for describing characteristics of the video frame, for example, image color, shape, edge contour or texture feature, and the like.
- the first video frame and the second video frame are video frames in the video stream, and the first video frame and the second video frame are not currently assigned to any of the shots.
- Step A3 Calculate the lens distance between the first video frame and the second video frame according to the feature information.
- the lens distance is used to indicate the degree of difference between the first video frame and the second video frame.
- Step A4 Determine whether the lens distance is greater than a preset lens threshold.
- the preset lens threshold can be set manually.
- Step A5 If the lens distance is greater than the preset lens threshold, the target lens is segmented from the video stream, and if the lens distance is less than the preset lens threshold, the first video frame and the second video frame are attributed to the same lens.
- the start frame of the target lens is the first video frame
- the end frame of the target lens is the previous video frame of the second video frame
- the target lens belongs to one of the lenses of the video stream
- the lens is a segment in time. Continuous video frames.
- the lens distance between the first video frame and the second video frame is greater than the preset lens threshold, indicating that the difference between the first video frame and the second video frame reaches a preset requirement
- the first video frame and the second video frame are The difference between the video frame and the first video frame does not reach a preset requirement, that is, less than the preset lens threshold, so that in the video stream, from the first video frame to the previous video frame of the second video frame
- the video frame belongs to the target lens. Otherwise, when the first video frame is located before the second video frame, the lens distance is calculated by the next frame of the second video frame and the first video frame, and steps A4 and A5 are repeated. Thus, through the repeated execution of the above steps, multiple shots can be obtained from the video stream.
- the feature information of the video frame is first extracted, and the content is measured based on the feature.
- a more common method is to extract image color, shape, edge contour or texture features, or extract multiple features and normalize them.
- the method of the embodiment of the present invention describes the image by using a block color histogram.
- the video image frame is first scaled to a fixed size (eg 320*240) and the image is downsampled to reduce the effects of noise on the image. Then, the image is 4*4 divided, and each block extracts an RGB color histogram. To reduce the impact of illumination on the image, the histogram is equalized. Then, the distance between the video frames is calculated based on the feature information of the video frame.
- the distance between video frames can be measured by a measure such as Mahalanobis distance and Euclidean distance.
- this example uses the normalized histogram intersection method to measure.
- the preset lens threshold is preset. When the lens distance is greater than the preset lens threshold, the video frame in the front of the two video frames in which the lens distance is calculated is determined as the frame boundary start frame, and the video frames in the two video frames are located behind. The previous frame is determined to be the boundary end frame of the previous shot, otherwise the two video frames belong to the same shot. Finally, you can split a complete video into multiple sets of separate shots.
- Step 503 Extract key frames from the obtained shots.
- a key frame is extracted from each lens, and the reconstruction operation of the method of the embodiment of the present invention is performed with the key frame.
- step 303 can be implemented by the execution of step A5.
- Step A5 For each shot in the video stream, the key frame is extracted according to the frame distance between the video frames in the shot.
- the frame distance between any two adjacent key frames in each shot is greater than a preset frame distance threshold, and the frame distance is used to indicate the degree of difference between the two video frames. Then, the reconstruction of the plurality of video frames is performed with key frames of each shot to obtain scene information and a reconstruction residual of each video frame.
- current key frame extraction algorithms mainly include sampling-based methods, color feature-based methods, content-based analysis methods, motion analysis-based methods, cluster-based methods, and compression-based methods.
- sampling-based methods mainly include sampling-based methods, color feature-based methods, content-based analysis methods, motion analysis-based methods, cluster-based methods, and compression-based methods.
- the starting frame of each shot is set as a key frame.
- Feature description and distance metric are used for each frame by using block color histogram feature and histogram intersection method.
- the method of the embodiment of the present invention increases the judgment of the type of each lens, that is, first determines whether the lens is a static picture according to the adjacent frame feature space distance, if all the frames in the lens are between If the frame distance is 0, it is determined to be a static picture, and the key frame is no longer extracted, otherwise it is a dynamic picture.
- the content of each frame is measured in chronological order from the previous key frame, and if the distance is greater than the set threshold, the frame is set as a key frame.
- Figure 8 shows the key frame extraction process.
- the method of the embodiment of the present invention is described in the HEVC scenario.
- the lens obtained in the above steps can be used as a GOP, and one lens is a GOP.
- the start frame of the lens is a key frame
- the lens is taken from the lens through step A5.
- the video frame extracted in is also a key frame
- other video frames of the lens can be used as a B frame and a P frame.
- the key frame extraction operation of the embodiment of the present invention takes into account the context context information of the video, so that when the key frames are subsequently classified, the classification effect of the key frames is better, which contributes to the improvement of the compression ratio of the subsequent coding. .
- the key frame sequence is quickly generated, and can respond to the user's fast forward and switch scene requirements in time.
- the user can preview the video scene according to the sequence of key frames, and accurately locate the video scene segments that are of interest to the user, thereby improving the user experience.
- a video stream is acquired, wherein the video frames of the video stream include an I frame, a B frame, and a P frame. Then, an I frame is extracted from the video stream, and step 504 or step 505 is performed with the I frame.
- the encoding device acquires a plurality of key frames, which are video frames to be reconstructed to reduce redundancy.
- the method of the embodiment of the present invention further includes the step of classifying the key frame, that is, step 504.
- Step 504 Classify a plurality of key frames based on the correlation of the picture content to obtain key frames of one or more classification clusters.
- the method of the embodiment of the present invention may perform the step 505 in the key frame of the same cluster by using the method of the embodiment of the present invention.
- the screen content between the key frames is highly correlated, and there is a large amount of redundant data. If the classification effect is better, that is, the plurality of key frame information in the same cluster is highly aggregated. The greater the redundancy of multiple key frames in the same cluster, the more significant the effect of subsequent reconstruction operations on the reduction of redundancy.
- one or more classification clusters are obtained after the classification operation, and there are more portions of the same content content among the multiple key frames in the same classification cluster, so that redundant data redundancy between the key frames is performed. Larger.
- the classification operation if different key frames are classified based on the lens, the classification may also be referred to as a scene classification. Of course, the classification operation may also directly classify different key frames without being based on the lens.
- the classification operation of the method provided by the embodiment of the present invention is referred to as a scene classification operation.
- the clustering classification method the plurality of key frames are classified based on the correlation of the screen content, and the key frames of the one or more classification clusters are obtained, including:
- Step B1 Extract feature information of each key frame of the plurality of key frames.
- the feature information of the key frame may be an underlying feature or a middle layer semantic feature.
- Step B2 Determine the cluster distance between any two key frames according to the feature information.
- the cluster distance is used to represent the similarity between two key frames.
- Any two key frames here include all the key frames extracted in the above steps, which may be key frames belonging to different shots, or key frames belonging to the same shot.
- the difference between frames in the lens is smaller than the difference between frames in different lenses.
- different feature spaces may be selected, and different feature spaces correspond to different metrics, so that the cluster distance and the lens distance may be different.
- Step B3 Cluster the video frames according to the cluster distance to obtain video frames of one or more clusters.
- scene classification is achieved by analyzing and clustering key frames of each lens.
- Scene classification is closely related to scene reconstruction.
- the first principle of scene classification is that the key frames in each cluster are highly correlated at the content level of the screen, and there is a large amount of information redundancy.
- the existing scene classification algorithms are mainly divided into two categories: a) based on the underlying feature scene classification algorithm; b) based on the middle layer semantic feature modeling scene classification algorithm. These methods are based on feature detection and description, and reflect the description of the scene content at different levels.
- the underlying image features may include features such as color, edge, texture, SIFT (Scale-invariant feature transform), HOG (Histogram of Oriented Gradient), and GIST.
- Middle-level semantic features include Bag of Words, deep learning network features, and more.
- the embodiment of the present invention selects a relatively simple GIST global feature to describe the overall content of the key frame.
- the distance measure function uses the Euclidean distance to measure the similarity of the two images.
- the clustering algorithm can adopt traditional K-means, graph cutting, hierarchical clustering and other methods.
- a condensed hierarchical clustering algorithm is used to cluster key frames. The method cluster number depends on the similarity threshold setting. The higher the threshold setting, the greater the amount of keyframe information redundancy within the class and the corresponding number of clusters.
- the specific flow chart of the scene classification is shown in the following figure.
- the above clustering-based scene classification strategy is beneficial to the improvement of coding speed.
- the following classification mechanism based on the classifier model is beneficial to the improvement of coding precision.
- the main idea of the scene classification strategy based on the classifier model is to perform discriminant training on each shot according to the shot segmentation result to obtain a plurality of discriminant classifiers.
- Each key frame is discriminated by the classifier, and the key frame with a high score is considered to be the same scene as the lens.
- the specific process is as follows:
- the classification method of the video coding method in the embodiment of the present invention includes:
- Step C1 Perform discrimination training according to each shot segmented from the video stream to obtain a plurality of classifiers corresponding to the shots.
- the optional classifier models are: decision tree, Adaboost, Support Vector Machine (SVM), deep learning and other models.
- SVM Support Vector Machine
- Step C2 Using the target classifier to discriminate the target key frame to obtain a discriminant score.
- the target classifier is one of the plurality of classifiers obtained in step C1
- the target video frame is one of the key frames
- the discriminant score is used to indicate the extent to which the target video frame belongs to the scene to which the target classifier belongs.
- Step C3 When the discriminant score is greater than the preset score threshold, it is determined that the target video frame belongs to the same scene as the shot to which the target classifier belongs.
- the target key frame When the discriminant score is greater than the preset score threshold, the target key frame may be considered to belong to the same scene as the shot to which the target classifier belongs. Otherwise, the target key frame and the shot to which the target classifier belongs are not considered to belong to the same scene.
- Step C4 Determine video frames of one or more clusters according to video frames belonging to the same scene as the shot.
- the operation of classifying using a classifier includes two main phases, as follows:
- y i is the label corresponding to the i-th training sample
- the positive sample corresponds to the label 1 and the negative sample is -1.
- ⁇ ( ⁇ ) is the feature mapping function
- n is the total number of training samples
- w is the classifier parameter
- I is the training sample.
- the keyframes are discriminated by the classifier model trained by each lens.
- the specific formula is as follows:
- the keyframes are discriminated by the classifier model trained by each lens.
- the specific formula is as follows:
- w j and b j are the classifier parameters corresponding to the jth lens, and the denominator is the normalization factor.
- the probability is greater than the set threshold, it is considered that the key frame i and the shot j belong to one scene.
- i and j are positive integers.
- the correspondence between multiple sets of key frames and shots can be obtained. These correspondences indicate that the key frames and the shots belong to the same scene, and then the encoding device can determine the video frames of one or more clusters according to the corresponding relationships. .
- step 504 may not be included.
- the redundancy of the redundant data between the key frames in the same classification cluster is large, thereby When the subsequent reconstruction of the key frames of the same cluster is performed, the redundancy of the redundant data can be further reduced to further reduce the amount of encoded data.
- the video is compressed according to the scene, and the content is clipped later, and the video green mirror (that is, the essence video is generated according to the heat analysis) is facilitated.
- Step 505 Perform reconstruction on multiple key frames of the same cluster to obtain scene features and reconstruction residuals of each video frame.
- Each of the plurality of key frames includes the same picture content, that is, redundant data included on the picture content between each key frame. If these key frames are not reconstructed, the encoding device will repeatedly encode the same picture content between these key frames.
- the reconstructed scene features are used to represent the same picture content between each video frame, such that the scene information includes data resulting from reducing redundancy of redundant data.
- the reconstructed residual is used to represent the difference between the key frame and the scene feature.
- the scene feature thus obtained may represent the overall information of the frame, so that the reconstruction operation of step 505 is directed to a scene in which the entire screen of the plurality of video frames has the same picture content.
- step 505 The specific implementation of step 505 is as follows:
- observation matrix Convert keyframes of the same taxonomy into observation matrices.
- the observation matrix is used to represent the plurality of key frames in a matrix form. Then, the observation matrix is reconstructed according to the first constraint condition to obtain a scene feature matrix and a reconstructed residual matrix.
- the scene feature matrix is used to represent the scene features in a matrix form
- the reconstructed residual matrix is used to represent the reconstructed residuals of the plurality of key frames in a matrix form.
- the first constraint is used to define the scene feature matrix low rank and the reconstructed residual matrix is sparse.
- the observation matrix is reconstructed according to the first constraint condition, and the scene feature matrix and the reconstructed residual matrix are obtained, including: calculating the scene feature matrix and reconstructing the residual according to the first preset formula a difference matrix, the scene feature matrix is a low rank matrix, and the reconstructed residual matrix is a sparse matrix;
- the first preset formula is:
- D is the observation matrix
- F is the scene feature matrix
- E is the reconstructed residual matrix
- ⁇ is the weight parameter
- ⁇ is used to balance the relationship between the scene feature matrix F and the reconstructed residual matrix E.
- Rank( ⁇ ) is a matrix for the rank function
- 1 is the matrix L1 norm
- * is the matrix kernel norm.
- the scene reconstruction is to perform content analysis on the scene of each cluster cluster obtained by the scene classification, and extract scene features and representation coefficients suitable for reconstructing all key frames in the scene.
- Models that can be used for scene reconstruction include RPCA (Robust Principle Component Analysis), LRR (low rank representation), SR (sparse representation), SC (sparse coding), and SDAE (Sparse Self-Coded Deep Learning Model). , CNN (convolution neural network) and so on.
- the representation coefficient of the embodiment of the present invention may be represented by a unit matrix, and the scene feature and the representation coefficient are still multiplied by the scene feature. Of course, in some embodiments of the present invention, since the representation coefficient can be ignored, it can be used as a unit matrix.
- the representation coefficient may or may not be used. In this case, in the decoding and reconstruction stage, only the scene feature and the reconstruction residual are required to represent the original video frame.
- the video coding method in this embodiment uses RPCA to reconstruct key frames in the scene.
- the scene reconstruction strategy based on RPCA method reconstructs the overall content data of key frames, which can reduce the square phenomenon caused by block prediction.
- a scene S contains N key frames, that is, a certain cluster includes N key frames, and N is a natural number.
- D the observation matrix
- Ii the ith key frame
- ⁇ is a weight parameter used to balance the relationship between the scene feature matrix F and the reconstructed residual matrix E
- rank( ⁇ ) is a matrix rank function
- 1 is a matrix L 1 norm.
- Figure 11 shows an example diagram based on RPCA scene reconstruction, where key frames 1 to 3 belong to different shot segments of the same video.
- the scene feature matrix F rank is 1, so only one column of the matrix needs to be data compressed.
- the residual matrix E has a value of 0 in most of the regions, so only a small amount of information is needed to represent E.
- the scene feature in the embodiment of the present invention is one specific implementation manner of the scenario information, and the step 505 is to reconstruct multiple video frames to obtain scene information and one of reconstructed residuals of each video frame. Specific implementation.
- the method of the embodiment of the present invention may perform a reconstruction operation on a key frame having redundant data of the frame overall information.
- the key frame needs to be detected first to determine the current Whether the selected key frame is suitable for the reconstruction operation of the method of the embodiment of the present invention, so that the adaptive coding can be performed according to the content of the video scene.
- the method of the embodiment of the present invention further includes: extracting each of the multiple video frames, before the reconstructing the plurality of video frames to obtain the scene features and the reconstructed residuals of the video frames.
- the picture feature information of the frame wherein the extracted picture feature information may be a global feature or a local feature of the video frame, and specifically includes a GIST global feature, a HOG global feature, a SIFT local feature, and the like, which are not specifically limited in this embodiment of the present invention.
- the encoding device calculates the content metric information according to the picture feature information, where the content metric information is used to measure the difference of the picture content of the multiple video frames, that is, the content consistency measurement of the key frame, the key frame Content consistency metrics can be measured in terms of feature variance, Euclidean distance, and the like. And performing the step of reconstructing the plurality of video frames to obtain a scene feature and a reconstruction residual of each video frame, when the content metric information is not greater than a preset metric threshold.
- the method of the embodiment of the present invention further includes:
- Step D1 Extract global GIST features of each of the plurality of video frames.
- This global GIST feature is used to describe the characteristics of keyframes.
- Step D2 Calculate the variance of the scene GIST feature according to the global GIST feature.
- the scene GIST feature variance is used to measure the content consistency of multiple video frames.
- the scene GIST feature variance is used to measure the content consistency of multiple key frames of the same cluster.
- Step D3 When the scenario GIST feature variance is not greater than the preset variance threshold, step 304 is performed.
- the video coding and decoding device may perform intra prediction coding on the scene feature and the reconstructed residual, respectively.
- steps D1 to D3 are specific methods for determining whether the key frame of the same cluster is applicable to step 505.
- Step 506 Perform predictive coding on the scene features to obtain scene feature prediction encoded data.
- Step 507 Perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.
- the predictive coding portion of the encoding device includes two parts of intra prediction coding and inter prediction coding.
- the scene features and reconstruction errors are encoded by intra prediction, and the remaining frames of the shot, that is, the non-key frames of the shot, are inter-predictive encoded.
- the specific process of intra prediction coding is similar to the HEVC intra coding module. Since the scene feature matrix has a low rank, only the key columns of the scene feature matrix need to be encoded.
- the reconstruction error belongs to residual coding, and the amount of coded data is small and the compression ratio is high.
- Step 508 Perform reconstruction according to the scene feature and the reconstructed residual to obtain a reference frame.
- a reference frame In order to perform interframe predictive coding on B and P frames, a reference frame needs to be obtained.
- the key frame is used as the reference frame.
- the reverse reconstruction scheme is adopted to prevent the error from spreading between the BP frames. The reconstruction is performed according to the scene feature and the reconstruction residual, and the following step 509 is performed with reference to the obtained reference frame.
- the BP frame inter prediction can be directly performed through the key frame extracted in step 504.
- Step 509 Perform inter-prediction encoding on the B frame and the P frame with reference to the reference frame, and obtain B frame predictive encoded data and P frame predictive encoded data.
- Inter-frame coding first reconstructs the key frame (I frame) according to the scene features and reconstruction error, and then performs motion compensation prediction and coding on the BP frame content.
- the specific inter prediction encoding process is the same as HEVC.
- Step 510 Perform transform coding, quantization coding, and entropy coding on the prediction encoded data to obtain video compressed data.
- the predictive coded data includes scene feature predictive coded data, residual predictive coded data, B frame predictive coded data, and P frame predictive coded data.
- the data is subjected to change coding, quantization coding, and entropy coding on the basis of predictive coding, which is the same as HEVC.
- the video coding method in the embodiment of the present invention can improve the video compression ratio.
- the entire scene information can be represented by a small amount of information, and the code rate is lowered, and the video quality is guaranteed.
- the reduction of the compression volume is more suitable for the transmission and storage of images in a low bit rate environment.
- the existing on-demand (VOD), personal video recording (NPVR), and Catch-up TV video services account for 70% of the server's storage resources and network bandwidth.
- the technical solution of the embodiment of the invention can reduce the pressure of the storage server and improve the network transmission efficiency.
- the CDN edge node can store more videos, the user hit rate will be greatly increased, the return rate is reduced, the user experience is improved, and the network device consumption is reduced.
- the method in the embodiment of the present invention can generate different code rate videos by performing feature extraction on different levels of the scene.
- the same picture content is de-duplicated and represented by scene features, which can reduce the redundancy of redundant information of the multiple video frames. Therefore, in the encoding operation, the obtained scene feature and the compressed data amount of the reconstructed residual total are reduced relative to the compressed data amount of the original video frame, and the amount of data obtained after compression is reduced.
- Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding.
- the codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high.
- the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
- the video codec device may perform a decompression operation on the compressed encoded data.
- FIG. 6 is a flowchart of a method for decoding a video according to an embodiment of the present invention.
- a video decoding method according to an embodiment of the present invention includes:
- Step 601 Acquire video compression data.
- the decoding device acquires video compressed data, which may be video compressed data obtained by the video encoding method of the embodiment shown in FIG. 5.
- Step 602 Perform entropy decoding, inverse quantization processing, and DCT inverse change on the video compressed data to obtain predictive encoded data.
- the prediction encoded data includes scene feature prediction encoded data, residual prediction encoded data, B-frame predictive encoded data, and P-frame predictive encoded data.
- the video compression data needs to be entropy decoded, inverse quantized, and DCT inversely changed according to the HEVC decoding process to obtain corresponding predictive encoded data.
- Step 603 Decode the scene feature prediction encoded data to obtain a scene feature.
- the scene feature is used to represent the same picture content between each video frame, and the scene feature obtained by decoding the scene feature prediction encoded data represents each video in the plurality of video frames.
- the same picture content between frames is used to represent the same picture content between frames.
- Step 604 Decode the residual prediction encoded data to obtain a reconstructed residual.
- the reconstructed residual is used to represent the difference between the video frame and the scene information.
- the scene feature prediction encoded data and the key frame error prediction encoded data are respectively decoded to obtain a scene feature matrix F and a reconstructed residual e i .
- Step 605 Perform reconstruction according to the scene feature and the reconstructed residual to obtain multiple I frames.
- the key frame is reconstructed to obtain the scene feature and the reconstruction residual. Therefore, in the coding method of the video frame, the scene feature and the reconstruction residual are reconstructed. The result is multiple keyframes.
- Step 606 Perform inter-frame decoding on the B frame predictive coded data and the P frame predictive coded data by using the I frame as a reference frame to obtain a B frame and a P frame.
- Step 607 Arranging the I frame, the B frame, and the P frame in chronological order to obtain a video stream.
- the video streams are obtained by arranging the three types of video frames in chronological order.
- the original data reconstruction is performed in combination with the decoded scene feature F and the key frame error e i to obtain key frame decoded data.
- BP frame decoding is performed according to the decoded key frame data, and the decoded data frames are arranged in chronological order to obtain a complete sequence of the original video.
- the scene feature prediction coded data and the residual prediction coded data are obtained, and the data can be obtained by the video decoding method shown in FIG. 6. Decoding to get a video frame.
- the embodiment shown in FIG. 5 is mainly applied to perform efficient compression in a redundant scenario in which overall information between key frames exists.
- the embodiment shown in FIG. 12 is applied to perform efficient compression in a redundant scene where local information of key frames exists, and the local information may be, for example, a texture image, a lens gradation, or the like.
- FIG. 12 is a flowchart of a method for a video encoding method according to an embodiment of the present invention.
- a video decoding method provided by an embodiment of the present invention includes:
- Step 1201 Acquire a video stream.
- step 1201 For details of the implementation of step 1201, reference may be made to step 501.
- Step 1202 Perform lens segmentation on the video stream to obtain multiple shots.
- step 1202 The implementation details of step 1202 can be referred to step 502.
- Step 1203 Extract key frames from the obtained shots.
- step 1203 For details of the implementation of step 1203, refer to step 503.
- the video frame to be reconstructed is obtained, and the video stream may also be acquired by other methods, for example, the video frame of the video stream. Includes I frames, B frames, and P frames. Then, extracting an I frame from the video stream, and performing subsequent steps of splitting each of the plurality of video frames by using the I frame to obtain a plurality of frame sub-blocks;
- Step 1204 classify a plurality of key frames based on the correlation of the picture content to obtain key frames of one or more classification clusters.
- step 1204 The implementation details of step 1204 can be referred to step 504.
- the method of the embodiment of the present invention may perform a reconstruction operation on a key frame in which frame local information has redundant data.
- the key frame needs to be detected first to determine the current Whether the selected key frame is suitable for the reconstruction operation of the method of the embodiment of the present invention, that is, before each video frame of the plurality of video frames is split to obtain a plurality of frame sub-blocks, the embodiment of the present invention
- the method further includes: extracting picture feature information of each of the plurality of video frames, wherein the extracted picture feature information may be a global feature or a local feature of the video frame, specifically a GIST global feature, a HOG global feature,
- the SIFT local features and the like are not specifically limited in the embodiment of the present invention.
- the encoding device calculates the content metric information according to the picture feature information, where the content metric information is used to measure the difference of the picture content of the multiple video frames, that is, the content consistency measurement of the key frame, the key frame Content consistency metrics can be measured in terms of feature variance, Euclidean distance, and the like.
- the content metric information is greater than the preset metric threshold, performing the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks.
- the method of the embodiment of the present invention further includes:
- Step E1 Extract global GIST features of each of the plurality of video frames.
- step E1 is to extract global GIST features for each of a plurality of key frames of the same cluster. This global GIST feature is used to describe the characteristics of keyframes.
- Step E2 Calculate the variance of the scene GIST feature according to the global GIST feature.
- the scene GIST feature variance is used to measure the content consistency of multiple video frames
- the scene GIST feature variance is used to measure the content consistency of multiple key frames of the same cluster.
- Step E3 When the scene GIST feature variance is greater than the preset variance threshold, step 1205 is performed.
- the video frames in the steps E1 to E3 are key frames in the HEVC scenario.
- the key frames are key frames belonging to the same cluster.
- steps E1 to E3 are specific methods for determining whether the key frame of the same cluster is applicable to step 1205. If the variance of the scene GIST feature of the plurality of key frames is greater than the preset variance threshold, it indicates that the local part of the frame picture of the multiple key frames has redundant data, so that step 1205 or step 1206 can be performed on the multiple key frames to Reduce the redundancy of these local redundant data.
- Step 1205 Split each video frame in multiple video frames to obtain multiple frame sub-blocks.
- the encoding device splits multiple key frames of the same cluster to obtain a plurality of frame sub-blocks.
- each of the plurality of video frames includes redundant data at a local location, that is, redundant data exists between different video frames and within a video frame, and these redundancy
- the data is in a local location of the frame.
- one video frame has a window image in the lower part of the frame, and the other video frame has the same window image in the upper part of the frame.
- the window image constitutes redundant data.
- the video frames can be split first, and the frame picture at this time is a frame sub-block picture, and the granularity of the redundant data relative to the frame picture is reduced, thereby facilitating the acquisition of the scene feature base, and the scene feature base See the description of step 1206 for details.
- the plurality of frame sub-blocks obtained by the splitting may be equal in size or unequal.
- the frame sub-blocks may be pre-processed, such as zooming in or out.
- Step 1206 Perform reconstruction on multiple frame sub-blocks to obtain a scene feature, a representation coefficient of each frame sub-block in the plurality of frame sub-blocks, and a reconstruction residual of each frame sub-block.
- the scene feature includes multiple independent scene feature bases, and the independent scene feature bases in the scene feature cannot be reconstructed from each other.
- the scene feature base is used to describe the screen content features of the frame sub-block, and the representation coefficients represent the scene features.
- the correspondence between the base and the frame sub-block, the reconstructed residual represents the difference between the frame sub-block and the scene feature base.
- the reconstructed residual may be a specific value or zero.
- the representation coefficients may be stored in separate fields, passed by encoding the ancillary information, such as by adding corresponding fields within the image header, strip header or macroblock information.
- the scene feature base can be configured in various forms, for example, it can be a certain frame sub-block, or a feature block in a specific space.
- Multiple scene feature bases may constitute scene features.
- different scene feature bases cannot be reconstructed from each other, and thus these scene feature bases constitute a basic image unit.
- the basic image unit and the corresponding reconstructed residual combination can obtain a certain frame sub-block. Since there are multiple basic image units, it is necessary to represent the coefficients to match the scene feature base and the reconstructed residual corresponding to the same frame sub-block.
- one frame sub-block may correspond to one scene feature base, or may correspond to multiple scene feature bases. When multiple scene feature bases correspond to one frame sub-block, the scene feature bases are superimposed on each other and reconstructed residuals are performed. The reconstructed frame sub-block is obtained.
- the scene features are composed of scene feature bases, and the scene feature bases in one scene feature cannot be reconstructed from each other, and the additional parameter reconstruction residuals represent the difference between the frame sub-block and the scene feature base, thereby being composed of multiple frames.
- the scene feature may record only one of the same scene feature bases, and the scene information includes data obtained by reducing redundancy of redundant data.
- the data of the frame sub-block is converted into data composed of the reconstructed residual and the scene feature, and the redundancy of the redundant data is reduced.
- the video coding method of the embodiment of the present invention may refer to FIG. 3b.
- the method further includes the representation coefficient C.
- the representation coefficient C For example, after performing scene reconstruction on the key frame of the scene 1, the weight is obtained.
- C1, C3, and C5 are the representation coefficients of the key frames I1, I3, and I5, respectively.
- Step 1205 and step 1206 described above are one of the specific forms of the steps of reconstructing a plurality of video frames to obtain scene information and reconstruction residuals of each video frame.
- the encoding apparatus reconstructs a plurality of frame sub-blocks to obtain a representation coefficient of each frame sub-block of the plurality of frame sub-blocks and a reconstruction residual of each frame sub-block.
- the representation coefficient represents a correspondence between a frame sub-block and a target frame sub-block
- the target frame sub-block is an independent frame sub-block among the plurality of frame sub-blocks
- the independent frame sub-block is not based on other ones of the plurality of frame sub-blocks
- the frame sub-block reconstructed from the frame sub-block is reconstructed to represent the difference between the target frame sub-block and the frame sub-block.
- the encoding device combines a plurality of target frame sub-blocks indicating the coefficient indication to obtain a scene feature, and the target frame sub-block is a scene feature base.
- the frame sub-blocks that are independently represented are determined by the reconstruction operation, and the independently represented frame sub-blocks are now referred to as target frames. Piece.
- the target frame sub-block and the non-target frame sub-block are included in the obtained multiple frame sub-blocks, the target frame sub-block cannot be reconstructed based on other target frame sub-blocks, and the non-target frame sub-block can be obtained based on other target frame sub-blocks. .
- the scene features are composed of target frame sub-blocks, which can reduce the redundancy of redundant data. Because the scene feature base is the original frame sub-block, the scene feature base constituting the scene feature can be determined according to the indication of the representation coefficient.
- one of the two frame sub-blocks includes a window pattern 1301, and the frame sub-block plus the gate image 1303 can obtain another frame sub-block, so that the previous frame sub-block is targeted.
- Frame sub-block 1302, the next frame sub-block is a non-target frame sub-block 1304.
- the target frame sub-block and the reconstruction residual of the gate pattern are reconstructed to obtain the target frame sub-block, so that in the scene including the two frame sub-blocks, the window pattern of the two-frame sub-block is redundant data.
- the reconstructed residual of the target frame sub-block and the gate is obtained, two representation coefficients, one representation coefficient indicates the target frame sub-block itself, and the other representation coefficient indicates the target frame sub-block and the gate.
- the target frame sub-block is a scene feature base.
- a frame sub-block is obtained as a target frame sub-block according to a representation coefficient indicating the target frame sub-block itself, and the target frame sub-block is represented according to a representation coefficient indicating a correspondence relationship between the target frame sub-block and the reconstructed residual of the gate.
- the reconstruction residual of the AND gate is reconstructed to obtain another frame sub-block.
- reconstructing a plurality of frame sub-blocks to obtain a representation coefficient of each frame sub-block of the plurality of frame sub-blocks and a reconstruction residual of each frame sub-block including:
- the coefficient matrix is a matrix including the representation coefficients of each frame sub-block of the plurality of frame sub-blocks, indicating the non-coefficient of the coefficients
- the zero coefficient indicates a target frame sub-block
- the reconstructed residual matrix is used to represent the reconstructed residual of each frame sub-block in a matrix form
- the second constraint is used to define a low rank and sparsity of the representation coefficient.
- the target frame sub-block indicated by the non-zero coefficient indicating the coefficient in the coefficient matrix is combined to obtain a scene feature.
- reconstructing the observation matrix according to the second constraint condition to obtain a representation coefficient matrix and a reconstruction residual matrix including:
- the representation coefficient matrix and the reconstructed residual matrix are calculated, and the second preset formula is:
- a scene S contains N key frames, that is, the same cluster includes N key frames, and N is a natural number.
- Pull each sub-block into a column vector to form an observation matrix D ie Since there is a large amount of redundancy in the information content between the key frame and the key frame, the matrix can be regarded as a union of a plurality of subspaces.
- the goal of scene reconstruction is to find these independent subspaces and solve the representation coefficients of the observation matrix D in these independent subspaces.
- Space refers to a collection with some specific properties.
- the observation matrix D contains a plurality of image feature vectors, and the representation space formed by these vectors is a full space.
- a subspace is a partial space that represents a dimension that is smaller than the full space. This subspace is the space formed by independent frame sub-blocks.
- the scene reconstruction problem can be transformed into the following optimization problem to describe:
- C is the coefficient of representation.
- the scene features corresponding to each subspace can be obtained.
- the non-zero number of coefficients C corresponds one-to-one with the number of scene feature bases.
- the representation coefficient of this embodiment refers to a coefficient matrix (or vector) represented by each scene feature base in the scene feature in the key frame reconstruction process, that is, a correspondence relationship between the frame sub-block and the scene feature base.
- the representation coefficient between different independent frame sub-blocks is usually 0.
- the grass image does not contain the lake scene feature, so the coefficient of the image block represented by the lake scene feature is usually zero.
- each frame sub-block in the observation matrix D can be represented by observing other frame sub-blocks in the matrix D, independent frame sub-blocks. It is itself represented by itself.
- Each column in the representation coefficient matrix C is a representation coefficient of a frame sub-block
- ⁇ and ⁇ are weight parameters, and the coefficient sparsity and low rank are adjusted.
- the above optimization problem can be solved by matrix optimization algorithms such as APG and IALM.
- the final scene feature consists of a feature base corresponding to a non-zero coefficient C.
- the representation coefficients need to be sparsely constrained, that is, the representation coefficients of the frame sub-blocks belonging to the same type of scene (for example, both are grassland) are not only strongly correlated but also indicate that the coefficients are mostly 0, and a small portion is not 0.
- the image sub-block corresponding to the representation coefficient is the scene feature that needs to be encoded eventually.
- c1_2 is not 0.
- the scene feature base is d2, that is, the frame sub-block d1 can be represented based on the frame sub-block d2, the frame sub-block d2 is an independent frame sub-block, and the reconstructed residuals of the frame sub-block d2 and the frame sub-block d1 are heavy.
- the embodiment of the present invention converts the information amount of the I frame into the scene feature base information and the residual matrix information.
- the redundancy of the I frame information amount is reflected in the scene feature base and the residual matrix information.
- Multiple I frames have the same scene feature base, so only the scene feature base needs to be encoded once to greatly reduce the amount of encoded data.
- each sub-block is first reconstructed according to the decoded scene features, representation coefficients and reconstruction errors, and then the sub-blocks are combined by number to obtain the final key frame content.
- Figure 14 shows an example of scene reconstruction based on local information representation.
- the frame sub-blocks may be arranged in a preset order without using the number of the sub-blocks, and the reconstructed process is performed according to the preset rule.
- the video sub-blocks can be combined to obtain a video frame.
- This implementation can mine the texture structure existing in the key frame. If there are a large number of texture features in the scene, the representation coefficient C obtained by the above formula will be low rank and sparse.
- the feature base corresponding to the sparse coefficient is the basic unit of the scene texture structure.
- Figure 15 shows an example diagram of local feature reconstruction under a texture scene.
- the scene content is represented and reconstructed according to the underlying data features of the image.
- the following implementations will use higher-level semantic features to describe and reconstruct the content of the scene to achieve data compression.
- Specific models include Sparse Coding (SC), Deep Neural Network (DNN), Convolutional Neural Network (CNN), Stacked Auto Encoder (SAE), and so on.
- the decoding device reconstructs a plurality of frame sub-blocks to obtain a scene feature and a representation coefficient of each frame sub-block of the plurality of frame sub-blocks.
- the scene feature includes a scene feature set as an independent feature block in the feature space, and the independent feature block is a feature block that cannot be reconstructed by other feature blocks in the scene feature.
- the decoding device calculates the reconstructed residual of each frame sub-block according to the reconstructed residual of each frame sub-block and the reconstructed data of the scene feature and each frame sub-block.
- the scene feature base is an independent feature block in the feature space.
- the feature space may be an RGB color space, a HIS color space, a YUV color space, etc., and different frame sub-blocks do not seem to have the same picture, but After the high-level mapping, the same feature blocks are formed. These same feature blocks constitute redundant data, and the scene features record the same feature blocks only one by one, thereby reducing the redundancy between the frame sub-blocks.
- Such scene features are similar to a dictionary consisting of feature blocks, which are the feature blocks needed to select a frame sub-block from the dictionary and map the corresponding reconstructed residuals.
- one frame sub-block can correspond to multiple feature blocks, and the multiple feature blocks are superimposed and reconstructed by the reconstructed residual to obtain a frame sub-block.
- the plurality of frame sub-blocks are reconstructed to obtain a scene feature and a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, including:
- the coefficient matrix is a matrix including the representation coefficients of each frame sub-block, and the non-zero coefficient indicating the coefficient indicates the scene feature base, as shown
- the scene feature matrix is used to represent the scene features in a matrix form
- the third constraint condition is used to define the similarity between the picture representing the coefficient matrix and the scene feature matrix reconstructed picture and the frame sub-block according to the preset similarity threshold, and the limitation
- the data matrix indicating that the coefficient matrix sparsity conforms to the preset sparse threshold and the scene feature matrix is smaller than the preset data volume threshold;
- the reconstructed residuals of each frame sub-block are calculated according to the reconstructed residual of each frame sub-block and the reconstructed data of the scene feature and each frame sub-block, including:
- the reconstructed residual matrix is calculated according to the data and the observation matrix reconstructed by the coefficient matrix and the scene feature matrix, and the reconstructed residual matrix is used to represent the reconstructed residual in a matrix form.
- the observation matrix is reconstructed according to the third constraint condition, and the representation coefficient matrix and the scene feature matrix are obtained, including:
- the representation coefficient matrix and the scene feature matrix are calculated, and the third preset formula is:
- D is the observation matrix
- C is the coefficient matrix
- F is the scene feature
- ⁇ and ⁇ are the weight parameters, which are used to adjust the coefficient sparsity and low rank.
- a sparse coding model is used for modeling and analysis for description.
- a scene S contains N key frames, and each key frame is evenly split into M equal-sized frame sub-blocks. Pull each frame sub-block into a column vector to form an observation matrix D, ie
- D observation matrix
- ⁇ and ⁇ are weight parameters
- the matrix optimization parameters are scene features F and representation coefficients C.
- the first item in the objective function is to constrain the reconstruction error, so that the picture reconstructed by the scene feature and the representation coefficient is as similar as possible to the original picture.
- the second term is the sparse constraint on the coefficient C, which means that each picture can be reconstructed with a small number of feature bases.
- the last item is to constrain the scene feature F to prevent the F data from being too large, that is, the first item of the formula is the error term, and the last two items are regular items, and the representation coefficients are constrained.
- the specific optimization algorithm can adopt the methods of conjugate gradient method, OMP (Orthogonal Matching Pursuit), LASSO and the like.
- the scene features obtained by the final solution are shown in Fig. 16.
- the dimension and number of the F matrix are consistent with the dimensions of the frame sub-block.
- each small frame of FIG. 16 is a scene feature base
- the scene feature matrix F is a matrix composed of small frames (scene feature bases)
- FC F[c1, c2, c3, ...]
- Fc1 represents The representation coefficient c1 combines the scene feature bases to obtain a linear representation of the feature space, and the reconstructed residual e1 is added to restore the original frame sub-block image I1.
- the scene feature base is directly determined by the observation sample D. That is, the scene feature base is selected from the observation sample D.
- the scene features in this example are learned according to the algorithm.
- the optimization process of the parameter F the iterative solution is performed according to the objective function, and the optimization result can minimize the reconstruction error.
- the amount of coded information is concentrated on F, E.
- the dimension of F is consistent with the dimension of the frame sub-block, and the number of Fs can be set in advance. The less the setting, the less the coding information, but the reconstruction residual E is larger. The more the F setting is, the larger the coding information is, but the reconstruction residual E is smaller, so the number of Fs needs to be weighed by the weight parameter.
- Step 1207 Perform predictive coding on the feature of the scene to obtain scene feature prediction encoded data.
- Step 1208 Perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.
- the predictive coding portion of the encoding device includes two parts of intra prediction coding and inter prediction coding.
- the scene features and reconstruction errors are encoded by intra prediction, and the remaining frames of the shot, that is, the non-key frames of the shot, are inter-predictive encoded.
- the specific process of intra prediction coding is similar to the HEVC intra coding module. Since the scene feature matrix has a low rank, only the key columns of the scene feature matrix need to be encoded.
- the reconstruction error belongs to residual coding, and the amount of coded data is small and the compression ratio is high.
- Step 1209 Perform reconstruction according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a reference frame.
- step 1209 For the specific implementation of step 1209, reference may be made to step 508.
- Step 1210 Perform reference frame prediction on the B frame and the P frame by using the reference frame as a reference, and obtain B frame predictive coded data and P frame predictive coded data.
- step 1210 For the specific implementation of step 1210, reference may be made to step 509.
- Step 1211 Perform transform coding, quantization coding, and entropy coding on the predictive coded data to obtain video compressed data.
- the predictive coded data includes scene feature predictive coded data, residual predictive coded data, B frame predictive coded data, and P frame predictive coded data.
- step 1211 For the specific implementation of step 1211, reference may be made to step 510.
- the embodiment shown in FIG. 12 is described based on the HEVC scenario, but the video encoding method shown in FIG. 12 can also be applied to other scenarios.
- the encoding device acquires a plurality of video frames, and each of the plurality of video frames includes redundant data on the picture content, and in particular, each of the plurality of video frames is in a mutual Local locations include redundant data.
- the encoding device splits each video frame in the plurality of video frames to obtain a plurality of frame sub-blocks, and then reconstructs the plurality of frame sub-blocks to obtain scene features and multiple frame sub-blocks.
- the scene feature includes multiple independent scene feature bases, and the independent scene feature bases in the scene feature cannot be reconstructed from each other.
- the scene feature base is used to describe the screen content features of the frame sub-block, and the representation coefficients represent the scene features.
- the correspondence between the base and the frame sub-block, the reconstructed residual represents the difference between the frame sub-block and the scene feature base.
- the scene features are predictively coded
- the scene feature prediction coded data is obtained
- the reconstructed residual is predictively coded to obtain residual prediction coded data.
- the redundancy of redundant data included in the local location is reduced. Therefore, in the encoding operation, the obtained scene feature and the compressed data amount of the reconstructed residual total are reduced relative to the compressed data amount of the original video frame, and the amount of data obtained after compression is reduced.
- Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding.
- the codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high.
- the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
- FIG. 17 shows a video decoding method.
- the video decoding method in the embodiment of the present invention includes:
- Step 1701 Acquire scene feature prediction encoded data, residual prediction encoded data, and representation coefficients.
- the decoding device acquires video compressed data, which may be video compressed data obtained by the video encoding method of the embodiment shown in FIG.
- acquiring the scene feature prediction encoded data and the residual prediction encoded data includes: acquiring video compressed data, and then performing entropy decoding, inverse quantization processing, and DCT inverse change on the video compressed data to obtain predictive encoded data.
- the prediction encoded data includes scene feature prediction encoded data, residual prediction encoded data, B frame predictive encoded data, and P frame predictive encoded data;
- Step 1702 Decode the scene feature prediction encoded data to obtain a scene feature.
- the scene feature includes multiple independent scene feature bases, and the independent scene feature bases in the scene feature cannot be reconstructed from each other.
- the scene feature base is used to describe the picture content features of the frame sub-block, and the represented coefficients represent the scene feature base and The correspondence between the frame sub-blocks, and the reconstructed residual represents the difference between the frame sub-block and the scene feature base.
- Step 1703 Decode the residual prediction encoded data to obtain a reconstructed residual.
- the reconstructed residual is used to represent the difference between the video frame and the scene information.
- Step 1704 Perform reconstruction according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a plurality of frame sub-blocks.
- the video decoding method in the embodiment of the present invention may perform multiple reconstructions according to the scene feature, the representation coefficient, and the reconstruction residual. Frame sub-block.
- the method of the embodiment of the present invention may refer to FIG. 4b, but after decoding the scene feature, the representation feature is used to determine the required scene feature base in the scene feature, for example, using the scene feature F1*[C1, C3, C5 After T , the reconstructed residuals E1, E3, and E5 are reconstructed respectively to obtain key frames I1, I3, and I5.
- C1, C3, and C5 are the representation coefficients of the key frames I1, I3, and I5, respectively.
- Step 1705 Combining a plurality of frame sub-blocks to obtain a plurality of video frames.
- Step 1704 and step 1705 are specific implementations of the steps of reconstructing the video information according to the scene information and the reconstructed residual.
- a plurality of frame sub-blocks are combined to obtain a plurality of video frames, and a plurality of frame sub-blocks are combined to obtain a plurality of I frames.
- each sub-block is first reconstructed according to the decoded scene features, representation coefficients, and reconstruction errors, and then the sub-blocks are combined by number to obtain the final key frame content.
- the method of the embodiment of the present invention further includes: inter-frame decoding the B frame predictive encoded data and the P frame predictive encoded data by using the I frame as a reference frame to obtain a B frame and a P frame. Then, the decoding device arranges the I frame, the B frame, and the P frame in chronological order to obtain a video stream.
- the video frame can be decoded by the video decoding method of the embodiment shown in FIG.
- the video frame that performs the reconfiguration operation is obtained, and the embodiment that the video frame is extracted from the acquired video stream and the video frame is directly obtained is obtained.
- the video frame is obtained. This can be obtained by taking a compressed video frame and then decompressing it.
- step 201 can be implemented by the following steps:
- Step F1 Acquire a compressed video stream.
- the compressed video stream includes a compressed video frame.
- the compressed video stream can be, for example, a HEVC compressed video stream.
- Step F2 Determine a plurality of target video frames from the compressed video stream.
- the target video frame is an independently compression-encoded video frame in the compressed video stream.
- Step F3 Decoding the target video frame to obtain a decoded target video frame.
- the decoded target video frame is used to perform step 202.
- the video frames may be classified. For details, refer to step 504.
- the compression efficiency of these video frames can be improved, and the compressed data amount of these video frames can be reduced.
- the embodiment of the present invention may perform secondary compression on the HEVC compressed video stream. Specifically, after compressed video discrimination, I frame extraction, and intra-frame decoding, an I frame to be used to perform the method of the embodiment of the present invention is obtained.
- a method of the embodiment of the present invention may be implemented by adding a compressed video discriminating, an I frame decimation, and an intra decoding module based on the original video encoding device.
- an I frame extraction operation is performed. Since the HEVC compressed video adopts a hierarchical code stream structure, independent GOP data is extracted according to the image group header in the image group layer according to the code stream hierarchy. Then, each frame of the GOP is extracted according to the image header, and the first frame image of the GOP group is an I frame, and the I frame can be extracted.
- the decoding device performs intra-frame decoding on the extracted I-frame encoded data to obtain decoding.
- the subsequent I frame, residual coding and decoding steps can be referred to the encoding and decoding operations above. In this way, it is possible to perform secondary encoding and decoding of the compressed video on the basis of the original video encoded data.
- the method of the invention can perform secondary encoding and decoding on the existing compressed video data, and is consistent with the traditional HEVC method in the process of transform coding, quantization coding, entropy coding, etc., therefore, when performing the function module deployment of the present invention, Can be compatible with legacy video compression devices.
- the method of the embodiment of the present invention can also be applied to other encoded data, and the steps of extracting and decoding the compressed video frame according to the above steps, and then performing the steps of the video encoding method of FIG. 2, FIG. 5 and FIG. 12 described above.
- the I frame can be determined according to the size of the compressed image data, and usually the I frame encoded data is much larger than the P frame and the B frame encoded data.
- FIG. 18 is a schematic structural diagram of a video encoding apparatus according to an embodiment of the present invention.
- FIG. 18b is a schematic diagram showing a partial structure of a video encoding apparatus according to the embodiment shown in FIG. 18a.
- the video encoding apparatus can be used to perform the video encoding method in the foregoing embodiments.
- the video encoding apparatus includes: acquiring Module 1801, reconstruction module 1802, and prediction encoding module 1803.
- the obtaining module 1801 is configured to perform a process of acquiring a video frame in an embodiment of each of the foregoing video encoding methods.
- the reconstruction module 1802 is configured to perform a process related to the reconfiguration operation to reduce the redundancy of the redundant data in the embodiments of the foregoing video coding methods, for example, step 202, step 505, and step 1206.
- the prediction encoding module 1803 is configured to perform steps of predictive encoding, such as step 203 and step 204, in an embodiment of each of the above video encoding methods.
- the reconstruction module 1802 obtains the scene information and the reconstruction residual after performing the reconstruction operation on the plurality of video frames acquired by the obtaining module 1801, so that the prediction encoding module 1803 predictively encodes the scene information and the reconstructed residual.
- the video encoding device further includes a metric feature extraction module 1804 and a metric information calculation module 1805 between the obtaining module 1801 and the reconstruction module 1802.
- the feature extraction module 1804 is configured to perform a process of extracting picture feature information of a video frame in an embodiment of each of the above video coding methods, for example, steps D1 and E1.
- the metric information calculation module 1805 is configured to perform a process of calculating the content metric information in the embodiment of each of the above video coding methods, for example, steps D2 and E2.
- the video encoding device further includes:
- a reference frame reconstruction module 1806, configured to perform a process of reconstructing a reference frame in an embodiment of each of the foregoing video coding methods
- the inter prediction encoding module 1807 is configured to perform a process related to inter prediction encoding in the embodiments of the foregoing video encoding methods.
- the encoding module 1808 is configured to perform a process of transform coding, quantization coding, and entropy coding in the embodiments of the foregoing video coding methods.
- the reconstruction module 1802 further includes a splitting unit 1809 and a reconstruction unit 1810.
- the reconstruction unit 1810 may reconstruct the frame sub-block obtained by splitting the unit 1809.
- the splitting unit 1809 is configured to perform a process of splitting a video frame in an embodiment of each of the video encoding methods described above, for example, step 1206.
- the reconstruction unit 1810 is configured to perform a process of reconstructing a frame sub-block in an embodiment of each of the foregoing video coding methods, for example, step 1206;
- the reconstruction unit 1810 includes a reconstruction subunit 1811 and a combination subunit 1812.
- the reconstruction sub-unit 1811 is configured to perform a process of reconstructing a frame sub-block to obtain a representation coefficient and a reconstruction residual in an embodiment of each of the above video coding methods.
- the combining sub-unit 1812 is configured to perform a process of combining the target frame sub-blocks in the embodiment of each of the video encoding methods described above.
- the reconstruction unit 1810 may further include a sub-block reconstruction sub-unit 1813 and a sub-block calculation sub-unit 1814.
- the sub-block reconstruction sub-unit 1813 is configured to perform a process of reconstructing a frame sub-block to obtain a scene feature and a representation coefficient in an embodiment of each of the foregoing video coding methods, where the scene feature includes a scene feature base that is independent in the feature space. Feature block.
- the sub-block calculation sub-unit 1814 is for performing a computational reconstruction residual processing procedure in an embodiment for performing the above-described respective video coding methods.
- the video encoding device further includes a classification module 1815 for performing a process involving classification in an embodiment of each of the video encoding methods described above.
- the classification module 1815 includes a feature extraction unit 1816, a distance calculation unit 1817, and a clustering unit 1818.
- the feature extraction unit 1816 is configured to extract feature information of each of the plurality of video frames, and the distance calculation unit 1817 is configured to perform a process of processing the cluster distance in the embodiment of each of the video coding methods.
- the class unit 1818 is configured to perform a process involving clustering in an embodiment of each of the above video coding methods.
- the obtaining module 1801 includes the following units:
- a video stream obtaining unit 1819 configured to acquire a video stream
- a frame feature extraction unit 1820 configured to perform a process of extracting feature information of the first video frame and the second video frame in an embodiment of each of the foregoing video encoding methods
- a lens distance calculation unit 1821 configured to perform a process related to lens distance calculation in an embodiment of each of the above video coding methods
- the lens distance determining unit 1822 is configured to determine whether the lens distance is greater than a preset lens threshold
- a lens dividing unit 1823 configured to perform a process of dividing a target lens in an embodiment of each of the above video encoding methods
- the key frame extracting unit 1824 is configured to perform a process of extracting a key frame according to a frame distance in an embodiment of each of the above video encoding methods.
- the video encoding device further includes:
- the training module 1825 is configured to perform discriminant training according to each shot segmented from the video stream, to obtain a plurality of classifiers corresponding to the shots;
- a discriminating module 1826 configured to determine a target video frame by using a target classifier to obtain a discriminant score
- the scene determining module 1827 is configured to: when the discriminant score is greater than the preset score threshold, determine that the target video frame belongs to the same scene as the shot to which the target classifier belongs;
- the cluster determination module 1828 is configured to determine video frames of one or more clusters according to video frames belonging to the same scene as the shot.
- the obtaining module 1801 includes:
- a compressed video obtaining unit 1829 configured to acquire a compressed video stream, where the compressed video stream includes a compressed video frame
- a frame determining unit 1830 configured to determine, from the compressed video stream, a target video frame, where the target video frame is an independently compressed encoded video frame;
- the decoding unit 1831 is configured to decode the target video frame to obtain a decoded target video frame, where the decoded target video frame is used to perform splitting of each of the plurality of video frames to obtain multiple frames. The steps of the block.
- the obtaining module 1801 acquires a plurality of video frames, and each of the plurality of video frames includes redundant data on the screen content. Then, the reconstruction module 1802 reconstructs the plurality of video frames to obtain scene information and a reconstruction residual of each video frame, where the scene information includes data obtained by reducing redundancy of redundant data, and reconstructing the residual The difference is used to represent the difference between the video frame and the scene information.
- the prediction encoding module 1803 performs predictive coding on the scene information to obtain scene feature prediction encoded data. The prediction encoding module 1803 performs predictive coding on the reconstructed residual to obtain residual prediction encoded data.
- the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video.
- the amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression.
- Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding.
- the codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high.
- the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
- FIG. 19 is a schematic structural diagram of a video decoding device according to an embodiment of the present invention.
- the video decoding device can be used to perform the video decoding method in the foregoing embodiments.
- the video decoding device includes: an obtaining module 1901, a scene information decoding module 1902, a reconstructed residual decoding module 1903, and a video frame reconstruction module. 1904.
- the scene information decoding module 1902 and the reconstructed residual decoding module 1903 respectively perform the decoding operation on the scene feature prediction encoded data and the residual prediction encoded data acquired by the obtaining module 1901, so that the video frame reconstruction module 1904 can reconstruct the data obtained by using the decoding. Get the video frame.
- the obtaining module 1901 is configured to perform a process of acquiring encoded data in an embodiment of each of the foregoing video decoding methods, for example, step 205;
- the scene information decoding module 1902 is configured to perform a process related to decoding scene information in the embodiments of the foregoing video decoding methods, for example, step 206, step 603;
- the reconstructed residual decoding module 1903 is configured to perform a process of decoding the reconstructed residual in the embodiment of each of the foregoing video decoding methods, for example, step 207;
- the video frame reconstruction module 1904 is configured to perform a process of reconstructing a plurality of video frames in an embodiment of each of the video decoding methods, for example, step 208 and step 604.
- the obtaining module 1901 includes an obtaining unit 1905 and a decoding unit 1906.
- the obtaining unit 1905 is configured to perform a process of acquiring video compression data in an embodiment of each of the foregoing video decoding methods, for example, step 601.
- the decoding unit 1906 is configured to perform a process related to obtaining the predicted encoded data in the embodiment of each of the video decoding methods described above, for example, step 602.
- the video decoding apparatus further includes: an inter-frame decoding module 1907, configured to perform a process related to inter-frame decoding in an embodiment of each of the above video decoding methods, for example, step 606;
- the arranging module 1908 is configured to perform a process involving frame alignment in the embodiment of each of the video decoding methods described above, for example, step 607.
- the obtaining module 1901 is further configured to acquire a representation coefficient.
- the video frame reconstruction module 1904 includes a reconstruction unit 1909 and a combination unit 1910.
- the reconstruction unit 1909 is configured to perform a process of reconstructing a plurality of frame sub-blocks in an embodiment of each of the video decoding methods, for example, step 1704.
- the combining unit 1910 is configured to perform a process of combining frame sub-blocks in an embodiment of each of the above video decoding methods, for example, step 1705.
- the scene information decoding module 1902 decodes the scene feature prediction coded data to obtain scene information, where the scene information includes reducing redundant data.
- the redundancy obtained data the redundant data is redundant data on the picture content between each of the plurality of video frames.
- the reconstructed residual decoding module 1903 decodes the residual prediction encoded data to obtain a reconstructed residual, and the reconstructed residual is used to represent the difference between the video frame and the scene information.
- a video frame reconstruction module 1904 configured to perform reconstruction according to the scene information and the reconstructed residual to obtain a plurality of video frames.
- FIG. 20 is a schematic structural diagram of a video codec device according to an embodiment of the present invention.
- the video encoding and decoding device can be used to perform the video encoding method and the video decoding method in the foregoing embodiments.
- the video encoding and decoding device 2000 includes a video encoding device 2001 and a video decoding device 2002.
- the video encoding device 2001 is the video encoding device of the embodiment shown in FIG. 18a and FIG. 18b above;
- the video decoding device 2002 is the video decoding device of the embodiment shown in Fig. 19 described above.
- the video encoding method and the video decoding method provided by the embodiments of the present invention are described below in the hardware architecture.
- a video encoding and decoding system is provided.
- the video frame encoding and decoding system includes a video encoder and a video decoder. .
- video codec system 10 includes source device 12 and destination device 14.
- Source device 12 produces encoded video data.
- source device 12 may be referred to as a video encoding device or a video encoding device.
- Destination device 14 may decode the encoded video data produced by source device 12.
- destination device 14 may be referred to as a video decoding device or a video decoding device.
- Source device 12 and destination device 14 may be examples of video codec devices or video codec devices.
- Source device 12 and destination device 14 may include a wide range of devices including desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set top boxes, smart phones, etc., televisions, cameras, display devices , digital media player, video game console, on-board computer, or the like.
- Channel 16 may include one or more media and/or devices capable of moving encoded video data from source device 12 to destination device 14.
- channel 16 may include one or more communication media that enable source device 12 to transmit encoded video data directly to destination device 14 in real time.
- source device 12 may modulate the encoded video data in accordance with a communication standard (eg, a wireless communication protocol) and may transmit the modulated video data to destination device 14.
- the one or more communication media may include wireless and/or wired communication media, such as a radio frequency (RF) spectrum or one or more physical transmission lines.
- RF radio frequency
- the one or more communication media may form part of a packet-based network (eg, a local area network, a wide area network, or a global network (eg, the Internet).
- the one or more communication media may include routers, switches, base stations, or promotions Other devices that communicate from source device 12 to destination device 14.
- channel 16 can include a storage medium that stores encoded video data generated by source device 12.
- destination device 14 can access the storage medium via disk access or card access.
- the storage medium may include a variety of locally accessible data storage media, such as Blu-ray Disc, DVD, CD-ROM, flash memory, or other suitable digital storage medium for storing encoded video data.
- channel 16 can include a file server or another intermediate storage device that stores encoded video data generated by source device 12.
- destination device 14 may access the encoded video data stored at a file server or other intermediate storage device via streaming or download.
- the file server may be a server type capable of storing encoded video data and transmitting the encoded video data to the destination device 14.
- the instance file server includes a web server (eg, for a website), a file transfer protocol (FTP) server, a network attached storage (NAS) device, and a local disk drive.
- FTP file transfer protocol
- NAS network attached storage
- Destination device 14 can access the encoded video data via a standard data connection (e.g., an internet connection).
- a standard data connection e.g., an internet connection.
- An instance type of a data connection includes a wireless channel (eg, a Wi-Fi connection), a wired connection (eg, DSL, cable modem, etc.), or both, suitable for accessing encoded video data stored on a file server. combination.
- the transmission of the encoded video data from the file server may be streaming, downloading, or a combination of both.
- the technology of the present invention is not limited to a wireless application scenario.
- the technology can be applied to video codecs supporting multiple multimedia applications such as aerial television broadcasting, cable television transmission, satellite television transmission, and streaming video. Transmission (eg, via the Internet), encoding of video data stored on a data storage medium, decoding of video data stored on a data storage medium, or other application.
- video codec system 10 may be configured to support one-way or two-way video transmission to support applications such as video streaming, video playback, video broadcasting, and/or video telephony.
- source device 12 includes video source 18, video encoder 20, and output interface 22.
- output interface 22 can include a modulator/demodulator (modem) and/or a transmitter.
- Video source 18 may include a video capture device (eg, a video camera), a video archive containing previously captured video data, a video input interface to receive video data from a video content provider, and/or a computer for generating video data.
- Video encoder 20 may encode video data from video source 18.
- source device 12 transmits the encoded video data directly to destination device 14 via output interface 22.
- the encoded video data may also be stored on a storage medium or file server for later access by the destination device 14 for decoding and/or playback.
- destination device 14 includes an input interface 28, a video decoder 30, and a display device 32.
- input interface 28 includes a receiver and/or a modem.
- Input interface 28 can receive the encoded video data via channel 16.
- Display device 32 may be integral with destination device 14 or may be external to destination device 14. In general, display device 32 displays the decoded video data.
- Display device 32 may include a variety of display devices such as liquid crystal displays (LCDs), plasma displays, organic light emitting diode (OLED) displays, or other types of display devices.
- LCDs liquid crystal displays
- OLED organic light emitting diode
- Video encoder 20 and video decoder 30 may operate in accordance with a video compression standard (eg, the High Efficiency Video Codec H.265 standard) and may conform to the HEVC Test Model (HM).
- a video compression standard eg, the High Efficiency Video Codec H.265 standard
- HM HEVC Test Model
- a textual description of the H.265 standard is published on April 29, 2015, ITU-T.265(V3) (04/2015), available for download from http://handle.itu.int/11.1002/1000/12455 The entire contents of the document are incorporated herein by reference.
- video encoder 20 and video decoder 30 may operate in accordance with other proprietary or industry standards including ITU-TH.261, ISO/IEC MPEG-1 Visual, ITU-TH.262, or ISO/IEC MPEG-2 Visual, ITU. -TH.263, ISO/IECMPEG-4 Visual, ITU-TH.264 (also known as ISO/IEC MPEG-4 AVC), including scalable video codec (SVC) and multiview video codec (MVC) extensions.
- SVC scalable video codec
- MVC multiview video codec
- FIG. 21 is merely an example and the techniques of the present invention are applicable to video codec applications (eg, single-sided video encoding or video decoding) that do not necessarily include any data communication between the encoding device and the decoding device.
- data is retrieved from local memory, data is streamed over a network, or manipulated in a similar manner.
- the encoding device may encode the data and store the data to a memory, and/or the decoding device may retrieve the data from the memory and decode the data.
- encoding and decoding are performed by a plurality of devices that only encode data to and/or retrieve data from the memory and decode the data by not communicating with each other.
- Video encoder 20 and video decoder 30 may each be implemented as any of a variety of suitable circuits, such as one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable Gate array (FPGA), discrete logic, hardware, or any combination thereof. If the technology is implemented partially or wholly in software, the device may store the instructions of the software in a suitable non-transitory computer readable storage medium, and the instructions in the hardware may be executed using one or more processors to perform the techniques of the present invention. . Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) can be considered as one or more processors. Each of video encoder 20 and video decoder 30 may be included in one or more encoders or decoders, any of which may be integrated into a combined encoder/decoder (codec) in other devices Part of the (CODEC).
- codec combined encoder/decoder
- the invention may generally refer to video encoder 20 "signaling" certain information to another device (e.g., video decoder 30).
- the term “signaling” may generally refer to a syntax element and/or to convey the communication of encoded video data. This communication can occur in real time or near real time. Alternatively, this communication may occur over a time span, such as may occur when encoding the encoded element to a computer readable storage medium at the time of encoding, the syntax element being subsequently decodable after being stored in the medium The device is retrieved at any time.
- the video encoder 20 encodes video data.
- Video data may include one or more pictures.
- Video encoder 20 may generate a code stream that contains encoded information for the video data in the form of a bitstream.
- the encoded information may include encoded picture data and associated data.
- Associated data can include sequence parameter sets (SPS), picture parameter sets (PPS), and other syntax structures.
- SPS sequence parameter sets
- PPS picture parameter sets
- An SPS can contain parameters that are applied to zero or more sequences.
- the PPS can contain parameters that are applied to zero or more pictures.
- a grammatical structure refers to a collection of zero or more syntax elements arranged in a specified order in a code stream.
- video encoder 20 may partition the picture into a raster of coded tree blocks (CTBs).
- CTB coded tree blocks
- a CTB may be referred to as a "tree block,” a "maximum coding unit” (LCU), or a "coding tree unit.”
- the CTB is not limited to a particular size and may include one or more coding units (CUs).
- Each CTB can be associated with a block of pixels of equal size within the picture.
- Each pixel can correspond to one luminance (luminance or luma) sample and two chrominance or chroma samples.
- each CTB can be associated with one luma sample block and two chroma sample blocks.
- the CTB of a picture can be divided into one or more stripes.
- each stripe contains an integer number of CTBs.
- video encoder 20 may generate encoded information for each strip of the picture, i.e., encode the CTB within the strip.
- video encoder 20 may recursively perform quadtree partitioning on the block of pixels associated with the CTB to partition the block of pixels into decreasing blocks of pixels. The smaller block of pixels can be associated with a CU.
- Video encoder 20 may generate one or more prediction units (PUs) that each no longer partition the CU. Each PU of a CU may be associated with a different block of pixels within a block of pixels of the CU. Video encoder 20 may generate predictive pixel blocks for each PU of the CU. Video encoder 20 may use intra prediction or inter prediction to generate predictive pixel blocks for the PU. If video encoder 20 uses intra prediction to generate a predictive pixel block for a PU, video encoder 20 may generate a predictive pixel block for the PU based on the decoded pixels of the picture associated with the PU.
- PUs prediction units
- video encoder 20 may generate predictiveness of the PU based on decoded pixels of one or more pictures that are different from pictures associated with the PU. Pixel block. Video encoder 20 may generate residual pixel blocks of the CU based on the predictive pixel blocks of the PU of the CU. The residual pixel block of the CU may indicate the difference between the sampled value in the predictive pixel block of the PU of the CU and the corresponding sampled value in the initial pixel block of the CU.
- Video encoder 20 may perform recursive quadtree partitioning on the residual pixel blocks of the CU to partition the residual pixel blocks of the CU into one or more smaller residual pixel blocks associated with the transform units (TUs) of the CU. Because the pixels in the pixel block associated with the TU each correspond to one luma sample and two chroma samples, each TU can be associated with one luma residual sample block and two chroma residual sample blocks. Video encoder 20 may apply one or more transforms to the residual sample block associated with the TU to generate a coefficient block (ie, a block of coefficients). The transform can be a DCT transform or a variant thereof.
- the coefficient block is obtained by applying a one-dimensional transform in the horizontal and vertical directions to calculate a two-dimensional transform.
- Video encoder 20 may perform a quantization procedure for each of the coefficients in the coefficient block. Quantization generally refers to the process by which the coefficients are quantized to reduce the amount of data used to represent the coefficients, thereby providing further compression.
- Video encoder 20 may generate a set of syntax elements that represent coefficients in the quantized coefficient block. Video encoder 20 may apply an entropy encoding operation (eg, a context adaptive binary arithmetic coding (CABAC) operation) to some or all of the above syntax elements. To apply CABAC encoding to syntax elements, video encoder 20 may binarize the syntax elements to form a binary sequence that includes one or more bits (referred to as "binary"). Video encoder 20 may encode a portion of the binary using regular encoding, and may use bypass encoding to encode other portions of the binary.
- CABAC context adaptive binary arithmetic coding
- video encoder 20 may apply an inverse quantization and an inverse transform to the transformed coefficient block to reconstruct the residual sample block from the transformed coefficient block.
- Video encoder 20 may add the reconstructed residual sample block to a corresponding sample block of one or more predictive sample blocks to produce a reconstructed sample block.
- video encoder 20 may reconstruct the block of pixels associated with the TU. The pixel block of each TU of the CU is reconstructed in this way until the entire pixel block reconstruction of the CU is completed.
- video encoder 20 may perform a deblocking filtering operation to reduce the blockiness of the block of pixels associated with the CU.
- video encoder 20 may use sample adaptive offset (SAO) to modify the reconstructed block of pixels of the CTB of the picture.
- SAO sample adaptive offset
- video encoder 20 may store the reconstructed blocks of pixels of the CU in a decoded picture buffer for use in generating predictive blocks of pixels for other CUs.
- Video decoder 30 can receive the code stream.
- the code stream contains encoded information of video data encoded by video encoder 20 in the form of a bitstream.
- Video decoder 30 may parse the code stream to extract syntax elements from the code stream.
- video decoder 30 may perform regular decoding on partial bins and may perform bypass decoding on bins of other portions, and the bins in the code stream have mapping relationships with syntax elements, through parsing The binary gets the syntax element.
- Video decoder 30 may reconstruct a picture of the video data based on the syntax elements extracted from the code stream.
- the process of reconstructing video data based on syntax elements is generally reciprocal to the process performed by video encoder 20 to generate syntax elements.
- video decoder 30 may generate a predictive pixel block of a PU of a CU based on syntax elements associated with the CU.
- video decoder 30 may inverse quantize the coefficient blocks associated with the TUs of the CU.
- Video decoder 30 may perform an inverse transform on the inverse quantized coefficient block to reconstruct a residual pixel block associated with the TU of the CU.
- Video decoder 30 may reconstruct a block of pixels of the CU based on the predictive pixel block and the residual pixel block.
- video decoder 30 may perform a deblocking filtering operation to reduce the blockiness of the block of pixels associated with the CU. Additionally, video decoder 30 may perform the same SAO operations as video encoder 20 based on one or more SAO syntax elements. After video decoder 30 performs these operations, video decoder 30 may store the block of pixels of the CU in a decoded picture buffer.
- the decoded picture buffer can provide reference pictures for subsequent motion compensation, intra prediction, and presentation by the display device.
- video encoder 20 includes prediction processing unit 100, residual generation unit 102, transform processing unit 104, quantization unit 106, inverse quantization unit 108, inverse transform processing unit 110, reconstruction unit 112, filter unit 113,
- the picture buffer 114 and the entropy encoding unit 116 are decoded.
- Entropy encoding unit 116 includes a regular CABAC codec engine 118 and a bypass codec engine 120.
- the prediction processing unit 100 includes an inter prediction processing unit 121 and an intra prediction processing unit 126.
- the inter prediction processing unit 121 includes a motion estimation unit 122 and a motion compensation unit 124.
- video encoder 20 may include more, fewer, or different functional components.
- Video encoder 20 receives the video data.
- video encoder 20 may encode each strip of each picture of the video data.
- video encoder 20 may encode each CTB in the strip.
- prediction processing unit 100 may perform quadtree partitioning on the pixel blocks associated with the CTB to divide the block of pixels into decreasing blocks of pixels. For example, prediction processing unit 100 may partition a block of pixels of a CTB into four equally sized sub-blocks, split one or more of the sub-blocks into four equally sized sub-sub-blocks, and the like.
- Video encoder 20 may encode the CU of the CTB in the picture to generate coded information for the CU.
- Video encoder 20 may encode the CU of the CTB according to the fold scan order. In other words, video encoder 20 may encode the CU by the upper left CU, the upper right CU, the lower left CU, and then the lower right CU.
- video encoder 20 may encode the CU associated with the sub-block of the pixel block of the partitioned CU according to the fold scan order.
- prediction processing unit 100 can partition the pixel blocks of the CU in one or more PUs of the CU.
- Video encoder 20 and video decoder 30 can support a variety of PU sizes. Assuming that the size of a particular CU is 2N ⁇ 2N, video encoder 20 and video decoder 30 may support a PU size of 2N ⁇ 2N or N ⁇ N for intra prediction, and support 2N ⁇ 2N, 2N ⁇ N, N ⁇ 2N, N x N or similarly sized symmetric PUs for inter prediction. Video encoder 20 and video decoder 30 may also support asymmetric PUs of 2N x nU, 2N x nD, nL x 2N, and nR x 2N for inter prediction.
- the inter prediction processing unit 121 may generate predictive data of the PU by performing inter prediction on each PU of the CU.
- the predictive data of the PU may include motion information corresponding to the predictive pixel block of the PU and the PU.
- the strip can be an I strip, a P strip or a B strip.
- the inter prediction unit 121 may perform different operations on the PU of the CU depending on whether the PU is in an I slice, a P slice, or a B slice. In the I slice, all PUs perform intra prediction.
- motion estimation unit 122 may search for a reference picture in a list of reference pictures (eg, "List 0") to find a reference block for the PU.
- the reference block of the PU may be the pixel block that most closely corresponds to the pixel block of the PU.
- Motion estimation unit 122 may generate a reference picture index that indicates a reference picture of the PU-containing reference block in list 0, and a motion vector that indicates a spatial displacement between the pixel block of the PU and the reference block.
- the motion estimation unit 122 may output the reference picture index and the motion vector as motion information of the PU.
- Motion compensation unit 124 may generate a predictive pixel block of the PU based on the reference block indicated by the motion information of the PU.
- motion estimation unit 122 may perform uni-directional inter prediction or bi-directional inter prediction on the PU.
- motion estimation unit 122 may search for a reference picture of a first reference picture list ("List 0") or a second reference picture list ("List 1") to find a reference block for the PU.
- the motion estimation unit 122 may output the following as the motion information of the PU: a reference picture index indicating a position in the list 0 or the list 1 of the reference picture containing the reference block, a space between the pixel block indicating the PU and the reference block The motion vector of the displacement, and the prediction direction indicator indicating whether the reference picture is in list 0 or in list 1.
- motion estimation unit 122 may search for reference pictures in list 0 to find reference blocks for the PU, and may also search for reference pictures in list 1 to find another reference block for the PU.
- Motion estimation unit 122 may generate a reference picture index indicating the list 0 of the reference picture containing the reference block and the location in list 1. Additionally, motion estimation unit 122 may generate a motion vector that indicates a spatial displacement between the reference block and the pixel block of the PU.
- the motion information of the PU may include a reference picture index of the PU and a motion vector.
- Motion compensation unit 124 may generate a predictive pixel block of the PU based on the reference block indicated by the motion information of the PU.
- Intra prediction processing unit 126 may generate predictive data for the PU by performing intra prediction on the PU.
- the predictive data of the PU may include predictive pixel blocks of the PU and various syntax elements.
- Intra prediction processing unit 126 may perform intra prediction on I slices, P slices, and PUs within B slices.
- intra-prediction processing unit 126 may use multiple intra-prediction modes to generate multiple sets of predictive data for the PU.
- intra-prediction processing unit 126 may spread samples of sample blocks from neighboring PUs across sample blocks of the PU in a direction associated with the intra-prediction mode. It is assumed that the coding order from left to right and from top to bottom is used for PU, CU and CTB, and the adjacent PU may be above the PU, at the upper right of the PU, at the upper left of the PU or to the left of the PU.
- Intra prediction processing unit 126 may use a different number of intra prediction modes, for example, 33 directional intra prediction modes. In some examples, the number of intra prediction modes may depend on the size of the pixel block of the PU.
- the prediction processing unit 100 may select the predictive data of the PU of the CU from among the predictive data generated by the inter prediction processing unit 121 for the PU or the predictive data generated by the intra prediction processing unit 126 for the PU. In some examples, prediction processing unit 100 selects predictive data for the PU of the CU based on the rate/distortion metric of the set of predictive data. For example, a Lagrangian cost function is used to select between an encoding mode and its parameter values, such as motion vectors, reference indices, and intra prediction directions.
- a predictive pixel block that selects predictive data may be referred to herein as a selected predictive pixel block.
- Residual generation unit 102 may generate a residual pixel block of the CU based on the pixel block of the CU and the selected predictive pixel block of the PU of the CU. For example, the residual generation unit 102 may generate a residual pixel block of the CU such that each sample in the residual pixel block has a value equal to a difference between: a sample in a pixel block of the CU, and a PU of the CU Corresponding samples in the predictive pixel block are selected.
- the prediction processing unit 100 may perform quadtree partitioning to partition the residual pixel block of the CU into sub-blocks. Each residual pixel block that is no longer partitioned may be associated with a different TU of the CU. The size and location of the residual pixel block associated with the TU of the CU is not necessarily related to the size and location of the pixel block of the CU-based PU.
- Transform processing unit 104 may generate a coefficient block for each TU of the CU by applying one or more transforms to the residual sample block associated with the TU. For example, transform processing unit 104 may apply a discrete cosine transform (DCT), a directional transform, or a conceptually similar transform to the residual sample block.
- DCT discrete cosine transform
- Quantization unit 106 may quantize the coefficients in the coefficient block. For example, an n-bit coefficient can be truncated to an m-bit coefficient during quantization, where n is greater than m. Quantization unit 106 may quantize the coefficient block associated with the TU of the CU based on a quantization parameter (QP) value associated with the CU. Video encoder 20 may adjust the degree of quantization applied to the coefficient block associated with the CU by adjusting the QP value associated with the CU.
- QP quantization parameter
- Inverse quantization unit 108 and inverse transform processing unit 110 may apply inverse quantization and inverse transform, respectively, to the transformed coefficient block to reconstruct the residual sample block from the coefficient block.
- Reconstruction unit 112 may add samples of the reconstructed residual sample block to corresponding samples of one or more predictive sample blocks generated by prediction processing unit 100 to generate a reconstructed sample block associated with the TU. By reconstructing the sample block of each TU of the CU in this manner, video encoder 20 may reconstruct the block of pixels of the CU.
- Filter unit 113 may perform a deblocking filtering operation to reduce blockiness of pixel blocks associated with the CU. Further, the filter unit 113 may apply the SAO offset determined by the prediction processing unit 100 to the reconstructed sample block to restore the pixel block. Filter unit 113 may generate encoding information for the SAO syntax elements of the CTB.
- the decoded picture buffer 114 can store the reconstructed block of pixels.
- Inter prediction unit 121 may perform inter prediction on PUs of other pictures using reference pictures containing the reconstructed pixel blocks.
- intra-prediction processing unit 126 can use the reconstructed block of pixels in decoded picture buffer 114 to perform intra-prediction on other PUs in the same picture as the CU.
- Entropy encoding unit 116 may receive data from other functional components of video encoder 20. For example, entropy encoding unit 116 may receive a coefficient block from quantization unit 106 and may receive syntax elements from prediction processing unit 100. Entropy encoding unit 116 may perform one or more entropy encoding operations on the data to generate entropy encoded data. For example, entropy encoding unit 116 may perform context adaptive variable length codec (CAVLC) operations, CABAC operations, variable to variable (V2V) length codec operations, grammar-based context adaptive binary arithmetic coding on data. Decoding (SBAC) operations, probability interval partition entropy (PIPE) codec operations, or other types of entropy coding operations. In a particular example, entropy encoding unit 116 may encode regular CABAC codec bins of syntax elements using regular CABAC engine 118, and may encode pass-through codec bins using bypass codec engine 120.
- CAVLC context adaptive variable length codec
- video decoder 30 includes an entropy decoding unit 150, a prediction processing unit 152, an inverse quantization unit 154, an inverse transform processing unit 156, a reconstruction unit 158, a filter unit 159, and a decoded picture buffer 160.
- the prediction processing unit 152 includes a motion compensation unit 162 and an intra prediction processing unit 164.
- Entropy decoding unit 150 includes a regular CABAC codec engine 166 and a bypass codec engine 168. In other examples, video decoder 30 may include more, fewer, or different functional components.
- Video decoder 30 can receive the code stream.
- Entropy decoding unit 150 may parse the code stream to extract syntax elements from the code stream. As part of parsing the code stream, entropy decoding unit 150 may parse the entropy encoded syntax elements in the code stream.
- the prediction processing unit 152, the inverse quantization unit 154, the inverse transform processing unit 156, the reconstruction unit 158, and the filter unit 159 may decode the video data according to the syntax elements extracted from the code stream, that is, generate the decoded video data.
- the syntax elements may include a regular CABAC codec binary and a bypass codec binary.
- Entropy decoding unit 150 may use a regular CABAC codec engine 166 to decode the regular CABAC codec bins, and may use the bypass codec engine 168 to decode the bypass codec bins.
- intra prediction processing unit 164 may perform intra prediction to generate a predictive sample block for the PU.
- Intra-prediction processing unit 164 may use an intra-prediction mode to generate a predictive pixel block of a PU based on a block of pixels of a spatially neighboring PU.
- Intra prediction processing unit 164 may determine an intra prediction mode for the PU based on one or more syntax elements parsed from the code stream.
- Motion compensation unit 162 may construct a first reference picture list (List 0) and a second reference picture list (List 1) based on syntax elements parsed from the code stream. Furthermore, if the PU uses inter prediction coding, the entropy decoding unit 150 may parse the motion information of the PU. Motion compensation unit 162 can determine one or more reference blocks of the PU based on the motion information of the PU. Motion compensation unit 162 can generate a predictive pixel block of the PU from one or more reference blocks of the PU.
- video decoder 30 may perform a reconstruction operation on a CU that is no longer split. To perform a reconstruction operation on a CU that is no longer split, video decoder 30 may perform a reconstruction operation on each TU of the CU. By performing a reconstruction operation on each TU of the CU, video decoder 30 may reconstruct the residual pixel blocks associated with the CU.
- inverse quantization unit 154 may inverse quantize (ie, dequantize) the coefficient block associated with the TU. Inverse quantization unit 154 may determine the degree of quantization using the QP value associated with the CU of the TU, and determine the degree of inverse quantization that the inverse quantization unit 154 will apply.
- inverse transform processing unit 156 may apply one or more inverse transforms to the coefficient block to generate a residual sample block associated with the TU.
- inverse transform processing unit 156 may map inverse DCT, inverse integer transform, Karhunen-Loeve transform (KLT), inverse rotation transform, inverse directional transform, or other transform to the encoding end. The inverse transform is applied to the coefficient block.
- Reconstruction unit 158 may use the residual pixel block associated with the TU of the CU and the predictive pixel block of the PU of the CU (ie, intra-prediction data or inter-prediction data) to reconstruct the block of pixels of the CU, where applicable.
- reconstruction unit 158 can add samples of the residual pixel block to corresponding samples of the predictive pixel block to reconstruct the pixel block of the CU.
- Filter unit 159 may perform a deblocking filtering operation to reduce the blockiness of the block of pixels associated with the CU of the CTB. Additionally, filter unit 159 can modify the pixel values of the CTB based on the SAO syntax elements parsed from the code stream. For example, filter unit 159 can determine the correction value based on the SAO syntax element of the CTB and add the determined correction value to the sample value in the reconstructed pixel block of the CTB. By modifying some or all of the pixel values of the CTB of the picture, the filter unit 159 can modify the reconstructed picture of the video data according to the SAO syntax element.
- Video decoder 30 may store the block of pixels of the CU in decoded picture buffer 160.
- the decoded picture buffer 160 may provide reference pictures for subsequent motion compensation, intra prediction, and presentation by a display device (eg, display device 32 of FIG. 21).
- video decoder 30 may perform intra-prediction operations or inter-prediction operations on PUs of other CUs according to the blocks of pixels in decoded picture buffer 160.
- the video encoder of the embodiment of the present invention may be used to perform the video encoding method of the foregoing embodiments, and the functional modules of the video encoding apparatus shown in FIG. 18a and FIG. 18b may be integrated into the video encoder 20 of the embodiment of the present invention.
- the video encoder can be used to perform the video encoding method of the embodiment shown in FIG. 2, FIG. 5 or FIG. 12 described above.
- video encoder 20 acquires a plurality of video frames, each of which includes redundant data on the picture content. Then, the video encoder 20 reconstructs the plurality of video frames to obtain scene information and reconstruction residuals of each video frame, where the scene information includes redundant data to reduce redundancy, and the residual is used for reconstruction. Indicates the difference between the video frame and the scene information. Next, the video encoder 20 predictively encodes the scene information to obtain scene feature prediction encoded data. The video encoder 20 predictively encodes the reconstructed residual to obtain residual prediction encoded data.
- the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video.
- the amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression.
- Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding.
- the codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high.
- the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
- a video decoder is further provided, where the video decoder can be used to perform the video decoding method of the foregoing embodiments, and the functional modules of the video decoding device shown in FIG. 19 can also be integrated.
- the video decoder 30 of an embodiment of the invention can be used to perform the video decoding method of the embodiment shown in FIG. 2, FIG. 6, or FIG.
- the video decoder 30 decodes the scene feature prediction encoded data to obtain scene information, where the scene information includes redundant data and reduces redundancy.
- the data the redundant data is redundant data on the picture content between each of the plurality of video frames.
- video decoder 30 decodes the residual prediction encoded data to obtain a reconstructed residual, which is used to represent the difference between the video frame and the scene information.
- a video decoder 30, configured to perform reconstruction according to the scene information and the reconstructed residual to obtain a plurality of video frames.
- the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted as one or more instructions or code via a computer-readable medium and executed by a hardware-based processing unit.
- the computer readable medium can comprise a computer readable storage medium (which corresponds to a tangible medium such as a data storage medium) or a communication medium comprising, for example, any medium that facilitates transfer of the computer program from one place to another in accordance with a communication protocol. .
- computer readable media generally may correspond to (1) a non-transitory tangible computer readable storage medium, or (2) a communication medium such as a signal or carrier wave.
- Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for use in carrying out the techniques described herein.
- the computer program product can comprise a computer readable medium.
- Some computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, disk storage or other magnetic storage device, flash memory, or may be used to store instructions or data structures, by way of example and not limitation. Any other medium in the form of the desired program code and accessible by the computer. Also, any connection is properly termed a computer-readable medium. For example, if you use coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology (eg, infrared, radio, and microwave) to send commands from a website, server, or other remote source, coaxial cable , fiber optic cable, twisted pair, DSL, or wireless technologies (eg, infrared, radio, and microwave) are included in the definition of the media.
- coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology eg, infrared, radio, and microwave
- a magnetic disk and an optical disk include a compact disk (CD), a laser disk, an optical disk, a digital video disk (DVD), a flexible disk, and a Blu-ray disk, wherein the disk usually reproduces data magnetically, and the disk passes the laser Optically copy data. Combinations of the above should also be included within the scope of computer readable media.
- processors such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits
- DSPs digital signal processors
- ASICs application specific integrated circuits
- FPGAs field programmable logic arrays
- processors may refer to any of the foregoing structures or any other structure suitable for implementing the techniques described herein.
- the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec.
- the techniques can be fully implemented in one or more circuits or logic elements.
- the techniques of the present invention can be broadly implemented by a variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a collection of ICs (eg, a chipset).
- IC integrated circuit
- Various components, modules or units are described in this disclosure to emphasize functional aspects of the apparatus configured to perform the disclosed techniques, but are not necessarily required to be implemented by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or combined with suitable software and/or by a collection of interoperable hardware units (including one or more processors as described above). Or firmware to provide.
- system and “network” are used interchangeably herein. It should be understood that the term “and/or” herein is merely an association relationship describing an associated object, indicating that there may be three relationships, for example, A and/or B, which may indicate that A exists separately, and A and B exist simultaneously. There are three cases of B alone. In addition, the character "/" in this article generally indicates that the contextual object is an "or" relationship.
- B corresponding to A means that B is associated with A, and B can be determined from A.
- determining B from A does not mean that B is only determined based on A, and that B can also be determined based on A and/or other information.
- the disclosed systems, devices, and methods may be implemented in other manners.
- the device embodiments described above are merely illustrative.
- the division of the unit is only a logical function division.
- there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
- the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
- each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the disclosed system, apparatus, and method may be implemented in other manners.
- the device embodiments described above are merely illustrative.
- the division of the unit is only a logical function division.
- there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
- the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
- each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
- the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
- the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
- a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
- the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .
- the computer program product includes one or more computer instructions.
- the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
- the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.).
- wire eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.).
- the computer readable storage medium can be any available media that can be stored by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
- the usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (such as a solid state disk (SSD)).
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Discrete Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Embodiments of the present invention disclose a video encoding method, a video decoding method, and a related device, for improving the compression efficiency of video frames. The method in the embodiments of the present invention comprises: acquiring a plurality of video frames, wherein redundant image content data exists between each of the plurality of video frames; reconstructing the plurality of video frames to obtain scene information and a reconstruction residual of each video frame, the scene information comprising data obtained by reducing the redundancy of the redundant data, and the reconstruction residual being used to represent a difference between the video frame and the scene information; and performing predictive encoding on the scene information and reconstruction residuals respectively to obtain scene feature predictive encoding data and residual predictive encoding data. In this way, a redundancy between video frames is reduced, and a volume of data obtained after compression is reduced. Furthermore, each video frame is reconstructed into scene features and a reconstruction residual. Reconstruction residuals are encoded based on residual encoding, producing a small volume of encoded data and a high compression ratio. In this way, the method of the embodiments of the present invention can effectively improve the compression efficiency of video frames.
Description
本申请要求于2017年3月21日提交中国专利局、申请号为201710169486.5、发明名称为“一种视频编码方法、视频解码方法和相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese Patent Application, filed on March 21, 2017, the Chinese Patent Office, Application No. 201710169486.5, entitled "A Video Coding Method, Video Decoding Method, and Related Equipment", the entire contents of which are incorporated by reference. Combined in this application.
本发明涉及视频帧处理领域,尤其涉及一种视频编码方法、视频解码方法、视频编码设备和视频解码设备、以及视频编解码设备。The present invention relates to the field of video frame processing, and in particular, to a video encoding method, a video decoding method, a video encoding device and a video decoding device, and a video encoding and decoding device.
随着互联网及流媒体等技术的不断发展,数字视频已大量应用于各种终端设备中,如传统PC、智能手机、平板电脑、网络互动电视(IPTV)等。另一方面,人们感官需求也在不断提升,对高清视频、超高清视频的需求也在不断增加。这些视频制式、分辨率不断提高的要求势必会带来十分巨大的传输码率(Bit Rate)。因此,在大视频背景下,需要对视频进行高质量压缩,以降低网络传输负担,提升存储容量。With the continuous development of technologies such as the Internet and streaming media, digital video has been widely used in various terminal devices, such as traditional PCs, smart phones, tablet computers, and Internet interactive television (IPTV). On the other hand, people's sensory needs are also increasing, and the demand for high-definition video and ultra-high-definition video is also increasing. These video formats and increasing resolution requirements are bound to bring a very large bit rate. Therefore, in the context of large video, high-quality compression of the video is required to reduce the network transmission load and increase the storage capacity.
对于要独立编码的视频帧,现有技术往往是对该帧进行独立的编码,从而导致该需要独立编码的帧冗余信息较大,不利于数据存取和传输。For video frames to be independently coded, the prior art often performs independent coding on the frame, which results in large frame redundancy information that needs to be independently coded, which is not conducive to data access and transmission.
例如,HEVC(High Efficiency Video Coding)预测编码使用了帧内压缩和帧间压缩两种方式。编码之前首先设定GOP(Group of pictures;帧组)步长,即GOP中所包含帧的个数。其中,帧组即多个帧构成的群组。为防止运动变化,帧数不宜设置过多。在具体预测编码环节,HEVC将所有帧划分I,P,B三种类型帧,如图1所示。图中各帧上方数字表明相应帧在原有视频序列中的编号。编码时以GOP为单位,先后对I帧、P帧及B帧进行编码。其中,I帧(Intra-frame)又称为帧内编码帧,是一种自带全部信息的独立帧,无需参考其它图像便可独立进行编码及解码。现有HEVC标准的对该I帧,只采用了当前I帧的图像帧内信息进行编码和解码,且是按照视频时间轴采用固定策略选取。从而,在HEVC标准中,独立编码的I帧压缩数据量占比高且存在大量信息冗余。For example, HEVC (High Efficiency Video Coding) predictive coding uses both intra-frame compression and inter-frame compression. The GOP (Group of pictures) step size is set first before encoding, that is, the number of frames included in the GOP. The frame group is a group composed of a plurality of frames. To prevent motion changes, the number of frames should not be set too much. In the specific prediction coding, HEVC divides all frames into three types of frames: I, P, and B, as shown in Figure 1. The numbers above the frames in the figure indicate the number of the corresponding frame in the original video sequence. When encoding, the I frame, the P frame, and the B frame are encoded in units of GOP. Among them, an Intra-frame (Intra-frame), also known as an intra-coded frame, is an independent frame with all the information, and can be independently encoded and decoded without reference to other images. The existing IVC frame of the HEVC standard only uses the image intraframe information of the current I frame for encoding and decoding, and is selected according to the video time axis by a fixed strategy. Thus, in the HEVC standard, the amount of independently encoded I-frame compressed data is high and there is a large amount of information redundancy.
发明内容Summary of the invention
本发明实施例提供了一种视频编码方法、视频解码方法、视频编码设备、视频解码设备和视频编解码设备,用于提高视频帧的压缩效率。The embodiments of the present invention provide a video encoding method, a video decoding method, a video encoding device, a video decoding device, and a video encoding and decoding device, which are used to improve the compression efficiency of a video frame.
本发明实施例的第一方面提供一种视频编码方法,该方法包括:获取多个视频帧,其中,该多个视频帧中的每一视频帧间在画面内容上包括冗余数据。然后,对该多个视频帧进行重构,得到场景信息和每一视频帧的重构残差,该场景信息包括由减少冗余数据的冗余度得到的数据,重构残差用于表示视频帧和场景信息间的差值,这样,通过重构减少了该多个视频帧的冗余数据。后续,对场景信息进行预测编码,得到场景特征预测编码数据,以及,对重构残差进行预测编码,得到残差预测编码数据。A first aspect of the embodiments of the present invention provides a video encoding method, the method comprising: acquiring a plurality of video frames, wherein each of the plurality of video frames includes redundant data on the screen content. Then, the multiple video frames are reconstructed to obtain scene information and reconstruction residuals of each video frame, the scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent The difference between the video frame and the scene information, such that redundant data of the plurality of video frames is reduced by reconstruction. Subsequently, the scene information is predictively coded, the scene feature prediction coded data is obtained, and the reconstructed residual is predictively coded to obtain residual prediction coded data.
这样,通过对该多个视频帧进行重构的处理,可以减少这些视频帧的冗余度,从而在编码操作中,得到的场景特征和重构残差总体的压缩数据量相对于原来的视频帧的压缩数据量得到了缩减,减少了压缩后得到的数据量。而将每一视频帧重构为场景特征和重构残差,因重构残差包含除场景信息外的残差信息,因此信息量少且稀疏,该特性在进行预测编码时,可以通过较少的码字对其进行预测编码,编码数据量小,压缩比高。这样,本发明实施例的方法可有效提高视频帧的压缩效率。In this way, by performing the process of reconstructing the plurality of video frames, the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video. The amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
结合本申请实施例的第一方面,在本申请实施例的第一方面的第一种实现方式中,该多个视频帧中的每一视频帧间包括相同的画面内容,这些相同的画面内容即为该多个视频帧的冗余数据。对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差的步骤,包括:对多个视频帧进行重构,得到场景特征和每一视频帧的重构残差,场景特征用于表示每一视频帧间的相同的画面内容,重构残差用于表示视频帧和场景特征间的差值。该场景特征为场景信息的其中一种具体形式。通过重构操作,将多个视频帧间的多个相同的画面内容中的其中一个,在一个场景特征中进行保存,从而减少了对相同画面内容的重复记录,减少了冗余数据的冗余度。相应的,对场景信息进行预测编码,得到场景特征预测编码数据,包括:对场景特征进行预测编码,得到场景特征预测编码数据。In conjunction with the first aspect of the embodiments of the present application, in a first implementation manner of the first aspect of the embodiments of the present application, each of the multiple video frames includes the same picture content, and the same picture content That is, redundant data of the plurality of video frames. Reconstructing a plurality of video frames to obtain scene information and a reconstruction residual of each video frame, comprising: reconstructing a plurality of video frames to obtain scene features and reconstruction residuals of each video frame, The scene feature is used to represent the same picture content between each video frame, and the reconstructed residual is used to represent the difference between the video frame and the scene feature. The scene feature is one of the specific forms of scene information. Through the reconstruction operation, one of the plurality of identical picture contents between the plurality of video frames is saved in one scene feature, thereby reducing the repeated recording of the same picture content and reducing redundancy of redundant data. degree. Correspondingly, the scene information is predictively encoded, and the scene feature prediction encoded data is obtained, including: predicting and encoding the scene features, and obtaining scene feature prediction encoded data.
这样,通过重构,将该相同的画面内容去重后由场景特征进行表示,可减少该多个视频帧的冗余信息的冗余度。从而在编码操作中,得到的场景特征和重构残差总体的压缩数据量相对于原来的视频帧的压缩数据量得到了缩减,减少了压缩后得到的数据量。而将每一视频帧重构为场景特征和重构残差,因重构残差包含除场景信息外的残差信息,因此信息量少且稀疏,该特性在进行预测编码时,可以通过较少的码字对其进行预测编码,编码数据量小,压缩比高。这样,本发明实施例的方法可有效提高视频帧的压缩效率。In this way, by reconstructing, the same picture content is deduplicated and represented by the scene feature, and the redundancy of the redundant information of the plurality of video frames can be reduced. Therefore, in the encoding operation, the obtained scene feature and the compressed data amount of the reconstructed residual total are reduced relative to the compressed data amount of the original video frame, and the amount of data obtained after compression is reduced. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
结合本申请实施例的第一方面的第一种实现方式,在本申请实施例的第一方面的第二种实现方式中,对多个视频帧进行重构,得到场景特征和每一视频帧的重构残差,包括:将多个视频帧转换成观测矩阵,观测矩阵用于以矩阵形式对多个视频帧进行表示。然后,根据第一约束条件对观测矩阵进行重构,得到场景特征矩阵和重构残差矩阵,场景特征矩阵用于以矩阵形式对场景特征进行表示,重构残差矩阵用于以矩阵形式对多个视频帧的重构残差进行表示,第一约束条件用于限定场景特征矩阵低秩以及重构残差矩阵稀疏。这样,通过矩阵的形式执行对多个视频帧的重构操作,并且在第一约束条件的限定下,使得重构残差和场景特征符合预设的要求,利于后续编码操作时减少编码量和提高压缩率。With reference to the first implementation manner of the first aspect of the embodiment of the present application, in a second implementation manner of the first aspect of the embodiment of the present application, multiple video frames are reconstructed to obtain a scene feature and each video frame. Reconstruction residuals include: converting multiple video frames into an observation matrix, and the observation matrix is used to represent multiple video frames in a matrix form. Then, the observation matrix is reconstructed according to the first constraint condition to obtain a scene feature matrix and a reconstructed residual matrix. The scene feature matrix is used to represent the scene features in a matrix form, and the reconstructed residual matrix is used in a matrix form. The reconstructed residuals of the plurality of video frames are represented, the first constraint is used to define the scene feature matrix low rank and the reconstructed residual matrix is sparse. In this way, the reconstruction operation of the plurality of video frames is performed in the form of a matrix, and under the constraint of the first constraint, the reconstruction residual and the scene feature are made to meet the preset requirements, which is advantageous for reducing the coding amount and the subsequent encoding operation. Increase the compression ratio.
结合本申请实施例的第一方面的第二种实现方式,在本申请实施例的第一方面的第三种实现方式中,根据第一约束条件对观测矩阵进行重构,得到场景特征矩阵和重构残差矩阵,包括:根据第一预设公式,计算得到场景特征矩阵和重构残差矩阵,得到的场景特征矩阵为低秩矩阵,而重构残差矩阵为稀疏矩阵。With reference to the second implementation manner of the first aspect of the embodiment of the present application, in a third implementation manner of the first aspect of the embodiment of the present application, the observation matrix is reconstructed according to the first constraint condition, and the scene feature matrix is obtained. Reconstructing the residual matrix includes: calculating a scene feature matrix and a reconstructed residual matrix according to a first preset formula, wherein the obtained scene feature matrix is a low rank matrix, and the reconstructed residual matrix is a sparse matrix.
其中,第一预设公式为:Among them, the first preset formula is:
s.t.D=F+E s.t.D=F+Es.t.D=F+E s.t.D=F+E
这两组公式都包括两个公式:目标约束函数和重构公式。前一组公式因属于NP难题,对其进行松弛操作,得到后一组公式,后一组公式方便求解。Both sets of formulas include two formulas: the target constraint function and the reconstruction formula. Because the former group of formulas belong to the NP problem, the slack operation is performed to obtain the latter set of formulas, and the latter set of formulas are convenient to solve.
其中,D为观测矩阵,F为场景特征矩阵,E为重构残差矩阵,λ为权重参数,λ用来平衡场景特征矩阵F与重构残差矩阵E之间的关系,
表示求F和E的最优值,即使得目标公式rank(F)+λ||E||
1或者||F||
*+λ||E||
1值最小时F和E的取值,rank(·)为矩阵求秩函数,||·||
1为矩阵L1范数,||·||
*为矩阵核范数。
Where D is the observation matrix, F is the scene feature matrix, E is the reconstructed residual matrix, λ is the weight parameter, and λ is used to balance the relationship between the scene feature matrix F and the reconstructed residual matrix E. Represents seek the optimal value F and E, even though the values to obtain the objective formula rank (F) + λ || E || 1 or || F || * + λ || E || 1 minimum value F and E ,rank(·) is a matrix for the rank function, ||·|| 1 is the matrix L1 norm, and ||·|| * is the matrix kernel norm.
结合本申请实施例的第一方面的第一种至第三种中的任意一种实现方式,在本申请实施例的第一方面的第四种实现方式中,对所述多个视频帧进行重构,得到场景特征和所述每一视频帧的重构残差之前,本实现方式的方法还包括:提取所述多个视频帧中的每一视频帧的画面特征信息;然后,根据画面特征信息,计算得到内容度量信息,该内容度量信息用于度量所述多个视频帧的画面内容的差异性。从而,当所述内容度量信息不大于预设度量阈值时,执行所述对所述多个视频帧进行重构,得到场景特征和所述每一视频帧的重构残差的步骤。通过该判断检测,可以使得符合要求的多个视频帧才使用第一方面的第一种至第三种实现方式的重构操作,保证了该重构操作的正常执行。With reference to any one of the first to the third aspects of the first aspect of the present application, in a fourth implementation manner of the first aspect of the embodiments of the present application, the multiple video frames are performed on the multiple video frames. Before the reconstruction, the scene feature and the reconstruction residual of each video frame are obtained, the method of the implementation manner further includes: extracting picture feature information of each of the plurality of video frames; and then, according to the picture The feature information is calculated to obtain content metric information for measuring a difference in picture content of the plurality of video frames. Therefore, when the content metric information is not greater than the preset metric threshold, performing the step of reconstructing the plurality of video frames to obtain a scene feature and a reconstruction residual of each video frame. Through the judgment detection, the reconstruction operations of the first to third implementations of the first aspect can be performed by the plurality of video frames that meet the requirements, and the normal execution of the reconstruction operation is ensured.
结合本申请实施例的第一方面的第四种实现方式,在本申请实施例的第一方面的第五种实现方式中,该画面特征信息为全局GIST特征,该预设度量阈值为预设方差阈值,该根据所述画面特征信息,计算得到内容度量信息,包括:根据全局GIST特征,计算得到场景GIST特征方差。通过计算多个视频帧的场景GIST特征方差以实现度量多个视频帧的内容一致性,从而判断是否执行本申请第一方面的第一种至第三种实现方式的重构操作。结合本申请实施例的第一方面的第一种至第三种中的任意一种实现方式,在本申请实施例的第一方面的第六种实现方式中,获取多个视频帧,包括:获取视频流,视频流的视频帧包括I帧、B帧和P帧。然后,从视频流中提取I帧,该I帧用于执行对多个视频帧进行重构,得到场景特征和每一视频帧的重构残差的步骤。在具体的编码阶段,本实现方式的方法还包括:根据场景特征和重构残差进行重构,得到参考帧。以该参考帧做参考,对B帧和P帧进行帧间预测编码,得到B帧预测编码数据和P帧预测编码数据。然后,对预测编码数据进行变换编码、量化编码及熵编码,得到视频压缩数据;预测编码数据包括场景特征预测编码数据、残差预测编码数据、B帧预测编码数据和P帧预测编码数据。这样,即可对视频流的I帧使用本实现方式的方法进行重构以及编码,减少了I帧的编码数据量,以及减少了I帧的冗余数据。With reference to the fourth implementation manner of the first aspect of the embodiments of the present application, in a fifth implementation manner of the first aspect of the application, the screen feature information is a global GIST feature, and the preset metric threshold is a preset. The variance threshold is calculated according to the picture feature information, and the content GIST feature variance is calculated according to the global GIST feature. The reconstruction of the first to third implementations of the first aspect of the present application is performed by calculating the content GIST feature variance of the plurality of video frames to measure the content consistency of the plurality of video frames. With reference to any one of the first to third implementation manners of the first aspect of the present application, in a sixth implementation manner of the first aspect of the application, the acquiring multiple video frames includes: The video stream is obtained, and the video frames of the video stream include an I frame, a B frame, and a P frame. Then, an I frame is extracted from the video stream, and the I frame is used to perform a step of reconstructing a plurality of video frames to obtain scene features and reconstruction residuals of each video frame. In a specific coding stage, the method of the implementation manner further includes: reconstructing according to the scene feature and the reconstructed residual to obtain a reference frame. Referring to the reference frame, the B frame and the P frame are inter-predictive-coded to obtain B-frame predictive coded data and P-frame predictive coded data. Then, the predictive coded data is subjected to transform coding, quantization coding, and entropy coding to obtain video compressed data; the predictive coded data includes scene feature prediction coded data, residual prediction coded data, B frame predictive coded data, and P frame predictive coded data. In this way, the I frame of the video stream can be reconstructed and encoded using the method of the present implementation, the amount of encoded data of the I frame is reduced, and the redundant data of the I frame is reduced.
结合本申请实施例的第一方面,在本申请实施例的第一方面的第七种实现方式中,该多个视频帧中的每一视频帧相互之间在局部位置包括冗余数据,相应的重构操作不同于上述的实现方式,即对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差,包括:对多个视频帧中的每一视频帧进行拆分,得到多个帧子块,拆分后得到的帧子块包括了冗余数据,且部分帧子块可基于其它的帧子块得到。所谓的帧子块为视频帧的部分区域的帧内容。然后,对多个帧子块进行重构,得到场景特征、多个帧子块中的每一帧子块的表示系数和每一帧子块的重构残差,其中,场景特征包括多个独立的场景特征基,在场景特征内独立的场景特征基间不能互相重构得到,场景特征基用于描述帧子块的画面内容特征,所示表示系数表示场景特征基和帧子块的对应关系,重构残差表示帧子块和场景特征基的 差值。这样,通过重构操作,减少了包括冗余数据的帧子块的冗余度。本实现方式的场景特征为场景信息的其中一种具体形式,可减少局部冗余的视频帧间的冗余度。相应的,对场景信息进行预测编码,得到场景特征预测编码数据,包括:对场景特征进行预测编码,得到场景特征预测编码数据。With reference to the first aspect of the embodiments of the present application, in a seventh implementation manner of the first aspect of the embodiments of the present application, each of the multiple video frames includes redundant data at a local location, corresponding to The reconstruction operation is different from the foregoing implementation manner, that is, reconstructing multiple video frames to obtain scene information and reconstruction residuals of each video frame, including: splitting each video frame in multiple video frames A plurality of frame sub-blocks are obtained, and the frame sub-block obtained after the split includes redundant data, and the partial frame sub-blocks can be obtained based on other frame sub-blocks. The so-called frame sub-block is the frame content of a partial area of the video frame. Then, the plurality of frame sub-blocks are reconstructed to obtain a scene feature, a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, and a reconstruction residual of each frame sub-block, wherein the scene feature includes multiple The independent scene feature base cannot be reconstructed from each other within the scene feature. The scene feature base is used to describe the picture content feature of the frame sub-block. The indicated representation coefficient represents the correspondence between the scene feature base and the frame sub-block. The relationship, the reconstructed residual represents the difference between the frame sub-block and the scene feature base. Thus, by the reconstruction operation, the redundancy of the frame sub-block including the redundant data is reduced. The scene feature of the implementation manner is one of the specific forms of the scene information, which can reduce the redundancy between the partially redundant video frames. Correspondingly, the scene information is predictively encoded, and the scene feature prediction encoded data is obtained, including: predicting and encoding the scene features, and obtaining scene feature prediction encoded data.
结合本申请实施例的第一方面的第六种实现方式,在本申请实施例的第一方面的第八种实现方式中,对多个帧子块进行重构,得到场景特征、多个帧子块中的每一帧子块的表示系数和每一帧子块的重构残差,包括:对多个帧子块进行重构,得到多个帧子块中的每一帧子块的表示系数和每一帧子块的重构残差。其中,表示系数表示帧子块和目标帧子块的对应关系,目标帧子块为多个帧子块中独立的帧子块,独立的帧子块为不能基于多个帧子块中的其它帧子块重构得到的帧子块,重构残差用于表示目标帧子块和帧子块间的差值。然后,组合多个表示系数指示的目标帧子块,得到场景特征,该目标帧子块即为场景特征基。这样,选出可独立表示的目标帧子块,对不用独立表示的帧子块由目标帧子块和重构残差进行表示,从而减少了不用独立表示的帧子块和目标帧子块间的冗余数据,编码时只需对目标帧子块和重构残差进行编码即可,减少了编码量。With reference to the sixth implementation manner of the first aspect of the embodiments of the present application, in an eighth implementation manner of the first aspect of the embodiments of the present application, multiple frame sub-blocks are reconstructed to obtain a scene feature and multiple frames. a representation coefficient of each frame sub-block in the sub-block and a reconstruction residual of each frame sub-block, including: reconstructing a plurality of frame sub-blocks to obtain each of the plurality of frame sub-blocks Represents the coefficient and the reconstructed residual of each frame sub-block. Wherein, the representation coefficient represents a correspondence between a frame sub-block and a target frame sub-block, the target frame sub-block is an independent frame sub-block among the plurality of frame sub-blocks, and the independent frame sub-block is not based on other ones of the plurality of frame sub-blocks The frame sub-block reconstructed from the frame sub-block is reconstructed to represent the difference between the target frame sub-block and the frame sub-block. Then, a plurality of target frame sub-blocks indicating the coefficient indication are combined to obtain a scene feature, and the target frame sub-block is a scene feature base. In this way, the target frame sub-blocks that can be independently represented are selected, and the target sub-blocks and the reconstructed residuals are not represented by the frame sub-blocks that are not independently represented, thereby reducing the between the sub-blocks and the target sub-blocks that are not independently represented. Redundant data, only need to encode the target frame sub-block and the reconstructed residual when encoding, reducing the amount of coding.
结合本申请实施例的第一方面的第八种实现方式,在本申请实施例的第一方面的第九种实现方式中,对多个帧子块进行重构,得到多个帧子块中的每一帧子块的表示系数和每一帧子块的重构残差,包括:将多个帧子块转换成观测矩阵,观测矩阵用于以矩阵形式对多个帧子块进行表示。跟着,根据第二约束条件对观测矩阵进行重构,得到表示系数矩阵和重构残差矩阵。其中,表示系数矩阵为包括多个帧子块中的每一帧子块的表示系数的矩阵,表示系数的非零系数指示目标帧子块,重构残差矩阵用于以矩阵形式对每一帧子块的重构残差进行表示,第二约束条件用于限定表示系数的低秩性及稀疏性符合预设要求。而组合多个表示系数指示的目标帧子块,得到场景特征,包括:组合表示系数矩阵中的表示系数的非零系数指示的目标帧子块,得到场景特征。这样,即可通过矩阵的形式进行重构操作,且利用第二约束条件计算得到符合减少编码量要求的重构残差和场景特征。With reference to the eighth implementation manner of the first aspect of the embodiment of the present application, in a ninth implementation manner of the first aspect of the embodiments of the present application, the multiple frame sub-blocks are reconstructed to obtain multiple frame sub-blocks. The representation coefficient of each frame sub-block and the reconstruction residual of each frame sub-block include: converting a plurality of frame sub-blocks into an observation matrix, and the observation matrix is used to represent the plurality of frame sub-blocks in a matrix form. Then, the observation matrix is reconstructed according to the second constraint condition, and the representation coefficient matrix and the reconstructed residual matrix are obtained. Wherein, the representation coefficient matrix is a matrix including the representation coefficients of each of the plurality of frame sub-blocks, the non-zero coefficient indicating the coefficient indicates the target frame sub-block, and the reconstructed residual matrix is used for each of the matrix forms. The reconstructed residual of the frame sub-block is represented, and the second constraint is used to define the low rank and the sparsity of the represented coefficient to meet the preset requirement. And combining the plurality of target frame sub-blocks indicating the coefficient indication to obtain the scene feature, comprising: combining the target frame sub-blocks indicating the non-zero coefficient indication coefficients of the coefficient matrix to obtain the scene features. In this way, the reconstruction operation can be performed in the form of a matrix, and the reconstruction residual and the scene feature satisfying the requirement of reducing the coding amount are calculated by using the second constraint condition.
结合本申请实施例的第一方面的第九种实现方式,在本申请实施例的第一方面的第十种实现方式中,根据第二约束条件对观测矩阵进行重构,得到表示系数矩阵和重构残差矩阵,包括:根据第二预设公式,计算得到表示系数矩阵和重构残差矩阵,第二预设公式为:With reference to the ninth implementation manner of the first aspect of the embodiment of the present application, in the tenth implementation manner of the first aspect of the embodiment of the present application, the observation matrix is reconstructed according to the second constraint condition, and the representation coefficient matrix is obtained. Reconstructing the residual matrix includes: calculating a representation coefficient matrix and a reconstructed residual matrix according to a second preset formula, where the second preset formula is:
s.t.D=DC+E s.t.D=DC+Es.t.D=DC+E s.t.D=DC+E
其中,D为观测矩阵,C为表示系数矩阵,E为重构残差矩阵,λ、β为权重参数,
表示求C和E的最优值,即使得目标公式||C||
*+λ||E||
1或者||C||
*+λ||E||
1+β||C||
1值最小时C和E的取值,||·||
*为矩阵核范数,||·||
1为矩阵L
1范数。
Where D is the observation matrix, C is the coefficient matrix, E is the reconstructed residual matrix, and λ and β are the weight parameters. Represents the optimal value of C and E, that is, the target formula ||C|| * +λ||E|| 1 or ||C|| * +λ||E|| 1 +β||C|| When the value of 1 is the smallest, the values of C and E, ||·|| * are the matrix kernel norm, and ||·|| 1 is the matrix L 1 norm.
结合本申请实施例的第一方面的第七种实现方式,在本申请实施例的第一方面的第十一种实现方式中,对多个帧子块进行重构,得到场景特征、多个帧子块中的每一帧子块的表示系数和每一帧子块的重构残差,包括:对多个帧子块进行重构,得到场景特征和多个帧子块中的每一帧子块的表示系数,场景特征包括的场景特征基为特征空间中的独立的特征块,独立的特征块为在场景特征中不能由其它特征块重构得到的特征块。然后,根据每 一帧子块的重构残差和场景特征重构得到的数据和每一帧子块,计算得到每一帧子块的重构残差。这样,通过重构得到可对该多个帧子块进行整体表示的场景特征,该场景特征由场景特征基组成,该场景特征基为特征空间中的独立的特征块,若由不同的帧子块重构得到相同的特征块,则在场景特征中可不对相同的特征块进行重复保存,从而减少了冗余数据。With reference to the seventh implementation manner of the first aspect of the embodiment of the present application, in the eleventh implementation manner of the first aspect of the embodiment of the present application, multiple frame sub-blocks are reconstructed to obtain scene features and multiple Representing coefficients of each frame sub-block in the frame sub-block and reconstructing residuals of each frame sub-block includes: reconstructing a plurality of frame sub-blocks to obtain a scene feature and each of the plurality of frame sub-blocks The representation coefficient of the frame sub-block, the scene feature includes the scene feature base as an independent feature block in the feature space, and the independent feature block is a feature block that cannot be reconstructed by other feature blocks in the scene feature. Then, the reconstructed residual of each frame sub-block is calculated according to the reconstructed residual of each frame sub-block and the reconstructed data of the scene feature and each frame sub-block. In this way, a scene feature that can represent the plurality of frame sub-blocks as a whole is obtained by reconstructing, the scene feature is composed of a scene feature base, and the scene feature base is an independent feature block in the feature space, if different frames are used. If the block reconstruction obtains the same feature block, the same feature block may not be repeatedly saved in the scene feature, thereby reducing redundant data.
结合本申请实施例的第一方面的第十一种实现方式,在本申请实施例的第一方面的第十二种实现方式中,对多个帧子块进行重构,得到场景特征和多个帧子块中的每一帧子块的表示系数,包括:将多个帧子块转换为观测矩阵,观测矩阵用于以矩阵形式对多个帧子块进行表示。以及,根据第三约束条件对观测矩阵进行重构,得到表示系数矩阵和场景特征矩阵,其中,表示系数矩阵为包括每一帧子块的表示系数的矩阵,表示系数的非零系数指示场景特征基,场景特征矩阵用于以矩阵形式对场景特征进行表示,第三约束条件用于限定表示系数矩阵和景特征矩阵重构得到的画面和帧子块的画面的相似性符合预设相似阈值、以及限定表示系数矩阵稀疏性符合预设稀疏阀值、以及场景特征矩阵的数据量小于预设数据量阀值。With reference to the eleventh implementation manner of the first aspect of the embodiments of the present application, in a twelfth implementation manner of the first aspect of the embodiments of the present application, multiple frame sub-blocks are reconstructed to obtain scene features and multiple The representation coefficients of each frame sub-block in the frame sub-block include: converting a plurality of frame sub-blocks into an observation matrix, and the observation matrix is used to represent the plurality of frame sub-blocks in a matrix form. And reconstructing the observation matrix according to the third constraint condition to obtain a representation coefficient matrix and a scene feature matrix, wherein the representation coefficient matrix is a matrix including the representation coefficients of each frame sub-block, and the non-zero coefficient indicating the coefficient indicates the scene feature Base, the scene feature matrix is used to represent the scene feature in a matrix form, and the third constraint condition is used to define the similarity between the picture representing the coefficient matrix and the scene feature matrix reconstructed picture and the frame sub-block according to a preset similarity threshold, And limiting the data matrix sparsity to meet the preset sparse threshold, and the amount of data of the scene feature matrix is less than the preset data amount threshold.
以及,根据每一帧子块的重构残差和场景特征重构得到的数据和每一帧子块,计算得到每一帧子块的重构残差,包括:根据表示系数矩阵和场景特征矩阵重构得到的数据和观测矩阵,计算得到重构残差矩阵,其中,重构残差矩阵用于以矩阵形式对重构残差进行表示。And, according to the reconstructed residual of each frame sub-block and the reconstructed data of the scene feature and each frame sub-block, the reconstructed residual of each frame sub-block is calculated, including: according to the representation coefficient matrix and the scene feature The data and the observation matrix obtained by the matrix reconstruction are used to calculate a reconstructed residual matrix, wherein the reconstructed residual matrix is used to represent the reconstructed residual in a matrix form.
这样,即可通过矩阵的形式进行重构操作,且利用第三约束条件计算得到符合减少编码量要求的表示系数和场景特征。In this way, the reconstruction operation can be performed in the form of a matrix, and the representation coefficients and scene features that meet the requirements for reducing the coding amount are calculated by using the third constraint condition.
结合本申请实施例的第一方面的第十二种实现方式,在本申请实施例的第一方面的第十三种实现方式中,根据第三约束条件对观测矩阵进行重构,得到表示系数矩阵和场景特征矩阵,包括:根据第三预设公式,计算得到表示系数矩阵和场景特征矩阵,第三预设公式为:With reference to the twelfth implementation manner of the first aspect of the embodiment of the present application, in the thirteenth implementation manner of the first aspect of the embodiment of the present application, the observation matrix is reconstructed according to the third constraint condition, and the representation coefficient is obtained. The matrix and the scene feature matrix include: calculating a representation coefficient matrix and a scene feature matrix according to a third preset formula, and the third preset formula is:
其中,D为观测矩阵,C为表示系数矩阵,F为场景特征,λ和β为权重参数,用于对系数稀疏性及低秩性进行调节。
表示求F和C的最优值,即使得公式
值最小时F和C的取值。
Where D is the observation matrix, C is the coefficient matrix, F is the scene feature, and λ and β are the weight parameters, which are used to adjust the coefficient sparsity and low rank. Represents the optimal value of F and C, ie the formula The value of F and C when the value is minimum.
结合本申请实施例的第一方面的第七种至第十三种中的任一实现方式,在本申请实施例的第一方面的第十四种实现方式中,对所述多个视频帧中的每一视频帧进行拆分,得到多个帧子块之前,本实现方式的方法还包括:提取所述多个视频帧中的每一视频帧的画面特征信息。然后,根据所述画面特征信息,计算得到内容度量信息,所述内容度量信息用于度量所述多个视频帧的画面内容的差异性。从而,当所述内容度量信息大于预设度量阈值时,执行所述对所述多个视频帧中的每一视频帧进行拆分,得到多个帧子块的步骤。这样,当内容度量信息大于预设度量阈值时,表示该多个视频帧的图像局部存在冗余数据,从而使用对视频帧进行拆分和对帧子块进行重构的方法。With reference to any one of the seventh to thirteenth aspects of the first aspect of the present application, in the fourteenth implementation manner of the first aspect of the embodiments of the present application, the multiple video frames are The method of the present implementation further includes: extracting picture feature information of each of the plurality of video frames, before each video frame is split to obtain a plurality of frame sub-blocks. Then, based on the picture feature information, content metric information is calculated, where the content metric information is used to measure the difference of the picture content of the plurality of video frames. Therefore, when the content metric information is greater than the preset metric threshold, performing the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks. In this way, when the content metric information is greater than the preset metric threshold, the image representing the plurality of video frames locally has redundant data, thereby using a method of splitting the video frame and reconstructing the frame sub-block.
结合本申请实施例的第一方面的第十四种实现方式,在本申请实施例的第一方面的第十五种实现方式中,该画面特征信息为全局GIST特征,该预设度量阈值为预设方差阈值,该根据所述画面特征信息,计算得到内容度量信息,包括:根据全局GIST特征,计算得到场景GIST特征方差。通过计算多个视频帧的场景GIST特征方差以实现度量多个视频帧的内容一致性,从而判断该多个视频帧的图像是否存在局部存在冗余数据,以使用对视频帧进行拆分和对帧子块进行重构的方法。With reference to the fourteenth implementation manner of the first aspect of the embodiment of the present application, in the fifteenth implementation manner of the first aspect of the application, the picture feature information is a global GIST feature, and the preset metric threshold is And a preset variance threshold, wherein the content metric information is calculated according to the picture feature information, including: calculating a feature GIST feature variance according to the global GIST feature. The content consistency of the plurality of video frames is calculated by calculating the variance of the scene GIST features of the plurality of video frames, thereby determining whether the images of the plurality of video frames have locally stored redundant data, so as to split and match the video frames. A method of reconstructing a frame sub-block.
结合本申请实施例的第一方面的第七种至第十三种中的任一实现方式,在本申请实施例的第一方面的第十六种实现方式中,获取多个视频帧,包括:获取视频流,视频流的视频帧包括I帧、B帧和P帧;从视频流中提取I帧,该I帧用于执行对多个视频帧中的每一视频帧进行拆分,得到多个帧子块的步骤;With reference to any one of the seventh to thirteenth implementations of the first aspect of the present application, in a sixteenth implementation manner of the first aspect of the embodiments of the present application, multiple video frames are acquired, including Obtaining a video stream, where the video frame of the video stream includes an I frame, a B frame, and a P frame; extracting an I frame from the video stream, where the I frame is used to perform splitting each video frame in the multiple video frames, a step of multiple frame sub-blocks;
本实现方式的方法还包括:根据场景特征、表示系数和重构残差进行重构,得到参考帧;以参考帧做参考,对B帧和P帧进行帧间预测编码,得到B帧预测编码数据和P帧预测编码数据;对预测编码数据进行变换编码、量化编码及熵编码,得到视频压缩数据;预测编码数据包括场景特征预测编码数据、残差预测编码数据、B帧预测编码数据和P帧预测编码数据。The method of the implementation manner further includes: performing reconfiguration according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a reference frame; using the reference frame as a reference, performing interframe prediction coding on the B frame and the P frame, and obtaining the B frame prediction coding. Data and P frame predictive coded data; transform coding, quantization coding, and entropy coding of the predictive coded data to obtain video compressed data; the predictive coded data includes scene feature predictive coded data, residual predictive coded data, B frame predictive coded data, and P Frame prediction encoded data.
这样,本实现方式的方法可应用于视频流的关键帧中,减少了关键帧的冗余数据和编码量。Thus, the method of the present implementation can be applied to key frames of a video stream, reducing redundant data and coding amount of key frames.
结合本申请实施例的第一方面或者第一方面的第一种至第十六种中的任一实现方式,在本申请实施例的第一方面的第十七种实现方式中,获取多个视频帧之后,本实现方式的方法还包括:基于画面内容的相关性对多个视频帧进行分类,得到一个或多个分类簇的视频帧,同一分类簇的视频帧用于执行对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差的步骤。通过分类,使得属于同一分类簇的多个视频帧间的冗余数据的冗余度更大,从而使得后续的视频帧重构阶段,更多地减少视频帧间的冗余数据的冗余度。With reference to the first aspect of the embodiment of the present application or any one of the first to the sixteenth aspects of the first aspect, in the seventeenth implementation manner of the first aspect of the embodiment of the present application, multiple After the video frame, the method of the implementation manner further includes: classifying the plurality of video frames based on the correlation of the picture content, and obtaining video frames of one or more classification clusters, where the video frames of the same classification cluster are used to execute multiple videos. The frame is reconstructed to obtain scene information and a step of reconstructing the residual of each video frame. By classifying, the redundancy of redundant data between multiple video frames belonging to the same cluster is greater, so that the subsequent video frame reconstruction stage reduces the redundancy of redundant data between video frames. .
结合本申请实施例的第一方面的第十七种实现方式,在本申请实施例的第一方面的第十八种实现方式中,基于画面内容的相关性对多个视频帧进行分类,得到一个或多个分类簇的视频帧,包括:提取多个视频帧中的每一视频帧的特征信息。根据特征信息确定任意两个视频帧间的聚类距离,聚类距离用于表示两个视频帧间的相似度,根据聚类距离对视频帧进行聚类,得到一个或多个分类簇的视频帧。这样,通过聚类实现了对多个视频帧的分类操作。With reference to the seventeenth implementation manner of the first aspect of the embodiment of the present application, in the eighteenth implementation manner of the first aspect of the embodiment of the present application, the multiple video frames are classified according to the correlation of the screen content, A video frame of one or more clusters includes extracting feature information of each of the plurality of video frames. Determining the clustering distance between any two video frames according to the feature information, the clustering distance is used to represent the similarity between the two video frames, and the video frames are clustered according to the clustering distance to obtain the video of one or more clustering clusters. frame. In this way, the classification operation of multiple video frames is realized by clustering.
结合本申请实施例的第一方面,在本申请实施例的第一方面的第十九种实现方式中,获取多个视频帧,包括:获取视频流,视频流包括多个视频帧。然后,分别提取第一视频帧和第二视频帧的特征信息,特征信息用于对视频帧的画面内容进行描述,第一视频帧和第二视频帧为视频流中的视频帧;根据特征信息计算第一视频帧和第二视频帧间的镜头距离;判断镜头距离是否大于预设镜头阈值;若镜头距离大于预设镜头阈值,则从视频流中分割出目标镜头,目标镜头的起始帧为第一视频帧,目标镜头的结束帧为第二视频帧的上一视频帧;若镜头距离小于预设镜头阈值,则将第一视频帧和第二视频帧归属于同一镜头,目标镜头属于视频流的镜头的其中之一,镜头为一段在时间上连续的视频帧;对视频流中的每一镜头,根据镜头内的视频帧间的帧距离提取出关键帧,在每一镜头内任意两个相 邻的关键帧间的帧距离大于预设帧距离阈值,帧距离用于表示两视频帧间的差异度,每一镜头的关键帧用于执行对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差的步骤。通过镜头分割后,根据距离从各个镜头中提取出关键帧,这样的提取方法,使用到了视频流的上下文信息,且使得本实现方式的方法可应用于视频流中。With reference to the first aspect of the embodiments of the present application, in a nineteenth implementation manner of the first aspect of the present application, acquiring a plurality of video frames includes: acquiring a video stream, where the video stream includes multiple video frames. Then, feature information of the first video frame and the second video frame are respectively extracted, and the feature information is used to describe the picture content of the video frame, where the first video frame and the second video frame are video frames in the video stream; Calculating a lens distance between the first video frame and the second video frame; determining whether the lens distance is greater than a preset lens threshold; if the lens distance is greater than a preset lens threshold, segmenting the target lens from the video stream, starting frame of the target lens For the first video frame, the end frame of the target lens is the previous video frame of the second video frame; if the lens distance is less than the preset lens threshold, the first video frame and the second video frame are attributed to the same lens, and the target lens belongs to One of the lenses of the video stream, the lens is a time-continuous video frame; for each shot in the video stream, the key frame is extracted according to the frame distance between the video frames in the lens, and is randomly selected in each lens. The frame distance between two adjacent key frames is greater than a preset frame distance threshold, and the frame distance is used to indicate the degree of difference between the two video frames, and the key frame of each shot is used for execution. Reconstructing a plurality of video frames to obtain scene information and a reconstructed residual for each video frame. After the lens is divided, the key frames are extracted from the respective shots according to the distance. Such an extraction method uses the context information of the video stream, and the method of the present implementation can be applied to the video stream.
结合本申请实施例的第一方面的第十九种实现方式,在本申请实施例的第一方面的第二十种实现方式中,对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差之前,方法还包括:根据从视频流中分割出的每一镜头进行判别训练,得到多个与镜头对应的分类器;使用目标分类器对目标视频帧进行判别,得到判别分数,目标分类器为多个分类器的其中之一,目标视频帧为关键帧的其中之一,判别分数用于表示目标视频帧属于目标分类器所属的镜头的场景的程度;当判别分数大于预设分数阈值时,确定目标视频帧与目标分类器所属的镜头属于同一场景;根据与镜头属于同一场景的视频帧,确定一个或多个分类簇的视频帧。With reference to the nineteenth implementation manner of the first aspect of the embodiments of the present application, in the twentieth implementation manner of the first aspect of the embodiments of the present application, multiple video frames are reconstructed to obtain scene information and each Before reconstructing the residual of the video frame, the method further comprises: performing discriminant training according to each shot segmented from the video stream to obtain a plurality of classifiers corresponding to the shot; and using the target classifier to discriminate the target video frame, Determining the score, the target classifier is one of a plurality of classifiers, and the target video frame is one of the key frames, and the discriminant score is used to indicate the extent to which the target video frame belongs to the scene to which the target classifier belongs; When the threshold is greater than the preset score threshold, it is determined that the target video frame belongs to the same scene as the shot to which the target classifier belongs; and the video frames of one or more clusters are determined according to the video frames belonging to the same scene as the shot.
结合本申请实施例的第一方面,在本申请实施例的第一方面的第二十一种实现方式中,获取多个视频帧,包括:获取压缩视频流,压缩视频流包括已压缩的视频帧;从压缩视频流中确定出多个目标视频帧,目标视频帧在压缩视频流中为独立压缩编码的视频帧;对目标视频帧进行解码,得到解码后的目标视频帧,解码后的目标视频帧用于执行对多个视频帧中的每一视频帧进行拆分,得到多个帧子块的步骤。这样,在压缩的视频流中,提取出被压缩的独立压缩编码的视频帧,对这些视频帧使用本实现方式的视频编码方法可进一步减少这些视频帧的编码数据量。With reference to the first aspect of the embodiments of the present application, in a twenty-first implementation manner of the first aspect of the embodiments of the present application, acquiring a plurality of video frames includes: acquiring a compressed video stream, where the compressed video stream includes the compressed video a frame; a plurality of target video frames are determined from the compressed video stream, and the target video frame is an independently compressed and encoded video frame in the compressed video stream; the target video frame is decoded to obtain a decoded target video frame, and the decoded target is obtained. The video frame is used to perform the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks. Thus, in the compressed video stream, compressed independent compression-encoded video frames are extracted, and the video encoding method of the present embodiment can be used to further reduce the amount of encoded data of these video frames.
本发明实施例的第二方面提供一种视频解码方法,该方法包括:获取场景特征预测编码数据和残差预测编码数据。然后,对场景特征预测编码数据进行解码,得到场景信息,其中,场景信息包括由减少冗余数据的冗余度得到的数据,冗余数据为多个视频帧中的每一视频帧间在画面内容上的冗余数据。对残差预测编码数据进行解码,得到重构残差,重构残差用于表示视频帧和场景信息间的差值。根据场景信息和重构残差进行重构,得到多个视频帧。这样,第一方面提供的视频编码方法得到的场景特征预测编码数据和残差预测编码数据可通过本实现方式的视频解码方法得到了解码。A second aspect of the embodiments of the present invention provides a video decoding method, which includes: acquiring scene feature prediction encoded data and residual prediction encoded data. Then, the scene feature prediction encoded data is decoded to obtain scene information, wherein the scene information includes data obtained by reducing redundancy of redundant data, and the redundant data is between each video frame of the plurality of video frames. Redundant data on the content. The residual prediction encoded data is decoded to obtain a reconstructed residual, and the reconstructed residual is used to represent the difference between the video frame and the scene information. The reconstruction is performed according to the scene information and the reconstructed residual, and multiple video frames are obtained. In this way, the scene feature prediction coded data and the residual prediction coded data obtained by the video coding method provided by the first aspect can be decoded by the video decoding method of the implementation manner.
结合本申请实施例的第二方面,在本申请实施例的第二方面的第一种实现方式中,多个视频帧中的每一视频帧间包括相同的画面内容,该对场景特征预测编码数据进行解码,得到场景信息,包括:对场景特征预测编码数据进行解码,得到场景特征,场景特征用于表示每一视频帧间的相同的画面内容。根据场景信息和重构残差进行重构,得到多个视频帧,包括:根据场景特征和重构残差进行重构,得到多个视频帧。这样,若场景特征用于表示相同的画面内容,可通过本实现方式对场景特征信息进行解码。With reference to the second aspect of the embodiments of the present application, in a first implementation manner of the second aspect of the embodiments of the present application, each of the multiple video frames includes the same picture content, and the pair of scene feature prediction codes The data is decoded to obtain scene information, including: decoding scene feature prediction encoded data to obtain scene features, and the scene features are used to represent the same screen content between each video frame. Reconstructing according to the scene information and the reconstructed residual, obtaining multiple video frames, including: reconstructing according to the scene feature and the reconstructed residual, to obtain multiple video frames. Thus, if the scene feature is used to represent the same picture content, the scene feature information can be decoded by this implementation.
结合本申请实施例的第二方面的第一种实现方式,在本申请实施例的第二方面的第二种实现方式中,获取场景特征预测编码数据和残差预测编码数据,包括:获取视频压缩数据;对视频压缩数据进行熵解码、反量化处理、DCT反变化得到预测编码数据,预测编码数据包括场景特征预测编码数据、残差预测编码数据、B帧预测编码数据和P帧预测编码数据。With reference to the first implementation manner of the second aspect of the embodiments of the present application, in a second implementation manner of the second aspect of the embodiments of the present application, acquiring scene feature prediction coding data and residual prediction coding data, including: acquiring video Compressing data; performing entropy decoding, inverse quantization processing, and DCT inverse variation on the video compressed data to obtain predictive encoded data, and the predictive encoded data includes scene feature predictive encoded data, residual predictive encoded data, B frame predictive encoded data, and P frame predictive encoded data. .
根据场景特征和重构残差进行重构,得到多个视频帧,包括:根据场景特征和重构残 差进行重构,得到多个I帧;Reconstructing according to the scene feature and the reconstructed residual, obtaining multiple video frames, including: reconstructing according to the scene feature and the reconstruction residual, and obtaining multiple I frames;
本实现方式的方法还包括:以I帧为参考帧,对B帧预测编码数据和P帧预测编码数据进行帧间解码,得到B帧和P帧;对I帧、B帧和P帧按时间顺序进行排列,得到视频流。The method of the implementation manner further includes: performing inter-frame decoding on the B frame predictive encoded data and the P frame predictive encoded data by using the I frame as a reference frame to obtain a B frame and a P frame; and time consuming the I frame, the B frame, and the P frame. Arrange sequentially to get the video stream.
这样,在视频流中使用上述的视频帧编码方式时,可通过本实现方式对该视频流进行解码。Thus, when the video frame coding method described above is used in the video stream, the video stream can be decoded by the present implementation.
结合本申请实施例的第二方面,在本申请实施例的第二方面的第三种实现方式中,本实现方式的方法还包括:获取表示系数。该对场景特征预测编码数据进行解码,得到场景信息,包括:对场景特征预测编码数据进行解码,得到场景特征,场景特征包括多个独立的场景特征基,在场景特征内独立的场景特征基间不能互相重构得到,场景特征基用于描述帧子块的画面内容特征,所示表示系数表示场景特征基和帧子块的对应关系,重构残差表示帧子块和场景特征基的差值。With reference to the second aspect of the embodiments of the present application, in a third implementation manner of the second aspect of the embodiment of the present application, the method of the implementation manner further includes: acquiring a representation coefficient. Decoding the scene feature prediction encoded data to obtain scene information, comprising: decoding scene feature prediction encoded data to obtain a scene feature, where the scene feature includes multiple independent scene feature bases, and independent scene feature bases within the scene feature Cannot be reconstructed from each other, the scene feature base is used to describe the picture content feature of the frame sub-block, and the representation coefficient represents the correspondence between the scene feature base and the frame sub-block, and the reconstructed residual represents the difference between the frame sub-block and the scene feature base. value.
该根据场景信息和重构残差进行重构,得到多个视频帧。包括:根据场景特征、表示系数和重构残差进行重构,得到多个帧子块;对多个帧子块进行组合,得到多个视频帧。The reconstruction is performed according to the scene information and the reconstructed residual to obtain a plurality of video frames. The method includes: reconstructing according to a scene feature, a representation coefficient, and a reconstruction residual to obtain a plurality of frame sub-blocks; combining the plurality of frame sub-blocks to obtain a plurality of video frames.
这样,在对帧子块进行重构后编码得到编码数据后,可通过本实现方式的视频解码方法进行解码得到场景特征和重构残差,并进行重构得到多个帧子块,对其进行重组即可得到视频帧。In this way, after the frame sub-block is reconstructed and the encoded data is obtained, the video decoding method of the present implementation can be used to decode the scene feature and the reconstructed residual, and reconstructed to obtain a plurality of frame sub-blocks. A video frame can be obtained by reorganizing.
结合本申请实施例的第二方面的第三种实现方式,在本申请实施例的第二方面的第四种实现方式中,获取场景特征预测编码数据和残差预测编码数据,包括:获取视频压缩数据;对视频压缩数据进行熵解码、反量化处理、DCT反变化得到预测编码数据,预测编码数据包括场景特征预测编码数据、残差预测编码数据、B帧预测编码数据和P帧预测编码数据。With reference to the third implementation manner of the second aspect of the embodiment of the present application, in a fourth implementation manner of the second aspect of the embodiments of the present application, acquiring scene feature prediction coding data and residual prediction coding data, including: acquiring video Compressing data; performing entropy decoding, inverse quantization processing, and DCT inverse variation on the video compressed data to obtain predictive encoded data, and the predictive encoded data includes scene feature predictive encoded data, residual predictive encoded data, B frame predictive encoded data, and P frame predictive encoded data. .
该对多个帧子块进行组合,得到多个视频帧,包括:对多个帧子块进行组合,得到多个I帧;Combining the plurality of frame sub-blocks to obtain a plurality of video frames, comprising: combining a plurality of frame sub-blocks to obtain a plurality of I frames;
本实现方式的方法还包括:The method of this implementation manner further includes:
以I帧为参考帧,对B帧预测编码数据和P帧预测编码数据进行帧间解码,得到B帧和P帧;对I帧、B帧和P帧按时间顺序进行排列,得到视频流。The I frame is used as a reference frame, and the B frame predictive coded data and the P frame predictive coded data are inter-frame decoded to obtain a B frame and a P frame; and the I frame, the B frame, and the P frame are arranged in chronological order to obtain a video stream.
在视频流中对I帧进行拆分得到帧子块后,对这些帧子块进行重构得到重构残差、场景特征和表示系数后,可通过本实现方式的视频解码方法进行解码并复原得到视频流。After the I frame is split into the video stream to obtain the frame sub-block, the reconstructed residual, the scene feature and the representation coefficient are reconstructed by the frame sub-block, and then decoded and restored by the video decoding method of the implementation manner. Get the video stream.
本发明实施例的第三方面提供一种视频编码设备,该设备具有执行上述视频编码方法的功能。该功能可以通过硬件实现,也可能通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。A third aspect of the embodiments of the present invention provides a video encoding apparatus having a function of performing the above video encoding method. This function can be implemented in hardware or in hardware by executing the corresponding software. The hardware or software includes one or more modules corresponding to the functions described above.
一种可能的实现方式中,该视频编码设备包括:In a possible implementation manner, the video encoding device includes:
获取模块,用于获取多个视频帧,多个视频帧中的每一视频帧间在画面内容上包括冗余数据;An acquiring module, configured to acquire multiple video frames, and each of the plurality of video frames includes redundant data on the screen content;
重构模块,用于对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差,场景信息包括由减少冗余数据的冗余度得到的数据,重构残差用于表示视频帧和场景信息间的差值;The reconstruction module is configured to reconstruct multiple video frames to obtain scene information and reconstruction residuals of each video frame, where the scene information includes data obtained by reducing redundancy of redundant data, and reconstructing residuals Deducing the difference between the video frame and the scene information;
预测编码模块,用于对场景信息进行预测编码,得到场景特征预测编码数据;a prediction encoding module, configured to predictively encode scene information, and obtain scene feature prediction encoded data;
预测编码模块,还用于对重构残差进行预测编码,得到残差预测编码数据。The prediction encoding module is further configured to perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.
另一种可能的实现方式中,该视频编码设备包括:In another possible implementation manner, the video encoding device includes:
视频编码器;Video encoder
该视频编码器执行如下动作:获取多个视频帧,多个视频帧中的每一视频帧间在画面内容上包括冗余数据;The video encoder performs the following actions: acquiring a plurality of video frames, and each of the plurality of video frames includes redundant data on the screen content;
该视频编码器还执行如下动作:对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差,场景信息包括由减少冗余数据的冗余度得到的数据,重构残差用于表示视频帧和场景信息间的差值;The video encoder further performs the following actions: reconstructing a plurality of video frames to obtain scene information and reconstruction residuals of each video frame, and the scene information includes data obtained by reducing redundancy of redundant data, and reconstructing The residual is used to represent the difference between the video frame and the scene information;
该视频编码器还执行如下动作:对场景信息进行预测编码,得到场景特征预测编码数据;The video encoder further performs the following actions: performing predictive coding on the scene information to obtain scene feature prediction encoded data;
该视频编码器还执行如下动作:对重构残差进行预测编码,得到残差预测编码数据。The video encoder also performs an operation of predictive coding the reconstructed residual to obtain residual prediction encoded data.
本发明实施例的第四方面提供一种视频解码设备,该设备具有执行上述视频解码方法的功能。该功能可以通过硬件实现,也可能通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。A fourth aspect of the embodiments of the present invention provides a video decoding apparatus having a function of performing the above video decoding method. This function can be implemented in hardware or in hardware by executing the corresponding software. The hardware or software includes one or more modules corresponding to the functions described above.
一种可能的实现方式中,该视频解码设备包括:In a possible implementation manner, the video decoding device includes:
获取模块,用于获取场景特征预测编码数据和残差预测编码数据;An obtaining module, configured to acquire scene feature prediction encoded data and residual prediction encoded data;
场景信息解码模块,用于对场景特征预测编码数据进行解码,得到场景信息,场景信息包括由减少冗余数据的冗余度得到的数据,冗余数据为多个视频帧中的每一视频帧间在画面内容上的冗余数据;a scene information decoding module, configured to decode scene feature prediction encoded data to obtain scene information, where the scene information includes data obtained by reducing redundancy of redundant data, and the redundant data is each video frame of multiple video frames. Redundant data between screen contents;
重构残差解码模块,用于对残差预测编码数据进行解码,得到重构残差,重构残差用于表示视频帧和场景信息间的差值;And a reconstructed residual decoding module, configured to decode the residual prediction encoded data to obtain a reconstructed residual, where the reconstructed residual is used to represent a difference between the video frame and the scene information;
视频帧重构模块,用于根据场景信息和重构残差进行重构,得到多个视频帧。The video frame reconstruction module is configured to reconstruct according to the scene information and the reconstructed residual to obtain a plurality of video frames.
另一种可能的实现方式中,该视频解码设备包括:In another possible implementation manner, the video decoding device includes:
视频解码器;Video decoder
该视频解码器执行如下动作:获取场景特征预测编码数据和残差预测编码数据;The video decoder performs the following actions: acquiring scene feature prediction encoded data and residual prediction encoded data;
该视频解码器还执行如下动作:对场景特征预测编码数据进行解码,得到场景信息,场景信息包括由减少冗余数据的冗余度得到的数据,冗余数据为多个视频帧中的每一视频帧间在画面内容上的冗余数据;The video decoder further performs the following operations: decoding scene feature prediction encoded data to obtain scene information, the scene information including data obtained by reducing redundancy of redundant data, and the redundant data is each of a plurality of video frames Redundant data on the content of the picture between video frames;
该视频解码器还执行如下动作:对残差预测编码数据进行解码,得到重构残差,重构残差用于表示视频帧和场景信息间的差值;The video decoder further performs the following operations: decoding the residual prediction encoded data to obtain a reconstructed residual, where the reconstructed residual is used to represent a difference between the video frame and the scene information;
该视频解码器还执行如下动作:根据场景信息和重构残差进行重构,得到多个视频帧。The video decoder also performs an action of reconstructing based on the scene information and the reconstructed residual to obtain a plurality of video frames.
本发明实施例的第五方面提供一种视频编解码设备,该视频编解码设备包括视频编码设备和视频解码设备。A fifth aspect of the embodiments of the present invention provides a video codec device, where the video codec device includes a video encoding device and a video decoding device.
其中,该视频编码设备如上述第三方面提供的视频编码设备;The video encoding device is the video encoding device provided by the foregoing third aspect;
该视频解码设备如上述第四方面提供的视频解码设备。The video decoding device is the video decoding device provided by the fourth aspect above.
本发明实施例的第六方面提供一种计算机存储介质,该计算机存储介质存储有程序代码,该程序代码用于指示执行上述第一方面的方法。A sixth aspect of an embodiment of the present invention provides a computer storage medium storing program code for indicating execution of the method of the first aspect described above.
本发明实施例的第七方面提供一种计算机存储介质,该计算机存储介质存储有程序代码,该程序代码用于指示执行上述第二方面的方法。A seventh aspect of the embodiments of the present invention provides a computer storage medium storing program code for indicating execution of the method of the second aspect described above.
本申请的又一方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。Yet another aspect of the present application provides a computer readable storage medium having instructions stored therein that, when executed on a computer, cause the computer to perform the methods described in the above aspects.
本申请的又一方面提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。Yet another aspect of the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the methods described in the various aspects above.
从以上技术方案可以看出,本发明实施例具有以下优点:It can be seen from the above technical solutions that the embodiments of the present invention have the following advantages:
获取多个视频帧,其中该多个视频帧中的每一视频帧间在画面内容上包括冗余数据。然后,对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差,场景信息包括由减少冗余数据的冗余度得到的数据,重构残差用于表示视频帧和场景信息间的差值。跟着,对场景信息进行预测编码,得到场景特征预测编码数据,以及对重构残差进行预测编码,得到残差预测编码数据。这样,通过对该多个视频帧进行重构的处理,可以减少这些视频帧的冗余度,从而在编码操作中,得到的场景特征和重构残差总体的压缩数据量相对于原来的视频帧的压缩数据量得到了缩减,减少了压缩后得到的数据量。而将每一视频帧重构为场景特征和重构残差,因重构残差包含除场景信息外的残差信息,因此信息量少且稀疏,该特性在进行预测编码时,可以通过较少的码字对其进行预测编码,编码数据量小,压缩比高。这样,本发明实施例的方法可有效提高视频帧的压缩效率。Acquiring a plurality of video frames, wherein each of the plurality of video frames includes redundant data on the picture content. Then, reconstructing the plurality of video frames to obtain scene information and reconstruction residuals of each video frame, the scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent the video frame. The difference between the scene information and the scene information. Then, the scene information is predictively coded, the scene feature prediction coded data is obtained, and the reconstructed residual is predictively coded to obtain residual prediction coded data. In this way, by performing the process of reconstructing the plurality of video frames, the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video. The amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
图1为现有的一种HEVC编码示意图;1 is a schematic diagram of a conventional HEVC coding;
图2为本发明一实施例提供的一种视频帧编解码方法的流程图;2 is a flowchart of a video frame encoding and decoding method according to an embodiment of the present invention;
图3a为本发明另一实施例提供的视频编码方法的流程和现有的HEVC编码方法的流程的对比示意图;3a is a schematic diagram of a flow of a video encoding method and a flow of an existing HEVC encoding method according to another embodiment of the present invention;
图3b为本发明另一实施例提供的一种视频编码方法所涉及的场景示意图;FIG. 3b is a schematic diagram of a scenario involved in a video encoding method according to another embodiment of the present invention;
图4a为本发明另一实施例提供的视频解码方法的流程和现有的HEVC解码方法的流程的对比示意图;4a is a schematic diagram of a flow of a video decoding method and a flow of an existing HEVC decoding method according to another embodiment of the present invention;
图4b为本发明另一实施例提供的一种视频解码方法所涉及的场景示意图;FIG. 4b is a schematic diagram of a scenario involved in a video decoding method according to another embodiment of the present invention;
图5为本发明另一实施例提供的一种视频编码方法的方法流程图;FIG. 5 is a flowchart of a method for video encoding according to another embodiment of the present invention;
图6为本发明另一实施例提供的一种视频解码方法的方法流程图;FIG. 6 is a flowchart of a method for decoding a video according to another embodiment of the present invention;
图7为图5所示的视频编码方法的镜头分割方法的方法流程图;7 is a flowchart of a method of a lens segmentation method of the video encoding method shown in FIG. 5;
图8为图5所示的视频编码方法的关键帧抽取方法的方法流程图;8 is a flowchart of a method for extracting a key frame of the video encoding method shown in FIG. 5;
图9为图5所示的视频编码方法的场景分类方法的方法流程图;9 is a flowchart of a method for scene classification of the video encoding method shown in FIG. 5;
图10为图5所示的视频编码方法的基于SVM分类方法的方法流程图;10 is a flowchart of a method based on an SVM classification method of the video encoding method shown in FIG. 5;
图11为图5所示的视频编码方法的基于RPCA场景重构的方法流程图;11 is a flowchart of a method for reconstructing an RPCA based scene of the video encoding method shown in FIG. 5;
图12为本发明另一实施例提供的一种视频编码方法的方法流程图;FIG. 12 is a flowchart of a method for a video encoding method according to another embodiment of the present invention;
图13为图12所示的视频编码方法的场景示意图;13 is a schematic diagram of a scenario of the video encoding method shown in FIG. 12;
图14为图12所示的视频编码方法的其中一种具体方法的场景示意图;14 is a schematic diagram of a scenario of one of the specific methods of the video encoding method shown in FIG. 12;
图15为图12所示的视频编码方法的其中一种具体方法的场景示意图;15 is a schematic diagram of a scenario of one of the specific methods of the video encoding method shown in FIG. 12;
图16为图12所示的视频编码方法的其中一种具体方法的场景示意图;16 is a schematic diagram of a scenario of one of the specific methods of the video encoding method shown in FIG. 12;
图17为本发明另一实施例提供的一种视频解码方法的方法流程图;FIG. 17 is a flowchart of a method for decoding a video according to another embodiment of the present invention;
图18a为本发明另一实施例提供的一种视频编码设备的结构示意图;FIG. 18 is a schematic structural diagram of a video encoding apparatus according to another embodiment of the present invention;
图18b为图18a所示的视频编码设备的局部结构示意图;18b is a partial structural diagram of the video encoding apparatus shown in FIG. 18a;
图19为本发明另一实施例提供的一种视频解码设备的结构示意图;FIG. 19 is a schematic structural diagram of a video decoding device according to another embodiment of the present invention;
图20为本发明另一实施例提供的一种视频编解码设备的结构示意图;FIG. 20 is a schematic structural diagram of a video codec device according to another embodiment of the present invention;
图21是根据本发明实施例的视频编解码系统10的示意性框图;21 is a schematic block diagram of a video codec system 10 in accordance with an embodiment of the present invention;
图22为说明经配置以实施本发明的技术的实例视频编码器20的框图;22 is a block diagram illustrating an example video encoder 20 that is configured to implement the techniques of the present invention;
图23为说明经配置以实施本发明的技术的实例视频解码器30的框图。23 is a block diagram illustrating an example video decoder 30 that is configured to implement the techniques of the present invention.
本发明实施例提供了一种视频编码方法、视频解码方法、视频编码设备、视频解码设备,用于提高视频帧的压缩效率,以降低网络传输负担和减少对视频帧的存储负担。The embodiments of the present invention provide a video encoding method, a video decoding method, a video encoding device, and a video decoding device, which are used to improve the compression efficiency of a video frame, thereby reducing the network transmission load and reducing the storage load on the video frame.
对于独立编码的视频帧,往往在编码后,这些视频帧的压缩数据量多,以及压缩的视频帧之间会存在大量信息冗余,导致增加了网络传输负担和存储负担,不利于数据存取。For independently coded video frames, there is often a large amount of compressed data in these video frames after encoding, and there is a large amount of information redundancy between compressed video frames, resulting in increased network transmission burden and storage burden, which is not conducive to data access. .
为此,本发明实施例的视频编码方法,编码设备获取多个视频帧后,其中该多个视频帧中的每一视频帧间在画面内容上包括冗余数据,对该多个视频帧进行重构,得到场景信息和每一视频帧的重构残差,其中,场景信息包括由减少冗余数据的冗余度得到的数据,重构残差用于表示视频帧和场景信息间的差值。跟着,对场景信息进行预测编码,得到场景特征预测编码数据,以及对重构残差进行预测编码,得到残差预测编码数据。这样,通过对该多个视频帧进行重构的处理,可以减少这些视频帧的冗余度,从而在编码操作中,得到的场景特征和重构残差总体的压缩数据量相对于原来的视频帧的压缩数据量得到了缩减,减少了压缩后得到的数据量。而将每一视频帧重构为场景特征和重构残差,因重构残差包含除场景信息外的残差信息,因此信息量少且稀疏,该特性在进行预测编码时,可以通过较少的码字对其进行预测编码,编码数据量小,压缩比高。这样,本发明实施例的方法可有效提高视频帧的压缩效率。To this end, in the video coding method of the embodiment of the present invention, after the encoding device acquires a plurality of video frames, each of the plurality of video frames includes redundant data on the screen content, and the plurality of video frames are performed. Reconstructing, obtaining scene information and reconstruction residuals of each video frame, wherein the scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent a difference between the video frame and the scene information. value. Then, the scene information is predictively coded, the scene feature prediction coded data is obtained, and the reconstructed residual is predictively coded to obtain residual prediction coded data. In this way, by performing the process of reconstructing the plurality of video frames, the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video. The amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
相应的,本发明实施例还提供了一种视频解码方法,用于对上述视频编码设备得到的场景特征预测编码数据和残差预测编码数据进行解码,得到场景信息和重构残差后,根据该场景信息和重构残差进行重构得到视频帧。Correspondingly, the embodiment of the present invention further provides a video decoding method, which is used to decode scene feature prediction encoded data and residual prediction encoded data obtained by the video encoding device, obtain scene information, and reconstruct residuals, according to The scene information and the reconstructed residual are reconstructed to obtain a video frame.
在HEVC标准中,关键帧进行独立编码,其中关键帧也称为I帧,I帧在压缩后其压缩数据量占比高且I帧之间存在大量信息冗余。对此,若在编码时,对I帧使用本发明实施例的视频编码方法,将能提高I帧的编码效率。In the HEVC standard, key frames are independently coded, wherein key frames are also referred to as I frames. After compression, the I frame has a high proportion of compressed data and a large amount of information redundancy between I frames. In this regard, if the video coding method of the embodiment of the present invention is used for the I frame at the time of encoding, the coding efficiency of the I frame can be improved.
为了对本发明实施例提供的视频帧编码和解码方法有更加直观的说明,在下文中的部分内容,使用到了HEVC标准的场景,为了方便对全文的理解,现先对HECV标准进行简要介绍。In order to explain the video frame encoding and decoding method provided by the embodiment of the present invention, a part of the following content uses the HEVC standard scenario. In order to facilitate the understanding of the full text, the HECV standard is briefly introduced.
HEVC(H.265)是应用广泛且成功的视频编解码标准。HEVC属于基于块的混合编码方法,具体包括预测、变换、量化、熵编码、环路滤波等几个模块。其中,预测模块是HEVC编解码方法的核心模块,具体可以分为帧内预测(Intra Prediction)及帧间预测(Inter Prediction)模块。帧内预测,即利用当前图像内已经编码的像素生成预测值。帧间预测,即利用当前图像之前已经的编码的图像重建像素生成预测值。由于帧间预测采用残差编码的方式,压缩比较高。HEVC (H.265) is a widely used and successful video codec standard. HEVC is a block-based hybrid coding method, which includes several modules such as prediction, transform, quantization, entropy coding, and loop filtering. The prediction module is a core module of the HEVC codec method, and may be specifically classified into an intra prediction and an inter prediction module. Intra prediction, that is, using the pixels already encoded in the current image to generate prediction values. Inter prediction, that is, reconstructing a pixel using the encoded image that has been previously in the current image to generate a predicted value. Since interframe prediction uses residual coding, compression is relatively high.
现有HEVC标准的帧内预测模块只采用了当前图像帧内信息进行编码和解码,且是按照视频时间轴采用固定策略选取,并没有考虑到视频的上下文语境信息,因此编解码效率低,压缩比不高。举例来讲:The existing intra prediction module of the HEVC standard only uses the current image intraframe information for encoding and decoding, and adopts a fixed strategy according to the video time axis, and does not take into consideration the context context information of the video, so the encoding and decoding efficiency is low. The compression ratio is not high. For example:
1)场景一,电影中A和B进行对话,导演频繁在A与B之间进行镜头切换,以表达人物内心情感。此时,适合将所有与A相关的镜头进行分割及聚类,统一进行帧间及帧内预测编码。1) Scene 1: In the movie, A and B perform dialogues. The director frequently switches between A and B to express the inner feelings of the characters. At this time, it is suitable to divide and cluster all the lenses related to A, and perform inter-frame and intra-prediction encoding uniformly.
2)场景二,电视剧拍摄场地主要分为草地、海滩、办公场景。此时,适合将所有草地、海滩、办公场景进行场景识别和分类,统一提取场景特征信息,对关键帧进行表达和预测。2) Scene 2, the TV drama shooting venue is mainly divided into grass, beach and office scenes. At this time, it is suitable to identify and classify all grasses, beaches, and office scenes, extract scene feature information uniformly, and express and predict key frames.
如图1所示,其示出了HEVC预测编码流程。参阅图1,HEVC预测编码使用了帧内压缩和帧间压缩两种方式。编码之前首先设定GOP步长,即GOP中所包含帧的个数。为防止运动变化,帧数不宜设置过多。在具体预测编码环节,HEVC将所有帧划分I,P,B三种类型帧,如图1所示。图1中各帧上方数字表明相应帧在原有视频序列中的编号。编码时以GOP为单位,先后对I帧、P帧及B帧进行编码。As shown in Figure 1, it shows the HEVC predictive coding process. Referring to Figure 1, HEVC predictive coding uses both intra-frame compression and inter-frame compression. The GOP step size is set first before encoding, that is, the number of frames included in the GOP. To prevent motion changes, the number of frames should not be set too much. In the specific prediction coding, HEVC divides all frames into three types of frames: I, P, and B, as shown in Figure 1. The numbers above the frames in Figure 1 indicate the number of the corresponding frame in the original video sequence. When encoding, the I frame, the P frame, and the B frame are encoded in units of GOP.
I帧(Intra-frame)又称为帧内编码帧,是一种自带全部信息的独立帧,无需参考其它图像便可独立进行编码及解码,可以简单理解为一张静态画面。通常每个GOP中的第一帧设置为I帧,GOP的长度也就表示两个相邻I帧之间的间隔。I帧提供GOP内最关键的信息,所占数据的信息量比较大,因此压缩比较差,一般在7:1左右。Intra-frame (Intra-frame), also known as intra-framed frame, is an independent frame with all the information. It can be independently encoded and decoded without reference to other images. It can be simply understood as a static picture. Usually the first frame in each GOP is set to an I frame, and the length of the GOP also represents the interval between two adjacent I frames. The I frame provides the most critical information in the GOP, and the amount of information in the data is relatively large, so the compression is relatively poor, generally around 7:1.
I帧编码具体流程如下:The specific process of I frame coding is as follows:
1)进行帧内预测,决定所采用的帧内预测模式;1) performing intra prediction to determine the intra prediction mode used;
2)像素值减去预测值,得到残差;2) subtracting the predicted value from the pixel value to obtain a residual;
3)对残差进行变换和量化;3) transforming and quantizing the residuals;
4)编程编码和算术编码;4) programming coding and arithmetic coding;
5)重构图像并滤波,得到的图像作为其它帧的参考帧。5) Reconstruct the image and filter it, and the obtained image is used as a reference frame for other frames.
P帧(Predictive frame)又称帧间预测编码帧,需要参考前面的I帧才能进行编码。表示的是当前帧画面与前一阵(前一帧可能是I帧也可能是P帧)的差值。解码时需要用之前缓存的画面叠加上本帧定义的差值,生成最终画面。与I帧相比,P帧通常占用更少的数据位,但不足是,由于P帧对前面的P和I参考帧有着复杂的依赖性,因此对传输错误非常敏感。由于采用残差进行编码,相对I帧而言,P帧所需编码信息量大幅减少,压缩比相对较高,一般在20:1左右。A P-frame (Predictive frame) is also called an inter-predictive coded frame. It needs to refer to the previous I frame to encode. Indicates the difference between the current frame picture and the previous frame (the previous frame may be an I frame or a P frame). When decoding, it is necessary to superimpose the difference defined by this frame with the previously buffered picture to generate the final picture. P frames typically occupy fewer bits of data than I frames, but the disadvantage is that P frames are very sensitive to transmission errors because of the complex dependence of the previous P and I reference frames. Since the residual is used for encoding, the amount of coded information required for the P frame is greatly reduced relative to the I frame, and the compression ratio is relatively high, generally around 20:1.
B帧(Bi-directional frame)又称双向预测编码帧,也就是B帧记录的是本帧与前后帧的差值。解码B帧,不仅要取得之前的缓存画面,还需要解码之后的P帧画面,通过前后画面的插值与本帧数据的叠加取得最终的画面。B帧压缩率高,但是对解码性能要求较高。B帧不是参考帧,不会造成解码错误的扩散。此外,B帧拥有最高编码压缩率,一般压缩率在50:1左右。A bi-directional frame is also called a bidirectional predictive coding frame, that is, a B frame records the difference between the current frame and the previous and subsequent frames. Decoding a B frame requires not only the previous buffered picture but also the P frame picture after decoding, and the final picture is obtained by superimposing the previous and subsequent pictures and superimposing the current frame data. B frame compression rate is high, but the decoding performance is high. The B frame is not a reference frame and does not cause a spread of decoding errors. In addition, B frames have the highest encoding compression ratio, and the general compression ratio is around 50:1.
BP帧编码具体流程如下:The specific process of BP frame coding is as follows:
1)进行运动估计,计算采用帧间编码流程编码模式的率失真函数(节)值。P帧只参考前面的帧,B帧可参考后面的帧。1) Perform motion estimation and calculate a rate distortion function (section) value using an interframe coding process coding mode. The P frame only refers to the previous frame, and the B frame can refer to the following frame.
2)进行帧内预测,选取率失真函数值最小的帧内模式与帧间模式比较,确定采用哪种编码模式。2) Perform intra prediction, select an intra mode with the smallest rate distortion function value and compare the inter mode to determine which encoding mode to use.
3)计算实际值和预测值的差值。3) Calculate the difference between the actual value and the predicted value.
4)对残差进行变换和量化。4) Transform and quantize the residuals.
5)熵编码,如果是帧间编码模式,编码运动矢量。5) Entropy coding, if it is an inter coding mode, encodes a motion vector.
HEVC的解码流程是编码流程的逆向过程,在此不再熬述。The decoding process of HEVC is the reverse process of the encoding process, and will not be described here.
HEVC编解码方法过于依赖I帧编码,存在如下缺陷:The HEVC codec method relies too much on I frame coding and has the following drawbacks:
1)I帧压缩数据量占比高。I帧编码仅对帧内数据进行空间压缩,而没有考虑相邻帧之间的冗余信息,压缩数据量较大,通常为P帧的10倍左右。在编码之前,需要对GOP步长进行预先设置。I帧比例随着GOP步长的设置而确定。如图1所示,当GOP步长设置为13,则I帧与BP帧的比例为1:12。根据IBP帧各自压缩比来计算,最终I帧与BP帧压缩数据量比例大致为2:5左右。通常可以设置较大的GOP步长来降低I帧比例,以提高视频整体压缩比,但此时也会造成压缩视频质量的下降。1) The amount of I frame compressed data is high. I frame coding only performs spatial compression on intraframe data without considering redundant information between adjacent frames. The amount of compressed data is large, usually about 10 times that of P frames. The GOP step size needs to be preset before encoding. The I frame ratio is determined by the setting of the GOP step size. As shown in FIG. 1, when the GOP step size is set to 13, the ratio of the I frame to the BP frame is 1:12. According to the respective compression ratios of the IBP frames, the ratio of the final I frame to the BP frame compressed data is approximately 2:5. Generally, a larger GOP step size can be set to reduce the I frame ratio to improve the overall compression ratio of the video, but this also causes a decrease in the quality of the compressed video.
2)I帧之间存在大量信息冗余。I帧按照时间轴顺序依次抽取相应帧得到,相邻I帧之间间隔为GOP步长。该选取策略未考虑到视频的上下文语境信息。例如,针对两段时间上不连续但画面内容高度相关的视频片段,如果按照GOP步长对I帧进行抽取并进行单独帧内编码会造成大量信息冗余。2) There is a large amount of information redundancy between I frames. The I frames are sequentially extracted according to the time axis sequence, and the interval between adjacent I frames is GOP step. The selection strategy does not take into account the contextual context information of the video. For example, for two video segments that are not consecutive in time but highly correlated in the picture content, if the I frame is extracted according to the GOP step size and the individual intra coding is performed, a large amount of information redundancy is caused.
本发明实施例针对原有HEVC过于依赖I帧编码且压缩效率不高的问题,提出了一种基于智能视频场景分类的视频编解码算法。该方法通过对视频镜头及场景进行识别和分类,将关键帧(I帧)进行整体数据分析及重构,对场景信息及表示残差进行编码。有效避免了针对单个关键帧帧内压缩效率不高的问题,同时引入了视频上下文信息,提升了压缩比。The embodiment of the present invention proposes a video encoding and decoding algorithm based on intelligent video scene classification, in view of the problem that the original HEVC relies too much on I frame coding and the compression efficiency is not high. The method identifies and classifies the video shots and scenes, performs key data analysis and reconstruction on the key frames (I frames), and encodes the scene information and the representation residuals. It effectively avoids the problem of inefficient compression in a single key frame, and introduces video context information to improve the compression ratio.
图2为本发明实施例提供的一种视频帧编解码方法的流程图,视频帧编解码方法包括编码方法部分和解码方法部分。参阅图2,该视频帧编解码方法包括:2 is a flowchart of a video frame encoding and decoding method according to an embodiment of the present invention. The video frame encoding and decoding method includes an encoding method part and a decoding method part. Referring to FIG. 2, the video frame coding and decoding method includes:
步骤201:获取多个视频帧。Step 201: Acquire multiple video frames.
其中,该多个视频帧中的每一视频帧间在画面内容上包括冗余数据。Wherein each of the plurality of video frames includes redundant data on the picture content.
该多个视频帧的获取可以是在获取视频流后,根据预设的规则从视频流中提取得到,也可以是视频编解器从其它设备获取该多个视频帧,本发明实施例对此不作具体限定。其中,本发明实施例的多个指至少两个。The obtaining of the multiple video frames may be obtained from the video stream according to a preset rule after the video stream is acquired, or the video codec may acquire the multiple video frames from other devices, which is used by the embodiment of the present invention. No specific limitation. Wherein, the plurality of embodiments of the present invention refer to at least two.
该冗余数据为该多个视频帧间在画面内容上相关的数据,存在信息冗余。该冗余数据可以是视频帧整体画面上的冗余数据,例如如下述图5所示的实施例的描述。也可以是视频帧局部画面上的冗余数据,例如图12所示的实施例的描述。The redundant data is data related to the content of the screen among the plurality of video frames, and information redundancy exists. The redundant data may be redundant data on the overall picture of the video frame, such as the description of the embodiment shown in Figure 5 below. It may also be redundant data on a partial picture of a video frame, such as the description of the embodiment shown in FIG.
在本发明的一些实施例中,该多个视频帧从视频流获取得到。具体来说,编解码设备在给定整体视频数据流的前提下,通过场景转换检测技术对视频镜头进行分割,并判断是否为静态镜头。根据镜头类型对每个镜头进行视频帧抽取。In some embodiments of the invention, the plurality of video frames are obtained from a video stream. Specifically, the codec device divides the video lens by the scene transition detection technology on the premise of the overall video data stream, and determines whether it is a static lens. Video frame extraction is performed for each lens according to the type of lens.
例如,在镜头分割步骤中,通过场景转换检测技术将原始视频流分割成长短不一的镜 头单元。其中,每个镜头是由在时间上连续的视频帧组成,它代表一个场景中在时间上和空间上连续的动作。具体的镜头分割方法,可以根据视频帧内容的变化对镜头进行边界分割及判别处理。例如,通过对镜头边界进行定位,找出边界帧的位置或时间点,即可据此对视频进行分割。For example, in the lens segmentation step, the original video stream is segmented into a short-sized lens unit by a scene change detection technique. Among them, each shot is composed of video frames that are continuous in time, and represents a temporally and spatially continuous motion in a scene. The specific lens segmentation method can perform boundary segmentation and discrimination processing on the lens according to the change of the content of the video frame. For example, by locating the lens boundary and finding the position or time point of the boundary frame, the video can be segmented accordingly.
从视频流中分割出镜头后,在镜头分割的基础之上,对镜头的视频帧进行抽取,该抽取得到的视频帧即为步骤201要获取的视频帧。其中,视频帧的抽取根据镜头长度及内容变化进行自适应选取,可以是能够反应该镜头主要信息内容的一帧或多帧图像。After the lens is segmented from the video stream, the video frame of the lens is extracted on the basis of the lens segmentation, and the extracted video frame is the video frame to be acquired in step 201. The extraction of the video frame is adaptively selected according to the length of the lens and the content change, and may be one or more frames of images capable of reflecting the main information content of the lens.
当然,在本发明有的实施例中,编解码设备可以直接从视频流中提取出执行下述编码方法的多个视频帧,例如,按照预设步长对视频帧进行提取等。Of course, in some embodiments of the present invention, the codec device may directly extract a plurality of video frames that perform the following encoding method from the video stream, for example, extract the video frames according to a preset step size, and the like.
步骤202:对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差。Step 202: Perform reconstruction on multiple video frames to obtain scene information and reconstruction residuals of each video frame.
其中,场景信息包括由减少冗余数据的冗余度得到的数据,重构残差用于表示视频帧和场景信息间的差值。The scene information includes data obtained by reducing redundancy of redundant data, and the reconstructed residual is used to represent a difference between the video frame and the scene information.
通过重构可以减少该多个视频帧的冗余度,具体的重构方法有多种,相应的,得到的场景信息也有多种形式,具体详见下述描述。场景信息包括由视频帧间的冗余数据减少冗余度后的数据,而一重构残差表示一视频帧和场景特征间的差值,从而将该多个视频帧重构得到的场景信息重构残差相比与原来的视频帧减少了冗余数据的冗余度,减少了整体的数据量,且保持完整的信息量。The redundancy of the multiple video frames can be reduced by the reconstruction. There are various specific reconstruction methods. Correspondingly, the obtained scene information can also be in various forms. For details, refer to the following description. The scene information includes data obtained by reducing redundancy between redundant data frames, and a reconstructed residual represents a difference between a video frame and a scene feature, thereby reconstructing scene information obtained by reconstructing the plurality of video frames. The reconstructed residual reduces the redundancy of redundant data compared to the original video frame, reduces the overall amount of data, and maintains a complete amount of information.
步骤202可称之为场景重构操作,场景重构是对视频帧内容进行分析,提取适合对整体场景信息进行表示的场景信息。在有的实施例中,场景信息包括场景特征、在有的实施例中场景信息包括场景特征和表示系数。其中,场景特征指能够描述场景整体或局部画面内容的特征信息,可以是原图像像素表示空间的特定帧画面或局部图像块,也可以是特征表示空间的特征基,如小波特征基、稀疏编码字典基等等。Step 202 may be referred to as a scene reconstruction operation. The scene reconstruction is to analyze the content of the video frame, and extract scene information suitable for representing the overall scene information. In some embodiments, the scene information includes scene features, and in some embodiments the scene information includes scene features and representation coefficients. The scene feature refers to feature information capable of describing the whole or partial screen content of the scene, and may be a specific frame picture or a partial image block of the original image pixel representation space, or may be a feature base of the feature representation space, such as a wavelet feature base and a sparse coding. Dictionary base and so on.
场景重构的目的是减少场景内关键帧在内容上的冗余度。场景特征提取原则是该场景特征表述简洁占用数据量小,同时根据场景信息重构得到的数据与原图尽量匹配,使得重构残差量小。该场景重构操作直接影响到视频编码的压缩效果。The purpose of scene reconstruction is to reduce the redundancy of key frames in the scene. The scene feature extraction principle is that the scene feature representation succinctly occupies a small amount of data, and the data reconstructed according to the scene information matches the original image as much as possible, so that the reconstructed residual amount is small. The scene reconstruction operation directly affects the compression effect of the video encoding.
在本发明有的实施例中,在步骤202之前,本发明实施例的方法还包括对该多个视频帧进行分类的操作,例如,基于画面内容的相关性对该多个视频帧进行分类,得到一个或多个分类簇的视频帧,后续以同一分类簇的视频帧执行步骤202。其中属于同一分类簇的多个视频帧间的冗余数据的冗余度符合预设要求,例如大于一阈值。In an embodiment of the present invention, before step 202, the method of the embodiment of the present invention further includes the operation of classifying the plurality of video frames, for example, classifying the plurality of video frames based on the correlation of the screen content, A video frame of one or more clusters is obtained, and step 202 is performed subsequent to the video frames of the same cluster. The redundancy of redundant data between multiple video frames belonging to the same cluster is in accordance with a preset requirement, for example, greater than a threshold.
具体的分类方法包括多种,例如基于聚类的方法、使用分类器的方法等,例如,对关键帧进行特征提取和描述,在特征空间对关键帧进行聚类。具体的实现过程详见下述实施例的描述,本发明实施例对此不作具体限定。The specific classification methods include various methods, such as a cluster-based method, a method using a classifier, and the like, for example, feature extraction and description of key frames, and clustering key frames in the feature space. The specific implementation process is described in detail in the following embodiments, which are not specifically limited in this embodiment of the present invention.
在本发明的一些实施例中,通过对视频流进行分割后得到多个镜头,对每一镜头提取出执行本发明实施例的方法的视频帧,此时,因一镜头提取的视频帧可反映该镜头的特征,从而,对提取的视频帧进行分类,也可称之为对镜头进行场景分类。场景分类的目的是将内容上强相关的从镜头提取的视频帧结合起来,以便之后对整个场景内容进行分析。场景分类具体策略是通过对各镜头关键帧进行分析和聚类实现。场景分类的原则是每个分类簇中的视频帧在画面内容上高度相关,存在大量信息冗余。该操作对后续场景重构操作起到 了决定性作用,分类效果越好,类内信息高度聚合,信息冗余量越大,则编码效率越高。In some embodiments of the present invention, by dividing a video stream to obtain a plurality of shots, a video frame for performing the method of the embodiment of the present invention is extracted for each shot. At this time, the video frame extracted by one shot can be reflected. The characteristics of the lens, and thus the classification of the extracted video frames, can also be referred to as scene classification of the lens. The purpose of scene classification is to combine video frames extracted from the lens that are strongly related in content, so that the entire scene content can be analyzed later. The specific strategy of scene classification is realized by analyzing and clustering key frames of each lens. The principle of scene classification is that the video frames in each cluster are highly correlated on the screen content, and there is a large amount of information redundancy. This operation plays a decisive role in the subsequent scene reconstruction operation. The better the classification effect is, the intra-class information is highly aggregated, and the larger the information redundancy, the higher the coding efficiency.
步骤203:对场景信息进行预测编码,得到场景特征预测编码数据。Step 203: Perform predictive coding on the scene information to obtain scene feature prediction encoded data.
得到该场景信息后,可对其进行预测编码,得到场景特征预测编码数据。After the scene information is obtained, it can be predictively encoded to obtain scene feature prediction encoded data.
步骤204:对重构残差进行预测编码,得到残差预测编码数据。Step 204: Perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.
得到重构残差后,可对其进行预测编码,得到残差预测编码数据。其中,在进行具体编码时,可以采用帧内预测编码或者帧间预测编码。After the reconstructed residual is obtained, it can be predictively coded to obtain residual prediction encoded data. Wherein, when performing specific coding, intra prediction coding or inter prediction coding may be employed.
经过步骤202的重构操作之后,该重构残差因不包括场景特征,从而具有稀疏的特性,例如,重构残差以矩阵表示时,绝大部分都为0,只有少数不为0的数值,因此编码信息量少。After the reconstruction operation of step 202, the reconstruction residual has sparse characteristics because it does not include scene features. For example, when the reconstruction residuals are represented by a matrix, most of them are 0, and only a few are not 0. The value is therefore small in the amount of encoded information.
因场景信息和重构残差相比于原来的多个视频帧,减少了冗余数据的冗余度,使得待编码的数据量得到了减少,从而编码后得到的场景特征预测编码数据和残差预测编码数据的数据量得到减少,且因将视频帧分别由场景信息和重构残差进行表示,而重构残差表示视频帧和场景特征之间的差值,故,得到的重构残差有稀疏的特性,使得该重构残差的编码信息量减少。Since the scene information and the reconstructed residual are compared with the original multiple video frames, the redundancy of the redundant data is reduced, so that the amount of data to be encoded is reduced, so that the scene feature predictive coding data and the residual obtained after the encoding are obtained. The amount of data of the difference prediction coded data is reduced, and since the video frame is represented by the scene information and the reconstructed residual, respectively, and the reconstructed residual represents the difference between the video frame and the scene feature, the obtained reconstruction is performed. The residual has a sparse characteristic, so that the amount of coded information of the reconstructed residual is reduced.
上述步骤201至步骤204为视频编码方法,下述为视频解码方法的步骤。The above steps 201 to 204 are video encoding methods, and the following are the steps of the video decoding method.
步骤205:获取场景特征预测编码数据和残差预测编码数据。Step 205: Acquire scene feature prediction encoded data and residual prediction encoded data.
视频编解码设备获取以被编码的场景特征预测编码数据和残差预测编码数据。The video codec device acquires the predicted feature encoded data and the residual predictive encoded data with the encoded scene.
步骤206:对场景特征预测编码数据进行解码,得到场景信息。Step 206: Decode the scene feature prediction encoded data to obtain scene information.
视频编解码设备对该场景特征预测编码数据进行编码,得到场景信息。由上述内容的描述可知,该场景信息包括由减少冗余数据的冗余度得到的数据,该冗余数据为多个视频帧中的每一视频帧间在画面内容上的冗余数据。The video codec device encodes the scene feature prediction encoded data to obtain scene information. As can be seen from the above description, the scene information includes data obtained by reducing the redundancy of redundant data, which is redundant data on the screen content between each of the plurality of video frames.
步骤207:对残差预测编码数据进行解码,得到重构残差。Step 207: Decode the residual prediction encoded data to obtain a reconstructed residual.
视频编解码器也对残差预测编码数据进行解码,得到重构残差,由上述的编码过程的描述可知,该重构残差用于表示视频帧和场景信息间的差值。The video codec also decodes the residual prediction encoded data to obtain a reconstructed residual. According to the description of the encoding process described above, the reconstructed residual is used to represent the difference between the video frame and the scene information.
可以理解,对步骤206和步骤207的执行顺序本发明实施例不作具体限定。It should be understood that the execution sequence of step 206 and step 207 is not specifically limited in the embodiment of the present invention.
步骤208:根据场景信息和重构残差进行重构,得到多个视频帧。Step 208: Perform reconstruction according to the scene information and the reconstructed residual to obtain a plurality of video frames.
该场景特征预测编码数据和重构残差包括了视频帧的信息,通过对该场景信息和重构残差进行重构,即可得到多个视频帧。The scene feature prediction coding data and the reconstruction residual include information of the video frame, and the video information and the reconstruction residual are reconstructed to obtain a plurality of video frames.
这样,通过对该多个视频帧进行重构的处理,可以减少这些视频帧的冗余度,从而在编码操作中,得到的场景特征和重构残差总体的压缩数据量相对于原来的视频帧的压缩数据量得到了缩减,减少了压缩后得到的数据量。而将每一视频帧重构为场景特征和重构残差,因重构残差包含除场景信息外的残差信息,因此信息量少且稀疏,该特性在进行预测编码时,可以通过较少的码字对其进行预测编码,编码数据量小,压缩比高。这样,本发明实施例的方法可有效提高视频帧的压缩效率。In this way, by performing the process of reconstructing the plurality of video frames, the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video. The amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
可以理解,本发明实施例可用于多种场景中,例如在HEVC场景中使用上述的本发明实施例的视频帧编解码方法。此时,上述实施例的步骤201获取的视频帧即为HEVC场景中的关键帧(I帧),在步骤202之后,本发明实施例的方法还包括:对关键帧(I帧)进行重构,并以此为参考对其余帧采用传统BP帧帧间预测编码。后续,本发明实施例的方 法还包括根据HEVC编码流程对预测编码数据进行变换编码、量化编码及熵编码,得到视频压缩数据。该预测编码数据包括场景特征预测编码数据、残差预测编码数据、B帧预测编码数据和P帧预测编码数据。具体可参考图3a,图3a为本发明实施例的视频编码方法的流程和现有的HEVC编码方法的流程的对比示意图,图3b为本发明实施例提供的一种视频编码方法所涉及的场景示意图。It can be understood that the embodiments of the present invention can be used in various scenarios, for example, the video frame encoding and decoding method of the foregoing embodiment of the present invention is used in an HEVC scenario. At this time, the video frame obtained in step 201 of the foregoing embodiment is a key frame (I frame) in the HEVC scenario. After the step 202, the method in the embodiment of the present invention further includes: reconstructing the key frame (I frame) And use the traditional BP frame inter-prediction coding for the remaining frames as a reference. The method of the embodiment of the present invention further includes performing transform coding, quantization coding, and entropy coding on the predictive coded data according to the HEVC coding process to obtain video compression data. The predictive coding data includes scene feature prediction encoded data, residual predictive encoded data, B-frame predictive encoded data, and P-frame predictive encoded data. For details, refer to FIG. 3a, which is a schematic diagram of a flow of a video encoding method and a flow of an existing HEVC encoding method according to an embodiment of the present invention, and FIG. 3b is a scenario related to a video encoding method according to an embodiment of the present invention. schematic diagram.
相应的,在进行解码操作中,视频编解码设备获取视频压缩数据后,根据HEVC解码流程对视频压缩数据进行熵解码、反量化处理、DCT(discrete cosine transformation;离散余弦变换)反变化得到对应预测编码数据。然后使用该预测编码数据中的场景特征预测编码数据和残差预测编码数据执行上述步骤205至步骤208的操作。其中步骤208重构得到的视频帧即为关键帧,后续,本发明实施例的方法还包括根据解码后的关键帧数据进行BP帧解码,对解码数据帧按时间顺序进行排列得到原始视频完整序列。具体可参考图4a所示的内容,图4a为本发明实施例的视频解码方法的流程和现有的HEVC解码方法的流程的对比示意图。图4b为本发明实施例提供的一种视频解码方法的场景示意图。Correspondingly, in the decoding operation, after the video codec device obtains the video compression data, the video compression data is subjected to entropy decoding, inverse quantization processing, and DCT (discrete cosine transformation) to obtain corresponding prediction according to the HEVC decoding process. Encoded data. The above-described operations of steps 205 to 208 are then performed using the scene feature prediction encoded data and the residual prediction encoded data in the predictive encoded data. The video frame reconstructed in step 208 is a key frame. Subsequently, the method in the embodiment of the present invention further includes performing BP frame decoding according to the decoded key frame data, and arranging the decoded data frames in time sequence to obtain a complete sequence of the original video. . For details, refer to the content shown in FIG. 4a. FIG. 4a is a schematic diagram of a comparison between a flow of a video decoding method and a flow of an existing HEVC decoding method according to an embodiment of the present invention. FIG. 4 is a schematic diagram of a scenario of a video decoding method according to an embodiment of the present invention.
原有HEVC过于依赖I帧编码且压缩效率不高,对关键帧使用本发明实施例的方法,现有技术中,该I帧是独立进行编码,从而I帧压缩数据量占比高,且I帧之间存在大量的信息冗余。通过本发明实施例的方法的执行,该I帧的冗余信息得到了减少,且减少了I帧的编码数据量。尤其是本发明实施例的方法通过对视频镜头及场景进行识别和分类,将场景内关键帧(I帧)进行整体数据分析及重构,对场景特征及表示残差进行编码。有效避免了针对单个关键帧帧内压缩效率不高的问题,同时引入了视频上下文信息,提升了压缩比。The original HEVC is too dependent on the I frame coding and the compression efficiency is not high. The method of the embodiment of the present invention is used for the key frame. In the prior art, the I frame is independently coded, so that the I frame compression data amount is high, and I There is a large amount of information redundancy between frames. Through the execution of the method of the embodiment of the present invention, the redundant information of the I frame is reduced, and the amount of encoded data of the I frame is reduced. In particular, the method of the embodiment of the present invention identifies and classifies a video shot and a scene, performs overall data analysis and reconstruction on a key frame (I frame) in the scene, and encodes the scene feature and the representation residual. It effectively avoids the problem of inefficient compression in a single key frame, and introduces video context information to improve the compression ratio.
可以理解,本发明实施例的方法也可以用于其它的原来需要独立编码的视频帧中,通过原需要独立编码的视频帧进行重构得到场景信息和重构残差,并分别进行编码,可减少该原需要独立编码的视频帧的压缩数据量。It can be understood that the method in the embodiment of the present invention can also be used in other video frames that need to be independently coded, and reconstructed by using a video frame that needs to be independently coded to obtain scene information and reconstructed residuals, and separately coded. Reduce the amount of compressed data that would otherwise require a separately encoded video frame.
为了对本发明实施例的方法进行直观的描述,下文以HEVC标准的场景对本发明实施例的方法进行说明,应当理解,本发明实施例提供的视频帧编码和解码方法也可以应用于其它的场景,本发明实施例对具体的使用场景不作限定。In order to describe the method of the embodiment of the present invention, the method of the embodiment of the present invention is described in the context of the HEVC standard. It should be understood that the video frame encoding and decoding method provided by the embodiment of the present invention can also be applied to other scenarios. The specific usage scenarios are not limited in the embodiment of the present invention.
根据将视频帧进行重构得到场景信息和重构残差的具体实现方法的不同,下文将举出两个具体实施例。其中,在一个实施例中,被重构的视频帧的整体帧画面具有冗余数据,另一实施例中,被重构的视频帧的局部帧画面具有冗余数据。According to the specific implementation method of reconstructing the video frame to obtain the scene information and reconstructing the residual, two specific embodiments will be hereinafter described. Wherein, in one embodiment, the overall frame picture of the reconstructed video frame has redundant data, and in another embodiment, the partial frame picture of the reconstructed video frame has redundant data.
一、视频帧的整体帧画面具有冗余数据First, the overall frame picture of the video frame has redundant data
图5为本发明实施例提供的一种视频编码方法的方法流程图,参阅图5以及图3b,本发明实施例的视频编码方法包括:FIG. 5 is a flowchart of a method for a video encoding method according to an embodiment of the present invention. Referring to FIG. 5 and FIG. 3b, a video encoding method according to an embodiment of the present invention includes:
步骤501:获取视频流。Step 501: Acquire a video stream.
编码设备获取视频流,该视频流包括多个视频帧。The encoding device acquires a video stream that includes a plurality of video frames.
步骤502:对视频流进行镜头分割,得到多个镜头。Step 502: Perform lens segmentation on the video stream to obtain multiple shots.
获取视频流后,该编码设备的镜头分割模块可将该视频流分割出多个镜头,以根据镜头提取出待重构的视频帧。当然,也可能从一个视频流中得出一个镜头。After acquiring the video stream, the lens segmentation module of the encoding device may segment the video stream into multiple shots to extract a video frame to be reconstructed according to the lens. Of course, it is also possible to draw a shot from a video stream.
其中,镜头包括在时间上连续的视频帧,镜头代表一个场景中在时间上和空间上连续 的动作。The lens includes temporally consecutive video frames, and the lens represents a temporally and spatially continuous motion in a scene.
具体来说,参阅图7,步骤502可通过如下步骤实现:Specifically, referring to FIG. 7, step 502 can be implemented by the following steps:
步骤A1:获取视频流。Step A1: Acquire a video stream.
步骤A1即步骤501,其中该视频流包括多个视频帧。Step A1 is step 501, wherein the video stream includes a plurality of video frames.
步骤A2:分别提取第一视频帧和第二视频帧的特征信息。Step A2: Extract feature information of the first video frame and the second video frame, respectively.
其中,特征信息用于对视频帧的画面内容进行描述。为了对该视频流的视频帧进行分析,可以通过视频流的特征信息进行分析,该特征信息为用于描述视频帧的特征的信息,例如,图像颜色、形状、边缘轮廓或纹理特征等。The feature information is used to describe the picture content of the video frame. In order to analyze the video frame of the video stream, it may be analyzed by feature information of the video stream, which is information for describing characteristics of the video frame, for example, image color, shape, edge contour or texture feature, and the like.
该第一视频帧和第二视频帧为视频流中的视频帧,第一视频帧和第二视频帧当前未归属于任一镜头。The first video frame and the second video frame are video frames in the video stream, and the first video frame and the second video frame are not currently assigned to any of the shots.
步骤A3:根据特征信息计算第一视频帧和第二视频帧间的镜头距离。Step A3: Calculate the lens distance between the first video frame and the second video frame according to the feature information.
其中,镜头距离用于表示第一视频帧和第二视频帧间的差异度。Wherein, the lens distance is used to indicate the degree of difference between the first video frame and the second video frame.
步骤A4:判断镜头距离是否大于预设镜头阈值。Step A4: Determine whether the lens distance is greater than a preset lens threshold.
预设镜头阈值可以由人为设定。The preset lens threshold can be set manually.
步骤A5:若镜头距离大于预设镜头阈值,则从视频流中分割出目标镜头,若镜头距离小于预设镜头阈值,则将第一视频帧和第二视频帧归属于同一镜头。Step A5: If the lens distance is greater than the preset lens threshold, the target lens is segmented from the video stream, and if the lens distance is less than the preset lens threshold, the first video frame and the second video frame are attributed to the same lens.
其中,目标镜头的起始帧为该第一视频帧,目标镜头的结束帧为第二视频帧的上一视频帧,该目标镜头属于视频流的镜头的其中之一,镜头为一段在时间上连续的视频帧。The start frame of the target lens is the first video frame, and the end frame of the target lens is the previous video frame of the second video frame, the target lens belongs to one of the lenses of the video stream, and the lens is a segment in time. Continuous video frames.
若第一视频帧和第二视频帧间的镜头距离大于预设镜头阈值,表明第一视频帧和第二视频帧的差异度达到了预设要求,且第一视频帧和第二视频帧之间的视频帧与第一视频帧的差异度未达到预设要求,即小于预设镜头阈值,从而在该视频流中,从第一视频帧开始到第二视频帧的上一视频帧为止的视频帧归属于目标镜头。否则,在第一视频帧位于第二视频帧之前时,以第二视频帧的下一帧和第一视频帧进行镜头距离的计算,重复步骤A4、A5。这样,通过上述步骤的重复执行,可从视频流中得到多个镜头。If the lens distance between the first video frame and the second video frame is greater than the preset lens threshold, indicating that the difference between the first video frame and the second video frame reaches a preset requirement, and the first video frame and the second video frame are The difference between the video frame and the first video frame does not reach a preset requirement, that is, less than the preset lens threshold, so that in the video stream, from the first video frame to the previous video frame of the second video frame The video frame belongs to the target lens. Otherwise, when the first video frame is located before the second video frame, the lens distance is calculated by the next frame of the second video frame and the first video frame, and steps A4 and A5 are repeated. Thus, through the repeated execution of the above steps, multiple shots can be obtained from the video stream.
例如,首先提取视频帧的特征信息,以特征为基础来衡量内容是否发生变化。比较常用的方法是提取图像颜色、形状、边缘轮廓或纹理特征,也可以提取多种特征后进行归一化处理。为提升分割效率,本发明实施例的方法采用分块颜色直方图对图像进行描述。首先将视频图像帧缩放至固定大小(例如320*240),对图像进行下采样以减少噪音对图像的影响。然后,对图像进行4*4分割,每个分块提取RGB颜色直方图。为减少光照对图像的影响,对直方图做均衡化处理。跟着,根据视频帧的特征信息计算视频帧间的距离。视频帧之间的距离,即镜头距离,可以采用马氏距离、欧氏距离等测度标准来衡量。为了消除光照所带来的影响,本示例采用归一化直方图相交法来度量。预先设定预设镜头阈值,当镜头距离大于预设镜头阈值时,计算得到该镜头距离的两视频帧中位于前面的视频帧定为镜头边界起始帧,两视频帧中位于后面的视频帧的上一帧则确定为上一个镜头的边界结束帧,否则该两视频帧归属同一个镜头。最终,可以将一段完整的视频分割为多组单独的镜头。For example, the feature information of the video frame is first extracted, and the content is measured based on the feature. A more common method is to extract image color, shape, edge contour or texture features, or extract multiple features and normalize them. To improve the segmentation efficiency, the method of the embodiment of the present invention describes the image by using a block color histogram. The video image frame is first scaled to a fixed size (eg 320*240) and the image is downsampled to reduce the effects of noise on the image. Then, the image is 4*4 divided, and each block extracts an RGB color histogram. To reduce the impact of illumination on the image, the histogram is equalized. Then, the distance between the video frames is calculated based on the feature information of the video frame. The distance between video frames, that is, the lens distance, can be measured by a measure such as Mahalanobis distance and Euclidean distance. To eliminate the effects of illumination, this example uses the normalized histogram intersection method to measure. The preset lens threshold is preset. When the lens distance is greater than the preset lens threshold, the video frame in the front of the two video frames in which the lens distance is calculated is determined as the frame boundary start frame, and the video frames in the two video frames are located behind. The previous frame is determined to be the boundary end frame of the previous shot, otherwise the two video frames belong to the same shot. Finally, you can split a complete video into multiple sets of separate shots.
步骤503:从得到的镜头中抽取出关键帧。Step 503: Extract key frames from the obtained shots.
编码设备在切割出镜头后,要从各镜头中提取出关键帧,以该关键帧执行本发明实施 例的方法的重构操作。After the encoding device cuts out the lens, a key frame is extracted from each lens, and the reconstruction operation of the method of the embodiment of the present invention is performed with the key frame.
具体来说,在上述的镜头分割步骤后,步骤303可通过步骤A5的执行来实现。Specifically, after the above-described lens segmentation step, step 303 can be implemented by the execution of step A5.
步骤A5:对视频流中的每一镜头,根据镜头内的视频帧间的帧距离提取出关键帧。Step A5: For each shot in the video stream, the key frame is extracted according to the frame distance between the video frames in the shot.
其中,在每一镜头内任意两个相邻的关键帧间的帧距离大于预设帧距离阈值,该帧距离用于表示两视频帧间的差异度。然后,以每一镜头的关键帧执行对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差的步骤。The frame distance between any two adjacent key frames in each shot is greater than a preset frame distance threshold, and the frame distance is used to indicate the degree of difference between the two video frames. Then, the reconstruction of the plurality of video frames is performed with key frames of each shot to obtain scene information and a reconstruction residual of each video frame.
例如,目前的关键帧抽取算法主要有基于抽样的方法、基于颜色特征的方法、基于内容分析的方法、基于运动分析的方法、基于聚类的方法和基于压缩的方法等等。由于编码过程中BP帧需要参考前帧进行帧间预测,因此将每个镜头的起始帧设为关键帧。采用分块颜色直方图特征及直方图相交法对各帧进行特征描述和距离度量。为了使得对关键帧的抽取更快,本发明实施例的方法增加对每一镜头的类型进行判断,即首先依据相邻帧特征空间距离判定镜头是否为静态画面,如果镜头内所有帧之间的帧距离都为0,则判定为静态画面,不再提取关键帧,否则为动态画面。针对动态画面,按时间顺序将各帧内容与上一个关键帧进行距离度量,若距离大于设定阈值,则将该帧设定为关键帧。图8给出了关键帧抽取流程。For example, current key frame extraction algorithms mainly include sampling-based methods, color feature-based methods, content-based analysis methods, motion analysis-based methods, cluster-based methods, and compression-based methods. Since the BP frame needs to refer to the previous frame for inter prediction during encoding, the starting frame of each shot is set as a key frame. Feature description and distance metric are used for each frame by using block color histogram feature and histogram intersection method. In order to make the extraction of the key frame faster, the method of the embodiment of the present invention increases the judgment of the type of each lens, that is, first determines whether the lens is a static picture according to the adjacent frame feature space distance, if all the frames in the lens are between If the frame distance is 0, it is determined to be a static picture, and the key frame is no longer extracted, otherwise it is a dynamic picture. For the dynamic picture, the content of each frame is measured in chronological order from the previous key frame, and if the distance is greater than the set threshold, the frame is set as a key frame. Figure 8 shows the key frame extraction process.
当然,在本发明有的实施例中,也可以不用对镜头属于静态画面还是动态画面进行判断。Of course, in some embodiments of the present invention, it is not necessary to judge whether the lens belongs to a static picture or a dynamic picture.
其中,本发明实施例的方法以HEVC场景进行描述,上述步骤得到的镜头可作为GOP,一个镜头为一个GOP,其中,在一个镜头中,镜头的起始帧为关键帧,通过步骤A5从镜头中提取的视频帧也为关键帧,该镜头的其它视频帧可作为B帧和P帧。本发明实施例的关键帧提取操作考虑到了视频的上下文语境信息,使得后续对这些关键帧进行分类操作时,对这些关键帧有更好的分类效果,有助于后续编码的压缩比的提高。The method of the embodiment of the present invention is described in the HEVC scenario. The lens obtained in the above steps can be used as a GOP, and one lens is a GOP. In a shot, the start frame of the lens is a key frame, and the lens is taken from the lens through step A5. The video frame extracted in is also a key frame, and other video frames of the lens can be used as a B frame and a P frame. The key frame extraction operation of the embodiment of the present invention takes into account the context context information of the video, so that when the key frames are subsequently classified, the classification effect of the key frames is better, which contributes to the improvement of the compression ratio of the subsequent coding. .
在本发明实施例的方法中,关键帧序列快速生成,能够及时响应用户的快进、切换场景需求。用户可以根据关键帧序列对视频场景进行预览,对用户感兴趣视频场景片段进行精准定位,提升用户体验。In the method of the embodiment of the present invention, the key frame sequence is quickly generated, and can respond to the user's fast forward and switch scene requirements in time. The user can preview the video scene according to the sequence of key frames, and accurately locate the video scene segments that are of interest to the user, thereby improving the user experience.
可以理解,获取执行下述重构操作的关键帧,除了上述的方法之外,还可以通过其它的方式。例如,获取视频流,其中,该视频流的视频帧包括I帧、B帧和P帧。然后,从该视频流中提取I帧,以该I帧执行步骤504或步骤505。It can be understood that the key frame for performing the following reconstruction operation can be obtained, and other methods can be used in addition to the above methods. For example, a video stream is acquired, wherein the video frames of the video stream include an I frame, a B frame, and a P frame. Then, an I frame is extracted from the video stream, and step 504 or step 505 is performed with the I frame.
通过上述方法的执行,编码设备获取了多个关键帧,该多个关键帧为待重构以减少冗余度的视频帧。为了通过本发明实施例的方法进一步减少视频帧的冗余数据的冗余度,获取多个关键帧之后,本发明实施例的方法还包括对关键帧进行分类的步骤,即步骤504。Through the execution of the above method, the encoding device acquires a plurality of key frames, which are video frames to be reconstructed to reduce redundancy. In order to further reduce the redundancy of the redundant data of the video frame by the method of the embodiment of the present invention, after the multiple key frames are acquired, the method of the embodiment of the present invention further includes the step of classifying the key frame, that is, step 504.
步骤504:基于画面内容的相关性对多个关键帧进行分类,得到一个或多个分类簇的关键帧。Step 504: Classify a plurality of key frames based on the correlation of the picture content to obtain key frames of one or more classification clusters.
通过分类操作的执行,分类后的同一分类簇的多个关键帧中的每一关键帧包括相同的画面内容,本发明实施例的方法后续即可以以同一分类簇的关键帧执行步骤505。The method of the embodiment of the present invention may perform the step 505 in the key frame of the same cluster by using the method of the embodiment of the present invention.
基于画面内容的相关性分类得到的同一分类簇中,关键帧间的画面内容高度相关,存在大量的冗余数据,若分类效果越好,即同一分类簇内的多个关键帧信息高度聚合,则同一分类簇内的多个关键帧冗余度越大,后续的重构操作对冗余度的减少起到的效果越显著。In the same cluster according to the correlation classification of the screen content, the screen content between the key frames is highly correlated, and there is a large amount of redundant data. If the classification effect is better, that is, the plurality of key frame information in the same cluster is highly aggregated. The greater the redundancy of multiple key frames in the same cluster, the more significant the effect of subsequent reconstruction operations on the reduction of redundancy.
例如,在本发明实施例中,经过分类操作后得到一个或多个分类簇,同一分类簇内的多个关键帧间画面内容相同的部分较多,从而这些关键帧间的冗余数据冗余度较大。For example, in the embodiment of the present invention, one or more classification clusters are obtained after the classification operation, and there are more portions of the same content content among the multiple key frames in the same classification cluster, so that redundant data redundancy between the key frames is performed. Larger.
在分类的操作中,如果基于镜头对不同关键帧进行分类,则该分类也可称之为场景分类,当然,该分类操作也可不基于镜头,而直接对不同关键帧进行分类,为了描述简便,现将本发明实施例提供的方法的分类操作称之为场景分类操作。In the classification operation, if different key frames are classified based on the lens, the classification may also be referred to as a scene classification. Of course, the classification operation may also directly classify different key frames without being based on the lens. The classification operation of the method provided by the embodiment of the present invention is referred to as a scene classification operation.
其中,具体的分类方法有多种,下面即举出其中的两例,一是聚类的分类方法,另一是使用分类器的分类方法。Among them, there are various specific classification methods. Two examples are given below, one is the classification method of clustering, and the other is the classification method using classifiers.
1)聚类的分类方法1) Classification method of clustering
在聚类的分类方法中,该基于画面内容的相关性对多个关键帧进行分类,得到一个或多个分类簇的关键帧,包括:In the clustering classification method, the plurality of key frames are classified based on the correlation of the screen content, and the key frames of the one or more classification clusters are obtained, including:
步骤B1:提取多个关键帧中的每一关键帧的特征信息。Step B1: Extract feature information of each key frame of the plurality of key frames.
关键帧的特征信息可以是底层特征或中层语义特征等。The feature information of the key frame may be an underlying feature or a middle layer semantic feature.
步骤B2:根据特征信息确定任意两个关键帧间的聚类距离。Step B2: Determine the cluster distance between any two key frames according to the feature information.
其中,聚类距离用于表示两个关键帧间的相似度。这里的任意两个关键帧包括上述步骤中提取出的所有关键帧,可为属于不同的镜头的关键帧,也可以为属于同一镜头的关键帧。Among them, the cluster distance is used to represent the similarity between two key frames. Any two key frames here include all the key frames extracted in the above steps, which may be key frames belonging to different shots, or key frames belonging to the same shot.
因镜头内各帧的差异性要比不同镜头间各帧的差异性小。为了能够对场景分类进行有效划分,可选用不同的特征空间,而不同的特征空间则对应了不同的度量标准,从而使用的聚类距离和镜头距离可以不同。The difference between frames in the lens is smaller than the difference between frames in different lenses. In order to effectively divide the scene classification, different feature spaces may be selected, and different feature spaces correspond to different metrics, so that the cluster distance and the lens distance may be different.
步骤B3:根据聚类距离对视频帧进行聚类,得到一个或多个分类簇的视频帧。Step B3: Cluster the video frames according to the cluster distance to obtain video frames of one or more clusters.
例如,场景分类是通过对各镜头关键帧进行分析和聚类实现。场景分类与场景重构息息相关。以视频编码任务为前提,场景分类的首要原则是每个分类簇中的关键帧在画面内容层面上高度相关,存在大量信息冗余。现有的场景分类算法主要分为两类:a)基于底层特征场景分类算法;b)基于中层语义特征建模的场景分类算法。这些方法都是以特征检测与描述为基础,体现了场景内容在不同层次上的描述。底层图像特征可采用颜色、边缘、纹理、SIFT(Scale-invariant feature transform)、HOG(Histogram of Oriented Gradient)、GIST等特征。中层语义特征包含视觉词袋(Bag of Words),深度学习网络特征等等。为提升效率,本发明实施例选用较为简单GIST全局特征对关键帧整体内容进行描述。距离度量函数则采用欧氏距离来衡量两幅图像的相似度。聚类算法可以采用传统的K-means、图切割、层次聚类等方法。本实施例选用凝聚层次聚类算法对关键帧进行聚类。该方法聚类数目依赖于相似度阈值设定。阈值设定越高,则类内关键帧信息冗余量越大,相应的聚类数目也越多。场景分类具体流程图9如下图所示。For example, scene classification is achieved by analyzing and clustering key frames of each lens. Scene classification is closely related to scene reconstruction. Prerequisite to the video coding task, the first principle of scene classification is that the key frames in each cluster are highly correlated at the content level of the screen, and there is a large amount of information redundancy. The existing scene classification algorithms are mainly divided into two categories: a) based on the underlying feature scene classification algorithm; b) based on the middle layer semantic feature modeling scene classification algorithm. These methods are based on feature detection and description, and reflect the description of the scene content at different levels. The underlying image features may include features such as color, edge, texture, SIFT (Scale-invariant feature transform), HOG (Histogram of Oriented Gradient), and GIST. Middle-level semantic features include Bag of Words, deep learning network features, and more. In order to improve efficiency, the embodiment of the present invention selects a relatively simple GIST global feature to describe the overall content of the key frame. The distance measure function uses the Euclidean distance to measure the similarity of the two images. The clustering algorithm can adopt traditional K-means, graph cutting, hierarchical clustering and other methods. In this embodiment, a condensed hierarchical clustering algorithm is used to cluster key frames. The method cluster number depends on the similarity threshold setting. The higher the threshold setting, the greater the amount of keyframe information redundancy within the class and the corresponding number of clusters. The specific flow chart of the scene classification is shown in the following figure.
上述基于聚类的场景分类策略有益于编码速度的提高,下述基于分类器模型的场景分类策略有益于编码精度的提高。The above clustering-based scene classification strategy is beneficial to the improvement of coding speed. The following classification mechanism based on the classifier model is beneficial to the improvement of coding precision.
其中,该基于分类器模型的场景分类策略的主要思想是根据镜头分割结果对每个镜头进行判别训练得到多个判别分类器。通过分类器对各关键帧进行判别,得分高的关键帧则被认为与该镜头属同一场景。具体过程如下:The main idea of the scene classification strategy based on the classifier model is to perform discriminant training on each shot according to the shot segmentation result to obtain a plurality of discriminant classifiers. Each key frame is discriminated by the classifier, and the key frame with a high score is considered to be the same scene as the lens. The specific process is as follows:
2)使用分类器的分类方法2) Classification method using classifier
在使用分类器的分类方法中,本发明实施例的视频编码方法的分类方法包括:In the classification method using the classifier, the classification method of the video coding method in the embodiment of the present invention includes:
步骤C1:根据从视频流中分割出的每一镜头进行判别训练,得到多个与镜头对应的分类器。Step C1: Perform discrimination training according to each shot segmented from the video stream to obtain a plurality of classifiers corresponding to the shots.
其中,可选的分类器模型有:决策树、Adaboost、支持向量机(Support Vector Machine,SVM)、深度学习等模型。Among them, the optional classifier models are: decision tree, Adaboost, Support Vector Machine (SVM), deep learning and other models.
步骤C2:使用目标分类器对目标关键帧进行判别,得到判别分数。Step C2: Using the target classifier to discriminate the target key frame to obtain a discriminant score.
目标分类器为步骤C1得到的多个分类器的其中之一,目标视频帧为关键帧的其中之一,判别分数用于表示目标视频帧属于目标分类器所属的镜头的场景的程度。The target classifier is one of the plurality of classifiers obtained in step C1, and the target video frame is one of the key frames, and the discriminant score is used to indicate the extent to which the target video frame belongs to the scene to which the target classifier belongs.
这样,即可实现对各关键帧进行类型判别,判定一关键帧是否与一个镜头属于同一场景。In this way, the type discrimination of each key frame can be realized, and it is determined whether a key frame belongs to the same scene as one shot.
步骤C3:当判别分数大于预设分数阈值时,确定目标视频帧与目标分类器所属的镜头属于同一场景。Step C3: When the discriminant score is greater than the preset score threshold, it is determined that the target video frame belongs to the same scene as the shot to which the target classifier belongs.
当判别分数大于预设分数阈值时,可认为目标关键帧与目标分类器所属的镜头属于同一场景,否则,认为目标关键帧与目标分类器所属的镜头不属于同一场景。When the discriminant score is greater than the preset score threshold, the target key frame may be considered to belong to the same scene as the shot to which the target classifier belongs. Otherwise, the target key frame and the shot to which the target classifier belongs are not considered to belong to the same scene.
步骤C4:根据与镜头属于同一场景的视频帧,确定一个或多个分类簇的视频帧。Step C4: Determine video frames of one or more clusters according to video frames belonging to the same scene as the shot.
例如,以SVM为例作介绍。如图10所示,使用分类器进行分类的操作包括两主要阶段,如下:For example, take SVM as an example. As shown in Figure 10, the operation of classifying using a classifier includes two main phases, as follows:
2.1)模型训练2.1) Model training
首先对各镜头进行判别训练。各镜头所包含的所有视频帧为正样本。与该镜头相邻两个镜头内的所有视频帧为负样本。根据训练样本对分类器参数进行训练。各镜头分类器训练公式如下:First, discriminate training for each lens. All video frames contained in each shot are positive samples. All video frames in the two lenses adjacent to the lens are negative samples. The classifier parameters are trained according to the training samples. The training formula for each lens classifier is as follows:
其中y
i为第i个训练样本对应的标签,正样本对应标签为1,负样本为-1。φ(·)为特征映射函数,n为训练样本总数,w是分类器参数,I为训练样本。
Where y i is the label corresponding to the i-th training sample, the positive sample corresponds to the label 1 and the negative sample is -1. φ(·) is the feature mapping function, n is the total number of training samples, w is the classifier parameter, and I is the training sample.
2.2)场景分类2.2) Scene classification
利用各镜头训练好的分类器模型对关键帧进行判别,具体公式如下:The keyframes are discriminated by the classifier model trained by each lens. The specific formula is as follows:
利用各镜头训练好的分类器模型对关键帧进行判别,具体公式如下:The keyframes are discriminated by the classifier model trained by each lens. The specific formula is as follows:
表示关键帧i同镜头j同属于一个场景的概率。公式中w
j、b
j为第j个镜头所对应的分类器参数,分母为归一化因子。当概率大于设定阈值时,则认为关键帧i与镜头j同属于一个场景。其中,i和j为正整数。
Indicates the probability that the key frame i belongs to the same scene as the shot j. In the formula, w j and b j are the classifier parameters corresponding to the jth lens, and the denominator is the normalization factor. When the probability is greater than the set threshold, it is considered that the key frame i and the shot j belong to one scene. Where i and j are positive integers.
这样,通过上述操作,即可得到多组关键帧和镜头的对应关系,这些对应关系表示关键帧和镜头属于同一场景,然后编码设备即可根据这些对应关系确定一个或多个分类簇的 视频帧。In this way, by the above operation, the correspondence between multiple sets of key frames and shots can be obtained. These correspondences indicate that the key frames and the shots belong to the same scene, and then the encoding device can determine the video frames of one or more clusters according to the corresponding relationships. .
可以理解,使用SVM分类器的例子中,使用的是二分类分类器的具体场景,本发明实施例也可以使用多分类的分类算法进行操作。It can be understood that, in the example of using the SVM classifier, a specific scenario of the two-class classifier is used, and the embodiment of the present invention can also operate by using a multi-class classification algorithm.
可以理解,在本发明的一些实施例中,可以不包括步骤504。It will be appreciated that in some embodiments of the invention, step 504 may not be included.
可以理解,上述内容是基于HEVC场景对本发明实施例提供的编码方法进行描述,在其它的场景中,上述各步骤中的关键帧可直接以视频帧进行描述。It can be understood that the above description is based on the HEVC scenario, and the coding method provided by the embodiment of the present invention is described. In other scenarios, the key frames in the foregoing steps may be directly described by video frames.
通过上述方法,基于画面内容的相关性对多个视频帧进行分类,得到一个或多个分类簇的关键帧后,同一分类簇内的关键帧间的冗余数据的冗余度较大,从而后续对同一分类簇的关键帧进行重构操作时,可以更多地减少冗余数据的冗余度,以进一步减少编码数据量。另外,在对镜头的关键帧进行分类操作的实施例中,视频依据场景进行压缩,为后期进行内容剪辑,视频绿镜(即根据热度分析生成精华版视频)制作提供便利。Through the above method, after multiple video frames are classified based on the correlation of the picture content, and the key frames of one or more classification clusters are obtained, the redundancy of the redundant data between the key frames in the same classification cluster is large, thereby When the subsequent reconstruction of the key frames of the same cluster is performed, the redundancy of the redundant data can be further reduced to further reduce the amount of encoded data. In addition, in the embodiment in which the key frame of the lens is classified, the video is compressed according to the scene, and the content is clipped later, and the video green mirror (that is, the essence video is generated according to the heat analysis) is facilitated.
步骤505:对同一分类簇的多个关键帧进行重构,得到场景特征和每一视频帧的重构残差。Step 505: Perform reconstruction on multiple key frames of the same cluster to obtain scene features and reconstruction residuals of each video frame.
该多个关键帧中的每一关键帧间包括相同的画面内容,该相同的画面内容,即为每一关键帧间在画面内容上包括的冗余数据。若不对这些关键帧进行重构,则编码设备会对这些关键帧间的相同的画面内容进行重复编码。重构得到的场景特征用于表示每一视频帧间的相同的画面内容,从而,场景信息包括由减少冗余数据的冗余度得到的数据。该重构残差用于表示关键帧和场景特征间的差值。这样得到的场景特征可表示帧的整体信息,从而步骤505的重构操作,针对的是多个视频帧的整体画面存在相同的画面内容的场景。Each of the plurality of key frames includes the same picture content, that is, redundant data included on the picture content between each key frame. If these key frames are not reconstructed, the encoding device will repeatedly encode the same picture content between these key frames. The reconstructed scene features are used to represent the same picture content between each video frame, such that the scene information includes data resulting from reducing redundancy of redundant data. The reconstructed residual is used to represent the difference between the key frame and the scene feature. The scene feature thus obtained may represent the overall information of the frame, so that the reconstruction operation of step 505 is directed to a scene in which the entire screen of the plurality of video frames has the same picture content.
步骤505的具体实现方式如下:The specific implementation of step 505 is as follows:
将同一分类簇的关键帧转换成观测矩阵。其中,观测矩阵用于以矩阵形式对该多个关键帧进行表示。然后,根据第一约束条件对观测矩阵进行重构,得到场景特征矩阵和重构残差矩阵。Convert keyframes of the same taxonomy into observation matrices. The observation matrix is used to represent the plurality of key frames in a matrix form. Then, the observation matrix is reconstructed according to the first constraint condition to obtain a scene feature matrix and a reconstructed residual matrix.
其中,场景特征矩阵用于以矩阵形式对场景特征进行表示,重构残差矩阵用于以矩阵形式对该多个关键帧的重构残差进行表示。第一约束条件用于限定场景特征矩阵低秩以及重构残差矩阵稀疏。The scene feature matrix is used to represent the scene features in a matrix form, and the reconstructed residual matrix is used to represent the reconstructed residuals of the plurality of key frames in a matrix form. The first constraint is used to define the scene feature matrix low rank and the reconstructed residual matrix is sparse.
在本发明的一些实施例中,根据第一约束条件对观测矩阵进行重构,得到场景特征矩阵和重构残差矩阵,包括:根据第一预设公式,计算得到场景特征矩阵和重构残差矩阵,场景特征矩阵为低秩矩阵,重构残差矩阵为稀疏矩阵;In some embodiments of the present invention, the observation matrix is reconstructed according to the first constraint condition, and the scene feature matrix and the reconstructed residual matrix are obtained, including: calculating the scene feature matrix and reconstructing the residual according to the first preset formula a difference matrix, the scene feature matrix is a low rank matrix, and the reconstructed residual matrix is a sparse matrix;
第一预设公式为:The first preset formula is:
s.t.D=F+E s.t.D=F+Es.t.D=F+E s.t.D=F+E
其中,D为观测矩阵,F为场景特征矩阵,E为重构残差矩阵,λ为权重参数,λ用来平衡场景特征矩阵F与重构残差矩阵E之间的关系,
表示求F和E的最优值,即使得目标公式rank(F)+λ||E||
1或者||F||
*+λ||E||
1值最小时F和E的取值。rank(·)为矩阵求秩函数,||·||
1为矩阵L1范数,||·||
*为矩阵核范数。
Where D is the observation matrix, F is the scene feature matrix, E is the reconstructed residual matrix, λ is the weight parameter, and λ is used to balance the relationship between the scene feature matrix F and the reconstructed residual matrix E. Represents seek the optimal value F and E, even though the values to obtain the objective formula rank (F) + λ || E || 1 or || F || * + λ || E || 1 minimum value F and E . Rank(·) is a matrix for the rank function, ||·|| 1 is the matrix L1 norm, and ||·|| * is the matrix kernel norm.
例如,场景重构是对场景分类得到的每个分类簇的场景进行内容分析,提取适合对场景内所有关键帧进行重构的场景特征和表示系数。可用于场景重构的模型有RPCA(Robust Principle Component Analysis;鲁棒主成分分析)、LRR(低秩表示)、SR(稀疏表示)、SC(稀疏编码)、SDAE(稀疏自编码深度学习模型)、CNN(卷积神经网络)等等。本发明实施例的表示系数可以用单位矩阵进行表示,场景特征与表示系数相乘仍为场景特征,当然,在本发明有的实施例中,因表示系数可以忽略,所以可以使用为单位矩阵的表示系数,也可以不使用表示系数,此时,在解码重构阶段,只需要场景特征及重构残差即可对原视频帧进行表示。For example, the scene reconstruction is to perform content analysis on the scene of each cluster cluster obtained by the scene classification, and extract scene features and representation coefficients suitable for reconstructing all key frames in the scene. Models that can be used for scene reconstruction include RPCA (Robust Principle Component Analysis), LRR (low rank representation), SR (sparse representation), SC (sparse coding), and SDAE (Sparse Self-Coded Deep Learning Model). , CNN (convolution neural network) and so on. The representation coefficient of the embodiment of the present invention may be represented by a unit matrix, and the scene feature and the representation coefficient are still multiplied by the scene feature. Of course, in some embodiments of the present invention, since the representation coefficient can be ignored, it can be used as a unit matrix. The representation coefficient may or may not be used. In this case, in the decoding and reconstruction stage, only the scene feature and the reconstruction residual are required to represent the original video frame.
本实施例的视频编码方法采用RPCA对场景内关键帧进行重构。基于RPCA方法的场景重构策略对关键帧进行了整体内容数据重构,可以减少由块预测产生的方块现象。The video coding method in this embodiment uses RPCA to reconstruct key frames in the scene. The scene reconstruction strategy based on RPCA method reconstructs the overall content data of key frames, which can reduce the square phenomenon caused by block prediction.
假设某场景S包含N个关键帧,即某一分类簇包括N个关键帧,N为自然数。将同一分类簇内的所有关键帧图像像素值拉成一个列向量,组成一个观测矩阵D,即D=[I
1,I
2,...,I
n],其中Ii为第i个关键帧的列向量表示。由于同一场景中各关键帧之间内容相似,可以假定各关键帧含有相同的场景特征f
i,由场景特征组成的特征矩阵F=[f
1,f
2,...,f
n]本质上应该是低秩矩阵;各关键帧在F矩阵基础之上发生少量变化得到观测矩阵D,因此重构误差E=[e
1,e
2,...,e
n]应该是稀疏的。
Suppose a scene S contains N key frames, that is, a certain cluster includes N key frames, and N is a natural number. The pixel values of all key frame images in the same cluster are drawn into a column vector to form an observation matrix D, that is, D=[I 1 , I 2 , . . . , I n ], where Ii is the ith key frame The column vector representation. Since the content between the key frames in the same scene is similar, it can be assumed that each key frame contains the same scene feature f i , and the feature matrix F=[f 1 , f 2 , . . . , f n ] composed of scene features is essentially It should be a low rank matrix; each key frame undergoes a small change on the basis of the F matrix to obtain the observation matrix D, so the reconstruction error E = [e 1 , e 2 , ..., e n ] should be sparse.
将场景重构问题转化为如下优化问题来描述:Translate the scene reconstruction problem into the following optimization problem to describe:
s.t.D=F+Es.t.D=F+E
其中,λ为权重参数用来平衡场景特征矩阵F与重构残差矩阵E之间的关系,rank(·)为矩阵求秩函数,||·||
1为矩阵L
1范数。上述优化问题属于NP难题,可以松弛为如下问题求解:
Where λ is a weight parameter used to balance the relationship between the scene feature matrix F and the reconstructed residual matrix E, rank(·) is a matrix rank function, and ||·|| 1 is a matrix L 1 norm. The above optimization problem belongs to the NP problem, and can be relaxed to solve the following problems:
s.t.D=F+Es.t.D=F+E
其中||·||
*为矩阵核范数。上述优化问题可以通过近端加速梯度法(Accelerated Proximal Gradient,APG)、非精确增广拉格朗日算法(Inexact Augmented Lagrange Multiplier,IALM)等矩阵优化算法求解。
Where ||·|| * is the matrix kernel norm. The above optimization problem can be solved by matrix optimization algorithms such as Accelerated Proximal Gradient (APG) and Inexact Augmented Lagrange Multiplier (IALM).
这样重构后,得到场景特征和重构残差,通过重构可以将原有对关键帧I的压缩转化为对场景特征F及重构误差E来实现。由于场景特征矩阵F低秩,重构残差E稀疏,因此两者的压缩数据量相对于传统I帧压缩算法将得到大幅缩减。图11给出了基于RPCA场景重构的示例图,其中关键帧1~3分属同一视频的不同镜头片段。由图可知,场景特征矩阵F秩为1,因此只需要对该矩阵的某一列进行数据压缩。残差矩阵E在大部分区域值为0,因此只需少量信息对E进行表示。After the reconstruction, the scene feature and the reconstruction residual are obtained, and the original compression of the key frame I can be converted into the scene feature F and the reconstruction error E by reconstruction. Since the scene feature matrix F has a low rank and the reconstructed residual E is sparse, the amount of compressed data of the two is greatly reduced relative to the conventional I frame compression algorithm. Figure 11 shows an example diagram based on RPCA scene reconstruction, where key frames 1 to 3 belong to different shot segments of the same video. As can be seen from the figure, the scene feature matrix F rank is 1, so only one column of the matrix needs to be data compressed. The residual matrix E has a value of 0 in most of the regions, so only a small amount of information is needed to represent E.
其中,本发明实施例的场景特征为场景信息的其中一种具体实现方式,步骤505即为 对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差的其中一种具体实现方式。The scene feature in the embodiment of the present invention is one specific implementation manner of the scenario information, and the step 505 is to reconstruct multiple video frames to obtain scene information and one of reconstructed residuals of each video frame. Specific implementation.
本发明实施例的方法可针对帧整体信息具有冗余数据的关键帧进行重构操作,为了通过重构操作高效地减少关键帧的冗余数据,需要先对关键帧进行检测,以判断当前的选出的关键帧是否适于本发明实施例的方法的重构操作,从而可以根据视频场景内容的不同进行自适应编码。The method of the embodiment of the present invention may perform a reconstruction operation on a key frame having redundant data of the frame overall information. In order to efficiently reduce redundant data of the key frame by the reconstruction operation, the key frame needs to be detected first to determine the current Whether the selected key frame is suitable for the reconstruction operation of the method of the embodiment of the present invention, so that the adaptive coding can be performed according to the content of the video scene.
即对所述多个视频帧进行重构,得到场景特征和所述每一视频帧的重构残差之前,本发明实施例的方法还包括:提取所述多个视频帧中的每一视频帧的画面特征信息,其中该提取的画面特征信息可以为视频帧的全局特征或局部特征,具体有GIST全局特征,HOG全局特征,SIFT局部特征等等,本发明实施例对此不作具体限定。然后,编码设备根据所述画面特征信息,计算得到内容度量信息,其中,内容度量信息用于度量所述多个视频帧的画面内容的差异性,即进行关键帧的内容一致性度量,关键帧内容一致性度量准则可以采用特征方差、欧氏距离等方式度量。当所述内容度量信息不大于预设度量阈值时,执行所述对所述多个视频帧进行重构,得到场景特征和所述每一视频帧的重构残差的步骤。The method of the embodiment of the present invention further includes: extracting each of the multiple video frames, before the reconstructing the plurality of video frames to obtain the scene features and the reconstructed residuals of the video frames. The picture feature information of the frame, wherein the extracted picture feature information may be a global feature or a local feature of the video frame, and specifically includes a GIST global feature, a HOG global feature, a SIFT local feature, and the like, which are not specifically limited in this embodiment of the present invention. Then, the encoding device calculates the content metric information according to the picture feature information, where the content metric information is used to measure the difference of the picture content of the multiple video frames, that is, the content consistency measurement of the key frame, the key frame Content consistency metrics can be measured in terms of feature variance, Euclidean distance, and the like. And performing the step of reconstructing the plurality of video frames to obtain a scene feature and a reconstruction residual of each video frame, when the content metric information is not greater than a preset metric threshold.
例如,在步骤505之前,本发明实施例的方法还包括:For example, before the step 505, the method of the embodiment of the present invention further includes:
步骤D1:提取多个视频帧中的每一视频帧的全局GIST特征。Step D1: Extract global GIST features of each of the plurality of video frames.
在上述的HEVC场景中,步骤D1为提取同一分类簇的多个关键帧中的每一关键帧的全局GIST特征。In the HEVC scenario described above, step D1 is to extract global GIST features for each of a plurality of key frames of the same cluster.
该全局GIST特征用于对关键帧的特征进行描述。This global GIST feature is used to describe the characteristics of keyframes.
步骤D2:根据全局GIST特征,计算得到场景GIST特征方差。Step D2: Calculate the variance of the scene GIST feature according to the global GIST feature.
其中,场景GIST特征方差用于度量多个视频帧的内容一致性。The scene GIST feature variance is used to measure the content consistency of multiple video frames.
在上述的HEVC场景中,场景GIST特征方差用于度量同一分类簇的多个关键帧的内容一致性。In the above HEVC scenario, the scene GIST feature variance is used to measure the content consistency of multiple key frames of the same cluster.
步骤D3:当场景GIST特征方差不大于预设方差阈值时,执行步骤304。Step D3: When the scenario GIST feature variance is not greater than the preset variance threshold, step 304 is performed.
通过上述步骤的执行,得到场景特征和重构残差后,视频编解码设备可分别对场景特征和重构残差进行帧内预测编码。After obtaining the scene feature and the reconstruction residual by performing the foregoing steps, the video coding and decoding device may perform intra prediction coding on the scene feature and the reconstructed residual, respectively.
上述步骤D1至D3为判断同一分类簇的关键帧是否适用于步骤505的具体方法。The above steps D1 to D3 are specific methods for determining whether the key frame of the same cluster is applicable to step 505.
步骤506:对场景特征进行预测编码,得到场景特征预测编码数据。Step 506: Perform predictive coding on the scene features to obtain scene feature prediction encoded data.
步骤507:对重构残差进行预测编码,得到残差预测编码数据。Step 507: Perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.
在编码设备的预测编码部分包括帧内预测编码及帧间预测编码两个部分。场景特征及重构误差采用帧内预测编码,镜头的剩余帧,即镜头的非关键帧采用帧间预测编码。帧内预测编码具体流程同HEVC帧内编码模块类似。由于场景特征矩阵低秩,则只需对场景特征矩阵关键列进行编码。重构误差属于残差编码,编码数据量小,压缩比高。The predictive coding portion of the encoding device includes two parts of intra prediction coding and inter prediction coding. The scene features and reconstruction errors are encoded by intra prediction, and the remaining frames of the shot, that is, the non-key frames of the shot, are inter-predictive encoded. The specific process of intra prediction coding is similar to the HEVC intra coding module. Since the scene feature matrix has a low rank, only the key columns of the scene feature matrix need to be encoded. The reconstruction error belongs to residual coding, and the amount of coded data is small and the compression ratio is high.
步骤508:根据场景特征和重构残差进行重构,得到参考帧。Step 508: Perform reconstruction according to the scene feature and the reconstructed residual to obtain a reference frame.
为了对B、P帧进行帧间预测编码,需要得到参考帧。在HEVC场景中,以关键帧为参考帧。在上述方法中,场景特征、重构残差如果采用有损压缩,如果采用步骤504中原关键帧进行帧间预测,BP帧在解压时会出现误差扩散现象。步骤507为了防止误差在BP帧帧间扩散采用了该逆向重构方案,以根据场景特征和重构残差进行重构,以得到的参考帧为参考执行下述的步骤509。In order to perform interframe predictive coding on B and P frames, a reference frame needs to be obtained. In the HEVC scenario, the key frame is used as the reference frame. In the above method, if the scene feature and the reconstructed residual use lossy compression, if the original key frame in step 504 is used for inter-frame prediction, the error diffusion phenomenon occurs when the BP frame is decompressed. In step 507, the reverse reconstruction scheme is adopted to prevent the error from spreading between the BP frames. The reconstruction is performed according to the scene feature and the reconstruction residual, and the following step 509 is performed with reference to the obtained reference frame.
可以理解,如果骤506和步骤507中场景特征、重构残差帧间预测编码采用无损压缩的方式,则可以直接通过步骤504提取的关键帧进行BP帧帧间预测。It can be understood that if the scene feature and the reconstructed residual interframe predictive coding in steps 506 and 507 adopt the lossless compression mode, the BP frame inter prediction can be directly performed through the key frame extracted in step 504.
步骤509:以参考帧做参考,对B帧和P帧进行帧间预测编码,得到B帧预测编码数据和P帧预测编码数据。Step 509: Perform inter-prediction encoding on the B frame and the P frame with reference to the reference frame, and obtain B frame predictive encoded data and P frame predictive encoded data.
帧间编码首先根据场景特征及重构误差对关键帧(I帧)进行重构,然后对BP帧内容进行运动补偿预测并编码。具体帧间预测编码过程与HEVC相同。Inter-frame coding first reconstructs the key frame (I frame) according to the scene features and reconstruction error, and then performs motion compensation prediction and coding on the BP frame content. The specific inter prediction encoding process is the same as HEVC.
步骤510:对预测编码数据进行变换编码、量化编码及熵编码,得到视频压缩数据。Step 510: Perform transform coding, quantization coding, and entropy coding on the prediction encoded data to obtain video compressed data.
预测编码数据包括场景特征预测编码数据、残差预测编码数据、B帧预测编码数据和P帧预测编码数据。The predictive coded data includes scene feature predictive coded data, residual predictive coded data, B frame predictive coded data, and P frame predictive coded data.
在预测编码基础上对数据进行变化编码、量化编码及熵编码,该过程与HEVC相同。The data is subjected to change coding, quantization coding, and entropy coding on the basis of predictive coding, which is the same as HEVC.
本发明实施例的视频编码方法,可以提升视频压缩比,在有的实施例中,场景内容高度关联情况下,可以通过极少量信息来表示整个场景信息,降低码率,在保证视频质量的前提下,减少压缩体积,更适宜于低比特率环境下图像的传输、存储。以数字视频产业为例来分析,现有点播(VOD)、个人视频录制(NPVR)、录播(Catch-up TV)视频业务占据了服务器70%的存储资源和网络带宽。采用本发明实施例的技术方案可以降低存储服务器压力、提升网络传输效率。此外,CDN边缘节点可以存储更多的视频,用户命中率将会大幅增加,回源率减少,提升了用户体验,降低网络设备消耗。并且,本发明实施例的方法可以通过对场景进行不同层面特征提取,生成不同码率视频。The video coding method in the embodiment of the present invention can improve the video compression ratio. In some embodiments, when the scene content is highly correlated, the entire scene information can be represented by a small amount of information, and the code rate is lowered, and the video quality is guaranteed. The reduction of the compression volume is more suitable for the transmission and storage of images in a low bit rate environment. Taking the digital video industry as an example, the existing on-demand (VOD), personal video recording (NPVR), and Catch-up TV video services account for 70% of the server's storage resources and network bandwidth. The technical solution of the embodiment of the invention can reduce the pressure of the storage server and improve the network transmission efficiency. In addition, the CDN edge node can store more videos, the user hit rate will be greatly increased, the return rate is reduced, the user experience is improved, and the network device consumption is reduced. Moreover, the method in the embodiment of the present invention can generate different code rate videos by performing feature extraction on different levels of the scene.
综上所述,通过重构,将该相同的画面内容去重后由场景特征进行表示,可减少该多个视频帧的冗余信息的冗余度。从而在编码操作中,得到的场景特征和重构残差总体的压缩数据量相对于原来的视频帧的压缩数据量得到了缩减,减少了压缩后得到的数据量。而将每一视频帧重构为场景特征和重构残差,因重构残差包含除场景信息外的残差信息,因此信息量少且稀疏,该特性在进行预测编码时,可以通过较少的码字对其进行预测编码,编码数据量小,压缩比高。这样,本发明实施例的方法可有效提高视频帧的压缩效率。In summary, by reconstructing, the same picture content is de-duplicated and represented by scene features, which can reduce the redundancy of redundant information of the multiple video frames. Therefore, in the encoding operation, the obtained scene feature and the compressed data amount of the reconstructed residual total are reduced relative to the compressed data amount of the original video frame, and the amount of data obtained after compression is reduced. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
通过上述步骤的执行得到压缩编码数据后,视频编解码设备可对该压缩编码数据进行解压操作。After the compression encoded data is obtained by performing the above steps, the video codec device may perform a decompression operation on the compressed encoded data.
图6为本发明实施例提供的一种视频解码方法的方法流程图。参阅图6以及图4b,本发明实施例的视频解码方法包括:FIG. 6 is a flowchart of a method for decoding a video according to an embodiment of the present invention. Referring to FIG. 6 and FIG. 4b, a video decoding method according to an embodiment of the present invention includes:
步骤601:获取视频压缩数据;Step 601: Acquire video compression data.
解码设备获取视频压缩数据,该视频压缩数据可为通过图5所示的实施例的视频编码方法得到的视频压缩数据。The decoding device acquires video compressed data, which may be video compressed data obtained by the video encoding method of the embodiment shown in FIG. 5.
步骤602:对视频压缩数据进行熵解码、反量化处理、DCT反变化得到预测编码数据。Step 602: Perform entropy decoding, inverse quantization processing, and DCT inverse change on the video compressed data to obtain predictive encoded data.
其中,预测编码数据包括场景特征预测编码数据、残差预测编码数据、B帧预测编码数据和P帧预测编码数据。The prediction encoded data includes scene feature prediction encoded data, residual prediction encoded data, B-frame predictive encoded data, and P-frame predictive encoded data.
在HEVC场景中,相应于步骤510,需要根据HEVC解码流程对视频压缩数据进行熵解码、反量化处理、DCT反变化得到对应预测编码数据。In the HEVC scenario, corresponding to step 510, the video compression data needs to be entropy decoded, inverse quantized, and DCT inversely changed according to the HEVC decoding process to obtain corresponding predictive encoded data.
这样,即可实现获取场景特征预测编码数据和残差预测编码数据。In this way, the acquisition of scene feature prediction encoded data and residual prediction encoded data can be achieved.
步骤603:对场景特征预测编码数据进行解码,得到场景特征。Step 603: Decode the scene feature prediction encoded data to obtain a scene feature.
对应于图5所示的实施例,场景特征用于表示每一视频帧间的相同的画面内容,而对场景特征预测编码数据解码后得到的场景特征即表示多个视频帧中的每一视频帧间相同的画面内容。Corresponding to the embodiment shown in FIG. 5, the scene feature is used to represent the same picture content between each video frame, and the scene feature obtained by decoding the scene feature prediction encoded data represents each video in the plurality of video frames. The same picture content between frames.
步骤604:对残差预测编码数据进行解码,得到重构残差。Step 604: Decode the residual prediction encoded data to obtain a reconstructed residual.
重构残差用于表示视频帧和场景信息间的差值。The reconstructed residual is used to represent the difference between the video frame and the scene information.
例如,分别对场景特征预测编码数据及关键帧误差预测编码数据进行解码得到场景特征矩阵F及重构残差e
i。
For example, the scene feature prediction encoded data and the key frame error prediction encoded data are respectively decoded to obtain a scene feature matrix F and a reconstructed residual e i .
步骤605:根据场景特征和重构残差进行重构,得到多个I帧;Step 605: Perform reconstruction according to the scene feature and the reconstructed residual to obtain multiple I frames.
在本发明图5所示的实施例中针对关键帧进行了重构,得到场景特征和重构残差,故,在视频帧的编码方法中,通过场景特征和重构残差进行重构,得到的是多个关键帧。In the embodiment shown in FIG. 5, the key frame is reconstructed to obtain the scene feature and the reconstruction residual. Therefore, in the coding method of the video frame, the scene feature and the reconstruction residual are reconstructed. The result is multiple keyframes.
步骤606:以I帧为参考帧,对B帧预测编码数据和P帧预测编码数据进行帧间解码,得到B帧和P帧;Step 606: Perform inter-frame decoding on the B frame predictive coded data and the P frame predictive coded data by using the I frame as a reference frame to obtain a B frame and a P frame.
步骤607:对I帧、B帧和P帧按时间顺序进行排列,得到视频流。Step 607: Arranging the I frame, the B frame, and the P frame in chronological order to obtain a video stream.
获取到I帧、B帧和P帧后,对这三种类型的视频帧按照时间顺序进行排列,即可得到视频流。After the I frame, the B frame, and the P frame are acquired, the video streams are obtained by arranging the three types of video frames in chronological order.
例如,结合解码场景特征F、关键帧误差e
i进行原数据重构,获得关键帧解码数据。最后,根据解码后的关键帧数据进行BP帧解码,对解码数据帧按时间顺序进行排列得到原始视频完整序列。
For example, the original data reconstruction is performed in combination with the decoded scene feature F and the key frame error e i to obtain key frame decoded data. Finally, BP frame decoding is performed according to the decoded key frame data, and the decoded data frames are arranged in chronological order to obtain a complete sequence of the original video.
这样,通过图5所示的视频编码方法得到视频压缩数据后,在有的实施例中为得到场景特征预测编码数据和残差预测编码数据,可通过图6所示的视频解码方法对这些数据进行解码得到视频帧。In this way, after the video compression data is obtained by the video coding method shown in FIG. 5, in some embodiments, the scene feature prediction coded data and the residual prediction coded data are obtained, and the data can be obtained by the video decoding method shown in FIG. 6. Decoding to get a video frame.
图5所示的实施例主要应用于关键帧间整体信息存在冗余场景下进行高效压缩。下文的图12所示的实施例应用于关键帧间局部信息存在冗余场景下进行高效压缩,该局部信息例如可以为纹理图像、镜头渐变等。The embodiment shown in FIG. 5 is mainly applied to perform efficient compression in a redundant scenario in which overall information between key frames exists. The embodiment shown in FIG. 12 is applied to perform efficient compression in a redundant scene where local information of key frames exists, and the local information may be, for example, a texture image, a lens gradation, or the like.
图12为本发明实施例提供的一种视频编码方法的方法流程图。参考图12,本发明实施例提供的视频解码方法包括:FIG. 12 is a flowchart of a method for a video encoding method according to an embodiment of the present invention. Referring to FIG. 12, a video decoding method provided by an embodiment of the present invention includes:
步骤1201:获取视频流。Step 1201: Acquire a video stream.
步骤1201的实现细节可参考步骤501。For details of the implementation of step 1201, reference may be made to step 501.
步骤1202:对视频流进行镜头分割,得到多个镜头。Step 1202: Perform lens segmentation on the video stream to obtain multiple shots.
步骤1202的实现细节可参考步骤502。The implementation details of step 1202 can be referred to step 502.
步骤1203:从得到的镜头中抽取出关键帧。Step 1203: Extract key frames from the obtained shots.
步骤1203的实现细节可参考步骤503。For details of the implementation of step 1203, refer to step 503.
类似于上述图5所示的实施例,图12所示实施例的方法中,获取待重构的视频帧,也可以通过其它的方式,例如,获取视频流,其中,该视频流的视频帧包括I帧、B帧和P帧。然后,从视频流中提取I帧,以I帧执行后续的对多个视频帧中的每一视频帧进行拆分,得到多个帧子块的步骤;Similar to the embodiment shown in FIG. 5, in the method of the embodiment shown in FIG. 12, the video frame to be reconstructed is obtained, and the video stream may also be acquired by other methods, for example, the video frame of the video stream. Includes I frames, B frames, and P frames. Then, extracting an I frame from the video stream, and performing subsequent steps of splitting each of the plurality of video frames by using the I frame to obtain a plurality of frame sub-blocks;
步骤1204:基于画面内容的相关性对多个关键帧进行分类,得到一个或多个分类簇的关键帧。Step 1204: classify a plurality of key frames based on the correlation of the picture content to obtain key frames of one or more classification clusters.
步骤1204的实现细节可参考步骤504。The implementation details of step 1204 can be referred to step 504.
本发明实施例的方法用到的具体的分类方法也可参考步骤504的相关描述。The specific classification method used in the method of the embodiment of the present invention can also refer to the related description of step 504.
本发明实施例的方法可针对帧局部信息具有冗余数据的关键帧进行重构操作,为了通过重构操作高效地减少关键帧的冗余数据,需要先对关键帧进行检测,以判断当前的选出的关键帧是否适于本发明实施例的方法的重构操作,即对所述多个视频帧中的每一视频帧进行拆分,得到多个帧子块之前,本发明实施例的方法还包括:提取所述多个视频帧中的每一视频帧的画面特征信息,其中该提取的画面特征信息可以为视频帧的全局特征或局部特征,具体有GIST全局特征,HOG全局特征,SIFT局部特征等等,本发明实施例对此不作具体限定。然后,编码设备根据所述画面特征信息,计算得到内容度量信息,所述内容度量信息用于度量所述多个视频帧的画面内容的差异性,即进行关键帧的内容一致性度量,关键帧内容一致性度量准则可以采用特征方差、欧氏距离等方式度量。当所述内容度量信息大于预设度量阈值时,执行所述对所述多个视频帧中的每一视频帧进行拆分,得到多个帧子块的步骤。The method of the embodiment of the present invention may perform a reconstruction operation on a key frame in which frame local information has redundant data. In order to efficiently reduce redundant data of a key frame through a reconstruction operation, the key frame needs to be detected first to determine the current Whether the selected key frame is suitable for the reconstruction operation of the method of the embodiment of the present invention, that is, before each video frame of the plurality of video frames is split to obtain a plurality of frame sub-blocks, the embodiment of the present invention The method further includes: extracting picture feature information of each of the plurality of video frames, wherein the extracted picture feature information may be a global feature or a local feature of the video frame, specifically a GIST global feature, a HOG global feature, The SIFT local features and the like are not specifically limited in the embodiment of the present invention. Then, the encoding device calculates the content metric information according to the picture feature information, where the content metric information is used to measure the difference of the picture content of the multiple video frames, that is, the content consistency measurement of the key frame, the key frame Content consistency metrics can be measured in terms of feature variance, Euclidean distance, and the like. When the content metric information is greater than the preset metric threshold, performing the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks.
例如,在步骤1205之前,本发明实施例的方法还包括:For example, before the step 1205, the method of the embodiment of the present invention further includes:
步骤E1:提取多个视频帧中的每一视频帧的全局GIST特征。Step E1: Extract global GIST features of each of the plurality of video frames.
在HEVC场景中,步骤E1为提取同一分类簇的多个关键帧中的每一关键帧的全局GIST特征。该全局GIST特征用于对关键帧的特征进行描述。In the HEVC scenario, step E1 is to extract global GIST features for each of a plurality of key frames of the same cluster. This global GIST feature is used to describe the characteristics of keyframes.
步骤E2:根据全局GIST特征,计算得到场景GIST特征方差。Step E2: Calculate the variance of the scene GIST feature according to the global GIST feature.
场景GIST特征方差用于度量多个视频帧的内容一致性;The scene GIST feature variance is used to measure the content consistency of multiple video frames;
在上述的HEVC场景中,场景GIST特征方差用于度量同一分类簇的多个关键帧的内容一致性。In the above HEVC scenario, the scene GIST feature variance is used to measure the content consistency of multiple key frames of the same cluster.
步骤E3:当场景GIST特征方差大于预设方差阈值时,则执行步骤1205。Step E3: When the scene GIST feature variance is greater than the preset variance threshold, step 1205 is performed.
其中,在HEVC的场景中,步骤E1至E3中的视频帧为关键帧,在本发明有的实施例中,这些关键帧为属于同一分类簇的关键帧。The video frames in the steps E1 to E3 are key frames in the HEVC scenario. In some embodiments of the present invention, the key frames are key frames belonging to the same cluster.
上述步骤E1至E3为判断同一分类簇的关键帧是否适用于步骤1205的具体方法。若该多个关键帧的场景GIST特征方差大于预设方差阈值,则表明该多个关键帧的帧画面的局部具有冗余数据,从而可对该多个关键帧执行步骤1205或步骤1206,以减少这些局部的冗余数据的冗余度。The above steps E1 to E3 are specific methods for determining whether the key frame of the same cluster is applicable to step 1205. If the variance of the scene GIST feature of the plurality of key frames is greater than the preset variance threshold, it indicates that the local part of the frame picture of the multiple key frames has redundant data, so that step 1205 or step 1206 can be performed on the multiple key frames to Reduce the redundancy of these local redundant data.
步骤1205:对多个视频帧中的每一视频帧进行拆分,得到多个帧子块。Step 1205: Split each video frame in multiple video frames to obtain multiple frame sub-blocks.
具体来说,在进行分类的实施例中,编码设备在分类出一个或多个分类簇的关键帧后,对同一分类簇的多个关键帧进行拆分,得到多个帧子块。Specifically, in the embodiment for performing classification, after encoding the key frames of one or more classification clusters, the encoding device splits multiple key frames of the same cluster to obtain a plurality of frame sub-blocks.
步骤1205的多个视频帧中,多个视频帧中的每一视频帧相互之间在局部位置包括冗余数据,即不同视频帧之间和一视频帧内部存在冗余数据,且这些冗余数据在帧的局部位置。例如,两个视频帧中,一个视频帧在帧下部具有一窗户图像,另一视频帧在帧上部具有同样的窗户图像,在这两个视频帧中,这个窗户图像即构成了冗余数据。In the multiple video frames of step 1205, each of the plurality of video frames includes redundant data at a local location, that is, redundant data exists between different video frames and within a video frame, and these redundancy The data is in a local location of the frame. For example, of the two video frames, one video frame has a window image in the lower part of the frame, and the other video frame has the same window image in the upper part of the frame. In these two video frames, the window image constitutes redundant data.
通过将这些视频帧进行拆分,得到多个帧子块,因原来的视频帧间或视频帧内部具有冗余数据,拆分后,这些冗余数据由拆分得到的帧子块进行携带。因冗余数据位于视频帧的帧局部位置,不方便从这些视频帧中提取出对帧的整体画面进行表示的场景特征,或者 这样的场景特征对冗余数据的冗余度的减少作用不大,从而可以先对这些视频帧进行拆分,此时的帧画面为帧子块的画面,冗余数据相对于帧画面的粒度得到了减少,从而方便了对场景特征基的获取,场景特征基的获取详见步骤1206的描述。By splitting these video frames, a plurality of frame sub-blocks are obtained. Since the original video frames or video frames have redundant data inside, after the split, the redundant data is carried by the split frame sub-blocks. Because the redundant data is located at the local position of the frame of the video frame, it is inconvenient to extract the scene features representing the overall picture of the frame from the video frames, or such scene features have little effect on reducing the redundancy of the redundant data. Therefore, the video frames can be split first, and the frame picture at this time is a frame sub-block picture, and the granularity of the redundant data relative to the frame picture is reduced, thereby facilitating the acquisition of the scene feature base, and the scene feature base See the description of step 1206 for details.
可以理解,拆分得到的多个帧子块可以大小相等,也可以不相等,拆分后可以对这些帧子块进行预处理,例如放大或缩小等。It can be understood that the plurality of frame sub-blocks obtained by the splitting may be equal in size or unequal. After the splitting, the frame sub-blocks may be pre-processed, such as zooming in or out.
步骤1206:对多个帧子块进行重构,得到场景特征、多个帧子块中的每一帧子块的表示系数和每一帧子块的重构残差。Step 1206: Perform reconstruction on multiple frame sub-blocks to obtain a scene feature, a representation coefficient of each frame sub-block in the plurality of frame sub-blocks, and a reconstruction residual of each frame sub-block.
其中,场景特征包括多个独立的场景特征基,在场景特征内独立的场景特征基间不能互相重构得到,场景特征基用于描述帧子块的画面内容特征,所示表示系数表示场景特征基和帧子块的对应关系,重构残差表示帧子块和场景特征基的差值。其中,该重构残差可以为具体的数值也可以为零。The scene feature includes multiple independent scene feature bases, and the independent scene feature bases in the scene feature cannot be reconstructed from each other. The scene feature base is used to describe the screen content features of the frame sub-block, and the representation coefficients represent the scene features. The correspondence between the base and the frame sub-block, the reconstructed residual represents the difference between the frame sub-block and the scene feature base. The reconstructed residual may be a specific value or zero.
在本发明的实施例中,表示系数可以通过单独字段保存,通过编码附属信息进行传递,如在图像头、条带头或者宏块信息内添加相应字段。In an embodiment of the invention, the representation coefficients may be stored in separate fields, passed by encoding the ancillary information, such as by adding corresponding fields within the image header, strip header or macroblock information.
场景特征基的构成形式有多种,例如可以是某些帧子块,或者在特定的空间的特征块等,具体可参考下述两个例子。多个场景特征基可构成场景特征,在同一场景特征中,不同的场景特征基间不能互相重构得到,从而这些场景特征基构成了基本的图像单元。基本的图像单元和相应的重构残差组合即可得到某一帧子块,因基本的图像单元有多个,需要表示系数来将与同一帧子块对应的场景特征基和重构残差进行联系。可以理解,一个帧子块可以对应一个场景特征基,也可以对应多个场景特征基,在多个场景特征基对应一个帧子块时,这些场景特征基进行互相叠加后和重构残差进行重构得到帧子块。The scene feature base can be configured in various forms, for example, it can be a certain frame sub-block, or a feature block in a specific space. For details, refer to the following two examples. Multiple scene feature bases may constitute scene features. In the same scene feature, different scene feature bases cannot be reconstructed from each other, and thus these scene feature bases constitute a basic image unit. The basic image unit and the corresponding reconstructed residual combination can obtain a certain frame sub-block. Since there are multiple basic image units, it is necessary to represent the coefficients to match the scene feature base and the reconstructed residual corresponding to the same frame sub-block. Contact. It can be understood that one frame sub-block may correspond to one scene feature base, or may correspond to multiple scene feature bases. When multiple scene feature bases correspond to one frame sub-block, the scene feature bases are superimposed on each other and reconstructed residuals are performed. The reconstructed frame sub-block is obtained.
场景特征由场景特征基构成,一场景特征中的场景特征基间不能互相重构得到,而另外的参数重构残差表示帧子块和场景特征基间的差值,从而由多个帧子块得到多个相同的场景特征基时,场景特征可以只取相同的场景特征基中的一个进行记录,实现了场景信息包括由减少冗余数据的冗余度得到的数据。这样,步骤1206重构后,将帧子块的数据换化为重构残差和场景特征组成的数据,冗余数据的冗余度得到了减少。The scene features are composed of scene feature bases, and the scene feature bases in one scene feature cannot be reconstructed from each other, and the additional parameter reconstruction residuals represent the difference between the frame sub-block and the scene feature base, thereby being composed of multiple frames. When the block obtains multiple identical scene feature bases, the scene feature may record only one of the same scene feature bases, and the scene information includes data obtained by reducing redundancy of redundant data. Thus, after the reconstruction of step 1206, the data of the frame sub-block is converted into data composed of the reconstructed residual and the scene feature, and the redundancy of the redundant data is reduced.
其中,本发明实施例的视频编码方法可以参考图3b,只是,基于该图3b,在场景重构后,还包括表示系数C,例如,对场景1的关键帧进行场景重构后,得到重构残差矩阵E1、E2、E3,以及场景特征F1*[C1、C3、C5]
T。C1、C3、C5分别为关键帧I1、I3、I5的表示系数。
The video coding method of the embodiment of the present invention may refer to FIG. 3b. However, after the scene reconstruction, the method further includes the representation coefficient C. For example, after performing scene reconstruction on the key frame of the scene 1, the weight is obtained. Construct residual matrices E1, E2, E3, and scene features F1*[C1, C3, C5] T . C1, C3, and C5 are the representation coefficients of the key frames I1, I3, and I5, respectively.
上述的步骤1205和步骤1206,即为对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差的步骤的其中一种具体形式。 Step 1205 and step 1206 described above are one of the specific forms of the steps of reconstructing a plurality of video frames to obtain scene information and reconstruction residuals of each video frame.
其中,对步骤1206的执行有多种方式,现举其中两个例子,详述如下:There are a plurality of ways to perform the step 1206. Two examples are as follows:
例一:Example 1:
首先,编码设备对多个帧子块进行重构,得到多个帧子块中的每一帧子块的表示系数和每一帧子块的重构残差。First, the encoding apparatus reconstructs a plurality of frame sub-blocks to obtain a representation coefficient of each frame sub-block of the plurality of frame sub-blocks and a reconstruction residual of each frame sub-block.
其中,表示系数表示帧子块和目标帧子块的对应关系,目标帧子块为多个帧子块中独立的帧子块,独立的帧子块为不能基于多个帧子块中的其它帧子块重构得到的帧子块,重构残差用于表示目标帧子块和帧子块间的差值。Wherein, the representation coefficient represents a correspondence between a frame sub-block and a target frame sub-block, the target frame sub-block is an independent frame sub-block among the plurality of frame sub-blocks, and the independent frame sub-block is not based on other ones of the plurality of frame sub-blocks The frame sub-block reconstructed from the frame sub-block is reconstructed to represent the difference between the target frame sub-block and the frame sub-block.
然后,编码设备组合多个表示系数指示的目标帧子块,得到场景特征,目标帧子块即为场景特征基。Then, the encoding device combines a plurality of target frame sub-blocks indicating the coefficient indication to obtain a scene feature, and the target frame sub-block is a scene feature base.
即在该具体实施方式中,从多个视频帧得到多个帧子块后,通过重构操作,确定出独立表示的帧子块,现将这些独立表示的帧子块称之为目标帧子块。在得到的多个帧子块中包括目标帧子块和非目标帧子块,目标帧子块不能基于其它目标帧子块重构得到,非目标帧子块可以基于其它的目标帧子块得到。这样,场景特征由目标帧子块组成,即可减少冗余数据的冗余度。因场景特征基即为原来的帧子块,根据表示系数的指示即可确定出构成场景特征的场景特征基。That is, in this embodiment, after obtaining a plurality of frame sub-blocks from a plurality of video frames, the frame sub-blocks that are independently represented are determined by the reconstruction operation, and the independently represented frame sub-blocks are now referred to as target frames. Piece. The target frame sub-block and the non-target frame sub-block are included in the obtained multiple frame sub-blocks, the target frame sub-block cannot be reconstructed based on other target frame sub-blocks, and the non-target frame sub-block can be obtained based on other target frame sub-blocks. . In this way, the scene features are composed of target frame sub-blocks, which can reduce the redundancy of redundant data. Because the scene feature base is the original frame sub-block, the scene feature base constituting the scene feature can be determined according to the indication of the representation coefficient.
例如,如图13所示,两个帧子块中,一个帧子块包括窗户图案1301,该帧子块加上门图像1303即可得到另一帧子块,从而,前一帧子块为目标帧子块1302,后一帧子块为非目标帧子块1304。该目标帧子块和门图案的重构残差重构即可得到该目标帧子块,这样,在包括这两个帧子块的场景中,两帧子块的窗户图案为冗余数据,经过本发明实施例的重构操作,得到目标帧子块、门的重构残差,两个表示系数,一个表示系数指示目标帧子块本身,另一表示系数指示目标帧子块和门的重构残差的对应关系,该目标帧子块为场景特征基。在解码设备处,根据指示目标帧子块本身的表示系数得到一帧子块为目标帧子块,根据指示目标帧子块和门的重构残差的对应关系的表示系数将目标帧子块和门的重构残差重构得到另一帧子块。这样,在编码时,通过上述的重构操作,减少了冗余数据的冗余度,减少了编码量。For example, as shown in FIG. 13, one of the two frame sub-blocks includes a window pattern 1301, and the frame sub-block plus the gate image 1303 can obtain another frame sub-block, so that the previous frame sub-block is targeted. Frame sub-block 1302, the next frame sub-block is a non-target frame sub-block 1304. The target frame sub-block and the reconstruction residual of the gate pattern are reconstructed to obtain the target frame sub-block, so that in the scene including the two frame sub-blocks, the window pattern of the two-frame sub-block is redundant data. After the reconstruction operation of the embodiment of the present invention, the reconstructed residual of the target frame sub-block and the gate is obtained, two representation coefficients, one representation coefficient indicates the target frame sub-block itself, and the other representation coefficient indicates the target frame sub-block and the gate. Reconstructing the correspondence of residuals, the target frame sub-block is a scene feature base. At the decoding device, a frame sub-block is obtained as a target frame sub-block according to a representation coefficient indicating the target frame sub-block itself, and the target frame sub-block is represented according to a representation coefficient indicating a correspondence relationship between the target frame sub-block and the reconstructed residual of the gate. The reconstruction residual of the AND gate is reconstructed to obtain another frame sub-block. Thus, at the time of encoding, the redundancy of the redundant data is reduced and the amount of encoding is reduced by the above-described reconstruction operation.
具体来说,对多个帧子块进行重构,得到多个帧子块中的每一帧子块的表示系数和每一帧子块的重构残差,包括:Specifically, reconstructing a plurality of frame sub-blocks to obtain a representation coefficient of each frame sub-block of the plurality of frame sub-blocks and a reconstruction residual of each frame sub-block, including:
将多个帧子块转换成观测矩阵,观测矩阵用于以矩阵形式对多个帧子块进行表示;Converting a plurality of frame sub-blocks into an observation matrix, the observation matrix being used to represent a plurality of frame sub-blocks in a matrix form;
根据第二约束条件对观测矩阵进行重构,得到表示系数矩阵和重构残差矩阵,表示系数矩阵为包括多个帧子块中的每一帧子块的表示系数的矩阵,表示系数的非零系数指示目标帧子块,重构残差矩阵用于以矩阵形式对每一帧子块的重构残差进行表示,第二约束条件用于限定表示系数的低秩性及稀疏性符合预设要求;Reconstructing the observation matrix according to the second constraint condition to obtain a representation coefficient matrix and a reconstruction residual matrix, wherein the coefficient matrix is a matrix including the representation coefficients of each frame sub-block of the plurality of frame sub-blocks, indicating the non-coefficient of the coefficients The zero coefficient indicates a target frame sub-block, and the reconstructed residual matrix is used to represent the reconstructed residual of each frame sub-block in a matrix form, and the second constraint is used to define a low rank and sparsity of the representation coefficient. Set requirements;
组合多个表示系数指示的目标帧子块,得到场景特征,包括:Combining a plurality of target frame sub-blocks indicating the coefficient indication to obtain scene features, including:
组合该表示系数矩阵中的表示系数的非零系数指示的目标帧子块,得到场景特征。The target frame sub-block indicated by the non-zero coefficient indicating the coefficient in the coefficient matrix is combined to obtain a scene feature.
可选地,根据第二约束条件对观测矩阵进行重构,得到表示系数矩阵和重构残差矩阵,包括:Optionally, reconstructing the observation matrix according to the second constraint condition to obtain a representation coefficient matrix and a reconstruction residual matrix, including:
根据第二预设公式,计算得到表示系数矩阵和重构残差矩阵,第二预设公式为:According to the second preset formula, the representation coefficient matrix and the reconstructed residual matrix are calculated, and the second preset formula is:
s.t.D=DC+E s.t.D=DC+Es.t.D=DC+E s.t.D=DC+E
其中,D为观测矩阵,C为表示系数矩阵,E为重构残差矩阵,λ、β为权重参数。
表示求C和E的最优值,即使得目标公式||C||
*+λ||E||
1或者||C||
*+λ||E||
1+β||C||
1值最小时C和E的取值,||·||
*为矩阵核范数,||·||
1为矩阵L
1范数。
Where D is the observation matrix, C is the coefficient matrix, E is the reconstructed residual matrix, and λ and β are the weight parameters. Represents the optimal value of C and E, that is, the target formula ||C|| * +λ||E|| 1 or ||C|| * +λ||E|| 1 +β||C|| When the value of 1 is the smallest, the values of C and E, ||·|| * are the matrix kernel norm, and ||·|| 1 is the matrix L 1 norm.
例如,假设某场景S包含N个关键帧,即同一分类簇包括N个关键帧,N为自然数。将每个关键帧均匀拆分成M个大小相等子块。将各子块拉成一个列向量,组成一个观测矩 阵D,即
由于关键帧内部及关键帧之间的信息内容存在大量冗余,因此该矩阵可以看成是由多个子空间构成的并集。场景重构的目标是找出这些独立的子空间,并求解观测矩阵D在这些独立子空间内的表示系数。空间是指带有一些特定性质的集合。观测矩阵D包含多个图像特征向量,这些向量所构成的表示空间为全空间。子空间即是表示维度小于全空间的部分空间。该子空间即为独立的帧子块构成的空间。
For example, suppose a scene S contains N key frames, that is, the same cluster includes N key frames, and N is a natural number. Split each keyframe evenly into M equal-sized sub-blocks. Pull each sub-block into a column vector to form an observation matrix D, ie Since there is a large amount of redundancy in the information content between the key frame and the key frame, the matrix can be regarded as a union of a plurality of subspaces. The goal of scene reconstruction is to find these independent subspaces and solve the representation coefficients of the observation matrix D in these independent subspaces. Space refers to a collection with some specific properties. The observation matrix D contains a plurality of image feature vectors, and the representation space formed by these vectors is a full space. A subspace is a partial space that represents a dimension that is smaller than the full space. This subspace is the space formed by independent frame sub-blocks.
场景重构问题可转化为如下优化问题来描述:The scene reconstruction problem can be transformed into the following optimization problem to describe:
s.t.D=DC+Es.t.D=DC+E
其中C为表示系数。根据表示系数C,可得到各子空间对应的场景特征。系数C的非零个数与场景特征基个数一一对应。本实施例的表示系数是指关键帧重构过程中,原帧子块由场景特征中的各场景特征基来表示的系数矩阵(或向量),即帧子块和场景特征基的对应关系。不同的独立帧子块之间的表示系数通常为0,例如,草地图像不包含湖水场景特征,因此该图像块由湖水场景特征来表示的系数通常为0。Where C is the coefficient of representation. According to the representation coefficient C, the scene features corresponding to each subspace can be obtained. The non-zero number of coefficients C corresponds one-to-one with the number of scene feature bases. The representation coefficient of this embodiment refers to a coefficient matrix (or vector) represented by each scene feature base in the scene feature in the key frame reconstruction process, that is, a correspondence relationship between the frame sub-block and the scene feature base. The representation coefficient between different independent frame sub-blocks is usually 0. For example, the grass image does not contain the lake scene feature, so the coefficient of the image block represented by the lake scene feature is usually zero.
这样,即可实现将观测矩阵D进行自表示(self-representation),即观测矩阵D中每一个帧子块,都可以通过观测矩阵D中的其他帧子块来进行表示,独立的帧子块则由其本身表示自己。表示系数矩阵C中的每一列为帧子块的表示系数,残差矩阵中E的每一列为对应帧子块的重构残差。因此公式可以使用D=DC+E。In this way, the self-representation of the observation matrix D can be realized, that is, each frame sub-block in the observation matrix D can be represented by observing other frame sub-blocks in the matrix D, independent frame sub-blocks. It is itself represented by itself. Each column in the representation coefficient matrix C is a representation coefficient of a frame sub-block, and each column of E in the residual matrix is a reconstruction residual of the corresponding frame sub-block. So the formula can use D=DC+E.
该目标约束函数表示:在自表示前提下,由于观测矩阵是由多个场景特征基构成,因此表示的系数应该是低秩矩阵(即多个表示系数强相关),在给出低秩约束的前提下,可以避免求解得到平凡解(即C=I,E=0的情形)。同时,给出重构误差稀疏约束,使得表示尽可能地接近原图像。The target constraint function indicates that, under the premise of self-representation, since the observation matrix is composed of multiple scene feature bases, the represented coefficients should be low rank matrices (ie, multiple representation coefficients are strongly correlated), given the low rank constraint. Under the premise, the solution can be avoided to obtain a trivial solution (ie, C=I, E=0). At the same time, the reconstruction error sparse constraint is given so that the representation is as close as possible to the original image.
为了降低场景特征表示数据量,需要对表示系数进行系数稀疏约束,即:In order to reduce the amount of data represented by the scene feature, it is necessary to perform coefficient sparse constraint on the representation coefficient, namely:
s.t.D=DC+Es.t.D=DC+E
其中λ、β为权重参数,对系数稀疏性及低秩性进行调节,上述优化问题可通过APG、IALM等矩阵优化算法进行求解。最终场景特征由非零系数C对应的特征基组成。Among them, λ and β are weight parameters, and the coefficient sparsity and low rank are adjusted. The above optimization problem can be solved by matrix optimization algorithms such as APG and IALM. The final scene feature consists of a feature base corresponding to a non-zero coefficient C.
为了减少特征基数量,需要对表示系数进行稀疏约束,即属于同一类场景的帧子块(例如,都为草地)的表示系数不仅强相关而且表示系数大部分为0,少部分不为0的表示系数对应的图像子块为最终需要编码的场景特征。In order to reduce the number of feature bases, the representation coefficients need to be sparsely constrained, that is, the representation coefficients of the frame sub-blocks belonging to the same type of scene (for example, both are grassland) are not only strongly correlated but also indicate that the coefficients are mostly 0, and a small portion is not 0. The image sub-block corresponding to the representation coefficient is the scene feature that needs to be encoded eventually.
例如,假设表示系数矩阵C和观测矩阵D是由列向量c,d排布的矩阵,即C=[c1,c2,c3,…],D=[d1,d2,d3,…],其中c1=[c1_1,c1_2,c1_3]为观测样本d1对应的表示系数,d1即为一帧子块的矩阵表示。DC表示矩阵相乘,则d1=D*c1,即d1=d1*c1_1+d2*c1_2+d3*c1_3,在求解后,c1向量内只有少部分维度不为0,比如说c1_2不为0,则场景特征基即为d2,即帧子块d1可以基于帧子块d2进行表示,该帧子块d2为独立的帧子块,帧子块d2和帧子块d1的重构残差进行重构即可得到帧子块d1,表示系数c1=[0,c1_2,0]表示帧子块d1和独立帧子块d2的对应关系。For example, suppose that the representation coefficient matrix C and the observation matrix D are matrices arranged by the column vectors c, d, that is, C = [c1, c2, c3, ...], D = [d1, d2, d3, ...], where c1 =[c1_1,c1_2,c1_3] is the representation coefficient corresponding to the observation sample d1, and d1 is the matrix representation of one frame sub-block. DC indicates that the matrix is multiplied, then d1=D*c1, that is, d1=d1*c1_1+d2*c1_2+d3*c1_3. After solving, only a small part of the dimension in the c1 vector is not 0, for example, c1_2 is not 0. The scene feature base is d2, that is, the frame sub-block d1 can be represented based on the frame sub-block d2, the frame sub-block d2 is an independent frame sub-block, and the reconstructed residuals of the frame sub-block d2 and the frame sub-block d1 are heavy. The frame sub-block d1 is obtained, and the coefficient c1=[0, c1_2, 0] indicates the correspondence between the frame sub-block d1 and the independent frame sub-block d2.
这样,本发明实施例将I帧的信息量转换成场景特征基信息及残差矩阵信息来表示。I帧信息量的冗余即体现在场景特征基及残差矩阵信息上。多个I帧拥有相同的场景特征基,因此只需要对该场景特征基编码一次来大幅降低编码数据量。Thus, the embodiment of the present invention converts the information amount of the I frame into the scene feature base information and the residual matrix information. The redundancy of the I frame information amount is reflected in the scene feature base and the residual matrix information. Multiple I frames have the same scene feature base, so only the scene feature base needs to be encoded once to greatly reduce the amount of encoded data.
编码时除对场景特征、重构误差进行编码外,还需要记录表示系数,及子块的编号。在解码过程中,首先根据解码过后的场景特征、表示系数及重构误差对每个子块进行重构,然后将子块按编号进行组合,得到最终关键帧内容。图14给出了基于局部信息表示的场景重构示例图。In addition to coding the scene features and reconstruction errors, it is also necessary to record the representation coefficients and the number of the sub-blocks. In the decoding process, each sub-block is first reconstructed according to the decoded scene features, representation coefficients and reconstruction errors, and then the sub-blocks are combined by number to obtain the final key frame content. Figure 14 shows an example of scene reconstruction based on local information representation.
当然,在本发明有的实施例中个,也可以不使用子块的编号,而是按照预设顺序对帧子块进行排列,再解码复原过程中,按照该预设规则对重构得到的帧子块进行组合,也可以得到视频帧。Certainly, in some embodiments of the present invention, the frame sub-blocks may be arranged in a preset order without using the number of the sub-blocks, and the reconstructed process is performed according to the preset rule. The video sub-blocks can be combined to obtain a video frame.
本实现方式可以对关键帧中存在的纹理结构进行挖掘。如果场景存在大量纹理特征,通过上述公式求解得到的表示系数C将会是低秩且稀疏的。稀疏系数对应的特征基便是场景纹理结构的基本单元。图15给出了在纹理场景下进行局部特征重构的示例图。This implementation can mine the texture structure existing in the key frame. If there are a large number of texture features in the scene, the representation coefficient C obtained by the above formula will be low rank and sparse. The feature base corresponding to the sparse coefficient is the basic unit of the scene texture structure. Figure 15 shows an example diagram of local feature reconstruction under a texture scene.
上述实现方式给出的压缩方案中根据图像底层数据特征对场景内容进行表示和重构。下述实现方式将采用更高层的语义特征对场景内容进行描述及重构来达到数据压缩的目的。具体的模型有稀疏编码(Sparse Coding,SC)、深度学习网络(Deep Neural Network,DNN)、卷积神经网络(Convolutional Neural Network,CNN)、堆栈自编码模型(Stacked Auto Encoder,SAE)等等。In the compression scheme given by the above implementation, the scene content is represented and reconstructed according to the underlying data features of the image. The following implementations will use higher-level semantic features to describe and reconstruct the content of the scene to achieve data compression. Specific models include Sparse Coding (SC), Deep Neural Network (DNN), Convolutional Neural Network (CNN), Stacked Auto Encoder (SAE), and so on.
例二 Case 2
首先,解码设备对多个帧子块进行重构,得到场景特征和多个帧子块中的每一帧子块的表示系数。其中,场景特征包括的场景特征基为特征空间中的独立的特征块,独立的特征块为在场景特征中不能由其它特征块重构得到的特征块。First, the decoding device reconstructs a plurality of frame sub-blocks to obtain a scene feature and a representation coefficient of each frame sub-block of the plurality of frame sub-blocks. The scene feature includes a scene feature set as an independent feature block in the feature space, and the independent feature block is a feature block that cannot be reconstructed by other feature blocks in the scene feature.
然后,解码设备根据每一帧子块的重构残差和场景特征重构得到的数据和每一帧子块,计算得到每一帧子块的重构残差。Then, the decoding device calculates the reconstructed residual of each frame sub-block according to the reconstructed residual of each frame sub-block and the reconstructed data of the scene feature and each frame sub-block.
其中,场景特征基为特征空间中的独立的特征块,这些特征空间例如可以为RGB彩色空间、HIS彩色空间、YUV彩色空间等,不同的帧子块原本看起来不具有相同的画面,但是经过高层映射后,会具有相同的特征块,这些相同的特征块即构成了冗余数据,场景特征将相同的特征块只取其一进行记录,从而减少了帧子块间的冗余度。这样的场景特征类似于由特征块组成的字典,表示系数即用于从该字典中选取一个帧子块需要的特征块并将对应的重构残差对应起来。The scene feature base is an independent feature block in the feature space. For example, the feature space may be an RGB color space, a HIS color space, a YUV color space, etc., and different frame sub-blocks do not seem to have the same picture, but After the high-level mapping, the same feature blocks are formed. These same feature blocks constitute redundant data, and the scene features record the same feature blocks only one by one, thereby reducing the redundancy between the frame sub-blocks. Such scene features are similar to a dictionary consisting of feature blocks, which are the feature blocks needed to select a frame sub-block from the dictionary and map the corresponding reconstructed residuals.
可以理解,一个帧子块可以对应与多个特征块,这些多个特征块进行叠加并于重构残差重构后,即可得到帧子块。It can be understood that one frame sub-block can correspond to multiple feature blocks, and the multiple feature blocks are superimposed and reconstructed by the reconstructed residual to obtain a frame sub-block.
具体来说,对多个帧子块进行重构,得到场景特征和多个帧子块中的每一帧子块的表示系数,包括:Specifically, the plurality of frame sub-blocks are reconstructed to obtain a scene feature and a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, including:
将多个帧子块转换为观测矩阵,观测矩阵用于以矩阵形式对多个帧子块进行表示;Converting a plurality of frame sub-blocks into an observation matrix, the observation matrix being used to represent a plurality of frame sub-blocks in a matrix form;
根据第三约束条件对观测矩阵进行重构,得到表示系数矩阵和场景特征矩阵,表示系数矩阵为包括每一帧子块的表示系数的矩阵,表示系数的非零系数指示场景特征基,所示场景特征矩阵用于以矩阵形式对场景特征进行表示,第三约束条件用于限定表示系数矩阵 和景特征矩阵重构得到的画面和帧子块的画面的相似性符合预设相似阈值、以及限定表示系数矩阵稀疏性符合预设稀疏阀值、以及场景特征矩阵的数据量小于预设数据量阀值;Reconstructing the observation matrix according to the third constraint condition to obtain a representation coefficient matrix and a scene feature matrix, wherein the coefficient matrix is a matrix including the representation coefficients of each frame sub-block, and the non-zero coefficient indicating the coefficient indicates the scene feature base, as shown The scene feature matrix is used to represent the scene features in a matrix form, and the third constraint condition is used to define the similarity between the picture representing the coefficient matrix and the scene feature matrix reconstructed picture and the frame sub-block according to the preset similarity threshold, and the limitation The data matrix indicating that the coefficient matrix sparsity conforms to the preset sparse threshold and the scene feature matrix is smaller than the preset data volume threshold;
根据每一帧子块的重构残差和场景特征重构得到的数据和每一帧子块,计算得到每一帧子块的重构残差,包括:The reconstructed residuals of each frame sub-block are calculated according to the reconstructed residual of each frame sub-block and the reconstructed data of the scene feature and each frame sub-block, including:
根据表示系数矩阵和场景特征矩阵重构得到的数据和观测矩阵,计算得到重构残差矩阵,重构残差矩阵用于以矩阵形式对重构残差进行表示。The reconstructed residual matrix is calculated according to the data and the observation matrix reconstructed by the coefficient matrix and the scene feature matrix, and the reconstructed residual matrix is used to represent the reconstructed residual in a matrix form.
例如,根据第三约束条件对观测矩阵进行重构,得到表示系数矩阵和场景特征矩阵,包括:For example, the observation matrix is reconstructed according to the third constraint condition, and the representation coefficient matrix and the scene feature matrix are obtained, including:
根据第三预设公式,计算得到表示系数矩阵和场景特征矩阵,第三预设公式为:According to the third preset formula, the representation coefficient matrix and the scene feature matrix are calculated, and the third preset formula is:
其中,D为观测矩阵,C为表示系数矩阵,F为场景特征,λ和β为权重参数,用于对系数稀疏性及低秩性进行调节。Where D is the observation matrix, C is the coefficient matrix, F is the scene feature, and λ and β are the weight parameters, which are used to adjust the coefficient sparsity and low rank.
例如,选用稀疏编码模型进行建模和分析进行描述。假设某场景S包含N个关键帧,将每个关键帧均匀拆分成M个大小相等的帧子块。将各帧子块拉成一个列向量,组成一个观测矩阵D,即
将场景重构问题可转化为如下问题来描述:
For example, a sparse coding model is used for modeling and analysis for description. Suppose a scene S contains N key frames, and each key frame is evenly split into M equal-sized frame sub-blocks. Pull each frame sub-block into a column vector to form an observation matrix D, ie The scene reconstruction problem can be transformed into the following problem to describe:
其中,λ和β为权重参数,矩阵优化参数为场景特征F及表示系数C。Where λ and β are weight parameters, and the matrix optimization parameters are scene features F and representation coefficients C.
目标函数中第一项是对重构误差做约束,使得场景特征与表示系数重构得到的画面与原图尽量相似。第二项是对系数C进行稀疏约束,表示每幅画面通过少量特征基即可重构。最后一项是对场景特征F进行约束,防止F数据量过大,即该公式的第一项为误差项、后两项为正则项,对表示系数做约束。具体优化算法可采用共轭梯度法、OMP(Orthogonal Matching Pursuit)、LASSO等方法。最终求解得到的场景特征如图16所示。然后,根据公式E=D-FC求解重构残差。其中,F矩阵的维数及个数与帧子块的维度一致。The first item in the objective function is to constrain the reconstruction error, so that the picture reconstructed by the scene feature and the representation coefficient is as similar as possible to the original picture. The second term is the sparse constraint on the coefficient C, which means that each picture can be reconstructed with a small number of feature bases. The last item is to constrain the scene feature F to prevent the F data from being too large, that is, the first item of the formula is the error term, and the last two items are regular items, and the representation coefficients are constrained. The specific optimization algorithm can adopt the methods of conjugate gradient method, OMP (Orthogonal Matching Pursuit), LASSO and the like. The scene features obtained by the final solution are shown in Fig. 16. Then, the reconstructed residual is solved according to the formula E=D-FC. The dimension and number of the F matrix are consistent with the dimensions of the frame sub-block.
参阅图16,图16的各小框为场景特征基,场景特征矩阵F即为由各小框(场景特征基)组成的矩阵,FC=F[c1,c2,c3,…],Fc1表示根据表示系数c1将场景特征基进行组合,得到特征空间的一个线性表示,和重构残差e1相加后可复原出原来的帧子块图像I1。Referring to FIG. 16, each small frame of FIG. 16 is a scene feature base, and the scene feature matrix F is a matrix composed of small frames (scene feature bases), FC=F[c1, c2, c3, ...], and Fc1 represents The representation coefficient c1 combines the scene feature bases to obtain a linear representation of the feature space, and the reconstructed residual e1 is added to restore the original frame sub-block image I1.
在例一中,场景特征基是由观测样本D来直接确定。即场景特征基都是从观测样本D中选取出来的。而本例中的场景特征是根据算法学习得到的。在对参数F的优化过程中,根据目标函数进行迭代求解,得到最优化结果可以使得重构误差最小。编码的信息量集中在F,E上。F的维度与帧子块的维度一致,F的个数可以预先设定,设定越少,编码信息少,但是重构残差E较大。F设定越多,编码信息大,但是重构残差E较小,故,需要通过权重参数权衡F的个数。In the first example, the scene feature base is directly determined by the observation sample D. That is, the scene feature base is selected from the observation sample D. The scene features in this example are learned according to the algorithm. In the optimization process of the parameter F, the iterative solution is performed according to the objective function, and the optimization result can minimize the reconstruction error. The amount of coded information is concentrated on F, E. The dimension of F is consistent with the dimension of the frame sub-block, and the number of Fs can be set in advance. The less the setting, the less the coding information, but the reconstruction residual E is larger. The more the F setting is, the larger the coding information is, but the reconstruction residual E is smaller, so the number of Fs needs to be weighed by the weight parameter.
步骤1207:对场景特征进行预测编码,得到场景特征预测编码数据。Step 1207: Perform predictive coding on the feature of the scene to obtain scene feature prediction encoded data.
步骤1208:对重构残差进行预测编码,得到残差预测编码数据。Step 1208: Perform predictive coding on the reconstructed residual to obtain residual prediction encoded data.
在编码设备的预测编码部分包括帧内预测编码及帧间预测编码两个部分。场景特征及重构误差采用帧内预测编码,镜头的剩余帧,即镜头的非关键帧采用帧间预测编码。帧内预测编码具体流程同HEVC帧内编码模块类似。由于场景特征矩阵低秩,则只需对场景特征矩阵关键列进行编码。重构误差属于残差编码,编码数据量小,压缩比高。The predictive coding portion of the encoding device includes two parts of intra prediction coding and inter prediction coding. The scene features and reconstruction errors are encoded by intra prediction, and the remaining frames of the shot, that is, the non-key frames of the shot, are inter-predictive encoded. The specific process of intra prediction coding is similar to the HEVC intra coding module. Since the scene feature matrix has a low rank, only the key columns of the scene feature matrix need to be encoded. The reconstruction error belongs to residual coding, and the amount of coded data is small and the compression ratio is high.
步骤1209:根据场景特征、表示系数和重构残差进行重构,得到参考帧。Step 1209: Perform reconstruction according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a reference frame.
步骤1209的具体实现可参考步骤508。For the specific implementation of step 1209, reference may be made to step 508.
步骤1210:以参考帧做参考,对B帧和P帧进行帧间预测编码,得到B帧预测编码数据和P帧预测编码数据。Step 1210: Perform reference frame prediction on the B frame and the P frame by using the reference frame as a reference, and obtain B frame predictive coded data and P frame predictive coded data.
步骤1210的具体实现可参考步骤509。For the specific implementation of step 1210, reference may be made to step 509.
步骤1211:对预测编码数据进行变换编码、量化编码及熵编码,得到视频压缩数据。Step 1211: Perform transform coding, quantization coding, and entropy coding on the predictive coded data to obtain video compressed data.
预测编码数据包括场景特征预测编码数据、残差预测编码数据、B帧预测编码数据和P帧预测编码数据。The predictive coded data includes scene feature predictive coded data, residual predictive coded data, B frame predictive coded data, and P frame predictive coded data.
步骤1211的具体实现可参考步骤510。For the specific implementation of step 1211, reference may be made to step 510.
类似于图5所示的实施例,图12所示的实施例基于HEVC场景进行描述,但图12所示的视频编码方法还可以用于其它的场景。Similar to the embodiment shown in FIG. 5, the embodiment shown in FIG. 12 is described based on the HEVC scenario, but the video encoding method shown in FIG. 12 can also be applied to other scenarios.
综上所述,编码设备获取多个视频帧,该多个视频帧中的每一视频帧间在画面内容上包括冗余数据,尤其是多个视频帧中的每一视频帧相互之间在局部位置包括冗余数据。对此,编码设备对多个视频帧中的每一视频帧进行拆分,得到多个帧子块,然后,对多个帧子块进行重构,得到场景特征、多个帧子块中的每一帧子块的表示系数和每一帧子块的重构残差。其中,场景特征包括多个独立的场景特征基,在场景特征内独立的场景特征基间不能互相重构得到,场景特征基用于描述帧子块的画面内容特征,所示表示系数表示场景特征基和帧子块的对应关系,重构残差表示帧子块和场景特征基的差值。后续,对场景特征进行预测编码,得到场景特征预测编码数据,以及对重构残差进行预测编码,得到残差预测编码数据。In summary, the encoding device acquires a plurality of video frames, and each of the plurality of video frames includes redundant data on the picture content, and in particular, each of the plurality of video frames is in a mutual Local locations include redundant data. In this regard, the encoding device splits each video frame in the plurality of video frames to obtain a plurality of frame sub-blocks, and then reconstructs the plurality of frame sub-blocks to obtain scene features and multiple frame sub-blocks. The representation coefficient of each frame sub-block and the reconstruction residual of each frame sub-block. The scene feature includes multiple independent scene feature bases, and the independent scene feature bases in the scene feature cannot be reconstructed from each other. The scene feature base is used to describe the screen content features of the frame sub-block, and the representation coefficients represent the scene features. The correspondence between the base and the frame sub-block, the reconstructed residual represents the difference between the frame sub-block and the scene feature base. Subsequently, the scene features are predictively coded, the scene feature prediction coded data is obtained, and the reconstructed residual is predictively coded to obtain residual prediction coded data.
这样,通过重构,减少了局部位置包括的冗余数据的冗余度。从而在编码操作中,得到的场景特征和重构残差总体的压缩数据量相对于原来的视频帧的压缩数据量得到了缩减,减少了压缩后得到的数据量。而将每一视频帧重构为场景特征和重构残差,因重构残差包含除场景信息外的残差信息,因此信息量少且稀疏,该特性在进行预测编码时,可以通过较少的码字对其进行预测编码,编码数据量小,压缩比高。这样,本发明实施例的方法可有效提高视频帧的压缩效率。Thus, by reconstruction, the redundancy of redundant data included in the local location is reduced. Therefore, in the encoding operation, the obtained scene feature and the compressed data amount of the reconstructed residual total are reduced relative to the compressed data amount of the original video frame, and the amount of data obtained after compression is reduced. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
相应于图12所示的视频编码方法,图17示出了一种视频解码方法,参阅图17,本发明实施例的视频解码方法包括:Corresponding to the video encoding method shown in FIG. 12, FIG. 17 shows a video decoding method. Referring to FIG. 17, the video decoding method in the embodiment of the present invention includes:
步骤1701:获取场景特征预测编码数据、残差预测编码数据和表示系数。Step 1701: Acquire scene feature prediction encoded data, residual prediction encoded data, and representation coefficients.
解码设备获取视频压缩数据,该视频压缩数据可为通过图12所示的实施例的视频编码方法得到的视频压缩数据。The decoding device acquires video compressed data, which may be video compressed data obtained by the video encoding method of the embodiment shown in FIG.
具体来说,在HEVC场景中,获取场景特征预测编码数据和残差预测编码数据,包括: 获取视频压缩数据,然后,对视频压缩数据进行熵解码、反量化处理、DCT反变化得到预测编码数据。其中,预测编码数据包括场景特征预测编码数据、残差预测编码数据、B帧预测编码数据和P帧预测编码数据;Specifically, in the HEVC scenario, acquiring the scene feature prediction encoded data and the residual prediction encoded data includes: acquiring video compressed data, and then performing entropy decoding, inverse quantization processing, and DCT inverse change on the video compressed data to obtain predictive encoded data. . The prediction encoded data includes scene feature prediction encoded data, residual prediction encoded data, B frame predictive encoded data, and P frame predictive encoded data;
步骤1702:对场景特征预测编码数据进行解码,得到场景特征。Step 1702: Decode the scene feature prediction encoded data to obtain a scene feature.
场景特征包括多个独立的场景特征基,在场景特征内独立的场景特征基间不能互相重构得到,场景特征基用于描述帧子块的画面内容特征,所示表示系数表示场景特征基和帧子块的对应关系,重构残差表示帧子块和场景特征基的差值。The scene feature includes multiple independent scene feature bases, and the independent scene feature bases in the scene feature cannot be reconstructed from each other. The scene feature base is used to describe the picture content features of the frame sub-block, and the represented coefficients represent the scene feature base and The correspondence between the frame sub-blocks, and the reconstructed residual represents the difference between the frame sub-block and the scene feature base.
步骤1703:对残差预测编码数据进行解码,得到重构残差。Step 1703: Decode the residual prediction encoded data to obtain a reconstructed residual.
其中,重构残差用于表示视频帧和场景信息间的差值。The reconstructed residual is used to represent the difference between the video frame and the scene information.
步骤1704:根据场景特征、表示系数和重构残差进行重构,得到多个帧子块。Step 1704: Perform reconstruction according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a plurality of frame sub-blocks.
对应于图12所示的视频编码方法得到的场景特征、表示系数和重构残差,本发明实施例的视频解码方法中根据场景特征、表示系数和重构残差进行重构可得到多个帧子块。其中,本发明实施例的方法可参考图4b,只是在解码得到场景特征后,要使用表示系数在场景特征中确定出需要的场景特征基,例如,使用场景特征F1*[C1、C3、C5]
T后,再分别与重构残差E1、E3、E5进行重构,即可得到关键帧I1、I3、I5。C1、C3、C5分别为关键帧I1、I3、I5的表示系数。
Corresponding to the scene feature, the representation coefficient, and the reconstruction residual obtained by the video coding method shown in FIG. 12, the video decoding method in the embodiment of the present invention may perform multiple reconstructions according to the scene feature, the representation coefficient, and the reconstruction residual. Frame sub-block. The method of the embodiment of the present invention may refer to FIG. 4b, but after decoding the scene feature, the representation feature is used to determine the required scene feature base in the scene feature, for example, using the scene feature F1*[C1, C3, C5 After T , the reconstructed residuals E1, E3, and E5 are reconstructed respectively to obtain key frames I1, I3, and I5. C1, C3, and C5 are the representation coefficients of the key frames I1, I3, and I5, respectively.
步骤1705:对多个帧子块进行组合,得到多个视频帧。Step 1705: Combining a plurality of frame sub-blocks to obtain a plurality of video frames.
步骤1704和步骤1705即为根据场景信息和重构残差进行重构,得到多个视频帧的步骤的具体实现方式。 Step 1704 and step 1705 are specific implementations of the steps of reconstructing the video information according to the scene information and the reconstructed residual.
例如,在HEVC场景中,对多个帧子块进行组合,得到多个视频帧,为对多个帧子块进行组合,得到多个I帧。例如,解码过程中,首先根据解码过后的场景特征、表示系数及重构误差对每个子块进行重构,然后将子块按编号进行组合,得到最终关键帧内容。并且,本发明实施例的方法还包括:以I帧为参考帧,对B帧预测编码数据和P帧预测编码数据进行帧间解码,得到B帧和P帧。跟着,解码设备对I帧、B帧和P帧按时间顺序进行排列,得到视频流。For example, in an HEVC scenario, a plurality of frame sub-blocks are combined to obtain a plurality of video frames, and a plurality of frame sub-blocks are combined to obtain a plurality of I frames. For example, in the decoding process, each sub-block is first reconstructed according to the decoded scene features, representation coefficients, and reconstruction errors, and then the sub-blocks are combined by number to obtain the final key frame content. The method of the embodiment of the present invention further includes: inter-frame decoding the B frame predictive encoded data and the P frame predictive encoded data by using the I frame as a reference frame to obtain a B frame and a P frame. Then, the decoding device arranges the I frame, the B frame, and the P frame in chronological order to obtain a video stream.
这样,在通过图12所示的实施例的视频编码方法得到压缩视频数据或者后,通过图17所示的实施例的视频解码方法可解码得到视频帧。Thus, after the compressed video data is obtained by the video encoding method of the embodiment shown in FIG. 12, the video frame can be decoded by the video decoding method of the embodiment shown in FIG.
上述各个实施例中,获取执行重构操作的视频帧,举了从获取的视频流中提取视频帧和直接获取视频帧等实施例,在本发明的一些实施例中,该视频帧的获取还可以通过获取压缩的视频帧,然后对其解压后得到。In the foregoing embodiments, the video frame that performs the reconfiguration operation is obtained, and the embodiment that the video frame is extracted from the acquired video stream and the video frame is directly obtained is obtained. In some embodiments of the present invention, the video frame is obtained. This can be obtained by taking a compressed video frame and then decompressing it.
具体来说,步骤201可通过如下步骤实现:Specifically, step 201 can be implemented by the following steps:
步骤F1:获取压缩视频流。Step F1: Acquire a compressed video stream.
其中,压缩视频流包括已压缩的视频帧。该压缩的视频流例如可以是经过HEVC压缩的视频流。Wherein, the compressed video stream includes a compressed video frame. The compressed video stream can be, for example, a HEVC compressed video stream.
步骤F2:从压缩视频流中确定出多个目标视频帧。Step F2: Determine a plurality of target video frames from the compressed video stream.
其中,目标视频帧在该压缩视频流中为独立压缩编码的视频帧。The target video frame is an independently compression-encoded video frame in the compressed video stream.
步骤F3:对目标视频帧进行解码,得到解码后的目标视频帧。Step F3: Decoding the target video frame to obtain a decoded target video frame.
该解码后的目标视频帧用于执行步骤202。The decoded target video frame is used to perform step 202.
在本发明有的实施例中为了进一步减少该解码后的视频帧的冗余度,可以对这些视频帧进行分类操作,具体可以参考步骤504。In the embodiment of the present invention, in order to further reduce the redundancy of the decoded video frames, the video frames may be classified. For details, refer to step 504.
通过对在压缩视频流中独立压缩的视频帧执行本发明实施例的视频编码方法,可以提高这些视频帧的压缩效率,减少这些视频帧的压缩数据量。By performing the video encoding method of the embodiment of the present invention on video frames that are independently compressed in the compressed video stream, the compression efficiency of these video frames can be improved, and the compressed data amount of these video frames can be reduced.
例如,本发明实施例可对HEVC压缩视频流进行二次压缩。具体来说,经过压缩视频判别、I帧抽取、帧内解码后,即可得到待用于执行本发明实施例的方法的I帧。例如,可在原来的视频编码设备的基础之上增加压缩视频判别、I帧抽取、帧内解码模块来实现本发明实施例的方法。For example, the embodiment of the present invention may perform secondary compression on the HEVC compressed video stream. Specifically, after compressed video discrimination, I frame extraction, and intra-frame decoding, an I frame to be used to perform the method of the embodiment of the present invention is obtained. For example, a method of the embodiment of the present invention may be implemented by adding a compressed video discriminating, an I frame decimation, and an intra decoding module based on the original video encoding device.
首先,根据视频流是否含有压缩视频码头信息判别该视频流是否被压缩。First, it is determined whether the video stream is compressed according to whether the video stream contains compressed video terminal information.
然后,执行I帧抽取操作。由于HEVC压缩视频采用层次的码流结构,根据码流层次结构在图像组层根据图像组头提取独立GOP数据。然后,根据图像头对GOP内每帧图像进行提取,GOP组第一帧图像即为I帧,对该I帧即可进行提取。Then, an I frame extraction operation is performed. Since the HEVC compressed video adopts a hierarchical code stream structure, independent GOP data is extracted according to the image group header in the image group layer according to the code stream hierarchy. Then, each frame of the GOP is extracted according to the image header, and the first frame image of the GOP group is an I frame, and the I frame can be extracted.
跟着,因该I帧在HEVC压缩视频中已被进行独立的压缩操作,具体如上文中对HECV标准进行的简要介绍所述,从而,解码设备对提取的I帧编码数据进行帧内解码,得到解码后的I帧,剩余编码及解码步骤可参考上文中的编码和解码操作。这样,即可实现在原有视频编码数据基础之上可以对压缩视频进行二次编解码。Subsequently, since the I frame has been subjected to independent compression operations in the HEVC compressed video, as described above in detail in the HECV standard, the decoding device performs intra-frame decoding on the extracted I-frame encoded data to obtain decoding. The subsequent I frame, residual coding and decoding steps can be referred to the encoding and decoding operations above. In this way, it is possible to perform secondary encoding and decoding of the compressed video on the basis of the original video encoded data.
由于本文发明所提方法可以对已有压缩视频数据进行二次编码及解码,且与传统HEVC方法在变换编码、量化编码、熵编码等环节保持一致,因此,在进行本发明功能模块部署时,可以与原有视频压缩设备保持兼容。Because the method of the invention can perform secondary encoding and decoding on the existing compressed video data, and is consistent with the traditional HEVC method in the process of transform coding, quantization coding, entropy coding, etc., therefore, when performing the function module deployment of the present invention, Can be compatible with legacy video compression devices.
可以理解,本发明实施例的方法也可以应用于其它的编码数据,按照上述的步骤提取并解码已被压缩的视频帧,然后再执行上述图2、图5和图12的视频编码方法的步骤。其中,针对非HEVC视频编码数据,可以根据压缩过后的图像数据量大小进行I帧的判断,通常I帧编码数据要远远大于P帧及B帧编码数据。It can be understood that the method of the embodiment of the present invention can also be applied to other encoded data, and the steps of extracting and decoding the compressed video frame according to the above steps, and then performing the steps of the video encoding method of FIG. 2, FIG. 5 and FIG. 12 described above. . Wherein, for non-HEVC video encoded data, the I frame can be determined according to the size of the compressed image data, and usually the I frame encoded data is much larger than the P frame and the B frame encoded data.
图18a为本发明实施例提供的一种视频编码设备的结构示意图。图18b为图18a所示实施例提供的视频编码设备的局部结构示意图,该视频编码设备可用于执行上述各实施例中的视频编码方法,参阅图18a和图18b,该视频编码设备包括:获取模块1801、重构模块1802和预测编码模块1803。获取模块1801,用于执行上述各视频编码方法的实施例中涉及获取视频帧的处理过程。重构模块1802,用于执行上述各视频编码方法的实施例中涉及重构操作以减少冗余数据的冗余度的处理过程,例如步骤202、步骤505和步骤1206。预测编码模块1803,用于执行上述各视频编码方法的实施例中涉及预测编码的步骤,例如步骤203和步骤204。重构模块1802将获取模块1801获取到的多个视频帧进行重构操作后,得到场景信息和重构残差,以使得预测编码模块1803对场景信息和重构残差进行预测编码。FIG. 18 is a schematic structural diagram of a video encoding apparatus according to an embodiment of the present invention. FIG. 18b is a schematic diagram showing a partial structure of a video encoding apparatus according to the embodiment shown in FIG. 18a. The video encoding apparatus can be used to perform the video encoding method in the foregoing embodiments. Referring to FIG. 18a and FIG. 18b, the video encoding apparatus includes: acquiring Module 1801, reconstruction module 1802, and prediction encoding module 1803. The obtaining module 1801 is configured to perform a process of acquiring a video frame in an embodiment of each of the foregoing video encoding methods. The reconstruction module 1802 is configured to perform a process related to the reconfiguration operation to reduce the redundancy of the redundant data in the embodiments of the foregoing video coding methods, for example, step 202, step 505, and step 1206. The prediction encoding module 1803 is configured to perform steps of predictive encoding, such as step 203 and step 204, in an embodiment of each of the above video encoding methods. The reconstruction module 1802 obtains the scene information and the reconstruction residual after performing the reconstruction operation on the plurality of video frames acquired by the obtaining module 1801, so that the prediction encoding module 1803 predictively encodes the scene information and the reconstructed residual.
可选地,在获取模块1801和重构模块1802之间,该视频编码设备还包括度量特征提取模块1804、度量信息计算模块1805。Optionally, the video encoding device further includes a metric feature extraction module 1804 and a metric information calculation module 1805 between the obtaining module 1801 and the reconstruction module 1802.
特征提取模块1804,用于执行上述各视频编码方法的实施例中涉及提取视频帧的画面特征信息的处理过程,例如步骤D1和E1。The feature extraction module 1804 is configured to perform a process of extracting picture feature information of a video frame in an embodiment of each of the above video coding methods, for example, steps D1 and E1.
度量信息计算模块1805,用于执行上述各视频编码方法的实施例中涉及计算得到内容 度量信息的处理过程,例如步骤D2和E2。The metric information calculation module 1805 is configured to perform a process of calculating the content metric information in the embodiment of each of the above video coding methods, for example, steps D2 and E2.
可选地,该视频编码设备还包括:Optionally, the video encoding device further includes:
参考帧重构模块1806,用于执行上述各视频编码方法的实施例中涉及重构参考帧的处理过程;a reference frame reconstruction module 1806, configured to perform a process of reconstructing a reference frame in an embodiment of each of the foregoing video coding methods;
帧间预测编码模块1807,用于执行上述各视频编码方法的实施例中涉及帧间预测编码的处理过程;The inter prediction encoding module 1807 is configured to perform a process related to inter prediction encoding in the embodiments of the foregoing video encoding methods.
编码模块1808,用于执行上述各视频编码方法的实施例中涉及变换编码、量化编码及熵编码的处理过程。The encoding module 1808 is configured to perform a process of transform coding, quantization coding, and entropy coding in the embodiments of the foregoing video coding methods.
可选地,该重构模块1802还包括拆分单元1809和重构单元1810,重构单元1810可对拆分单元1809拆分得到的帧子块进行重构。Optionally, the reconstruction module 1802 further includes a splitting unit 1809 and a reconstruction unit 1810. The reconstruction unit 1810 may reconstruct the frame sub-block obtained by splitting the unit 1809.
拆分单元1809,用于执行上述各视频编码方法的实施例中涉及拆分视频帧的处理过程,例如步骤1206。重构单元1810,用于执行上述各视频编码方法的实施例中涉及重构帧子块的处理过程,例如步骤1206;The splitting unit 1809 is configured to perform a process of splitting a video frame in an embodiment of each of the video encoding methods described above, for example, step 1206. The reconstruction unit 1810 is configured to perform a process of reconstructing a frame sub-block in an embodiment of each of the foregoing video coding methods, for example, step 1206;
其中,重构单元1810包括重构子单元1811和组合子单元1812。The reconstruction unit 1810 includes a reconstruction subunit 1811 and a combination subunit 1812.
重构子单元1811,用于执行上述各视频编码方法的实施例中涉及重构帧子块得到表示系数和重构残差的处理过程。The reconstruction sub-unit 1811 is configured to perform a process of reconstructing a frame sub-block to obtain a representation coefficient and a reconstruction residual in an embodiment of each of the above video coding methods.
组合子单元1812,用于执行上述各视频编码方法的实施例中涉及组合目标帧子块的处理过程。The combining sub-unit 1812 is configured to perform a process of combining the target frame sub-blocks in the embodiment of each of the video encoding methods described above.
可选地,重构单元1810还可以包括子块重构子单元1813和子块计算子单元1814。Optionally, the reconstruction unit 1810 may further include a sub-block reconstruction sub-unit 1813 and a sub-block calculation sub-unit 1814.
子块重构子单元1813,用于执行上述各视频编码方法的实施例中涉及重构帧子块得到场景特征和表示系数的处理过程,该场景特征包括的场景特征基为特征空间中的独立的特征块。The sub-block reconstruction sub-unit 1813 is configured to perform a process of reconstructing a frame sub-block to obtain a scene feature and a representation coefficient in an embodiment of each of the foregoing video coding methods, where the scene feature includes a scene feature base that is independent in the feature space. Feature block.
子块计算子单元1814,用于执行上述各视频编码方法的实施例中涉及计算重构残差处理过程。The sub-block calculation sub-unit 1814 is for performing a computational reconstruction residual processing procedure in an embodiment for performing the above-described respective video coding methods.
可选地,该视频编码设备还包括分类模块1815,其用于执行上述各视频编码方法的实施例中涉及分类的处理过程。Optionally, the video encoding device further includes a classification module 1815 for performing a process involving classification in an embodiment of each of the video encoding methods described above.
可选地,分类模块1815包括特征提取单元1816、距离计算单元1817和聚类单元1818。Optionally, the classification module 1815 includes a feature extraction unit 1816, a distance calculation unit 1817, and a clustering unit 1818.
其中,特征提取单元1816,用于提取多个视频帧中的每一视频帧的特征信息;距离计算单元1817,用于执行上述各视频编码方法的实施例中涉及聚类距离的处理过程;聚类单元1818,用于执行上述各视频编码方法的实施例中涉及聚类的处理过程。The feature extraction unit 1816 is configured to extract feature information of each of the plurality of video frames, and the distance calculation unit 1817 is configured to perform a process of processing the cluster distance in the embodiment of each of the video coding methods. The class unit 1818 is configured to perform a process involving clustering in an embodiment of each of the above video coding methods.
可选地,获取模块1801包括如下单元:Optionally, the obtaining module 1801 includes the following units:
视频流获取单元1819,用于获取视频流;a video stream obtaining unit 1819, configured to acquire a video stream;
帧特征提取单元1820,用于执行上述各视频编码方法的实施例中涉及提取第一视频帧和第二视频帧的特征信息的处理过程;a frame feature extraction unit 1820, configured to perform a process of extracting feature information of the first video frame and the second video frame in an embodiment of each of the foregoing video encoding methods;
镜头距离计算单元1821,用于执行上述各视频编码方法的实施例中涉及镜头距离计算的处理过程;a lens distance calculation unit 1821, configured to perform a process related to lens distance calculation in an embodiment of each of the above video coding methods;
镜头距离判断单元1822,用于判断镜头距离是否大于预设镜头阈值;The lens distance determining unit 1822 is configured to determine whether the lens distance is greater than a preset lens threshold;
镜头分割单元1823,用于执行上述各视频编码方法的实施例中涉及分割出目标镜头的 处理过程;a lens dividing unit 1823, configured to perform a process of dividing a target lens in an embodiment of each of the above video encoding methods;
关键帧提取单元1824,用于执行上述各视频编码方法的实施例中涉及根据帧距离提取关键帧的处理过程。The key frame extracting unit 1824 is configured to perform a process of extracting a key frame according to a frame distance in an embodiment of each of the above video encoding methods.
可选地,视频编码设备还包括:Optionally, the video encoding device further includes:
训练模块1825,用于根据从视频流中分割出的每一镜头进行判别训练,得到多个与镜头对应的分类器;The training module 1825 is configured to perform discriminant training according to each shot segmented from the video stream, to obtain a plurality of classifiers corresponding to the shots;
判别模块1826,用于使用目标分类器对目标视频帧进行判别,得到判别分数,a discriminating module 1826, configured to determine a target video frame by using a target classifier to obtain a discriminant score,
场景确定模块1827,用于当判别分数大于预设分数阈值时,确定目标视频帧与目标分类器所属的镜头属于同一场景;The scene determining module 1827 is configured to: when the discriminant score is greater than the preset score threshold, determine that the target video frame belongs to the same scene as the shot to which the target classifier belongs;
分类簇确定模块1828,用于根据与镜头属于同一场景的视频帧,确定一个或多个分类簇的视频帧。The cluster determination module 1828 is configured to determine video frames of one or more clusters according to video frames belonging to the same scene as the shot.
可选地,获取模块1801,包括:Optionally, the obtaining module 1801 includes:
压缩视频获取单元1829,用于获取压缩视频流,压缩视频流包括已压缩的视频帧;a compressed video obtaining unit 1829, configured to acquire a compressed video stream, where the compressed video stream includes a compressed video frame;
帧确定单元1830,用于从压缩视频流中确定出目标视频帧,目标视频帧为独立压缩编码的视频帧;a frame determining unit 1830, configured to determine, from the compressed video stream, a target video frame, where the target video frame is an independently compressed encoded video frame;
解码单元1831,用于对目标视频帧进行解码,得到解码后的目标视频帧,解码后的目标视频帧用于执行对多个视频帧中的每一视频帧进行拆分,得到多个帧子块的步骤。The decoding unit 1831 is configured to decode the target video frame to obtain a decoded target video frame, where the decoded target video frame is used to perform splitting of each of the plurality of video frames to obtain multiple frames. The steps of the block.
综上所述,获取模块1801获取多个视频帧,多个视频帧中的每一视频帧间在画面内容上包括冗余数据。然后,重构模块1802对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差,其中,场景信息包括由减少冗余数据的冗余度得到的数据,重构残差用于表示视频帧和场景信息间的差值。跟着,预测编码模块1803,对场景信息进行预测编码,得到场景特征预测编码数据。预测编码模块1803对重构残差进行预测编码,得到残差预测编码数据。这样,通过对该多个视频帧进行重构的处理,可以减少这些视频帧的冗余度,从而在编码操作中,得到的场景特征和重构残差总体的压缩数据量相对于原来的视频帧的压缩数据量得到了缩减,减少了压缩后得到的数据量。而将每一视频帧重构为场景特征和重构残差,因重构残差包含除场景信息外的残差信息,因此信息量少且稀疏,该特性在进行预测编码时,可以通过较少的码字对其进行预测编码,编码数据量小,压缩比高。这样,本发明实施例的方法可有效提高视频帧的压缩效率。In summary, the obtaining module 1801 acquires a plurality of video frames, and each of the plurality of video frames includes redundant data on the screen content. Then, the reconstruction module 1802 reconstructs the plurality of video frames to obtain scene information and a reconstruction residual of each video frame, where the scene information includes data obtained by reducing redundancy of redundant data, and reconstructing the residual The difference is used to represent the difference between the video frame and the scene information. Next, the prediction encoding module 1803 performs predictive coding on the scene information to obtain scene feature prediction encoded data. The prediction encoding module 1803 performs predictive coding on the reconstructed residual to obtain residual prediction encoded data. In this way, by performing the process of reconstructing the plurality of video frames, the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video. The amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
图19为本发明实施例提供的一种视频解码设备的结构示意图。该视频解码设备可用于执行上述各实施例中的视频解码方法,参阅图19,该视频解码设备包括:获取模块1901、场景信息解码模块1902、重构残差解码模块1903和视频帧重构模块1904。场景信息解码模块1902和重构残差解码模块1903分别将获取模块1901获取的场景特征预测编码数据和残差预测编码数据进行解码操作,从而视频帧重构模块1904可使用解码得到的数据重构得到视频帧。FIG. 19 is a schematic structural diagram of a video decoding device according to an embodiment of the present invention. The video decoding device can be used to perform the video decoding method in the foregoing embodiments. Referring to FIG. 19, the video decoding device includes: an obtaining module 1901, a scene information decoding module 1902, a reconstructed residual decoding module 1903, and a video frame reconstruction module. 1904. The scene information decoding module 1902 and the reconstructed residual decoding module 1903 respectively perform the decoding operation on the scene feature prediction encoded data and the residual prediction encoded data acquired by the obtaining module 1901, so that the video frame reconstruction module 1904 can reconstruct the data obtained by using the decoding. Get the video frame.
获取模块1901,用于执行上述各视频解码方法的实施例中涉及获取编码数据的处理过程,例如步骤205;The obtaining module 1901 is configured to perform a process of acquiring encoded data in an embodiment of each of the foregoing video decoding methods, for example, step 205;
场景信息解码模块1902,用于执行上述各视频解码方法的实施例中涉及解码场景信息的处理过程,例如步骤206、步骤603;The scene information decoding module 1902 is configured to perform a process related to decoding scene information in the embodiments of the foregoing video decoding methods, for example, step 206, step 603;
重构残差解码模块1903,用于执行上述各视频解码方法的实施例中涉及解码重构残差的处理过程,例如步骤207;The reconstructed residual decoding module 1903 is configured to perform a process of decoding the reconstructed residual in the embodiment of each of the foregoing video decoding methods, for example, step 207;
视频帧重构模块1904,用于执行上述各视频解码方法的实施例中涉及重构得到多个视频帧的处理过程,例如步骤208、步骤604。The video frame reconstruction module 1904 is configured to perform a process of reconstructing a plurality of video frames in an embodiment of each of the video decoding methods, for example, step 208 and step 604.
可选地,获取模块1901包括获取单元1905和解码单元1906,Optionally, the obtaining module 1901 includes an obtaining unit 1905 and a decoding unit 1906.
获取单元1905,用于执行上述各视频解码方法的实施例中涉及获取视频压缩数据的处理过程,例如步骤601。The obtaining unit 1905 is configured to perform a process of acquiring video compression data in an embodiment of each of the foregoing video decoding methods, for example, step 601.
解码单元1906,用于执行上述各视频解码方法的实施例中涉及得到预测编码数据的处理过程,例如步骤602。The decoding unit 1906 is configured to perform a process related to obtaining the predicted encoded data in the embodiment of each of the video decoding methods described above, for example, step 602.
该视频解码设备还包括:帧间解码模块1907,用于执行上述各视频解码方法的实施例中涉及帧间解码的处理过程,例如步骤606;The video decoding apparatus further includes: an inter-frame decoding module 1907, configured to perform a process related to inter-frame decoding in an embodiment of each of the above video decoding methods, for example, step 606;
排列模块1908,用于执行上述各视频解码方法的实施例中涉及帧排列的处理过程,例如步骤607。The arranging module 1908 is configured to perform a process involving frame alignment in the embodiment of each of the video decoding methods described above, for example, step 607.
可选地,获取模块1901,还用于获取表示系数。Optionally, the obtaining module 1901 is further configured to acquire a representation coefficient.
该视频帧重构模块1904,包括重构单元1909和组合单元1910。The video frame reconstruction module 1904 includes a reconstruction unit 1909 and a combination unit 1910.
重构单元1909,用于执行上述各视频解码方法的实施例中涉及重构得到多个帧子块的处理过程,例如步骤1704。The reconstruction unit 1909 is configured to perform a process of reconstructing a plurality of frame sub-blocks in an embodiment of each of the video decoding methods, for example, step 1704.
组合单元1910,用于执行上述各视频解码方法的实施例中涉及组合帧子块的处理过程,例如步骤1705。The combining unit 1910 is configured to perform a process of combining frame sub-blocks in an embodiment of each of the above video decoding methods, for example, step 1705.
综上所述,获取模块1901获取场景特征预测编码数据和残差预测编码数据后,场景信息解码模块1902对场景特征预测编码数据进行解码,得到场景信息,其中,场景信息包括由减少冗余数据的冗余度得到的数据,冗余数据为多个视频帧中的每一视频帧间在画面内容上的冗余数据。跟着,重构残差解码模块1903对残差预测编码数据进行解码,得到重构残差,重构残差用于表示视频帧和场景信息间的差值。以及视频帧重构模块1904,用于根据场景信息和重构残差进行重构,得到多个视频帧。这样,对上述实施例中的视频编码设备编码得到的场景特征预测编码数据和残差预测编码数据使用本发明实施例的视频解码设备即可完成解码操作。In summary, after the acquisition module 1901 acquires the scene feature prediction coded data and the residual prediction coded data, the scene information decoding module 1902 decodes the scene feature prediction coded data to obtain scene information, where the scene information includes reducing redundant data. The redundancy obtained data, the redundant data is redundant data on the picture content between each of the plurality of video frames. Then, the reconstructed residual decoding module 1903 decodes the residual prediction encoded data to obtain a reconstructed residual, and the reconstructed residual is used to represent the difference between the video frame and the scene information. And a video frame reconstruction module 1904, configured to perform reconstruction according to the scene information and the reconstructed residual to obtain a plurality of video frames. In this way, the scene feature predictive coded data and the residual predictive coded data obtained by encoding the video coding device in the foregoing embodiment can complete the decoding operation by using the video decoding device of the embodiment of the present invention.
图20为本发明实施例提供的一种视频编解码设备的结构示意图。该视频编解码设备可用于执行上述各实施例中的视频编码方法和视频解码方法,参阅图20,视频编解码设备2000包括视频编码设备2001和视频解码设备2002,FIG. 20 is a schematic structural diagram of a video codec device according to an embodiment of the present invention. The video encoding and decoding device can be used to perform the video encoding method and the video decoding method in the foregoing embodiments. Referring to FIG. 20, the video encoding and decoding device 2000 includes a video encoding device 2001 and a video decoding device 2002.
其中,视频编码设备2001为上述图18a和图18b所示的实施例的视频编码设备;The video encoding device 2001 is the video encoding device of the embodiment shown in FIG. 18a and FIG. 18b above;
视频解码设备2002为上述图19所示的实施例的视频解码设备。The video decoding device 2002 is the video decoding device of the embodiment shown in Fig. 19 described above.
下文将以硬件架构对本发明实施例提供的视频编码方法和视频解码方法进行说明,即下文的实施例中提供了一种视频编解码系统,该视频帧编解码系统包括视频编码器和视频解码器。The video encoding method and the video decoding method provided by the embodiments of the present invention are described below in the hardware architecture. In the following embodiments, a video encoding and decoding system is provided. The video frame encoding and decoding system includes a video encoder and a video decoder. .
·系统架构·system structure
图21是根据本发明实施例的视频编解码系统10的示意性框图。如图21所示,视频编解码系统10包含源装置12及目的地装置14。源装置12产生经编码视频数据。因此, 源装置12可被称作视频编码装置或视频编码设备。目的地装置14可解码由源装置12产生的经编码视频数据。因此,目的地装置14可被称作视频解码装置或视频解码设备。源装置12及目的地装置14可为视频编解码装置或视频编解码设备的实例。源装置12及目的地装置14可包括广泛范围的装置,包含台式计算机、移动计算装置、笔记本(例如,膝上型)计算机、平板计算机、机顶盒、智能电话等手持机、电视、相机、显示装置、数字媒体播放器、视频游戏控制台、车载计算机,或其类似者。21 is a schematic block diagram of a video codec system 10 in accordance with an embodiment of the present invention. As shown in FIG. 21, video codec system 10 includes source device 12 and destination device 14. Source device 12 produces encoded video data. Thus, source device 12 may be referred to as a video encoding device or a video encoding device. Destination device 14 may decode the encoded video data produced by source device 12. Thus, destination device 14 may be referred to as a video decoding device or a video decoding device. Source device 12 and destination device 14 may be examples of video codec devices or video codec devices. Source device 12 and destination device 14 may include a wide range of devices including desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set top boxes, smart phones, etc., televisions, cameras, display devices , digital media player, video game console, on-board computer, or the like.
目的地装置14可经由信道16接收来自源装置12的编码后的视频数据。信道16可包括能够将经编码视频数据从源装置12移动到目的地装置14的一个或多个媒体及/或装置。在一个实例中,信道16可包括使源装置12能够实时地将编码后的视频数据直接发射到目的地装置14的一个或多个通信媒体。在此实例中,源装置12可根据通信标准(例如,无线通信协议)来调制编码后的视频数据,且可将调制后的视频数据发射到目的地装置14。所述一个或多个通信媒体可包含无线及/或有线通信媒体,例如射频(RF)频谱或一根或多根物理传输线。所述一个或多个通信媒体可形成基于包的网络(例如,局域网、广域网或全球网络(例如,因特网)的部分。所述一个或多个通信媒体可包含路由器、交换器、基站,或促进从源装置12到目的地装置14的通信的其它设备。 Destination device 14 may receive the encoded video data from source device 12 via channel 16. Channel 16 may include one or more media and/or devices capable of moving encoded video data from source device 12 to destination device 14. In one example, channel 16 may include one or more communication media that enable source device 12 to transmit encoded video data directly to destination device 14 in real time. In this example, source device 12 may modulate the encoded video data in accordance with a communication standard (eg, a wireless communication protocol) and may transmit the modulated video data to destination device 14. The one or more communication media may include wireless and/or wired communication media, such as a radio frequency (RF) spectrum or one or more physical transmission lines. The one or more communication media may form part of a packet-based network (eg, a local area network, a wide area network, or a global network (eg, the Internet). The one or more communication media may include routers, switches, base stations, or promotions Other devices that communicate from source device 12 to destination device 14.
在另一实例中,信道16可包含存储由源装置12产生的编码后的视频数据的存储媒体。在此实例中,目的地装置14可经由磁盘存取或卡存取来存取存储媒体。存储媒体可包含多种本地存取式数据存储媒体,例如蓝光光盘、DVD、CD-ROM、快闪存储器,或用于存储经编码视频数据的其它合适数字存储媒体。In another example, channel 16 can include a storage medium that stores encoded video data generated by source device 12. In this example, destination device 14 can access the storage medium via disk access or card access. The storage medium may include a variety of locally accessible data storage media, such as Blu-ray Disc, DVD, CD-ROM, flash memory, or other suitable digital storage medium for storing encoded video data.
在另一实例中,信道16可包含文件服务器或存储由源装置12产生的编码后的视频数据的另一中间存储装置。在此实例中,目的地装置14可经由流式传输或下载来存取存储于文件服务器或其它中间存储装置处的编码后的视频数据。文件服务器可以是能够存储编码后的视频数据且将所述编码后的视频数据发射到目的地装置14的服务器类型。实例文件服务器包含web服务器(例如,用于网站)、文件传送协议(FTP)服务器、网络附加存储(NAS)装置,及本地磁盘驱动器。In another example, channel 16 can include a file server or another intermediate storage device that stores encoded video data generated by source device 12. In this example, destination device 14 may access the encoded video data stored at a file server or other intermediate storage device via streaming or download. The file server may be a server type capable of storing encoded video data and transmitting the encoded video data to the destination device 14. The instance file server includes a web server (eg, for a website), a file transfer protocol (FTP) server, a network attached storage (NAS) device, and a local disk drive.
目的地装置14可经由标准数据连接(例如,因特网连接)来存取编码后的视频数据。数据连接的实例类型包含适合于存取存储于文件服务器上的编码后的视频数据的无线信道(例如,Wi-Fi连接)、有线连接(例如,DSL、缆线调制解调器等),或两者的组合。编码后的视频数据从文件服务器的发射可为流式传输、下载传输或两者的组合。 Destination device 14 can access the encoded video data via a standard data connection (e.g., an internet connection). An instance type of a data connection includes a wireless channel (eg, a Wi-Fi connection), a wired connection (eg, DSL, cable modem, etc.), or both, suitable for accessing encoded video data stored on a file server. combination. The transmission of the encoded video data from the file server may be streaming, downloading, or a combination of both.
本发明的技术不限于无线应用场景,示例性的,可将所述技术应用于支持以下应用等多种多媒体应用的视频编解码:空中电视广播、有线电视发射、卫星电视发射、流式传输视频发射(例如,经由因特网)、存储于数据存储媒体上的视频数据的编码、存储于数据存储媒体上的视频数据的解码,或其它应用。在一些实例中,视频编解码系统10可经配置以支持单向或双向视频发射,以支持例如视频流式传输、视频播放、视频广播及/或视频电话等应用。The technology of the present invention is not limited to a wireless application scenario. Illustratively, the technology can be applied to video codecs supporting multiple multimedia applications such as aerial television broadcasting, cable television transmission, satellite television transmission, and streaming video. Transmission (eg, via the Internet), encoding of video data stored on a data storage medium, decoding of video data stored on a data storage medium, or other application. In some examples, video codec system 10 may be configured to support one-way or two-way video transmission to support applications such as video streaming, video playback, video broadcasting, and/or video telephony.
在图21的实例中,源装置12包含视频源18、视频编码器20及输出接口22。在一些实例中,输出接口22可包含调制器/解调器(调制解调器)及/或发射器。视频源18可包含视频俘获装置(例如,视频相机)、含有先前俘获的视频数据的视频存档、用以从视频内容提 供者接收视频数据的视频输入接口,及/或用于产生视频数据的计算机图形系统,或上述视频数据源的组合。In the example of FIG. 21, source device 12 includes video source 18, video encoder 20, and output interface 22. In some examples, output interface 22 can include a modulator/demodulator (modem) and/or a transmitter. Video source 18 may include a video capture device (eg, a video camera), a video archive containing previously captured video data, a video input interface to receive video data from a video content provider, and/or a computer for generating video data. A graphics system, or a combination of the above video data sources.
视频编码器20可编码来自视频源18的视频数据。在一些实例中,源装置12经由输出接口22将编码后的视频数据直接发射到目的地装置14。编码后的视频数据还可存储于存储媒体或文件服务器上以供目的地装置14稍后存取以用于解码及/或播放。 Video encoder 20 may encode video data from video source 18. In some examples, source device 12 transmits the encoded video data directly to destination device 14 via output interface 22. The encoded video data may also be stored on a storage medium or file server for later access by the destination device 14 for decoding and/or playback.
在图21的实例中,目的地装置14包含输入接口28、视频解码器30及显示装置32。在一些实例中,输入接口28包含接收器及/或调制解调器。输入接口28可经由信道16接收编码后的视频数据。显示装置32可与目的地装置14整合或可在目的地装置14外部。一般来说,显示装置32显示解码后的视频数据。显示装置32可包括多种显示装置,例如液晶显示器(LCD)、等离子体显示器、有机发光二极管(OLED)显示器或其它类型的显示装置。In the example of FIG. 21, destination device 14 includes an input interface 28, a video decoder 30, and a display device 32. In some examples, input interface 28 includes a receiver and/or a modem. Input interface 28 can receive the encoded video data via channel 16. Display device 32 may be integral with destination device 14 or may be external to destination device 14. In general, display device 32 displays the decoded video data. Display device 32 may include a variety of display devices such as liquid crystal displays (LCDs), plasma displays, organic light emitting diode (OLED) displays, or other types of display devices.
视频编码器20及视频解码器30可根据视频压缩标准(例如,高效率视频编解码H.265标准)而操作,且可遵照HEVC测试模型(HM)。H.265标准的文本描述ITU-TH.265(V3)(04/2015)于2015年4月29号发布,可从http://handle.itu.int/11.1002/1000/12455下载,所述文件的全部内容以引用的方式并入本文中。 Video encoder 20 and video decoder 30 may operate in accordance with a video compression standard (eg, the High Efficiency Video Codec H.265 standard) and may conform to the HEVC Test Model (HM). A textual description of the H.265 standard is published on April 29, 2015, ITU-T.265(V3) (04/2015), available for download from http://handle.itu.int/11.1002/1000/12455 The entire contents of the document are incorporated herein by reference.
或者,视频编码器20及视频解码器30可根据其它专属或行业标准而操作,所述标准包含ITU-TH.261、ISO/IECMPEG-1Visual、ITU-TH.262或ISO/IECMPEG-2Visual、ITU-TH.263、ISO/IECMPEG-4Visual,ITU-TH.264(还称为ISO/IECMPEG-4AVC),包含可分级视频编解码(SVC)及多视图视频编解码(MVC)扩展。应理解,本发明的技术不限于任何特定编解码标准或技术。Alternatively, video encoder 20 and video decoder 30 may operate in accordance with other proprietary or industry standards including ITU-TH.261, ISO/IEC MPEG-1 Visual, ITU-TH.262, or ISO/IEC MPEG-2 Visual, ITU. -TH.263, ISO/IECMPEG-4 Visual, ITU-TH.264 (also known as ISO/IEC MPEG-4 AVC), including scalable video codec (SVC) and multiview video codec (MVC) extensions. It should be understood that the techniques of the present invention are not limited to any particular codec standard or technique.
此外,图21仅为实例且本发明的技术可应用于未必包含编码装置与解码装置之间的任何数据通信的视频编解码应用(例如,单侧的视频编码或视频解码)。在其它实例中,从本地存储器检索数据,经由网络流式传输数据,或以类似方式操作数据。编码装置可编码数据且将所述数据存储到存储器,及/或解码装置可从存储器检索数据且解码所述数据。在许多实例中,通过彼此不进行通信而仅编码数据到存储器及/或从存储器检索数据及解码数据的多个装置执行编码及解码。Moreover, FIG. 21 is merely an example and the techniques of the present invention are applicable to video codec applications (eg, single-sided video encoding or video decoding) that do not necessarily include any data communication between the encoding device and the decoding device. In other examples, data is retrieved from local memory, data is streamed over a network, or manipulated in a similar manner. The encoding device may encode the data and store the data to a memory, and/or the decoding device may retrieve the data from the memory and decode the data. In many instances, encoding and decoding are performed by a plurality of devices that only encode data to and/or retrieve data from the memory and decode the data by not communicating with each other.
视频编码器20及视频解码器30各自可实施为多种合适电路中的任一者,例如一个或多个微处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)、离散逻辑、硬件或其任何组合。如果技术部分地或者全部以软件实施,则装置可将软件的指令存储于合适的非瞬时计算机可读存储媒体中,且可使用一个或多个处理器执行硬件中的指令以执行本发明的技术。可将前述各者中的任一者(包含硬件、软件、硬件与软件的组合等)视为一个或多个处理器。视频编码器20及视频解码器30中的每一者可包含于一个或多个编码器或解码器中,其中的任一者可整合为其它装置中的组合式编码器/解码器(编解码器(CODEC))的部分。 Video encoder 20 and video decoder 30 may each be implemented as any of a variety of suitable circuits, such as one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable Gate array (FPGA), discrete logic, hardware, or any combination thereof. If the technology is implemented partially or wholly in software, the device may store the instructions of the software in a suitable non-transitory computer readable storage medium, and the instructions in the hardware may be executed using one or more processors to perform the techniques of the present invention. . Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) can be considered as one or more processors. Each of video encoder 20 and video decoder 30 may be included in one or more encoders or decoders, any of which may be integrated into a combined encoder/decoder (codec) in other devices Part of the (CODEC).
本发明大体上可指代视频编码器20将某一信息“用信号发送”到另一装置(例如,视频解码器30)。术语“用信号发送”大体上可指代语法元素及/或表示编码后的视频数据的传达。此传达可实时或近实时地发生。或者,此通信可在一时间跨度上发生,例如可在编码时以编码后得到的二进制数据将语法元素存储到计算机可读存储媒体时发生,所述语法 元素在存储到此媒体之后接着可由解码装置在任何时间检索。The invention may generally refer to video encoder 20 "signaling" certain information to another device (e.g., video decoder 30). The term "signaling" may generally refer to a syntax element and/or to convey the communication of encoded video data. This communication can occur in real time or near real time. Alternatively, this communication may occur over a time span, such as may occur when encoding the encoded element to a computer readable storage medium at the time of encoding, the syntax element being subsequently decodable after being stored in the medium The device is retrieved at any time.
·分块模式· Block mode
所述视频编码器20编码视频数据。视频数据可包括一个或多个图片。视频编码器20可产生码流,所述码流以比特流的形式包含了视频数据的编码信息。所述编码信息可以包含编码图片数据及相关联数据。相关联数据可包含序列参数集(SPS)、图片参数集(PPS)及其它语法结构。SPS可含有应用于零个或多个序列的参数。PPS可含有应用于零个或多个图片的参数。语法结构是指码流中以指定次序排列的零个或多个语法元素的集合。The video encoder 20 encodes video data. Video data may include one or more pictures. Video encoder 20 may generate a code stream that contains encoded information for the video data in the form of a bitstream. The encoded information may include encoded picture data and associated data. Associated data can include sequence parameter sets (SPS), picture parameter sets (PPS), and other syntax structures. An SPS can contain parameters that are applied to zero or more sequences. The PPS can contain parameters that are applied to zero or more pictures. A grammatical structure refers to a collection of zero or more syntax elements arranged in a specified order in a code stream.
为产生图片的编码信息,视频编码器20可将图片分割成编码树型块(CTB)的栅格。在一些例子中,CTB可被称作“树型块”、“最大编码单元”(LCU)或“编码树型单元”。CTB不限于特定大小且可包含一个或多个编码单元(CU)。每一个CTB可以与图片内的具有相等大小的像素块相关联。每一像素可对应一个亮度(luminance或luma)采样及两个色度(chrominance或chroma)采样。因此,每一个CTB可与一个亮度采样块及两个色度采样块相关联。图片的CTB可分成一个或多个条带。在一些实例中,每一个条带包含整数个数的CTB。作为编码图片的一部分,视频编码器20可产生所述图片的每一条带的编码信息,即编码所述条带内的CTB。为了编码CTB,视频编码器20可对与CTB相关联的像素块递归地执行四叉树分割,以将像素块分割成递减的像素块。所述较小的像素块可以和CU相关联。To generate the encoded information for the picture, video encoder 20 may partition the picture into a raster of coded tree blocks (CTBs). In some examples, a CTB may be referred to as a "tree block," a "maximum coding unit" (LCU), or a "coding tree unit." The CTB is not limited to a particular size and may include one or more coding units (CUs). Each CTB can be associated with a block of pixels of equal size within the picture. Each pixel can correspond to one luminance (luminance or luma) sample and two chrominance or chroma samples. Thus, each CTB can be associated with one luma sample block and two chroma sample blocks. The CTB of a picture can be divided into one or more stripes. In some examples, each stripe contains an integer number of CTBs. As part of the encoded picture, video encoder 20 may generate encoded information for each strip of the picture, i.e., encode the CTB within the strip. To encode the CTB, video encoder 20 may recursively perform quadtree partitioning on the block of pixels associated with the CTB to partition the block of pixels into decreasing blocks of pixels. The smaller block of pixels can be associated with a CU.
·预测·prediction
视频编码器20可产生每一不再分割CU的一个或多个预测单元(PU)。CU的每一个PU可与CU的像素块内的不同像素块相关联。视频编码器20可针对CU的每一个PU产生预测性像素块。视频编码器20可使用帧内预测或帧间预测来产生PU的预测性像素块。如果视频编码器20使用帧内预测来产生PU的预测性像素块,则视频编码器20可基于与PU相关联的图片的解码后的像素来产生PU的预测性像素块。如果视频编码器20使用帧间预测来产生PU的预测性像素块,则视频编码器20可基于不同于与PU相关联的图片的一个或多个图片的解码后的像素来产生PU的预测性像素块。视频编码器20可基于CU的PU的预测性像素块来产生CU的残余像素块。CU的残余像素块可指示CU的PU的预测性像素块中的采样值与CU的初始像素块中的对应采样值之间的差。 Video encoder 20 may generate one or more prediction units (PUs) that each no longer partition the CU. Each PU of a CU may be associated with a different block of pixels within a block of pixels of the CU. Video encoder 20 may generate predictive pixel blocks for each PU of the CU. Video encoder 20 may use intra prediction or inter prediction to generate predictive pixel blocks for the PU. If video encoder 20 uses intra prediction to generate a predictive pixel block for a PU, video encoder 20 may generate a predictive pixel block for the PU based on the decoded pixels of the picture associated with the PU. If video encoder 20 uses inter prediction to generate predictive pixel blocks for a PU, video encoder 20 may generate predictiveness of the PU based on decoded pixels of one or more pictures that are different from pictures associated with the PU. Pixel block. Video encoder 20 may generate residual pixel blocks of the CU based on the predictive pixel blocks of the PU of the CU. The residual pixel block of the CU may indicate the difference between the sampled value in the predictive pixel block of the PU of the CU and the corresponding sampled value in the initial pixel block of the CU.
·变换量化·Transformation quantization
视频编码器20可对CU的残余像素块执行递归四叉树分割以将CU的残余像素块分割成与CU的变换单元(TU)相关联的一个或多个较小残余像素块。因为与TU相关联的像素块中的像素各自对应一个亮度采样及两个色度采样,所以每一个TU可与一个亮度的残余采样块及两个色度的残余采样块相关联。视频编码器20可将一个或多个变换应用于与TU相关联的残余采样块以产生系数块(即,系数的块)。变换可以是DCT变换或者它的变体。采用DCT的变换矩阵,通过在水平和竖直方向应用一维变换计算二维变换,获得所述系数块。视频编码器20可对系数块中的每一个系数执行量化程序。量化一般指系数经量化以减少用以表示系数的数据量,从而提供进一步压缩的过程。 Video encoder 20 may perform recursive quadtree partitioning on the residual pixel blocks of the CU to partition the residual pixel blocks of the CU into one or more smaller residual pixel blocks associated with the transform units (TUs) of the CU. Because the pixels in the pixel block associated with the TU each correspond to one luma sample and two chroma samples, each TU can be associated with one luma residual sample block and two chroma residual sample blocks. Video encoder 20 may apply one or more transforms to the residual sample block associated with the TU to generate a coefficient block (ie, a block of coefficients). The transform can be a DCT transform or a variant thereof. Using the transform matrix of the DCT, the coefficient block is obtained by applying a one-dimensional transform in the horizontal and vertical directions to calculate a two-dimensional transform. Video encoder 20 may perform a quantization procedure for each of the coefficients in the coefficient block. Quantization generally refers to the process by which the coefficients are quantized to reduce the amount of data used to represent the coefficients, thereby providing further compression.
·熵编码Entropy coding
视频编码器20可产生表示量化后系数块中的系数的语法元素的集合。视频编码器20 可将熵编码操作(例如,上下文自适应二进制算术译码(CABAC)操作)应用于上述语法元素中的部分或者全部。为将CABAC编码应用于语法元素,视频编码器20可将语法元素二进制化以形成包括一个或多个位(称作“二进位”)的二进制序列。视频编码器20可使用规则(regular)编码来编码二进位中的一部分,且可使用旁通(bypass)编码来编码二进位中的其它部分。 Video encoder 20 may generate a set of syntax elements that represent coefficients in the quantized coefficient block. Video encoder 20 may apply an entropy encoding operation (eg, a context adaptive binary arithmetic coding (CABAC) operation) to some or all of the above syntax elements. To apply CABAC encoding to syntax elements, video encoder 20 may binarize the syntax elements to form a binary sequence that includes one or more bits (referred to as "binary"). Video encoder 20 may encode a portion of the binary using regular encoding, and may use bypass encoding to encode other portions of the binary.
·编码端重建图像·Coded end reconstructed image
除熵编码系数块的语法元素外,视频编码器20可将逆量化及逆变换应用于变换后的系数块,以从变换后的系数块重建残余采样块。视频编码器20可将重建后的残余采样块加到一个或多个预测性采样块的对应采样块,以产生重建后的采样块。通过重建每一色彩分量的采样块,视频编码器20可重建与TU相关联的像素块。以此方式重建CU的每一TU的像素块,直到CU的整个像素块重建完成。In addition to the syntax elements of the entropy coding coefficient block, video encoder 20 may apply an inverse quantization and an inverse transform to the transformed coefficient block to reconstruct the residual sample block from the transformed coefficient block. Video encoder 20 may add the reconstructed residual sample block to a corresponding sample block of one or more predictive sample blocks to produce a reconstructed sample block. By reconstructing the sample block for each color component, video encoder 20 may reconstruct the block of pixels associated with the TU. The pixel block of each TU of the CU is reconstructed in this way until the entire pixel block reconstruction of the CU is completed.
·编码端滤波·Encoding end filtering
在视频编码器20重建构CU的像素块之后,视频编码器20可执行消块滤波操作以减少与CU相关联的像素块的块效应。在视频编码器20执行消块滤波操作之后,视频编码器20可使用采样自适应偏移(SAO)来修改图片的CTB的重建后的像素块。在执行这些操作之后,视频编码器20可将CU的重建后的像素块存储于解码图片缓冲器中以用于产生其它CU的预测性像素块。After video encoder 20 reconstructs the block of pixels of the CU, video encoder 20 may perform a deblocking filtering operation to reduce the blockiness of the block of pixels associated with the CU. After video encoder 20 performs the deblocking filtering operation, video encoder 20 may use sample adaptive offset (SAO) to modify the reconstructed block of pixels of the CTB of the picture. After performing these operations, video encoder 20 may store the reconstructed blocks of pixels of the CU in a decoded picture buffer for use in generating predictive blocks of pixels for other CUs.
·熵解码Entropy decoding
视频解码器30可接收码流。所述码流以比特流的形式包含了由视频编码器20编码的视频数据的编码信息。视频解码器30可解析所述码流以从所述码流提取语法元素。当视频解码器30执行CABAC解码时,视频解码器30可对部分二进位执行规则解码且可对其它部分的二进位执行旁通解码,码流中的二进位与语法元素具有映射关系,通过解析二进位获得语法元素。 Video decoder 30 can receive the code stream. The code stream contains encoded information of video data encoded by video encoder 20 in the form of a bitstream. Video decoder 30 may parse the code stream to extract syntax elements from the code stream. When video decoder 30 performs CABAC decoding, video decoder 30 may perform regular decoding on partial bins and may perform bypass decoding on bins of other portions, and the bins in the code stream have mapping relationships with syntax elements, through parsing The binary gets the syntax element.
·解码端重建图像·Decoding end reconstruction image
视频解码器30可基于从码流提取的语法元素来重建视频数据的图片。基于语法元素来重建视频数据的过程大体上与由视频编码器20执行以产生语法元素的过程互逆。举例来说,视频解码器30可基于与CU相关联的语法元素来产生CU的PU的预测性像素块。另外,视频解码器30可逆量化与CU的TU相关联的系数块。视频解码器30可对逆量化后的系数块执行逆变换以重建与CU的TU相关联的残余像素块。视频解码器30可基于预测性像素块及残余像素块来重建CU的像素块。 Video decoder 30 may reconstruct a picture of the video data based on the syntax elements extracted from the code stream. The process of reconstructing video data based on syntax elements is generally reciprocal to the process performed by video encoder 20 to generate syntax elements. For example, video decoder 30 may generate a predictive pixel block of a PU of a CU based on syntax elements associated with the CU. Additionally, video decoder 30 may inverse quantize the coefficient blocks associated with the TUs of the CU. Video decoder 30 may perform an inverse transform on the inverse quantized coefficient block to reconstruct a residual pixel block associated with the TU of the CU. Video decoder 30 may reconstruct a block of pixels of the CU based on the predictive pixel block and the residual pixel block.
·解码端滤波·Decoder filtering
在视频解码器30重建CU的像素块之后,视频解码器30可执行消块滤波操作以减少与CU相关联的像素块的块效应。另外,基于一个或多个SAO语法元素,视频解码器30可执行与视频编码器20相同的SAO操作。在视频解码器30执行这些操作之后,视频解码器30可将CU的像素块存储于解码图片缓冲器中。解码图片缓冲器可提供用于后续运动补偿、帧内预测及显示装置呈现的参考图片。After video decoder 30 reconstructs the block of pixels of the CU, video decoder 30 may perform a deblocking filtering operation to reduce the blockiness of the block of pixels associated with the CU. Additionally, video decoder 30 may perform the same SAO operations as video encoder 20 based on one or more SAO syntax elements. After video decoder 30 performs these operations, video decoder 30 may store the block of pixels of the CU in a decoded picture buffer. The decoded picture buffer can provide reference pictures for subsequent motion compensation, intra prediction, and presentation by the display device.
·编码模块·Encoding module
图22为说明经配置以实施本发明的技术的实例视频编码器20的框图。应理解,图22 是示例性的而不应视为限制如本发明广泛例证并描述的技术。如图22所示,视频编码器20包含预测处理单元100、残余产生单元102、变换处理单元104、量化单元106、逆量化单元108、逆变换处理单元110、重建单元112、滤波器单元113、解码图片缓冲器114及熵编码单元116。熵编码单元116包含规则CABAC编解码引擎118及旁通编解码引擎120。预测处理单元100包含帧间预测处理单元121及帧内预测处理单元126。帧间预测处理单元121包含运动估计单元122及运动补偿单元124。在其它实例中,视频编码器20可包含更多、更少或不同的功能组件。22 is a block diagram illustrating an example video encoder 20 that is configured to implement the techniques of the present invention. It should be understood that FIG. 22 is exemplary and should not be considered as limiting the techniques as broadly exemplified and described herein. As shown in FIG. 22, video encoder 20 includes prediction processing unit 100, residual generation unit 102, transform processing unit 104, quantization unit 106, inverse quantization unit 108, inverse transform processing unit 110, reconstruction unit 112, filter unit 113, The picture buffer 114 and the entropy encoding unit 116 are decoded. Entropy encoding unit 116 includes a regular CABAC codec engine 118 and a bypass codec engine 120. The prediction processing unit 100 includes an inter prediction processing unit 121 and an intra prediction processing unit 126. The inter prediction processing unit 121 includes a motion estimation unit 122 and a motion compensation unit 124. In other examples, video encoder 20 may include more, fewer, or different functional components.
·预测模块· Prediction module
视频编码器20接收视频数据。为编码视频数据,视频编码器20可编码视频数据的每一图片的每一条带。作为编码条带的一部分,视频编码器20可编码所述条带中的每一CTB。作为编码CTB的一部分,预测处理单元100可对与CTB相关联的像素块执行四叉树分割,以将像素块分成递减的像素块。举例来说,预测处理单元100可将CTB的像素块分割成四个相等大小的子块,将子块中的一个或多个分割成四个相等大小的子子块,等等。 Video encoder 20 receives the video data. To encode the video data, video encoder 20 may encode each strip of each picture of the video data. As part of the encoded strip, video encoder 20 may encode each CTB in the strip. As part of encoding the CTB, prediction processing unit 100 may perform quadtree partitioning on the pixel blocks associated with the CTB to divide the block of pixels into decreasing blocks of pixels. For example, prediction processing unit 100 may partition a block of pixels of a CTB into four equally sized sub-blocks, split one or more of the sub-blocks into four equally sized sub-sub-blocks, and the like.
视频编码器20可编码图片中的CTB的CU以产生CU的编码信息。视频编码器20可根据折形扫描次序来编码CTB的CU。换句话说,视频编码器20可按左上CU、右上CU、左下CU及接着右下CU来编码所述CU。当视频编码器20编码分割后的CU时,视频编码器20可根据折形扫描次序来编码与分割后的CU的像素块的子块相关联的CU。 Video encoder 20 may encode the CU of the CTB in the picture to generate coded information for the CU. Video encoder 20 may encode the CU of the CTB according to the fold scan order. In other words, video encoder 20 may encode the CU by the upper left CU, the upper right CU, the lower left CU, and then the lower right CU. When video encoder 20 encodes the partitioned CU, video encoder 20 may encode the CU associated with the sub-block of the pixel block of the partitioned CU according to the fold scan order.
此外,预测处理单元100可在CU的一个或多个PU中分割CU的像素块。视频编码器20及视频解码器30可支持各种PU大小。假定特定CU的大小为2N×2N,视频编码器20及视频解码器30可支持2N×2N或N×N的PU大小以用于帧内预测,且支持2N×2N、2N×N、N×2N、N×N或类似大小的对称PU以用于帧间预测。视频编码器20及视频解码器30还可支持2N×nU、2N×nD、nL×2N及nR×2N的不对称PU以用于帧间预测。Moreover, prediction processing unit 100 can partition the pixel blocks of the CU in one or more PUs of the CU. Video encoder 20 and video decoder 30 can support a variety of PU sizes. Assuming that the size of a particular CU is 2N×2N, video encoder 20 and video decoder 30 may support a PU size of 2N×2N or N×N for intra prediction, and support 2N×2N, 2N×N, N× 2N, N x N or similarly sized symmetric PUs for inter prediction. Video encoder 20 and video decoder 30 may also support asymmetric PUs of 2N x nU, 2N x nD, nL x 2N, and nR x 2N for inter prediction.
帧间预测处理单元121可通过对CU的每一PU执行帧间预测而产生PU的预测性数据。PU的预测性数据可包含对应于PU的预测性像素块及PU的运动信息。条带可为I条带、P条带或B条带。帧间预测单元121可根据PU是在I条带、P条带还是B条带中而对CU的PU执行不同操作。在I条带中,所有PU执行帧内预测。The inter prediction processing unit 121 may generate predictive data of the PU by performing inter prediction on each PU of the CU. The predictive data of the PU may include motion information corresponding to the predictive pixel block of the PU and the PU. The strip can be an I strip, a P strip or a B strip. The inter prediction unit 121 may perform different operations on the PU of the CU depending on whether the PU is in an I slice, a P slice, or a B slice. In the I slice, all PUs perform intra prediction.
如果PU在P条带中,则运动估计单元122可搜索参考图片的列表(例如,“列表0”)中的参考图片以查找PU的参考块。PU的参考块可为最紧密地对应于PU的像素块的像素块。运动估计单元122可产生指示列表0中的含有PU的参考块的参考图片的参考图片索引,及指示PU的像素块与参考块之间的空间位移的运动向量。运动估计单元122可将参考图片索引及运动向量作为PU的运动信息而输出。运动补偿单元124可基于由PU的运动信息指示的参考块来产生PU的预测性像素块。If the PU is in a P-strip, motion estimation unit 122 may search for a reference picture in a list of reference pictures (eg, "List 0") to find a reference block for the PU. The reference block of the PU may be the pixel block that most closely corresponds to the pixel block of the PU. Motion estimation unit 122 may generate a reference picture index that indicates a reference picture of the PU-containing reference block in list 0, and a motion vector that indicates a spatial displacement between the pixel block of the PU and the reference block. The motion estimation unit 122 may output the reference picture index and the motion vector as motion information of the PU. Motion compensation unit 124 may generate a predictive pixel block of the PU based on the reference block indicated by the motion information of the PU.
如果PU在B条带中,则运动估计单元122可对PU执行单向帧间预测或双向帧间预测。为对PU执行单向帧间预测,运动估计单元122可搜索第一参考图片列表(“列表0”)或第二参考图片列表(“列表1”)的参考图片以查找PU的参考块。运动估计单元122可将以下各者作为PU的运动信息而输出:指示含有参考块的参考图片的列表0或列表1中的位置的参考图片索引、指示PU的像素块与参考块之间的空间位移的运动向量,及指示参考图片是在列表0中还是在列表1中的预测方向指示符。为对PU执行双向帧间预测,运 动估计单元122可搜索列表0中的参考图片以查找PU的参考块,且还可搜索列表1中的参考图片以查找PU的另一参考块。运动估计单元122可产生指示含有参考块的参考图片的列表0及列表1中的位置的参考图片索引。另外,运动估计单元122可产生指示参考块与PU的像素块之间的空间位移的运动向量。PU的运动信息可包含PU的参考图片索引及运动向量。运动补偿单元124可基于由PU的运动信息指示的参考块来产生PU的预测性像素块。If the PU is in B-strip, motion estimation unit 122 may perform uni-directional inter prediction or bi-directional inter prediction on the PU. To perform uni-directional inter prediction on the PU, motion estimation unit 122 may search for a reference picture of a first reference picture list ("List 0") or a second reference picture list ("List 1") to find a reference block for the PU. The motion estimation unit 122 may output the following as the motion information of the PU: a reference picture index indicating a position in the list 0 or the list 1 of the reference picture containing the reference block, a space between the pixel block indicating the PU and the reference block The motion vector of the displacement, and the prediction direction indicator indicating whether the reference picture is in list 0 or in list 1. To perform bi-directional inter prediction on the PU, motion estimation unit 122 may search for reference pictures in list 0 to find reference blocks for the PU, and may also search for reference pictures in list 1 to find another reference block for the PU. Motion estimation unit 122 may generate a reference picture index indicating the list 0 of the reference picture containing the reference block and the location in list 1. Additionally, motion estimation unit 122 may generate a motion vector that indicates a spatial displacement between the reference block and the pixel block of the PU. The motion information of the PU may include a reference picture index of the PU and a motion vector. Motion compensation unit 124 may generate a predictive pixel block of the PU based on the reference block indicated by the motion information of the PU.
帧内预测处理单元126可通过对PU执行帧内预测而产生PU的预测性数据。PU的预测性数据可包含PU的预测性像素块及各种语法元素。帧内预测处理单元126可对I条带、P条带及B条带内的PU执行帧内预测。Intra prediction processing unit 126 may generate predictive data for the PU by performing intra prediction on the PU. The predictive data of the PU may include predictive pixel blocks of the PU and various syntax elements. Intra prediction processing unit 126 may perform intra prediction on I slices, P slices, and PUs within B slices.
为对PU执行帧内预测,帧内预测处理单元126可使用多个帧内预测模式来产生PU的预测性数据的多个集合。为使用帧内预测模式来产生PU的预测性数据的集合,帧内预测处理单元126可在与帧内预测模式相关联的方向上跨越PU的采样块扩展来自相邻PU的采样块的采样。假定从左向右、从上而下的编码次序用于PU、CU及CTB,相邻PU可在PU的上方,在PU的右上方,在PU的左上方或在PU的左方。帧内预测处理单元126可使用包含不同数目的帧内预测模式,例如,33个方向性帧内预测模式。在一些实例中,帧内预测模式的数目可取决于PU的像素块的大小。To perform intra prediction on a PU, intra-prediction processing unit 126 may use multiple intra-prediction modes to generate multiple sets of predictive data for the PU. To generate a set of predictive data for a PU using intra prediction mode, intra-prediction processing unit 126 may spread samples of sample blocks from neighboring PUs across sample blocks of the PU in a direction associated with the intra-prediction mode. It is assumed that the coding order from left to right and from top to bottom is used for PU, CU and CTB, and the adjacent PU may be above the PU, at the upper right of the PU, at the upper left of the PU or to the left of the PU. Intra prediction processing unit 126 may use a different number of intra prediction modes, for example, 33 directional intra prediction modes. In some examples, the number of intra prediction modes may depend on the size of the pixel block of the PU.
预测处理单元100可从通过帧间预测处理单元121针对PU而产生的预测性数据或通过帧内预测处理单元126针对PU而产生的预测性数据当中选择CU的PU的预测性数据。在一些实例中,预测处理单元100基于预测性数据的集合的速率/失真量度来选择CU的PU的预测性数据。例如,使用拉格朗日代价函数来在编码模式及其参数值(比如运动矢量、参考索引和帧内预测方向)之间进行选择。这一种类的代价函数使用加权因子lambda将由于有损编码方法所致的实际的或者估计的图像失真与为了表示在图像区域中的像素值而需要的实际的或估计的信息量联系在一起:C=D+lambda×R,其中C是待最小化的拉格朗日代价,D是具有模式及其参数的图像失真(例如均方误差),R是为了在解码器中重构图像块而需要的比特数(例如包括用于表示候选运动矢量的数据量)。一般的,代价最小的编码模式被选定作为实际编码模式。选定预测性数据的预测性像素块在本文中可被称作选定预测性像素块。The prediction processing unit 100 may select the predictive data of the PU of the CU from among the predictive data generated by the inter prediction processing unit 121 for the PU or the predictive data generated by the intra prediction processing unit 126 for the PU. In some examples, prediction processing unit 100 selects predictive data for the PU of the CU based on the rate/distortion metric of the set of predictive data. For example, a Lagrangian cost function is used to select between an encoding mode and its parameter values, such as motion vectors, reference indices, and intra prediction directions. This kind of cost function uses the weighting factor lambda to relate the actual or estimated image distortion due to the lossy coding method to the actual or estimated amount of information needed to represent the pixel values in the image region: C=D+lambda×R, where C is the Lagrangian cost to be minimized, D is the image distortion with the mode and its parameters (eg mean square error), R is for reconstructing the image block in the decoder The number of bits required (for example including the amount of data used to represent the candidate motion vectors). In general, the least expensive coding mode is selected as the actual coding mode. A predictive pixel block that selects predictive data may be referred to herein as a selected predictive pixel block.
残余产生单元102可基于CU的像素块及CU的PU的选定预测性像素块来产生CU的残余像素块。举例来说,残余产生单元102可产生CU的残余像素块,使得残余像素块中的每一采样具有等于以下两者之间的差的值:CU的像素块中的采样,及CU的PU的选定预测性像素块中的对应采样。 Residual generation unit 102 may generate a residual pixel block of the CU based on the pixel block of the CU and the selected predictive pixel block of the PU of the CU. For example, the residual generation unit 102 may generate a residual pixel block of the CU such that each sample in the residual pixel block has a value equal to a difference between: a sample in a pixel block of the CU, and a PU of the CU Corresponding samples in the predictive pixel block are selected.
预测处理单元100可执行四叉树分割以将CU的残余像素块分割成子块。每一不再划分的残余像素块可与CU的不同TU相关联。与CU的TU相关联的残余像素块的大小及位置与基于CU的PU的像素块的大小及位置没有必然联系。The prediction processing unit 100 may perform quadtree partitioning to partition the residual pixel block of the CU into sub-blocks. Each residual pixel block that is no longer partitioned may be associated with a different TU of the CU. The size and location of the residual pixel block associated with the TU of the CU is not necessarily related to the size and location of the pixel block of the CU-based PU.
·变换模块·Transformation module
因为TU的残余像素块的像素可对应一个亮度采样及两个色度采样,所以每一个TU可与一个亮度采样块及两个色度采样块相关联。变换处理单元104可通过将一个或多个变换应用于与TU相关联的残余采样块而产生CU的每一个TU的系数块。举例来说,变换 处理单元104可将离散余弦变换(DCT)、方向性变换或概念上类似的变换应用于残余采样块。Since the pixels of the residual pixel block of the TU can correspond to one luma sample and two chroma samples, each TU can be associated with one luma sample block and two chroma sample blocks. Transform processing unit 104 may generate a coefficient block for each TU of the CU by applying one or more transforms to the residual sample block associated with the TU. For example, transform processing unit 104 may apply a discrete cosine transform (DCT), a directional transform, or a conceptually similar transform to the residual sample block.
·量化模块·Quantization module
量化单元106可量化系数块中的系数。举例来说,n位系数可在量化期间舍位到m位系数,其中n大于m。量化单元106可基于与CU相关联的量化参数(QP)值来量化与CU的TU相关联的系数块。视频编码器20可通过调整与CU相关联的QP值来调整应用于与CU相关联的系数块的量化程度。 Quantization unit 106 may quantize the coefficients in the coefficient block. For example, an n-bit coefficient can be truncated to an m-bit coefficient during quantization, where n is greater than m. Quantization unit 106 may quantize the coefficient block associated with the TU of the CU based on a quantization parameter (QP) value associated with the CU. Video encoder 20 may adjust the degree of quantization applied to the coefficient block associated with the CU by adjusting the QP value associated with the CU.
·编码重建模块(逆变换量化)· Code reconstruction module (inverse transform quantization)
逆量化单元108及逆变换处理单元110可分别将逆量化及逆变换应用于变换后的系数块以从系数块重建残余采样块。重建单元112可将重建后的残余采样块的采样加到预测处理单元100产生的一个或多个预测性采样块的对应采样,以产生与TU相关联的重建后的采样块。通过此方式重建CU的每一个TU的采样块,视频编码器20可重建CU的像素块。 Inverse quantization unit 108 and inverse transform processing unit 110 may apply inverse quantization and inverse transform, respectively, to the transformed coefficient block to reconstruct the residual sample block from the coefficient block. Reconstruction unit 112 may add samples of the reconstructed residual sample block to corresponding samples of one or more predictive sample blocks generated by prediction processing unit 100 to generate a reconstructed sample block associated with the TU. By reconstructing the sample block of each TU of the CU in this manner, video encoder 20 may reconstruct the block of pixels of the CU.
·滤波模块·Filter module
滤波器单元113可执行消块滤波操作以减少与CU相关联的像素块的块效应。此外,滤波器单元113可将由预测处理单元100确定的SAO偏移应用于重建后的采样块以恢复像素块。滤波器单元113可产生CTB的SAO语法元素的编码信息。 Filter unit 113 may perform a deblocking filtering operation to reduce blockiness of pixel blocks associated with the CU. Further, the filter unit 113 may apply the SAO offset determined by the prediction processing unit 100 to the reconstructed sample block to restore the pixel block. Filter unit 113 may generate encoding information for the SAO syntax elements of the CTB.
·参考图像模块·Reference image module
解码图片缓冲器114可存储重建后的像素块。帧间预测单元121可使用含有重建后的像素块的参考图片来对其它图片的PU执行帧间预测。另外,帧内预测处理单元126可使用解码图片缓冲器114中的重建后的像素块来对在与CU相同的图片中的其它PU执行帧内预测。The decoded picture buffer 114 can store the reconstructed block of pixels. Inter prediction unit 121 may perform inter prediction on PUs of other pictures using reference pictures containing the reconstructed pixel blocks. Additionally, intra-prediction processing unit 126 can use the reconstructed block of pixels in decoded picture buffer 114 to perform intra-prediction on other PUs in the same picture as the CU.
·熵编码模块Entropy coding module
熵编码单元116可接收来自视频编码器20的其它功能组件的数据。举例来说,熵编码单元116可接收来自量化单元106的系数块且可接收来自预测处理单元100的语法元素。熵编码单元116可对数据执行一个或多个熵编码操作以产生熵编码后的数据。举例来说,熵编码单元116可对数据执行上下文自适应可变长度编解码(CAVLC)操作、CABAC操作、可变到可变(V2V)长度编解码操作、基于语法的上下文自适应二进制算术编解码(SBAC)操作、机率区间分割熵(PIPE)编解码操作,或其它类型的熵编码操作。在一特定实例中,熵编码单元116可使用规则CABAC引擎118来编码语法元素的经规则CABAC编解码二进位,且可使用旁通编解码引擎120来编码经旁通编解码二进位。 Entropy encoding unit 116 may receive data from other functional components of video encoder 20. For example, entropy encoding unit 116 may receive a coefficient block from quantization unit 106 and may receive syntax elements from prediction processing unit 100. Entropy encoding unit 116 may perform one or more entropy encoding operations on the data to generate entropy encoded data. For example, entropy encoding unit 116 may perform context adaptive variable length codec (CAVLC) operations, CABAC operations, variable to variable (V2V) length codec operations, grammar-based context adaptive binary arithmetic coding on data. Decoding (SBAC) operations, probability interval partition entropy (PIPE) codec operations, or other types of entropy coding operations. In a particular example, entropy encoding unit 116 may encode regular CABAC codec bins of syntax elements using regular CABAC engine 118, and may encode pass-through codec bins using bypass codec engine 120.
·解码模块·Decoding module
图23为说明经配置以实施本发明的技术的实例视频解码器30的框图。应理解,图23是示例性的而不应视为限制如本发明广泛例证并描述的技术。如图23所示,视频解码器30包含熵解码单元150、预测处理单元152、逆量化单元154、逆变换处理单元156、重建单元158、滤波器单元159及解码图片缓冲器160。预测处理单元152包含运动补偿单元162及帧内预测处理单元164。熵解码单元150包含规则CABAC编解码引擎166及旁通编解码引擎168。在其它实例中,视频解码器30可包含更多、更少或不同的功能组件。23 is a block diagram illustrating an example video decoder 30 that is configured to implement the techniques of the present invention. It should be understood that FIG. 23 is exemplary and should not be considered as limiting the techniques as broadly exemplified and described herein. As shown in FIG. 23, video decoder 30 includes an entropy decoding unit 150, a prediction processing unit 152, an inverse quantization unit 154, an inverse transform processing unit 156, a reconstruction unit 158, a filter unit 159, and a decoded picture buffer 160. The prediction processing unit 152 includes a motion compensation unit 162 and an intra prediction processing unit 164. Entropy decoding unit 150 includes a regular CABAC codec engine 166 and a bypass codec engine 168. In other examples, video decoder 30 may include more, fewer, or different functional components.
视频解码器30可接收码流。熵解码单元150可解析所述码流以从所述码流提取语法 元素。作为解析码流的一部分,熵解码单元150可解析码流中的经熵编码后的语法元素。预测处理单元152、逆量化单元154、逆变换处理单元156、重建单元158及滤波器单元159可根据从码流中提取的语法元素来解码视频数据,即产生解码后的视频数据。 Video decoder 30 can receive the code stream. Entropy decoding unit 150 may parse the code stream to extract syntax elements from the code stream. As part of parsing the code stream, entropy decoding unit 150 may parse the entropy encoded syntax elements in the code stream. The prediction processing unit 152, the inverse quantization unit 154, the inverse transform processing unit 156, the reconstruction unit 158, and the filter unit 159 may decode the video data according to the syntax elements extracted from the code stream, that is, generate the decoded video data.
·熵解码模块Entropy decoding module
语法元素可包含经规则CABAC编解码二进位及经旁通编解码二进位。熵解码单元150可使用规则CABAC编解码引擎166来解码经规则CABAC编解码二进位,且可使用旁通编解码引擎168来解码经旁通编解码二进位。The syntax elements may include a regular CABAC codec binary and a bypass codec binary. Entropy decoding unit 150 may use a regular CABAC codec engine 166 to decode the regular CABAC codec bins, and may use the bypass codec engine 168 to decode the bypass codec bins.
·预测模块· Prediction module
如果PU使用帧内预测编码,则帧内预测处理单元164可执行帧内预测以产生PU的预测性采样块。帧内预测处理单元164可使用帧内预测模式以基于空间相邻PU的像素块来产生PU的预测性像素块。帧内预测处理单元164可根据从码流解析的一个或多个语法元素来确定PU的帧内预测模式。If the PU uses intra prediction encoding, intra prediction processing unit 164 may perform intra prediction to generate a predictive sample block for the PU. Intra-prediction processing unit 164 may use an intra-prediction mode to generate a predictive pixel block of a PU based on a block of pixels of a spatially neighboring PU. Intra prediction processing unit 164 may determine an intra prediction mode for the PU based on one or more syntax elements parsed from the code stream.
运动补偿单元162可根据从码流解析的语法元素来构造第一参考图片列表(列表0)及第二参考图片列表(列表1)。此外,如果PU使用帧间预测编码,则熵解码单元150可解析PU的运动信息。运动补偿单元162可根据PU的运动信息来确定PU的一个或多个参考块。运动补偿单元162可根据PU的一个或多个参考块来产生PU的预测性像素块。 Motion compensation unit 162 may construct a first reference picture list (List 0) and a second reference picture list (List 1) based on syntax elements parsed from the code stream. Furthermore, if the PU uses inter prediction coding, the entropy decoding unit 150 may parse the motion information of the PU. Motion compensation unit 162 can determine one or more reference blocks of the PU based on the motion information of the PU. Motion compensation unit 162 can generate a predictive pixel block of the PU from one or more reference blocks of the PU.
·解码重建模块(逆变换量化)· Decoding reconstruction module (inverse transform quantization)
另外,视频解码器30可对不再分割的CU执行重建操作。为对不再分割的CU执行重建操作,视频解码器30可对CU的每一TU执行重建操作。通过对CU的每一TU执行重建操作,视频解码器30可重建与CU相关联的残余像素块。Additionally, video decoder 30 may perform a reconstruction operation on a CU that is no longer split. To perform a reconstruction operation on a CU that is no longer split, video decoder 30 may perform a reconstruction operation on each TU of the CU. By performing a reconstruction operation on each TU of the CU, video decoder 30 may reconstruct the residual pixel blocks associated with the CU.
作为对CU的TU执行重建操作的一部分,逆量化单元154可逆量化(即,解量化)与TU相关联的系数块。逆量化单元154可使用与TU的CU相关联的QP值来确定量化程度,且与确定逆量化单元154将应用的逆量化程度相同。As part of performing a reconstruction operation on the TU of the CU, inverse quantization unit 154 may inverse quantize (ie, dequantize) the coefficient block associated with the TU. Inverse quantization unit 154 may determine the degree of quantization using the QP value associated with the CU of the TU, and determine the degree of inverse quantization that the inverse quantization unit 154 will apply.
在逆量化单元154逆量化系数块之后,逆变换处理单元156可将一个或多个逆变换应用于系数块,以便产生与TU相关联的残余采样块。举例来说,逆变换处理单元156可将逆DCT、逆整数变换、逆卡忽南-拉维(Karhunen-Loeve)变换(KLT)、逆旋转变换、逆方向性变换或其它与编码端的变换对应的逆变换应用于系数块。After inverse quantization unit 154 inverse quantizes the coefficient block, inverse transform processing unit 156 may apply one or more inverse transforms to the coefficient block to generate a residual sample block associated with the TU. For example, inverse transform processing unit 156 may map inverse DCT, inverse integer transform, Karhunen-Loeve transform (KLT), inverse rotation transform, inverse directional transform, or other transform to the encoding end. The inverse transform is applied to the coefficient block.
重建单元158可在适用时使用与CU的TU相关联的残余像素块及CU的PU的预测性像素块(即,帧内预测数据或帧间预测数据)以重建CU的像素块。特定来说,重建单元158可将残余像素块的采样加到预测性像素块的对应采样以重建CU的像素块。 Reconstruction unit 158 may use the residual pixel block associated with the TU of the CU and the predictive pixel block of the PU of the CU (ie, intra-prediction data or inter-prediction data) to reconstruct the block of pixels of the CU, where applicable. In particular, reconstruction unit 158 can add samples of the residual pixel block to corresponding samples of the predictive pixel block to reconstruct the pixel block of the CU.
·滤波模块·Filter module
滤波器单元159可执行消块滤波操作以减少与CTB的CU相关联的像素块的块效应。另外,滤波器单元159可根据从码流解析的SAO语法元素来修改CTB的像素值。举例来说,滤波器单元159可根据CTB的SAO语法元素来确定修正值,且将所确定的修正值加到CTB的重建后的像素块中的采样值。通过修改图片的CTB的部分或全部像素值,滤波器单元159可根据SAO语法元素来修正视频数据的重建图片。Filter unit 159 may perform a deblocking filtering operation to reduce the blockiness of the block of pixels associated with the CU of the CTB. Additionally, filter unit 159 can modify the pixel values of the CTB based on the SAO syntax elements parsed from the code stream. For example, filter unit 159 can determine the correction value based on the SAO syntax element of the CTB and add the determined correction value to the sample value in the reconstructed pixel block of the CTB. By modifying some or all of the pixel values of the CTB of the picture, the filter unit 159 can modify the reconstructed picture of the video data according to the SAO syntax element.
参考图像模块Reference image module
视频解码器30可将CU的像素块存储于解码图片缓冲器160中。解码图片缓冲器160 可提供参考图片以用于后续运动补偿、帧内预测及显示装置(例如,图21的显示装置32)呈现。举例来说,视频解码器30可根据解码图片缓冲器160中的像素块来对其它CU的PU执行帧内预测操作或帧间预测操作。 Video decoder 30 may store the block of pixels of the CU in decoded picture buffer 160. The decoded picture buffer 160 may provide reference pictures for subsequent motion compensation, intra prediction, and presentation by a display device (eg, display device 32 of FIG. 21). For example, video decoder 30 may perform intra-prediction operations or inter-prediction operations on PUs of other CUs according to the blocks of pixels in decoded picture buffer 160.
本发明实施例的视频编码器可用于执行上述各实施例的视频编码方法,也可将图18a和图18b所示的视频编码设备的各功能模块集成在本发明实施例的视频编码器20上。例如,该视频编码器可用于执行上述的图2、图5或图12所示的实施例的视频编码方法。The video encoder of the embodiment of the present invention may be used to perform the video encoding method of the foregoing embodiments, and the functional modules of the video encoding apparatus shown in FIG. 18a and FIG. 18b may be integrated into the video encoder 20 of the embodiment of the present invention. . For example, the video encoder can be used to perform the video encoding method of the embodiment shown in FIG. 2, FIG. 5 or FIG. 12 described above.
这样,视频编码器20获取多个视频帧,多个视频帧中的每一视频帧间在画面内容上包括冗余数据。然后,视频编码器20对多个视频帧进行重构,得到场景信息和每一视频帧的重构残差,其中,场景信息包括冗余数据减少冗余度后的数据,重构残差用于表示视频帧和场景信息间的差值。跟着,视频编码器20对场景信息进行预测编码,得到场景特征预测编码数据。视频编码器20对重构残差进行预测编码,得到残差预测编码数据。这样,通过对该多个视频帧进行重构的处理,可以减少这些视频帧的冗余度,从而在编码操作中,得到的场景特征和重构残差总体的压缩数据量相对于原来的视频帧的压缩数据量得到了缩减,减少了压缩后得到的数据量。而将每一视频帧重构为场景特征和重构残差,因重构残差包含除场景信息外的残差信息,因此信息量少且稀疏,该特性在进行预测编码时,可以通过较少的码字对其进行预测编码,编码数据量小,压缩比高。这样,本发明实施例的方法可有效提高视频帧的压缩效率。Thus, video encoder 20 acquires a plurality of video frames, each of which includes redundant data on the picture content. Then, the video encoder 20 reconstructs the plurality of video frames to obtain scene information and reconstruction residuals of each video frame, where the scene information includes redundant data to reduce redundancy, and the residual is used for reconstruction. Indicates the difference between the video frame and the scene information. Next, the video encoder 20 predictively encodes the scene information to obtain scene feature prediction encoded data. The video encoder 20 predictively encodes the reconstructed residual to obtain residual prediction encoded data. In this way, by performing the process of reconstructing the plurality of video frames, the redundancy of the video frames can be reduced, so that in the encoding operation, the obtained scene features and the reconstructed residual total compressed data amount are relative to the original video. The amount of compressed data of the frame is reduced, reducing the amount of data obtained after compression. Each video frame is reconstructed into a scene feature and a reconstructed residual. Since the reconstructed residual includes residual information other than the scene information, the amount of information is small and sparse, and the feature can be compared when performing predictive coding. The codewords are less predictively encoded, the amount of encoded data is small, and the compression ratio is high. Thus, the method of the embodiment of the present invention can effectively improve the compression efficiency of a video frame.
在本发明有的实施例中,还提供了一种视频解码器,该视频解码器可用于执行上述各实施例的视频解码方法,也可将图19所示的视频解码设备的各功能模块集成在本发明实施例的视频解码器30上。例如,该视频解码器30可用于执行上述的图2、图6或图17所示实施例的视频解码方法。In an embodiment of the present invention, a video decoder is further provided, where the video decoder can be used to perform the video decoding method of the foregoing embodiments, and the functional modules of the video decoding device shown in FIG. 19 can also be integrated. On video decoder 30 of an embodiment of the invention. For example, the video decoder 30 can be used to perform the video decoding method of the embodiment shown in FIG. 2, FIG. 6, or FIG.
这样,视频解码器30获取场景特征预测编码数据和残差预测编码数据后,视频解码器30对场景特征预测编码数据进行解码,得到场景信息,其中,场景信息包括冗余数据减少冗余度后的数据,冗余数据为多个视频帧中的每一视频帧间在画面内容上的冗余数据。跟着,视频解码器30对残差预测编码数据进行解码,得到重构残差,重构残差用于表示视频帧和场景信息间的差值。以及视频解码器30,用于根据场景信息和重构残差进行重构,得到多个视频帧。这样,对上述实施例中的视频编码设备编码得到的场景特征预测编码数据和残差预测编码数据使用本发明实施例的视频解码设备即可完成解码操作。In this way, after the video decoder 30 obtains the scene feature prediction encoded data and the residual prediction encoded data, the video decoder 30 decodes the scene feature prediction encoded data to obtain scene information, where the scene information includes redundant data and reduces redundancy. The data, the redundant data is redundant data on the picture content between each of the plurality of video frames. Next, video decoder 30 decodes the residual prediction encoded data to obtain a reconstructed residual, which is used to represent the difference between the video frame and the scene information. And a video decoder 30, configured to perform reconstruction according to the scene information and the reconstructed residual to obtain a plurality of video frames. In this way, the scene feature predictive coded data and the residual predictive coded data obtained by encoding the video coding device in the foregoing embodiment can complete the decoding operation by using the video decoding device of the embodiment of the present invention.
在一个或多个实例中,所描述的功能可以硬件、软件、固件或其任何组合来实施。如果以软件实施,则功能可作为一个或多个指令或代码而存储于计算机可读媒体上或经由计算机可读媒体而发送,且通过基于硬件的处理单元执行。计算机可读媒体可包含计算机可读存储媒体(其对应于例如数据存储媒体等有形媒体)或通信媒体,通信媒体包含(例如)根据通信协议促进计算机程序从一处传送到另一处的任何媒体。以此方式,计算机可读媒体大体上可对应于(1)非瞬时的有形计算机可读存储媒体,或(2)例如信号或载波等通信媒体。数据存储媒体可为可由一个或多个计算机或一个或多个处理器存取以检索指令、代码及/或数据结构以用于实施本发明中所描述的技术的任何可用媒体。计算机程序产品可包含计算机可读媒体。In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted as one or more instructions or code via a computer-readable medium and executed by a hardware-based processing unit. The computer readable medium can comprise a computer readable storage medium (which corresponds to a tangible medium such as a data storage medium) or a communication medium comprising, for example, any medium that facilitates transfer of the computer program from one place to another in accordance with a communication protocol. . In this manner, computer readable media generally may correspond to (1) a non-transitory tangible computer readable storage medium, or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for use in carrying out the techniques described herein. The computer program product can comprise a computer readable medium.
通过实例而非限制,某些计算机可读存储媒体可包括RAM、ROM、EEPROM、CD-ROM 或其它光盘存储器、磁盘存储器或其它磁性存储装置、快闪存储器,或可用以存储呈指令或数据结构的形式的所要程序代码且可由计算机存取的任何其它媒体。而且,任何连接可适当地称为计算机可读媒体。举例来说,如果使用同轴电缆、光缆、双绞线、数字用户线(DSL)或无线技术(例如,红外线、无线电及微波)而从网站、服务器或其它远程源发送指令,则同轴电缆、光缆、双绞线、DSL或无线技术(例如,红外线、无线电及微波)包含于媒体的定义中。然而,应理解,计算机可读存储媒体及数据存储媒体不包含连接、载波、信号或其它瞬时媒体,而是有关非瞬时有形存储媒体。如本文中所使用,磁盘及光盘包含压缩光盘(CD)、激光光盘、光学光盘、数字影音光盘(DVD)、软性磁盘及蓝光光盘,其中磁盘通常以磁性方式复制数据,而光盘通过激光以光学方式复制数据。以上各物的组合还应包含于计算机可读媒体的范围内。Some computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, disk storage or other magnetic storage device, flash memory, or may be used to store instructions or data structures, by way of example and not limitation. Any other medium in the form of the desired program code and accessible by the computer. Also, any connection is properly termed a computer-readable medium. For example, if you use coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology (eg, infrared, radio, and microwave) to send commands from a website, server, or other remote source, coaxial cable , fiber optic cable, twisted pair, DSL, or wireless technologies (eg, infrared, radio, and microwave) are included in the definition of the media. However, it should be understood that computer readable storage media and data storage media do not include connections, carrier waves, signals, or other transient media, but rather non-transitory tangible storage media. As used herein, a magnetic disk and an optical disk include a compact disk (CD), a laser disk, an optical disk, a digital video disk (DVD), a flexible disk, and a Blu-ray disk, wherein the disk usually reproduces data magnetically, and the disk passes the laser Optically copy data. Combinations of the above should also be included within the scope of computer readable media.
可由例如一个或多个数字信号处理器(DSP)、通用微处理器、专用集成电路(ASIC)、现场可编程逻辑阵列(FPGA)或其它等效集成或离散逻辑电路等一个或多个处理器来执行指令。因此,如本文中所使用的术语“处理器”可指代前述结构或适于实施本文中所描述的技术的任何其它结构中的任一者。另外,在一些方面中,可将本文中所描述的功能性提供于经配置以用于编码及解码的专用硬件及/或软件模块内,或并入于组合式编解码器中。而且,所述技术可完全实施于一个或多个电路或逻辑元件中。One or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits To execute the instructions. Accordingly, the term "processor," as used herein, may refer to any of the foregoing structures or any other structure suitable for implementing the techniques described herein. In addition, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Moreover, the techniques can be fully implemented in one or more circuits or logic elements.
本发明的技术可以广泛地由多种装置或设备来实施,所述装置或设备包含无线手持机、集成电路(IC)或IC集合(例如,芯片组)。在本发明中描述各种组件、模块或单元以强调经配置以执行所揭示技术的装置的功能方面,但未必要求通过不同硬件单元来实现。确切地说,如上文所描述,各种单元可组合于编解码器硬件单元中,或通过交互操作性硬件单元(包含如上文所描述的一个或多个处理器)的集合结合合适软件及/或固件来提供。The techniques of the present invention can be broadly implemented by a variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a collection of ICs (eg, a chipset). Various components, modules or units are described in this disclosure to emphasize functional aspects of the apparatus configured to perform the disclosed techniques, but are not necessarily required to be implemented by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or combined with suitable software and/or by a collection of interoperable hardware units (including one or more processors as described above). Or firmware to provide.
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本发明的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。It is to be understood that the phrase "one embodiment" or "an embodiment" or "an" Thus, "in one embodiment" or "in an embodiment" or "an" In addition, these particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
在本发明的各种实施例中,应理解,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本发明实施例的实施过程构成任何限定。In the various embodiments of the present invention, it should be understood that the size of the sequence numbers of the above processes does not mean the order of execution, and the order of execution of each process should be determined by its function and internal logic, and should not be taken to the embodiments of the present invention. The implementation process constitutes any limitation.
另外,本文中术语“系统”和“网络”在本文中常可互换使用。应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。Additionally, the terms "system" and "network" are used interchangeably herein. It should be understood that the term "and/or" herein is merely an association relationship describing an associated object, indicating that there may be three relationships, for example, A and/or B, which may indicate that A exists separately, and A and B exist simultaneously. There are three cases of B alone. In addition, the character "/" in this article generally indicates that the contextual object is an "or" relationship.
在本申请所提供的实施例中,应理解,“与A相应的B”表示B与A相关联,根据A可以确定B。但还应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。In the embodiments provided herein, it should be understood that "B corresponding to A" means that B is associated with A, and B can be determined from A. However, it should also be understood that determining B from A does not mean that B is only determined based on A, and that B can also be determined based on A and/or other information.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些 功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both, for clarity of hardware and software. Interchangeability, the composition and steps of the various examples have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product.
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present invention are generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.). The computer readable storage medium can be any available media that can be stored by a computer or a data storage device such as a server, data center, or the like that includes one or more available media. The usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (such as a solid state disk (SSD)).
以上所述,以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the embodiments are modified, or the equivalents of the technical features are replaced by the equivalents of the technical solutions of the embodiments of the present invention.
Claims (31)
- 一种视频编码方法,其特征在于,所述方法包括:A video encoding method, the method comprising:获取多个视频帧,所述多个视频帧中的每一视频帧间在画面内容上包括冗余数据;Acquiring a plurality of video frames, each of the plurality of video frames including redundant data on the picture content;对所述多个视频帧进行重构,得到场景信息和所述每一视频帧的重构残差,所述场景信息包括由减少所述冗余数据的冗余度得到的数据,所述重构残差用于表示所述视频帧和所述场景信息间的差值;Reconstructing the plurality of video frames to obtain scene information and a reconstruction residual of each video frame, where the scene information includes data obtained by reducing redundancy of the redundant data, the weight Constructing a residual is used to represent a difference between the video frame and the scene information;对所述场景信息进行预测编码,得到场景特征预测编码数据;Performing predictive coding on the scene information to obtain scene feature prediction encoded data;对所述重构残差进行预测编码,得到残差预测编码数据。The reconstructed residual is predictively encoded to obtain residual prediction encoded data.
- 根据权利要求1所述的方法,其特征在于,The method of claim 1 wherein所述对所述多个视频帧进行重构,得到场景信息和所述每一视频帧的重构残差,包括:Reconstructing the plurality of video frames to obtain scene information and reconstruction residuals of each video frame, including:对所述多个视频帧进行重构,得到场景特征和所述每一视频帧的重构残差,所述场景特征用于表示所述每一视频帧间的相同的画面内容,所述重构残差用于表示所述视频帧和所述场景特征间的差值;Reconstructing the plurality of video frames to obtain a scene feature and a reconstructed residual of each video frame, where the scene feature is used to represent the same picture content between each video frame, the weight Constructing a residual is used to represent a difference between the video frame and the scene feature;所述对所述场景信息进行预测编码,得到场景特征预测编码数据,包括:Performing predictive coding on the scene information to obtain scene feature prediction encoded data, including:对所述场景特征进行预测编码,得到场景特征预测编码数据。Performing predictive coding on the scene features to obtain scene feature prediction encoded data.
- 根据权利要求2所述的方法,其特征在于,The method of claim 2 wherein:所述对所述多个视频帧进行重构,得到场景特征和所述每一视频帧的重构残差之前,所述方法还包括:Before the reconstructing the plurality of video frames to obtain the scene features and the reconstructed residuals of the video frames, the method further includes:提取所述多个视频帧中的每一视频帧的画面特征信息;Extracting picture feature information of each of the plurality of video frames;根据所述画面特征信息,计算得到内容度量信息,所述内容度量信息用于度量所述多个视频帧的画面内容的差异性;And calculating, according to the picture feature information, content metric information, where the content metric information is used to measure the difference of the picture content of the multiple video frames;当所述内容度量信息不大于预设度量阈值时,执行所述对所述多个视频帧进行重构,得到场景特征和所述每一视频帧的重构残差的步骤。And performing the step of reconstructing the plurality of video frames to obtain a scene feature and a reconstruction residual of each video frame, when the content metric information is not greater than a preset metric threshold.
- 根据权利要求2所述的方法,其特征在于,The method of claim 2 wherein:所述获取多个视频帧,包括:The acquiring multiple video frames includes:获取视频流,所述视频流的视频帧包括I帧和B帧以及P帧;Obtaining a video stream, where the video frame of the video stream includes an I frame and a B frame, and a P frame;从所述视频流中提取所述I帧,所述I帧用于执行所述对所述多个视频帧进行重构,得到场景特征和所述每一视频帧的重构残差的步骤;Extracting the I frame from the video stream, where the I frame is used to perform the step of performing reconfiguration on the multiple video frames to obtain a scene feature and a reconstructed residual of each video frame;所述方法还包括:The method further includes:根据所述场景特征和所述重构残差进行重构,得到参考帧;Reconstructing according to the scene feature and the reconstructed residual to obtain a reference frame;以所述参考帧做参考,对所述B帧和所述P帧进行帧间预测编码,得到B帧预测编码数据和P帧预测编码数据;Referring to the reference frame, performing inter prediction encoding on the B frame and the P frame, to obtain B frame predictive encoded data and P frame predictive encoded data;对预测编码数据进行变换编码、量化编码及熵编码,得到视频压缩数据;所述预测编码数据包括所述场景特征预测编码数据、所述残差预测编码数据、所述B帧预测编码数据和所述P帧预测编码数据。Performing transform coding, quantization coding, and entropy coding on the predictive coded data to obtain video compressed data; the predictive coded data includes the scene feature predictive coded data, the residual predictive coded data, the B frame predictive coded data, and the The P frame predictive coded data.
- 根据权利要求1所述的方法,其特征在于,The method of claim 1 wherein所述多个视频帧中的每一视频帧相互之间在局部位置包括冗余数据;Each of the plurality of video frames includes redundant data at a local location with respect to each other;所述对所述多个视频帧进行重构,得到场景信息和所述每一视频帧的重构残差,包括:Reconstructing the plurality of video frames to obtain scene information and reconstruction residuals of each video frame, including:对所述多个视频帧中的每一视频帧进行拆分,得到多个帧子块;Splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks;对所述多个帧子块进行重构,得到场景特征、所述多个帧子块中的每一帧子块的表示系数和所述每一帧子块的重构残差,所述场景特征包括多个独立的场景特征基,在所述场景特征内所述独立的场景特征基间不能互相重构得到,所述场景特征基用于描述所述帧子块的画面内容特征,所示表示系数表示所述场景特征基和所述帧子块的对应关系,所述重构残差表示所述帧子块和所述场景特征基的差值;Reconstructing the plurality of frame sub-blocks to obtain a scene feature, a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, and a reconstruction residual of each frame sub-block, the scene The feature includes a plurality of independent scene feature bases, wherein the independent scene feature bases cannot be reconstructed from each other within the scene feature, and the scene feature base is used to describe a screen content feature of the frame sub-block, The representation coefficient represents a correspondence between the scene feature base and the frame sub-block, and the reconstruction residual represents a difference between the frame sub-block and the scene feature base;所述对所述场景信息进行预测编码,得到场景特征预测编码数据,包括:Performing predictive coding on the scene information to obtain scene feature prediction encoded data, including:对所述场景特征进行预测编码,得到场景特征预测编码数据。Performing predictive coding on the scene features to obtain scene feature prediction encoded data.
- 根据权利要求5所述的方法,其特征在于,The method of claim 5 wherein:所述对所述多个帧子块进行重构,得到场景特征、所述多个帧子块中的每一帧子块的表示系数和所述每一帧子块的重构残差,包括:Reconstructing the plurality of frame sub-blocks to obtain a scene feature, a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, and a reconstruction residual of each frame sub-block, including :对所述多个帧子块进行重构,得到所述多个帧子块中的每一帧子块的表示系数和所述每一帧子块的重构残差,所述表示系数表示所述帧子块和目标帧子块的对应关系,所述目标帧子块为所述多个帧子块中独立的帧子块,所述独立的帧子块为不能基于所述多个帧子块中的其它帧子块重构得到的帧子块,所述重构残差用于表示所述目标帧子块和所述帧子块间的差值;Reconstructing the plurality of frame sub-blocks to obtain a representation coefficient of each frame sub-block of the plurality of frame sub-blocks and a reconstruction residual of each of the frame sub-blocks, where the representation coefficient represents Corresponding relationship between the frame sub-block and the target frame sub-block, where the target frame sub-block is an independent frame sub-block in the multiple frame sub-blocks, and the independent frame sub-block is not based on the multiple frame sub-blocks And reconstructing, by the other frame sub-blocks in the block, the obtained frame sub-block, where the reconstructed residual is used to represent a difference between the target frame sub-block and the frame sub-block;组合所述多个表示系数指示的目标帧子块,得到场景特征,所述目标帧子块为场景特征基。And combining the plurality of target frame sub-blocks indicated by the coefficient to obtain a scene feature, where the target frame sub-block is a scene feature base.
- 根据权利要求5所述的方法,其特征在于,The method of claim 5 wherein:所述对所述多个视频帧中的每一视频帧进行拆分,得到多个帧子块之前,所述方法还包括:Before the splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks, the method further includes:提取所述多个视频帧中的每一视频帧的画面特征信息;Extracting picture feature information of each of the plurality of video frames;根据所述画面特征信息,计算得到内容度量信息,所述内容度量信息用于度量所述多个视频帧的画面内容的差异性;And calculating, according to the picture feature information, content metric information, where the content metric information is used to measure the difference of the picture content of the multiple video frames;当所述内容度量信息大于预设度量阈值时,执行所述对所述多个视频帧中的每一视频帧进行拆分,得到多个帧子块的步骤。When the content metric information is greater than the preset metric threshold, performing the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks.
- 根据权利要求5所述的方法,其特征在于,The method of claim 5 wherein:所述获取多个视频帧,包括:The acquiring multiple video frames includes:获取视频流,所述视频流的视频帧包括I帧和B帧以及P帧;Obtaining a video stream, where the video frame of the video stream includes an I frame and a B frame, and a P frame;从所述视频流中提取所述I帧,所述I帧用于执行所述对所述多个视频帧中的每一视频帧进行拆分,得到多个帧子块的步骤;Extracting the I frame from the video stream, where the I frame is used to perform the step of splitting each of the multiple video frames to obtain a plurality of frame sub-blocks;所述方法还包括:The method further includes:根据所述场景特征、所述表示系数和所述重构残差进行重构,得到参考帧;Performing reconstruction according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a reference frame;以所述参考帧做参考,对所述B帧和所述P帧进行帧间预测编码,得到B帧预测编码数据和P帧预测编码数据;Referring to the reference frame, performing inter prediction encoding on the B frame and the P frame, to obtain B frame predictive encoded data and P frame predictive encoded data;对预测编码数据进行变换编码、量化编码及熵编码,得到视频压缩数据;所述预测编 码数据包括所述场景特征预测编码数据、所述残差预测编码数据、所述B帧预测编码数据和所述P帧预测编码数据。Performing transform coding, quantization coding, and entropy coding on the predictive coded data to obtain video compressed data; the predictive coded data includes the scene feature predictive coded data, the residual predictive coded data, the B frame predictive coded data, and the The P frame predictive coded data.
- 根据权利要求1至8任一项所述的方法,其特征在于,A method according to any one of claims 1 to 8, wherein所述获取多个视频帧之后,所述方法还包括:After the acquiring multiple video frames, the method further includes:基于画面内容的相关性对所述多个视频帧进行分类,得到一个或多个分类簇的视频帧,同一分类簇的视频帧用于执行所述对所述多个视频帧进行重构,得到场景信息和所述每一视频帧的重构残差的步骤。And categorizing the plurality of video frames according to the correlation of the content of the picture, to obtain a video frame of one or more classification clusters, where the video frame of the same classification cluster is used to perform the reconstruction on the multiple video frames to obtain Scene information and the step of reconstructing the residual of each video frame.
- 根据权利要求1所述的方法,其特征在于,The method of claim 1 wherein所述获取多个视频帧,包括:The acquiring multiple video frames includes:获取视频流,所述视频流包括多个视频帧;Acquiring a video stream, the video stream comprising a plurality of video frames;分别提取第一视频帧和第二视频帧的特征信息,所述特征信息用于对视频帧的画面内容进行描述,所述第一视频帧和所述第二视频帧为所述视频流中的视频帧;Extracting feature information of the first video frame and the second video frame, where the feature information is used to describe a picture content of the video frame, where the first video frame and the second video frame are in the video stream Video frame根据所述特征信息计算所述第一视频帧和所述第二视频帧间的镜头距离;Calculating a lens distance between the first video frame and the second video frame according to the feature information;判断所述镜头距离是否大于预设镜头阈值;Determining whether the lens distance is greater than a preset lens threshold;若所述镜头距离大于所述预设镜头阈值,则从所述视频流中分割出目标镜头,所述目标镜头的起始帧为所述第一视频帧,所述目标镜头的结束帧为所述第二视频帧的上一视频帧;若所述镜头距离小于所述预设镜头阈值,则将所述第一视频帧和所述第二视频帧归属于同一镜头,所述目标镜头属于所述视频流的镜头的其中之一,所述镜头为一段在时间上连续的视频帧;If the lens distance is greater than the preset lens threshold, the target lens is segmented from the video stream, the start frame of the target lens is the first video frame, and the end frame of the target lens is a previous video frame of the second video frame; if the lens distance is less than the preset lens threshold, the first video frame and the second video frame are attributed to the same lens, and the target lens belongs to the same One of the shots of the video stream, the shot being a video frame that is continuous in time;对所述视频流中的每一镜头,根据镜头内的视频帧间的帧距离提取出关键帧,在每一镜头内任意两个相邻的关键帧间的帧距离大于预设帧距离阈值,所述帧距离用于表示两视频帧间的差异度,所述每一镜头的关键帧用于执行所述对所述多个视频帧进行重构,得到场景信息和所述每一视频帧的重构残差的步骤。For each shot in the video stream, a key frame is extracted according to a frame distance between video frames in the shot, and a frame distance between any two adjacent key frames in each shot is greater than a preset frame distance threshold. The frame distance is used to indicate the degree of difference between the two video frames, and the key frame of each shot is used to perform the reconstructing of the multiple video frames to obtain scene information and each video frame. The step of reconstructing the residual.
- 一种视频解码方法,其特征在于,所述方法包括:A video decoding method, the method comprising:获取场景特征预测编码数据和残差预测编码数据;Obtaining scene feature prediction encoded data and residual prediction encoded data;对所述场景特征预测编码数据进行解码,得到场景信息,所述场景信息包括由减少所述冗余数据的冗余度得到的数据,所述冗余数据为多个视频帧中的每一视频帧间在画面内容上的冗余数据;Decoding the scene feature prediction encoded data to obtain scene information, where the scene information includes data obtained by reducing redundancy of the redundant data, the redundant data being each video of multiple video frames Redundant data on the content of the picture between frames;对所述残差预测编码数据进行解码,得到重构残差,所述重构残差用于表示所述视频帧和所述场景信息间的差值;Decoding the residual prediction encoded data to obtain a reconstructed residual, where the reconstructed residual is used to represent a difference between the video frame and the scene information;根据所述场景信息和所述重构残差进行重构,得到所述多个视频帧。Performing reconstruction according to the scene information and the reconstructed residual to obtain the plurality of video frames.
- 根据权利要求11所述的方法,其特征在于,The method of claim 11 wherein所述对所述场景特征预测编码数据进行解码,得到场景信息,包括:Decoding the scene feature prediction encoded data to obtain scenario information, including:对所述场景特征预测编码数据进行解码,得到场景特征,所述场景特征用于表示所述每一视频帧间的相同的画面内容;Decoding the scene feature prediction encoded data to obtain a scene feature, where the scene feature is used to represent the same picture content between each video frame;所述根据所述场景信息和所述重构残差进行重构,得到所述多个视频帧,包括:Reconstructing according to the scenario information and the reconstructed residual to obtain the multiple video frames, including:根据所述场景特征和所述重构残差进行重构,得到所述多个视频帧。Performing reconstruction according to the scene feature and the reconstructed residual to obtain the plurality of video frames.
- 根据权利要求12所述的方法,其特征在于,The method of claim 12 wherein:所述获取场景特征预测编码数据和残差预测编码数据,包括:And acquiring the scene feature prediction encoded data and the residual prediction encoded data, including:获取视频压缩数据;Obtain video compression data;对所述视频压缩数据进行熵解码、反量化处理和DCT反变化得到预测编码数据,所述预测编码数据包括所述场景特征预测编码数据、所述残差预测编码数据、B帧预测编码数据和P帧预测编码数据;Performing entropy decoding, inverse quantization processing, and DCT inverse variation on the video compressed data to obtain predictive encoded data, where the predictive encoded data includes the scene feature predictive encoded data, the residual predictive encoded data, B frame predictive encoded data, and P frame prediction encoded data;所述根据所述场景特征和所述重构残差进行重构,得到所述多个视频帧,包括:Reconstructing according to the scene feature and the reconstructed residual to obtain the multiple video frames, including:根据所述场景特征和所述重构残差进行重构,得到多个I帧;Reconstructing according to the scene feature and the reconstructed residual, to obtain a plurality of I frames;所述方法还包括:The method further includes:以所述I帧为参考帧,对所述B帧预测编码数据和P帧预测编码数据进行帧间解码,得到B帧和P帧;Performing inter-frame decoding on the B frame predictive coded data and the P frame predictive coded data by using the I frame as a reference frame to obtain a B frame and a P frame;对所述I帧、所述B帧和所述P帧按时间顺序进行排列,得到视频流。The I frame, the B frame, and the P frame are arranged in chronological order to obtain a video stream.
- 根据权利要求11所述的方法,其特征在于,The method of claim 11 wherein所述方法还包括:The method further includes:获取表示系数;Obtain a representation coefficient;所述对所述场景特征预测编码数据进行解码,得到场景信息,包括:Decoding the scene feature prediction encoded data to obtain scenario information, including:对所述场景特征预测编码数据进行解码,得到场景特征,所述场景特征包括多个独立的场景特征基,在所述场景特征内所述独立的场景特征基间不能互相重构得到,所述场景特征基用于描述帧子块的画面内容特征,所述表示系数表示所述场景特征基和所述帧子块的对应关系,所述重构残差表示所述帧子块和所述场景特征基的差值;Decoding the scene feature prediction encoded data to obtain a scene feature, where the scene feature includes a plurality of independent scene feature bases, wherein the independent scene feature bases cannot be reconstructed from each other in the scene feature, The scene feature base is used to describe a picture content feature of the frame sub-block, the representation coefficient represents a correspondence between the scene feature base and the frame sub-block, and the reconstruction residual represents the frame sub-block and the scene The difference between the characteristic bases;所述根据所述场景信息和所述重构残差进行重构,得到所述多个视频帧,包括:Reconstructing according to the scenario information and the reconstructed residual to obtain the multiple video frames, including:根据所述场景特征、所述表示系数和所述重构残差进行重构,得到多个帧子块;Reconstructing according to the scene feature, the representation coefficient, and the reconstruction residual, to obtain a plurality of frame sub-blocks;对所述多个帧子块进行组合,得到多个视频帧。Combining the plurality of frame sub-blocks to obtain a plurality of video frames.
- 根据权利要求14所述的方法,其特征在于,The method of claim 14 wherein:所述获取场景特征预测编码数据和残差预测编码数据,包括:And acquiring the scene feature prediction encoded data and the residual prediction encoded data, including:获取视频压缩数据;Obtain video compression data;对所述视频压缩数据进行熵解码、反量化处理和DCT反变化得到预测编码数据,所述预测编码数据包括所述场景特征预测编码数据、所述残差预测编码数据、B帧预测编码数据和P帧预测编码数据;Performing entropy decoding, inverse quantization processing, and DCT inverse variation on the video compressed data to obtain predictive encoded data, where the predictive encoded data includes the scene feature predictive encoded data, the residual predictive encoded data, B frame predictive encoded data, and P frame prediction encoded data;所述对所述多个帧子块进行组合,得到多个视频帧,包括:Combining the plurality of frame sub-blocks to obtain a plurality of video frames, including:对所述多个帧子块进行组合,得到多个I帧;Combining the plurality of frame sub-blocks to obtain a plurality of I frames;所述方法还包括:The method further includes:以所述I帧为参考帧,对所述B帧预测编码数据和P帧预测编码数据进行帧间解码,得到B帧和P帧;Performing inter-frame decoding on the B frame predictive coded data and the P frame predictive coded data by using the I frame as a reference frame to obtain a B frame and a P frame;对所述I帧、所述B帧和所述P帧按时间顺序进行排列,得到视频流。The I frame, the B frame, and the P frame are arranged in chronological order to obtain a video stream.
- 一种视频编码设备,其特征在于,所述设备包括:A video encoding device, characterized in that the device comprises:获取模块,用于获取多个视频帧,所述多个视频帧中的每一视频帧间在画面内容上包括冗余数据;An acquiring module, configured to acquire a plurality of video frames, where each of the plurality of video frames includes redundant data on the screen content;重构模块,用于对所述多个视频帧进行重构,得到场景信息和所述每一视频帧的重构 残差,所述场景信息包括由减少所述冗余数据的冗余度得到的数据,所述重构残差用于表示所述视频帧和所述场景信息间的差值;a reconstruction module, configured to reconstruct the multiple video frames to obtain scene information and a reconstruction residual of each video frame, where the scenario information is obtained by reducing redundancy of the redundant data Data, the reconstruction residual is used to represent a difference between the video frame and the scene information;预测编码模块,用于对所述场景信息进行预测编码,得到场景特征预测编码数据;a prediction encoding module, configured to perform predictive coding on the scene information to obtain scene feature prediction encoded data;所述预测编码模块,还用于对所述重构残差进行预测编码,得到残差预测编码数据。The predictive coding module is further configured to perform predictive coding on the reconstructed residual to obtain residual predictive encoded data.
- 根据权利要求16所述的设备,其特征在于,The device of claim 16 wherein:所述重构模块,还用于对所述多个视频帧进行重构,得到场景特征和所述每一视频帧的重构残差,所述场景特征用于表示所述每一视频帧间的相同的画面内容,所述重构残差用于表示所述视频帧和所述场景特征间的差值;The reconstruction module is further configured to reconstruct the multiple video frames to obtain a scene feature and a reconstructed residual of each video frame, where the scene feature is used to represent the each video frame The same picture content, the reconstruction residual is used to represent a difference between the video frame and the scene feature;所述预测编码模块,还用于对所述场景特征进行预测编码,得到场景特征预测编码数据。The predictive coding module is further configured to perform predictive coding on the scene feature to obtain scene feature predictive encoded data.
- 根据权利要求17所述的设备,其特征在于,The device according to claim 17, wherein所述设备还包括:The device further includes:特征提取模块,用于提取所述多个视频帧中的每一视频帧的画面特征信息;a feature extraction module, configured to extract picture feature information of each of the plurality of video frames;度量信息计算模块,用于根据所述画面特征信息,计算得到内容度量信息,所述内容度量信息用于度量所述多个视频帧的画面内容的差异性;a metric information calculation module, configured to calculate, according to the picture feature information, content metric information, where the content metric information is used to measure a difference of picture content of the multiple video frames;当所述内容度量信息不大于预设度量阈值时,所述重构模块执行所述对所述多个视频帧进行重构,得到场景特征和所述每一视频帧的重构残差的步骤。When the content metric information is not greater than the preset metric threshold, the re-implementing module performs the step of reconstructing the multiple video frames to obtain a scene feature and a reconstructed residual of each video frame .
- 根据权利要求17所述的设备,其特征在于,The device according to claim 17, wherein所述获取模块,还用于获取视频流,所述视频流的视频帧包括I帧和B帧以及P帧;从所述视频流中提取所述I帧,所述I帧用于执行所述对所述多个视频帧进行重构,得到场景特征和所述每一视频帧的重构残差的步骤;The acquiring module is further configured to acquire a video stream, where a video frame of the video stream includes an I frame and a B frame, and a P frame, and extract the I frame from the video stream, where the I frame is used to perform the Reconstructing the plurality of video frames to obtain a scene feature and a reconstruction residual of each video frame;所述设备还包括:The device further includes:参考帧重构模块,用于根据所述场景特征和所述重构残差进行重构,得到参考帧;a reference frame reconstruction module, configured to perform reconstruction according to the scene feature and the reconstructed residual, to obtain a reference frame;帧间预测编码模块,用于以所述参考帧做参考,对所述B帧和所述P帧进行帧间预测编码,得到B帧预测编码数据和P帧预测编码数据;An inter prediction encoding module, configured to reference the reference frame, perform inter prediction encoding on the B frame and the P frame, to obtain B frame predictive encoded data and P frame predictive encoded data;编码模块,用于对预测编码数据进行变换编码、量化编码及熵编码,得到视频压缩数据;所述预测编码数据包括所述场景特征预测编码数据、所述残差预测编码数据、所述B帧预测编码数据和所述P帧预测编码数据。An encoding module, configured to perform transform coding, quantization coding, and entropy coding on the predictive coded data to obtain video compressed data; the predictive coded data includes the scene feature predictive coded data, the residual predictive coded data, and the B frame The encoded data and the P frame predictive encoded data are predicted.
- 根据权利要求16所述的设备,其特征在于,The device of claim 16 wherein:所述多个视频帧中的每一视频帧相互之间在局部位置包括冗余数据;Each of the plurality of video frames includes redundant data at a local location with respect to each other;所述重构模块,包括:The reconstruction module includes:拆分单元,用于对所述多个视频帧中的每一视频帧进行拆分,得到多个帧子块;a splitting unit, configured to split each of the plurality of video frames to obtain a plurality of frame sub-blocks;重构单元,用于对所述多个帧子块进行重构,得到场景特征、所述多个帧子块中的每一帧子块的表示系数和所述每一帧子块的重构残差,所述场景特征包括多个独立的场景特征基,在所述场景特征内所述独立的场景特征基间不能互相重构得到,所述场景特征基用于描述所述帧子块的画面内容特征,所示表示系数表示所述场景特征基和所述帧子块的对应关系,所述重构残差表示所述帧子块和所述场景特征基的差值;a reconstruction unit, configured to reconstruct the plurality of frame sub-blocks, to obtain a scene feature, a representation coefficient of each frame sub-block of the plurality of frame sub-blocks, and a reconstruction of each frame sub-block a residual feature, the scene feature includes a plurality of independent scene feature bases, wherein the independent scene feature bases cannot be reconstructed from each other in the scene feature, and the scene feature base is used to describe the frame sub-block a picture content feature, the representation coefficient represents a correspondence between the scene feature base and the frame sub-block, and the reconstruction residual represents a difference between the frame sub-block and the scene feature base;所述预测编码模块,还用于对所述场景特征进行预测编码,得到场景特征预测编码数 据。The predictive coding module is further configured to perform predictive coding on the scene feature to obtain scene feature predictive coding data.
- 根据权利要求20所述的设备,其特征在于,The device according to claim 20, wherein所述重构单元,包括:The reconstruction unit includes:重构子单元,用于对所述多个帧子块进行重构,得到所述多个帧子块中的每一帧子块的表示系数和所述每一帧子块的重构残差,所述表示系数表示所述帧子块和目标帧子块的对应关系,所述目标帧子块为所述多个帧子块中独立的帧子块,所述独立的帧子块为不能基于所述多个帧子块中的其它帧子块重构得到的帧子块,所述重构残差用于表示所述目标帧子块和所述帧子块间的差值;a reconstructing sub-unit, configured to reconstruct the plurality of frame sub-blocks, to obtain a representation coefficient of each of the plurality of frame sub-blocks and a reconstructed residual of each of the sub-blocks of the frame And the representation coefficient represents a correspondence between the frame sub-block and the target frame sub-block, where the target frame sub-block is an independent frame sub-block in the multiple frame sub-blocks, and the independent frame sub-block is not And reconstructing a frame sub-block obtained by using another frame sub-block in the plurality of frame sub-blocks, where the reconstructed residual is used to represent a difference between the target frame sub-block and the frame sub-block;组合子单元,用于组合所述多个表示系数指示的目标帧子块,得到场景特征,所述目标帧子块为场景特征基。a combining subunit, configured to combine the plurality of target frame sub-blocks indicated by the representation coefficients to obtain a scene feature, where the target frame sub-block is a scene feature base.
- 根据权利要求20所述的设备,其特征在于,The device according to claim 20, wherein所述设备还包括:The device further includes:特征提取模块,用于提取所述多个视频帧中的每一视频帧的画面特征信息;a feature extraction module, configured to extract picture feature information of each of the plurality of video frames;度量信息计算模块,用于根据所述画面特征信息,计算得到内容度量信息,所述内容度量信息用于度量所述多个视频帧的画面内容的差异性;a metric information calculation module, configured to calculate, according to the picture feature information, content metric information, where the content metric information is used to measure a difference of picture content of the multiple video frames;当所述内容度量信息大于预设度量阈值时,所述拆分单元执行所述对所述多个视频帧中的每一视频帧进行拆分,得到多个帧子块的步骤。When the content metric information is greater than the preset metric threshold, the splitting unit performs the step of splitting each of the plurality of video frames to obtain a plurality of frame sub-blocks.
- 根据权利要求20所述的设备,其特征在于,The device according to claim 20, wherein所述获取模块,还用于获取视频流,所述视频流的视频帧包括I帧和B帧以及P帧;从所述视频流中提取所述I帧,所述I帧用于执行所述对所述多个视频帧中的每一视频帧进行拆分,得到多个帧子块的步骤;The acquiring module is further configured to acquire a video stream, where a video frame of the video stream includes an I frame and a B frame, and a P frame, and extract the I frame from the video stream, where the I frame is used to perform the Decomposing each of the plurality of video frames to obtain a plurality of frame sub-blocks;所述设备还包括:The device further includes:参考帧重构模块,用于根据所述场景特征、所述表示系数和所述重构残差进行重构,得到参考帧;a reference frame reconstruction module, configured to perform reconfiguration according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a reference frame;帧间预测编码模块,用于以所述参考帧做参考,对所述B帧和所述P帧进行帧间预测编码,得到B帧预测编码数据和P帧预测编码数据;An inter prediction encoding module, configured to reference the reference frame, perform inter prediction encoding on the B frame and the P frame, to obtain B frame predictive encoded data and P frame predictive encoded data;编码模块,用于对预测编码数据进行变换编码、量化编码及熵编码,得到视频压缩数据;所述预测编码数据包括所述场景特征预测编码数据、所述残差预测编码数据、所述B帧预测编码数据和所述P帧预测编码数据。An encoding module, configured to perform transform coding, quantization coding, and entropy coding on the predictive coded data to obtain video compressed data; the predictive coded data includes the scene feature predictive coded data, the residual predictive coded data, and the B frame The encoded data and the P frame predictive encoded data are predicted.
- 根据权利要求16至23任一项所述的设备,其特征在于,Apparatus according to any one of claims 16 to 23, wherein所述设备还包括:The device further includes:分类模块,用于基于画面内容的相关性对所述多个视频帧进行分类,得到一个或多个分类簇的视频帧,同一分类簇的视频帧用于执行所述对所述多个视频帧进行重构,得到场景信息和所述每一视频帧的重构残差的步骤。a classifying module, configured to classify the plurality of video frames based on a correlation of the content of the picture, to obtain a video frame of one or more classification clusters, where the video frames of the same classification cluster are used to perform the pair of the multiple video frames Performing a reconstruction to obtain scene information and a step of reconstructing a residual of each video frame.
- 根据权利要求16所述的设备,其特征在于,The device of claim 16 wherein:所述获取模块,包括:The obtaining module includes:视频流获取单元,用于获取视频流,所述视频流包括多个视频帧;a video stream obtaining unit, configured to acquire a video stream, where the video stream includes multiple video frames;帧特征提取单元,用于分别提取第一视频帧和第二视频帧的特征信息,所述特征信息 用于对视频帧的画面内容进行描述,所述第一视频帧和所述第二视频帧为所述视频流中的视频帧;a frame feature extraction unit, configured to respectively extract feature information of the first video frame and the second video frame, where the feature information is used to describe a picture content of the video frame, the first video frame and the second video frame a video frame in the video stream;镜头距离计算单元,用于根据所述特征信息计算所述第一视频帧和所述第二视频帧间的镜头距离;a lens distance calculation unit, configured to calculate a lens distance between the first video frame and the second video frame according to the feature information;镜头距离判断单元,用于判断所述镜头距离是否大于预设镜头阈值;a lens distance determining unit, configured to determine whether the lens distance is greater than a preset lens threshold;镜头分割单元,用于若所述镜头距离大于所述预设镜头阈值,则从所述视频流中分割出目标镜头,所述目标镜头的起始帧为所述第一视频帧,所述目标镜头的结束帧为所述第二视频帧的上一视频帧;若所述镜头距离小于所述预设镜头阈值,则将所述第一视频帧和所述第二视频帧归属于同一镜头,所述目标镜头属于所述视频流的镜头的其中之一,所述镜头为一段在时间上连续的视频帧;a lens splitting unit, configured to: when the lens distance is greater than the preset lens threshold, segment a target lens from the video stream, where a starting frame of the target lens is the first video frame, and the target The end frame of the lens is the previous video frame of the second video frame; if the lens distance is less than the preset lens threshold, the first video frame and the second video frame are attributed to the same lens. The target lens belongs to one of the lenses of the video stream, and the lens is a video frame that is continuous in time;关键帧提取单元,用于对所述视频流中的每一镜头,根据镜头内的视频帧间的帧距离提取出关键帧,在每一镜头内任意两个相邻的关键帧间的帧距离大于预设帧距离阈值,所述帧距离用于表示两视频帧间的差异度,所述每一镜头的关键帧用于执行所述对所述多个视频帧进行重构,得到场景信息和所述每一视频帧的重构残差的步骤。a key frame extracting unit, configured to extract a key frame according to a frame distance between video frames in the lens for each shot in the video stream, and a frame distance between any two adjacent key frames in each shot The frame distance is greater than a preset frame distance threshold, and the frame distance is used to indicate a degree of difference between the two video frames, where the key frame of each shot is used to perform the reconstructing of the multiple video frames to obtain scene information and The step of reconstructing the residual of each video frame.
- 一种视频解码设备,其特征在于,所述设备包括:A video decoding device, the device comprising:获取模块,用于获取场景特征预测编码数据和残差预测编码数据;An obtaining module, configured to acquire scene feature prediction encoded data and residual prediction encoded data;场景信息解码模块,用于对所述场景特征预测编码数据进行解码,得到场景信息,所述场景信息包括由减少所述冗余数据的冗余度得到的数据,所述冗余数据为多个视频帧中的每一视频帧间在画面内容上的冗余数据;a scene information decoding module, configured to decode the scene feature prediction encoded data, to obtain scene information, where the scene information includes data obtained by reducing redundancy of the redundant data, where the redundant data is multiple Redundant data on the content of the picture between each video frame in the video frame;重构残差解码模块,用于对所述残差预测编码数据进行解码,得到重构残差,所述重构残差用于表示所述视频帧和所述场景信息间的差值;a reconstructed residual decoding module, configured to decode the residual prediction encoded data to obtain a reconstructed residual, where the reconstructed residual is used to represent a difference between the video frame and the scene information;视频帧重构模块,用于根据所述场景信息和所述重构残差进行重构,得到所述多个视频帧。And a video frame reconstruction module, configured to perform reconstruction according to the scene information and the reconstruction residual, to obtain the multiple video frames.
- 根据权利要求26所述的设备,其特征在于,The device according to claim 26, wherein所述场景信息解码模块,还用于对所述场景特征预测编码数据进行解码,得到场景特征,所述场景特征用于表示所述每一视频帧间的相同的画面内容;The scene information decoding module is further configured to: decode the scene feature prediction encoded data to obtain a scene feature, where the scene feature is used to represent the same screen content between each video frame;所述视频帧重构模块,还用于根据所述场景特征和所述重构残差进行重构,得到所述多个视频帧。The video frame reconstruction module is further configured to perform reconfiguration according to the scene feature and the reconstruction residual to obtain the multiple video frames.
- 根据权利要求27所述的设备,其特征在于,The device according to claim 27, wherein所述获取模块包括获取单元和解码单元,The obtaining module includes an acquiring unit and a decoding unit,所述获取单元,用于获取视频压缩数据;The acquiring unit is configured to acquire video compression data;所述解码单元,用于对所述视频压缩数据进行熵解码、反量化处理、DCT反变化得到预测编码数据,所述预测编码数据包括所述场景特征预测编码数据、所述残差预测编码数据、B帧预测编码数据和P帧预测编码数据;The decoding unit is configured to perform entropy decoding, inverse quantization processing, and DCT inverse change on the video compressed data to obtain predictive encoded data, where the predictive encoded data includes the scene feature predictive encoded data and the residual predictive encoded data. , B frame predictive coded data and P frame predictive coded data;所述视频帧重构模块,还用于根据所述场景特征和所述重构残差进行重构,得到多个I帧;The video frame reconstruction module is further configured to perform reconfiguration according to the scene feature and the reconstructed residual to obtain multiple I frames.所述设备还包括:The device further includes:帧间解码模块,用于以所述I帧为参考帧,对所述B帧预测编码数据和P帧预测编码 数据进行帧间解码,得到B帧和P帧;An inter-frame decoding module, configured to perform inter-frame decoding on the B-frame predictive coded data and the P-frame predictive coded data by using the I frame as a reference frame to obtain a B frame and a P frame;排列模块,用于对所述I帧、所述B帧和所述P帧按时间顺序进行排列,得到视频流。And an arranging module, configured to align the I frame, the B frame, and the P frame in chronological order to obtain a video stream.
- 根据权利要求26所述的设备,其特征在于,The device according to claim 26, wherein所述获取模块,还用于获取表示系数;The obtaining module is further configured to acquire a representation coefficient;所述场景信息解码模块,还用于对所述场景特征预测编码数据进行解码,得到场景特征,所述场景特征包括多个独立的场景特征基,在所述场景特征内所述独立的场景特征基间不能互相重构得到,所述场景特征基用于描述帧子块的画面内容特征,所示表示系数表示所述场景特征基和所述帧子块的对应关系,所述重构残差表示所述帧子块和所述场景特征基的差值;The scene information decoding module is further configured to decode the scene feature prediction encoded data to obtain a scene feature, where the scene feature includes multiple independent scene feature bases, and the independent scene features in the scene feature The base feature base is used to describe the picture content feature of the frame sub-block, and the representation coefficient indicates the correspondence between the scene feature base and the frame sub-block, the reconstruction residual Determining a difference between the frame sub-block and the scene feature base;所述视频帧重构模块,包括:The video frame reconstruction module includes:重构单元,用于根据所述场景特征、所述表示系数和所述重构残差进行重构,得到多个帧子块;a reconstruction unit, configured to perform reconfiguration according to the scene feature, the representation coefficient, and the reconstruction residual to obtain a plurality of frame sub-blocks;组合单元,用于对所述多个帧子块进行组合,得到多个视频帧。And a combining unit, configured to combine the plurality of frame sub-blocks to obtain a plurality of video frames.
- 根据权利要求29所述的设备,其特征在于,The device according to claim 29, characterized in that所述获取模块,包括获取单元和解码单元,The acquiring module includes an acquiring unit and a decoding unit,所述获取单元,用于获取视频压缩数据;The acquiring unit is configured to acquire video compression data;所述解码单元,用于对所述视频压缩数据进行熵解码、反量化处理和DCT反变化得到预测编码数据,所述预测编码数据包括所述场景特征预测编码数据、所述残差预测编码数据、B帧预测编码数据和P帧预测编码数据;The decoding unit is configured to perform entropy decoding, inverse quantization processing, and DCT inverse change on the video compressed data to obtain predictive encoded data, where the predictive encoded data includes the scene feature predictive encoded data and the residual predictive encoded data. , B frame predictive coded data and P frame predictive coded data;所述组合单元,还用于对所述多个帧子块进行组合,得到多个I帧;The combining unit is further configured to combine the plurality of frame sub-blocks to obtain a plurality of I frames;所述设备还包括:The device further includes:帧间解码模块,用于以所述I帧为参考帧,对所述B帧预测编码数据和P帧预测编码数据进行帧间解码,得到B帧和P帧;An inter-frame decoding module is configured to perform inter-frame decoding on the B-frame predictive coded data and the P-frame predictive-coded data by using the I frame as a reference frame to obtain a B frame and a P frame;排列模块,用于对所述I帧、所述B帧和所述P帧按时间顺序进行排列,得到视频流。And an arranging module, configured to align the I frame, the B frame, and the P frame in chronological order to obtain a video stream.
- 一种视频编解码设备,其特征在于,所述视频编解码设备包括视频编码设备和视频解码设备A video codec device, characterized in that the video codec device comprises a video encoding device and a video decoding device其中,所述视频编码设备为权利要求16至25任一项所述的视频编码设备;The video encoding device is the video encoding device according to any one of claims 16 to 25;所述视频解码设备为权利要求26至30任一项所述的视频解码设备。The video decoding device is the video decoding device according to any one of claims 26 to 30.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710169486.5A CN108632625B (en) | 2017-03-21 | 2017-03-21 | A video encoding method, video decoding method and related equipment |
CN201710169486.5 | 2017-03-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018171596A1 true WO2018171596A1 (en) | 2018-09-27 |
Family
ID=63584112
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/079699 WO2018171596A1 (en) | 2017-03-21 | 2018-03-21 | Video encoding method, video decoding method, and related device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108632625B (en) |
WO (1) | WO2018171596A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11711449B2 (en) | 2021-12-07 | 2023-07-25 | Capital One Services, Llc | Compressing websites for fast data transfers |
Families Citing this family (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111383245B (en) * | 2018-12-29 | 2023-09-22 | 北京地平线机器人技术研发有限公司 | Video detection method, video detection device and electronic equipment |
CN109714602B (en) * | 2018-12-29 | 2022-11-01 | 武汉大学 | Unmanned aerial vehicle video compression method based on background template and sparse coding |
WO2020188004A1 (en) * | 2019-03-18 | 2020-09-24 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Methods and apparatuses for compressing parameters of neural networks |
CN110263650B (en) * | 2019-05-22 | 2022-02-22 | 北京奇艺世纪科技有限公司 | Behavior class detection method and device, electronic equipment and computer readable medium |
CN110427517B (en) * | 2019-07-18 | 2023-04-25 | 华戎信息产业有限公司 | Picture searching video method and device based on scene dictionary tree and computer readable storage medium |
CN110554405B (en) * | 2019-08-27 | 2021-07-30 | 华中科技大学 | A normal scan registration method and system based on combinatorial clustering |
CN110572675B (en) * | 2019-09-27 | 2023-11-14 | 腾讯科技(深圳)有限公司 | Video decoding and encoding methods and devices, storage medium, decoder and encoder |
WO2021068175A1 (en) * | 2019-10-10 | 2021-04-15 | Suzhou Aqueti Technology Co., Ltd. | Method and apparatus for video clip compression |
CN111083498B (en) * | 2019-12-18 | 2021-12-21 | 杭州师范大学 | Model training method and using method for video coding inter-frame loop filtering |
CN111083499A (en) * | 2019-12-31 | 2020-04-28 | 合肥图鸭信息科技有限公司 | Video frame reconstruction method and device and terminal equipment |
CN111212288B (en) * | 2020-01-09 | 2022-10-04 | 广州虎牙科技有限公司 | Video data encoding and decoding method and device, computer equipment and storage medium |
CN111181568A (en) * | 2020-01-10 | 2020-05-19 | 深圳花果公社商业服务有限公司 | Data compression device and method, data decompression device and method |
CN111223438B (en) * | 2020-03-11 | 2022-11-04 | Tcl华星光电技术有限公司 | Compression method and device of pixel compensation table |
CN111654724B (en) * | 2020-06-08 | 2021-04-06 | 上海纽菲斯信息科技有限公司 | Low-bit-rate coding transmission method of video conference system |
CN112004085B (en) * | 2020-08-14 | 2023-07-07 | 北京航空航天大学 | Video coding method under guidance of scene semantic segmentation result |
CN111953973B (en) * | 2020-08-31 | 2022-10-28 | 中国科学技术大学 | A Universal Video Compression Coding Method Supporting Machine Intelligence |
CN112084949B (en) * | 2020-09-10 | 2022-07-19 | 上海交通大学 | Video real-time recognition, segmentation and detection method and device |
US11494700B2 (en) * | 2020-09-16 | 2022-11-08 | International Business Machines Corporation | Semantic learning in a federated learning system |
CN114257818B (en) * | 2020-09-22 | 2024-09-24 | 阿里巴巴达摩院(杭州)科技有限公司 | Video encoding and decoding methods, devices, equipment and storage medium |
CN112184843B (en) * | 2020-11-09 | 2021-06-29 | 新相微电子(上海)有限公司 | Redundant data removing system and method for image data compression |
CN113852850B (en) * | 2020-11-24 | 2024-01-09 | 广东朝歌智慧互联科技有限公司 | Audio/video stream playing device |
CN112770116B (en) * | 2020-12-31 | 2021-12-07 | 西安邮电大学 | Method for extracting video key frame by using video compression coding information |
CN112802485B (en) * | 2021-04-12 | 2021-07-02 | 腾讯科技(深圳)有限公司 | Voice data processing method and device, computer equipment and storage medium |
CN113784108B (en) * | 2021-08-25 | 2022-04-15 | 盐城香农智能科技有限公司 | A VR tourism and sightseeing method and system based on 5G transmission technology |
CN114222133B (en) * | 2021-12-10 | 2024-08-20 | 上海大学 | Content self-adaptive VVC intra-frame coding rapid dividing method based on classification |
CN114374845B (en) * | 2021-12-21 | 2022-08-02 | 北京中科智易科技有限公司 | Storage system and device for automatic compression encryption |
CN114390314B (en) * | 2021-12-30 | 2024-06-18 | 咪咕文化科技有限公司 | Variable frame rate audio and video processing method, device and storage medium |
CN114449241B (en) * | 2022-02-18 | 2024-04-02 | 复旦大学 | A color space conversion algorithm suitable for image compression |
CN114422803B (en) * | 2022-03-30 | 2022-08-05 | 浙江智慧视频安防创新中心有限公司 | Video processing method, device and equipment |
CN116527912A (en) * | 2023-03-28 | 2023-08-01 | 阿里巴巴(中国)有限公司 | Encoded video data processing method and video encoding processor |
CN116437102B (en) * | 2023-06-14 | 2023-10-20 | 中国科学技术大学 | Can learn general video coding methods, systems, equipment and storage media |
CN117651148B (en) * | 2023-11-01 | 2024-07-19 | 广东联通通信建设有限公司 | A method for controlling terminal of Internet of Things |
CN118368423B (en) * | 2024-06-19 | 2024-10-15 | 摩尔线程智能科技(北京)有限责任公司 | Video encoding method, video encoder, electronic device and storage medium |
CN118972590B (en) * | 2024-10-15 | 2024-12-17 | 中科方寸知微(南京)科技有限公司 | Scene self-adaptive video compression method and system based on natural language guidance |
CN120281905B (en) * | 2025-06-06 | 2025-09-16 | 深圳金三立视频科技股份有限公司 | Video encoding method, video encoding device, electronic equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101742319A (en) * | 2010-01-15 | 2010-06-16 | 北京大学 | Method and system for static camera video compression based on background modeling |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10542265B2 (en) * | 2014-09-09 | 2020-01-21 | Dolby Laboratories Licensing Corporation | Self-adaptive prediction method for multi-layer codec |
-
2017
- 2017-03-21 CN CN201710169486.5A patent/CN108632625B/en active Active
-
2018
- 2018-03-21 WO PCT/CN2018/079699 patent/WO2018171596A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101742319A (en) * | 2010-01-15 | 2010-06-16 | 北京大学 | Method and system for static camera video compression based on background modeling |
Non-Patent Citations (1)
Title |
---|
ZHANG, XIAOYUN ET AL.: "Research On Hevc Coding Based On Alternating Background Model", COMPUTER APPLICATIONS AND SOFTWARE, vol. 34, no. 3, 15 March 2017 (2017-03-15), pages 131 - 135, ISSN: 1000-386X * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11711449B2 (en) | 2021-12-07 | 2023-07-25 | Capital One Services, Llc | Compressing websites for fast data transfers |
Also Published As
Publication number | Publication date |
---|---|
CN108632625B (en) | 2020-02-21 |
CN108632625A (en) | 2018-10-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018171596A1 (en) | Video encoding method, video decoding method, and related device | |
US11638007B2 (en) | Codebook generation for cloud-based video applications | |
Wang et al. | Towards analysis-friendly face representation with scalable feature and texture compression | |
WO2017071480A1 (en) | Reference frame decoding method | |
Makar et al. | Interframe coding of feature descriptors for mobile augmented reality | |
CN103202017A (en) | Video decoding using example - based data pruning | |
US9787985B2 (en) | Reduction of spatial predictors in video compression | |
JP2024056596A (en) | System and method for end-to-end feature compression in multidimensional data encoding - Patents.com | |
Megala et al. | State-of-the-art in video processing: compression, optimization and retrieval | |
WO2024217530A1 (en) | Method and apparatus for image encoding and decoding | |
US20240291995A1 (en) | Video processing method and related apparatus | |
Baroffio et al. | Hybrid coding of visual content and local image features | |
WO2023279968A1 (en) | Method and apparatus for encoding and decoding video image | |
WO2024005659A1 (en) | Adaptive selection of entropy coding parameters | |
KR102072576B1 (en) | Apparatus and method for encoding and decoding of data | |
Rabie et al. | PixoComp: a novel video compression scheme utilizing temporal pixograms | |
Anandan et al. | Nonsubsampled contourlet transform based video compression using Huffman and run length encoding for multimedia applications | |
Kufa et al. | Quality comparison of 360° 8K images compressed by conventional and deep learning algorithms | |
Adami et al. | Embedded indexing in scalable video coding | |
KR102804777B1 (en) | Action Recognition Method by Deep Learning from Video Compression Domain and device therefor | |
Ye et al. | A novel image compression framework at edges | |
US11831887B1 (en) | Scalable video coding for machine | |
US20240364890A1 (en) | Compression of bitstream indexes for wide scale parallel entropy coding in neural-based video codecs | |
WO2025148762A1 (en) | Method and compression framework with post-processing for machine vision | |
TW202503681A (en) | Encoding and decoding method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18772429 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18772429 Country of ref document: EP Kind code of ref document: A1 |