HK40075484A

HK40075484A - Method, computer apparatus, device and storage medium for video processing

Info

Publication number: HK40075484A
Application number: HK62022064558.9A
Authority: HK
Inventors: 欧阳祥; 李翔; 刘杉
Original assignee: 腾讯美国有限责任公司
Priority date: 2020-12-29
Filing date: 2021-09-08
Publication date: 2023-01-20

Description

Method and apparatus for video encoding

Reference merging

This APPLICATION claims the benefit OF priority from "METHOD AND APPARATUS FOR VIDEO CODING" U.S. patent APPLICATION No. 17/463,352, filed on 31/8/2021, which claims the benefit OF priority from "APPLICATION OF CLIPPING TO IMPROVE PRE-PROCESSING IN A NEURAL NETWORK BASED IN-LOOP FILTER IN A VIDEO CODEC" OF U.S. provisional APPLICATION No. 63/131,656, filed on 29/12/2020. The entire disclosure of the prior application is hereby incorporated by reference in its entirety.

Technical Field

The present disclosure describes embodiments that relate generally to video encoding. More specifically, the present disclosure provides techniques for improving neural network-based in-loop filters.

Background

The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.

Video encoding and decoding may be performed using inter-picture prediction with motion compensation. Uncompressed digital video may comprise a series of pictures, each picture having a spatial dimension of, for example, 1920 x 1080 luma samples and associated chroma samples. The series of pictures may have, for example, 60 pictures per second or a fixed or variable picture rate (informally also referred to as frame rate) of 60 Hz. Uncompressed video has certain bit rate requirements. For example, 8-bit per sample 1080p60 4. One hour of such video requires more than 600 gigabytes (gbytes) of storage space.

One purpose of video encoding and decoding may be to reduce redundancy in the input video signal by compression. Compression may help reduce the aforementioned bandwidth or storage space requirements, by two orders of magnitude or more in some cases. Both lossless and lossy compression, and combinations thereof, may be employed. Lossless compression refers to a technique by which an exact copy of an original signal can be reconstructed from a compressed original signal. When lossy compression is used, the reconstructed signal may not be identical to the original signal, but the distortion between the original signal and the reconstructed signal is small enough to make the reconstructed signal useful for the intended application. In the case of video, lossy compression is widely adopted. The amount of distortion tolerated depends on the application; for example, users of certain consumer streaming applications may tolerate higher distortion than users of television distribution applications. The achievable compression ratio may reflect: higher allowable/tolerable distortion may result in higher compression ratios.

Video encoders and decoders may utilize techniques from several broad categories, including, for example, motion compensation, transformation, quantization, and entropy coding.

Video coding techniques may include a technique known as intra-coding. In intra coding, sample values are represented without reference to samples or other data from previously reconstructed reference pictures. In some video codecs, a picture is spatially subdivided into blocks of samples. When all sample blocks are encoded in intra mode, the picture may be an intra picture. Intra pictures and derivatives thereof (e.g., independent decoder refresh pictures) can be used to reset decoder states and thus can be used as the first picture in an encoded video bitstream and video session, or as still pictures. Samples of an intra block may be subjected to a transform and the transform coefficients may be quantized prior to entropy encoding. Intra prediction may be a technique that minimizes the sample values of the pre-transform domain. In some cases, the smaller the DC value after transformation, and the smaller the AC coefficient, the fewer bits are needed to represent the block after entropy encoding at a given quantization step size.

Conventional intra-frame coding, such as that known from, for example, MPEG-2 generation coding techniques, does not use intra-prediction. However, some newer video compression techniques include techniques that attempt from, for example, surrounding sample data and/or metadata obtained during encoding/decoding of spatially adjacent and preceding data blocks in decoding order. Such techniques are hereinafter referred to as "intra-prediction" techniques. Note that in at least some cases, intra prediction uses only reference data from the current picture in reconstruction, not from a reference picture.

There may be many different forms of intra prediction. When more than one such technique may be used in a given video encoding technique, the techniques used may be encoded in intra prediction mode. In some cases, a mode may have sub-modes and/or parameters, and these sub-modes and/or parameters may be encoded separately or included in a mode codeword. Which codewords to use for a given mode/sub-mode/parameter combination may have an effect on the coding efficiency gain through intra-prediction, and so does the entropy coding technique used to convert the codewords into a bitstream.

Certain modes of intra prediction are introduced in h.264, improved in h.265, and further improved in newer coding techniques such as Joint Exploration Model (JEM), universal video coding (VVC), and reference set (BMS). The predictor block may be formed using neighboring sample values belonging to already available samples. The sample values of adjacent samples are copied into the predictor block according to the direction. The reference to the direction of use may be encoded in the bitstream or may be predicted itself.

Referring to fig. 1A, a subset of nine predictor directions known from the 33 possible predictor directions of h.265 (33 angular patterns corresponding to 35 intra-modes) is depicted in the bottom right. The point (101) where the arrows converge represents the predicted sample. The arrows indicate the direction in which the samples are predicted. For example, the arrow (102) indicates that the sample (101) is predicted from one or more samples at the upper right at an angle of 45 degrees to the horizontal. Similarly, the arrow (103) indicates that the sample (101) is predicted from one or more samples of the sample (101) at the lower left of 22.5 degrees from horizontal.

Still referring to fig. 1A, at the top left, a square block (104) of 4 × 4 samples is depicted (indicated by the dashed bold line). The block (104) includes 16 samples, each labeled with "S", its position in the Y dimension (e.g., row index), and its position in the X dimension (e.g., column index). For example, sample S21 is the second sample in the Y dimension (from the top) and the first sample in the X dimension (from the left). Similarly, sample S44 is the fourth sample in the Y dimension and the X dimension in block (104). Since the block size is 4 × 4 samples, S44 is at the bottom right. Reference samples following a similar numbering scheme are further shown. The reference sample is labeled with R, its Y position (e.g., row index) and X position (column index) relative to block (104). In both h.264 and h.265, the prediction samples are adjacent to the block in reconstruction; and therefore negative values need not be used.

Intra picture prediction can be performed by copying reference sample values from neighboring samples delimited by the signaled prediction direction. For example, assume that the encoded video bitstream includes the following signaling: for this block, the signaling indicates a prediction direction that coincides with the arrow (102), that is, samples are predicted from one or more prediction samples at the upper right, at an angle of 45 degrees to the horizontal. In this case, the sample S41, the sample S32, the sample S23, and the sample S14 are predicted from the same reference sample R05. The sample S44 is then predicted from the reference sample R08.

In some cases, the values of multiple reference samples may be combined, for example by interpolation, to calculate a reference sample; especially when the direction cannot be evenly divided by 45 degrees.

As video coding techniques evolve, the number of possible directions also increases. In h.264 (2003), nine different directions can be represented. Increased to 33 in h.265 (2013) and JEM/VVC/BMS can support up to 65 directions when published. Experiments have been performed to identify the most likely directions, and some techniques in entropy coding are used to represent those possible directions with a small number of bits, accepting some penalty for less likely directions. Furthermore, the direction itself may sometimes be predicted from the neighboring directions used in neighboring decoded blocks.

Fig. 1B shows a schematic diagram (180) depicting 65 intra prediction directions according to JEM to show the increase in the number of prediction directions over time.

The mapping of intra prediction direction bits representing directions in the encoded video bitstream may vary from video coding technique to video coding technique; and may range, for example, from simple direct mapping of prediction direction to intra-prediction mode, to codewords, to complex adaptation schemes involving the most probable mode, and similar techniques. In all cases, however, there may be certain directions that are statistically less likely to occur in the video content than certain other directions. Since the goal of video compression is to reduce redundancy, in well-working video coding techniques, those less likely directions will be represented by more bits than more likely directions.

Motion compensation may be a lossy compression technique, and may involve techniques such as: a specimen data block from a previously reconstructed picture or part thereof (a reference picture) is used to predict a newly reconstructed picture or picture slice after spatial shifting in the direction indicated by a motion vector (hereafter MV). In some cases, the reference picture may be the same as the picture currently being reconstructed. The MV may have two dimensions X and Y, or three dimensions, the third being an indication of the reference picture in use (the latter may be indirectly the temporal dimension).

In some video compression techniques, MVs applicable to a certain region of sample data may be predicted from other MVs, e.g., MVs related to another region of sample data spatially adjacent to the region in reconstruction and preceding the MV in decoding order. This can significantly reduce the amount of data required to encode the MV, thereby eliminating redundancy and increasing compression. MV prediction can work efficiently, for example, because when encoding an input video signal (referred to as natural video) derived from a camera, there is a statistical likelihood that a region is moving in a similar direction that is larger than the region to which a single MV applies, and thus similar motion vectors derived from MVs of adjacent regions can be used for prediction in some cases. This results in the MVs found for a given region being similar or identical to MVs predicted from surrounding MVs and can in turn be represented after entropy encoding by a number of bits less than would be used to directly encode the MVs. In some cases, MV prediction may be an example of lossless compression of a signal (i.e., MV) derived from an original signal (i.e., a sample stream). In other cases, MV prediction itself may be lossy, for example, because of rounding errors when the predictor is calculated from several surrounding MVs.

Various MV prediction mechanisms are described in H.265/HEVC (ITU-T Rec.H.265, "High Efficiency Video Coding",2016 month 12). Among the many MV prediction mechanisms provided by h.265, described herein is a technique referred to hereinafter as "spatial merging".

Referring to fig. 2, a current block (201) includes samples that are found by an encoder during a motion search process to be predictable from previous blocks of the same size that have been spatially shifted. Instead of directly encoding MVs, MVs may be derived from metadata associated with one or more reference pictures, e.g., from the nearest (in decoding order) reference picture, using MVs associated with any of the five surrounding samples denoted as A0, A1 and B0, B1, B2 (202 to 206, respectively). In h.265, MV prediction may use predictors from the same reference picture that neighboring blocks are using.

Disclosure of Invention

Aspects of the present disclosure provide methods and apparatus for video processing. In some examples, an apparatus for video processing includes a processing circuit. The processing circuit converts a picture in a sub-sampled format in a color space to a non-sub-sampled format in the color space. The processing circuit then crops the values of the color components of the non-subsampled format picture before providing the non-subsampled format picture as input to the neural network-based filter.

In some examples, the processing circuitry clips values of color components of the picture in the non-subsampled format into valid ranges of the color components. In an example, the processing circuit clips values of color components of the picture in the non-subsampled format into a range determined based on the bit depth. In another example, the processing circuitry clips values of color components of the picture in a non-subsampled format into a predetermined range.

In some examples, the processing circuitry determines a range for clipping values based on decoding information from a bitstream carrying the picture, and then clips values of color components of the picture in a non-subsampled format into the determined range. In the case of the example shown in the figure, the processing circuitry decodes a signal indicating a range from at least one of a sequence parameter set, a picture parameter set, a slice header, and a picture header in the bitstream.

In some examples, the processing circuit reconstructs a picture in a sub-sample format based on decoded information from the bitstream and applies a deblocking filter to the picture in the sub-sample format. In some examples, the processing circuit applies a neural network-based filter to the picture in the non-sub-sampled format having the cropped value to generate a filtered picture in the non-sub-sampled format and converts the filtered picture in the non-sub-sampled format to the filtered picture in the sub-sampled format.

In some examples, a picture in a non-subsampled format having a cropped value is stored in a storage device. The stored pictures in the non-subsampled format with the cropped values may then be provided as training inputs to train the neural network in the neural network based filter.

Aspects of the present disclosure also provide a non-transitory computer-readable medium storing instructions that, when executed by a computer for video decoding, cause the computer to perform a method for video processing.

Drawings

Other features, properties, and various advantages of the disclosed subject matter will become more apparent from the following detailed description and the accompanying drawings, in which:

fig. 1A is a schematic illustration of an exemplary subset of intra prediction modes;

FIG. 1B is a diagram of exemplary intra prediction directions;

FIG. 2 is a current block and its use in one example schematic illustration of surrounding spatial merge candidates;

fig. 3 is a schematic illustration of a simplified block diagram of a communication system (300) according to an embodiment;

fig. 4 is a schematic illustration of a simplified block diagram of a communication system (400) according to an embodiment;

fig. 5 is a schematic illustration of a simplified block diagram of a decoder according to an embodiment;

fig. 6 is a schematic illustration of a simplified block diagram of an encoder according to an embodiment;

FIG. 7 shows a block diagram of an encoder according to another embodiment;

FIG. 8 shows a block diagram of a decoder according to another embodiment;

fig. 9 illustrates a block diagram of a loop filter unit in some examples.

FIG. 10 illustrates in some examples a block diagram of another loop filter unit.

Fig. 11 illustrates a block diagram of a neural network-based filter in some examples.

Fig. 12 illustrates a block diagram of a pre-processing module in some examples.

Fig. 13 illustrates a block diagram of a neural network structure in some examples.

Fig. 14 shows a block diagram of a dense residual unit.

Fig. 15 illustrates a block diagram of a post-processing module in some examples.

Fig. 16 illustrates a block diagram of a pre-processing module in some examples.

FIG. 17 shows a flowchart outlining an example of a process.

Fig. 18 is a schematic illustration of a computer system, according to an embodiment.

Detailed Description

Fig. 3 shows a simplified block diagram of a communication system (300) according to an embodiment of the present disclosure. The communication system (300) comprises a plurality of terminal devices, which may communicate with each other via, for example, a network (350). For example, a communication system (300) includes a first pair of terminal devices (310) and (320) interconnected via a network (350). In the example of fig. 3, the first pair of terminal devices (310) and (320) performs a unidirectional transmission of data. For example, the terminal device (310) may encode video data (e.g., a stream of video pictures captured by the terminal device (310)) for transmission to another terminal device (320) via the network (350). The encoded video data may be transmitted in the form of one or more encoded video bitstreams. The terminal device (320) may receive encoded video data from the network (350), decode the encoded video data to recover video pictures, and display the video pictures according to the recovered video data. One-way data transmission on media service applications, etc. may be common.

In another example, a communication system (300) includes a second pair of terminal devices (330) and (340), which performs bi-directional transmission of encoded video data that may occur, for example, during a video conference. For bi-directional transmission of data, in an example, each of the terminal devices (330) and (340) may encode video data (e.g., a stream of video pictures captured by the terminal device) for transmission to the other of the terminal devices (330) and (340) via the network (350). Each of terminal devices (330) and (340) may also receive encoded video data transmitted by the other of terminal devices (330) and (340), and may decode the encoded video data to recover the video picture, and may display the video picture at the accessible display device according to the recovered video data.

In the example of fig. 3, the terminal devices (310), (320), (330), and (40) may be illustrated as a server, a personal computer, and a smartphone, but the principles of the present disclosure may not be limited thereto. Embodiments of the present disclosure may be applied to laptop computers, tablet computers, media players, and/or dedicated video conferencing equipment. Network (350) represents any number of networks that transport encoded video data between end devices (310), (320), (330), and (340), including, for example, wired (wired) and/or wireless communication networks. The communication network (350) may exchange data in circuit-switched and/or packet-switched channels. Representative networks include telecommunications networks, local area networks, wide area networks, and/or the internet. For purposes of the present discussion, the architecture and topology of the network (350) may be unimportant to the operation of the present disclosure, unless described herein below.

Fig. 4 shows an arrangement of a video encoder and a video decoder in a streaming environment as an example of an application of the disclosed subject matter. The disclosed subject matter may be equally applicable to other video-enabled applications including, for example, video conferencing, digital TV, storing compressed video on digital media including CDs, DVDs, memory sticks, and the like.

The streaming system may include a capture subsystem (413), which may include a video source (401), such as a digital video camera, that creates, for example, an uncompressed video picture stream (402). In an example, a video picture stream (402) includes samples taken by a digital camera. A video picture stream (402) depicted as a bold line to emphasize a high data volume when compared to encoded video data (404) (or an encoded video bitstream) may be processed by an electronic device (420) comprising a video encoder (403) coupled to a video source (401). The video encoder (403) may include hardware, software, or a combination thereof to implement or realize aspects of the disclosed subject matter as described in more detail below. Encoded video data (404) (or encoded video bitstream (404)) depicted as thin lines to emphasize a lower amount of data when compared to the video picture stream (402) may be stored on the streaming server (405) for future use. One or more streaming client subsystems, such as client subsystems (406) and (408) in fig. 4, may access a streaming server (405) to retrieve copies (407) and (409) of the encoded video data (404). The client subsystem (406) may include, for example, a video decoder (410) in an electronic device (430). <xnotran> (410) (407) , (412) (, ) ( ) (411) . </xnotran> In some streaming systems, the encoded video data (404), (407), and (409) (e.g., video bit streams) may be encoded according to certain video encoding/compression standards. Examples of such standards include ITU-T recommendation H.265. In the examples, the Video Coding standard under development is informally referred to as universal Video Coding (VVC). The disclosed subject matter may be used in the context of VVCs.

Note that electronic devices (420) and (430) may include other components (not shown). For example, the electronic device (420) may include a video decoder (not shown), and the electronic device (430) may also include a video encoder (not shown).

Fig. 5 shows a block diagram of a video decoder (510) according to an embodiment of the present disclosure. The video decoder (510) may be included in an electronic device (530). The electronic device (530) may include a receiver (531) (e.g., a receive circuit). The video decoder (510) may be used in place of the video decoder (410) in the example of fig. 4.

The receiver (531) may receive one or more encoded video sequences to be decoded by the video decoder (510); in the same or another implementation, each encoded video sequence is decoded independently of other encoded video sequences, one encoded video sequence at a time. An encoded video sequence may be received from a channel (501), the channel may be a hardware/software link to a storage device that stores the encoded video data. The receiver (531) may receive encoded video data with other data, e.g. encoded audio data and/or auxiliary data streams, which may be forwarded to their respective usage entities (not depicted). The receiver (531) may separate the encoded video sequence from other data. To combat network jitter, a buffer memory (515) may be coupled between the receiver (531) and the entropy decoder/parser (520) (hereinafter "parser (520)"). In some applications, the buffer memory (515) is part of the video decoder (510). In other embodiments, the buffer memory may be external to the video decoder (510) (not depicted). In other embodiments, there may be a buffer memory (not depicted) external to the video decoder (510), for example to combat network jitter, and additionally there may be another buffer memory (515) internal to the video decoder (510), for example to handle playback timing. When the receiver (531) receives data from a store/forward device with sufficient bandwidth and controllability or from an isochronous network, the buffer memory (515) may not be needed or may be small. For use over a best effort packet network such as the internet, a buffer memory (515) may be required, which may be relatively large and may advantageously be of adaptive size, and may be implemented at least partially in an operating system or similar element (not depicted) external to the video decoder (510).

The video decoder (510) may include a parser (520) to reconstruct symbols (521) from the encoded video sequence. These categories of symbols include information for managing the operation of the video decoder (510), and potentially information for controlling a rendering device, such as a rendering device (512) (e.g., a display screen) that is not an integral part of the electronic device (530) but may be coupled to the electronic device (530), as shown in fig. 5. The control Information for the presentation device may be in the form of supplemental enhancement Information (SEI message) or Video Usability Information (VUI) parameter set fragments (not depicted). The parser (520) may parse/entropy decode the received encoded video sequence. The encoding of the encoded video sequence may be in accordance with a video coding technique or standard, and may follow various principles, including variable length coding, huffman coding, arithmetic coding with or without context sensitivity, and the like. A parser (520) may extract from the encoded video sequence a set of sub-group parameters for at least one of the sub-groups of pixels in the video decoder based on at least one parameter corresponding to the group. The subgroups may include Groups of pictures (Groups of pictures, GOP), picture, tile, slice, macroblock, coding Unit (Coding Unit, CU), block, transform Unit (TU), prediction Unit (PU), etc. The parser (520) may also extract information such as transform coefficients, quantizer parameter values, motion vectors, etc., from the encoded video sequence.

The parser (520) may perform entropy decoding/parsing operations on the video sequence received from the buffer memory (515) to create symbols (521).

The reconstruction of the symbols (521) may involve a number of different units depending on the type of encoded video picture or portion thereof (e.g., inter and intra pictures, inter and intra blocks), and other factors. Which units are involved and how these units are involved can be controlled by subgroup control information parsed from the encoded video sequence by a parser (520). For clarity, the flow of such subgroup control information between parser (520) and the following units is not depicted.

In addition to the functional blocks already mentioned, the video decoder (510) may be conceptually subdivided into a plurality of functional units, as described below. In a practical implementation operating under business constraints, many of these units interact tightly with each other and may be at least partially integrated into each other. However, for purposes of describing the disclosed subject matter, it is appropriate that the following conceptually subdivide into functional units.

The first unit is a scaler/inverse transform unit (551). The scaler/inverse transform unit (551) receives the quantized transform coefficients from the parser (520) as symbols (521) along with control information including what transform to use, block size, quantization factor, quantization scaling matrix, etc. The scaler/inverse transform unit (551) may output a block including sample values that may be input into the aggregator (555).

In some cases, the output samples of the sealer/inverse transform (551) may relate to intra-coded blocks; namely: instead of using predictive information from previously reconstructed pictures, blocks of predictive information from previously reconstructed portions of the current picture may be used. Such predictive information may be provided by an intra picture prediction unit (552). In some cases, the intra picture prediction unit (552) uses surrounding already reconstructed information obtained from the current picture buffer (558) to generate a block of the same size and shape as the block being reconstructed. A current picture buffer (558) buffers, for example, a partially reconstructed current picture and/or a fully reconstructed current picture. In some cases, the aggregator (555) adds, on a per sample basis, the prediction information that the intra prediction unit (552) has generated to the output sample information provided by the scaler/inverse transform unit (551).

In other cases, the output samples of sealer/inverse transform unit (551) may relate to inter-coded and possibly motion compensated blocks. In such a case, the motion compensated prediction unit (553) may access a reference picture memory (557) to obtain samples for prediction. After motion compensation of the acquired samples according to the block-related symbols (521), these samples may be added by an aggregator (555) to the output of the sealer/inverse transform unit (551), in this case referred to as residual samples or residual signals, in order to generate output sample information. The address within the reference picture memory (557) from which the motion compensated prediction unit (553) takes the prediction samples may be controlled by a motion vector, which may be used by the motion compensated prediction unit (553) in a form that may have, for example, X, Y and symbols (521) of the reference picture components. Motion compensation may also include interpolation of sample values obtained from a reference picture memory (557) when sub-sample precision motion vectors are in use, motion vector prediction mechanisms, and so on.

The output samples of the aggregator (555) may be subjected to various loop filtering techniques in a loop filter unit (556). The video compression techniques may include in-loop filter techniques controlled by parameters included in the encoded video sequence (also referred to as the encoded video bitstream) and available to the loop filter unit (556) as symbols (521) from the parser (520), but may also be responsive to meta-information obtained during decoding of previous (in decoding order) portions of the encoded picture or encoded video sequence, and to sample values previously reconstructed and loop filtered.

The output of the loop filter unit (556) may be a sample stream, which may be output to a rendering device (512) and stored in a reference picture memory (557) for use in future inter picture prediction.

Some coded pictures, once fully reconstructed, may be used as reference pictures for future prediction. For example, once the encoded picture corresponding to the current picture is fully reconstructed and the encoded picture has been identified as a reference picture (e.g., by parser (520)), current picture buffer (558) may become part of reference picture memory (557) and a new current picture buffer may be reallocated before starting reconstruction of a subsequent encoded picture.

The video decoder (510) may perform the decoding operation according to a predetermined video compression technique in a standard, such as ITU-T recommendation h.265. An encoded video sequence may conform to the syntax specified by the video compression technique or standard used, in the sense that the encoded video sequence adheres to both the syntax of the video compression technique or standard and the configuration files set forth in the video compression technique or standard. In particular, the configuration file may select certain tools from all tools available in the video compression technology or standard as the only tools available according to the configuration file. Also necessary for compliance, the complexity of the encoded video sequence is within a range defined by the level of the video compression technique or standard. In some cases, the level limits the maximum picture size, the maximum frame rate, the maximum reconstruction sampling rate (e.g., measured in million samples per second), the maximum reference picture size, and so on. In some cases, the limit set by the level may be further limited by assuming a Reference Decoder (HRD) specification and metadata for HRD buffer management signaled in the encoded video sequence.

In an embodiment, the receiver (531) may receive additional (redundant) data with the encoded video. The additional data may be included as part of the encoded video sequence. The additional data may be used by the video decoder (510) to properly decode the data and/or more accurately reconstruct the original video data. The additional data may be in the form of, for example, a temporal, spatial, or signal-to-noise ratio (SNR) enhancement layer, a redundant slice, a redundant picture, a forward error correction code, etc.

Fig. 6 shows a block diagram of a video encoder (603) according to an embodiment of the present disclosure. The video encoder (603) is comprised in an electronic device (620). The electronic device (620) includes a transmitter (640) (e.g., transmit circuitry). The video encoder (603) may be used in place of the video encoder (403) in the example of fig. 4.

The video encoder (603) may receive video samples from a video source (601) (which is not part of the electronic device (620) in the example of fig. 6) that may capture video images to be encoded by the video encoder (603). In another example, the video source (601) is part of an electronic device (620).

The video source (601) may provide a source video sequence to be encoded by the video encoder (603) in the form of a stream of digital video samples, which may have any suitable bit depth (e.g., 8 bits, 10 bits, 12 bits, \8230;), any color space (e.g., bt.601y CrCB, RGB, \8230;) and any suitable sampling structure (e.g., Y CrCB 4. In the media service system, the video source (601) may be a storage device storing a previously prepared video. In a video conferencing system, the video source (601) may be a video camera that captures local image information as a video sequence. The video data may be provided as a plurality of individual pictures that impart motion when viewed sequentially. The picture itself may be organized as a spatial array of pixels, where each pixel may comprise one or more samples, depending on the sampling structure, color space, etc. in use. The relationship between the pixel and the sample can be easily understood by those skilled in the art. The following description focuses on the sample.

According to an embodiment, the video encoder (603) may encode and compress pictures of a source video sequence into an encoded video sequence (643) in real-time or under any other time constraints required by the application. Forcing the appropriate encoding speed is one function of the controller (650). In some embodiments, the controller (650) controls and is functionally coupled to other functional units as described below. In the interest of clarity of presentation, the coupling is not depicted. The parameters set by the controller (650) may include rate control related parameters (picture skip, quantizer, lambda value of rate distortion optimization technique, 8230; \8230;), picture size, group of pictures (GOP) layout, maximum motion vector search range, etc. The controller (650) may be configured with other suitable functionality with respect to the video encoder (603) optimized for a particular system design.

In some implementations, the video encoder (603) is configured to operate in an encoding loop. As an oversimplified description, in an example, the encoding loop may include a source encoder (630) (e.g., responsible for creating symbols such as a symbol stream based on input pictures and reference pictures to be encoded) and a (local) decoder (633) embedded in the video encoder (603). The decoder (633) reconstructs the symbols to create sample data in a similar manner as would be created by a (remote) decoder (since any compression between the symbols and the encoded video bitstream is lossless in the video compression techniques considered in the disclosed subject matter). The reconstructed sample stream (sample data) is input to a reference picture memory (634). Since the decoding of the symbol stream results in bit-exact (bit-exact) results that are independent of the decoder location (local or remote), the content in the reference picture memory (634) is also bit-exact between the local encoder and the remote encoder. In other words, the prediction portion of the encoder "sees" the exact same sample values that the decoder would "see" as reference picture samples when using prediction during decoding. This basic principle of reference picture synchronicity (and drift if synchronicity cannot be maintained due to channel errors, for example) is also used in some related technologies.

The operation of the "local" decoder (633) may be the same as a "remote" decoder, which is, for example, the video decoder (510) that has been described in detail above in connection with fig. 5. However, referring briefly to fig. 5, the entropy decoding portion of the video decoder (510), including the buffer memory (515) and the parser (520), may not be fully implemented in the local decoder (633) since symbols are available and the encoding/decoding of the symbols into an encoded video sequence by the entropy encoder (645) and the parser (520) may be lossless.

It can be observed at this point that any decoder technique other than the parsing/entropy decoding present in the decoder must also be present in the corresponding encoder in substantially the same functional form. To this end, the disclosed subject matter focuses on decoder operation. The description of the encoder techniques may be simplified because they are the inverse of the fully described decoder techniques. A more detailed description is needed and provided below in only certain areas.

During operation, in some examples, the source encoder (630) may perform motion compensated predictive encoding that predictively encodes an input picture with reference to one or more previously encoded pictures from the video sequence that are designated as "reference pictures". In this manner, the encoding engine (632) encodes differences between pixel blocks of the input picture and pixel blocks of a reference picture that may be selected as a prediction reference for the input picture.

The local video decoder (633) may decode encoded video data for a picture that may be designated as a reference picture based on the symbols created by the source encoder (630). The operation of the encoding engine (632) may advantageously be a lossy process. When the encoded video data can be decoded at a video decoder (not shown in figure 6), the reconstructed video sequence may typically be a copy of the source video sequence with some errors. The local video decoder (633) replicates a decoding process that may be performed on reference pictures by the video decoder, and may cause reconstructed reference pictures to be stored in a reference picture cache (634). In this way, the video encoder (603) can locally store a copy of the reconstructed reference picture that has common content (no transmission errors) with the reconstructed reference picture that will be obtained by the far-end video decoder.

Predictor (635) may perform a prediction search of coding engine (632). That is, for a new picture to be encoded, the predictor (635) may search from the reference picture memory (634) for sample data (as a candidate reference pixel block) or specific metadata such as a reference picture motion vector, block shape, etc., which may serve as a suitable prediction reference for the new picture. The predictor (635) may operate on a block-by-block basis of samples to find the appropriate prediction reference. In some cases, the input picture may have prediction references obtained from multiple reference pictures stored in a reference picture memory (634), as determined by search results obtained by the predictor (635).

The controller (650) may manage the encoding operations of the source encoder (630), including, for example, the setting of parameters and sub-group parameters for encoding video data.

The outputs of all the foregoing functional units may undergo entropy encoding in an entropy encoder (645). The entropy encoder (645) converts the symbols generated by the various functional units into an encoded video sequence by lossless compression of the symbols according to techniques such as huffman coding, variable length coding, arithmetic coding, and the like.

The transmitter (640) may buffer the encoded video sequence created by the entropy encoder (645) in preparation for transmission via a communication channel (660), which may be a hardware/software link to a storage device that is to store the encoded video data. The transmitter (640) may merge the encoded video data from the video encoder (603) with other data to be transmitted, e.g., encoded audio data and/or an auxiliary data stream (source not shown).

Controller (650) can manage operation of the video encoder (603). During encoding, the controller (650) may specify to each encoded picture a certain encoded picture type, which may affect the encoding techniques that may be applied to the respective picture. For example, in the case of a liquid, a picture may be generally specified as one of the following picture types:

an intra picture (I picture) may be a picture that can be encoded and decoded without using any other picture in the sequence as a prediction source. Some video codecs allow different types of intra pictures, including, for example, independent Decoder Refresh ("IDR") pictures. Those skilled in the art are aware of those variations of picture I and their respective applications and features.

A predictive picture (P-picture) may be a picture that is encoded and decoded using intra prediction or inter prediction, where the intra prediction or inter prediction uses at most one motion vector and reference index to predict sample values of each block.

A bi-predictive picture (B-picture) may be a picture encoded and decoded using intra prediction or inter prediction, which predicts sample values of each block using at most two motion vectors and reference indices. Similarly, multiple predictive pictures may use more than two reference pictures and associated metadata for reconstructing a single block.

A source picture may typically be spatially subdivided into blocks of samples (e.g., blocks of 4 × 4, 8 × 8, 4 × 8, or 16 × 16 samples each) and encoded block-wise. Blocks may be predictively encoded with reference to other (already encoded) blocks determined by the encoding allocation applied to the respective pictures of the block. For example, blocks of an I picture may be non-predictively encoded, or they may be predictively encoded (spatial prediction or intra prediction) with reference to already encoded blocks of the same picture. A block of pixels of a P picture may be predictively encoded via spatial prediction or via temporal prediction with reference to one previously encoded reference picture. A block of a B picture may be predictively encoded via spatial prediction or via temporal prediction with reference to one or two previously encoded reference pictures.

The video encoder (603) may perform encoding operations according to a predetermined video encoding technique or standard, such as ITU-T recommendation h.265. In its operation, the video encoder (603) may perform various compression operations, including predictive encoding operations that exploit temporal and spatial redundancies in the input video sequence. Thus, the encoded video data may conform to syntax specified by the video coding technique or standard used.

In an embodiment, the transmitter (640) may transmit additional data with the encoded video. The source encoder (630) may include such data as part of an encoded video sequence. The additional data may include temporal/spatial/SNR enhancement layers, other forms of redundant data such as redundant pictures and slices, SEI messages, VUI parameter set fragments, etc.

Video may be captured as a plurality of source pictures (video pictures) in a temporal sequence. Intra picture prediction (often abbreviated as intra prediction) exploits spatial correlation in a given picture, and inter picture prediction exploits (temporal or other) correlation between pictures. In an example, a particular picture being encoded/decoded, referred to as a current picture, is divided into blocks. When a block in a current picture is similar to a reference block in a previously encoded and still buffered reference picture in video, the block in the current picture may be encoded by a vector called a motion vector. The motion vector points to a reference block in a reference picture, and in the case where multiple reference pictures are used, may have a third dimension that identifies the reference picture.

In some embodiments of the present invention, the substrate is, bi-directional prediction techniques may be used for inter-picture prediction. According to bi-prediction techniques, two reference pictures are used, e.g., a first reference picture and a second reference picture, both preceding the current picture in the video in decoding order (but possibly in the past and future, respectively, in display order). A block in a current picture may be encoded by a first motion vector pointing to a first reference block in a first reference picture and a second motion vector pointing to a second reference block in a second reference picture. The block may be predicted by a combination of the first reference block and the second reference block.

Furthermore, a merge mode technique may be used in inter picture prediction to improve coding efficiency.

According to some embodiments of the present disclosure, prediction such as inter-picture prediction and intra-picture prediction is performed in units of blocks. For example, according to the HEVC standard, pictures in a sequence of video pictures are divided into Coding Tree Units (CTUs) for compression, the CTUs in the pictures having the same size, e.g., 64 × 64 pixels, 32 × 32 pixels, or 16 × 16 pixels. Generally, a CTU includes three Coding Tree Blocks (CTBs), which are one luminance CTB and two chrominance CTBs. Each CTU may be recursively partitioned into one or more Coding Units (CUs) in a quadtree. For example, a CTU of 64 × 64 pixels may be divided into one CU of 64 × 64 pixels, or 4 CUs of 32 × 32 pixels, or 16 CUs of 16 × 16 pixels. In an example, each CU is analyzed to determine a prediction type of the CU, e.g., an inter prediction type or an intra prediction type. Depending on temporal and/or spatial predictability, a CU is partitioned into one or more Prediction Units (PUs). In general, each PU includes a luma Prediction Block (PB) and two chroma blocks PB. In an embodiment, a prediction operation in encoding (encoding/decoding) is performed in units of prediction blocks. Using a luma prediction block as an example of a prediction block, the prediction block includes a matrix of values (e.g., luma values) of pixels (e.g., 8 × 8 pixels, 16 × 16 pixels, 8 × 16 pixels, 16 × 8 pixels, etc.).

Fig. 7 shows a diagram of a video encoder (703) according to another embodiment of the present disclosure. A video encoder (703) is configured to receive a processing block (e.g., a prediction block) of sample values within a current video picture in a sequence of video pictures, and encode the processing block as an encoded picture that is part of an encoded video sequence. In an example, a video encoder (703) is used instead of the video encoder (403) in the example of fig. 4.

In the HEVC example, a video encoder (703) receives a matrix of sample values of a processing block (e.g., a prediction block of 8 × 8 samples, etc.). The video encoder (703) determines whether the processing block is best encoded using intra mode, inter mode, or bi-prediction mode, e.g., using rate-distortion optimization. When the processing block is to be encoded in intra mode, the video encoder (703) may encode the processing block into an encoded picture using intra prediction techniques; and when the processing block is to be encoded in inter mode or bi-prediction mode, the video encoder (703) may encode the processing block into an encoded picture using inter prediction or bi-prediction techniques, respectively. In some video coding techniques, the merge mode may be an inter picture prediction sub-mode, in which motion vectors are derived from one or more motion vector predictors without resorting to coded motion vector components outside of the predictors. In some other video coding techniques, there may be motion vector components that may be applicable to the subject block. In an example, the video encoder (703) includes other components, such as a mode decision module (not shown) for determining a mode of the processing block.

In the example of fig. 7, the video encoder (703) includes an inter encoder (730), an intra encoder (722), a residual calculator (723), a switch (726), a residual encoder (724), a general purpose controller (721), and an entropy encoder (725) coupled together as shown in fig. 7.

The inter encoder (730) is configured to receive samples of a current block (e.g., a processed block), compare the block to one or more reference blocks in a reference picture (e.g., blocks in previous and subsequent pictures), generate inter prediction information (e.g., a description of redundant information according to an inter coding technique, a motion vector, merge mode information), and calculate an inter prediction result (e.g., a predicted block) based on the inter prediction information using any suitable technique. In some examples, the reference picture is a decoded reference picture that is decoded based on the encoded video information.

An intra encoder (722) is configured to receive samples of a current block (e.g., a processing block), in some cases compare the block to a block already encoded in the same picture, generate quantized coefficients after transformation, and in some cases also generate intra prediction information (e.g., intra prediction direction information according to one or more intra coding techniques). In an example, the intra encoder (722) also computes an intra-prediction result (e.g., a predicted block) based on a reference block and intra-prediction information in the same picture.

The general controller (721) is configured to determine the general-purpose control data, and controls other components of the video encoder (703) based on the general control data. In an example, a general purpose controller (721) determines a mode of a block and provides a control signal to a switch (726) based on the mode. For example, when the mode is intra mode, the general purpose controller (721) controls the switch (726) to select an intra mode result for use by the residual calculator (723), and controls the entropy encoder (725) to select and include intra prediction information in the bitstream; and when the mode is an inter mode, the general controller (721) controls the switch (726) to select an inter prediction result for use by the residual calculator (723), and controls the entropy encoder (725) to select and include inter prediction information in the bitstream.

The residual calculator (723) is configured to calculate received block and slave intra coder: (722) or the prediction results selected by the inter encoder (730) (residual data). A residual encoder (724) is configured to operate based on the residual data to encode the residual data to generate transform coefficients. In an example, the residual encoder (724) is configured to convert residual data from a spatial domain to a frequency domain, and generates transform coefficients. Then, the transform coefficients are subjected to quantization processing to obtain quantized transform coefficients. In various implementations, the video encoder (703) also includes a residual decoder (728). A residual decoder (728) is configured to perform the inverse transform and generate decoded residual data. The decoded residual data may be suitably used by an intra encoder (722) and an inter encoder (730). For example, the inter encoder (730) may generate a decoded block based on the decoded residual data and the inter prediction information, and the intra encoder (722) may generate a decoded block based on the decoded residual data and the inter prediction information. In some examples, the decoded blocks are processed appropriately to generate decoded pictures, and the decoded pictures may be buffered in memory circuitry (not shown) and used as reference pictures.

The entropy encoder (725) is configured to format the bitstream to include encoded blocks. The entropy encoder (725) is configured to include various information according to a suitable standard, such as the HEVC standard. In an example, the entropy encoder (725) is configured to include general control data, selected prediction information (e.g., intra prediction information or inter prediction information), residual information, and other suitable information in the bitstream. Note that, according to the disclosed subject matter, there is no residual information when a block is encoded in a merge sub-mode of an inter mode or a bi-prediction mode.

Fig. 8 shows a diagram of a video decoder (810) according to another embodiment of the present disclosure. A video decoder (810) is configured to receive an encoded picture that is part of an encoded video sequence and decode the encoded picture to generate a reconstructed picture. In an example, a video decoder (810) is used in place of the video decoder (410) in the example of fig. 4.

In the example of fig. 8, the video decoder (810) includes an entropy decoder (871), an inter-frame decoder (880), a residual decoder (873), a reconstruction module (874), and an intra-frame decoder (872) coupled together as shown in fig. 8.

The entropy decoder (871) can be configured to reconstruct from the encoded picture certain symbols representing syntax elements constituting the encoded picture. Such symbols may include, for example, a mode in which the block is encoded (such as intra mode, inter mode, bi-prediction mode, a merge sub-mode of the latter two, or another sub-mode), prediction information (such as intra prediction information or inter prediction information) that may identify certain samples or metadata used for prediction by an intra decoder (872) or an inter decoder (880), residual information, e.g., in the form of quantized transform coefficients, etc. In an example, when the prediction mode is an inter mode or a bi-directional prediction mode, inter prediction information is provided to an inter decoder (880); and when the prediction type is an intra prediction type, the intra prediction information is provided to an intra decoder (872). The residual information may be inverse quantized and provided to a residual decoder (873).

An inter-frame decoder (880) is configured to receive the inter-frame prediction information and generate an inter-frame prediction result based on the inter-frame prediction information.

An intra-frame decoder (872) is configured to receive the intra-frame prediction information and generate a prediction result based on the intra-frame prediction information.

A residual decoder (873) is configured to perform inverse quantization to extract dequantized transform coefficients, and to interpret the quantized transform coefficients to convert the residual from a frequency domain to a spatial domain. The residual decoder (873) may also need some control information (to include Quantizer parameters (Quantizer Parameter, QP)), and this information may be provided by the entropy decoder (871) (data path not indicated, since this is only low-level control information).

The reconstruction module (874) is configured to combine the residuals output by the residual decoder (873) and the predictions (optionally output by the inter prediction module or the intra prediction module) in the spatial domain to form a reconstructed block, which may be part of a reconstructed picture, which may in turn be part of a reconstructed video. It should be noted that other suitable operations, such as deblocking operations, etc., may be performed to improve visual quality.

Note that video encoders (403), (603), and (703) and video decoders (410), (510), and (810) may be implemented using any suitable technique. In an embodiment, the video encoders 403, 603, and 703 and the video decoders 410, 510, and 810 may be implemented using one or more integrated circuits. In another embodiment, the video encoders (403), (603), and (603) and the video decoders (410), (510), and (810) may be implemented using one or more processors executing software instructions.

Neural network technology may be used with video coding technology, and video coding technology that utilizes neural networks may be referred to as hybrid video coding technology. For example, a loop filter unit, such as loop filter unit (556), may apply various loop filters for sample filtering. The one or more loop filters may be implemented by a neural network. Aspects of the present disclosure provide in-loop filtering techniques in hybrid video coding techniques for improving picture quality using neural networks. In particular, according to an aspect of the present disclosure, techniques to crop data may be used prior to feeding the data to the kernel of the neural network-based in-loop filter.

According to an aspect of the disclosure, the loop filter is a filter that affects the reference data. For example, the image filtered by the loop filter unit (556) is stored in a buffer, e.g. in a reference picture memory (557), as a reference for further prediction. In-loop filters may improve video quality in video codecs.

Fig. 9 illustrates a block diagram of a loop filter unit (900) in some examples. In an example, a loop filter unit (900) may be used instead of the loop filter unit (556). In the example of fig. 9, the loop filter unit (900) includes a deblocking filter (901), a Sample Adaptive Offset (SAO) filter (902), and an Adaptive Loop Filter (ALF) filter (903). In some examples, the ALF filter (903) may include a Cross Component Adaptive Loop Filter (CCALF).

During operation, in an example, the loop filter unit (900) receives a reconstructed picture, applies various filters to the reconstructed picture, and generates an output picture in response to the reconstructed picture.

In some examples, the deblocking filter (901) and the SAO filter (902) are configured to remove blocking artifacts introduced when using block coding techniques. The deblocking filter (901) may smooth shape edges that are formed when using block coding techniques. The SAO filter (902) may apply a particular offset to the samples to reduce distortion relative to other samples in the video frame. The ALF (903) may apply the classification to, for example, a block of samples, and then apply a filter associated with the classification to the block of samples. In some examples, the filter coefficients of the filter may be determined by the encoder and signaled to the decoder.

In some examples (e.g., jfet-T0057), an additional filter called dense residual convolutional neural network based in-loop filter (DRNLF) may be inserted between the deblocking filter (901) and the SAO filter (902). DRNLF can further improve image quality.

Fig. 10 shows a block diagram of a loop filter unit (1000) in some examples. In an example, a loop filter unit (1000) may be used instead of the loop filter unit (556). In the example of figure 10, it is shown, the loop filter unit (1000) includes a deblocking filter (1001), an SAO filter (1002), an ALF filter (1003), and a DRNLF filter (1010) located between the deblocking filter (1001) and the SAO filter (1002).

The deblocking filter (1001) is similarly configured as a deblocking filter (901), the SAO filter (1002) is similarly configured as a SAO filter (902), and the ALF filter (1003) is similarly configured as an ALF filter (903).

The DRNLF filter (1010) receives the output of the deblocking filter (1001) shown by the deblocked picture (1011), and also receives the Quantization Parameter (QP) map of the reconstructed picture. The QP map includes quantization parameters for blocks in the reconstructed picture. The DRNLF filter (1010) may output a picture shown by the filtered picture (1019) with improved quality, and the filtered picture (1019) is fed to the SAO filter (1002) for further filtering processing.

According to an aspect of the present invention, a neural network for video processing may include a plurality of channels for processing color components in a color space. In an example, a YCbCr model may be used to define the color space. In the YCbCr model, Y denotes a luminance component (luma), and Cb and Cr denote chrominance components. It should be noted that in the following description, YUV is used to describe a format encoded using the YCbCr model.

According to an aspect of the disclosure, a plurality of channels in a neural network are configured to operate on color components of the same size. In some examples, a picture may be represented by different sized color components. For example, the human visual system is much more sensitive to changes in lightness than to changes in color, so the video system can compress the chrominance components to reduce file size and save transmission time without creating large visual differences as perceived by the human eye. In some examples, the human visual system is utilized with less acuity to color differences than to luminance, while the chroma subsampling technique is used to achieve a resolution of chroma information that is less than the resolution of luminance information.

In some examples, the sub-sampling can be expressed as a three-part ratio, e.g., 4. For example, a 4; 4. It should be noted that the techniques disclosed in this disclosure are illustrated in the following description using YUV420 as an example of a sub-sampling format. The disclosed techniques may be used for other sub-sampling formats. For ease of description, the format of color components (e.g., YUV 444) having the same sampling rate without subsampling is referred to as a non-subsampled format; and a format having at least one color component that is sub-sampled (e.g., YUV420, YUV422, YUV411, etc.) is referred to as a sub-sampling format.

In general, a neural network may operate on pictures in a non-subsampled format (e.g., YUV 444). Thus, for a picture in a sub-sampled format, the picture is converted to a non-sub-sampled format before being provided as input to the neural network.

Fig. 11 illustrates a block diagram of a DRNLF filter (1100) in some examples. In an example, a DRNLF filter (1100) can be used in place of the DRNLF filter (1010). The DRNLF filter (1100) includes a QP mapped quantizer (1110), a pre-processing module (1120), a main processing module (1130), and a post-processing module (1140) coupled together as shown in fig. 11. The main processing module (1130) includes a partition grabber (1131), a partition-based DRNLF kernel processing module (1132) and a partition reassembler (1133) coupled together as shown in FIG. 11.

In some examples, the QP mapping includes a mapping of QP values applied to reconstruct the various blocks in the currently reconstructed picture. QP mapping quantizer (1110) may quantize the values to a set of predetermined values. In an example (e.g., jfet-T0057), the QP value may be quantized to one of 22, 27, 32, and 37 by a QP mapping quantizer (1110).

The pre-processing module (1120) may receive the deblocked picture in a first format and convert it to a second format used by the main processing module (1130). For example, the main processing module (1130) is configured to process pictures having YUV444 format. When the pre-processing module (1120) receives a deblocked picture in a format different from the YUV444 format, the pre-processing module (1120) may process the deblocked picture in the different format and output the deblocked picture in the YUV444 format. For example, the pre-processing module (1120) receives the deblocked picture in YUV420 format and then interpolates the U and V chroma channels horizontally and vertically by a factor of 2 to generate the deblocked picture in YUV444 format.

The main processing module (1130) may receive as input the deblocked picture in YUV444 format and a quantized QP mapping. A patch grabber (1131) splits the input into patches. A DRNLF kernel processing module (1132) may process each partition separately based on the DRNLF kernel. The tile reassembler (1133) may assemble the tiles processed by the DRNLF kernel processing module (1132) into filtered pictures in YUV444 format.

The post-processing module (1140) converts the filtered picture in the second format back to the first format. For example, the post-processing module (1140) receives the filtered picture in YUV444 format (output from the main processing module (1130)), and outputs the filtered picture in YUV420 format.

Fig. 12 illustrates a block diagram of the pre-processing module (1220) in some examples. In an example, a pre-processing module (1220) is used in place of the pre-processing module (1120).

The pre-processing module (1220) may receive the deblocked picture in YUV420 format, convert the deblocked picture to YUV444 format, and output the deblocked picture in YUV444 format. In particular, the pre-processing module (1220) receives deblocked pictures in three input channels, the three input channels include a luma input channel for the Y component and two chroma input channels for the U (Cb) component and the V (Cr) component, respectively. The pre-processing module (1220) outputs the deblocked picture through three output channels, including a luma output channel for the Y component and two chroma output channels for the U (Cb) component and the V (Cr) component, respectively.

In an example, when the deblocked picture has YUV420 format, the Y component has a size (H, W), the U component has a size (H/2, W/2) and the V component has a size (H/2, W/2), where H represents the height of the deblocked picture (e.g., in samples) and W represents the width of the deblocked picture (e.g., in samples).

In the example of fig. 12, the pre-processing module (1220) does not resize the Y component. The pre-processing module (1220) receives a Y component of size (H, W) from the luminance input channel and outputs the Y component of size (H, W) to the luminance output channel.

The pre-processing module (1220) adjusts the size of the U and V components, respectively. The pre-processing module (1220) includes a first resizing unit (1221) and a second resizing unit (1222) that process the U-component and the V-component, respectively. For example, the first resizing unit (1221) receives the U component of size (H/2, W/2), resizes the U component to size (H, W), and outputs the U component of size (H, W) to the chroma output channel for the U component. A second resizing unit (1222) receives the V component of size (H/2, W/2), resizes the V component to size (H, W), and outputs the V component of size (H, W) to a chroma output channel for the V component. In some examples, a first resizing unit (1221) resizes the U component based on interpolation, e.g., using a Lanczos interpolation filter. Similarly, in some examples, a second resizing unit (1222) resizes the V component based on the interpolation, e.g., using a Lanczos interpolation filter.

In some examples, interpolation operations, such as using Lanczos interpolation filters, may not guarantee that the output of the interpolation operation is a meaningful value, e.g., non-negative for meaningful U (Cb) and V (Cr) components. In some examples, the pre-processed YUV444 formatted deblocked pictures may be stored, and then the stored YUV444 formatted pictures may be used in a training process for a neural network. Negative values of the U (Cb) component and the V (Cr) component may adversely affect the results of the training process of the neural network.

Fig. 13 shows a block diagram of a neural network structure (1300). In some examples, the neural network structure (1300) is used for a dense residual convolutional neural network based in-loop filter (DRNLF) and may be used in place of the chunking based DRNLF kernel processing module (1132). The neural network structure (1300) includes a series of Dense Residual Units (DRUs), e.g., DRU (1301) to DRU (1304), and the number of DRUs is denoted by N. In fig. 13, the number of convolution kernels is denoted by M, and M is also the number of output channels for convolution. For example, "CONV 3 × 3 × M" indicates a standard convolution with M convolution kernels having a kernel size of 3 × 3, and "DSC 3 × 3 × M" indicates a depth separable convolution with M convolution kernels having a kernel size of 3 × 3. N and M may be set for a trade-off between computational efficiency and performance. In an example (e.g., JFET-T0057), N is set to 4 and M is set to 32.

During operation, the neural network structure (1300) processes the deblocked picture by blocking (patch). For each partition of the deblocked picture in YUV444 format, the partition is normalized (e.g., divided by 1023 in the example of fig. 13), and removing the mean of the deblocked pictures from the normalized partitions to obtain a first portion (1311) of the internal input (1313). The second part of the internal input (1313) comes from the QP map. For example, the blocks of the QP map corresponding to the blocks forming the first portion (1311) (referred to as QP map blocks) are obtained from the QP map. The QP map block is normalized (e.g., divided by 51 in fig. 13). The normalized QP map block is the second part (1312) of the internal input (1313). The second part (1312) is concatenated (concatenate) with the first part (1311) to obtain the internal input (1313). The internal input (1313) is provided to a first regular convolution block (1351) (denoted by CONV 3 × 3 × M). The output of the first regular volume block (1351) is then processed by the N DRUs.

For each DRU, intermediate inputs are received and processed. The output of the DRU is concatenated with the intermediate input to form an intermediate input for the following DRU. Using the DRU (1302) as an example, the DRU (1302) receives the intermediate input (1321), processes the intermediate input (1321), and generates an output (1322). The output (1322) is concatenated with the intermediate input (1321) to form an intermediate input (1323) for the DRU (1303).

Note that because the intermediate input (1321) has more than M channels, a convolution operation of "CONV1 × 1 × M" may be applied to the intermediate input (1321) to generate M channels for further processing by the DRU (1302). Also note that the output of the first regular convolution block (1351) includes M channels, so the output can be processed by the DRU (1301) without the need for a convolution operation using "CONV1 × 1 × M".

The output of the last DRU is provided to a last regular convolution block (1359). The output of the last regular rolling block (1359) is converted to a regular picture blocking value, for example by adding the mean values of the deblocked pictures and multiplying by 1023, as shown in fig. 13.

Fig. 14 shows a block diagram of a Dense Residual Unit (DRU) (1400). In some examples, DRUs (1400) may be used in place of each DRU in fig. 13, e.g., DRU (1301), DRU (1302), DRU (1303), and DRU (1304).

In the fig. 14 example, the DRU (1400) receives the intermediate input x and propagates the intermediate input x directly to subsequent DRUs through a shortcut (1401). The DRU (1400) also includes a rule processing path (1402). In some examples, the rule processing path (1402) includes a rule convolution layer (1411), depth Separable Convolution (DSC) layers (1412) and (1414), and a rectified linear unit (ReLU) layer (1413). For example, the intermediate input x is concatenated with the output of the rule processing path (1402) to form an intermediate input for a subsequent DRU.

In some examples, DSC layers (1412) and (1414) are used to reduce computational cost.

According to an aspect of the disclosure, the neural network structure (1300) includes three channels corresponding to Y, U (Cb), V (Cr) components, respectively. In some examples, these three channels may be referred to as Y channels, U channels, and V channels. The DRNLF filter (1100) may be applied to intra pictures and inter pictures. In some examples, an additional flag is signaled to indicate on/off of the DRNLF filter (1100) at the picture level and CTU level.

Fig. 15 illustrates a block diagram of a post-processing module (1540) in some examples. In an example, the post-processing module (1540) may be used instead of the post-processing module (1140). The post-processing module (1540) includes clipping units (1541) to (1543) that clip the values of the Y component, U component, and V component into predetermined non-negative ranges [ a, b ], respectively. In an example, the lower limit a and the upper limit b of the non-negative range may be set to a =16 × 4 and b =234 × 4. Further, the post-processing module (1540) includes a resizing unit (1545) and a resizing unit (1546) that resize the cropped U component and V component, respectively, from size (H, W) to size (H/2, W/2), where H is the height of the original picture (e.g., the deblocked picture) and W is the width of the original picture.

Aspects of the present disclosure provide techniques for preprocessing. The preprocessed data can be stored and used for training of the neural network, and better training and reasoning results.

Fig. 16 illustrates a block diagram of the pre-processing module (1620) in some examples. In an example, a pre-processing module (1620) is used in place of the pre-processing module (1120).

The pre-processing module (1620) may receive the deblocked pictures in YUV420 format, convert the deblocked pictures to YUV444 format, and output the deblocked pictures in YUV444 format. In particular, the pre-processing module (1620) receives the deblocked picture in three input channels, including a luma input channel for the Y component and two chroma input channels for the U (Cb) component and the V (Cr) component, respectively. The pre-processing module (1620) outputs the deblocked picture through three output channels, including a luma output channel for the Y component and two chroma output channels for the U and V components, respectively.

In the example of fig. 16, the pre-processing module (1620) does not resize the Y component. The pre-processing module (1620) receives the Y component of size (H, W) from the luminance input channel and outputs the Y component of size (H, W) to the luminance output channel.

The pre-processing module (1620) adjusts the size of the U and V components, respectively. The pre-processing module 1620 includes a first resizing unit (1621) and a second resizing unit (1622) that process the U component and the V component, respectively. For example, the first resizing unit (1621) receives the U component of size (H/2, W/2), resizes the U component to size (H, W), and outputs the U component of size (H, W) to the chroma output channel for the U component. A second resizing unit (1622) receives the V component of size (H/2, W/2), resizes the V component to size (H, W), and outputs the V component of size (H, W) to a chroma output channel for the V component. In some examples, a first resizing unit (1621) resizes the U component based on interpolation, e.g., using a Lanczos interpolation filter. Similarly, in some examples, the second resizing unit (1622) resizes the V component based on interpolation, e.g., using a Lanczos interpolation filter.

In some examples, interpolation operations, such as using Lanczos interpolation filters, may not guarantee that the output of the interpolation operation is a meaningful value, such as non-negative for meaningful U (Cb) and V (Cr) components.

In the example of fig. 16, the pre-processing module (1620) includes a clipping unit (1625) and a clipping unit (1626) to clip the values of the interpolated U and V components to [ c, d, respectively]In the range of (1). In some examples, the values of the Y, U, and V components used for pre-processing have a bit depth (bitdepth) of 10, then c and d may be set to c =0 and d =2 ^bitdepth -1＝1023。

In an example, the c and d values are predefined and used. In another example, pairs of c-and d-values are predefined and the index of the pairs of c-and d-values for clipping may be signaled in the bitstream (e.g., in a Sequence Parameter Set (SPS), picture Parameter Set (PPS), slice, or tile header).

In some examples, the clipped values of the U and V components and the value of the Y component may be stored as a deblocked picture in YUV444 format. In some implementations, the stored pictures in YUV444 format can be used as input in a training process for a neural network, such as a neural network in the main processing module (1130). In some examples, the values of the U and V components are clipped to a range that will not adversely affect the training process of the neural network. In an example, the values of the U and V components are clipped to be non-negative.

<xnotran> , YUV444 , (, , ) , . </xnotran> In addition, the first and second substrates are, the neural network may be trained with better model parameters that can improve compression efficiency and/or picture quality.

In some examples, adding a cropping unit (1625) and a cropping unit (1626) in the pre-processing module (1620) may improve compression efficiency and/or quality, for example, at a lower Bjontegaard delta rate (BD-rate).

Fig. 17 shows a flowchart outlining a process (1700) according to an embodiment of the present disclosure. The process (1700) may be used for video processing. In various embodiments, process (1700) is performed by processing circuitry, such as processing circuitry in terminal devices (310), (320), (330), and (340), processing circuitry that performs the functions of video encoder (403), processing circuitry that performs the functions of video decoder (410), processing circuitry that performs the functions of video decoder (510), processing circuitry that performs the functions of video encoder (603), and so forth. In some implementations, the process (1700) is implemented in software instructions such that when the software instructions are executed by processing circuitry, the processing circuitry performs the process (1700). The process is at (S1701) and proceeds to (S1710).

At (S1710), the picture in the sub-sampled format in the color space is converted to a non-sub-sampled format in the color space. In some examples, the conversion is performed based on interpolation, and may result in invalid values. In an example, the transition may produce negative values that are not valid for the YCbCr model.

At (S1720), values of one or more color components of the image in the non-subsampled format are clipped prior to providing the picture in the non-subsampled format as input to the neural network-based filter. In some examples, the one or more color components may be chroma components. Then, the process proceeds to (S1799).

In an example, values of color components of a picture in a non-subsampled format are clipped into a valid range of the color components. In an example, values of color components of a picture in a non-subsampled format are clipped to be non-negative. In another example, the range is determined based on a bit depth. For example, the lower limit of the range is 0, and the upper limit of the range is set to (2) ^bitdepth )-1。

In some examples, the range is predetermined. In some examples, the range is determined based on decoded information from a bitstream carrying the picture. In some examples, the signal indicating the range is decoded from at least one of a sequence parameter set, a picture parameter set, a slice header, and a picture header in the bitstream.

In an example, multiple ranges may be predetermined. An index indicating one of a plurality of ranges may then be carried in one of a sequence parameter set, a picture parameter set, a slice header, and a tile header in the bitstream.

In some examples, the process (1700) is used in a decoder. For example, a picture in a sub-sampling format is reconstructed based on decoded information from a bitstream, and a deblocking filter is applied to the picture in the sub-sampling format prior to converting the picture in the sub-sampling format from the sub-sampling format to a non-sub-sampling format. In another example, a neural network-based filter is applied to a picture in a non-subsampled format having a cropped value to generate a filtered picture in the non-subsampled format, and then the filtered picture in the non-subsampled format is converted to a filtered picture in a subsampled format.

In some examples, a picture in a non-subsampled format having a cropped value is stored in a storage device. The picture in non-subsampled format with the cropped values and other pictures may then be provided as training inputs to train the neural network in the neural network based filter.

It should be noted that the various units, blocks, and modules in the above description may be implemented by various techniques, such as processing circuitry, a processor executing software instructions, a combination of hardware and software, etc.

The techniques described above may be implemented as computer software using computer readable instructions and physically stored in one or more computer readable media. For example, fig. 18 illustrates a computer system (1800) suitable for implementing certain embodiments of the disclosed subject matter.

Computer software may be encoded using any suitable machine code or computer language, which may be subject to assembly, compilation, linking, etc. mechanisms to create code that includes instructions that may be executed directly by one or more computer Central Processing Units (CPUs), graphics Processing Units (GPUs), etc., or by interpretation, microcode execution, etc.

The instructions may be executed on various types of computers or components thereof, including, for example, personal computers, tablet computers, servers, smart phones, gaming devices, internet of things devices, and so forth.

The components of the computer system (1800) shown in fig. 18 are exemplary in nature and are not intended to suggest any limitation as to the scope of use or functionality of the computer software for implementing embodiments of the present disclosure. Neither should the configuration of components be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary embodiments of the computer system (1800).

The computer system (1800) may include some human interface input devices. Such human interface input devices may be responsive to input by one or more human users through, for example, tactile input (such as keystrokes, slides, data glove movements), audio input (such as speech, clapping hands), visual input (such as gestures), olfactory input (not depicted). The human interface device may also be used to capture certain media that are not necessarily directly related to human conscious input, such as audio (such as speech, music, ambient sounds), images (such as scanned images, photographic images obtained from still-image cameras), video (such as two-dimensional video, three-dimensional video including stereoscopic video).

The input human interface device may include one or more of the following (only one depicted each): keyboard (1801), mouse (1802), track pad (1803), touch screen (1810), data gloves (not shown), joystick (1805), microphone (1806), scanner (1807), video camera (1808).

The computer system (1800) may also include certain human interface output devices. Such human interface output devices may be through tactile output, sound, for example light and smell/taste to stimulate the perception of one or more human users. Such human interface output devices may include tactile output devices (e.g., tactile feedback through a touch screen (1810), data glove (not shown), or joystick (1805), but there may also be tactile feedback devices that do not act as input devices), audio output devices (such as speakers (1809), headphones (not depicted)), visual output devices (such as a screen (1810) to include CRT screens, LCD screens, plasma screens, OLED screens, each with or without touch screen input capability, each with or without tactile feedback capability — some of which may output two-dimensional visual output or more than three-dimensional output through such means as stereoscopic graphical output, virtual reality glasses (not depicted), holographic displays, and smoke boxes (not depicted)), and printers (not depicted).

The computer system (1800) may also include human-accessible storage devices and their associated media, such as optical media including CD/DVD ROM/RW (1820) with media (1821) such as CD/DVD, thumb drive (1822), removable hard or solid state drive (1823), conventional magnetic media (not depicted) such as tape and floppy disk, dedicated ROM/ASIC/PLD based devices (not depicted) such as a secure dongle, and so forth.

Those skilled in the art will also appreciate that the term "computer-readable medium" used in connection with the presently disclosed subject matter does not include a transmission medium, carrier wave, or other transitory signal.

The computer system (1800) may also include an interface (1854) to one or more communication networks (1855). The network may be, for example, wireless, wired, optical. The network may also be local, wide area, urban, vehicular and industrial, real-time, delay tolerant, etc. Examples of networks include: a local area network such as ethernet; a wireless LAN; cellular networks including GSM (Global System for Mobile Communication), 3G (Third Generation), 4G (Fourth Generation), 5G (Fifth Generation), LTE (Long Term Evolution), and the like; a cable connection or wireless wide area digital network of televisions including cable, satellite, and terrestrial broadcast; vehicle and industrial networks including CAN buses, etc. Some networks typically require external network interface adapters attached to some general purpose data port or peripheral bus (1849) (e.g., a USB port of computer system (1800)); other networks are typically integrated into the core of the computer system (1800) by attaching to a system bus as described below (e.g., an ethernet interface integrated into a PC computer system or a cellular network interface integrated into a smart phone computer system). Using any of these networks, computer system (1800) may communicate with other entities. Such communications may be unidirectional receive-only (e.g., broadcast television), unidirectional transmit-only (e.g., CAN bus to certain CAN bus devices), or bidirectional, e.g., to other computer systems using local or wide area digital networks. Certain protocols and protocol stacks may be used on each of these networks and network interfaces as described above.

The human interface device, human accessible storage device, and network interface described above may be attached to the core (1840) of the computer system (1800).

The core (1840) may include one or more Central Processing Units (CPUs) (1841), graphics Processing Units (GPUs) (1842), special purpose Programmable Processing units (1843) in the form of Field Programmable Gate Arrays (FPGAs), hardware accelerators (1844) for certain tasks, graphics adapters (1850), and so forth. These devices, along with Read-only memory (ROM) (1845), random access memory (1846), internal mass storage such as internal non-user accessible hard drives, SSDs (1847), and the like, may be connected by a system bus (1848). In some computer systems, the system bus (1848) may be accessed in the form of one or more physical plugs to enable expansion by additional CPUs, GPUs, and the like. The peripheral devices may be attached to the system bus (1848) of the core either directly or through a peripheral bus (1849). In an example, screen (1810) may be connected to graphics adapter (1850). The architecture of the peripheral bus includes PCI, USB, etc.

The CPU (1841), GPU (1842), FPGA (1843), and accelerator (1844) may execute certain instructions, which in combination may constitute the computer code described above. The computer code may be stored in ROM (1845) or RAM (1846). Transitional data may also be stored in RAM (1846), while persistent data may be stored in, for example, an internal mass storage device (1847). Fast storage and retrieval of any memory device may be achieved through the use of cache memory, which may be closely associated with one or more CPUs (1841), GPUs (1842), mass storage (1847), ROM (1845), RAM (1846), and the like.

Computer code may be present on the computer readable medium for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present disclosure, or they may be of the kind well known and available to those having skill in the computer software arts.

By way of example, and not limitation, a computer system (1800) having an architecture, and in particular a core (1840), may provide functionality as a result of a processor (including a CPU, GPU, FPGA, accelerator, etc.) executing software embodied in one or more tangible computer-readable media. Such computer-readable media may be media associated with user-accessible mass storage as described above, as well as particular storage of cores (1840) with non-transitory nature, such as core internal mass storage (1847) or ROM (1845). Software implementing various embodiments of the present disclosure may be stored in such devices and executed by the core (1840). The computer readable medium may include one or more memory devices or chips, according to particular needs. The software may cause the cores (1840), and in particular the processors therein (including CPUs, GPUs, FPGAs, etc.), to perform certain processes or certain portions of certain processes described herein, including defining data structures stored in RAM (1846) and modifying such data structures according to processes defined by the software. Additionally or alternatively, the computer system may provide functionality as a result of logic, either hardwired or otherwise embodied in circuitry (e.g., accelerator (1844)), which may operate in place of or in conjunction with software to perform certain processes or certain portions of certain processes described herein. Where appropriate, reference to software may encompass logic and vice versa. Reference to a computer-readable medium may include, where appropriate, circuitry (e.g., an integrated circuit) that stores software for execution, circuitry that contains logic for execution, or both circuitry that stores software for execution and circuitry that contains logic for execution. The present disclosure encompasses any suitable combination of hardware and software.

Appendix A: acronym

JEM: joint exploration model

VVC: universal video coding

BMS: reference set

MV: motion vector

HEVC: efficient video coding

SEI: supplemental enhancement information

VUI: video usability information

GOP: picture group

TU: conversion unit

PU (polyurethane): prediction unit

And (3) CTU: coding tree unit

CTB: coding tree block

PB: prediction block

HRD: hypothetical reference decoder

SNR: signal to noise ratio

A CPU: central processing unit

GPU: graphics processing unit

CRT: cathode ray tube having a shadow mask with a plurality of apertures

LCD: liquid crystal display device with a light guide plate

An OLED: organic light emitting diode

CD: compact disc

DVD: digital video CD

ROM: read-only memory

RAM: random access memory

ASIC: application specific integrated circuit

PLD: programmable logic device

LAN: local area network

GSM: global mobile communication system

LTE: long term evolution

CANBus: controller area network bus

USB: universal serial bus

PCI: peripheral component interconnect

FPGA: field programmable gate area

SSD: <xnotran> </xnotran>

IC: integrated circuit with a plurality of transistors

CU: coding unit

While this disclosure has described several exemplary embodiments, there are alterations, permutations, and various substitute equivalents, which fall within the scope of this disclosure. It will thus be appreciated that those skilled in the art will be able to devise numerous systems and methods which, although not explicitly shown or described herein, embody the principles of the disclosure and are thus within the spirit and scope of the disclosure.

Claims

1. A method of video processing, comprising:

converting, by a processing circuit, a picture in a sub-sampled format in a color space to a non-sub-sampled format in the color space; and

clipping, by the processing circuit, values of color components of the non-subsampled format picture before providing the non-subsampled format picture as input to a neural network-based filter.

2. The method of claim 1, further comprising:

clipping values of color components of the picture in the non-subsampled format to be within a valid range of the color components.

3. The method of claim 1, further comprising:

clipping values of color components of the non-subsampled format picture to within a range determined based on a bit depth.

4. The method of claim 1, further comprising:

clipping values of color components of the non-subsampled format picture to within a predetermined range.

5. The method of claim 1, further comprising:

determining a range for clipping the value based on decoded information from a bitstream carrying the picture; and

clipping values of color components of the picture in the non-subsampled format to be within a determined range.

6. The method of claim 5, further comprising:

decoding a signal indicating the range from at least one of a sequence parameter set, a picture parameter set, a slice header, and a block header in the bitstream.

7. The method of claim 1, further comprising:

reconstructing the sub-sampled format picture based on decoded information from a bitstream.

8. The method of claim 1, further comprising:

applying a neural network-based filter to the non-subsampled format picture with the cropped value to generate a filtered picture of the non-subsampled format; and

converting the non-subsampled format filtered picture to the subsampled format filtered picture.

9. The method of claim 1, further comprising:

storing a picture of the non-subsampled format having a cropped value.

10. The method of claim 9, further comprising:

providing the stored picture in the non-subsampled format with the cropped values as a training input to train a neural network in the neural network-based filter.

11. An apparatus for video processing, comprising processing circuitry configured to:

converting a picture in a sub-sampled format in a color space to a non-sub-sampled format in the color space; and

clipping values of color components of the non-subsampled format picture before providing the non-subsampled format picture as input to a neural network-based filter.

12. The device of claim 11, wherein the processing circuit is configured to:

clipping values of color components of the non-subsampled format picture to be within a valid range of the color components.

13. The device of claim 11, wherein the processing circuit is configured to:

14. The device of claim 11, wherein the processing circuit is configured to:

15. The device of claim 11, wherein the processing circuit is configured to:

clipping values of color components of the non-subsampled format picture to within a determined range.

16. The device of claim 15, wherein the processing circuit is configured to:

17. The device of claim 11, wherein the processing circuit is configured to:

18. The apparatus as set forth in claim 11, wherein, wherein the processing circuit is configured to:

applying a neural network-based filter to the non-subsampled format picture with the cropped value to generate a filtered picture of the non-subsampled format; zxfoom

19. The apparatus of claim 11, further comprising:

a storage configured to store the non-subsampled format of pictures with cropped values.

20. The device of claim 19, wherein the processing circuit is configured to: