HEVC Extension for Multiview Video Coding and Multiview Video ....doc
ITU - Telecommunications Standardization SectorSTUDY GROUP 16 Question 6Video Coding Experts Group (VCEG)44nd Meeting: San Jose, CA, USA, 03-10 February 2012Document VCEG-AR13Filename: VCEG-AR13.docQuestion:Q.6/SG16 (VCEG)Source:Christian Bartnik, Sebastian Bosse, Heribert Brust, Tobias Hinz, Haricharan Lakshman, Detlev Marpe, Philipp Merkle, Karsten Müller, Hunn Rhee, Heiko Schwarz, Gerhard Tech, Thomas Wiegand, Martin WinkenFraunhofer HHIEinsteinufer 3710587 Berlin, GermanyEmail:(firstName).(lastName)hhi.fraunhofer.deTitle:HEVC Extension for Multiview Video Coding and Multiview Video plus Depth CodingPurpose:Proposal_In this document, an HEVC extension for Multiview Video Coding and Multiview Video plus Depth Coding is proposed. Beside the known concept of disparity-compensated prediction, the proposed HEVC extension includes additional inter-view prediction techniques and depth coding tools. The extensions for multiview video coding and multiview video plus depth coding were integrated in the HM-3.0 software. Experimental results for four stereo test sequences show an average overall bit rate reduction of about 28 % relative to HEVC simulcast for multiview video coding, and 33 % relative to HEVC simulcast for multiview video plus depth coding.For the standardization of Multiview Video Coding and Multiview Video plus Depth Coding, four 1080p25/30 stereo test sequences with automatically generated depth maps are proposed.1 IntroductionThis document describes a proposal for a data format suitable for delivering 3D video in future applications and a coding scheme for representing the data format. The 3D video is transmitted in the Multiview plus Depth (MVD) format, which contains two or more captured views as well as associated depth maps. Based on the coded videos and depth maps, additional views suitable for displaying the 3D video on autostereoscopic displays can be generated using depth-image-based rendering (DIBR) techniques. The video sequences as well as the sequences of depth maps are coded using an extension of HEVC. This proposed coding format is backwards compatible to HEVC in a way that a sub-bitstream representing a single-view can be extracted from the 3D video bitstream and independently decoded with an HEVC conforming decoder. The data format also provides view scalability. Optionally, the data format provides independent decodability of the video sequences, so that, for example, a sub-bitstream representing conventional stereo video can be extracted from a 3D bitstream and decoded. The proposed coding format can also be used for conventional multiview video coding (without the coding of depth data).2 Data Format and System DescriptionIn the proposed HEVC extension, 3D video is in general represented using the Multiview Video plus Depth (MVD) format, in which a small number of captured views as well as associated depth maps are coded and the resulting bitstream packets are multiplexed into a 3D video bitstream. After decoding the video and depth data, additional intermediate views suitable for displaying the 3D content on an auto-stereoscopic display can be synthesized using depth-image-based rendering (DIBR) techniques. For the purpose of view synthesis, camera parameters, or more accurately, parameters specifying a conversion of the depth data into disparity vectors, are additionally included in the bitstream. The bitstream packets include header information, which signal, in connection with transmitted parameter sets, a view identifier and an indication whether the packet contains video or depth data. Sub-bitstreams containing only some of the coded components can be easily extracted by discarding bitstream packets that contain non-required data. One of the views, which is also referred to as the base view or the independent view, is coded independently of the other views and the depth data using a conventional 2D video coder. HEVC is used as 2D video codec. The sub-bitstream containing the independent view can be decoded by an unmodified 2D HEVC decoder and displayed on a conventional 2D display. Optionally, the encoder can be configured in a way that a sub-bitstream representing two views without depth data can be extracted and independently decoded for displaying the 3D video on a conventional stereo display. The codec can also be used for coding multiview video signals without depth data. And, when using depth data, it can be configured in a way that the video pictures can be decoded independently of the depth data.Figure 1: Overview of the system structure and the data format for the transmission of 3D video.The basic concept of the proposed system and data format is illustrated in Figure 1. In general, the input signal for the encoder consists of multiple views, associated depth maps, and corresponding camera parameters. However, as described above, the codec can also be operated without depth data. The input component signals are coded using a 3D video encoder, which represents an extension of HEVC. At this, the base view is coded using an unmodified HEVC encoder. The 3D video encoder generates a bitstream, which represents the input videos and depth data in a coded format. If the bitstream is decoded using a 3D video decoder, the input videos, the associated depth data, and camera parameters are reconstructed with the given fidelity. For displaying the 3D video on an autostereoscopic display, additional intermediate views are generated by a DIBR algorithm using the reconstructed views and depth data. If the 3D video decoder is connected to a conventional stereo display instead of to an autostereoscopic display, the view synthesizer can also generate a pair of stereo views, in case such a pair is not actually present in the bitstream. At this, it is possible to adjust the rendered stereo views to the stereo geometry of the viewing conditions. One of the decoded views or an intermediate view at an arbitrary virtual camera position can also be used for displaying a single view on a conventional 2D display.The 3D video bitstream is constructed in a way that the sub-bitstream representing the coded representation of the base view can be extracted by simple means. The bitstream packets representing the base view can be identified by inspecting transmitted parameter sets and the packet headers. The sub-bitstream for the base view can be extracted by discarding all packets that contain depth data or data for the dependent views and, then, the extracted sub-bitstream can be directly decoded with an unmodified HEVC decoder and displayed on a conventional 2D video display.Beside the option that a stereo pair can be rendered based on the output of a 3D video decoder, the encoder can also be configured in a way that the sub-bitstream containing only two stereo views can be extracted and directly decoded using a stereo decoder. The encoder can also be configured in a way that the views can be generally decoded independently of the depth data.3 Coding AlgorithmIn the following, we describe the coding algorithm based on the MVD format, in which each video picture is associated with a depth map. But as mentioned in sec. 2, the coding algorithm can also be used for a multiview format without depth maps. The video pictures and, when present, the depth maps are coded access unit by access unit, as it is illustrated in Figure 2. An access unit includes all video pictures and depth maps that correspond to the same time instant. It should be noted that the coding order of access units doesn't need to be identical to the capture or display order. In general, the reconstructed data of already coded access units can be used for an efficient coding of the current access unit. Random access is enabled by so-called random access units or instantaneous decoding refresh (IDR) access units, in which the video pictures and depth maps are coded without referring to previously coded access units. Furthermore, an access unit doesn't reference any access unit that precedes the previous random access unit in coding order.Figure 2: Access unit structure and coding order of view components.The video pictures and depth maps corresponding to a particular camera position are indicated by a view identifier (viewId). All video pictures and depth maps that belong to the same camera position are associated with the same value of viewId. The view identifiers are used for specifying the coding order inside the access units and detecting missing views in error-prone environments. Inside an access unit, the video picture and, when present, the associated depth map with viewId equal to 0 are coded first, followed by the video picture and depth map with viewId equal to 1, etc. A video picture and depth map with a particular value of viewId are transmitted after all video pictures and depth maps with smaller values of viewId. The video picture is always coded before the associated depth map (i.e., the depth map with the same value of viewId). It should be noted that the value of viewId doesn't necessarily represent the arrangement of the cameras in the camera array. For ordering the reconstructed video pictures and depth map after decoding, each value of viewId is associated with another identifier called view order index (VOI). The view order index is a signed integer values, which specifies the ordering of the coded views from left to right. If a view A has a smaller value of VOI than a view B, the camera for view A is located left to the camera of view B. In addition, camera parameters required for converting depth values into disparity vectors are included in the bitstream. For a linear camera setup, these conversion parameters consist of a scale factor and an offset. The vertical component of a disparity vector is always equal to 0. The horizontal component is derived according todv = ( s * v + o ) >> n,where v is the depth sample value, s is the transmitted scale factor, o is the transmitted offset, and n is a shift parameter that depends on the required accuracy of the disparity vectors.Each video sequence and depth sequence is associated with a separate sequence parameter set and a separate picture parameter set. The picture parameter set syntax, the NAL unit header syntax, and the slice header syntax for the coded slices haven't been modified for including a mechanism by which the content of a coded slice NAL units can be associated with a component signal. Instead, the sequence parameter set syntax for all component sequences except for the base view has been extended. Theses sequences parameter sets contain the following additional parameters:· the view identifier (indicates the coding order of a view); · the depth flag (indicates whether video data or depth data are present);· the view order index (indicates the location of the view relative to other coded views);· an indicator specifying whether camera parameters are present in the sequence parameter set or in the slice headers;· when camera parameters are present in an sequence parameter set, for each viewId value smaller than the current view identifier, a scale and an offset specifying the conversion of a depth sample of the current view to a horizontal disparity between the current view and the view with viewId;· when camera parameters are present in an sequence parameter set, for each viewId value smaller than the current view identifier, a scale and an offset specifying the conversion of a depth sample of the view with viewId to a horizontal disparity between the current view and the view with viewId;The sequence parameter set for the base view doesn't contain the additional parameters. Here, the view identifier is inferred to be equal to 0, the depth flag is inferred to be equal to 0, and the view order index is inferred to be equal to 0.The sequence parameter sets for dependent views include a flag, which specifies whether the camera parameters are constant for a coded video sequence or whether they can change on a picture by picture basis. If this flag indicates that the camera parameters are constant for a coded video sequence, the camera parameters (i.e., the scale and offset values described above) are present in the sequence parameter set. Otherwise, the camera parameters are not present in the sequence parameter set, but instead the camera parameters are coded in the slice headers that reference the corresponding sequence parameter set.Figure 3: Basic codec structure with inter-component prediction (red arrows).The basic structure of the 3D video codec is shown in the block diagram of Figure 3. In principle, each component signal is coded using an HEVC-based codec. The resulting bitstream packets, or more accurately, the resulting Network Abstraction Layer (NAL) units, are multiplexed to form the 3D video bitstream. The base or independent view is coded using an unmodified HEVC codec. Given the 3D video bitstream, the NAL units containing data for the base layer can be identified by parsing the parameter sets and NAL unit header of coded slice NAL units (up to the picture parameter set identifier). Based on these data, the sub-bitstream for the base view can be extracted and directly coded using a conventional HEVC decoder.For coding the dependent views and the depth data, modified HEVC codecs are used, which are extended by including additional coding tools and inter-component prediction techniques that employ already coded data inside the same access unit as indicated by the red arrows in Figure 3. For enabling an optional discarding of depth data from the bitstream, e.g., for supporting the decoding of a stereo video suitable for conventional stereo displays, the inter-component prediction can be configured in a way that video pictures can be decoded independently of the depth data. For improving the coding efficiency for dependent views and depth data, the following modifications have been integrated:· disparity-compensated prediction: A technique for using already coded and reconstructed pictures (or depth maps) inside an access unit as additional reference pictures for inter prediction. The same concept is found in MVC.· inter-view motion prediction: A technique for employing the motion parameters of already coded video pictures of other views (inside an access unit) for predicting the motion parameters of a current video picture.· inter-view residual prediction: A technique for employing the coded residuals of already coded video pictures of other views (inside an access unit) for predicting the residuals of a current video picture.· reduced motion vector accuracy for depth data: A technique for increasing the coding efficiency of depth data (and decreasing the decoding complexity) by reducing the motion vector accuracy.· disabling of in-loop filters for depth data: A encoding technique for i