Depth estimation, processing and rendering for dynamic six-degrees-of-freedom (6DoF) virtual reality

Room GH


Freedom of head and body movement is one of the reasons why virtual reality headsets provide a high sense of realism. Computer-generated content, being inherently 3D, can provide this motion freedom. However, captured content is stereoscopic at best, which means that the observer perceives stereoscopic 3D but not motion parallax. Although the captured content in itself is more realistic than computer-generated content, the absence of motion freedom severely limits the sense of realism. To make video with motion freedom possible, it is required to estimate depth for multiple views and perform view synthesis. In this talk I will discuss how a real-time depth-based processing chain can be built using our experience in stereo-to-depth conversion for autostereoscopic displays.

Real-time depth estimation and view synthesis

Our current autostereoscopic 3D display hardware takes stereoscopic video as input from playback devices such as a Blu-ray player, and outputs a variable number of synthesized views using depth estimation and view synthesis. I will explain the workings of the 3D display and the essential algorithmic steps that we use in our current implementation to ensure temporally stable depth estimation and view synthesis.

Dynamic 6DoF scene capture configurations carrying depth throughout the processing chain

Multiple 360°capture rigs now already use depth estimation, mainly for the purpose of image stitching. Accurate depth estimation and stitching are usually performed offline. A specific hardware setup using FPGAs and/or graphics cards will help to achieve real-time depth processing and 6DoF streaming which will be necessary for sports and other live events such as concerts.

6DoF playback

Once multiple views of a dynamic scene, including depth data, have been streamed to a server, they need to be prepared for 6DoF playback. I will discuss how streaming and playback can be achieved using commonly available graphics cards. Our GPU-based processing allows for real-time depth map decoding and conversion to a 3D mesh such that standard texture mapping can be used for playback. During playback, left- and right-eye images are synthesized while switching between the multiple reference anchor views following the user who is moving around in the 6DoF captured sweet spot.

Conclusions and future work

The technology that will enable freedom of head and body movement in captured dynamic scenes is still in its infancy. We have shown that a real-time depth based chain can in principle be defined for live events and that an important step, i.e. high-quality depth estimation, can be achieved in hardware. However, more work is needed to define capture rigs that have cameras placed in such a way that playback is possible with high quality. At the same time, further improvements in depth-based view synthesis algorithms are important in order to meet the high image quality requirements of today’s visual experiences.

Develop Track