In this technical deep-dive, D ‘eVRydayVR’ Coetzee explores in detail the technical challenges preventing us from sharing our favourite VR experiences with others and the solutions available to overcome them.
Sharing your experiences of virtual worlds is a fundamental part of Internet culture. Through screenshots and videos on sites like Imgur and YouTube, captured with tools like Fraps, OBS, and Shadowplay, we let others see our creations, our reactions, our strategies, and our shared moments with other players, in precisely the same manner that we saw them on our monitors during play. Videos of content can be viewed without installing anything, even on machines that can’t run the original application.
Virtual reality complicates this: if you record a video of the content displayed on your VR headset and simply play it back on someone else’s headset, a host of problems arise. First, differences between headsets (or even user settings) can result in distortion, artifacts, and problems with convergence. Second, because the original player is in control of head movement, not the person viewing the content, they feel as though their head is being forcibly turned to look around the scene, a sensation that often induces VR sickness. The problem can be circumvented by simply watching the recorded content on your monitor instead, but that’s no fun – what you really want is for others to be able to experience those virtual worlds in VR, the same way that you did during play.
However, there are a number of technical challenges to solve before capturing and sharing high-quality 360 content with other VR users is as easy as capturing and sharing video. This article describes what solutions are available today, how they work and what their limitations are, and the problems that are still under investigation.
What we can do today
If you are a VR developer creating an application in Unity, the most popular engine for VR applications, there is a free script that I created on the Unity Asset Store called 360 Panorama Capture. By simply dropping this script onto an object in your application, you activate a hotkey that will take a 360 degree snapshot from the player’s perspective and save it to an image file on disk, like the equirectangular image shown below. This image can then be viewed using 360 degree panorama viewing tools on a variety of platforms, and can be uploaded to 360 degree panorama sharing websites like VRCHIVE – visit the image below on VRCHIVE and use your mouse or mobile device to explore it in 360 degrees.
The 360 Panorama Capture script can capture both monoscopic panoramas (where both eyes see the same image) and stereoscopic panoramas, where you get an effect like in 3D films. (If you have a WebVR-capable browser, you can view a stereo version of the above panorama.) Certain VR applications like the social VR world VR Chat have already made this feature available to all users.
It can also be used to produce 360 videos which can be uploaded to YouTube, like the one below. Most Android phones can use the YouTube for Android app to view them on Google Cardboard, and on the PC, Virtual Desktop can download and view 360 YouTube videos on the Rift.
How does it work?
The basic concept of 360 capture in Unity is similar to 360 capture in the real world: we point the (virtual) camera in many different directions to capture different parts of the scene, and then combine the results into a single 360 degree image. Monoscopic capture is particularly simple: we just point the camera in six directions, up, down, left, right, forward and backwards (cubemap format). The camera is located in exactly the same position for all six. The image below shows capturing a checkered box using a camera located inside the box.
Representing a complete 360 degree environment in a single image faces the same challenge as trying to represent the surface of the Earth on a flat map. The projection used by most tools today is the equirectangular projection. In an equirectangular projection, vertical (y) position gives latitude, while horizontal (x) position gives longitude. This projection is generally avoided by mapmakers because it distorts the area of the poles, making them appear much larger than they actually are; but it is convenient for 360 degree panoramas because it is easy to create an efficient viewer for this format by simply projecting the image onto the inside of a large sphere and placing the viewer inside the sphere.
Converting from the captured camera views into the final equirectangular image requires a re-projection operation, which copies pixels from the source images to their correct locations in the target image. This can be done very quickly on GPU using a compute shader; once it’s done, the final image is transferred to the CPU for saving.
Note that in some cases, this simple approach can produce visual artifacts. For example, if a vignette filter is being applied to dim the edges of the view, this will cause the edge of each of the six views to be dimmed, making the edges of the cube visually evident, as shown below. To reproduce such an effect, it would be necessary to remove it during capture, and then add it back while viewing the panorama. Screen-space effects that cause issues like this usually appear in sky and water effects. Other screen-space effects, like bloom or antialiasing, affect the entire view uniformly, and so produce no noticeable artifacts.
Capturing stereoscopic panoramas
Monoscopic panoramas (where both eyes see the same view) are straightforward to capture, and generally reproduce the original view reliably with no noticeable artifacts. Other than not supporting any kind of parallax or positional tracking, they are correct. Stereoscopic panoramas, in which the left and right eye see different images, are a different beast: there is no known way of reproducing them that is both correct and practical.
One strategy for doing stereoscopic panoramas is to place a large number of cameras on the surface of a sphere, maybe a few hundred cameras on a sphere the size of a beachball, and store and compress all of these captured camera views. During viewing, light field techniques can be used to synthesize any camera view at any orientation and any position located inside the sphere. This allows us to generate correct stereoscopic results regardless of viewing angle and even support limited positional tracking, but at a tremendous cost: even with compression, a typical full-resolution light field image may be over 500 MB in size. And that’s just a single image; video is out of the question. More effective compression techniques exist but remain difficult to decompress in real-time.
So, rather than trying to be totally correct, we use an alternate approach based on the equirectangular projection used for monoscopic images above. Instead of having one image, we have two: one for the left eye (on top) and one for the right eye (on bottom), as shown below. The viewer remains efficient and simple to implement: both views are applied as textures to a large sphere surrounding the viewer, with the left eye seeing only the left view texture, and the right eye seeing only the right view texture (using camera layers).
The simplest possible way to generate this image would be to put down two cameras at fixed positions, separated by the average distance between a person’s eyes, and then capture a monoscopic 360 degree panorama for each one. This will result in roughly correct results for the area directly in front of the viewer, but when looking behind them, left and right will be reversed, and when looking in any other direction results will also be incorrect, as the diagram below shows.
The simplest way to fix this, and the one most commonly used, is to divide the equirectangular image into thin vertical strips, and render each of them separately. Each strip is rendered with both eyes looking directly toward that strip, greatly improving the stereoscopic effect.
Provided the strips are thin and numerous enough, the seams between them will generally not be visible. Because the strips have limited horizontal field of view, they can also be rendered efficiently.
However, while this scheme works great for just turning your head left and right, it produces strange problems when looking up and down. If you look straight up, you are seeing all the strips simultaneously, coming together in a point at the poles, and only one of them is correct. Some are completely reversed. This will typically manifest as an inability to converge properly when looking toward the top and bottom poles.
In engines based on raytracing, such as OTOY’s Octane, they successfully mitigate this problem by adjusting the distance between the eyes (IPD) based on how far you look up or down. At the equator (no looking up or down), the IPD is at its maximum value. At the poles, IPD is reduced to zero, eliminating any stereo effect. In-between, it has intermediate values and reduced stereo effect. Because stereo effect is minimal when looking straight up or down, the errors in the stereo, although still present, are much less visually jarring.
Can a similar technique be used in a real-time engine like Unity? The answer is yes, and it works by bringing in ideas from light-field rendering and the Google Jump camera. We create a small circle with diameter equal to the distance between the eyes, and around the perimeter place a large number of virtual cameras (at least 8, typically 100). Each virtual camera has a massive field of view: over 180 degrees both horizontal and vertical. Because Unity cannot render with such a large field of view, each virtual camera is represented by a sub-array of four cameras each with about a 90 degree field of view. The four cameras are turned 45 degrees left, 45 degrees right, 45 degrees up, and 45 degrees down, and together cover all the field of view that we require.
Next: each pixel of the output is associated with two angles, the yaw or longitude (how far we must turn left/right to see that point), and the pitch or latitude (how far we must look up/down to see that point). The yaw is used to rotate the eyes to both look directly toward the point, just as with the vertical slices method. But now, the pitch is also used to scale the distance between the eyes. As a result, the eyes may lie on the perimeter of the circle, or may lie anywhere inside the circle.
To render a pixel, the two eyes each cast rays toward the circle. When they hit the circle, they hit it between two of the virtual cameras. We continue casting the rays out from those two virtual cameras to determine the correct color for the pixel from each of the two viewpoints. Finally, we blend the two resulting colors together based on the distance to each camera.
In the example above, we have 10 virtual cameras, each with a 216 degree field of view (each camera view is actually captured using four cameras with a 108 degree field of view). Three cases are shown:
- When pitch is zero (not looking up or down), the eyes lie on the perimeter of the circle, and their location on the circle determines which cameras to use. In the diagram, the left eye uses camera 8 and the right eye uses camera 3; at different yaw/longitude values different cameras would be used. The rays are sideways relative to the camera, which is okay because the cameras still have a wide enough angle of view to capture them.
- When pitch is intermediate, the eyes lie inside the circle. In the diagram, the pixel seen by the left eye is formed by combining a pixel from camera 9’s view and a pixel from camera 10’s view (camera 10’s pixel value has more influence because the ray from the left eye strikes the circle closer to camera 10 than to camera 9). Similar for right eye.
- When looking straight up or down, the eyes are both in the circle center. They both fire the same ray and receive the same pixel color.
Although this scheme is effective in reducing artifacts near the poles, it can still produce artifacts if not enough cameras are used, particularly if there are very close objects. These objects will appear to be blurry or doubled. This occurs because the camera rays are striking the object at a slightly different location than the eye ray. Fortunately, in a real-time rendered environment, rendering large numbers of camera views is fast. Because there is not enough memory to store all these views at once, usually just a few (usually 3 wide-angle virtual cameras) will be rendered at a time, and these will be used to render a portion of the final view.
There is some subtlety to how IPD should shrink as you move toward the poles. A simple linear function will result in a visible “crease” near the equator. A good IPD scaling function should be continuous at the equator, while also producing acceptable visual results at all pitch values. This is a somewhat empirical, ad-hoc process.
Although 360 capture is slowly becoming available in more and more applications, it still faces a number of limitations compared to traditional monitor-based capture tools like Fraps, OBS, and ShadowPlay.
Requires application support
With regular capture tools, it’s easy to capture video of virtually any application. The application itself does not need to provide any special functionality to support this. Currently, 360 capture is only available in apps in which the developers add special support for it, and access to the source code of the project is required.
In principle, the same kind of injection technology used by tools like vorpX and Vireio, which can manipulate the camera in certain DirectX titles, could be used to capture 360 images, but no one has yet implemented this. This approach would not suffice for 360 video, since it cannot be recorded in real-time, and would also struggle with artifacts from screen-space effects.
No real-time video
Traditional monitor-based capture tools capture applications at full resolution and frame rate in real-time during play with little overhead. As noted above, 360 video must be very high-resolution, at least 4K pixels wide. At least 6 different views of the scene must be rendered per frame (more for stereoscopic capture). Additionally, in VR it’s essential to maintain a high framerate on the display to ensure a comfortable experience. All this results in a capture framerate of less than 10 FPS.
Since this is unacceptably low, the only alternative is to render the video offline. For on-rails experiences like Welcome to Oculus, Senza Peso, Colosse, or most roller coaster apps, where the player is merely an observer, this is simple: just step one frame at a time and capture each frame, then proceed to the next frame. Audio is added in postprocessing.
For other applications where the player is actively controlling the action, it is necessary to implement a frame-by-frame replay system which captures the player’s actions and then replays them afterwards while capturing each frame. Ensuring that the replay reproduces the original scene exactly may be tricky in certain applications that include elements of randomness or examine wall clock time.
Stereo equirectangular format
The stereo equirectangular format, with separate left-eye and right-eye images, is easy to implement efficient viewing software for, easy to compress and store using existing image and video formats, and is compact enough to download and stream. However, even with all the rendering fixes outlined in previous sections, this format still has fundamental limitations.
For one, it cannot properly display stereo views where your head is tilted with your ear on your shoulder (roll). Convergence will not be possible if you attempt this. Another example: if you lean your head so far back you can see behind you, your view will be reversed. Even with your head pointed straight forward, if you use your eyes to look left or right, you will not see the correct view toward the edge of the frame, and this issue becomes more severe as HMD FOV increases. The distance between the eyes (IPD) is “baked in” to the image, which will produce a distorted sense of scale for those with unusually extreme IPD values.
Finally, it can’t provide any kind of positional tracking – in addition to preventing the viewer from leaning left, right, forward, or back while viewing, this also causes incorrect parallax during head rotation, because the eyes are rotating around a fixed point between the eyes, rather than around the center of the neck as they do in real life. This manifests as a perceived motion of nearby objects during head rotation, even in static scenes where those objects should remain still. (It may be possible to mitigate parallax problems with the light-field techniques described before, but this may also exacerbate artifacts.)
Where is the player looking?
In a traditional video, it’s easy to see where the original player was looking during gameplay. In a 360 video, the viewer takes over all head movement, and there is no longer any clear way to tell which way the original player was looking. There’s much room for experimentation in this area, but here are a few ideas:
- World-space UI and spatial audio: An arrow might indicate which way the player was looking; a square or circle may highlight the portion of the scene they were looking at; the portion of the scene they’re looking at may light up as they look at it, as though their eyes were shining light on the scene like a flashlight; a spatial audio source such as a hum sound may be located in the direction they are looking.
- Third-person camera with head animation: A viewer is, in some sense, another person hanging out with the original player, so being inside their body is weird, even in a first-person game. By placing the viewer outside the original player’s body, you can watch their body and head turn as they explore and look around, while also having the freedom to look around yourself. A third-person camera also offers the ability to smooth out motion and reduce discomfort when moving around. However, as with all third-person cameras, they can get confused or move unpredictably in cramped environments, like a tiny hut in Minecraft.
Capturing of real-time rendered content in engines like Unity and Unreal Engine 4 offers exciting possibilities for VR developers and gamers to share their creations and experiences while letting their friends be just as immersed as they were, with full freedom to look around. The tools and techniques are available now to begin integrating this into a variety of applications, but a number of daunting obstacles remain before 360 capture will be as fast and straightforward as video capture is today.
This tech doc was originally authored by D ‘eVRydayVR’ Coetzee.
All rights to all original content in this article waived under CC0 Public Domain Dedication