Given a number of images of the same scene, a neural network can synthesize images from novel vantage points, but it can take hours to train. A new approach cuts training time to a few minutes.
What’s new: Thomas Müller and colleagues at Nvidia introduced a new method for learning representations of positions in a 3D scene. It’s compatible with Neural Radiance Fields (NeRF), a popular way to synthesize images of a scene from novel perspectives.
NeRF basics: For a given scene, NeRF learns to reproduce ground-truth images shot by a camera from different positions and angles. At inference, given a camera position and angle, it generates views of a scene by sampling points along virtual light rays that extend from the camera through each pixel. Given an embedding of a point’s position and the ray’s direction, separate fully connected networks compute its color and transparency. (Typically many points occupy empty space, so they’re fully transparent and have no color.) The system combines the color and transparency of points along the ray to find the associated pixel’s color.
Key insight: Previous efforts to speed up NeRF training impose a 3D grid over the scene and learn an embedding of each grid point. When it comes to sampling coordinates along rays, these approaches interpolate embeddings of positions that fall in between the grid points. This process requires a lot of memory, and rendering is slow because ferrying data to the processor and back takes a lot of time. Limiting the total number of embeddings to fit within a processor’s cache eliminates this bottleneck, accelerating rendering. One way to do this is to hash the coordinates, which defines a function that maps them to the index of a list (hash table) of limited size. This makes it possible to map any number of points to a limited number of embeddings.
How it works: The authors trained separate systems of vanilla neural networks to generate 20 synthetic and real scenes used in the original NeRF paper. As in the original NeRF and its variants, the networks learned by minimizing the difference between the ground truth images and generated images from the same viewpoints. Given a camera position and viewing angle, the system projected a ray for each pixel in the resulting image and sampled from 3 to 26 points, depending on the scene’s size, along each ray.
- The system defined 16 3D grids with resolutions from coarse (16x16x16) to fine (512x512x512).
- Given a point along a ray at a particular resolution, it located the positions of the eight corners of the cell closest to it and hashed the coordinates to retrieve the corresponding embeddings. Then it interpolated the embeddings to calculate a vector that represented the point.
- It repeated this process at each resolution, producing 16 separate hash tables. Hashing each point’s coordinates at multiple resolutions kept the points differentiated by making it unlikely that different points would map to the same embedding (a phenomenon known as a hash collision) at all resolutions.
- The system concatenated each point’s embeddings at every resolution and fed them to two vanilla neural networks. One network estimated opacity and the other estimated color.
Results: The authors evaluated the system using Peak Signal-to-Noise Ratio (PSNR), which measures image reconstruction quality (higher is better), and compared their results to the original NeRF and similar Mip-NeRF. Averaged across all scenes, the new approach achieved 31.407 PSNR after 15 seconds of training (in contrast, NeRF achieved 31.005 PSNR after more than 12 hours of training) and 33.176 PSNR after five minutes of training (better than mip-NERF’s 33.090 PSNR after two to three hours of training).
Yes, but: Hash collisions, while rare, can still happen. The result is a rough surface texture.
Why it matters: Tailoring neural networks to hardware resources can accelerate processing with very little impact on output quality. This can dramatically reduce the time and money required to tackle modern machine learning tasks.
We’re thinking: The authors used a hash table to reduce the number of embeddings and dramatically accelerate rendering. Would the same method accelerate other models that rely on large numbers of embeddings?