Recognizing actions performed in a video requires understanding each frame and relationships between the frames. Previous research devised a way to analyze individual images efficiently known as Active Shift Layer (ASL). New research extends this technique to the steady march of video frames.
What’s new: Led by Linxi Fan and Shyamal Buch, the Stanford Vision and Learning Lab, University of Texas Austin, and Nvidia developed RubiksShift, an efficient replacement for convolutional layers when processing time-series inputs. The name’s similarity to Rubik’s Cube apparently refers to extracting features by shifting three-dimensional data.
Key insight: A shift filter is a variation on the convolutional filter that generates only values of 0 or 1. This is more computationally efficient than traditional convolution, which generates real-valued outputs, but it prevents backpropagation, which makes shift filters difficult to train. ASL reformulated backprop for shift filters applied to still images. RubiksShift adapts ASL to video by generalizing it for an additional dimension; in this case, time.
How it works: 3D convolutional filters typically are used to process images in the three dimensions: red, blue, and green. For videos, a time is added. RubiksShift is a layer of 4D shift convolutions. The researchers also propose an architecture, RubiksNet, composed of multiple RubiksShift layers.
- Shift filters effectively translate inputs in a certain direction. Applied to images, they change the center (without stretching or rotation). Applied to videos, they change the center of individual frames and move data within them forward or backward in time (within the confines of the frame rate).
- ASL trains shift filters by introducing two parameters that determine the translation of pixels vertically and horizontally. It allows the parameters to be non-integers, but it averages them. For instance, a half-pixel shift to the right is equivalent to the average of the original position and the one to its right.
- RubiksShift adds a third parameter that represents the shift across time. It forces the parameters to converge to integer values during training, so there’s no need to average them at test time.
Results: The authors evaluated RubiksNet against state-of-the-art action recognition networks designed for efficient computation, such as I3D, using the Something-Something dataset of clips that represent human actions. RubiksNet achieved top-1 accuracy of 46.5 percent compared to I3D’s 45.8 percent, and it executed 10 times fewer floating point operations during classification. RubiksNet more than doubled the accuracy of other methods that used a similar number of operations.
Why it matters: Video is ubiquitous, and we could do a lot more with it — in terms of search, manipulation, generation, and so on — if machines had better ways to understand it.
We’re thinking: Hopefully reading this overview of RubiksNet was less confusing than trying to solve a Rubik’s Cube!