Automated systems for interpreting human emotions have analyzed speech, text, and facial expressions. New research shows progress on the subtle art of reading body language.
What’s new: Researchers from the Universities of North Carolina and Maryland built EWalk, a collection of 1,348 videos of people walking, manually labeled according to perceived emotions: happy, sad, angry, and neutral. For instance, gaits labeled happy tend to have a quick pace and widely swinging arms, while sad gaits are slower with a slumped posture. A model trained on EWalk achieved state-of-the-art results matching gaits with emotional states.
Key insights: The model uses a random forest to classify the emotion expressed by a given gait using a combination of features extracted using hand-crafted rules and a neural network.
- Decision trees in this case are less data-hungry than neural networks, but they're prone to overfitting large-dimensional inputs.
- The researchers pre-processed the data using an LSTM to reduce the number of dimensions describing the input.
How it works: TimePoseNet, a neural network detailed in previous research, extracts from videos a 3D skeletal representation of the gait in each frame. The researchers compute pose and movement features from the skeleton and feed them to the random forest. They also feed the skeleton to an LSTM network, which supplements the random forest’s input.
- Pose measurements include the volume of the bounding box surrounding a skeleton, area of bounding boxes for upper and lower body, distance between feet and hands, and angles between head and shoulders as well as head and torso.
- Each pose measurement is averaged across all frames to produce the pose feature. Maximum stride length is also included in the pose feature, but not averaged over frames.
- Movement measurements include the average of the velocity, acceleration, and jerk (rate of change in acceleration) of each skeletal joint in each frame. The time of a single walk cycle, from the moment a foot lifts to the time it hits the ground, is also a movement measurement.
- Each movement measurement is averaged over all frames to produce the movement feature.
- The random forest takes pose and movement features as inputs to predict emotion.
- LSTM is also trained to predict emotion from the pose and movement features. The random forest receives the values of the LSTM’s hidden units, but not its prediction.
Results: The previous best method achieved 68 percent accuracy on EWalk. The random forest, given the pose and movement features plus the LSTM’s hidden units, achieves 80.1 percent. The random forest also outperforms the LSTM alone.
Why it matters: The better computers understand human emotion, the more effectively we’ll be able to work with them. Beyond that, this work has clear applications in security, where early warning of potential aggression could help stop violent incidents before they escalate. It could also be handy in a retail environment, helping salespeople choose the most productive approach to prospective customers, and possibly in other face-to-face customer service situations.
We’re thinking: Psychology demonstrates that emotion affects human planning and actions. While models exist that are startlingly accurate at predicting future human actions — see this paper — they don’t explicitly take emotion into account. Factoring in emotions could enable such systems to make even better predictions.