Bias arises in machine learning when we fit an overly simple function to a more complex problem. A theoretical study shows that gradient descent itself may introduce such bias and render algorithms unable to fit data properly.
What’s new: Suriya Gunasekar led colleagues at Toyota Technical Institute at Chicago, University of Southern California, and Technion-Israeli Institute of Technology to demonstrate that a model’s predictions can be limited by its optimization method regardless of the volume or distribution of training data. In some cases, gradient descent can’t find the optimal solution even with all the data in the world.
Key insight: The researchers considered a model’s ability to learn the optimal solution given a particular optimization method and loss function, as well as sufficient training data. They divided loss functions into two categories inspired by (A) linear regression with more parameters than data samples and (B) logistic regression when data classes are separable. Loss functions in category A have obtainable minima: The optimum sits at the bottom of a valley. Those in category B don’t: The optimum lies at the bottom of an infinitely long downward slope. Considering these categories separately enabled the authors to prove results for a variety of optimization methods.
How it works: The researchers found that optimization hyperparameter values can limit the behavior of certain combinations of loss function and optimization method. A well known example: With enough momentum, gradient descent doesn’t get stuck in local minima, but with too little, it gets trapped.
- Linear regression with quadratic loss defines a convex optimization problem. On the other hand, in logistic regression, with some datasets, certain parameters must approach infinity to realize the optimum. The researchers measured bias in such scenarios by whether the optimizer could approach optimal performance given infinite time.
- In linear models with gradient descent, optimizing losses in category A will always reach the optimal solution. When optimizing losses in category B, how close a model comes to finding an optimal solution depends on how long it trains.
- A natural gradient descent optimizer updates weights by following the gradient and a defined flow (which encourages certain directions, similar to gravity). By extending their results from gradient descent, the researchers showed that natural gradient descent is also biased by the initialization.
- For particular combinations of optimization method and loss function, the researchers were unable to quantify the gap between a solution and the optimum. But they did prove that the bias depends on a model’s learning rate and initialization. In such cases, gradient descent may get stuck in the local minima depending on the starting location.
Results: In theory, gradient descent with loss functions in category A depends on initialization but not hyperparameters. Moreover, momentum depends on both initialization and hyperparameters. For category B, gradient descent always converges on the correct solution given infinite time.
Why it matters: Theory and practice sometimes diverge, but theoretical analyses help simplify practical issues and provide clear guidance when questions arise.
We’re thinking: How to select and tune optimization algorithms is often the realm of tribal knowledge. (We teach some of this in the Deep Learning Specialization.) We’re glad to have higher principles to consult as we debug architectures, datasets, and training methods.