Linear regression may be the key statistical method in machine learning, but it didn’t get to be that way without a fight. Two eminent mathematicians claimed credit for it, and 200 years later the matter remains unresolved. The longstanding dispute attests not only to the algorithm’s extraordinary utility but also to its essential simplicity.
Whose algorithm is it anyway? In 1805, French mathematician Adrien-Marie Legendre published the method of fitting a line to a set of points while trying to predict the location of a comet. (Celestial navigation was the science most valuable in global commerce at the time, much like AI is today — the new electricity, if you will, two decades before the electric motor.) Four years later, the 24-year-old German wunderkind Carl Friedrich Gauss insisted that he had been using it since 1795 but had deemed it too trivial to write about. Gauss’ claim prompted Legendre to publish an addendum anonymously observing that “a very celebrated geometer has not hesitated to appropriate this method.”
Slopes and biases: Linear regression is useful any time the relationship between an outcome and a variable that influences it follows a straight line. For instance, a car’s fuel consumption bears a linear relationship to its weight.
- The relationship between fuel consumption y and car weight x depends on the line’s slope w (how steeply fuel consumption rises with weight) and bias term b (fuel consumption at zero weight): y=w*x+b.
- During training, given a car’s weight, the algorithm predicts the expected fuel consumption. It compares expected and actual fuel consumption. Then it minimizes the squared difference, typically via the technique of ordinary least squares, which hones the values of w and b.
- Taking the car’s drag into account makes it possible to generate more precise predictions. The additional variable extends the line into a plane. In this way, linear regression can accommodate any number of variables/dimensions.
Two steps to ubiquity: The algorithm immediately helped navigators to follow the stars, and later biologists (notably Charles Darwin’s cousin Francis Galton) to identify heritable traits in plants and animals. Two further developments unlocked its broad potential. In 1922, English statisticians Ronald Fisher and Karl Pearson showed how linear regression fit into the general statistical framework of correlation and distribution, making it useful throughout all sciences. And, nearly a century later, the advent of computers provided the data and processing power to take far greater advantage of it.
Coping with ambiguity: Of course, data is never perfectly measured, and some variables are more important than others. These facts of life have spurred more sophisticated variants. For instance, linear regression with regularization (also called ridge regression) encourages a linear regression model not to depend too much on any one variable, or rather to rely evenly on the most important variables. It’s a good default choice. If you’re going for simplicity, a different form of regularization (L1 instead of L2) results in lasso, which encourages as many coefficients as possible to be zero. In other words, it learns to select variables with high prediction power and ignores the rest. Elastic net combines both types of regularization. It’s useful when data is sparse or features appear to be correlated.
In every neuron: Still, the simple version is enormously useful. The most common sort of neuron in a neural network is a linear regression model followed by a nonlinear activation function, making linear regression a fundamental building block of deep learning.