In machine learning, you start by defining a task and a model. The model consists of an architecture and parameters. For a given architecture, the values of the parameters determine how accurately the model performs the task. But how do you find good values? By defining a loss function that evaluates how well the model performs. The goal is to minimize the loss and thereby to find parameter values that match predictions with reality. This is the essence of training.

# I   Setting up the optimization problem

The loss function will be different in different tasks depending on the output desired. How you define it has a major influence on how the model will train and perform. Let’s consider two examples:

### Example 1: House price prediction

Say your task is to predict the price of houses $y \in \mathbb{R}$ based on features such as floor area, number of bedrooms, and ceiling height. The squared loss function can be summarized by the sentence:

Given a set of house features, the square of the difference between your prediction and the actual price should be as small as possible.

This loss function is

$$\mathcal{L} = ||y-\hat{y}||_2^2$$

where $\hat{y}$ is your predicted price and $y$ is the actual price, also known as ground truth.

### Example 2: Object localization

Let’s consider a more complex example. Say your task is to localize the car in a set of images that contain one. The loss function should frame the following sentence in mathematical terms:

Given an image containing one car, predict a bounding box (bbox) that surrounds the vehicle. The predicted box should match the size and position of the actual car as closely as possible.

In mathematical terms, a possible loss function $\mathcal{L}$ (Redmon et al., 2016) is:

$$\mathcal{L} = \underbrace{(x - \hat{x})^2 + (y - \hat{y})^2}_{\text{BBox Center}} + \underbrace{(w - \hat{w})^2 + (h - \hat{h})^2}_{\text{BBox Width/Height}}$$

This loss function depends on:

• The model’s prediction which, in turn, depends on the parameter values (weights) as well as the input (in this case, images).
• The ground truth corresponding to the input (labels; in this case, bounding boxes).

### Cost function

Note that the loss $\mathcal{L}$ takes as input a single example, so minimizing it doesn’t guarantee better model parameters for other examples.

It is common to minimize the average of the loss computed over the entire training data set; $\mathcal{J} = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}^{(i)}$. We call this function the cost. $m$ is the size of the training data set and $\mathcal{L}^{(i)}$ is the loss of a single training example $x^{(i)}$ labelled $y^{(i)}$.

### Visualizing the cost function

For a given set of examples along with the corresponding ground truth labels, the cost function has a landscape that varies as a function of the parameters of the network.

It is difficult to visualize this landscape if there are more than two parameters. However, the landscape does exist, and our goal is to find the point where the cost function’s value is (approximately) minimal.

Updating the parameter values will move the value either closer to or farther from the target minimum point.

### The model versus the cost function

It is important to distinguish between the function $f$ that will perform the task (the model) and the function $\mathcal{J}$ you are optimizing (the cost function).

• The model inputs an unlabeled example (such as a picture) and outputs a label (such as a bbox for a car). It is defined by an architecture and a set of parameters, and approximates a real function that performs the task. Optimized parameter values will enable the model to perform the task with relative accuracy.
• The cost function inputs a set of parameters and outputs a cost, measuring how well that set of parameters performs the task (on the training set).

### Optimizing the cost function

Initially, good parameter values are unknown. However, you have a formula for the cost function. Minimize the cost function, and theoretically you will find good parameter values. The way to do this is to feed a training data set into the model and adjust the parameters iteratively to make the cost function as small as possible.

In summary, the way you define the cost function will dictate the performance of your model on the task at hand. The diagram below illustrates the process of finding a model that performs well. # II   Running the optimization process

In this section, we assume that you have chosen a task, a data set, and a cost function. You will minimize the cost to find good parameter values.

To find parameter values that achieve a function’s minimum, you can either try to derive a closed form solution algebraically or approximate it using an iterative method. In machine learning, iterative methods such as gradient descent are often the only option because cost functions are dependent on a large number of variables, and there is almost never any practical way to find a closed form solution for the minimum.

For gradient descent, you must first initialize the parameter values so that you have a starting point for optimization. Then, you adjust the parameter values iteratively to reduce the value of the cost function. At every iteration, parameter values are adjusted according to the opposite direction of the gradient of the cost; that is, in the direction that reduces the cost.

The mathematical procedure to remember is:

$\quad \text{for x in dataset:}$

$\quad \quad \quad \hat{y} = model_W(x) \quad \quad \text{(predict)}$

$\quad \quad \quad W = W - \alpha \frac{\partial \mathcal{J}(y, \hat{y})}{\partial W} \quad \quad \text{(update parameters)}$

Where:

• $\hat{y}$ is the model’s prediction given an input $x$.
• $W$ denotes the parameters.
• $\frac{\partial \mathcal{J}}{\partial W}$ is a gradient indicating the direction to push the value $W$ to decrease $\mathcal{J}$.
• $\alpha$ is the learning rate which you can tune to decide how much you want to adjust the value of $W$ per iteration.

You can learn more about gradient-based optimization algorithms in the Deep Learning Specialization. This topic is covered in Course 1, Week 2 (Neural Network Basics) and Course 2, Week 2 (Optimization Algorithms).

Note that the cost $\mathcal{J}$ takes as input the entire training data set, so computing it at every iteration can be slow. It is common to minimize the average of the loss computed over a batch of examples; for instance, $\mathcal{J_{mini-batch}} = \frac{1}{m_b} \sum_{i=1}^{m_b} \mathcal{L}^{(i)}$. Reducing this function leads to a quicker parameter-update direction to minimize training error. $m_b$ is called the batch size. This is a key hyperparameter to tune.

To use gradient descent, you must choose values for hyperparameters such as learning rate and batch size. These values will influence the optimization, so it’s important to set them appropriately.

In the visualization below, try to discover the parameters used to generate a dataset. You are provided the ground truth from which the data was generated (the blue line) so that you can compare it to your trained model (the red line). Play with the starting point of initialization, learning rate, and batch size. Here are some questions to consider as you explore the visualization:

• Why do the model parameters converge to values different than the ground-truth?
• What is the impact of the training set size?
• What is the impact of the learning rate on the optimization?
• Why does the cost landscape look like this?

In this visualization, your goal is to recover the ground truth parameters used to generate a training set. You can fit a linear model $\hat{y} = wx + b$ on the training set using gradient descent. Press the button to generate a dataset of chosen size (1) and observe its impact on the cost landscape. Then, choose initial values of your model parameters by dragging the red dot (2) before optimizing (3).

### 1. Generate a dataset

Select a training set size $m$.

A training set of the chosen size will be sampled with noise from a line representing ground truth. This line is the target line for the model defined by $\hat{y} = wx + b$.

### 2. Observe the cost landscape and initialize parameters.

The cost function is the L2 loss (defined as $\mathcal{L}(y, \hat{y}) = ||y - \hat{y}||_2^2$) averaged over the training set. The blue dot indicates the value of the cost function at the ground-truth slope and intercept. The red dot indicates the value of the cost function at a chosen initialization of the slope and intercept. Drag and drop the red dot to change the initialization.

### 3. Optimize the cost function

Now you can update the parameters iteratively to minimize the cost. Select a learning rate.

Select a batch size.

Train the model.

Here are some takeaways from the visualization:

• Even if you choose the best possible hyperparameters, the trained model will not exactly match the provided ground truth (blue line) because the dataset is just a proxy for the ground-truth distribution.
• The larger the training set size, the closer your trained model parameters will be to the parameters used to generate the data.
• If your learning rate is too large, your algorithm won’t converge. If it is too small, your algorithm will converge slowly.
• If the initial point (the red dot) is close to the ground truth and the hyperparameters (learning rate and batch size) are tuned properly, your algorithm will converge quickly.

As you can see, each hyperparameter has a different impact on the convergence of your algorithm. Let’s dig deeper into each hyperparameter.

### Initialization

A good initialization can accelerate optimization and enable it to converge to the minimum or, if there are several minima, the best one. To learn more about initialization, read our AI Note, Initializing Neural Networks.

### Learning rate

The learning rate influences the optimization’s convergence. It also counterbalances the influence of the cost function’s curvature. According to the gradient descent formula above, the direction and magnitude of the parameter update is given by the learning rate multiplied by the slope of the cost function at a certain point $W$. Specifically: $\alpha \frac{\partial \mathcal{J}}{\partial W}$.

• If the learning rate is too small, updates are small and optimization is slow, especially if the cost curvature is low. Also, you’re likely to settle into an poor local minimum or plateau.
• If the learning rate is too large, updates will be large and the optimization is likely to diverge, especially if the cost function’s curvature is high.
• If the learning rate is chosen well, updates are appropriate and the optimization should converge to a good set of parameters.

Play with the visualization below to understand how learning rate and cost curvature influence an algorithm’s convergence.

In this visualization, you are trying to find the parameter corresponding to the minimum of a cost function (blue parabola) using gradient descent. At a given point (blue dot), the cost function is approximated by its slope (orange line). You can tune the cost curvature and the learning rate, which together determine the direction and value of each parameter update and the approximation error (red line).

Iteration: 0

Cost function curvature:

Learning rate: The visualization illustrates that:

• What makes a good learning rate depends on the curvature of the cost function.
• Gradient descent makes a linear approximation of the cost function at a given point. Then it moves downhill along the approximation of the cost function.
• If the cost is highly curved, the larger the learning rate (step size), the more likely is the algorithm to overshoot.
• Taking small steps reduces reduces this problem, but also slows down learning.

It is common to start with a large learning rate — say, between 0.1 and 1 — and decay it during training. Choosing the right decay (how often? by how much?) is non-trivial. An excessively aggressive decay schedule slows progress toward the optimum, while a slow-paced decay schedule leads to chaotic updates with small improvements.

In fact, finding the “best decay schedule” is non trivial. However, adaptive learning-rate algorithms such as Momentum Adam and RMSprop help adjust the learning rate during the optimization process. We’ll explain those algorithms below.

### Batch size

Batch size is the number of data points used to train a model in each iteration. Typical small batches are 32, 64, 128, 256, 512, while large batches can be thousands of examples.

Choosing the right batch size is important to ensure convergence of the cost function and parameter values, and to the generalization of your model. Some research has considered how to make the choice, but there is no consensus. In practice, you can use a hyperparameter search.

Research into batch size has revealed the following principles:

• Batch size determines the frequency of updates. The smaller the batches, the more, and the quicker, the updates.
• The larger the batch size, the more accurate the gradient of the cost will be with respect to the parameters. That is, the direction of the update is most likely going down the local slope of the cost landscape.
• Having larger batch sizes, but not so large that they no longer fit in GPU memory, tends to improve parallelization efficiency and can accelerate training.
• Some authors (Keskar et al., 2016) have also suggested that large batch sizes can hurt the model’s ability to generalize, perhaps by causing the algorithm to find poorer local optima/plateau.

In choosing batch size, there’s a balance to be struck depending on the available computational hardware and the task you’re trying to achieve.

### Iterative update

Now that you have a starting point, a learning rate, and a batch size, it’s time to update the parameters iteratively to move toward the cost function’s minimum.

The optimization algorithm is also a core choice. You can play with various optimizers in the visualization below. That will help you build an intuitive sense of the pros and cons of each.

In the visualization below, your goal is to play with hyperparameters to find parameter values that minimize a cost function. You can choose the cost function and starting point of the optimization. Although there’s no explicit model, you can assume that finding the minimum of the cost function is equivalent to finding the best model for your task. For the sake of simplicity, the model only has two parameters and the batch size is always 1.

In this visualization, you can compare optimizers applied to different cost functions and initialization. For a given cost landscape (1) and initialization (2), you can choose optimizers, their learning rate and decay (3). Then, press the play button to see the optimization process (4). There's no explicit model, but you can assume that finding the cost function's minimum is equivalent to finding the best model for your task.

### 1. Choose a cost landscape

Select an artificial landscape $\mathcal{J}(w_1,w_2)$.

### 2. Choose initial parameters

On the cost landscape graph, drag the red dot to choose initial parameter values and thus the initial value of the cost.

### 3. Choose an optimizer

Select the optimizer(s) and hyperparameters.

Optimizer Learning Rate Learning Rate Decay
Momentum
RMSprop

### 4. Optimize the cost function

This 2D plot describes the cost function's value for different values of the two parameters $(w_1,w_2)$. The lighter the color, the smaller the cost value.

The graph below shows how the value of the cost changes through successive epochs for each optimizer.

### Choice of optimizer

The choice of optimizer influences both the speed of convergence and whether it occurs. Several alternatives to the classic gradient descent algorithms have been developed in the past few years and are listed in the table below. (Notation: $dW = \frac{\partial \mathcal{J}}{\partial W}$)

Optimizer Update rule Attribute
(Stochastic) Gradient Descent $W = W - \alpha dW$
• Gradient descent can use parallelization efficiently, but is very slow when the data set is larger the GPU's memory can handle. The parallelization wouldn't be optimal.
• Stochastic gradient descent usually converges faster than gradient descent on large datasets, because updates are more frequent. Plus, the stochastic approximation of the gradient is usually precise without using the whole dataset because the data is often redundant.
• Of the optimizers profiled here, stochastic gradient descent uses the least memory for a given batch size.
Momentum \begin{aligned} V_{dW} &= \beta V_{dW} + ( 1 - \beta ) dW\\ W &= W - \alpha V_{dW} \end{aligned}
• Momentum usually speeds up the learning with a very minor implementation change.
• Momentum uses more memory for a given batch size than stochastic gradient descent but less than RMSprop and Adam.
RMSprop \begin{aligned} S_{dW} &= \beta S_{dW} + ( 1 - \beta ) dW^2\\ W &= W - \alpha \frac{dW}{\sqrt{S_{dW}} + \varepsilon} \end{aligned}
• RMSprop’s adaptive learning rate usually prevents the learning rate decay from diminishing too slowly or too fast.
• RMSprop maintains per-parameter learning rates.
• RMSprop uses more memory for a given batch size than stochastic gradient descent and Momentum, but less than Adam.
Adam \begin{aligned} V_{dW} &= \beta_1 V_{dW} + ( 1 - \beta_1 ) dW\\ S_{dW} &= \beta_2 S_{dW} + ( 1 - \beta_2 ) dW^2\\ Vcorr_{dW} &= \frac{V_{dW}}{(1 - \beta_1)^t}\\ Scorr_{dW} &= \frac{S_{dW}}{(1 - \beta_2)^t}\\ W &= W - \alpha \frac{V_{corr_{dW}}}{\sqrt{S_{corr_{dW}}} + \varepsilon} \end{aligned}
• The hyperparameters of Adam (learning rate, exponential decay rates for the moment estimates, etc.) are usually set to predefined values (given in the paper), and do not need to be tuned.
• Adam performs a form of learning rate annealing with adaptive step-sizes.
• Of the optimizers profiled here, Adam uses the most memory for a given batch size.
• Adam is often the default optimizer in machine learning.

Adaptive optimization methods such as Adam or RMSprop perform well in the initial portion of training, but they have been found to generalize poorly at later stages compared to stochastic gradient descent.

### Conclusion

Exploring optimization methods and hyperparameter values can help you build intuition for optimizing networks for your own tasks. During hyperparameter search, it’s important to understand intuitively the optimization’s sensitivity to learning rate, batch size, optimizer, and so on. That intuitive understanding, combined with the right method (random search or Bayesian optimization), will help you find the right model.

-

By definition, this function L has a low value when the model performs well on the task.

Do you know the mathematical formula that allows a neural network to detect cats in images? Probably not. But using data you can find a function that performs this task. It turns out that a convolutional architecture with the right parameters defines a function that can perform this task well.

Close-form methods attempt to solve a problem in a finite sequence of algebraic operations. For instance, you can find the point achieving the minimum of $f(x) = x^2 + 1$ by solving $f'(x) = 0$ which leads to $2x = 0 \implies x=0$.

In addition to learning parameters for a model, you also need reasonable choices of hyperperameters that affect training, such as batch size and learning rate.

In theory, if you sampled infinitely many data points from the distribution and fit a linear model, you could recover the ground truth parameters.

We use the term poor local minimum because, in optimizing a machine learning model, the optimization is often non-convex and unlikely to converge to the global minimum.

Generalization refers to your model's ability to perform well on unseen data. In order to evaluate the generalization of your model, you can train your model on a training set and evaluate it on a hold-out test set.

For more information on hyperparameter tuning, see the Deep Learning Specialization Course 2, Week 3 (Hyperparameter Tuning, Batch Normalization and Programming Frameworks).

This term essentially describes inflection points (where the concavity of the landscape changes) for which the gradient is zero in some, but not all, directions.

Gradient descent makes a linear approximation of the cost function in a given point. It then moves downhill along the approximation of the cost function.

#### Authors

1. Kian Katanforoosh - Content and structure.
2. Daniel Kunin - Visualizations (created using D3.js and TensorFlow.js).
3. Jiaju Ma - Static and interactive graphics.

#### Acknowledgments

1. The template for the article was designed by Jingru Guo and inspired by Distill.
2. The loss landscape visualization adapted code from Mike Bostock's visualization of the Goldstein-Price function.
3. The banner visualization adapted code from deeplearn.js's implementation of a CPPN.

#### Footnotes

1. Chapter 4, Deep Learning, Goodfellow et al. (MIT Press)