Gradient Descent & Optimization Cheat Sheet

Gradient descent is a method for finding the input values that make a function as small as possible. It is used in applied math, data science, physics, engineering, and machine learning. This cheat sheet helps students connect slopes, derivatives, and vectors to a practical optimization algorithm.

It is useful when a formula is too complex to minimize by simple algebra.

Key Facts

To minimize a one-variable function f(x), gradient descent uses the update x_new = x_old - alpha f'(x_old).
For a multivariable function f(x, y), the gradient is grad f = <partial f/partial x, partial f/partial y>.
Gradient descent moves in the negative gradient direction because the gradient points in the direction of steepest increase.
The learning rate alpha controls step size, and a common update rule is theta_new = theta_old - alpha grad J(theta_old).
A critical point occurs when grad f = 0, but it may be a minimum, maximum, or saddle point.
Convergence means the updates become very small, often checked by |theta_new - theta_old| < tolerance or |grad J| < tolerance.
A convex function has one global minimum, so gradient descent is less likely to get stuck at a non-best local minimum.
If alpha is too large, the algorithm can overshoot or diverge, and if alpha is too small, the algorithm can be very slow.

Vocabulary

Objective function: The function being minimized or maximized in an optimization problem.
Gradient: A vector of partial derivatives that points in the direction where a multivariable function increases fastest.
Learning rate: The positive number alpha that controls how large each gradient descent step is.
Convergence: The process of an iterative method getting closer to a stable answer or minimum.
Local minimum: A point where the function value is lower than nearby points but not necessarily the lowest value overall.
Convex function: A function shaped so that any local minimum is also the global minimum.

Common Mistakes to Avoid

Using x_new = x_old + alpha f'(x_old) when minimizing is wrong because it moves in the direction of increase, not decrease.
Choosing a learning rate that is too large is wrong because the steps can jump over the minimum and make the values grow instead of shrink.
Assuming grad f = 0 always means a minimum is wrong because the point could be a maximum or a saddle point.
Stopping after one or two iterations is wrong because gradient descent usually needs repeated updates until the change or gradient is small.
Forgetting that the gradient is a vector in multivariable problems is wrong because each variable needs its own update using its partial derivative.

Practice Questions

1 For f(x) = x^2 - 6x + 10, start at x = 0 with alpha = 0.1. Compute one gradient descent update.
2 For f(x) = (x - 4)^2, start at x = 10 with alpha = 0.25. Compute two gradient descent updates.
3 For J(a, b) = a^2 + 3b^2, find grad J at (2, -1) and write the update formula using learning rate alpha.
4 Explain why moving in the negative gradient direction helps minimize a function, and describe what can go wrong if the learning rate is too large.

Sign in to save

Sign in to save

Gradient Descent & Optimization Cheat Sheet

Related Tools

Related Labs

Related Worksheets

Related Infographics

Study as Flashcards

Key Facts

Vocabulary

Common Mistakes to Avoid

Practice Questions