Sign in to save

Bookmark this page so you can find it later.

Sign in to save

Bookmark this page so you can find it later.

Gradient descent is a method for finding the input values that make a function as small as possible. It is used in applied math, data science, physics, engineering, and machine learning. This cheat sheet helps students connect slopes, derivatives, and vectors to a practical optimization algorithm. It is useful when a formula is too complex to minimize by simple algebra.

Key Facts

  • To minimize a one-variable function f(x), gradient descent uses the update x_new = x_old - alpha f'(x_old).
  • For a multivariable function f(x, y), the gradient is grad f = <partial f/partial x, partial f/partial y>.
  • Gradient descent moves in the negative gradient direction because the gradient points in the direction of steepest increase.
  • The learning rate alpha controls step size, and a common update rule is theta_new = theta_old - alpha grad J(theta_old).
  • A critical point occurs when grad f = 0, but it may be a minimum, maximum, or saddle point.
  • Convergence means the updates become very small, often checked by |theta_new - theta_old| < tolerance or |grad J| < tolerance.
  • A convex function has one global minimum, so gradient descent is less likely to get stuck at a non-best local minimum.
  • If alpha is too large, the algorithm can overshoot or diverge, and if alpha is too small, the algorithm can be very slow.

Vocabulary

Objective function
The function being minimized or maximized in an optimization problem.
Gradient
A vector of partial derivatives that points in the direction where a multivariable function increases fastest.
Learning rate
The positive number alpha that controls how large each gradient descent step is.
Convergence
The process of an iterative method getting closer to a stable answer or minimum.
Local minimum
A point where the function value is lower than nearby points but not necessarily the lowest value overall.
Convex function
A function shaped so that any local minimum is also the global minimum.

Common Mistakes to Avoid

  • Using x_new = x_old + alpha f'(x_old) when minimizing is wrong because it moves in the direction of increase, not decrease.
  • Choosing a learning rate that is too large is wrong because the steps can jump over the minimum and make the values grow instead of shrink.
  • Assuming grad f = 0 always means a minimum is wrong because the point could be a maximum or a saddle point.
  • Stopping after one or two iterations is wrong because gradient descent usually needs repeated updates until the change or gradient is small.
  • Forgetting that the gradient is a vector in multivariable problems is wrong because each variable needs its own update using its partial derivative.

Practice Questions

  1. 1 For f(x) = x^2 - 6x + 10, start at x = 0 with alpha = 0.1. Compute one gradient descent update.
  2. 2 For f(x) = (x - 4)^2, start at x = 10 with alpha = 0.25. Compute two gradient descent updates.
  3. 3 For J(a, b) = a^2 + 3b^2, find grad J at (2, -1) and write the update formula using learning rate alpha.
  4. 4 Explain why moving in the negative gradient direction helps minimize a function, and describe what can go wrong if the learning rate is too large.