Applied Math
Grade 11-12
Gradient Descent & Optimization Cheat Sheet
A printable reference covering objective functions, gradients, learning rate, gradient descent updates, convergence, and optimization pitfalls for grades 11-12.
Related Tools
Related Labs
Related Worksheets
Gradient descent is a method for finding the input values that make a function as small as possible. It is used in applied math, data science, physics, engineering, and machine learning. This cheat sheet helps students connect slopes, derivatives, and vectors to a practical optimization algorithm. It is useful when a formula is too complex to minimize by simple algebra.
Key Facts
- To minimize a one-variable function f(x), gradient descent uses the update x_new = x_old - alpha f'(x_old).
- For a multivariable function f(x, y), the gradient is grad f = <partial f/partial x, partial f/partial y>.
- Gradient descent moves in the negative gradient direction because the gradient points in the direction of steepest increase.
- The learning rate alpha controls step size, and a common update rule is theta_new = theta_old - alpha grad J(theta_old).
- A critical point occurs when grad f = 0, but it may be a minimum, maximum, or saddle point.
- Convergence means the updates become very small, often checked by |theta_new - theta_old| < tolerance or |grad J| < tolerance.
- A convex function has one global minimum, so gradient descent is less likely to get stuck at a non-best local minimum.
- If alpha is too large, the algorithm can overshoot or diverge, and if alpha is too small, the algorithm can be very slow.
Vocabulary
- Objective function
- The function being minimized or maximized in an optimization problem.
- Gradient
- A vector of partial derivatives that points in the direction where a multivariable function increases fastest.
- Learning rate
- The positive number alpha that controls how large each gradient descent step is.
- Convergence
- The process of an iterative method getting closer to a stable answer or minimum.
- Local minimum
- A point where the function value is lower than nearby points but not necessarily the lowest value overall.
- Convex function
- A function shaped so that any local minimum is also the global minimum.
Common Mistakes to Avoid
- Using x_new = x_old + alpha f'(x_old) when minimizing is wrong because it moves in the direction of increase, not decrease.
- Choosing a learning rate that is too large is wrong because the steps can jump over the minimum and make the values grow instead of shrink.
- Assuming grad f = 0 always means a minimum is wrong because the point could be a maximum or a saddle point.
- Stopping after one or two iterations is wrong because gradient descent usually needs repeated updates until the change or gradient is small.
- Forgetting that the gradient is a vector in multivariable problems is wrong because each variable needs its own update using its partial derivative.
Practice Questions
- 1 For f(x) = x^2 - 6x + 10, start at x = 0 with alpha = 0.1. Compute one gradient descent update.
- 2 For f(x) = (x - 4)^2, start at x = 10 with alpha = 0.25. Compute two gradient descent updates.
- 3 For J(a, b) = a^2 + 3b^2, find grad J at (2, -1) and write the update formula using learning rate alpha.
- 4 Explain why moving in the negative gradient direction helps minimize a function, and describe what can go wrong if the learning rate is too large.