Siphelele Danisa’s Post

This week, we will focus on a connection that I have found interesting recently involving stochastic gradient descent (SGD). We have talked about SGD before on a high level, and this week it's important to take a step back to loosely define the algorithm. SGD is an iterative method for optimizing an objective function, typically used in machine learning and deep learning for training models. It is regarded as a stochastic approximation of gradient descent optimization since it replaces the actual gradient, calculated from the entire dataset, with an estimate calculated from a randomly selected subset of the data. Especially in high-dimensional optimization problems, this reduces the very high computational burden, achieving faster iterations in exchange for a likely lower convergence rate. The result attached below provides an alternative perspective on SGD in the continuous time limit (that is, assuming infinitesimal step size). Researchers have used this continuous time formulation to prove some properties of the algorithm. On Thursday, we will see some of these properties and yet another perspective. Below: D is called the diffusion matrix, which is defined as the covariance of the stochastic gradients in SGD, f is the function being minimized, eta is the step size and the small curly b is the batch size.

  • No alternative text description for this image
Yuri Robbertze

Lecturer at University of Cape Town, Quantitative Risk Analyst at Old Mutual Limited

1mo

Why you call eta eta, but beta the small curly b 😂

Rishit Dagli

CS UG University of Toronto | AI Research, Qualcomm | Research ML, Vision UofT, Vector Institute | Prev: Civo, SpaceX, JWST | RT Kubernetes 1.26-9, TEDx, TED-Ed

1mo

No, D is not equal to the SGN covariance, it is a scaled version of that: it would be \eta / 2 times the SGN covariance, right?

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics