Source: O'Reilly Media
The loss function describes how well the model will perform given the current set of parameters (weights and biases) and gradient descent is used to find the best set of parameters. This is achieved by taking the partial derivative at a given point and then iteratively traversing the search space in the negative direction of the function gradient.
As the loss function improves, the parameters of a model (weights) are updated until it reaches the optimal point which is the minima of the loss function (the weights are updated in proportion to the derivative of the error). The two key aspects of Gradient descent are a) the direction to move and b) the size of the step (learning rate, discussed below).
Gradient Descent in action
Gradient descent is used when the model parameters cannot be calculated using straightforward math (e.g., linear algebra) and must be searched for using an optimization algorithm.
There are several variants of gradient descent including batch, stochastic, and mini-batch.
There are also several optimization algorithms including momentum, adagrad, nesterov accelerated gradient, RMSprop, adam, etc. Here is a blog post that covers the differences between these algorithms.
Gradient descent has a parameter called learning rate which represents the size of the steps taken as that network navigates the curve in search of the valley. If the learning rate is too high, the network may overshoot the minimum. If it's too low, the training will take too long and may never reach the minimum, or else get stuck in local minima.
Source: Rohith Gandhi / Towards Data Science