🤖
AI Wiki
Gradient PlatformDocsGet Started FreeContact Sales
  • Artificial Intelligence Wiki
  • Topics
    • Accuracy and Loss
    • Activation Function
    • AI Chips for Training and Inference
    • Artifacts
    • Artificial General Intelligence (AGI)
    • AUC (Area under the ROC Curve)
    • Automated Machine Learning (AutoML)
    • CI/CD for Machine Learning
    • Comparison of ML Frameworks
    • Confusion Matrix
    • Containers
    • Convergence
    • Convolutional Neural Network (CNN)
    • Datasets and Machine Learning
    • Data Science vs Machine Learning vs Deep Learning
    • Distributed Training (TensorFlow, MPI, & Horovod)
    • Generative Adversarial Network (GAN)
    • Epochs, Batch Size, & Iterations
    • ETL
    • Features, Feature Engineering, & Feature Stores
    • Gradient Boosting
    • Gradient Descent
    • Hyperparameter Optimization
    • Interpretability
    • Jupyter Notebooks
    • Kubernetes
    • Linear Regression
    • Logistic Regression
    • Long Short-Term Memory (LSTM)
    • Machine Learning Operations (MLOps)
    • Managing Machine Learning Models
    • ML Showcase
    • Metrics in Machine Learning
    • Machine Learning Models Explained
    • Model Deployment (Inference)
    • Model Drift & Decay
    • Model Training
    • MNIST
    • Overfitting vs Underfitting
    • Random Forest
    • Recurrent Neural Network (RNN)
    • Reproducibility in Machine Learning
    • REST and gRPC
    • Serverless ML: FaaS and Lambda
    • Synthetic Data
    • Structured vs Unstructured Data
    • Supervised, Unsupervised, & Reinforcement Learning
    • TensorBoard
    • Tensor Processing Unit (TPU)
    • Transfer Learning
    • Weights and Biases
Powered by GitBook
On this page

Was this helpful?

  1. Topics

Gradient Descent

PreviousGradient BoostingNextHyperparameter Optimization

Last updated 3 years ago

Was this helpful?

Gradient descent is an iterative optimization algorithm used in machine learning to minimize a .

The loss function describes how well the model will perform given the current set of parameters (weights and biases) and gradient descent is used to find the best set of parameters. This is achieved by taking the partial derivative at a given point and then iteratively traversing the search space in the negative direction of the function gradient.

As the loss function improves, the parameters of a model (weights) are updated until it reaches the optimal point which is the minima of the loss function (the weights are updated in proportion to the derivative of the error). The two key aspects of Gradient descent are a) the direction to move and b) the size of the step (learning rate, discussed below).

Gradient descent is used when the model parameters cannot be calculated using straightforward math (e.g., linear algebra) and must be searched for using an optimization algorithm.

There are several variants of gradient descent including batch, stochastic, and mini-batch.

Gradient descent has a parameter called learning rate which represents the size of the steps taken as that network navigates the curve in search of the valley. If the learning rate is too high, the network may overshoot the minimum. If it's too low, the training will take too long and may never reach the minimum, or else get stuck in local minima.

There are also several optimization algorithms including momentum, adagrad, nesterov accelerated gradient, RMSprop, adam, etc. Here is a that covers the differences between these algorithms.

Check out the in-depth explanation of Gradient Descent in this .

blog post
blog post
loss function
Source: O'Reilly Media
Gradient Descent in action
Source: Rohith Gandhi / Towards Data Science