🤖
AI Wiki
Gradient PlatformDocsGet Started FreeContact Sales
  • Artificial Intelligence Wiki
  • Topics
    • Accuracy and Loss
    • Activation Function
    • AI Chips for Training and Inference
    • Artifacts
    • Artificial General Intelligence (AGI)
    • AUC (Area under the ROC Curve)
    • Automated Machine Learning (AutoML)
    • CI/CD for Machine Learning
    • Comparison of ML Frameworks
    • Confusion Matrix
    • Containers
    • Convergence
    • Convolutional Neural Network (CNN)
    • Datasets and Machine Learning
    • Data Science vs Machine Learning vs Deep Learning
    • Distributed Training (TensorFlow, MPI, & Horovod)
    • Generative Adversarial Network (GAN)
    • Epochs, Batch Size, & Iterations
    • ETL
    • Features, Feature Engineering, & Feature Stores
    • Gradient Boosting
    • Gradient Descent
    • Hyperparameter Optimization
    • Interpretability
    • Jupyter Notebooks
    • Kubernetes
    • Linear Regression
    • Logistic Regression
    • Long Short-Term Memory (LSTM)
    • Machine Learning Operations (MLOps)
    • Managing Machine Learning Models
    • ML Showcase
    • Metrics in Machine Learning
    • Machine Learning Models Explained
    • Model Deployment (Inference)
    • Model Drift & Decay
    • Model Training
    • MNIST
    • Overfitting vs Underfitting
    • Random Forest
    • Recurrent Neural Network (RNN)
    • Reproducibility in Machine Learning
    • REST and gRPC
    • Serverless ML: FaaS and Lambda
    • Synthetic Data
    • Structured vs Unstructured Data
    • Supervised, Unsupervised, & Reinforcement Learning
    • TensorBoard
    • Tensor Processing Unit (TPU)
    • Transfer Learning
    • Weights and Biases
Powered by GitBook
On this page
  • Data Sources
  • File Systems
  • Object Storage
  • Databases
  • Data Warehouses & Data Lakes
  • Train, Test, & Validation Sets Explained
  • Public Datasets

Was this helpful?

  1. Topics

Datasets and Machine Learning

Training data used in machine learning can take many forms, including images, MRI scans, text, CSV/tabular data (e.g., database queries), audio recordings, geospatial data (e.g., radar or vector), logs, time-series data (e.g., stock trades), binaries or computer applications, video frames, and many more.

The most common providers of data in machine learning are AWS S3, Snowflake, Redshift, AWS EBS, BigQuery, and on-premise file systems.

Data Sources

There are several types of storage relevant to ML, though not all interface directly with ML pipelines. These include file systems, object storage, databases, and data warehouses/data lakes.

File Systems

File systems have been around forever and are the most familiar type of storage since every laptop uses a file system. File systems are compatible with all ML frameworks and are easy to use but have limitations that are exacerbated at scale. File systems at large companies are often distributed (e.g. Ceph, Gluster).

Network-attached storage (NAS) refers to a shared file system that is accessible by multiple users or compute nodes concurrently.

Object Storage

Objects are organized in buckets and are a relatively new type of storage popular in web apps. Object storage systems may be scaled massively and are very common in DevOps and web services. Objects also contain associated metadata.

AWS S3 is the most common/well-known object-store and is used to host datasets for training.

Databases

Although databases do not interface directly with ML pipelines, many datasets originate from a database. A dataset must be extracted from a database and stored in a file system or object store such as S3 for training.

Common relational databases include Postgres and MySQL, common unstructured or NoSQL databases include MongoDB and CouchDB, and common time series databases include InfluxDB and Prometheus.

Data Warehouses & Data Lakes

Data warehouses are used to store petabyte-scale data that companies collect from various sources. Databases are not used as a direct dataset source for training.

A dataset must be extracted from the Data Lake or Data Warehouse and stored in a file system or an object store such as S3 for training.

Train, Test, & Validation Sets Explained

It is standard practice to partition data into two or three data sets: training, test, and sometimes validation, which is recommended. All three should randomly sample a larger body of data.

Training dataset: The sample of data used to train the model. The model learns from this data.

Test dataset: The holdout sample of data that is used to evaluate the final model after training and tuning are complete.

It is recommended to use a curated approach to creating these datasets, not necessarily just a random split. This is to ensure that the data being tested against represents new real-world data the model will see in the future.

Public Datasets

PreviousConvolutional Neural Network (CNN)NextData Science vs Machine Learning vs Deep Learning

Last updated 3 years ago

Was this helpful?

Validation dataset: The sample of data used to evaluate the model during development to see how the model performs on new data. This dataset is also used while tuning model . During the hyperparameter tuning phase, the model sees this data, but does it not learn from the data since hyperparameters are not learnable parameters. The validation set is primarily used to to avoid to the training data.

These are large (typically labeled) datasets made available for public consumption and are often the starting point for developing a model. There are many famous public datasets such as , ImageNet, CIFAR-10, MS COCO, Sentiment140, IMDB, LSUN, and more. Many come from academia and some come from industry.

hyperparameters
overfitting
MNIST
Source: Andrew Ng's Machine Learning Coursera class