Datasets and Machine Learning
Training data used in machine learning can take many forms, including images, MRI scans, text, CSV/tabular data (e.g., database queries), audio recordings, geospatial data (e.g., radar or vector), logs, time-series data (e.g., stock trades), binaries or computer applications, video frames, and many more.
The most common providers of data in machine learning are AWS S3, Snowflake, Redshift, AWS EBS, BigQuery, and on-premise file systems.
There are several types of storage relevant to ML, though not all interface directly with ML pipelines. These include file systems, object storage, databases, and data warehouses/data lakes.
File systems have been around forever and are the most familiar type of storage since every laptop uses a file system. File systems are compatible with all ML frameworks and are easy to use but have limitations that are exacerbated at scale. File systems at large companies are often distributed (e.g. Ceph, Gluster).
Network-attached storage (NAS) refers to a shared file system that is accessible by multiple users or compute nodes concurrently.
Objects are organized in buckets and are a relatively new type of storage popular in web apps. Object storage systems may be scaled massively and are very common in DevOps and web services. Objects also contain associated metadata.
AWS S3 is the most common/well-known object-store and is used to host datasets for training.
Although databases do not interface directly with ML pipelines, many datasets originate from a database. A dataset must be extracted from a database and stored in a file system or object store such as S3 for training.
Common relational databases include Postgres and MySQL, common unstructured or NoSQL databases include MongoDB and CouchDB, and common time series databases include InfluxDB and Prometheus.
Data warehouses are used to store petabyte-scale data that companies collect from various sources. Databases are not used as a direct dataset source for training.
A dataset must be extracted from the Data Lake or Data Warehouse and stored in a file system or an object store such as S3 for training.
It is standard practice to partition data into two or three data sets: training, test, and sometimes validation, which is recommended. All three should randomly sample a larger body of data.
Training dataset: The sample of data used to train the model. The model learns from this data.
Validation dataset: The sample of data used to evaluate the model during development to see how the model performs on new data. This dataset is also used while tuning model hyperparameters. During the hyperparameter tuning phase, the model sees this data, but does it not learn from the data since hyperparameters are not learnable parameters. The validation set is primarily used to to avoid overfitting to the training data.
Test dataset: The holdout sample of data that is used to evaluate the final model after training and tuning are complete.
Source: Andrew Ng's Machine Learning Coursera class
It is recommended to use a curated approach to creating these datasets, not necessarily just a random split. This is to ensure that the data being tested against represents new real-world data the model will see in the future.
These are large (typically labeled) datasets made available for public consumption and are often the starting point for developing a model. There are many famous public datasets such as MNIST, ImageNet, CIFAR-10, MS COCO, Sentiment140, IMDB, LSUN, and more. Many come from academia and some come from industry.