Ultimate Source of Datasets for Machine Learning Projects

James Phoenix
James Phoenix

Data science and machine learning require data – that much is obvious!

But obtaining the data required for projects is not always easy. Fortunately, there are near limitless data sources available online, many of which are public and/or open. In fact, leading AIs such as those produced by OpenAI are trained using open data from the internet. 

When using open data from the internet, it’s important to bear in mind the potential drawbacks of public data. The internet is not a panacea for data science, and even though there are huge volumes of free data available for use, this only doesn’t guarantee the quality of the data. 

It’s always necessary to consider the bias of internet datasets, e.g., many image classification datasets contain primarily white men, which has led to problematic results for computer vision (CV) projects. 

Nevertheless, public data for machine learning and data science projects are extremely useful in both professional and educational contexts. 

This is a large list of open and public datasets split into several useful categories.


1: Governmental datasets 

Firstly, there are lots of government data projects that provide free data for general use. Government data covers a huge range of public sector and private sector services, including: 

  • Business and economy: Data on large and small businesses, industry, trade, imports and exports, etc. 
  • Crime and justice: Public data on everything from court and crime data to policing, immigration and prison.
  • Defence: Armed forces-related data. 
  • Education: Students, qualifications, university uptake, etc.
  • Environment: Weather trends, changes to environmental usage, flooding, pollution, air quality, geology and agriculture.
  • Government and government spending: Finances of government and public sector departments. 
  • Health: Covering disease, smoking, drugs, alcohol, medicine and GP and hospital stats, etc. 
  • Mapping: Land, addresses, boundaries, terrain and environmental data.
  • Society: Employment stats, welfare and benefits, household finances, poverty and deprivation, population and demographics. 
  • Towns and cities: Housing, building, infrastructure, land use, etc. 
  • Transport: Airports, roads, freight, cars, electric vehicles, trains and buses.

You can find data in the above categories on sites such as:


Data is typically available in XML, CSV, PDF and DOC formats and sometimes JSON. In addition, you can find GitHub topics on using government data, such as github.com/topics/uk-government. Simply use the search box to find relevant projects. 


2: Open Data From High-Authority Sources 

Government data sources are great but aren’t always optimised for data science and machine learning uses. The following sources are well-maintained data sources created by top authorities: 

  1. Amazon Datasets: Registry of open data for AWS. 
  2. Awesome Public Datasets: GitHub’s massive list of a topic-centric public data sources. They are collected and tidied from across the site. Other excellent datasets can be found in sindresorhus’s list.
  3. Dataset Search by Google: A Google-run dataset search tool for searching many types of data by keyword – essentially a data search engine. 
  4. EarthData – NASA’s gateway to NASA Earth Observation Data. 
  5. Kaggle: One of the most well-known dataset search engines and databases. Covers practically every topic and problem in ML and data science, complete with plenty of contextual information and a rich community. All data scientists and anyone else working in data should familiarise themselves with Kaggle.
  6. KDNuggets: Wide array of high-quality datasets, which includes many listed here. 
  7. Microsoft Research Open Data – Microsoft’s datasets covering everything from healthcare and science to crime and education. A great resource for scientific data. 
  8. OpenML: Dataset search optimised for machine learning and artificial intelligence projects. 
  9. UCI: Machine Learning Repository: A UCL machine learning and AI repository. Used in many high-echelon ML and AI projects. 
  10. Visual Data Discovery: An excellent data source designed specifically for computer vision. 
  11. GitHub: You can find many pre-made and pre-processed datasets on GitHub. Many are combined in the Awesome Public Datasets list, but it’s always worth seeing what others are doing along the same lines as your project(s). 

3: Open Data For Computer Vision (CV)

Computer vision (CV) is one of AI and ML’s great pillars. There are hundreds of labelled and unlabelled computer vision datasets out there. Many of the above datasets contain data suitable for computer vision tasks, but the below datasets are built with CV specifically in mind:

  1. CityScapes Dataset – A large dataset that contains stereo video recorded in street scenes from 50 different cities. Contains around 25,000 frames. 
  2. COCO Dataset – the Common Objects in Context dataset for object detection, segmentation, and image captioning. 
  3. ImageNet – Massive dataset of bounding box labelled images for object recognition. Linked to the WordNet database for NLP. 
  4. Kinetics – 650,000 video clips of complex human motions and actions.
  5. Labeled Faces in the Wild – 13,000 faces for benchmarking facial recognition tasks. 
  6. Mapillary Vistas Dataset – Large-scale street-level imagery dataset for various image classification and semantic segmentation tasks. 
  7. NYU Depth V2 – Indoor semantic segregation featuring interior rooms.
  8. Places and Places2 – Well-known datasets for various object recognition projects. Features a massive 1.8 million images grouped across some 365 scene categories.
  9. StanfordCars – 16,185 images of 196 classes of cars
  10. The CIFAR-10 dataset – A large database of small 60,000 32×32 pixel images. 
  11. VisualGenome – Datasets that connect images to the corresponding language to structure various concepts in language. 
  12. YouTube Labeled Dataset: Human-verified labels for around 237,000 video segments across 1000 classes. 
  13. Fashion-MINST: 60,000 examples and test set of 10,000 examples consisting of clothing and garments in 28×28 grayscale images across 10 classes.
  14. IMDB Faces: Large dataset of over 500,000 face images with age and gender labels. 
  15. MPII Human Pose Dataset: Includes some 25,000 images containing around 40,000 people with annotated body joints. 

4: Open Data for Natural Language Processing (NLP)

The next list of datasets are built and designed for NLP tasks. These datasets can be used to train or test chatbots, caption images, train autotranslation tools and auto-summarisation. NLP data can also be used to train image classification models by teaching them the language concepts related to visual imagery. 

For example, OpenAI’s DALL-E and GPT-3 overlap to share written and visual concepts. 

Unleash Your Potential with AI-Powered Prompt Engineering!

Dive into our comprehensive Udemy course and learn to craft compelling, AI-optimized prompts. Boost your skills and open up new possibilities with ChatGPT and Prompt Engineering.

Embark on Your AI Journey Now!
  1. Cornell Movie-Dialogs Corpus – Large dataset extracted from movie dialogs. Features 220,579 conversational exchanges between 10,292 pairs of movie characters from 617 movies. 
  2. European Parliament Proceedings Parallel Corpus – Multilingual sentence pair resource across 21 languages.
  3. GitHub NLP Index – GitHub list of NLP datasets. 
  4. HotpotQA – A question-answering dataset for multi-hop conversations. 
  5. Jeopardy – From the US TV show. 
  6. Legal Case Reports Dataset – Text summaries of approximately 4000 various legal cases.
  7. MultiDomain Sentiment Analysis Dataset – Sentiment analysis resource of Amazon reviews across product categories. 
  8. OpinRank Dataset – Contains 300,000 reviews of cars and hotels for sentiment analysis. 
  9. SMS Spam Collection – SMS spam resource with around 6000 messages tagged as legitimate or spam.
  10. The WikiQA Corpus – From Bing and Wikipedia, with some 3000 questions and 29,000 answer sentences. 
  11. Ubuntu Dialogue Corpus – Well-known resource of approximately 1 million tech support conversations with a total of some 7 million utterances and 100 million words.
  12. WordNet – Lexical database with nouns, verbs, adjectives and adverbs grouped into useful sets.

5: Datasets for Recommendation Engines

  1. MovieLens – Contains several real-world and synthetic datasets for recommendation systems and research. The  MovieLens Tag Genome Dataset 2021 and MovieLens 25M Dataset contain 25m movie ratings and 1m million tag applications applied to 62,000 movies by 162,000 users and 10.5 million computed tag-movie relevance scores as applied to 9,734 movies, respectively. 
  2. Jester – Online joke recommendation system featuring anonymous ratings data from 73,421 users.
  3. Recommender Systems Datasets – Many datasets appropriate for recommendation and personalisation training and analysis. Covers topics from health and fitness to music, art, video games and even beer.

6: Audio Datasets

Audio datasets are used to build audio and text translation models, auto-transcription, conversational AIs and many other AIs that can convert spoken natural language into text and vice-versa. 

OpenAI’s Whisper is probably the most modern incarnation of an automatic speech recognition (ASR) system and is trained on 680,000 hours of supervised data. Some of that data will have been collected from the following sources: 

  1. AudioSet – Contains a whopping 2.1 million annotated videos and 5.8 thousand hours of audio across 527 classes.
  2. ESC Environmental Sound – Audio data of around 2000 environmental recordings to help classify background noise and environmental audio. 
  3. GitHub Audio Datasets – Large list of audio datasets in GitHub. 
  4. LJ Speech Dataset – 13,100 clips of audiobook passages, most of which contain a single speaker with associated transcriptions.
  5. M-AI Labs Speech Dataset – Around 1000 hours of audio with associated transcriptions across different languages and male/female speakers. 
  6. Noisy Speech Database – A noisy speech resource for training AIs to recognise speech in natural environments. 
  7. Spoken Wikipedia Corpora – Speech from Wikipedia articles in English, German, and Dutch. 
  8. Find more audio datasets here.

Using Public and Open Data

When you combine these datasets with feature engineering, the potential is near-limitless. 

Of course, you’ll need to pull data from business or organisation systems for many applications. Open data is certainly not the remedy for all data science projects, but it’s very useful for training large-scale AIs – as OpenAI keeps proving – as their AIs are often trained on open internet data. 

Using open data in data science projects is an excellent way to build one’s portfolio, but can also be used to build models that draw upon real data across a range of subjects. For example, you could use weather data to delve into seasonality or demographic data to model regional sales trends. 


Summary: Ultimate Source of Datasets For Machine Learning Projects

These datasets provide near-endless opportunities for AI and ML projects of every discipline and flavour, ranging from NLP and CV to building ASR and recommendation systems. 

Datasets listed on Kaggle and GitHub are ideal for many projects as they’re pretty much ready to use and deploy.

Open datasets can be used fairly liberally, and there are few rules or regulations regarding how you use them, with some exceptions. For example, monetising projects developed with public data might not be possible, whereas ‘open data’ has a more precise meaning that allows total freedom of use.


FAQ

What is open data?

Strictly speaking, open data is data that is openly accessible, usable, exploitable, editable, and sharable for anyone for any purpose, including commercial purposes.

What is public data?

Public data is in the public domain or intended for the public domain. It’s not the same as open data, though open data is public by default. Public data is still information that can be freely used and redistributed by anyone without restriction, but there is some debate over what data can be easily classed as public.

What can I use open data for?

Open data is extremely useful for all manner of personal, academic and commercial uses. Businesses use public data to feed their models, helping them create accurate forecasts and analyses. For science, open data can be used to track everything from air quality to changes in environmental land use. Open data movements seek to place the usefulness of data into the hands of those that can use it for a positive end.


More Stories

Cover Image for Why I’m Betting on AI Agents as the Future of Work

Why I’m Betting on AI Agents as the Future of Work

I’ve been spending a lot of time with Devin lately, and I’ve got to tell you – we’re thinking about AI agents all wrong. You and I are standing at the edge of a fundamental shift in how we work with AI. These aren’t just tools anymore; they’re becoming more like background workers in our digital lives. Let me share what I’ve…

James Phoenix
James Phoenix
Cover Image for Supercharging Devin + Supabase: Fixing Docker Performance on EC2 with overlay2

Supercharging Devin + Supabase: Fixing Docker Performance on EC2 with overlay2

The Problem While setting up Devin (a coding assistant) with Supabase CLI on an EC2 instance, I encountered significant performance issues. After investigation, I discovered that Docker was using the VFS storage driver, which is known for being significantly slower than other storage drivers like overlay2. The root cause was interesting: the EC2 instance was already using overlayfs for its root filesystem,…

James Phoenix
James Phoenix