Data science projects of all varieties require data. While you might be collecting data from your website, app or product, you might also want to analyse public datasets, or build experimental or academic models using pre-existing data.
Moreover, supervised machine learning projects require training data that is labelled and annotated to teach algorithms how to map the input to output functions. Training data for machine learning includes text data for NLP, audio for training conversational AIs, image and video for computer vision (CV) projects and various statistics for all manner classification and regression problems.
Datasets are also used in unsupervised machine learning, often for the sake of exploratory data analysis; hierarchical clustering, K-means clustering, Gaussian mixture models, principal component analysis, etc. Reinforcement learning also requires input data.
In any of these situations, you might need a dataset. Even if you build a model using your own data, it’s often useful to test it using public or open datasets.
This is a guide to finding datasets for data and computer science purposes.
Data For Analysis vs Data For Machine Learning
When you’re searching for datasets, it’s important to make a distinction between data that is suitable for exploration and analysis (with or without applying machine learning) and data suitable for training models.
Training data for supervised machine learning is not raw, unstructured data. Instead, these datasets need to be tuned to the problem space and label, or annotated, to make them learnable.
Labelled or Annotated Data
Data labelling is the process of applying annotations or signposts to data in the form of bounding boxes, polygons, pixel segmentation, masking and other types of labels (for images). Furthermore, datasets for training models might contain a training set as well as a test set. The test set is reserved from the training stage.
For NLP projects, e.g. named entity recognition or sentiment analysis, there are many ways to prepare text for algorithms by labelling words, phrases, grammar, syntax and other linguistic components separately, or reducing text to a ‘bag of words’. You can also combine unsupervised learning to sort and clean text data to extract features automatically without labelling. Similar concepts apply to numerical or audio data.
Fundamentally, all data has something in common. But, if you’re training models then you’ll need clean data which is well-structured and labelled. On the other hand, if you’re analysing the data only (e.g. importing it into a dashboard for visualisation), then you’ll only need clean data and won’t need to label anything.
The Qualities of a Good Dataset
Datasets are not all built the same. Since datasets are inherently finite and limited by the number of samples, it’s crucial to know what actually makes a dataset appropriate for a particular project.
1: Make Sure You Have Enough High-Quality Data
While it is generally true that more data = better, in machine learning, you also need to consider the bias-variance trade-off, which helps obtain accurate models that don’t underfit or overfit based on overly simple or overly complex data. For data analysis, it’s essential to obtain enough data to reach a robust conclusion with high confidence.
In both cases, you can generate new data from old data, and even ‘blend in’ some noisy or anomalous data if your dataset is too simple, sparse or general. Moreover, the data itself should be high-quality and clean. While many of the datasets included in this article are purpose-made for data science, it’s still sensible to vet for quality. When it comes to using self-made or mined data, you’ll need to clean and normalize your data.
2: Make Sure Your Data Is Unbiased
Bias has plagued machine learning projects. For example, Amazon’s recruitment AI was thrown on the backburner after it resulted in hiring prejudice against women. Why? Because the training data used was biased and unrepresentative, largely because it was based on a period where men over-featured in tech rules. In 2021, Google fired two AI researchers, Timnit Gebru and Margaret Mitchell, partly because they spoke out against the lack of inclusion in AI research. Autonomous vehicles have also encountered numerous problems with recognising dark-skinned pedestrians, again because training data was non-representative.
If you’re searching for datasets for either research, analysis or machine learning, then bias is an enduring issue that the human researcher must take into account. This is true regardless of whether you’re embarking on a project for commercial reasons or for academic or experimental research. It’s also worth mentioning that using a biased or prejudiced model that inflicts some form of harm onto an individual can result in committing a civil or criminal offence in some jurisdictions, including the UK.
3: Make Sure You Take Notice of Privacy Regulation
You cannot freely scrape data and use it in algorithms, especially if it involves personally identifiable information (PII). GDPR fines have been inflicted on companies that used PII and other sensitive data to conduct data science projects without explicit permission or justification.
This risk is circumvented when you use public or open datasets as these are strictly free from legal or regulatory control. Even so, always conduct legal and regulatory due diligence if you are a business or individual using potentially sensitive data for analysis or modelling.
The Best Datasets For Data Science and Machine Learning
The following section will provide datasets and guidance for data mining. Datasets are broken down by category into:
- Images and video, for various computer vision projects.
- Text, for NLP.
- Audio, for conversational AI and voice recognition.
- General datasets, for various data analysis projects and tasks.
1: General Datasets
The following are general or mixed datasets and data repositories. Much of this data is suitable for analysis and research, but some can be used for machine learning or many other data science projects.
Many countries make certain public sector and government data accessible, including:
Most of this data is available in various formats such as JSON, XML, CSV, HTML, etc. For example, the UK provides data on everything from the environment, healthcare and crime to government expenditure, transport, defence and education. Much of this data is excellent for academic studies and research that involves data analysis. There is also potential for using some of this data in machine learning. Some, like the Singaporean database, also include some great visualisations.
Kaggle is probably the most widely-known and well-used source of data. There are well over 20,000 public datasets on the site, covering everything from figures and statistics to text, audio and computer vision. Each dataset is well-categorised, and the entire interface is easy to search. Not all Kaggle datasets are open or public use – the licences for each dataset are displayed.
A very well-maintained dataset hosted on GitHub. This is an excellent source of data for science projects of various kinds but you can find a wide range of data on there covering everything from economics to energy, finance, search engines, sports and language. There are datasets designed specifically for machine learning, but everything is good to go for analysis.
Academic Torrents is a data-sharing platform that uses the BitTorrent protocol. The result is an open repository of all manner of data spanning pretty much everything you can think of. There is some 83TB of data on the site!
Google Datasets search does exactly what it says it does and allows users to search from a huge range of datasets uploaded to thousands of repositories across the web. This is an excellent starting point for many data science projects or academic studies.
Not all data available via the search engine is usable for whatever purpose you like, so make sure you check licences and usage restrictions. Google Finance and Google Public Data and Google Trends also provide data in downloadable formats for analysis.
This is Microsoft’s collection of free datasets which covers areas in NLP, computer vision and various domain sciences. There aren’t too many datasets on there, but they are mostly high-quality and ready to use for a variety of machine learning or analytical purposes.
The WHO’s dataset for global health. Covers everything from COVID-19 to fertility, air pollution, nutrition, maternal health, child health, cancer, etc. Read more about AI in healthcare here.
A media and data platform for opinion poll analysis, politics, economics and sports. Contains a wide variety of US-oriented data.
BigQuery datasets for public use. Works with Google Cloud.
Want to improve your data skills?
See the best data engineering & data science books
2: Computer Vision Datasets
The following are datasets for computer vision projects. Some of these datasets are pre-labelled.
Computer Vision – Public and Open Source Datasets
- AWS Registry of Open Data – This is Amazon Web Services’ database for open data and predominantly contains image data and statistics. Ideally suits CV projects built in AWS.
- CityScapes Dataset – Pixel segmentation images of street scenes CV projects.
- COCO Dataset – The Common Objects in Context dataset, which is primarily oriented towards object recognition.
- EarthData – EarthData is NASA’s centre for open datasets. It contains a vast range of earth sciences data covering everything from climate science to space, structural engineering and agriculture.
- Fashion MNIST – A clothing and fashion dataset with grayscale images.
- FloodNet – An image classification and semantic segmentation dataset for natural disasters. Contains images taken by unmanned aerial vehicles (UAVs)
- ImageNet – A large dataset of images designed for object recognition CV. ImageNet is built from the WordNet database.
- IMDB-Wiki – Contains 500k faces with age and gender labels.
- Kinetics – Google’s Kinetics dataset contains a massive 650,000 video clips of human-object and human-human actions
- Labeled Faces in the Wild – A facial recognition dataset that contains 13,000 faces.
- Mapillary Vistas Dataset – A high-quality geospatial and street dataset for urban semantic segmentation. Contains data from most continents.
- MPII Human Pose Dataset – Contains 25,000 images containing over 40,000 people with annotated body joints.
- NYU Depth V2 – An indoor object dataset for semantic segregation.
- MIT’s Places and Places2 – Some 1.8 million images grouped in 365 scene categories for object recognition and other CV projects.
- Open Images – A huge collection of images with 16M bounding boxes for 600 object classes.
- StanfordCars – 16,185 images of 196 classes of cars. Relatively dated and perhaps best for experimental or research purposes.
- The CIFAR-10 dataset – A large dataset of some 60,000 small 32×32 images.
- VisualGenome – Similar to ImageNet, Visual Genome contains images and related words.
3: Natural Language Processing Datasets
The following are datasets for NLP tasks. Some of this data can be used to train chatbots or conversational AIs. Some NLP and related tasks include text classification, question answering, audio-summarisation, translation or image captioning.
- 20 Newsgroups – Contains 20,000 documents collected from over 20 different newsgroups. Covers a wide variety of topics with or without metadata.
- Cornell Movie-Dialogs Corpus – Features a wide range of dialogues extracted from movies.
- Dictionaries for Movies and Finance – Domain-specific dictionaries for sentiment analysis. Entries are arranged with positive or negative connotations.
- Enron Dataset -: Some 500,000 anonymized emails from over 100 users.
- European Parliament Proceedings Parallel Corpus: Sentence pairs from EU Parliament. This is a multilingual resource of some 21 European languages.
- GitHub NLP Index – Contains links to many textual datasets.
- Google Blogger Corpus – Some 700,000 blog posts from blogger.com.
- HotpotQA – A question-answering dataset.
- Jeopardy – Over 200,000 questions from the US TV show.
- Legal Case Reports Dataset – Summaries of around 4000 various legal cases.
- MultiDomain Sentiment Analysis Dataset – Contains Amazon reviews for sentiment analysis. Some product categories have thousands of entries.
- OpinRank Dataset – Contains 300,000 reviews, for sentiment analysis.
- Project Gutenberg – Collection of literary texts. Lots of historical works in different languages.
- Recommender Systems Datasets – A huge range of datasets covering everything from health and fitness, video games and song data to social media and reviews. Contains lots of labels and metadata.
- Sentiment 140 – 160,000 tweets for sentiment analysis, organised by polarity, date, user, text, query, and ID.
- SMS Spam Collection – For spam filtering. Contains around 6000 messages tagged as legitimate or spam.
- The WikiQA Corpus – A large open-source set of question and sentence pairs. Taken from Bing query logs and Wikipedia. Contains some 3000 questions and over 29,000 answer sentences
- Twitter Support – Contains around 3 million tweets and replies, mostly oriented around customer support.
- Ubuntu Dialogue Corpus – Around 1 million tech support conversations.
- WordNet – A lexical database in English. Contains a huge range of nouns, verbs, adjectives and adverbs grouped into sets of cognitive synonyms.
- Yahoo Language Data – Q&A datasets from Yahoo Answers.
- Yelp Reviews – Restaurant rankings and reviews.
4: Audio Datasets
Audio datasets are required for building speech recognition AIs, conversational AIs and many other AIs with audio sensors. Some datasets contain sounds classified by emotions. Others include ambient or background noise.
- AudioSet – Google Research’s database of some 2.1 million annotated videos, 5.8 thousand hours of audio across 527 classes.
- ESC Environmental Sound – Audio data 2000 environmental recordings. Covers everything from animals to ambience and weather noise.
- GitHub Audio Datasets – A huge range of well-maintained GitHub audio datasets.
- LibriSpeech – Contains some 1000 hours of English speech from audiobook clips.
- LJ Speech Dataset – 13,100 clips of short audiobook passages. Most involve a single speaker with transcription.
- M-AI Labs Speech Dataset – Contains some 1000 hours of audio with transcriptions. Features multiple languages and both male and female speakers.
- Noisy Speech Database – Contains both noisy and clean parallel speech. Used for training and testing speech recognition models for performance with background noise, etc.
- Spoken Wikipedia Corpora – Speech from Wikipedia articles in English, German, and Dutch.
- TowardsDataScience has an extensive list of other audio datasets here.
Summary: Websites to Find Data for Data Science Projects
Finding data for data science projects is reasonably straightforward, but be prepared to sort through the data and ensure that it’s clean and usable for your project. Licencing and usage restrictions may also complicate the process. It’s especially important to take notice of usage terms if you’re planning on publishing your model, copyright it or otherwise call it your own intellectual property.
While the very purpose of open and public datasets is freedom from mechanisms of control, it’s important to comply with the rules when you’re using more general datasets from state or public sector repositories.
If you are an academic, student or researcher, then do your due diligence based on your specific project. However, suppose you are a business, organisation or other commercial entity – it might be wise to seek legal assistance if you’re unsure how the data you use is licenced or controlled.
What is a dataset?
A dataset is an organised set of data. Datasets tend to be organised by theme or purpose and are usually well-structured. Datasets can be used for exploration or analysis, or for building and training models. Some datasets include labelled data for ML projects, but most are unlabelled.
What can you use datasets for?
Datasets can be used for many purposes, e.g. research purposes, academia, science and business. Data scientists who are looking to develop their skills can use datasets for experimental analysis, or for training models.