This page looks best with JavaScript enabled

50 Public Sources for Machine Learning Datasets

 ·  ☕ 9 min read
zebra toys, all looking the same

Photo by Markus Winkler on Unsplash

Curated public datasets are widely used to learn data science and machine learning. But their utility in real-world commercial projects is often overlooked.

You must have heard that data scientists spend 80% of their time collecting, cleaning, and preparing the data. Evaluating the viability of an idea will require a decent amount of data. So, do you want to invest time and effort in collecting data only to discover that your idea does not work?

It is far cheaper and faster to try your idea on a dataset similar to what you need to collect. First, train a (publically available) model on a similar dataset to see if the results are to your liking.

Public datasets can help you rapidly prototype and get clarity about the data you need to collect, saving you time and money.

This article gives you a list of 50 public sources for machine learning datasets where you can search and download datasets suitable for your needs.

Datasets from Academic Institutes

Machine Learning has a much longer history in academic research at universities. So it is not surprising that some of the most versatile open datasets were curated by universities.

These are my first go-to place for finding a similar dataset for two reasons. First, these have very diverse datasets for all sorts of machine learning problems. Second, these typically have permissive licenses not prohibiting use for commercial purposes (more about it later).

  1. UCI Machine Learning Repository: Datasets for a very diverse set of problems and tasks.

  2. GroupLens Datasets (by Univ of Minnesota): Datasets for recommendation systems for various item types (movies, books, jokes, etc.)

  3. Harvard Dataverse: More than 100k datasets used in research projects.

  4. The Internet Archive: Dataset archives from websites.

  5. Dataverse open-source research data repository: Datasets used in research papers in conferences and journals.

  6. Common Crawl Data: Open repository of web crawl data.

  7. DBpedia: Open Knowledge Graph extracted from Wikipedia.

  8. UK Data Service: UK’s largest collection of economic, social, and population data for research and teaching.

  9. OpenML Datasets: Around 4000 datasets shared by the ML community

  10. Academic Torrents Datasets: A community-maintained distributed repository for datasets.

Datasets by Major Cloud Providers

Amazon, Microsoft, and Google have rich dataset directories. Some of their datasets are available in cloud data warehouses, which is handy if you are doing machine learning on the cloud.

If you are doing machine learning on AWS, Microsoft Azure, or Google Cloud, you should look for a similar dataset here. These are set up to be used easily on the cloud and have permissive licenses.

However, these are not as diverse as those in the previous section.

  1. AWS Open Data Registry: datasets available as AWS resources.

  2. Microsoft Research Open Data: Free datasets from Microsoft Research.

  3. Microsoft Azure Open Datasets (Catalog): Curated datasets available on Azure.

  4. Google Public Data Directory: Library of public datasets from various sources.

  5. Google Dataset Search: Dataset search engine

  6. Google Cloud Datasets: Datasets available on Google Cloud.

  7. Google BigQuery Public Datasets: Datasets stored in BigQuery data warehouse.

Government Datasets

Before data science became a thing, it was known as statistics. Governments have been collecting all sorts of data for more than half a century. If you are building models in sociology, economic development, education, and health care, a government is quite likely to have a dataset sufficiently similar to what you need for your problem.

These are very rich datasets, and it may take you some time to locate the right dataset (but that time will be a fraction of what you will need to collect your own datasets). The licenses are lenient as well to encourage usage.

  1. Indian Government’s Data: Various datasets from the Indian union government and state governments.

  2. European Union’s Data: European Union’s official data source.

  3. UK Government Data: Data published by UK’s central government, local authorities, and public bodies.

  4. US Government’s Data: Diverse datasets from the US government.

  5. US Government Census Data: US census information like education, employment, health, and housing broken down to zip code level.

  6. US Bureau of Labor Statistics: Unemployment, pay & benefits, spending, and other employment data.

  7. US Congressional Budget Office: Budget, economic outlook, and projections.

  8. US Centers for Disease Control and Prevention: Data for alcohol abuse and various diseases.

  9. US Medicare Data: Medicare & Medicaid data

Socioeconomic Datasets by World Bodies

Just like governments, the world bodies like United Nations, WHO, and World Bank have rich socioeconomic datasets.

If you are working on models that span across countries, standardizing and stitching together datasets from multiple governments can take quite an effort. Since the goal is to quickly evaluate ML model feasibility before embarking on expensive data collection, cleaning, and labeling, the datasets from world bodies are a better option.

  1. United Nations Data (Catalog): Population (and migration), labor market, agriculture, production, price indices, trade, crime, health, environment, tourism, and development data for various countries.

  2. UNICEF Data: Childbirth, health, hygiene, nutrition, mortality, education, and development data from United Nations International Children’s Emergency Fund.

  3. World Health Organization Data: WHO data for health, diseases, pandemics, immunization, pollution, and environmental health data.

  4. World Bank Data (Catalog): Economy, growth, agriculture, education, energy, mining, debt, infrastructure, poverty, trade, rural and urban development data.

  5. International Monetary Fund (IMF) Data: GDP, trade, price index, exchange rate, monetary and financial data.

  6. Asian Development Bank: Similar to World Bank Data but for countries in Asia.

  7. Organization for Economic Co-operation and Development (OECD): Similar to World Bank data but mainly for OECD member countries.

Financial Data by Stock Exchanges and Central Banks

Almost all stock exchanges provide historical trading data, and central banks of every nation publish all kinds of financial data. This is your go-to place for trying any time-series models on stock prices, analyzing stock market trends over a period of time, and relations of equities with other asset classes or industrial indicators.

  1. India’s National Stock Exchange (NSE) Historical Data: Trading data of Nifty indices and stocks traded on NSE.

  2. Reserve Bank of India (RBI) Data: Financial data from India’s central bank.

  3. [NASDAQ Historical Data](https://www.nasdaq.com/market-activity/quotes/historical  https://data.nasdaq.com/): NASDAQ indices and stock trading data.

  4. NYSE Historical Data: Data for stocks traded on New York Stock Exchange.

  5. US Federal Reserve Data: Financial data from USA’s central bank.

  6. Yahoo! Finance (how-to steps): Data of stocks tarded on various stock changes in the world.

There are a number of image datasets. Over time, these have grown very large and richly annotated. These datasets suffice as a starting point for most of the ML solutions for image-related tasks.

  1. ImageNet Dataset: The famous image dataset, organized according to the WordNet hierarchy.

  2. Common Objects in Context (COCO) Dataset: 300K images (with >200K labeled) with 1.5 million object instances across 80 object categories.

  3. Google Open Image Dataset: Large-scale image datasets like COCO.

  4. VisualData: Community curated Computer Vision datasets.

Miscellaneous Dataset Sources

You may have a problem that does not fall into the broad categories described above. That does not mean that there is no dataset to get you quickly evaluate the feasibility.

Google’s dataset search engine listed earlier is a good starting point. You can search for the businesses relevant to your task, e.g., yelp for business reviews, Airbnb for rental properties, and YouTube for video.

Here are some datasets for sports, news, and other businesses.

  1. FiveThirtyEight: Sports and election datasets from ABC News.

  2. BuzzFeed News Data: News, crime, polls data curated by BuzzFeed News.

  3. Yelp Open Dataset: Business review dataset from Yelp.

  4. Airbnb Data: Listings and reviews of properties in various cities.

  5. YouTube Video Dataset: YouTube data with human-verified segment annotations.

  6. Kaggle Datasets: Datasets from various Kaggle competitions.

  7. Wikipedia Database: Data dump of all available content on Wikipedia.

Do Check the License

Please check the dataset’s license, especially before using it in a commercial project. A dataset being public does not necessarily imply that you are free to use it as you please. Here are the most common dataset licenses.

Open Data Commons (ODC)

Community Data License Agreement (CDLA)

  • CDLA Sharing: allowed to use, share and enhance the dataset, but you must give credit and share your data enhancements under the same license. It does not impose any obligations or restrictions on results obtained from the computational use of the data.

  • CDLA Permissive: Same as CDLA Sharing except it gives you the choice to distribute your work under a different license as long as you also include this license for the original dataset.

Creative Commons (CC)

The PDDL and CC0 are the most permissive licenses. These are rather a renunciation of all rights by the creator. The ODC-By and CC-BY are the next, as they require only attribution acknowledging the use of the dataset. Then comes the CDLA-Permissive and then CDLA-Sharing, as those don’t impose any restriction or obligation on the ML model (results obtained from the computational use of the data) you will create.

For the rest of the licenses, it depends on whether you intend to use the dataset for commercial purposes. Share-Alike licenses are generally considered “viral”, and No-Derivative licenses are the most restrictive.

Summary

The public datasets are useful while learning Data Science and Machine Learning and rapidly prototyping ideas in commercial settings. Only if an idea is promising, does one need to embark on costly data collection and labeling,

Share on

Satish Chandra Gupta
WRITTEN BY
Satish Chandra Gupta
Data/ML Practitioner