Where to Find Data: A Curated List of Free Datasets and Repositories
In data science and machine learning, data is the new oil—it’s the fuel that powers our models and insights. But finding the right data can be a challenge. Whether you’re a seasoned researcher, a data scientist looking for a competition, or a student just starting, knowing where to look is key.

Here’s a breakdown of some of the most popular and useful open data repositories available today.
Google Dataset Search
Think of this as Google, but specifically for data. Google Dataset Search doesn’t host data itself, but it indexes datasets from thousands of repositories across the web. It’s an excellent starting point when you have a specific topic in mind and want to see what’s available from government, academic, and private sources.
Visit Repository
Hugging Face Datasets
The Hugging Face Hub has become the central community for all things AI, especially Natural Language Processing (NLP). Alongside its famous transformers library, it hosts thousands of datasets, all easily accessible through their datasets library. It’s incredibly convenient for loading and preprocessing text, audio, and image data directly into your workflow.
Kaggle.com
Kaggle is best known for its machine learning competitions, but it’s also a massive community and data-hosting platform. You can find thousands of user-published datasets on almost any topic imaginable. Each dataset often comes with a “Code” section where you can see how others have analyzed the data, making it a fantastic learning environment.
Visit Repository
UC Irvine Machine Learning Repository
A true classic. The UCI Machine Learning Repository is one of the oldest dataset archives on the web (since 1987). It’s a staple in machine learning education. While dataset are often smaller and cleaner, they are perfect for benchmarking algorithms and learning fundamental concepts.
Visit Repository
PapersWithCode.com
Papers With Code connects machine learning research papers to their corresponding code and the datasets they were trained on. If you’ve just read a cutting-edge paper and want to reproduce its results or use its data, this is the first place you should look.
Visit Repository
OpenML.org
OpenML is a collaborative platform for machine learning. It allows users to upload and share datasets, code, and experiment results. Its goal is to make ML research more open and reproducible, allowing everyone to build on each other’s work easily.
Visit RepositoryAmazon’s AWS Datasets
The Registry of Open Data on AWS hosts large-scale datasets that are expensive to store and transfer. By hosting them on AWS, Amazon makes them freely accessible for analysis directly in the cloud. You’ll find massive datasets here, like the 1000 Genomes Project, Landsat satellite imagery, and web crawls.
Visit Repository
U.S. Government’s Open Data
Data.gov is the home of the U.S. government’s open data. You can find data from across federal, state, and local governments on a huge range of topics. It’s a goldmine for data journalists, policymakers, and civic-minded data scientists.
Visit Repository
Stanford Large Network Dataset Collection
SNAP is a go-to resource for anyone working with graph or network data. Maintained by Stanford University, it contains dozens of large-scale, real-world network datasets, from social networks to web graphs and communication networks.
Visit Repository
DataPortals.org
A meta-repository that curates a list of over 600 open data portals from around the world. Organized by country, region, and city, it’s an excellent tool for finding localized or government-specific data from outside the U.S.
Visit Repository
Wikipedia’s List of ML Datasets
Don’t underestimate this resource! Wikipedia maintains several curated lists of datasets for machine learning research. These pages are often well-organized by data type (e.g., images, text, time series) and provide direct links and brief descriptions for each dataset.
Visit Repository