awesome-public-datasets/README.rst
2015-12-23 16:04:01 +08:00

507 lines
29 KiB
ReStructuredText

Awesome Public Datasets
=======================
.. image:: https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg
:alt: Awesome
:target: https://github.com/sindresorhus/awesome
.. image:: https://travis-ci.org/caesar0301/awesome-public-datasets.svg
:target: https://travis-ci.org/caesar0301/awesome-public-datasets
`This list of public data sources <https://github.com/caesar0301/awesome-public-datasets>`_
are collected and tidied from blogs, answers, and user reponses.
Most of the data sets listed below are free, however, some are not.
Other amazingly awesome lists can be found in the
`awesome-awesomeness <https://github.com/bayandin/awesome-awesomeness>`_ and
`sindresorhus's awesome <https://github.com/sindresorhus/awesome>`_ list.
* `Visit our Google Group on APD <https://groups.google.com/forum/#!forum/awesomepublicdatasets>`_
Agriculture
------------
* `U.S. Department of Agriculture's PLANTS Database <http://www.plants.usda.gov/dl_all.html>`_
Biology
-------
* `1000 Genomes <http://www.1000genomes.org/data>`_
* `American Gut (Microbiome Project) <https://github.com/biocore/American-Gut>`_
* `Collaborative Research in Computational Neuroscience (CRCNS) <http://crcns.org/data-sets>`_
* `EBI ArrayExrepss <http://www.ebi.ac.uk/arrayexpress/>`_
* `ENCODE project <https://www.encodeproject.org>`_
* `Ensembl Genomes <http://ensemblgenomes.org/info/genomes>`_
* `Gene Expression Omnibus (GEO) <http://www.ncbi.nlm.nih.gov/geo/>`_
* `Gene Ontology (GO) <http://geneontology.org/page/download-annotations>`_
* `Global Biotic Interations (GloBI) <https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data>`_
* `Human Microbiome Project (HMP) <http://www.hmpdacc.org/reference_genomes/reference_genomes.php>`_
* `ICOS PSP Benchmark <http://ico2s.org/datasets/psp_benchmark.html>`_
* `MIT Cancer Genomics Data <http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi>`_
* `NIH Microarray data <http://bit.do/VVW6>`_ or `FTP <ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/>`_
* `OpenSNP genotypes data <https://opensnp.org/>`_
* `Pathguid: Protein-Protein Interactions Catalog <http://www.pathguide.org/>`_
* `Protein Data Bank <http://www.rcsb.org/>`_
* `PubChem Project <https://pubchem.ncbi.nlm.nih.gov/>`_
* `PubGene (now Coremine Medical) <http://www.pubgene.org/>`_
* `Sequence Read Archive(SRA) <http://www.ncbi.nlm.nih.gov/Traces/sra/>`_
* `Stanford Microarray Data <http://smd.stanford.edu/>`_
* `The Catalogue of Life <http://www.catalogueoflife.org/content/annual-checklist-archive>`_
* `The Personal Genome Project <http://www.personalgenomes.org/>`_ or `PGP <https://my.pgp-hms.org/public_genetic_data>`_
* `UCSC Public Data <http://hgdownload.soe.ucsc.edu/downloads.html>`_
* `UniGene <http://www.ncbi.nlm.nih.gov/unigene>`_
Climate/Weather
---------------
* `Australian Weather <http://www.bom.gov.au/climate/dwo/>`_
* `Brazilian Weather - Historical data (In Portuguese) <http://sinda.crn2.inpe.br/PCD/SITE/novo/site/>`_
* `Canadian Meteorological Centre <https://weather.gc.ca/grib/index_e.html>`_
* `Climate Data from UEA (updated monthly) <https://crudata.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/>`_
* `Global Climate Data Since 1929 <http://en.tutiempo.net/climate>`_
* `NASA Global Imagery Browse Services <https://wiki.earthdata.nasa.gov/display/GIBS>`_
* `NOAA Bering Sea Climate <http://www.beringclimate.noaa.gov/>`_
* `NOAA Climate Datasets <http://www.ncdc.noaa.gov/data-access/quick-links>`_
* `NOAA Realtime Weather Models <http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction>`_
* `The World Bank Open Data Resources for Climate Change <http://data.worldbank.org/developers/climate-data-api>`_
* `UEA Climatic Research Unit <http://www.cru.uea.ac.uk/data>`_
* `WorldClim - Global Climate Data <http://www.worldclim.org>`_
* `WU Historical Weather Worldwide <https://www.wunderground.com/history/index.html>`_
Complex Networks
----------------
* `CrossRef DOI URLs <https://archive.org/details/doi-urls>`_
* `DBLP Citation dataset <https://kdl.cs.umass.edu/display/public/DBLP>`_
* `NBER Patent Citations <http://nber.org/patents/>`_
* `NIST complex networks data collection <http://math.nist.gov/~RPozo/complex_datasets.html>`_
* `Protein-protein interaction network <http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm>`_
* `PyPI and Maven Dependency Network <https://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/>`_
* `Scopus Citation Database <https://www.elsevier.com/solutions/scopus>`_
* `Small Network Data <http://www-personal.umich.edu/~mejn/netdata/>`_
* `Stanford GraphBase (Steven Skiena) <http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml>`_
* `Stanford Large Network Dataset Collection <http://snap.stanford.edu/data/>`_
* `The Koblenz Network Collection <http://konect.uni-koblenz.de/>`_
* `The Laboratory for Web Algorithmics (UNIMI) <http://law.di.unimi.it/datasets.php>`_
* `The Nexus Network Repository <http://nexus.igraph.org/>`_
* `UCI Network Data Repository <https://networkdata.ics.uci.edu/resources.php>`_
* `UFL sparse matrix collection <http://www.cise.ufl.edu/research/sparse/matrices/>`_
* `WSU Graph Database <http://www.eecs.wsu.edu/mgd/gdb.html>`_
Computer Networks
-----------------
* `3.5B Web Pages from CommonCraw 2012 <http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us>`_
* `53.5B Web clicks of 100K users in Indiana Univ. <http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/>`_
* `CAIDA Internet Datasets <http://www.caida.org/data/overview/>`_
* `ClueWeb09 - 1B web pages <http://lemurproject.org/clueweb09/>`_
* `ClueWeb12 - 733M web pages <http://lemurproject.org/clueweb12/>`_
* `CommonCrawl Web Data over 7 years <http://commoncrawl.org/the-data/get-started/>`_
* `CRAWDAD Wireless datasets from Dartmouth Univ. <https://crawdad.cs.dartmouth.edu/>`_
* `Criteo click-through data <http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/>`_
* `Open Mobile Data by MobiPerf <https://console.developers.google.com/storage/openmobiledata_public/>`_
* `UCSD Network Telescope, IPv4 /8 net <http://www.caida.org/projects/network_telescope/>`_
Contextual Data
---------------
* `Context-aware data sets from five domains <http://students.depaul.edu/~yzheng8/DataSets.html#Data>`_ or `GitHub <https://github.com/irecsys/CARSKit/tree/master/context-aware_data_sets>`_
Data Challenges
---------------
* `Challenges in Machine Learning <http://www.chalearn.org/>`_
* `CrowdANALYTIX dataX <http://data.crowdanalytix.com>`_
* `D4D Challenge of Orange <http://www.d4d.orange.com/en/home>`_
* `DrivenData Competitions for Social Good <http://www.drivendata.org/>`_
* `ICWSM Data Challenge (since 2009) <http://icwsm.cs.umbc.edu/>`_
* `Kaggle Competition Data <https://www.kaggle.com/>`_
* `KDD Cup by Tencent 2012 <http://www.kddcup2012.org/>`_
* `Localytics Data Visualization Challenge <https://github.com/localytics/data-viz-challenge>`_
* `Netflix Prize <http://www.netflixprize.com/leaderboard>`_
* `Space Apps Challenge <https://2015.spaceappschallenge.org>`_
* `Telecom Italia Big Data Challenge <https://dandelion.eu/datamine/open-big-data/>`_
* `Yelp Dataset Challenge <http://www.yelp.com/dataset_challenge>`_
Economics
---------
* `American Economic Ass (AEA) <https://www.aeaweb.org/RFE/toc.php?show=complete>`_
* `EconData from UMD <http://inforumweb.umd.edu/econdata/econdata.html>`_
* `Internet Product Code Database <http://www.upcdatabase.com/>`_
Energy
------
* `AMPds <http://ampds.org/>`_
* `BLUEd <http://nilm.cmubi.org/>`_
* `COMBED <http://combed.github.io/>`_
* `Dataport <https://dataport.pecanstreet.org/>`_
* `ECO <http://www.vs.inf.ethz.ch/res/show.html?what=eco-data>`_
* `EIA <http://www.eia.gov/electricity/data/eia923/>`_
* `HFED <http://hfed.github.io/>`_
* `iAWE <http://iawe.github.io/>`_
* `Plaid <http://plaidplug.com/>`_
* `REDD <http://redd.csail.mit.edu/>`_
* `UK-Dale <http://www.doc.ic.ac.uk/~dk3810/data/>`_
Finance
-------
* `CBOE Futures Exchange <http://cfe.cboe.com/Data/>`_
* `Google Finance <https://www.google.com/finance>`_
* `Google Trends <http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0>`_
* `NASDAQ <https://data.nasdaq.com/>`_
* `OANDA <http://www.oanda.com/>`_
* `OSU Financial data <http://fisher.osu.edu/fin/fdf/osudata.htm>`_
* `Quandl <https://www.quandl.com/>`_
* `St Louis Federal <https://research.stlouisfed.org/fred2/>`_
* `Yahoo Finance <http://finance.yahoo.com/>`_
Geology
-------
* `Smithsonian Institution Global Volcano and Eruption Database <http://volcano.si.edu/>`_
* `USGS Earthquake Archives <http://earthquake.usgs.gov/earthquakes/search/>`_
GeoSpace/GIS
------------
* `BODC - marine data of ~22K vars <http://www.bodc.ac.uk/data/where_to_find_data/>`_
* `Cambridge, MA, US, GIS data on GitHub <http://cambridgegis.github.io/gisdata.html>`_
* `EOSDIS - NASA's earth observing system data <http://sedac.ciesin.columbia.edu/data/sets/browse>`_
* `Factual Global Location Data <https://www.factual.com/>`_
* `Geo Spatial Data from ASU <http://geodacenter.asu.edu/datalist/>`_
* `GeoNames Worldwide <http://www.geonames.org/>`_
* `Global Administrative Areas Database (GADM) <http://www.gadm.org/>`_
* `Landsat 8 on AWS <https://aws.amazon.com/public-data-sets/landsat/>`_
* `List of all countries in all languages <https://github.com/umpirsky/country-list>`_
* `Natural Earth - vectors and rasters of the world <http://www.naturalearthdata.com/>`_
* `OpenAddresses <http://openaddresses.io/>`_
* `OpenStreetMap (OSM) <http://wiki.openstreetmap.org/wiki/Downloading_data>`_
* `Reverse Geocoder using OSM data <https://github.com/kno10/reversegeocode>`_ & `additional high-resolution data files <http://data.ub.uni-muenchen.de/61/>`_
* `TIGER/Line - U.S. boundaries and roads <http://www.census.gov/geo/maps-data/data/tiger-line.html>`_
* `TwoFishes - Foursquare's coarse geocoder <https://github.com/foursquare/twofishes>`_
* `TZ Timezones shapfiles <http://efele.net/maps/tz/world/>`_
* `World countries in multiple formats <https://github.com/mledoze/countries>`_
Government
----------
* `Antwerp, Belgium <http://opendata.antwerpen.be/datasets>`_
* `Austin, TX, US <https://data.austintexas.gov/>`_
* `Australia (abs.gov.au) <http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument>`_
* `Australia (data.gov.au) <https://data.gov.au/>`_
* `Austria (data.gv.at) <https://www.data.gv.at/>`_
* `Belgium <http://data.gov.be/nl/datasets>`_
* `Brazil <http://dados.gov.br/dataset>`_
* `Cambridge, MA, US <https://data.cambridgema.gov/>`_
* `Canada <http://open.canada.ca/en?lang=En&n=5BCD274E-1>`_
* `Chicago <https://data.cityofchicago.org/>`_
* `Dallas Open Data <https://www.dallasopendata.com/>`_
* `Denver Open Data <http://data.denvergov.org//>`_
* `Durham, NC Open Data <https://opendurham.nc.gov/explore/>`_
* `England LGInform <http://lginform.local.gov.uk/>`_
* `EuroStat <http://ec.europa.eu/eurostat/data/database>`_
* `FedStats <http://fedstats.sites.usa.gov/>`_
* `Finland <https://www.opendata.fi/en>`_
* `France <https://www.data.gouv.fr/en/datasets/>`_
* `Germany <https://www-genesis.destatis.de/genesis/online>`_
* `Ghent, Belgium <https://data.stad.gent/datasets>`_
* `Glasgow, Scotland, UK <https://data.glasgow.gov.uk/>`_
* `Guardian world governments <http://www.guardian.co.uk/world-government-data>`_
* `Houston Open Data <http://data.ohouston.org>`_
* `Indian Government Data <https://data.gov.in/>`_
* `Indonesian Data Portal <http://data.go.id/>`_
* `London Datastore, UK <http://data.london.gov.uk/dataset>`_
* `Los Angeles Open Data <https://data.lacity.org/>`_
* `MassGIS, Massachusetts, U.S. <http://www.mass.gov/anf/research-and-tech/it-serv-and-support/application-serv/office-of-geographic-information-massgis/>`_
* `Mexico <http://catalogo.datos.gob.mx/dataset>`_
* `Netherlands <https://data.overheid.nl/>`_
* `New Zealand <http://www.stats.govt.nz/browse_for_stats.aspx>`_
* `NYC betanyc <http://betanyc.us/>`_
* `NYC Open Data <https://nycplatform.socrata.com/>`_
* `OECD <http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html>`_
* `Oklahoma <https://data.ok.gov/>`_
* `Open Government Data (OGD) Platform India <https://data.gov.in/>`_
* `Oregon <https://data.oregon.gov/>`_
* `Portland, Oregon <https://www.portlandoregon.gov/28130>`_
* `Puerto Rico Government <https://data.pr.gov//>`_
* `Rio de Janeiro, Brazil <http://data.rio.rj.gov.br/>`_
* `Romania <http://data.gov.ro/>`_
* `Russia <http://data.gov.ru>`_
* `San Francisco Data sets <http://datasf.org/>`_
* `Seattle <https://data.seattle.gov/>`_
* `Singapore Government Data <https://data.gov.sg/>`_
* `South Africa <http://beta2.statssa.gov.za/>`_
* `Switzerland <http://www.opendata.admin.ch/>`_
* `Texas Open Data <https://data.texas.gov/>`_
* `The World Bank <http://wdronline.worldbank.org/>`_
* `U.K. Government Data <http://data.gov.uk/data>`_
* `U.S. American Community Survey <http://www.census.gov/acs/www/data_documentation/data_release_info/>`_
* `U.S. CDC Public Health datasets <http://www.cdc.gov/nchs/data_access/ftp_data.htm>`_
* `U.S. Census Bureau <http://www.census.gov/data.html>`_
* `U.S. Department of Housing and Urban Development (HUD) <http://www.huduser.gov/portal/datasets/pdrdatas.html>`_
* `U.S. Federal Government Agencies <http://www.data.gov/metrics>`_
* `U.S. Federal Government Data Catalog <http://catalog.data.gov/dataset>`_
* `U.S. Food and Drug Administration (FDA) <https://open.fda.gov/index.html>`_
* `U.S. National Center for Education Statistics (NCES) <http://nces.ed.gov/>`_
* `U.S. Open Government <http://www.data.gov/open-gov/>`_
* `UK 2011 Census Open Atlas Project <http://www.alex-singleton.com/r/2013/02/05/2011-census-open-atlas-project/>`_
* `United Nations <http://data.un.org/>`_
* `Uruguay <https://catalogodatos.gub.uy/>`_
* `Vancouver, BC Open Data Catalog <http://data.vancouver.ca/datacatalogue/>`_
Healthcare
----------
* `EHDP Large Health Data Sets <http://www.ehdp.com/vitalnet/datasets.htm>`_
* `Gapminder World, demographic databases <http://www.gapminder.org/data/>`_
* `Medicare Coverage Database (MCD), U.S. <https://www.cms.gov/medicare-coverage-database/>`_
* `Medicare Data Engine of medicare.gov Data <https://data.medicare.gov/>`_
* `Medicare Data File <http://go.cms.gov/19xxPN4>`_
* `MeSH, the vocabulary thesaurus used for indexing articles for PubMed <https://www.nlm.nih.gov/mesh/filelist.html>`_
* `Number of Ebola Cases and Deaths in Affected Countries (2014) <https://data.hdx.rwlabs.org/dataset/ebola-cases-2014>`_
* `Open-ODS (structure of the UK NHS) <http://www.openods.co.uk>`_
* `The Cancer Genome Atlas project (TCGA) <https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp>`_ and `BigQuery table <http://google-genomics.readthedocs.org/en/latest/use_cases/discover_public_data/isb_cgc_data.html>`_
Image Processing
----------------
* `10k US Adult Faces Database <http://wilmabainbridge.com/facememorability2.html>`_
* `2GB of Photos of Cats (Original down - 20Agst2015) <http://137.189.35.203/WebUI/CatDatabase/catData.html>`_ or `Archive version <https://web.archive.org/web/20150520175645/http://137.189.35.203/WebUI/CatDatabase/catData.html>`_
* `Affective Image Classification <http://www.imageemotion.org/>`_
* `Animals with attributes <http://attributes.kyb.tuebingen.mpg.de/>`_
* `Face Recognition Benchmark <http://www.face-rec.org/databases/>`_
* `ImageNet (in WordNet hierarchy) <http://www.image-net.org/>`_
* `Indoor Scene Recognition <http://web.mit.edu/torralba/www/indoor.html>`_
* `International Affective Picture System, UFL <http://csea.phhp.ufl.edu/media/iapsmessage.html>`_
* `Massive Visual Memory Stimuli, MIT <http://cvcl.mit.edu/MM/stimuli.html>`_
* `Stanford Dogs Dataset <http://vision.stanford.edu/aditya86/ImageNetDogs/>`_
* `SUN database, MIT <http://groups.csail.mit.edu/vision/SUN/hierarchy.html>`_
* `The Oxford-IIIT Pet Dataset <http://www.robots.ox.ac.uk/~vgg/data/pets/>`_
* `YouTube Faces Database <http://www.cs.tau.ac.il/~wolf/ytfaces/>`_
* `Several Shape-from-Silhouette Datasets <http://kaiwolf.no-ip.org/3d-model-repository.html>`_
Machine Learning
----------------
* `Delve Datasets for classification and regression (Univ. of Toronto) <http://www.cs.toronto.edu/~delve/data/datasets.html>`_
* `Discogs Monthly Data <http://data.discogs.com/>`_
* `eBay Online Auctions (2012) <http://www.modelingonlineauctions.com/datasets>`_
* `IMDb Database <http://www.imdb.com/interfaces>`_
* `Keel Repository for classification, regression and time series <http://sci2s.ugr.es/keel/datasets.php>`_
* `Lending Club Loan Data <https://www.lendingclub.com/info/download-data.action>`_
* `Machine Learning Data Set Repository <http://mldata.org/>`_
* `Million Song Dataset <http://labrosa.ee.columbia.edu/millionsong/>`_
* `More Song Datasets <http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets>`_
* `MovieLens Data Sets <http://grouplens.org/datasets/movielens/>`_
* `RDataMining - "R and Data Mining" ebook data <http://www.rdatamining.com/data>`_
* `Registered Meteorites on Earth <http://healthintelligence.drupalgardens.com/content/registered-meteorites-has-impacted-earth-visualized>`_
* `Restaurants Health Score Data in San Francisco <http://missionlocal.org/san-francisco-restaurant-health-inspections/>`_
* `UCI Machine Learning Repository <http://archive.ics.uci.edu/ml/>`_
* `Yahoo! Ratings and Classification Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=r>`_
Museums
-------
* `Cooper-Hewitt's Collection Database <https://github.com/cooperhewitt/collection>`_
* `Minneapolis Institute of Arts metadata <https://github.com/artsmia/collection>`_
* `Natural History Museum (London) Data Portal <http://data.nhm.ac.uk/>`_
* `Rijksmuseum Historical Art Collection <https://www.rijksmuseum.nl/en/api>`_
* `Tate Collection metadata <https://github.com/tategallery/collection>`_
* `The Getty vocabularies <http://vocab.getty.edu>`_
* `Canada Science and Technology Museums Corporation's Open Data <http://techno-science.ca/en/data.php>`_
Natural Language
----------------
* `Blogger Corpus <http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm>`_
* `ClueWeb09 FACC <http://lemurproject.org/clueweb09/FACC1/>`_
* `ClueWeb12 FACC <http://lemurproject.org/clueweb12/FACC1/>`_
* `DBpedia - 4.58M things with 583M facts <http://wiki.dbpedia.org/Datasets>`_
* `Flickr Personal Taxonomies <http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html>`_
* `Google Books Ngrams (2.2TB) <https://aws.amazon.com/datasets/google-books-ngrams/>`_
* `Google Web 5gram (1TB, 2006) <https://catalog.ldc.upenn.edu/LDC2006T13>`_
* `Gutenberg eBooks List <http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs>`_
* `Hansards text chunks of Canadian Parliament <http://www.isi.edu/natural-language/download/hansard/>`_
* `Machine Translation of European languages <http://statmt.org/wmt11/translation-task.html#download>`_
* `Machine Comprehension Test (MCTest) of text from Microsoft Research <http://research.microsoft.com/en-us/um/redmond/projects/mctest/index.html>`_
* `SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles) <https://github.com/ParallelMazen/SaudiNewsNet>`_
* `SMS Spam Collection in English <http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/>`_
* `USENET postings corpus of 2005~2011 <http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html>`_
* `Wikidata - Wikipedia databases <https://www.wikidata.org/wiki/Wikidata:Database_download>`_
* `Wikipedia Links data - 40 Million Entities in Context <https://code.google.com/p/wiki-links/downloads/list>`_
* `WordNet databases and tools <http://wordnet.princeton.edu/wordnet/download/>`_
Physics
-------
* `CERN Open Data Portal <http://opendata.cern.ch/>`_
* `NASA Exoplanet Archive <http://exoplanetarchive.ipac.caltech.edu/>`_
* `NSSDC (NASA) data of 550 space spacecraft <http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html>`_
* `Sloan Digital Sky Survey (SDSS) - Mapping the Universe <http://www.sdss.org/>`_
Psychology/Cognition
--------------
* `OSU Cognitive Modeling Repository Datasets <http://www.cmr.osu.edu/browse/datasets>`_
Public Domains
--------------
* `Amazon <http://aws.amazon.com/datasets/>`_
* `Archive.org Datasets <https://archive.org/details/datasets>`_
* `CMU JASA data archive <http://lib.stat.cmu.edu/jasadata/>`_
* `CMU StatLab collections <http://lib.stat.cmu.edu/datasets/>`_
* `Data360 <http://www.data360.org/index.aspx>`_
* `Datamob.org <http://datamob.org/datasets>`_
* `Google <http://www.google.com/publicdata/directory>`_
* `Infochimps <http://www.infochimps.com/>`_
* `KDNuggets Data Collections <http://www.kdnuggets.com/datasets/index.html>`_
* `Microsoft Azure Data Market Free DataSets <http://datamarket.azure.com/browse/data?price=free>`_
* `Numbray <http://numbrary.com/>`_
* `Reddit Datasets <https://www.reddit.com/r/datasets>`_
* `RevolutionAnalytics Collection <http://packages.revolutionanalytics.com/datasets/>`_
* `Sample R data sets <http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html>`_
* `Stats4Stem R data sets <http://www.stats4stem.org/data-sets.html>`_
* `StatSci.org <http://www.statsci.org/datasets.html>`_
* `The Washington Post List <http://www.washingtonpost.com/wp-srv/metro/data/datapost.html>`_
* `UCLA SOCR data collection <http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data>`_
* `UFO Reports <http://www.nuforc.org/webreports.html>`_
* `Wikileaks 911 pager intercepts <https://911.wikileaks.org/files/index.html>`_
* `Yahoo Webscope <http://webscope.sandbox.yahoo.com/catalog.php>`_
Search Engines
--------------
* `Academic Torrents of data sharing from UMB <http://academictorrents.com/>`_
* `Archive-it from Internet Archive <https://www.archive-it.org/explore?show=Collections>`_
* `Datahub.io <https://datahub.io/dataset>`_
* `DataMarket (Qlik) <https://datamarket.com/data/list/?q=all>`_
* `Freebase.com of people, places, and things <http://www.freebase.com/>`_
* `Harvard Dataverse Network of scientific data <https://dataverse.harvard.edu/>`_
* `ICPSR (UMICH) <http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp>`_
* `Open Data Certificates (beta) <https://certificates.theodi.org/en/datasets>`_
* `Statista.com - statistics and Studies <http://www.statista.com/>`_
Social Networks
---------------
* `72 hours #gamergate scrape <http://waxy.org/random/misc/gamergate_tweets.csv>`_
* `Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape <https://archive.org/details/twitter_cikm_2010>`_
* `May 2011 Calufa Twitter Scrape <http://archive.org/details/2011-05-calufa-twitter-sql>`_
* `Network Twitter Data <http://snap.stanford.edu/data/higgs-twitter.html>`_
* `Social Twitter Data <http://snap.stanford.edu/data/egonets-Twitter.html>`_
* `Twitter Data for Sentiment Analysis <http://help.sentiment140.com/for-students/>`_
Social Sciences
---------------
* `Ancestry.com Forum Dataset over 10 years <http://www.cs.cmu.edu/~jelsas/data/ancestry.com/>`_
* `CMU Enron Email of 150 users <http://www.cs.cmu.edu/~enron/>`_
* `EDRM Enron EMail of 151 users, hosted on S3 <https://aws.amazon.com/datasets/enron-email-data/>`_
* `Facebook Data Scrape (2005) <https://archive.org/details/oxford-2005-facebook-matrix>`_
* `Facebook Social Networks from LAW (since 2007) <http://law.di.unimi.it/datasets.php>`_
* `FBI Hate Crime 2013 - aggregated data <https://github.com/emorisse/FBI-Hate-Crime-Statistics/tree/master/2013>`_
* `Foursquare from UMN/Sarwat (2013) <https://archive.org/details/201309_foursquare_dataset_umn>`_
* `GDELT Global Events Database <http://gdeltproject.org/data.html>`_
* `General Social Survey (GSS) since 1972 <http://gss.norc.org>`_
* `GetGlue - users rating TV shows <http://getglue-data.s3.amazonaws.com/getglue_sample.tar.gz>`_
* `GitHub Collaboration Archive <https://www.githubarchive.org/>`_
* `Google Scholar citation relations <http://www3.cs.stonybrook.edu/~leman/data/gscholar.db>`_
* `MIT Reality Mining Dataset <http://realitycommons.media.mit.edu/realitymining.html>`_
* `Mobile Social Networks from UMASS <https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks>`_
* `PewResearch Internet Survey Project <http://www.pewinternet.org/datasets/pages/2/>`_
* `Political Polarity Data <http://www3.cs.stonybrook.edu/~leman/data/14-icwsm-political-polarity-data.zip>`_
* `Reddit Comments <https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/>`_
* `Skytrax' Air Travel Reviews Dataset <https://github.com/quankiquanki/skytrax-reviews-dataset>`_
* `SourceForge.net Research Data <http://www3.nd.edu/~oss/Data/data.html>`_
* `StackExchange Data Explorer <http://data.stackexchange.com/help>`_
* `Texas Inmates Executed Since 1984 <http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html>`_
* `Titanic Survival Data Set <https://github.com/caesar0301/awesome-public-datasets/tree/master/Datasets>`_
* `Twitter Graph of entire Twitter site <http://an.kaist.ac.kr/traces/WWW2010.html>`_
* `UCB's Archive of Social Science Data (D-Lab) <http://ucdata.berkeley.edu/>`_
* `UCLA Social Sciences Data Archive <http://dataarchives.ss.ucla.edu/Home.DataPortals.htm>`_
* `UNIMI/LAW Social Network Datasets <http://law.di.unimi.it/datasets.php>`_
* `Universities Worldwide <http://univ.cc/>`_
* `UPJOHN for Labor Employment Research <http://www.upjohn.org/services/resources/employment-research-data-center>`_
* `Yahoo! Graph and Social Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=g>`_
* `Youtube Video Social Graph in 2007,2008 <http://netsg.cs.sfu.ca/youtubedata/>`_
Sports
------
* `Betfair Historical Exchange Data <http://data.betfair.com/>`_
* `Cricsheet Matches (cricket) <http://cricsheet.org/>`_
* `Ergast Formula 1, from 1950 up to date (API) <http://ergast.com/mrd/db>`_
* `Football/Soccer resources (data and APIs) <http://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/>`_
* `Lahman's Baseball Database <http://www.seanlahman.com/baseball-archive/statistics/>`_
* `Retrosheet Baseball Statistics <http://www.retrosheet.org/game.htm>`_
Time Series
-----------
* `Hard Drive Failure Rates <https://www.backblaze.com/hard-drive-test-data.html>`_
* `Heart Rate Time Series from MIT <http://ecg.mit.edu/time-series/>`_
* `Time Series Data Library (TSDL) from MU <https://datamarket.com/data/list/?q=provider:tsdl>`_
* `UC Riverside Time Series Dataset <http://www.cs.ucr.edu/~eamonn/time_series_data/>`_
Transportation
--------------
* `Airlines OD Data 1987-2008 <http://stat-computing.org/dataexpo/2009/the-data.html>`_
* `Bay Area Bike Share Data <http://www.bayareabikeshare.com/open-data>`_
* `Bike Share Systems (BSS) collection <https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems>`_
* `GeoLife GPS Trajectory from Microsoft Research <http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/>`_
* `German train system by Deutsche Bahn <http://data.deutschebahn.com/datasets/>`_
* `Hubway Million Rides in MA <http://hubwaydatachallenge.org/trip-history-data/>`_
* `Marine Traffic - ship tracks, port calls and more <http://www.marinetraffic.com/de/ais-api-services>`_
* `NYC Taxi Trip Data 2009- <http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml>`_
* `NYC Taxi Trip Data 2013 (FOIA/FOILed) <https://archive.org/details/nycTaxiTripData2013>`_
* `NYC Uber trip data April 2014 to September 2014 <https://github.com/fivethirtyeight/uber-tlc-foil-response>`_
* `OpenFlights - airport, airline and route data <http://openflights.org/data.html>`_
* `Plane Crash Database, since 1920 <http://www.planecrashinfo.com/database.htm>`_
* `RITA Airline On-Time Performance data <http://www.transtats.bts.gov/Tables.asp?DB_ID=120>`_
* `RITA/BTS transport data collection (TranStat) <http://www.transtats.bts.gov/DataIndex.asp>`_
* `Transport for London (TFL) <https://tfl.gov.uk/info-for/open-data-users/our-feeds>`_
* `Travel Tracker Survey (TTS) for Chicago <http://www.cmap.illinois.gov/data/transportation/travel-tracker-survey>`_
* `U.S. Bureau of Transportation Statistics (BTS) <http://www.rita.dot.gov/bts/>`_
* `U.S. Domestic Flights 1990 to 2009 <http://academictorrents.com/details/a2ccf94bbb4af222bf8e69dad60a68a29f310d9a>`_
* `U.S. Freight Analysis Framework since 2007 <http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm>`_
Complementary Collections
-------------------------
* DataWrangling: `Some Datasets Available on the Web <http://www.datawrangling.com/some-datasets-available-on-the-web>`_
* Inside-r: `Finding Data on the Internet <http://www.inside-r.org/howto/finding-data-internet>`_
* OpenDataMonitor: `An overview of available open data resources in Europe <http://opendatamonitor.eu>`_
* OpenDataNetwork: `A search engine of all Socrata powered data portals ranging from small cities to federal agencies and non-profits <http://www.opendatanetwork.com/>`_
* Quora: `Where can I find large datasets open to the public? <http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public>`_
* RS.io: `100+ Interesting Data Sets for Statistics <http://rs.io/100-interesting-data-sets-for-statistics/>`_
* StaTrek: `Leveraging open data to understand urban lives <http://xiaming.me/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/>`_
* Zenodo: `An open dependable home for the long-tail of science, enabling researchers to share and preserve any research outputs in any size, any format and from any science. <https://zenodo.org/collection/datasets>`_