A topic-centric list of HQ open datasets.
Go to file
2015-01-05 17:27:11 +08:00
Datasets Add titanic dataset 2014-11-21 17:38:34 +08:00
LICENSE Initial commit 2014-11-20 14:20:50 +08:00
README.rst Add Travel Tracker Survey 2015-01-05 17:27:11 +08:00

Awesome Public Datasets
=======================

`This list of public data sources <https://github.com/caesar0301/awesome-public-datasets>`_
are collected and tidyed from blogs, answers, and user reponses.
Most of the data sets listed below are free, however, some are not.
Other amazingly awesome lists can be found in the
`awesome-awesomeness <https://github.com/bayandin/awesome-awesomeness>`_ and
`another awesome <https://github.com/sindresorhus/awesome>`_ list.


Agriculture
------------
* `U.S. Department of Agriculture's PLANTS Database <http://www.plants.usda.gov/dl_all.html>`_


Biology
-------

* `1000 Genomes <http://www.1000genomes.org/data>`_
* `Collaborative Research in Computational Neuroscience (CRCNS) <http://crcns.org/data-sets>`_
* `Gene Expression Omnibus (GEO) <http://www.ncbi.nlm.nih.gov/geo/>`_
* `Human Microbiome Project (HMP) <http://www.hmpdacc.org/reference_genomes/reference_genomes.php>`_
* `ICOS PSP Benchmark <http://www.infobiotic.net/PSPbenchmarks/>`_
* `MIT Cancer Genomics Data <http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi>`_
* `NIH Microarray data (FTP) <http://bit.do/VVW6>`_
* `Protein Data Bank <http://pdb.org/>`_
* `PubChem Project <https://pubchem.ncbi.nlm.nih.gov/>`_
* `PubGene (now Coremine Medical) <http://www.pubgene.org/>`_
* `Stanford Microarray Data <http://smd.stanford.edu/>`_
* `The Personal Genome Project <http://www.personalgenomes.org/>`_ or `PGP <https://my.pgp-hms.org/public_genetic_data>`_
* `UCSC Public Data <http://hgdownload.soe.ucsc.edu/downloads.html>`_
* `UniGene <http://www.ncbi.nlm.nih.gov/unigene>`_


Climate/Weather
---------------

* `Australian Weather <http://www.bom.gov.au/climate/dwo/>`_
* `Canadian Meteorological Centre <https://weather.gc.ca/grib/index_e.html>`_
* `Climate Data from UEA (updated at roughly monthly intervals) <http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/>`_
* `Global Climate Data Since 1929 <http://www.tutiempo.net/en/Climate>`_
* `NOAA Bering Sea Climate <http://www.beringclimate.noaa.gov/>`_
* `NOAA Climate Datasets <http://ncdc.noaa.gov/data-access/quick-links>`_
* `NOAA Realtime Weather Models <http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction>`_
* `WU Historical Weather Worldwide <http://www.wunderground.com/history/index.html>`_


Complex Networks
----------------

* `CrossRef DOI URLs <https://archive.org/details/doi-urls>`_
* `DBLP Citation dataset <https://kdl.cs.umass.edu/display/public/DBLP>`_
* `NBER Patent Citations <http://nber.org/patents/>`_
* `NIST complex networks data collection <http://math.nist.gov/~RPozo/complex_datasets.html>`_
* `Protein-protein interaction network <http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm>`_
* `PyPI and Maven Dependency Network <http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/>`_
* `Scopus Citation Database <http://www.elsevier.com/online-tools/scopus>`_
* `Stanford GraphBase (Steven Skiena) <http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml>`_
* `Stanford Large Network Dataset Collection <http://snap.stanford.edu/data/>`_
* `The Koblenz Network Collection <http://konect.uni-koblenz.de/>`_
* `The Laboratory for Web Algorithmics (UNIMI) <http://law.di.unimi.it/datasets.php>`_
* `UCI Network Data Repository <http://networkdata.ics.uci.edu/resources.php>`_
* `UFL sparse matrix collection <http://www.cise.ufl.edu/research/sparse/matrices/>`_
* `WSU Graph Database <http://www.eecs.wsu.edu/mgd/gdb.html>`_


Computer Networks
-----------------

* `3.5B Web Pages <http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us>`_
* `53.5B Web clicks <http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset>`_
* `CAIDA Internet Datasets <http://www.caida.org/data/overview/>`_
* `ClueWeb09 <http://lemurproject.org/clueweb09/>`_
* `ClueWeb12 <http://lemurproject.org/clueweb12/>`_
* `CommonCrawl Web Data <http://commoncrawl.org/the-data/get-started/>`_
* `Dartmouth CRAWDAD Wireless datasets <http://crawdad.cs.dartmouth.edu/>`_
* `OpenMobileData (MobiPerf) <https://console.developers.google.com/storage/openmobiledata_public/>`_
* `UCSD Network Telescope <http://www.caida.org/projects/network_telescope/>`_


Data Challenges
---------------

* `Challenges in Machine Learning <http://www.chalearn.org/>`_
* `DrivenData Competitions for Social Good <http://www.drivendata.org/>`_
* `ICWSM Data Challenge (since 2009) <http://icwsm.cs.umbc.edu/>`_
* `Kaggle Competition Data <http://www.kaggle.com/>`_
* `KDD Cup by Tencent 2012 <https://www.kddcup2012.org/>`_
* `Localytics Data Visualization Challenge <https://github.com/localytics/data-viz-challenge>`_
* `Netflix Prize <http://www.netflixprize.com/leaderboard>`_
* `Yelp Dataset Challenge <http://www.yelp.com/dataset_challenge>`_


Economics
---------

* `American Economic Ass. (AEA) <http://www.aeaweb.org/RFE/toc.php?show=complete>`_
* `EconData from UMD <http://inforumweb.umd.edu/econdata/econdata.html>`_
* `Internet Product Code Database <http://www.upcdatabase.com/>`_


Energy
------

* `AMPds <http://ampds.org/>`_
* `BLUEd <http://nilm.cmubi.org/>`_
* `COMBED <http://combed.github.io/>`_
* `Dataport <https://dataport.pecanstreet.org/>`_
* `ECO <http://www.vs.inf.ethz.ch/res/show.html?what=eco-data>`_
* `EIA <http://www.eia.gov/electricity/data/eia923/>`_
* `HFED <http://hfed.github.io/>`_
* `iAWE <http://iawe.github.io/>`_
* `Plaid <http://plaidplug.com/>`_
* `REDD <http://redd.csail.mit.edu/>`_
* `UK-Dale <http://www.doc.ic.ac.uk/~dk3810/data/>`_


Finance
-------

* `CBOE Futures Exchange <http://cfe.cboe.com/Data/>`_
* `Google Finance <https://www.google.com/finance>`_
* `Google Trends <http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0>`_
* `NASDAQ <https://data.nasdaq.com/>`_
* `OANDA <http://www.oanda.com/>`_
* `OSU Financial data <http://fisher.osu.edu/fin/fdf/osudata.htm>`_
* `Quandl <http://www.quandl.com/>`_
* `St Louis Federal <http://research.stlouisfed.org/fred2/>`_
* `Yahoo Finance <http://finance.yahoo.com/>`_


GeoSpace/GIS
------------

* `BODC (marine data of nearly 22,000 oceanographic vars) <http://www.bodc.ac.uk/data/where_to_find_data/>`_
* `EOSDIS <http://sedac.ciesin.columbia.edu/data/sets/browse>`_
* `Factual Global Location Data <http://www.factual.com/>`_
* `GADM (Global Administrative Areas database) <http://www.gadm.org/>`_
* `Geo Spatial Data from ASU <http://geodacenter.asu.edu/datalist/>`_
* `GeoNames (over eight million placenames) <http://www.geonames.org/>`_
* `Natural Earth (vectors and rasters of the world) <http://www.naturalearthdata.com/>`_
* `OpenStreetMap (a free map worldwide) <http://wiki.openstreetmap.org/wiki/Downloading_data>`_
* `TIGER/Line (official United States boundaries and roads) <http://www.census.gov/geo/maps-data/data/tiger-line.html>`_
* `twofishes (Foursquare's coarse geocoder) <https://github.com/foursquare/twofishes>`_
* `tz_world (timezone polygons) <http://efele.net/maps/tz/world/>`_


Government
----------

* `Australia <http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument>`_ (abs.gov.au)
* `Australia <https://data.gov.au/>`_ (data.gov.au)
* `Canada <http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1>`_
* `Chicago <https://data.cityofchicago.org/>`_
* `EuroStat <http://ec.europa.eu/eurostat/data/database>`_
* `FedStats <http://www.fedstats.gov/cgi-bin/A2Z.cgi>`_
* `Germany <https://www-genesis.destatis.de/genesis/online>`_
* `Glasgow, Scotland, UK <http://data.glasgow.gov.uk/>`_
* `Guardian world governments <http://www.guardian.co.uk/world-government-data>`_
* `London Datastore, U.K <http://data.london.gov.uk/dataset>`_
* `Netherlands <https://data.overheid.nl/>`_
* `New Zealand <http://www.stats.govt.nz/browse_for_stats.aspx>`_
* `NYC betanyc <http://betanyc.us/>`_
* `NYC Open Data <http://nycplatform.socrata.com/>`_
* `OECD <http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html>`_
* `Open Government Data (OGD) Platform India <http://www.data.gov.in/>`_
* `San Francisco Data sets <http://datasf.org/>`_
* `South Africa <http://beta2.statssa.gov.za/>`_
* `The World Bank <http://wdronline.worldbank.org/>`_
* `U.K. Government Data <http://data.gov.uk/data>`_
* `U.S. American Community Survey <http://www.census.gov/acs/www/data_documentation/data_release_info/>`_
* `U.S. CDC Public Health datasets <http://www.cdc.gov/nchs/data_access/ftp_data.htm>`_
* `U.S. Census Bureau <http://www.census.gov/data.html>`_
* `U.S. Department of Housing and Urban Development (HUD) <http://www.huduser.org/portal/datasets/pdrdatas.html>`_
* `U.S. Federal Government Agencies <http://www.data.gov/metric>`_
* `U.S. Federal Government Data Catalog <http://catalog.data.gov/dataset>`_
* `U.S. Food and Drug Administration (FDA) <https://open.fda.gov/index.html>`_
* `U.S. Open Government <http://www.data.gov/open-gov/>`_
* `UK 2011 Census Open Atlas Project <http://www.alex-singleton.com/2011-census-open-atlas-project/>`_
* `United Nations <http://data.un.org/>`_


Healthcare
----------

* `EHDP Large Health Data Sets <http://www.ehdp.com/vitalnet/datasets.htm>`_
* `Gapminder <http://www.gapminder.org/data/>`_
* `Medicare Data File <http://go.cms.gov/19xxPN4>`_


Image Processing
----------------

* `2GB of photos of cats <http://137.189.35.203/WebUI/CatDatabase/catData.html>`_
* `Face Recognition Benchmark <http://www.face-rec.org/databases/>`_
* `ImageNet <http://www.image-net.org/>`_


Machine Learning
----------------

* `eBay Online Auctions <http://www.modelingonlineauctions.com/datasets>`_
* `IMDb database <http://www.imdb.com/interfaces>`_
* `Keel Repository <http://sci2s.ugr.es/keel/datasets.php>`_
* `Lending Club Loan Data <https://www.lendingclub.com/info/download-data.action>`_
* `Machine Learning Data Set Repository <http://mldata.org/>`_
* `Million Song Dataset <http://blog.echonest.com/post/3639160982/million-song-dataset>`_
* `More Song Datasets <http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets>`_
* `MovieLens Data Sets <http://datahub.io/dataset/movielens>`_
* `RDataMining R and Data Mining ebook data <http://www.rdatamining.com/data>`_
* `Registered meteorites on Earth <http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized>`_
* `SF restaurants dataset <http://missionlocal.org/san-francisco-restaurant-health-inspections/>`_
* `UCI Machine Learning Repository <http://archive.ics.uci.edu/ml/>`_
* `University of Toronto Delve Datasets <http://www.cs.toronto.edu/~delve/data/datasets.html>`_
* `Yahoo Ratings and Classification Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=r>`_


Museums
-------

* `Cooper-Hewitt's Collection Database <https://github.com/cooperhewitt/collection>`_
* `Minneapolis Institute of Arts metadata <https://github.com/artsmia/collection>`_
* `Tate Collection metadata <https://github.com/tategallery/collection>`_
* `The Getty vocabularies <http://vocab.getty.edu>`_


Music
-----

* `Discogs Data <http://www.discogs.com/data/>`_


Natural Language
----------------

* `40 Million Entities in Context <https://code.google.com/p/wiki-links/downloads/list>`_
* `ClueWeb09 FACC <http://lemurproject.org/clueweb09/FACC1/>`_
* `ClueWeb12 FACC <http://lemurproject.org/clueweb12/FACC1/>`_
* `DBpedia <http://wiki.dbpedia.org/Datasets>`_
* `Flickr personal taxonomies <http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html>`_
* `Google Books Ngrams <http://aws.amazon.com/datasets/8172056142375670>`_
* `Google Web 5gram, 2006 (1T) <https://catalog.ldc.upenn.edu/LDC2006T13>`_
* `Gutenberg eBooks List <http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs>`_
* `Hansards <http://www.isi.edu/natural-language/download/hansard/>`_
* `Machine Translation <http://statmt.org/wmt11/translation-task.html#download>`_
* `SMS Spam Collection <http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/>`_
* `USENET corpus <http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html>`_
* `Wikidata <https://www.wikidata.org/wiki/Wikidata:Database_download>`_
* `WordNet <http://wordnet.princeton.edu/wordnet/download/>`_


Physics
-------

* `CERN Open Data Portal <http://opendata.cern.ch/>`_
* `NASA <http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html>`_


Public Domains
--------------

* `Amazon <http://aws.amazon.com/datasets>`_
* `Archive.org Datasets <https://archive.org/details/datasets>`_
* `CMU JASA data archive <http://lib.stat.cmu.edu/jasadata/>`_
* `CMU StatLab collections <http://lib.stat.cmu.edu/datasets/>`_
* `Data360 <http://www.data360.org/index.aspx>`_
* `Datamob.org <http://datamob.org/datasets>`_
* `Google <http://www.google.com/publicdata/directory>`_
* `Infochimps <http://www.infochimps.com/>`_
* `KDNuggets Data Collections <http://www.kdnuggets.com/datasets/index.html>`_
* `Numbray <http://numbrary.com/>`_
* `Reddit Datasets <http://www.reddit.com/r/datasets>`_
* `RevolutionAnalytics Collection <http://www.revolutionanalytics.com/subscriptions/datasets/>`_
* `Sample R data sets <http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html>`_
* `Stats4Stem R data sets <http://www.stats4stem.org/data-sets.html>`_
* `StatSci.org <http://www.statsci.org/datasets.html>`_
* `The Washington Post List <http://www.washingtonpost.com/wp-srv/metro/data/datapost.html>`_
* `UCLA SOCR data collection <http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data>`_
* `UFO Reports <http://www.nuforc.org/webreports.html>`_
* `Wikileaks 911 pager intercepts <http://911.wikileaks.org/files/index.html>`_
* `Yahoo Webscope <http://webscope.sandbox.yahoo.com/catalog.php>`_


Search Engines
--------------

* `Academic Torrents <http://academictorrents.com/>`_
* `Archive-it <https://www.archive-it.org/explore?show=Collections>`_
* `Datahub.io <http://datahub.io/dataset>`_
* `DataMarket.com <https://datamarket.com/data/list/?q=all>`_
* `Freebase.com <http://www.freebase.com/>`_
* `Harvard Dataverse <http://thedata.harvard.edu/dvn/>`_
* `Statista.com <http://www.statista.com/>`_


Social Sciences
---------------

* `CMU Enron Email <http://www.cs.cmu.edu/~enron/>`_
* `Facebook Social Networks (since 2007) <http://law.di.unimi.it/datasets.php>`_
* `Facebook100 (2005) <https://archive.org/details/oxford-2005-facebook-matrix>`_
* `Foursquare (2010,2011) <http://www.public.asu.edu/~hgao16/dataset.html>`_
* `Foursquare (UMN/Sarwat, 2013) <https://archive.org/details/201309_foursquare_dataset_umn>`_
* `General Social Survey (GSS) <http://www3.norc.org/GSS+Website/>`_
* `GetGlue (users rating TV shows) <http://bit.ly/1aL8XS0>`_
* `GitHub Archive <http://www.githubarchive.org/>`_
* `ICPSR <http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp>`_
* `Mobile Social Networks (UMASS) <https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks>`_
* `PewResearch Internet Project <http://www.pewinternet.org/datasets/pages/2/>`_
* `Social Networking <http://www.cs.cmu.edu/~jelsas/data/ancestry.com/>`_
* `SourceForge Graph <http://www.nd.edu/~oss/Data/data.html>`_
* `Stack Exchange Network (Data Explorer) <http://data.stackexchange.com/help>`_
* `Titanic Survival Data Set <http://bit.do/dataset-titanic-csv-zip>`_
* `Twitter Graph <http://an.kaist.ac.kr/traces/WWW2010.html>`_
* `UC Berkeley's D-Lab Achive <http://ucdata.berkeley.edu/>`_
* `UCLA Social Sciences Data Archive <http://dataarchives.ss.ucla.edu/Home.DataPortals.htm>`_
* `UNIMI Social Network Datasets <http://law.di.unimi.it/datasets.php>`_
* `Universities Worldwide <http://univ.cc/>`_
* `UPJOHN for Employment Research <http://www.upjohn.org/erdc/erdc.html>`_
* `Yahoo Graph and Social Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=g>`_
* `Youtube Graph (2007,2008) <http://netsg.cs.sfu.ca/youtubedata/>`_


Sports
------

* `Betfair (betting exchange) Event Results <http://data.betfair.com/>`_
* `Cricsheet (cricket) <http://cricsheet.org/>`_
* `Ergast Formula 1 (API available) <http://ergast.com/mrd/db>`_
* `Football/Soccer data and APIs <http://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/>`_
* `Lahman's Baseball Database <http://www.seanlahman.com/baseball-archive/statistics/>`_
* `Retrosheet (baseball) <http://www.retrosheet.org/game.htm>`_


Time Series
-----------

* `Time Series data Library (TSDL) <https://datamarket.com/data/list/?q=provider:tsdl>`_: The Time Series Data Library was created by Rob Hyndman, Professor of Statistics at Monash University, Australia.

* `UC Riverside Time Series <http://www.cs.ucr.edu/~eamonn/time_series_data/>`_: This data resource was created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering.


Transportation
--------------

* `Airlines Data 1987-2008 <http://stat-computing.org/dataexpo/2009/the-data.html>`_: Flight OD data used by ASA Challenge, 2009.

* `Bike Share Data Systems <https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems>`_: A collection of bike sharing systems and trip histories over the world.

* `Edge data for US domestic flights 1990 to 2009 <http://data.memect.com/?p=229>`_

* `Half a million Hubway rides <http://hubwaydatachallenge.org/trip-history-data/>`_: Bike trip histories (since 2011) in MA published by Hubway.

* `Marine Traffic - ship tracks, port calls and more <https://www.marinetraffic.com/de/p/api-services>`_

* `NYC Taxi Trip Data 2013 <https://archive.org/details/nycTaxiTripData2013>`_: FOIA/FOILed Taxi Trip Data from the NYC Taxi and Limousine Commission 2013, released by a civic hacker, Chris Whong.

* `OpenFlights <http://openflights.org/data.html>`_: Airport, airline and route data collected contributed by open communities.

* `RITA Airline On-Time Performance Data <http://www.transtats.bts.gov/Tables.asp?DB_ID=120>`_: On-time arrival details for domestic flights by major air carriers in US.

* `RITA transport data collection (TranStat) <http://www.transtats.bts.gov/DataIndex.asp>`_: Various transportation databases published by BTS.

* `Transport for London (TFL) <http://www.tfl.gov.uk/info-for/open-data-users/our-feeds>`_: Providing London transportation data including bike sharing system, bus, train, and networking statistics.

* `Travel Tracker Survey, Chicago <http://www.cmap.illinois.gov/data/transportation/travel-tracker-survey>`_: Data collection took place between January 2007 and February 2008. A total of 10,552 households participated in either a 1-day or 2-day survey, providing a detailed travel inventory for each member of their household on the assigned travel day(s).

* `U.S. Bureau of Transportation Statistics (BTS) <http://www.rita.dot.gov/bts/>`_: As part of the RITA, BTS covers nearly all of transportation resources to create, manage, and share transportation statistical knowledge with public.

* `U.S. Freight Analysis Framework <http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm>`_: Freight movement data among states and major metropolitan areas since 2007.


Complementary Collections
-------------------------

* DataWrangling: `Some Datasets Available on the Web <http://www.datawrangling.com/some-datasets-available-on-the-web>`_
* Inside-r: `Finding Data on the Internet <http://www.inside-r.org/howto/finding-data-internet>`_
* Quora: `Where can I find large datasets open to the public? <http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public>`_
* RS.io: `100+ Interesting Data Sets for Statistics <http://rs.io/2014/05/29/list-of-data-sets.html>`_
* StaTrek: `Leveraging open data to understand urban lives <http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/>`_