new nlp datasets

This commit is contained in:
Alexandre Pinto 2017-10-06 17:47:15 +01:00
commit ace1a611cc
5 changed files with 520 additions and 177 deletions

10
.travis.yml Normal file
View File

@ -0,0 +1,10 @@
# language: ruby
# rvm:
# - 2.2
# before_script:
# - gem install awesome_bot
# script:
# - site404=www.datawrangling.com,getglue-data.s3.amazonaws.com,archive.org/details/2011-05-calufa-twitter-sql,www.stats4stem.org,lib.stat.cmu.edu,http://www.oecd.org/document/0,census.gov/acs/www/data_documentation/data_release_info/
# - whtlist=travis,crawdad.cs.dartmouth.edu,data.nasdaq.com,137.189.35.203/WebUI/CatDatabase/catData.html,numbrary.com,www.cmr.osu.edu,gutenberg.org,donnees.gouv.qc.ca,data.rio.rj.gov.br,ntrl.ntis.gov,openflights.org,www.data.gov.bc.ca,earthdata.nasa,pgp-hms,cru.uea.ac.uk,networkdata.ics,datos.argentina,data.gov.ie,isi.edu,data.go.id,wiki.dbpedia,www.laval.ca,www.wunderground.com,data.lexingtonky.gov,arcgis,bixi
# - site503=datamob.org,research.microsoft.com
# - awesome_bot README.rst --allow-dupe --allow-redirect --set-timeout 5 --allow-timeout --white-list $site404,$whtlist,$site503

107
Government.rst Normal file
View File

@ -0,0 +1,107 @@
Government
----------
* `EveryPolitician, ongoing project collating and sharing data on every politician. <http://everypolitician.org/>`_
* `Alberta, Province of Canada <http://open.alberta.ca>`_
* `Antwerp, Belgium <http://opendata.antwerpen.be/datasets>`_
* `Argentina (non official) <http://datar.noip.me/>`_
* `Argentina <http://datos.argentina.gob.ar/>`_
* `Austin, TX, US <https://data.austintexas.gov/>`_
* `Australia (abs.gov.au) <http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument>`_
* `Australia (data.gov.au) <https://data.gov.au/>`_
* `Austria (data.gv.at) <https://www.data.gv.at/>`_
* `Baton Rouge, LA, US <https://data.brla.gov/>`_
* `Belgium <http://data.gov.be/>`_
* `Brazil <http://dados.gov.br/dataset>`_
* `Buenos Aires, Argentina <http://data.buenosaires.gob.ar/>`_
* `Calgary, AB, Canada <https://data.calgary.ca/OpenData/Pages/DatasetListingAlphabetical.aspx>`_
* `Cambridge, MA, US <https://data.cambridgema.gov/>`_
* `Canada <http://open.canada.ca/en?lang=En&n=5BCD274E-1>`_
* `Chicago <https://data.cityofchicago.org/>`_
* `Chile <http://datos.gob.cl/dataset>`_
* `Dallas Open Data <https://www.dallasopendata.com/>`_
* `DataBC - data from the Province of British Columbia <http://www.data.gov.bc.ca/>`_
* `Denver Open Data <http://data.denvergov.org//>`_
* `Durham, NC Open Data <https://opendurham.nc.gov/explore/>`_
* `Edmonton, AB, Canada <https://data.edmonton.ca/>`_
* `England LGInform <http://lginform.local.gov.uk/>`_
* `EuroStat <http://ec.europa.eu/eurostat/data/database>`_
* `FedStats <http://fedstats.sites.usa.gov/>`_
* `Finland <https://www.opendata.fi/en>`_
* `France <https://www.data.gouv.fr/en/datasets/>`_
* `Fredericton, NB, Canada <http://www.fredericton.ca/en/citygovernment/Catalogue.asp>`_
* `Gatineau, QC, Canada <http://www.gatineau.ca/donneesouvertes/default_fr.aspx>`_
* `Germany <https://www-genesis.destatis.de/genesis/online>`_
* `Ghent, Belgium <https://data.stad.gent/datasets>`_
* `Glasgow, Scotland, UK <https://data.glasgow.gov.uk/>`_
* `Greece <http://www.data.gov.gr/>`_
* `Guardian world governments <http://www.guardian.co.uk/world-government-data>`_
* `Halifax, NS, Canada <http://www.halifax.ca/opendata/index.php>`_
* `Helsinki Region, Finland <http://www.hri.fi/en/>`_
* `Hong Kong, China <https://data.gov.hk/en/>`_
* `Houston Open Data <http://data.ohouston.org>`_
* `Indian Government Data <https://data.gov.in/>`_
* `Indonesian Data Portal <http://data.go.id/>`_
* `Ireland's Open Data Portal <https://data.gov.ie/data>`_
* `Japan <http://www.e-stat.go.jp/SG1/estat/eStatTopPortalE.do>`_
* `Laval, QC, Canada <http://www.laval.ca/Pages/Fr/Citoyens/donnees.aspx>`_
* `Lexington, KY <http://data.lexingtonky.gov/>`_
* `London Datastore, UK <http://data.london.gov.uk/dataset>`_
* `London, ON, Canada <http://www.london.ca/city-hall/open-data/Pages/default.aspx>`_
* `Los Angeles Open Data <https://data.lacity.org/>`_
* `MassGIS, Massachusetts, U.S. <http://www.mass.gov/anf/research-and-tech/it-serv-and-support/application-serv/office-of-geographic-information-massgis/>`_
* `Mexico <http://catalogo.datos.gob.mx/dataset>`_
* `Missisauga, ON, Canada <http://www.mississauga.ca/portal/residents/publicationsopendatacatalogue>`_
* `Moldova <http://data.gov.md/>`_
* `Moncton, NB, Canada <http://www.moncton.ca/Government/Terms_of_use/Open_Data_Purpose/Data_Catalogue.htm>`_
* `Montreal, QC, Canada <http://donnees.ville.montreal.qc.ca/>`_
* `Netherlands <https://data.overheid.nl/>`_
* `New Zealand <http://www.stats.govt.nz/browse_for_stats.aspx>`_
* `NYC betanyc <http://betanyc.us/>`_
* `NYC Open Data <https://nycplatform.socrata.com/>`_
* `OECD <https://data.oecd.org/>`_
* `Oklahoma <https://data.ok.gov/>`_
* `Open Government Data (OGD) Platform India <https://data.gov.in/>`_
* `Oregon <https://data.oregon.gov/>`_
* `Ottawa, ON, Canada <http://data.ottawa.ca/en/>`_
* `Portland, Oregon <https://www.portlandoregon.gov/28130>`_
* `Portugal - Pordata organization <http://www.pordata.pt/en/Home>`_
* `Puerto Rico Government <https://data.pr.gov//>`_
* `Quebec City, QC, Canada <http://donnees.ville.quebec.qc.ca/>`_
* `Quebec Province of Canada <http://donnees.gouv.qc.ca/>`_
* `Regina SK, Canada <http://open.regina.ca/>`_
* `Rio de Janeiro, Brazil <http://data.rio.rj.gov.br/>`_
* `Romania <http://data.gov.ro/>`_
* `Russia <http://data.gov.ru>`_
* `San Francisco Data sets <http://datasf.org/>`_
* `Saskatchewan, Province of Canada <http://opendatask.ca/data/>`_
* `Seattle <https://data.seattle.gov/>`_
* `Singapore Government Data <https://data.gov.sg/>`_
* `South Africa <http://beta2.statssa.gov.za/>`_
* `South Africa Trade Statistics <http://www.econostatistics.co.za/>`_
* `State of Utah, US <https://opendata.utah.gov/>`_
* `Switzerland <http://www.opendata.admin.ch/>`_
* `Taiwan <http://data.gov.tw/>`_
* `Taiwan g0v <http://data.g0v.tw/>`_
* `Texas Open Data <https://data.texas.gov/>`_
* `The World Bank <http://wdronline.worldbank.org/>`_
* `Toronto, ON, Canada <http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=1a66e03bb8d1e310VgnVCM10000071d60f89RCRD>`_
* `Tunisia <http://www.data.gov.tn/>`_
* `U.K. Government Data <http://data.gov.uk/data>`_
* `U.S. American Community Survey <http://www.census.gov/acs/www/data_documentation/data_release_info/>`_
* `U.S. CDC Public Health datasets <http://www.cdc.gov/nchs/data_access/ftp_data.htm>`_
* `U.S. Census Bureau <http://www.census.gov/data.html>`_
* `U.S. Department of Housing and Urban Development (HUD) <http://www.huduser.gov/portal/datasets/pdrdatas.html>`_
* `U.S. Federal Government Agencies <http://www.data.gov/metrics>`_
* `U.S. Federal Government Data Catalog <http://catalog.data.gov/dataset>`_
* `U.S. Food and Drug Administration (FDA) <https://open.fda.gov/index.html>`_
* `U.S. National Center for Education Statistics (NCES) <http://nces.ed.gov/>`_
* `U.S. Open Government <http://www.data.gov/open-gov/>`_
* `Uganda Bureau of Statistics <http://www.ubos.org/unda/index.php/catalog>`_
* `UK 2011 Census Open Atlas Project <http://www.alex-singleton.com/r/2014/02/05/2011-census-open-atlas-project-version-two/>`_
* `United Nations <http://data.un.org/>`_
* `Uruguay <https://catalogodatos.gub.uy/>`_
* `Vancouver, BC Open Data Catalog <http://data.vancouver.ca/datacatalogue/>`_
* `Victoria, BC, Canada <http://www.victoria.ca/EN/main/city/open-data-catalogue.html>`_
* `Vienna, Austria <https://open.wien.gv.at/site/open-data/>`_

View File

@ -1,6 +1,6 @@
The MIT License (MIT) The MIT License (MIT)
Copyright (c) 2014 Xiaming Copyright (c) 2014-2015 Xiaming Chen and other contributors to this list.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal

View File

@ -0,0 +1,3 @@
# Overview
* `Dataset Description <link to dataset>`_

575
README.rst Normal file → Executable file
View File

@ -1,66 +1,116 @@
Awesome Public Datasets Awesome Public Datasets
======================= =======================
.. image:: https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg
:alt: Awesome
:target: https://github.com/sindresorhus/awesome
`This list of public data sources <https://github.com/caesar0301/awesome-public-datasets>`_ `This list of a topic-centric public data sources <https://github.com/caesar0301/awesome-public-datasets>`_ in high quality. They
are collected and tidyed from blogs, answers, and user reponses. are collected and tidied from blogs, answers, and user responses.
Most of the data sets listed below are free, however, some are not. Most of the data sets listed below are free, however, some are not.
Other amazingly awesome lists can be found in the Other amazingly awesome lists can be found in the
`awesome-awesomeness <https://github.com/bayandin/awesome-awesomeness>`_ and `awesome-awesomeness <https://github.com/bayandin/awesome-awesomeness>`_ and
`another awesome <https://github.com/sindresorhus/awesome>`_ list. `sindresorhus's awesome <https://github.com/sindresorhus/awesome>`_ list.
.. contents:: Table of Contents
Agriculture Agriculture
------------ ------------
* `U.S. Department of Agriculture's PLANTS Database <http://www.plants.usda.gov/dl_all.html>`_ * `U.S. Department of Agriculture's PLANTS Database <http://www.plants.usda.gov/dl_all.html>`_
* `U.S. Department of Agriculture's Nutrient Database <https://www.ars.usda.gov/northeast-area/beltsville-md/beltsville-human-nutrition-research-center/nutrient-data-laboratory/docs/sr28-download-files/>`_
Biology Biology
------- -------
* `1000 Genomes <http://www.1000genomes.org/data>`_ * `1000 Genomes <http://www.1000genomes.org/data>`_
* `Collaborative Research in Computational Neuroscience (CRCNS) <http://crcns.org/data-sets>`_ * `American Gut (Microbiome Project) <https://github.com/biocore/American-Gut>`_
* `Broad Bioimage Benchmark Collection (BBBC) <https://www.broadinstitute.org/bbbc>`_
* `Broad Cancer Cell Line Encyclopedia (CCLE) <http://www.broadinstitute.org/ccle/home>`_
* `Cell Image Library <http://www.cellimagelibrary.org>`_
* `Complete Genomics Public Data <http://www.completegenomics.com/public-data/69-genomes/>`_
* `EBI ArrayExpress <http://www.ebi.ac.uk/arrayexpress/>`_
* `EBI Protein Data Bank in Europe <http://www.ebi.ac.uk/pdbe/emdb/index.html/>`_
* `Electron Microscopy Pilot Image Archive (EMPIAR) <http://www.ebi.ac.uk/pdbe/emdb/empiar/>`_
* `ENCODE project <https://www.encodeproject.org>`_
* `Ensembl Genomes <http://ensemblgenomes.org/info/genomes>`_
* `Gene Expression Omnibus (GEO) <http://www.ncbi.nlm.nih.gov/geo/>`_ * `Gene Expression Omnibus (GEO) <http://www.ncbi.nlm.nih.gov/geo/>`_
* `Gene Ontology (GO) <http://geneontology.org/page/download-annotations>`_
* `Global Biotic Interactions (GloBI) <https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data>`_
* `Harvard Medical School (HMS) LINCS Project <http://lincs.hms.harvard.edu>`_
* `Human Genome Diversity Project <http://www.hagsc.org/hgdp/files.html>`_
* `Human Microbiome Project (HMP) <http://www.hmpdacc.org/reference_genomes/reference_genomes.php>`_ * `Human Microbiome Project (HMP) <http://www.hmpdacc.org/reference_genomes/reference_genomes.php>`_
* `ICOS PSP Benchmark <http://www.infobiotic.net/PSPbenchmarks/>`_ * `ICOS PSP Benchmark <http://ico2s.org/datasets/psp_benchmark.html>`_
* `International HapMap Project <http://hapmap.ncbi.nlm.nih.gov/downloads/index.html.en>`_
* `Journal of Cell Biology DataViewer <http://jcb-dataviewer.rupress.org>`_
* `MIT Cancer Genomics Data <http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi>`_ * `MIT Cancer Genomics Data <http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi>`_
* `NIH Microarray data (FTP) <http://bit.do/VVW6>`_ * `NCBI Proteins <http://www.ncbi.nlm.nih.gov/guide/proteins/#databases>`_
* `Protein Data Bank <http://pdb.org/>`_ * `NCBI Taxonomy <http://www.ncbi.nlm.nih.gov/taxonomy>`_
* `NCI Genomic Data Commons <https://gdc-portal.nci.nih.gov>`_
* `NIH Microarray data <http://bit.do/VVW6>`_ or `FTP <ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/>`_ (see FTP link on `RAW <https://raw.githubusercontent.com/caesar0301/awesome-public-datasets/master/README.rst>`_)
* `OpenSNP genotypes data <https://opensnp.org/>`_
* `Pathguid - Protein-Protein Interactions Catalog <http://www.pathguide.org/>`_
* `Protein Data Bank <http://www.rcsb.org/>`_
* `Psychiatric Genomics Consortium <https://www.med.unc.edu/pgc/downloads>`_
* `PubChem Project <https://pubchem.ncbi.nlm.nih.gov/>`_ * `PubChem Project <https://pubchem.ncbi.nlm.nih.gov/>`_
* `PubGene (now Coremine Medical) <http://www.pubgene.org/>`_ * `PubGene (now Coremine Medical) <http://www.pubgene.org/>`_
* `Sanger Catalogue of Somatic Mutations in Cancer (COSMIC) <http://cancer.sanger.ac.uk/cosmic>`_
* `Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC) <http://www.cancerrxgene.org/>`_
* `Sequence Read Archive(SRA) <http://www.ncbi.nlm.nih.gov/Traces/sra/>`_
* `Stanford Microarray Data <http://smd.stanford.edu/>`_ * `Stanford Microarray Data <http://smd.stanford.edu/>`_
* `Stowers Institute Original Data Repository <http://www.stowers.org/research/publications/odr>`_
* `Systems Science of Biological Dynamics (SSBD) Database <http://ssbd.qbic.riken.jp>`_
* `The Cancer Genome Atlas (TCGA), available via Broad GDAC <https://gdac.broadinstitute.org/>`_
* `The Catalogue of Life <http://www.catalogueoflife.org/content/annual-checklist-archive>`_
* `The Personal Genome Project <http://www.personalgenomes.org/>`_ or `PGP <https://my.pgp-hms.org/public_genetic_data>`_ * `The Personal Genome Project <http://www.personalgenomes.org/>`_ or `PGP <https://my.pgp-hms.org/public_genetic_data>`_
* `UCSC Public Data <http://hgdownload.soe.ucsc.edu/downloads.html>`_ * `UCSC Public Data <http://hgdownload.soe.ucsc.edu/downloads.html>`_
* `UniGene <http://www.ncbi.nlm.nih.gov/unigene>`_ * `UniGene <http://www.ncbi.nlm.nih.gov/unigene>`_
* `Universal Protein Resource (UnitProt) <http://www.uniprot.org/downloads>`_
Climate/Weather Climate/Weather
--------------- ---------------
* `Actuaries Climate Index <http://actuariesclimateindex.org/data/>`_
* `Australian Weather <http://www.bom.gov.au/climate/dwo/>`_ * `Australian Weather <http://www.bom.gov.au/climate/dwo/>`_
* `Canadian Meteorological Centre <https://weather.gc.ca/grib/index_e.html>`_ * `Aviation Weather Center - Consistent, timely and accurate weather information for the world airspace system <https://aviationweather.gov/adds/dataserver>`_
* `Climate Data from UEA (updated at roughly monthly intervals) <http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/>`_ * `Brazilian Weather - Historical data (In Portuguese) <http://sinda.crn2.inpe.br/PCD/SITE/novo/site/>`_
* `Global Climate Data Since 1929 <http://www.tutiempo.net/en/Climate>`_ * `Canadian Meteorological Centre <http://weather.gc.ca/grib/index_e.html>`_
* `Climate Data from UEA (updated monthly) <https://crudata.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/>`_
* `European Climate Assessment & Dataset <http://eca.knmi.nl/>`_
* `Global Climate Data Since 1929 <http://en.tutiempo.net/climate>`_
* `NASA Global Imagery Browse Services <https://wiki.earthdata.nasa.gov/display/GIBS>`_
* `NOAA Bering Sea Climate <http://www.beringclimate.noaa.gov/>`_ * `NOAA Bering Sea Climate <http://www.beringclimate.noaa.gov/>`_
* `NOAA Climate Datasets <http://ncdc.noaa.gov/data-access/quick-links>`_ * `NOAA Climate Datasets <http://www.ncdc.noaa.gov/data-access/quick-links>`_
* `NOAA Realtime Weather Models <http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction>`_ * `NOAA Realtime Weather Models <http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction>`_
* `WU Historical Weather Worldwide <http://www.wunderground.com/history/index.html>`_ * `NOAA SURFRAD Meteorology and Radiation Datasets <https://www.esrl.noaa.gov/gmd/grad/stardata.html>`_
* `The World Bank Open Data Resources for Climate Change <http://data.worldbank.org/developers/climate-data-api>`_
* `UEA Climatic Research Unit <http://www.cru.uea.ac.uk/data>`_
* `WorldClim - Global Climate Data <http://www.worldclim.org>`_
* `WU Historical Weather Worldwide <https://www.wunderground.com/history/index.html>`_
Complex Networks Complex Networks
---------------- ----------------
* `AMiner Citation Network Dataset <http://aminer.org/citation>`_
* `CrossRef DOI URLs <https://archive.org/details/doi-urls>`_ * `CrossRef DOI URLs <https://archive.org/details/doi-urls>`_
* `DBLP Citation dataset <https://kdl.cs.umass.edu/display/public/DBLP>`_ * `DBLP Citation dataset <https://kdl.cs.umass.edu/display/public/DBLP>`_
* `DIMACS Road Networks Collection <http://www.dis.uniroma1.it/challenge9/download.shtml>`_
* `NBER Patent Citations <http://nber.org/patents/>`_ * `NBER Patent Citations <http://nber.org/patents/>`_
* `Network Repository with Interactive Exploratory Analysis Tools <http://networkrepository.com/>`_
* `NIST complex networks data collection <http://math.nist.gov/~RPozo/complex_datasets.html>`_ * `NIST complex networks data collection <http://math.nist.gov/~RPozo/complex_datasets.html>`_
* `Protein-protein interaction network <http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm>`_ * `Protein-protein interaction network <http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm>`_
* `PyPI and Maven Dependency Network <http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/>`_ * `PyPI and Maven Dependency Network <https://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/>`_
* `Scopus Citation Database <http://www.elsevier.com/online-tools/scopus>`_ * `Scopus Citation Database <https://www.elsevier.com/solutions/scopus>`_
* `Small Network Data <http://www-personal.umich.edu/~mejn/netdata/>`_
* `Stanford GraphBase (Steven Skiena) <http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml>`_ * `Stanford GraphBase (Steven Skiena) <http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml>`_
* `Stanford Large Network Dataset Collection <http://snap.stanford.edu/data/>`_ * `Stanford Large Network Dataset Collection <http://snap.stanford.edu/data/>`_
* `Stanford Longitudinal Network Data Sources <http://stanford.edu/group/sonia/dataSources/index.html>`_
* `The Koblenz Network Collection <http://konect.uni-koblenz.de/>`_ * `The Koblenz Network Collection <http://konect.uni-koblenz.de/>`_
* `The Laboratory for Web Algorithmics (UNIMI) <http://law.di.unimi.it/datasets.php>`_ * `The Laboratory for Web Algorithmics (UNIMI) <http://law.di.unimi.it/datasets.php>`_
* `UCI Network Data Repository <http://networkdata.ics.uci.edu/resources.php>`_ * `The Nexus Network Repository <http://nexus.igraph.org/>`_
* `UCI Network Data Repository <https://networkdata.ics.uci.edu/resources.php>`_
* `UFL sparse matrix collection <http://www.cise.ufl.edu/research/sparse/matrices/>`_ * `UFL sparse matrix collection <http://www.cise.ufl.edu/research/sparse/matrices/>`_
* `WSU Graph Database <http://www.eecs.wsu.edu/mgd/gdb.html>`_ * `WSU Graph Database <http://www.eecs.wsu.edu/mgd/gdb.html>`_
@ -68,36 +118,79 @@ Complex Networks
Computer Networks Computer Networks
----------------- -----------------
* `3.5B Web Pages <http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us>`_ * `3.5B Web Pages from CommonCrawl 2012 <http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us>`_
* `53.5B Web clicks <http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset>`_ * `53.5B Web clicks of 100K users in Indiana Univ. <http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/>`_
* `CAIDA Internet Datasets <http://www.caida.org/data/overview/>`_ * `CAIDA Internet Datasets <http://www.caida.org/data/overview/>`_
* `ClueWeb09 <http://lemurproject.org/clueweb09/>`_ * `ClueWeb09 - 1B web pages <http://lemurproject.org/clueweb09/>`_
* `ClueWeb12 <http://lemurproject.org/clueweb12/>`_ * `ClueWeb12 - 733M web pages <http://lemurproject.org/clueweb12/>`_
* `CommonCrawl Web Data <http://commoncrawl.org/the-data/get-started/>`_ * `CommonCrawl Web Data over 7 years <http://commoncrawl.org/the-data/get-started/>`_
* `Dartmouth CRAWDAD Wireless datasets <http://crawdad.cs.dartmouth.edu/>`_ * `CRAWDAD Wireless datasets from Dartmouth Univ. <https://crawdad.cs.dartmouth.edu/>`_
* `OpenMobileData (MobiPerf) <https://console.developers.google.com/storage/openmobiledata_public/>`_ * `Criteo click-through data <http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/>`_
* `UCSD Network Telescope <http://www.caida.org/projects/network_telescope/>`_ * `OONI: Open Observatory of Network Interference - Internet censorship data <https://ooni.torproject.org/data/>`_
* `Open Mobile Data by MobiPerf <https://console.developers.google.com/storage/openmobiledata_public/>`_
* `Rapid7 Sonar Internet Scans <https://sonar.labs.rapid7.com/>`_
* `UCSD Network Telescope, IPv4 /8 net <http://www.caida.org/projects/network_telescope/>`_
Data Challenges Data Challenges
--------------- ---------------
* `Bruteforce Database <https://github.com/duyetdev/bruteforce-database>`_
* `Challenges in Machine Learning <http://www.chalearn.org/>`_ * `Challenges in Machine Learning <http://www.chalearn.org/>`_
* `CrowdANALYTIX dataX <http://data.crowdanalytix.com>`_
* `D4D Challenge of Orange <http://www.d4d.orange.com/en/home>`_
* `DrivenData Competitions for Social Good <http://www.drivendata.org/>`_ * `DrivenData Competitions for Social Good <http://www.drivendata.org/>`_
* `ICWSM Data Challenge (since 2009) <http://icwsm.cs.umbc.edu/>`_ * `ICWSM Data Challenge (since 2009) <http://icwsm.cs.umbc.edu/>`_
* `Kaggle Competition Data <http://www.kaggle.com/>`_ * `Kaggle Competition Data <https://www.kaggle.com/>`_
* `KDD Cup by Tencent 2012 <https://www.kddcup2012.org/>`_ * `KDD Cup by Tencent 2012 <http://www.kddcup2012.org/>`_
* `Localytics Data Visualization Challenge <https://github.com/localytics/data-viz-challenge>`_ * `Localytics Data Visualization Challenge <https://github.com/localytics/data-viz-challenge>`_
* `Netflix Prize <http://www.netflixprize.com/leaderboard>`_ * `Netflix Prize <http://netflixprize.com/leaderboard.html>`_
* `Space Apps Challenge <https://2015.spaceappschallenge.org>`_
* `Telecom Italia Big Data Challenge <https://dandelion.eu/datamine/open-big-data/>`_
* `TravisTorrent Dataset - MSR'2017 Mining Challenge <https://travistorrent.testroots.org/>`_
* `Yelp Dataset Challenge <http://www.yelp.com/dataset_challenge>`_ * `Yelp Dataset Challenge <http://www.yelp.com/dataset_challenge>`_
Earth Science
-------------
* `AQUASTAT - Global water resources and uses <http://www.fao.org/nr/water/aquastat/data/query/index.html?lang=en>`_
* `BODC - marine data of ~22K vars <https://www.bodc.ac.uk/data/>`_
* `Earth Models <http://www.earthmodels.org/>`_
* `EOSDIS - NASA's earth observing system data <http://sedac.ciesin.columbia.edu/data/sets/browse>`_
* `Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements <https://imos.aodn.org.au>`_ or `on S3 <http://imos-data.s3-website-ap-southeast-2.amazonaws.com/>`_
* `Marinexplore - Open Oceanographic Data <http://marinexplore.org/>`_
* `Smithsonian Institution Global Volcano and Eruption Database <http://volcano.si.edu/>`_
* `USGS Earthquake Archives <http://earthquake.usgs.gov/earthquakes/search/>`_
Economics Economics
--------- ---------
* `American Economic Ass. (AEA) <http://www.aeaweb.org/RFE/toc.php?show=complete>`_ * `American Economic Association (AEA) <https://www.aeaweb.org/resources/data>`_
* `EconData from UMD <http://inforumweb.umd.edu/econdata/econdata.html>`_ * `EconData from UMD <http://inforumweb.umd.edu/econdata/econdata.html>`_
* `Economic Freedom of the World Data <http://www.freetheworld.com/datasets_efw.html>`_
* `Historical MacroEconomc Statistics <http://www.historicalstatistics.org/>`_
* `International Economics Database <http://widukind.cepremap.org/>`_ and `various data tools <https://github.com/Widukind>`_
* `International Trade Statistics <http://www.econostatistics.co.za/>`_
* `Internet Product Code Database <http://www.upcdatabase.com/>`_ * `Internet Product Code Database <http://www.upcdatabase.com/>`_
* `Joint External Debt Data Hub <http://www.jedh.org/>`_
* `Jon Haveman International Trade Data Links <http://www.macalester.edu/research/economics/PAGE/HAVEMAN/Trade.Resources/TradeData.html>`_
* `OpenCorporates Database of Companies in the World <https://opencorporates.com/>`_
* `Our World in Data <http://ourworldindata.org/>`_
* `SciencesPo World Trade Gravity Datasets <http://econ.sciences-po.fr/thierry-mayer/data>`_
* `The Atlas of Economic Complexity <http://atlas.cid.harvard.edu>`_
* `The Center for International Data <http://cid.econ.ucdavis.edu>`_
* `The Observatory of Economic Complexity <http://atlas.media.mit.edu/en/>`_
* `UN Commodity Trade Statistics <http://comtrade.un.org/db/>`_
* `UN Human Development Reports <http://hdr.undp.org/en>`_
Education
------------
* `College Scorecard Data <https://collegescorecard.ed.gov/data/>`_
* `Student Data from Free Code Camp <http://academictorrents.com/details/030b10dad0846b5aecc3905692890fb02404adbf>`_
Energy Energy
@ -107,13 +200,17 @@ Energy
* `BLUEd <http://nilm.cmubi.org/>`_ * `BLUEd <http://nilm.cmubi.org/>`_
* `COMBED <http://combed.github.io/>`_ * `COMBED <http://combed.github.io/>`_
* `Dataport <https://dataport.pecanstreet.org/>`_ * `Dataport <https://dataport.pecanstreet.org/>`_
* `DRED <http://www.st.ewi.tudelft.nl/~akshay/dred/>`_
* `ECO <http://www.vs.inf.ethz.ch/res/show.html?what=eco-data>`_ * `ECO <http://www.vs.inf.ethz.ch/res/show.html?what=eco-data>`_
* `EIA <http://www.eia.gov/electricity/data/eia923/>`_ * `EIA <http://www.eia.gov/electricity/data/eia923/>`_
* `HES <http://randd.defra.gov.uk/Default.aspx?Menu=Menu&Module=More&Location=None&ProjectID=17359&FromSearch=Y&Publisher=1&SearchText=EV0702&SortString=ProjectCode&SortOrder=Asc&Paging=10#Description>`_ - Household Electricity Study, UK
* `HFED <http://hfed.github.io/>`_ * `HFED <http://hfed.github.io/>`_
* `iAWE <http://iawe.github.io/>`_ * `iAWE <http://iawe.github.io/>`_
* `Plaid <http://plaidplug.com/>`_ * `PLAID <http://plaidplug.com/>`_ - the Plug Load Appliance Identification Dataset
* `REDD <http://redd.csail.mit.edu/>`_ * `REDD <http://redd.csail.mit.edu/>`_
* `UK-Dale <http://www.doc.ic.ac.uk/~dk3810/data/>`_ * `Tracebase <https://www.tracebase.org>`_
* `UK-DALE <http://www.doc.ic.ac.uk/~dk3810/data/>`_ - UK Domestic Appliance-Level Electricity
* `WHITED <http://nilmworkshop.org/2016/proceedings/Poster_ID18.pdf>`_
Finance Finance
@ -123,264 +220,390 @@ Finance
* `Google Finance <https://www.google.com/finance>`_ * `Google Finance <https://www.google.com/finance>`_
* `Google Trends <http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0>`_ * `Google Trends <http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0>`_
* `NASDAQ <https://data.nasdaq.com/>`_ * `NASDAQ <https://data.nasdaq.com/>`_
* `NYSE Market Data <ftp://ftp.nyxdata.com>`_ (see FTP link on `RAW <https://raw.githubusercontent.com/caesar0301/awesome-public-datasets/master/README.rst>`_)
* `OANDA <http://www.oanda.com/>`_ * `OANDA <http://www.oanda.com/>`_
* `OSU Financial data <http://fisher.osu.edu/fin/fdf/osudata.htm>`_ * `OSU Financial data <http://fisher.osu.edu/fin/fdf/osudata.htm>`_
* `Quandl <http://www.quandl.com/>`_ * `Quandl <https://www.quandl.com/>`_
* `St Louis Federal <http://research.stlouisfed.org/fred2/>`_ * `St Louis Federal <https://research.stlouisfed.org/fred2/>`_
* `Yahoo Finance <http://finance.yahoo.com/>`_ * `Yahoo Finance <http://finance.yahoo.com/>`_
GeoSpace/GIS GIS
------------ ---
* `BODC (marine data of nearly 22,000 oceanographic vars) <http://www.bodc.ac.uk/data/where_to_find_data/>`_ * `ArcGIS Open Data portal <http://opendata.arcgis.com/>`_
* `EOSDIS <http://sedac.ciesin.columbia.edu/data/sets/browse>`_ * `Cambridge, MA, US, GIS data on GitHub <http://cambridgegis.github.io/gisdata.html>`_
* `Factual Global Location Data <http://www.factual.com/>`_ * `Factual Global Location Data <https://www.factual.com/>`_
* `GADM (Global Administrative Areas database) <http://www.gadm.org/>`_
* `Geo Spatial Data from ASU <http://geodacenter.asu.edu/datalist/>`_ * `Geo Spatial Data from ASU <http://geodacenter.asu.edu/datalist/>`_
* `GeoNames (over eight million placenames) <http://www.geonames.org/>`_ * `Geo Wiki Project - Citizen-driven Environmental Monitoring <http://geo-wiki.org/>`_
* `Natural Earth (vectors and rasters of the world) <http://www.naturalearthdata.com/>`_ * `GeoFabrik - OSM data extracted to a variety of formats and areas <http://download.geofabrik.de/>`_
* `OpenStreetMap (a free map worldwide) <http://wiki.openstreetmap.org/wiki/Downloading_data>`_ * `GeoNames Worldwide <http://www.geonames.org/>`_
* `TIGER/Line (official United States boundaries and roads) <http://www.census.gov/geo/maps-data/data/tiger-line.html>`_ * `Global Administrative Areas Database (GADM) <http://www.gadm.org/>`_
* `twofishes (Foursquare's coarse geocoder) <https://github.com/foursquare/twofishes>`_ * `Homeland Infrastructure Foundation-Level Data <https://hifld-dhs-gii.opendata.arcgis.com/>`_
* `tz_world (timezone polygons) <http://efele.net/maps/tz/world/>`_ * `Landsat 8 on AWS <https://aws.amazon.com/public-data-sets/landsat/>`_
* `List of all countries in all languages <https://github.com/umpirsky/country-list>`_
* `National Weather Service GIS Data Portal <http://www.nws.noaa.gov/gis/>`_
* `Natural Earth - vectors and rasters of the world <http://www.naturalearthdata.com/>`_
* `OpenAddresses <http://openaddresses.io/>`_
* `OpenStreetMap (OSM) <http://wiki.openstreetmap.org/wiki/Downloading_data>`_
* `Pleiades - Gazetteer and graph of ancient places <http://pleiades.stoa.org/>`_
* `Reverse Geocoder using OSM data <https://github.com/kno10/reversegeocode>`_ & `additional high-resolution data files <http://data.ub.uni-muenchen.de/61/>`_
* `TIGER/Line - U.S. boundaries and roads <http://www.census.gov/geo/maps-data/data/tiger-line.html>`_
* `TwoFishes - Foursquare's coarse geocoder <https://github.com/foursquare/twofishes>`_
* `TZ Timezones shapfiles <http://efele.net/maps/tz/world/>`_
* `UN Environmental Data <http://geodata.grid.unep.ch/>`_
* `World boundaries from the U.S. Department of State <https://hiu.state.gov/data/data.aspx>`_
* `World countries in multiple formats <https://github.com/mledoze/countries>`_
Government Government
---------- ----------
* `Australia <http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument>`_ (abs.gov.au) * `A list of cities and countries contributed by community <https://github.com/caesar0301/awesome-public-datasets/blob/master/Government.rst>`_
* `Australia <https://data.gov.au/>`_ (data.gov.au) * `Open Data for Africa <http://opendataforafrica.org/>`_
* `Canada <http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1>`_ * `OpenDataSoft's list of 1,600 open data <https://www.opendatasoft.com/a-comprehensive-list-of-all-open-data-portals-around-the-world/>`_
* `Chicago <https://data.cityofchicago.org/>`_
* `EuroStat <http://ec.europa.eu/eurostat/data/database>`_
* `FedStats <http://www.fedstats.gov/cgi-bin/A2Z.cgi>`_
* `Germany <https://www-genesis.destatis.de/genesis/online>`_
* `Glasgow, Scotland, UK <http://data.glasgow.gov.uk/>`_
* `Guardian world governments <http://www.guardian.co.uk/world-government-data>`_
* `London Datastore, U.K <http://data.london.gov.uk/dataset>`_
* `Netherlands <https://data.overheid.nl/>`_
* `New Zealand <http://www.stats.govt.nz/browse_for_stats.aspx>`_
* `NYC betanyc <http://betanyc.us/>`_
* `NYC Open Data <http://nycplatform.socrata.com/>`_
* `OECD <http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html>`_
* `Open Government Data (OGD) Platform India <http://www.data.gov.in/>`_
* `San Francisco Data sets <http://datasf.org/>`_
* `South Africa <http://beta2.statssa.gov.za/>`_
* `The World Bank <http://wdronline.worldbank.org/>`_
* `U.K. Government Data <http://data.gov.uk/data>`_
* `U.S. American Community Survey <http://www.census.gov/acs/www/data_documentation/data_release_info/>`_
* `U.S. CDC Public Health datasets <http://www.cdc.gov/nchs/data_access/ftp_data.htm>`_
* `U.S. Census Bureau <http://www.census.gov/data.html>`_
* `U.S. Department of Housing and Urban Development (HUD) <http://www.huduser.org/portal/datasets/pdrdatas.html>`_
* `U.S. Federal Government Agencies <http://www.data.gov/metric>`_
* `U.S. Federal Government Data Catalog <http://catalog.data.gov/dataset>`_
* `U.S. Food and Drug Administration (FDA) <https://open.fda.gov/index.html>`_
* `U.S. Open Government <http://www.data.gov/open-gov/>`_
* `UK 2011 Census Open Atlas Project <http://www.alex-singleton.com/2011-census-open-atlas-project/>`_
* `United Nations <http://data.un.org/>`_
Healthcare Healthcare
---------- ----------
* `EHDP Large Health Data Sets <http://www.ehdp.com/vitalnet/datasets.htm>`_ * `EHDP Large Health Data Sets <http://www.ehdp.com/vitalnet/datasets.htm>`_
* `Gapminder <http://www.gapminder.org/data/>`_ * `Gapminder World demographic databases <http://www.gapminder.org/data/>`_
* `GDC supports several cancer genome programs for CCG, TCGA, TARGET etc. <https://gdc.cancer.gov/>`_
* `PhysioBank Databases - a large and growing archive of physiological data <https://www.physionet.org/physiobank/database/>`_
* `Medicare Coverage Database (MCD), U.S. <https://www.cms.gov/medicare-coverage-database/>`_
* `Medicare Data Engine of medicare.gov Data <https://data.medicare.gov/>`_
* `Medicare Data File <http://go.cms.gov/19xxPN4>`_ * `Medicare Data File <http://go.cms.gov/19xxPN4>`_
* `MeSH, the vocabulary thesaurus used for indexing articles for PubMed <https://www.nlm.nih.gov/mesh/filelist.html>`_
* `Number of Ebola Cases and Deaths in Affected Countries (2014) <https://data.hdx.rwlabs.org/dataset/ebola-cases-2014>`_
* `Open-ODS (structure of the UK NHS) <http://www.openods.co.uk>`_
* `OpenPaymentsData, Healthcare financial relationship data <https://openpaymentsdata.cms.gov>`_
* The Cancer Genome Atlas project (TCGA) (refer to `GDC <https://portal.gdc.cancer.gov/>`_ and `BigQuery table <http://google-genomics.readthedocs.org/en/latest/use_cases/discover_public_data/isb_cgc_data.html>`_)
* `World Health Organization Global Health Observatory <http://www.who.int/gho/en/>`_
Image Processing Image Processing
---------------- ----------------
* `2GB of photos of cats <http://137.189.35.203/WebUI/CatDatabase/catData.html>`_
* `Face Recognition Benchmark <http://www.face-rec.org/databases/>`_
* `ImageNet <http://www.image-net.org/>`_
* `SUN database <http://groups.csail.mit.edu/vision/SUN/hierarchy.html>`_
* `10k US Adult Faces Database <http://wilmabainbridge.com/facememorability2.html>`_ * `10k US Adult Faces Database <http://wilmabainbridge.com/facememorability2.html>`_
* `2GB of Photos of Cats <http://137.189.35.203/WebUI/CatDatabase/catData.html>`_ or `Archive version <https://web.archive.org/web/20150520175645/http://137.189.35.203/WebUI/CatDatabase/catData.html>`_
* `Adience Unfiltered faces for gender and age classification <http://www.openu.ac.il/home/hassner/Adience/data.html>`_
* `Affective Image Classification <http://www.imageemotion.org/>`_ * `Affective Image Classification <http://www.imageemotion.org/>`_
* `International Affective Picture System <http://csea.phhp.ufl.edu/media/iapsmessage.html>`_ * `Animals with attributes <http://attributes.kyb.tuebingen.mpg.de/>`_
* `Massive Visual Memory Stimuli <http://cvcl.mit.edu/MM/stimuli.html>`_ * `Caltech Pedestrian Detection Benchmark <https://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/>`_
* `Chars74K dataset, Character Recognition in Natural Images (both English and Kannada are available) <http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/>`_
* `Face Recognition Benchmark <http://www.face-rec.org/databases/>`_
* `Flickr: 32 Class Brand Logos <http://www.multimedia-computing.de/flickrlogos/>`_
* `GDXray: X-ray images for X-ray testing and Computer Vision <http://dmery.ing.puc.cl/index.php/material/gdxray/>`_
* `ImageNet (in WordNet hierarchy) <http://www.image-net.org/>`_
* `Indoor Scene Recognition <http://web.mit.edu/torralba/www/indoor.html>`_
* `International Affective Picture System, UFL <http://csea.phhp.ufl.edu/media/iapsmessage.html>`_
* `Massive Visual Memory Stimuli, MIT <http://cvcl.mit.edu/MM/stimuli.html>`_
* `MNIST database of handwritten digits, near 1 million examples <http://yann.lecun.com/exdb/mnist/>`_
* `Several Shape-from-Silhouette Datasets <http://kaiwolf.no-ip.org/3d-model-repository.html>`_
* `Stanford Dogs Dataset <http://vision.stanford.edu/aditya86/ImageNetDogs/>`_
* `SUN database, MIT <http://groups.csail.mit.edu/vision/SUN/hierarchy.html>`_
* `The Action Similarity Labeling (ASLAN) Challenge <http://www.openu.ac.il/home/hassner/data/ASLAN/ASLAN.html>`_
* `The Oxford-IIIT Pet Dataset <http://www.robots.ox.ac.uk/~vgg/data/pets/>`_
* `Violent-Flows - Crowd Violence \ Non-violence Database and benchmark <http://www.openu.ac.il/home/hassner/data/violentflows/>`_
* `Visual genome <http://visualgenome.org/api/v0/api_home.html>`_
* `YouTube Faces Database <http://www.cs.tau.ac.il/~wolf/ytfaces/>`_
Machine Learning Machine Learning
---------------- ----------------
* `eBay Online Auctions <http://www.modelingonlineauctions.com/datasets>`_ * `Context-aware data sets from five domains <https://github.com/irecsys/CARSKit/tree/master/context-aware_data_sets>`_
* `IMDb database <http://www.imdb.com/interfaces>`_ * `Delve Datasets for classification and regression (Univ. of Toronto) <http://www.cs.toronto.edu/~delve/data/datasets.html>`_
* `Keel Repository <http://sci2s.ugr.es/keel/datasets.php>`_ * `Discogs Monthly Data <http://data.discogs.com/>`_
* `eBay Online Auctions (2012) <http://www.modelingonlineauctions.com/datasets>`_
* `IMDb Database <http://www.imdb.com/interfaces>`_
* `Keel Repository for classification, regression and time series <http://sci2s.ugr.es/keel/datasets.php>`_
* `Labeled Faces in the Wild (LFW) <http://vis-www.cs.umass.edu/lfw/>`_
* `Lending Club Loan Data <https://www.lendingclub.com/info/download-data.action>`_ * `Lending Club Loan Data <https://www.lendingclub.com/info/download-data.action>`_
* `Machine Learning Data Set Repository <http://mldata.org/>`_ * `Machine Learning Data Set Repository <http://mldata.org/>`_
* `Million Song Dataset <http://blog.echonest.com/post/3639160982/million-song-dataset>`_ * `Free Music Archive <https://github.com/mdeff/fma>`_
* `Million Song Dataset <http://labrosa.ee.columbia.edu/millionsong/>`_
* `More Song Datasets <http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets>`_ * `More Song Datasets <http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets>`_
* `MovieLens Data Sets <http://datahub.io/dataset/movielens>`_ * `MovieLens Data Sets <http://grouplens.org/datasets/movielens/>`_
* `RDataMining R and Data Mining ebook data <http://www.rdatamining.com/data>`_ * `New Yorker caption contest ratings <https://github.com/nextml/caption-contest-data>`_
* `Registered meteorites on Earth <http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized>`_ * `RDataMining - "R and Data Mining" ebook data <http://www.rdatamining.com/data>`_
* `SF restaurants dataset <http://missionlocal.org/san-francisco-restaurant-health-inspections/>`_ * `Registered Meteorites on Earth <http://healthintelligence.drupalgardens.com/content/registered-meteorites-has-impacted-earth-visualized>`_
* `Restaurants Health Score Data in San Francisco <http://missionlocal.org/san-francisco-restaurant-health-inspections/>`_
* `UCI Machine Learning Repository <http://archive.ics.uci.edu/ml/>`_ * `UCI Machine Learning Repository <http://archive.ics.uci.edu/ml/>`_
* `University of Toronto Delve Datasets <http://www.cs.toronto.edu/~delve/data/datasets.html>`_ * `Yahoo! Ratings and Classification Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=r>`_
* `Yahoo Ratings and Classification Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=r>`_ * `Youtube 8m <https://research.google.com/youtube8m/download.html>`_
Museums Museums
------- -------
* `Canada Science and Technology Museums Corporation's Open Data <http://techno-science.ca/en/data.php>`_
* `Cooper-Hewitt's Collection Database <https://github.com/cooperhewitt/collection>`_ * `Cooper-Hewitt's Collection Database <https://github.com/cooperhewitt/collection>`_
* `Minneapolis Institute of Arts metadata <https://github.com/artsmia/collection>`_ * `Minneapolis Institute of Arts metadata <https://github.com/artsmia/collection>`_
* `Natural History Museum (London) Data Portal <http://data.nhm.ac.uk/>`_
* `Rijksmuseum Historical Art Collection <https://www.rijksmuseum.nl/en/api>`_
* `Tate Collection metadata <https://github.com/tategallery/collection>`_ * `Tate Collection metadata <https://github.com/tategallery/collection>`_
* `The Getty vocabularies <http://vocab.getty.edu>`_ * `The Getty vocabularies <http://vocab.getty.edu>`_
Music
-----
* `Discogs Data <http://www.discogs.com/data/>`_
Natural Language Natural Language
---------------- ----------------
* `40 Million Entities in Context <https://code.google.com/p/wiki-links/downloads/list>`_ * `POS/NER/Chunk annotated data <https://github.com/aritter/twitter_nlp/tree/master/data/annotated>`_
* `Automatic Keyphrase Extraction <https://github.com/snkim/AutomaticKeyphraseExtraction/>`_
* `Blogger Corpus <http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm>`_
* `CLiPS Stylometry Investigation Corpus <http://www.clips.uantwerpen.be/datasets/csi-corpus>`_
* `ClueWeb09 FACC <http://lemurproject.org/clueweb09/FACC1/>`_ * `ClueWeb09 FACC <http://lemurproject.org/clueweb09/FACC1/>`_
* `ClueWeb12 FACC <http://lemurproject.org/clueweb12/FACC1/>`_ * `ClueWeb12 FACC <http://lemurproject.org/clueweb12/FACC1/>`_
* `DBpedia <http://wiki.dbpedia.org/Datasets>`_ * `DBpedia - 4.58M things with 583M facts <http://wiki.dbpedia.org/Datasets>`_
* `Flickr personal taxonomies <http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html>`_ * `Flickr Personal Taxonomies <http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html>`_
* `Google Books Ngrams <http://aws.amazon.com/datasets/8172056142375670>`_ * `Freebase.com of people, places, and things <http://www.freebase.com/>`_
* `Google Web 5gram, 2006 (1T) <https://catalog.ldc.upenn.edu/LDC2006T13>`_ * `Google Books Ngrams (2.2TB) <https://aws.amazon.com/datasets/google-books-ngrams/>`_
* `Google MC-AFP, generated based on the public available Gigaword dataset using Paragraph Vectors <https://github.com/google/mcafp>`_
* `Google Web 5gram (1TB, 2006) <https://catalog.ldc.upenn.edu/LDC2006T13>`_
* `Gutenberg eBooks List <http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs>`_ * `Gutenberg eBooks List <http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs>`_
* `Hansards <http://www.isi.edu/natural-language/download/hansard/>`_ * `Hansards text chunks of Canadian Parliament <http://www.isi.edu/natural-language/download/hansard/>`_
* `Machine Translation <http://statmt.org/wmt11/translation-task.html#download>`_ * `Machine Comprehension Test (MCTest) of text from Microsoft Research <http://research.microsoft.com/en-us/um/redmond/projects/mctest/index.html>`_
* `SMS Spam Collection <http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/>`_ * `Machine Translation of European languages <http://statmt.org/wmt11/translation-task.html#download>`_
* `USENET corpus <http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html>`_ * `Making Sense of Microposts 2013 - Concept Extraction <http://oak.dcs.shef.ac.uk/msm2013/challenge.html>`_
* `Wikidata <https://www.wikidata.org/wiki/Wikidata:Database_download>`_ * `Making Sense of Microposts 2016 - Named Entity rEcognition and Linking <http://microposts2016.seas.upenn.edu/challenge.html>`_
* `WordNet <http://wordnet.princeton.edu/wordnet/download/>`_ * `Microsoft MAchine Reading COmprehension Dataset (or MS MARCO) <http://www.msmarco.org/dataset.aspx>`_
* `Multi-Domain Sentiment Dataset (version 2.0) <http://www.cs.jhu.edu/~mdredze/datasets/sentiment/>`_
* `Open Multilingual Wordnet <http://compling.hss.ntu.edu.sg/omw/>`_
* `Personae Corpus <http://www.clips.uantwerpen.be/datasets/personae-corpus>`_
* `SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles) <https://github.com/ParallelMazen/SaudiNewsNet>`_
* `SMS Spam Collection in English <http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/>`_
* `Universal Dependencies <http://universaldependencies.org>`_
* `USENET postings corpus of 2005~2011 <http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html>`_
* `Webhose - News/Blogs in multiple languages <https://webhose.io/datasets>`_
* `Wikidata - Wikipedia databases <https://www.wikidata.org/wiki/Wikidata:Database_download>`_
* `Wikipedia Links data - 40 Million Entities in Context <https://code.google.com/p/wiki-links/downloads/list>`_
* `WordNet databases and tools <http://wordnet.princeton.edu/wordnet/download/>`_
Neuroscience
-------------
* `Allen Institute Datasets <http://www.brain-map.org/>`_
* `Brain Catalogue <http://braincatalogue.org/>`_
* `Brainomics <http://brainomics.cea.fr/localizer>`_
* `CodeNeuro Datasets <http://datasets.codeneuro.org/>`_
* `Collaborative Research in Computational Neuroscience (CRCNS) <http://crcns.org/data-sets>`_
* `FCP-INDI <http://fcon_1000.projects.nitrc.org/index.html>`_
* `Human Connectome Project <http://www.humanconnectome.org/data/>`_
* `NDAR <https://ndar.nih.gov/>`_
* `NeuroData <http://neurodata.io>`_
* `Neuroelectro <http://neuroelectro.org/>`_
* `NIMH Data Archive <http://data-archive.nimh.nih.gov/>`_
* `OASIS <http://www.oasis-brains.org/>`_
* `OpenfMRI <https://openfmri.org/>`_
* `Study Forrest <http://studyforrest.org>`_
Physics Physics
------- -------
* `CERN Open Data Portal <http://opendata.cern.ch/>`_ * `CERN Open Data Portal <http://opendata.cern.ch/>`_
* `NASA <http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html>`_ * `Crystallography Open Database <http://www.crystallography.net/>`_
* `NASA Exoplanet Archive <http://exoplanetarchive.ipac.caltech.edu/>`_
* `NSSDC (NASA) data of 550 space spacecraft <http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html>`_
* `Sloan Digital Sky Survey (SDSS) - Mapping the Universe <http://www.sdss.org/>`_
Psychology/Cognition
--------------------
* `OSU Cognitive Modeling Repository Datasets <http://www.cmr.osu.edu/browse/datasets>`_
Public Domains Public Domains
-------------- --------------
* `Amazon <http://aws.amazon.com/datasets>`_ * `Amazon <http://aws.amazon.com/datasets/>`_
* `Archive-it from Internet Archive <https://www.archive-it.org/explore?show=Collections>`_
* `Archive.org Datasets <https://archive.org/details/datasets>`_ * `Archive.org Datasets <https://archive.org/details/datasets>`_
* `CMU JASA data archive <http://lib.stat.cmu.edu/jasadata/>`_ * `CMU JASA data archive <http://lib.stat.cmu.edu/jasadata/>`_
* `CMU StatLab collections <http://lib.stat.cmu.edu/datasets/>`_ * `CMU StatLab collections <http://lib.stat.cmu.edu/datasets/>`_
* `Data.World <https://data.world>`_
* `Data360 <http://www.data360.org/index.aspx>`_ * `Data360 <http://www.data360.org/index.aspx>`_
* `Datamob.org <http://datamob.org/datasets>`_
* `Google <http://www.google.com/publicdata/directory>`_ * `Google <http://www.google.com/publicdata/directory>`_
* `Infochimps <http://www.infochimps.com/>`_ * `Infochimps <http://www.infochimps.com/>`_
* `KDNuggets Data Collections <http://www.kdnuggets.com/datasets/index.html>`_ * `KDNuggets Data Collections <http://www.kdnuggets.com/datasets/index.html>`_
* `Microsoft Azure Data Market Free DataSets <http://datamarket.azure.com/browse/data?price=free>`_
* `Microsoft Data Science for Research <http://aka.ms/Data-Science>`_
* `Numbray <http://numbrary.com/>`_ * `Numbray <http://numbrary.com/>`_
* `Reddit Datasets <http://www.reddit.com/r/datasets>`_ * `Open Library Data Dumps <https://openlibrary.org/developers/dumps>`_
* `RevolutionAnalytics Collection <http://www.revolutionanalytics.com/subscriptions/datasets/>`_ * `Reddit Datasets <https://www.reddit.com/r/datasets>`_
* `RevolutionAnalytics Collection <http://packages.revolutionanalytics.com/datasets/>`_
* `Sample R data sets <http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html>`_ * `Sample R data sets <http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html>`_
* `Stats4Stem R data sets <http://www.stats4stem.org/data-sets.html>`_ * `Stats4Stem R data sets <http://www.stats4stem.org/data-sets.html>`_
* `StatSci.org <http://www.statsci.org/datasets.html>`_ * `StatSci.org <http://www.statsci.org/datasets.html>`_
* `The Washington Post List <http://www.washingtonpost.com/wp-srv/metro/data/datapost.html>`_ * `The Washington Post List <http://www.washingtonpost.com/wp-srv/metro/data/datapost.html>`_
* `UCLA SOCR data collection <http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data>`_ * `UCLA SOCR data collection <http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data>`_
* `UFO Reports <http://www.nuforc.org/webreports.html>`_ * `UFO Reports <http://www.nuforc.org/webreports.html>`_
* `Wikileaks 911 pager intercepts <http://911.wikileaks.org/files/index.html>`_ * `Wikileaks 911 pager intercepts <https://911.wikileaks.org/files/index.html>`_
* `Yahoo Webscope <http://webscope.sandbox.yahoo.com/catalog.php>`_ * `Yahoo Webscope <http://webscope.sandbox.yahoo.com/catalog.php>`_
Search Engines Search Engines
-------------- --------------
* `Academic Torrents <http://academictorrents.com/>`_ * `Academic Torrents of data sharing from UMB <http://academictorrents.com/>`_
* `Archive-it <https://www.archive-it.org/explore?show=Collections>`_ * `Datahub.io <https://datahub.io/dataset>`_
* `Datahub.io <http://datahub.io/dataset>`_ * `DataMarket (Qlik) <https://datamarket.com/data/list/?q=all>`_
* `DataMarket.com <https://datamarket.com/data/list/?q=all>`_ * `Harvard Dataverse Network of scientific data <https://dataverse.harvard.edu/>`_
* `Freebase.com <http://www.freebase.com/>`_ * `ICPSR (UMICH) <http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp>`_
* `Harvard Dataverse <http://thedata.harvard.edu/dvn/>`_ * `Institute of Education Sciences <http://eric.ed.gov>`_
* `Statista.com <http://www.statista.com/>`_ * `National Technical Reports Library <http://www.ntis.gov/products/ntrl/>`_
* `Open Data Certificates (beta) <https://certificates.theodi.org/en/datasets>`_
* `OpenDataNetwork - A search engine of all Socrata powered data portals <http://www.opendatanetwork.com/>`_
* `Statista.com - statistics and Studies <http://www.statista.com/>`_
* `Zenodo - An open dependable home for the long-tail of science <https://zenodo.org/collection/datasets>`_
Social Networks
---------------
* `72 hours #gamergate Twitter Scrape <http://waxy.org/random/misc/gamergate_tweets.csv>`_
* `Ancestry.com Forum Dataset over 10 years <http://www.cs.cmu.edu/~jelsas/data/ancestry.com/>`_
* `Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape <https://archive.org/details/twitter_cikm_2010>`_
* `CMU Enron Email of 150 users <http://www.cs.cmu.edu/~enron/>`_
* `EDRM Enron EMail of 151 users, hosted on S3 <https://aws.amazon.com/datasets/enron-email-data/>`_
* `Facebook Data Scrape (2005) <https://archive.org/details/oxford-2005-facebook-matrix>`_
* `Facebook Social Networks from LAW (since 2007) <http://law.di.unimi.it/datasets.php>`_
* `Foursquare from UMN/Sarwat (2013) <https://archive.org/details/201309_foursquare_dataset_umn>`_
* `GitHub Collaboration Archive <https://www.githubarchive.org/>`_
* `Google Scholar citation relations <http://www3.cs.stonybrook.edu/~leman/data/gscholar.db>`_
* `High-Resolution Contact Networks from Wearable Sensors <http://www.sociopatterns.org/datasets/>`_
* `Indie Map: social graph and crawl of top IndieWeb sites <http://www.indiemap.org/>`_
* `Mobile Social Networks from UMASS <https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks>`_
* `Network Twitter Data <http://snap.stanford.edu/data/higgs-twitter.html>`_
* `Reddit Comments <https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/>`_
* `Skytrax' Air Travel Reviews Dataset <https://github.com/quankiquanki/skytrax-reviews-dataset>`_
* `Social Twitter Data <http://snap.stanford.edu/data/egonets-Twitter.html>`_
* `SourceForge.net Research Data <http://www3.nd.edu/~oss/Data/data.html>`_
* `Twitter Data for Online Reputation Management <http://nlp.uned.es/replab2013/>`_
* `Twitter Data for Sentiment Analysis <http://help.sentiment140.com/for-students/>`_
* `Twitter Graph of entire Twitter site <http://an.kaist.ac.kr/traces/WWW2010.html>`_
* `Twitter Scrape Calufa May 2011 <http://archive.org/details/2011-05-calufa-twitter-sql>`_
* `UNIMI/LAW Social Network Datasets <http://law.di.unimi.it/datasets.php>`_
* `Yahoo! Graph and Social Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=g>`_
* `Youtube Video Social Graph in 2007,2008 <http://netsg.cs.sfu.ca/youtubedata/>`_
Social Sciences Social Sciences
--------------- ---------------
* `CMU Enron Email <http://www.cs.cmu.edu/~enron/>`_ * `ACLED (Armed Conflict Location & Event Data Project) <http://www.acleddata.com/>`_
* `Facebook Social Networks (since 2007) <http://law.di.unimi.it/datasets.php>`_ * `Canadian Legal Information Institute <https://www.canlii.org/en/index.php>`_
* `Facebook100 (2005) <https://archive.org/details/oxford-2005-facebook-matrix>`_ * `Center for Systemic Peace Datasets - Conflict Trends, Polities, State Fragility, etc <http://www.systemicpeace.org/>`_
* `Foursquare (2010,2011) <http://www.public.asu.edu/~hgao16/dataset.html>`_ * `Correlates of War Project <http://www.correlatesofwar.org/>`_
* `Foursquare (UMN/Sarwat, 2013) <https://archive.org/details/201309_foursquare_dataset_umn>`_ * `Cryptome Conspiracy Theory Items <http://cryptome.org>`_
* `General Social Survey (GSS) <http://www3.norc.org/GSS+Website/>`_ * `Datacards <http://datacards.org>`_
* `GetGlue (users rating TV shows) <http://bit.ly/1aL8XS0>`_ * `European Social Survey <http://www.europeansocialsurvey.org/data/>`_
* `GitHub Archive <http://www.githubarchive.org/>`_ * `FBI Hate Crime 2013 - aggregated data <https://github.com/emorisse/FBI-Hate-Crime-Statistics/tree/master/2013>`_
* `ICPSR <http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp>`_ * `Fragile States Index <http://fsi.fundforpeace.org/data>`_
* `Mobile Social Networks (UMASS) <https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks>`_ * `GDELT Global Events Database <http://gdeltproject.org/data.html>`_
* `PewResearch Internet Project <http://www.pewinternet.org/datasets/pages/2/>`_ * `General Social Survey (GSS) since 1972 <http://gss.norc.org>`_
* `Social Networking <http://www.cs.cmu.edu/~jelsas/data/ancestry.com/>`_ * `German Social Survey <http://www.gesis.org/en/home/>`_
* `SourceForge Graph <http://www.nd.edu/~oss/Data/data.html>`_ * `Global Religious Futures Project <http://www.globalreligiousfutures.org/>`_
* `Stack Exchange Network (Data Explorer) <http://data.stackexchange.com/help>`_ * `Humanitarian Data Exchange <https://data.hdx.rwlabs.org/>`_
* `Titanic Survival Data Set <http://bit.do/dataset-titanic-csv-zip>`_ * `INFORM Index for Risk Management <http://www.inform-index.org/Results/Global>`_
* `Twitter Graph <http://an.kaist.ac.kr/traces/WWW2010.html>`_ * `Institute for Demographic Studies <http://www.ined.fr/en/>`_
* `UC Berkeley's D-Lab Achive <http://ucdata.berkeley.edu/>`_ * `International Networks Archive <http://www.princeton.edu/~ina/>`_
* `International Social Survey Program ISSP <http://www.issp.org>`_
* `International Studies Compendium Project <http://www.isacompendium.com/public/>`_
* `James McGuire Cross National Data <http://jmcguire.faculty.wesleyan.edu/welcome/cross-national-data/>`_
* `MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste <http://nsd.uib.no>`_
* `Minnesota Population Center <https://www.ipums.org/>`_
* `MIT Reality Mining Dataset <http://realitycommons.media.mit.edu/realitymining.html>`_
* `Notre Dame Global Adaptation Index (NG-DAIN) <http://index.gain.org/about/download>`_
* `Open Crime and Policing Data in England, Wales and Northern Ireland <https://data.police.uk/data/>`_
* `Paul Hensel General International Data Page <http://www.paulhensel.org/dataintl.html>`_
* `PewResearch Internet Survey Project <http://www.pewinternet.org/datasets/pages/2/>`_
* `PewResearch Society Data Collection <http://www.pewresearch.org/data/download-datasets/>`_
* `Political Polarity Data <http://www3.cs.stonybrook.edu/~leman/data/14-icwsm-political-polarity-data.zip>`_
* `StackExchange Data Explorer <http://data.stackexchange.com/help>`_
* `Terrorism Research and Analysis Consortium <http://www.trackingterrorism.org/>`_
* `Texas Inmates Executed Since 1984 <http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html>`_
* `Titanic Survival Data Set <https://github.com/caesar0301/awesome-public-datasets/tree/master/Datasets>`_ or `on Kaggle <https://www.kaggle.com/c/titanic/data>`_
* `UCB's Archive of Social Science Data (D-Lab) <http://ucdata.berkeley.edu/>`_
* `UCLA Social Sciences Data Archive <http://dataarchives.ss.ucla.edu/Home.DataPortals.htm>`_ * `UCLA Social Sciences Data Archive <http://dataarchives.ss.ucla.edu/Home.DataPortals.htm>`_
* `UNIMI Social Network Datasets <http://law.di.unimi.it/datasets.php>`_ * `UN Civil Society Database <http://esango.un.org/civilsociety/>`_
* `Universities Worldwide <http://univ.cc/>`_ * `Universities Worldwide <http://univ.cc/>`_
* `UPJOHN for Employment Research <http://www.upjohn.org/erdc/erdc.html>`_ * `UPJOHN for Labor Employment Research <http://www.upjohn.org/services/resources/employment-research-data-center>`_
* `Yahoo Graph and Social Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=g>`_ * `Uppsala Conflict Data Program <http://ucdp.uu.se/>`_
* `Youtube Graph (2007,2008) <http://netsg.cs.sfu.ca/youtubedata/>`_ * `World Bank Open Data <http://data.worldbank.org/>`_
* `WorldPop project - Worldwide human population distributions <http://www.worldpop.org.uk/data/get_data/>`_
Software
--------
* `FLOSSmole data about free, libre, and open source software development <http://flossdata.syr.edu/data/>`_
Sports Sports
------ ------
* `Betfair (betting exchange) Event Results <http://data.betfair.com/>`_ * `Basketball (NBA/NCAA/Euro) Player Database and Statistics <http://www.draftexpress.com/stats.php>`_
* `Cricsheet (cricket) <http://cricsheet.org/>`_ * `Betfair Historical Exchange Data <http://data.betfair.com/>`_
* `Ergast Formula 1 (API available) <http://ergast.com/mrd/db>`_ * `Cricsheet Matches (cricket) <http://cricsheet.org/>`_
* `Football/Soccer data and APIs <http://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/>`_ * `Ergast Formula 1, from 1950 up to date (API) <http://ergast.com/mrd/db>`_
* `Football/Soccer resources (data and APIs) <http://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/>`_
* `Lahman's Baseball Database <http://www.seanlahman.com/baseball-archive/statistics/>`_ * `Lahman's Baseball Database <http://www.seanlahman.com/baseball-archive/statistics/>`_
* `Retrosheet (baseball) <http://www.retrosheet.org/game.htm>`_ * `Pinhooker: Thoroughbred Bloodstock Sale Data <https://github.com/phillc73/pinhooker>`_
* `Retrosheet Baseball Statistics <http://www.retrosheet.org/game.htm>`_
* `Tennis database of rankings, results, and stats for ATP <https://github.com/JeffSackmann/tennis_atp>`_, `WTA <https://github.com/JeffSackmann/tennis_wta>`_, `Grand Slams <https://github.com/JeffSackmann/tennis_slam_pointbypoint>`_ and `Match Charting Project <https://github.com/JeffSackmann/tennis_MatchChartingProject>`_
Time Series Time Series
----------- -----------
* `Time Series data Library (TSDL) <https://datamarket.com/data/list/?q=provider:tsdl>`_: The Time Series Data Library was created by Rob Hyndman, Professor of Statistics at Monash University, Australia. * `Databanks International Cross National Time Series Data Archive <http://www.cntsdata.com>`_
* `Hard Drive Failure Rates <https://www.backblaze.com/hard-drive-test-data.html>`_
* `UC Riverside Time Series <http://www.cs.ucr.edu/~eamonn/time_series_data/>`_: This data resource was created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering. * `Heart Rate Time Series from MIT <http://ecg.mit.edu/time-series/>`_
* `Time Series Data Library (TSDL) from MU <https://datamarket.com/data/list/?q=provider:tsdl>`_
* `UC Riverside Time Series Dataset <http://www.cs.ucr.edu/~eamonn/time_series_data/>`_
Transportation Transportation
-------------- --------------
* `Airlines Data 1987-2008 <http://stat-computing.org/dataexpo/2009/the-data.html>`_: Flight OD data used by ASA Challenge, 2009. * `Airlines OD Data 1987-2008 <http://stat-computing.org/dataexpo/2009/the-data.html>`_
* `Bay Area Bike Share Data <http://www.bayareabikeshare.com/open-data>`_
* `Bike Share Data Systems <https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems>`_: A collection of bike sharing systems and trip histories over the world. * `Bike Share Systems (BSS) collection <https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems>`_
* `GeoLife GPS Trajectory from Microsoft Research <http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/>`_
* `Edge data for US domestic flights 1990 to 2009 <http://data.memect.com/?p=229>`_ * `German train system by Deutsche Bahn <http://data.deutschebahn.com/datasets/>`_
* `Hubway Million Rides in MA <http://hubwaydatachallenge.org/trip-history-data/>`_
* `Half a million Hubway rides <http://hubwaydatachallenge.org/trip-history-data/>`_: Bike trip histories (since 2011) in MA published by Hubway. * `Marine Traffic - ship tracks, port calls and more <http://www.marinetraffic.com/de/ais-api-services>`_
* `Montreal BIXI Bike Share <https://montreal.bixi.com/en/open-data>`_
* `Marine Traffic - ship tracks, port calls and more <https://www.marinetraffic.com/de/p/api-services>`_ * `NYC Taxi Trip Data 2009- <http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml>`_
* `NYC Taxi Trip Data 2013 (FOIA/FOILed) <https://archive.org/details/nycTaxiTripData2013>`_
* `NYC Taxi Trip Data 2013 <https://archive.org/details/nycTaxiTripData2013>`_: FOIA/FOILed Taxi Trip Data from the NYC Taxi and Limousine Commission 2013, released by a civic hacker, Chris Whong. * `NYC Uber trip data April 2014 to September 2014 <https://github.com/fivethirtyeight/uber-tlc-foil-response>`_
* `Open Traffic collection <https://github.com/graphhopper/open-traffic-collection>`_
* `OpenFlights <http://openflights.org/data.html>`_: Airport, airline and route data collected contributed by open communities. * `OpenFlights - airport, airline and route data <http://openflights.org/data.html>`_
* `Philadelphia Bike Share Stations (JSON) <https://www.rideindego.com/stations/json/>`_
* `RITA Airline On-Time Performance Data <http://www.transtats.bts.gov/Tables.asp?DB_ID=120>`_: On-time arrival details for domestic flights by major air carriers in US. * `Plane Crash Database, since 1920 <http://www.planecrashinfo.com/database.htm>`_
* `RITA Airline On-Time Performance data <http://www.transtats.bts.gov/Tables.asp?DB_ID=120>`_
* `RITA transport data collection (TranStat) <http://www.transtats.bts.gov/DataIndex.asp>`_: Various transportation databases published by BTS. * `RITA/BTS transport data collection (TranStat) <http://www.transtats.bts.gov/DataIndex.asp>`_
* `Toronto Bike Share Stations (XML file) <http://www.bikesharetoronto.com/data/stations/bikeStations.xml>`_
* `Transport for London (TFL) <http://www.tfl.gov.uk/info-for/open-data-users/our-feeds>`_: Providing London transportation data including bike sharing system, bus, train, and networking statistics. * `Transport for London (TFL) <https://tfl.gov.uk/info-for/open-data-users/our-open-data>`_
* `Travel Tracker Survey (TTS) for Chicago <http://www.cmap.illinois.gov/data/transportation/travel-tracker-survey>`_
* `Travel Tracker Survey, Chicago <http://www.cmap.illinois.gov/data/transportation/travel-tracker-survey>`_: Data collection took place between January 2007 and February 2008. A total of 10,552 households participated in either a 1-day or 2-day survey, providing a detailed travel inventory for each member of their household on the assigned travel day(s). * `U.S. Bureau of Transportation Statistics (BTS) <http://www.rita.dot.gov/bts/>`_
* `U.S. Domestic Flights 1990 to 2009 <http://academictorrents.com/details/a2ccf94bbb4af222bf8e69dad60a68a29f310d9a>`_
* `U.S. Bureau of Transportation Statistics (BTS) <http://www.rita.dot.gov/bts/>`_: As part of the RITA, BTS covers nearly all of transportation resources to create, manage, and share transportation statistical knowledge with public. * `U.S. Freight Analysis Framework since 2007 <http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm>`_
* `U.S. Freight Analysis Framework <http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm>`_: Freight movement data among states and major metropolitan areas since 2007.
Complementary Collections Complementary Collections
------------------------- -------------------------
* `Data Packaged Core Datasets <https://github.com/datasets/>`_
* `Database of Scientific Code Contributions <https://mozillascience.org/collaborate>`_
* A growing collection of public datasets: `CoolDatasets. <http://cooldatasets.com/>`_
* DataWrangling: `Some Datasets Available on the Web <http://www.datawrangling.com/some-datasets-available-on-the-web>`_ * DataWrangling: `Some Datasets Available on the Web <http://www.datawrangling.com/some-datasets-available-on-the-web>`_
* Inside-r: `Finding Data on the Internet <http://www.inside-r.org/howto/finding-data-internet>`_ * Inside-r: `Finding Data on the Internet <http://www.inside-r.org/howto/finding-data-internet>`_
* OpenDataMonitor: `An overview of available open data resources in Europe <http://opendatamonitor.eu>`_
* Quora: `Where can I find large datasets open to the public? <http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public>`_ * Quora: `Where can I find large datasets open to the public? <http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public>`_
* RS.io: `100+ Interesting Data Sets for Statistics <http://rs.io/2014/05/29/list-of-data-sets.html>`_ * RS.io: `100+ Interesting Data Sets for Statistics <http://rs.io/100-interesting-data-sets-for-statistics/>`_
* StaTrek: `Leveraging open data to understand urban lives <http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/>`_ * StaTrek: `Leveraging open data to understand urban lives <http://xiaming.me/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/>`_