awesome-public-datasets/README.rst

368 lines
17 KiB
ReStructuredText
Raw Normal View History

2014-11-21 17:10:09 +08:00
Awesome Public Datasets
=======================
2014-12-21 15:38:35 +08:00
`This list of public data sources <https://github.com/caesar0301/awesome-public-datasets>`_
are collected and tidyed from blogs, answers, and user reponses.
Most of the data sets listed below are free, however, some are not.
Other amazingly awesome lists can be found in the
`awesome-awesomeness <https://github.com/bayandin/awesome-awesomeness>`_ and
`another awesome <https://github.com/sindresorhus/awesome>`_ list.
2014-12-05 18:37:43 +08:00
2014-12-26 22:12:33 +08:00
Agriculture
------------
* U.S. Department of Agriculture's PLANTS Database: http://www.plants.usda.gov/dl_all.html
Biology
2014-11-21 17:10:09 +08:00
-------
2014-12-26 22:12:33 +08:00
* 1000 Genomes: http://www.1000genomes.org/data
* CRCNS: http://crcns.org/data-sets
* Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
* Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
* MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
* NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/
* Protein Data Bank: http://pdb.org/
* Protein structure: http://www.infobiotic.net/PSPbenchmarks/
* PubChem Project: https://pubchem.ncbi.nlm.nih.gov/
* Public Gene Data: http://www.pubgene.org/
* Stanford Microarray Data: http://smd.stanford.edu/
* The Personal Genome Project: http://www.personalgenomes.org/ or https://my.pgp-hms.org/public_genetic_data
* UCSC Public Data: http://hgdownload.soe.ucsc.edu/downloads.html
* UniGene: http://www.ncbi.nlm.nih.gov/unigene
Climate/Weather
---------------
2014-11-21 17:10:09 +08:00
* Australian Weather: http://www.bom.gov.au/climate/dwo/
* Canadian Meteorological Centre: https://weather.gc.ca/grib/index_e.html
* Climate Data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/
* Global Climate Data Since 1929: http://www.tutiempo.net/en/Climate
2014-11-21 17:10:09 +08:00
* NOAA Bering Sea Climate: http://www.beringclimate.noaa.gov/
* NOAA Climate Datasets: http://ncdc.noaa.gov/data-access/quick-links
* NOAA Realtime Weather Models: http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction
2014-11-21 17:10:09 +08:00
* WU Historical Weather Worldwide: http://www.wunderground.com/history/index.html
2014-12-05 18:37:43 +08:00
2014-12-26 22:12:33 +08:00
Complex Networks
----------------
* CrossRef DOI URLs: https://archive.org/details/doi-urls
* DBLP Citation dataset: https://kdl.cs.umass.edu/display/public/DBLP
* NBER Patent Citations: http://nber.org/patents/
* NIST complex networks data collection: http://math.nist.gov/~RPozo/complex_datasets.html
* Protein-protein interaction network: http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm
* PyPI and Maven Dependency Network: http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/
* Scopus Citation Database: http://www.elsevier.com/online-tools/scopus
* Stanford GraphBase (Steven Skiena): http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml
* Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/
* The Koblenz Network Collection: http://konect.uni-koblenz.de/
* The Laboratory for Web Algorithmics (UNIMI): http://law.di.unimi.it/datasets.php
* UCI Network Data Repository: http://networkdata.ics.uci.edu/resources.php
* UFL sparse matrix collection: http://www.cise.ufl.edu/research/sparse/matrices/
* WSU Graph Database: http://www.eecs.wsu.edu/mgd/gdb.html
Computer Networks
-----------------
* 3.5B Web Pages: http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us
* 53.5B Web clicks: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset
* CAIDA Internet Datasets: http://www.caida.org/data/overview/
* ClueWeb09: http://lemurproject.org/clueweb09/
* ClueWeb12: http://lemurproject.org/clueweb12/
* CommonCrawl Web Data: http://commoncrawl.org/the-data/get-started/
* Dartmouth CRAWDAD Wireless datasets: http://crawdad.cs.dartmouth.edu/
* OpenMobileData (MobiPerf): https://console.developers.google.com/storage/openmobiledata_public/
* UCSD Network Telescope: http://www.caida.org/projects/network_telescope/
Data Challenges
---------------
* Challenges in Machine Learning: http://www.chalearn.org/
* DrivenData Competitions for Social Good: http://www.drivendata.org/
* ICWSM Data Challenge (since 2009): http://icwsm.cs.umbc.edu/
* Kaggle Competition Data: http://www.kaggle.com/
* KDD Cup by Tencent 2012: https://www.kddcup2012.org/
* Localytics Data Visualization Challenge: https://github.com/localytics/data-viz-challenge
* Netflix Prize: http://www.netflixprize.com/leaderboard
* Yelp Dataset Challenge: http://www.yelp.com/dataset_challenge
2014-11-21 17:10:09 +08:00
Economics
---------
* American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
* EconData (UMD): http://inforumweb.umd.edu/econdata/econdata.html
* Internet Product Code Database: http://www.upcdatabase.com/
* World bank: http://data.worldbank.org/indicator
2014-12-05 18:37:43 +08:00
2014-12-02 02:52:10 +08:00
Energy
------
* AMPds: http://ampds.org/
* BLUEd: http://nilm.cmubi.org/
* COMBED: http://combed.github.io/
* Dataport: https://dataport.pecanstreet.org/
* ECO: http://www.vs.inf.ethz.ch/res/show.html?what=eco-data
* EIA: http://www.eia.gov/electricity/data/eia923/
2014-12-02 02:52:10 +08:00
* HFED: http://hfed.github.io/
2014-12-26 22:12:33 +08:00
* iAWE: http://iawe.github.io/
2014-12-02 02:52:10 +08:00
* Plaid: http://plaidplug.com/
* REDD: http://redd.csail.mit.edu/
* UK-Dale: http://www.doc.ic.ac.uk/~dk3810/data/
2014-11-21 17:10:09 +08:00
Finance
-------
* CBOE Futures Exchange: http://cfe.cboe.com/Data/
* Google Finance: https://www.google.com/finance
* Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
* NASDAQ: https://data.nasdaq.com/
* OANDA: http://www.oanda.com/
* OSU Financial data: http://fisher.osu.edu/fin/osudata.htm or http://fisher.osu.edu/fin/fdf/osudata.htm
2014-11-21 17:10:09 +08:00
* Quandl: http://www.quandl.com/
* St Louis Federal: http://research.stlouisfed.org/fred2/
* Yahoo Finance: http://finance.yahoo.com/
2014-12-05 18:37:43 +08:00
GeoSpace/GIS
2014-12-26 22:12:33 +08:00
------------
2014-11-21 17:10:09 +08:00
2014-12-26 22:12:33 +08:00
* BODC (marine data of nearly 22,000 oceanographic vars): http://www.bodc.ac.uk/data/where_to_find_data/
2014-11-21 17:10:09 +08:00
* EOSDIS: http://sedac.ciesin.columbia.edu/data/sets/browse
* Factual Global Location Data: http://www.factual.com/
2014-12-26 22:12:33 +08:00
* GADM (Global Administrative Areas database): http://www.gadm.org/
2014-11-21 17:10:09 +08:00
* Geo Spatial Data: http://geodacenter.asu.edu/datalist/
2014-12-04 14:39:01 +08:00
* GeoNames (over eight million placenames): http://www.geonames.org/
2014-12-05 18:37:43 +08:00
* Natural Earth (vectors and rasters of the world): http://www.naturalearthdata.com/
2014-12-26 22:12:33 +08:00
* OpenStreetMap (a free map worldwide): http://wiki.openstreetmap.org/wiki/Downloading_data
2014-12-05 18:37:43 +08:00
* TIGER/Line (official United States boundaries and roads): http://www.census.gov/geo/maps-data/data/tiger-line.html
2014-12-26 22:12:33 +08:00
* twofishes (Foursquare's coarse geocoder): https://github.com/foursquare/twofishes
* tz_world (timezone polygons): http://efele.net/maps/tz/world/
2014-11-21 17:10:09 +08:00
Government
----------
* Archive-it: : https://www.archive-it.org/explore?show=Collections
* Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
2014-12-26 22:12:33 +08:00
* Australia: https://data.gov.au/
2014-11-21 17:10:09 +08:00
* Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
* Chicago: https://data.cityofchicago.org/
* EU: http://ec.europa.eu/eurostat/data/database
2014-11-21 17:10:09 +08:00
* FDA: https://open.fda.gov/index.html
* Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi
* Germany: https://www-genesis.destatis.de/genesis/online
2014-12-26 22:12:33 +08:00
* Glasgow, Scotland, UK: http://data.glasgow.gov.uk/
2014-11-21 17:10:09 +08:00
* Guardian world governments: http://www.guardian.co.uk/world-government-data
* HUD: http://www.huduser.org/portal/datasets/pdrdatas.html
* London Datastore, U.K: http://data.london.gov.uk/dataset
* Netherlands: https://data.overheid.nl/
2014-11-21 17:10:09 +08:00
* New Zealand: http://www.stats.govt.nz/browse_for_stats.aspx
* NYC betanyc: http://betanyc.us/
* NYC Open Data: http://nycplatform.socrata.com/
* OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
2014-12-26 22:12:33 +08:00
* Open Government Data (OGD) Platform India: http://www.data.gov.in/
2014-11-21 17:10:09 +08:00
* RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
* San Francisco Data sets: http://datasf.org/
* South Africa: http://beta2.statssa.gov.za/
2014-11-21 17:10:09 +08:00
* The World Bank: http://wdronline.worldbank.org/
* U.K. Government Data: http://data.gov.uk/data
2014-12-05 18:37:43 +08:00
* U.S. American Community Survey: http://www.census.gov/acs/www/data_documentation/data_release_info/
2014-12-26 22:12:33 +08:00
* U.S. Census Bureau: http://www.census.gov/data.html
2014-11-21 17:10:09 +08:00
* U.S. Federal Government Agencies: http://www.data.gov/metric
* U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset
* U.S. Open Government: http://www.data.gov/open-gov/
* UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/2011-census-open-atlas-project/
* United Nations: http://data.un.org/
* US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm
2014-12-26 22:12:33 +08:00
Healthcare
----------
2014-12-26 22:12:33 +08:00
* EHDP Large Health Data Sets: http://www.ehdp.com/vitalnet/datasets.htm
* Gapminder: http://www.gapminder.org/data/
* Medicare Data File: http://go.cms.gov/19xxPN4
2014-11-21 17:10:09 +08:00
2014-12-26 22:12:33 +08:00
Image Processing
----------------
2014-11-21 17:10:09 +08:00
2014-12-26 22:12:33 +08:00
* 2GB of photos of cats: http://137.189.35.203/WebUI/CatDatabase/catData.html
* Face Recognition Benchmark: http://www.face-rec.org/databases/
* ImageNet: http://www.image-net.org/
2014-11-21 17:10:09 +08:00
Machine Learning
----------------
* eBay Online Auctions: http://www.modelingonlineauctions.com/datasets
* IMDb database: http://www.imdb.com/interfaces
* Keel Repository: http://sci2s.ugr.es/keel/datasets.php
* Lending Club Loan Data: https://www.lendingclub.com/info/download-data.action
* Machine Learning Data Set Repository: http://mldata.org/
* Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
* More Song Datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
* MovieLens Data Sets: http://datahub.io/dataset/movielens
* RDataMining R and Data Mining ebook data: http://www.rdatamining.com/data
* Registered meteorites on Earth: http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized
* SF restaurants dataset: http://missionlocal.org/san-francisco-restaurant-health-inspections/
* UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
* University of Toronto Delve Datasets: http://www.cs.toronto.edu/~delve/data/datasets.html
* Yahoo Ratings and Classification Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
2014-12-26 22:12:33 +08:00
Museums
-------
* Cooper-Hewitt's Collection Database: https://github.com/cooperhewitt/collection
* Minneapolis Institute of Arts metadata: https://github.com/artsmia/collection
* Tate Collection metadata: https://github.com/tategallery/collection
* The Getty vocabularies: http://vocab.getty.edu
Music
-----
* Discogs Data: http://www.discogs.com/data/
2014-11-21 17:10:09 +08:00
Natural Language
----------------
* 40 Million Entities in Context: https://code.google.com/p/wiki-links/downloads/list
* ClueWeb09 FACC: http://lemurproject.org/clueweb09/FACC1/
* ClueWeb12 FACC: http://lemurproject.org/clueweb12/FACC1/
* DBpedia: http://wiki.dbpedia.org/Datasets
2014-11-21 17:10:09 +08:00
* Flickr personal taxonomies: http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
* Google Books Ngrams: http://aws.amazon.com/datasets/8172056142375670
* Google Web 5gram, 2006 (1T): https://catalog.ldc.upenn.edu/LDC2006T13
* Gutenberg eBooks List: http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs
* Hansards: http://www.isi.edu/natural-language/download/hansard/
* Machine Translation: http://statmt.org/wmt11/translation-task.html#download
* SMS Spam Collection: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
* USENET corpus: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
* Wikidata: https://www.wikidata.org/wiki/Wikidata:Database_download
2014-11-21 17:10:09 +08:00
* WordNet: http://wordnet.princeton.edu/wordnet/download/
2014-12-26 22:12:33 +08:00
Physics
-------
2014-11-21 17:10:09 +08:00
2014-12-26 22:12:33 +08:00
* CERN Open Data Portal: http://opendata.cern.ch/
* NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
2014-11-21 17:10:09 +08:00
2014-12-26 22:12:33 +08:00
Public Domains
--------------
* Amazon: http://aws.amazon.com/datasets
* Archive.org Datasets: https://archive.org/details/datasets
* CMU JASA data archive: http://lib.stat.cmu.edu/jasadata/
* CMU StatLab collections: http://lib.stat.cmu.edu/datasets/
* Data360: http://www.data360.org/index.aspx
* Datamob.org: http://datamob.org/datasets
* Google: http://www.google.com/publicdata/directory
* infochimps: http://www.infochimps.com/
* KDNuggets Data Collections: http://www.kdnuggets.com/datasets/index.html
* Numbray: http://numbrary.com/
* RevolutionAnalytics Collection: http://www.revolutionanalytics.com/subscriptions/datasets/
* Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
* Stats4Stem R data sets: http://www.stats4stem.org/data-sets.html
* StatSci.org: http://www.statsci.org/datasets.html
* The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html
* UCLA SOCR data collection: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
* UFO Reports: http://www.nuforc.org/webreports.html
* Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html
* Yahoo Webscope: http://webscope.sandbox.yahoo.com/catalog.php
2014-11-21 17:10:09 +08:00
2014-12-26 22:12:33 +08:00
Search Engines
--------------
* Academic Torrents: http://academictorrents.com/
* Datahub.io: http://datahub.io/dataset
* DataMarket: https://datamarket.com/data/list/?q=all
* Freebase: http://www.freebase.com/
* Harvard Dataverse: http://thedata.harvard.edu/dvn/
* Statista: http://www.statista.com/
2014-11-21 17:10:09 +08:00
Social Sciences
---------------
* CMU Enron Email: http://www.cs.cmu.edu/~enron/
* Facebook Social Networks (since 2007): http://law.di.unimi.it/datasets.php
* Facebook100 (2005): https://archive.org/details/oxford-2005-facebook-matrix
* Foursquare (2010,2011): http://www.public.asu.edu/~hgao16/dataset.html
* Foursquare (UMN/Sarwat, 2013): https://archive.org/details/201309_foursquare_dataset_umn
* General Social Survey (GSS): http://www3.norc.org/GSS+Website/
* GetGlue (users rating TV shows): http://bit.ly/1aL8XS0
* GitHub Archive: http://www.githubarchive.org/
* ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
* Mobile Social Networks (UMASS): https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks
* PewResearch Internet Project: http://www.pewinternet.org/datasets/pages/2/
* Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
* SourceForge Graph: http://www.nd.edu/~oss/Data/data.html
2014-12-23 18:15:51 +08:00
* Stack Exchange Network (Data Explorer): http://data.stackexchange.com/help
2014-12-23 18:17:38 +08:00
* Titanic Survival Data Set: http://bit.do/dataset-titanic-csv-zip
2014-11-21 17:10:09 +08:00
* Twitter Graph: http://an.kaist.ac.kr/traces/WWW2010.html
* UC Berkeley's D-Lab Achive: http://ucdata.berkeley.edu/
* UCLA Social Sciences Data Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
* UNIMI Social Network Datasets: http://law.di.unimi.it/datasets.php
* Universities Worldwide: http://univ.cc/
* UPJOHN for Employment Research: http://www.upjohn.org/erdc/erdc.html
* Yahoo Graph and Social Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=g
* Youtube Graph (2007,2008): http://netsg.cs.sfu.ca/youtubedata/
2014-12-26 22:12:33 +08:00
Sports
------
2014-12-08 17:42:02 +08:00
2014-12-26 22:12:33 +08:00
* Betfair (betting exchange) Event Results: http://data.betfair.com/
* Cricsheet (cricket): http://cricsheet.org/
* Ergast Formula 1 (API available): http://ergast.com/mrd/db
* Football/Soccer data and APIs: http://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/
* Lahman's Baseball Database: http://www.seanlahman.com/baseball-archive/statistics/
* Retrosheet (baseball): http://www.retrosheet.org/game.htm
2014-12-26 22:12:33 +08:00
Time Series
-----------
2014-11-21 17:10:09 +08:00
2014-12-26 22:12:33 +08:00
* Time Series data Library: https://datamarket.com/data/list/?q=provider:tsdl
* UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/
2014-11-21 17:10:09 +08:00
2014-12-26 22:12:33 +08:00
Transportation
2014-11-21 17:10:09 +08:00
--------------
2014-12-26 22:12:33 +08:00
* Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html
* Bike Share Data Systems: https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems
* Edge data for US domestic flights 1990 to 2009: http://data.memect.com/?p=229
* Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/
* Marine Traffic - ship tracks, port calls and more: https://www.marinetraffic.com/de/p/api-services
* NYC Taxi Trip Data 2013 (FOIA/FOIL): https://archive.org/details/nycTaxiTripData2013
* OpenFlights (airport, airline and route data): http://openflights.org/data.html
* RITA Airline On-Time Performance Data: http://www.transtats.bts.gov/Tables.asp?DB_ID=120
* RITA transport data collection: http://www.transtats.bts.gov/DataIndex.asp
* Transport for London: http://www.tfl.gov.uk/info-for/open-data-users/our-feeds
* U.S. Freight Analysis Framework: http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm
2014-11-21 17:10:09 +08:00
Complementary Collections
-------------------------
* DataWrangling: http://www.datawrangling.com/some-datasets-available-on-the-web
* Inside-r: http://www.inside-r.org/howto/finding-data-internet
* Quora: http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
2014-12-01 22:55:21 +08:00
* Reddit: http://www.reddit.com/r/datasets
2014-11-21 17:10:09 +08:00
* RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html
* StaTrek: http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/