diff --git a/README.md b/README.md deleted file mode 100644 index 36cfa83..0000000 --- a/README.md +++ /dev/null @@ -1,4 +0,0 @@ -awesome-public-datasets -======================= - -A awesome list of (large-scale) public datasets on the Internet. (On-going collection) diff --git a/README.rst b/README.rst new file mode 100644 index 0000000..f0ca088 --- /dev/null +++ b/README.rst @@ -0,0 +1,284 @@ +Awesome Public Datasets +======================= + +This list of public data sources are collected and tidyed from blogs, answers, +and user reponses. Most of the data sets listed below are free, however, some +are not. This list comes from https://github.com/caesar0301/awesome-public-datasets. + +Climate +------- + +* Australian Weather: http://www.bom.gov.au/climate/dwo/ +* Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/ +* Global climate data since 1929: http://www.tutiempo.net/en/Climate +* NOAA Bering Sea Climate: http://www.beringclimate.noaa.gov/ +* NOAA climate datasets: http://ncdc.noaa.gov/data-access/quick-links +* WU Historical Weather Worldwide: http://www.wunderground.com/history/index.html + +Economics +--------- + +* American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete +* EconData (UMD): http://inforumweb.umd.edu/econdata/econdata.html +* Internet Product Code Database: http://www.upcdatabase.com/ +* World bank: http://data.worldbank.org/indicator + +Finance +------- + +* CBOE Futures Exchange: http://cfe.cboe.com/Data/ +* Google Finance: https://www.google.com/finance +* Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0 +* NASDAQ: https://data.nasdaq.com/ +* OANDA: http://www.oanda.com/ +* OSU Financial data: http://fisher.osu.edu/fin/osudata.htm +* Quandl: http://www.quandl.com/ +* St Louis Federal: http://research.stlouisfed.org/fred2/ +* Yahoo Finance: http://finance.yahoo.com/ + +Biology +------- + +* CRCNS: http://crcns.org/data-sets +* Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/ +* Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php +* MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi +* NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/ +* Protein structure: http://www.infobiotic.net/PSPbenchmarks/ +* Public Gene Data: http://www.pubgene.org/ +* Stanford Microarray Data: http://smd.stanford.edu/ +* UniGene: http://www.ncbi.nlm.nih.gov/unigene + + +Physics +------- + +* NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html + + +Healthcare +---------- + +* EHDP Large Health Data Sets: http://www.ehdp.com/vitalnet/datasets.htm +* Gapminder: http://www.gapminder.org/data/ +* Medicare Data File: http://go.cms.gov/19xxPN4 + +GeoSpace +-------- + +* EOSDIS: http://sedac.ciesin.columbia.edu/data/sets/browse +* Factual Global Location Data: http://www.factual.com/ +* Geo Spatial Data: http://geodacenter.asu.edu/datalist/ + + +Transportation +-------------- + +* Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html +* Airports and their locations: http://www.infochimps.com/datasets/airports-and-their-locations +* Bike Share Data Systems: https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems +* Edge data for US domestic flights 1990 to 2009: http://data.memect.com/?p=229 +* Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/ +* NYC Taxi Trip Data 2013 (FOIA/FOIL): https://archive.org/details/nycTaxiTripData2013 +* OpenFlights (airport, airline and route data): http://openflights.org/data.html +* RITA Airline On-Time Performance Data: http://www.transtats.bts.gov/Tables.asp?DB_ID=120 +* RITA transport data collection: http://www.transtats.bts.gov/DataIndex.asp +* Transport for London: http://www.tfl.gov.uk/info-for/open-data-users/our-feeds +* U.S. Freight Analysis Framework: http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm + + +Government +---------- + +* Archive-it: : https://www.archive-it.org/explore?show=Collections +* Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument +* Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1 +* Chicago: https://data.cityofchicago.org/ +* FDA: https://open.fda.gov/index.html +* Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi +* Guardian world governments: http://www.guardian.co.uk/world-government-data +* HUD: http://www.huduser.org/portal/datasets/pdrdatas.html +* London Datastore, U.K: http://data.london.gov.uk/dataset +* New Zealand: http://www.stats.govt.nz/browse_for_stats.aspx +* NYC betanyc: http://betanyc.us/ +* NYC Open Data: http://nycplatform.socrata.com/ +* OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html +* RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp +* San Francisco Data sets: http://datasf.org/ +* The World Bank: http://wdronline.worldbank.org/ +* U.K. Government Data: http://data.gov.uk/data +* U.S. Census Bureau: http://www.census.gov/data.html +* U.S. Federal Government Agencies: http://www.data.gov/metric +* U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset +* U.S. Open Government: http://www.data.gov/open-gov/ +* UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/2011-census-open-atlas-project/ +* United Nations: http://data.un.org/ +* US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm + + +Data Challenges +--------------- + +* Challenges in Machine Learning: http://www.chalearn.org/ +* ICWSM Data Challenge (since 2009): http://icwsm.cs.umbc.edu/ +* Kaggle Competition Data: http://www.kaggle.com/ +* KDD Cup by Tencent 2012: https://www.kddcup2012.org/ +* Netflix Prize: http://www.netflixprize.com/leaderboard +* Yelp Dataset Challenge: http://www.yelp.com/dataset_challenge + + +Machine Learning +---------------- + +* eBay Online Auctions: http://www.modelingonlineauctions.com/datasets +* IMDb database: http://www.imdb.com/interfaces +* Keel Repository: http://sci2s.ugr.es/keel/datasets.php +* Lending Club Loan Data: https://www.lendingclub.com/info/download-data.action +* Machine Learning Data Set Repository: http://mldata.org/ +* Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset +* More Song Datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets +* MovieLens Data Sets: http://datahub.io/dataset/movielens +* RDataMining R and Data Mining ebook data: http://www.rdatamining.com/data +* Registered meteorites on Earth: http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized +* SF restaurants dataset: http://missionlocal.org/san-francisco-restaurant-health-inspections/ +* UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/ +* University of Toronto Delve Datasets: http://www.cs.toronto.edu/~delve/data/datasets.html +* Yahoo Ratings and Classification Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=r + + +Natural Language +---------------- + +* 40 Million Entities in Context: https://code.google.com/p/wiki-links/downloads/list +* ClueWeb09 FACC: http://lemurproject.org/clueweb09/FACC1/ +* ClueWeb12 FACC: http://lemurproject.org/clueweb12/FACC1/ +* Flickr personal taxonomies: http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html +* Google Books Ngrams: http://aws.amazon.com/datasets/8172056142375670 +* Google Web 5gram, 2006 (1T): https://catalog.ldc.upenn.edu/LDC2006T13 +* Gutenberg eBooks List: http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs +* Hansards: http://www.isi.edu/natural-language/download/hansard/ +* Machine Translation: http://statmt.org/wmt11/translation-task.html#download +* SMS Spam Collection: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ +* USENET corpus: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html +* WordNet: http://wordnet.princeton.edu/wordnet/download/ + + +Image Processing +---------------- + +* 2GB of photos of cats: http://137.189.35.203/WebUI/CatDatabase/catData.html +* Face Recognition Benchmark: http://www.face-rec.org/databases/ +* ImageNet: http://www.image-net.org/ + + +Time Series +----------- + +* Time Series data Library: https://datamarket.com/data/list/?q=provider:tsdl +* UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/ + + +Social Sciences +--------------- + +* China Hotel Checkin/out data: http://www.360doc.com/content/13/1105/13/7863900_326788919.shtml +* CMU Enron Email: http://www.cs.cmu.edu/~enron/ +* Facebook Social Networks (since 2007): http://law.di.unimi.it/datasets.php +* Facebook100 (2005): https://archive.org/details/oxford-2005-facebook-matrix +* Foursquare (2010,2011): http://www.public.asu.edu/~hgao16/dataset.html +* Foursquare (UMN/Sarwat, 2013): https://archive.org/details/201309_foursquare_dataset_umn +* General Social Survey (GSS): http://www3.norc.org/GSS+Website/ +* GetGlue (users rating TV shows): http://bit.ly/1aL8XS0 +* GitHub Archive: http://www.githubarchive.org/ +* ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp +* Mobile Social Networks (UMASS): https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks +* PewResearch Internet Project: http://www.pewinternet.org/datasets/pages/2/ +* Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/ +* SourceForge Graph: http://www.nd.edu/~oss/Data/data.html +* Titanic Survival Data Set: +* Twitter Graph: http://an.kaist.ac.kr/traces/WWW2010.html +* UC Berkeley's D-Lab Achive: http://ucdata.berkeley.edu/ +* UCLA Social Sciences Data Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm +* UNIMI Social Network Datasets: http://law.di.unimi.it/datasets.php +* Universities Worldwide: http://univ.cc/ +* UPJOHN for Employment Research: http://www.upjohn.org/erdc/erdc.html +* Yahoo Graph and Social Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=g +* Youtube Graph (2007,2008): http://netsg.cs.sfu.ca/youtubedata/ + + +Complex Networks +---------------- + +* CrossRef DOI URLs: https://archive.org/details/doi-urls +* DBLP Citation dataset: https://kdl.cs.umass.edu/display/public/DBLP +* NBER Patent Citations: http://nber.org/patents/ +* NIST complex networks data collection: http://math.nist.gov/~RPozo/complex_datasets.html +* Protein-protein interaction network: http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm +* PyPI and Maven Dependency Network: http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/ +* Scopus Citation Database: http://www.elsevier.com/online-tools/scopus +* Stanford GraphBase (Steven Skiena): http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml +* Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/ +* The Koblenz Network Collection: http://konect.uni-koblenz.de/ +* UCI Network Data Repository: http://networkdata.ics.uci.edu/resources.php +* UFL sparse matrix collection: http://www.cise.ufl.edu/research/sparse/matrices/ +* UNIMI Large Web Graph: http://law.di.unimi.it/datasets.php +* WSU Graph Database: http://www.eecs.wsu.edu/mgd/gdb.html + + +Computer Networks +----------------- + +* 3.5B Web Pages: http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us +* 53.5B Web clicks: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset +* CAIDA Internet Datasets: http://www.caida.org/data/overview/ +* ClueWeb09: http://lemurproject.org/clueweb09/ +* ClueWeb12: http://lemurproject.org/clueweb12/ +* CommonCrawl Web Data: http://commoncrawl.org/the-data/get-started/ +* Dartmouth CRAWDAD Wireless datasets: http://crawdad.cs.dartmouth.edu/ +* OpenMobileData (MobiPerf): https://console.developers.google.com/storage/openmobiledata_public/ +* UCSD Network Telescope: http://www.caida.org/projects/network_telescope/ + + +Data SEs +-------- + +* Academic Torrents: http://academictorrents.com/ +* Datahub.io: http://datahub.io/dataset +* DataMarket: https://datamarket.com/data/list/?q=all +* Harvard Dataverse: http://thedata.harvard.edu/dvn/ +* Statista: http://www.statista.com/ +* Freebase: http://www.freebase.com/ + + +Public Doamins +-------------- + +* Amazon: http://aws.amazon.com/datasets +* Archive.org Datasets: https://archive.org/details/datasets +* CMU JASA data archive: http://lib.stat.cmu.edu/jasadata/ +* CMU StatLab collections: http://lib.stat.cmu.edu/datasets/ +* Data360: http://www.data360.org/index.aspx +* Datamob.org: http://datamob.org/datasets +* Google: http://www.google.com/publicdata/directory +* infochimps: http://www.infochimps.com/ +* KDNuggets Data Collections: http://www.kdnuggets.com/datasets/index.html +* Numbray: http://numbrary.com/ +* RevolutionAnalytics Collection: http://www.revolutionanalytics.com/subscriptions/datasets/ +* Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html +* Stats4Stem R data sets: http://www.stats4stem.org/data-sets.html +* StatSci.org: http://www.statsci.org/datasets.html +* The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html +* UCLA SOCR data collection: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data +* UFO Reports: http://www.nuforc.org/webreports.html +* Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html +* Yahoo Webscope: http://webscope.sandbox.yahoo.com/catalog.php + + +Complementary Collections +------------------------- + +* DataWrangling: http://www.datawrangling.com/some-datasets-available-on-the-web +* Inside-r: http://www.inside-r.org/howto/finding-data-internet +* Quora: http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public +* RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html +* StaTrek: http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/