mirror of
https://github.com/awesomedata/awesome-public-datasets.git
synced 2024-04-18 07:30:58 +08:00
Added Network data
This commit is contained in:
commit
4a805a8934
253
README.rst
253
README.rst
|
@ -38,7 +38,7 @@ Climate/Weather
|
||||||
|
|
||||||
* `Australian Weather <http://www.bom.gov.au/climate/dwo/>`_
|
* `Australian Weather <http://www.bom.gov.au/climate/dwo/>`_
|
||||||
* `Canadian Meteorological Centre <https://weather.gc.ca/grib/index_e.html>`_
|
* `Canadian Meteorological Centre <https://weather.gc.ca/grib/index_e.html>`_
|
||||||
* `Climate Data from UEA (updated at roughly monthly intervals) <http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/>`_
|
* `Climate Data from UEA (updated monthly) <http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/>`_
|
||||||
* `Global Climate Data Since 1929 <http://www.tutiempo.net/en/Climate>`_
|
* `Global Climate Data Since 1929 <http://www.tutiempo.net/en/Climate>`_
|
||||||
* `NOAA Bering Sea Climate <http://www.beringclimate.noaa.gov/>`_
|
* `NOAA Bering Sea Climate <http://www.beringclimate.noaa.gov/>`_
|
||||||
* `NOAA Climate Datasets <http://ncdc.noaa.gov/data-access/quick-links>`_
|
* `NOAA Climate Datasets <http://ncdc.noaa.gov/data-access/quick-links>`_
|
||||||
|
@ -52,6 +52,8 @@ Complex Networks
|
||||||
* `CrossRef DOI URLs <https://archive.org/details/doi-urls>`_
|
* `CrossRef DOI URLs <https://archive.org/details/doi-urls>`_
|
||||||
* `DBLP Citation dataset <https://kdl.cs.umass.edu/display/public/DBLP>`_
|
* `DBLP Citation dataset <https://kdl.cs.umass.edu/display/public/DBLP>`_
|
||||||
* `NBER Patent Citations <http://nber.org/patents/>`_
|
* `NBER Patent Citations <http://nber.org/patents/>`_
|
||||||
|
* `Network Data <http://www-personal.umich.edu/~mejn/netdata/>`_
|
||||||
|
* `UCI Network Data Repository <https://networkdata.ics.uci.edu/resources.php>`_
|
||||||
* `NIST complex networks data collection <http://math.nist.gov/~RPozo/complex_datasets.html>`_
|
* `NIST complex networks data collection <http://math.nist.gov/~RPozo/complex_datasets.html>`_
|
||||||
* `Protein-protein interaction network <http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm>`_
|
* `Protein-protein interaction network <http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm>`_
|
||||||
* `PyPI and Maven Dependency Network <http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/>`_
|
* `PyPI and Maven Dependency Network <http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/>`_
|
||||||
|
@ -68,21 +70,23 @@ Complex Networks
|
||||||
Computer Networks
|
Computer Networks
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
* `3.5B Web Pages - Web graph extracted from CommonCraw 2012 web corpus. <http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us>`_
|
* `3.5B Web Pages from CommonCraw 2012 <http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us>`_
|
||||||
* `53.5B Web clicks - Anonymized HTTP records from 100K users in Indiana Univ. <http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset>`_
|
* `53.5B Web clicks of 100K users in Indiana Univ. <http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset>`_
|
||||||
* `CAIDA Internet Datasets - Network traces and topologies at geographically diverse locations. <http://www.caida.org/data/overview/>`_
|
* `CAIDA Internet Datasets <http://www.caida.org/data/overview/>`_
|
||||||
* `ClueWeb09 - About 1B web pages in ten languages that were collected in Jan. and Feb. 2009. <http://lemurproject.org/clueweb09/>`_
|
* `ClueWeb09 - 1B web pages <http://lemurproject.org/clueweb09/>`_
|
||||||
* `ClueWeb12 - About 733M web pages collected between Feb. and May 2012. <http://lemurproject.org/clueweb12/>`_
|
* `ClueWeb12 - 733M web pages <http://lemurproject.org/clueweb12/>`_
|
||||||
* `CommonCrawl Web Data - Petabytes of data collected over 7 years of web crawling. <http://commoncrawl.org/the-data/get-started/>`_
|
* `CommonCrawl Web Data over 7 years <http://commoncrawl.org/the-data/get-started/>`_
|
||||||
* `CRAWDAD Wireless datasets (Dartmouth) - A wireless network data resource for research communities. <http://crawdad.cs.dartmouth.edu/>`_
|
* `CRAWDAD Wireless datasets from Dartmouth Univ. <http://crawdad.cs.dartmouth.edu/>`_
|
||||||
* `OpenMobileData (MobiPerf) - Mobile performance measurement data collected with active tests. <https://console.developers.google.com/storage/openmobiledata_public/>`_
|
* `Criteo click-through data <http://labs.criteo.com/2015/03/criteo-releses-its-new-dataset/>`_
|
||||||
* `UCSD Network Telescope - A passive traffic monitoring system covering IPv4 /8 net. <http://www.caida.org/projects/network_telescope/>`_
|
* `Open Mobile Data by MobiPerf <https://console.developers.google.com/storage/openmobiledata_public/>`_
|
||||||
|
* `UCSD Network Telescope, IPv4 /8 net <http://www.caida.org/projects/network_telescope/>`_
|
||||||
|
|
||||||
|
|
||||||
Data Challenges
|
Data Challenges
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
* `Challenges in Machine Learning <http://www.chalearn.org/>`_
|
* `Challenges in Machine Learning <http://www.chalearn.org/>`_
|
||||||
|
* `D4D Challenge of Orange <http://www.d4d.orange.com/en/home>`_
|
||||||
* `DrivenData Competitions for Social Good <http://www.drivendata.org/>`_
|
* `DrivenData Competitions for Social Good <http://www.drivendata.org/>`_
|
||||||
* `ICWSM Data Challenge (since 2009) <http://icwsm.cs.umbc.edu/>`_
|
* `ICWSM Data Challenge (since 2009) <http://icwsm.cs.umbc.edu/>`_
|
||||||
* `Kaggle Competition Data <http://www.kaggle.com/>`_
|
* `Kaggle Competition Data <http://www.kaggle.com/>`_
|
||||||
|
@ -95,7 +99,7 @@ Data Challenges
|
||||||
Economics
|
Economics
|
||||||
---------
|
---------
|
||||||
|
|
||||||
* `American Economic Ass. (AEA) <http://www.aeaweb.org/RFE/toc.php?show=complete>`_
|
* `American Economic Ass (AEA) <http://www.aeaweb.org/RFE/toc.php?show=complete>`_
|
||||||
* `EconData from UMD <http://inforumweb.umd.edu/econdata/econdata.html>`_
|
* `EconData from UMD <http://inforumweb.umd.edu/econdata/econdata.html>`_
|
||||||
* `Internet Product Code Database <http://www.upcdatabase.com/>`_
|
* `Internet Product Code Database <http://www.upcdatabase.com/>`_
|
||||||
|
|
||||||
|
@ -133,32 +137,41 @@ Finance
|
||||||
GeoSpace/GIS
|
GeoSpace/GIS
|
||||||
------------
|
------------
|
||||||
|
|
||||||
* `BODC - Marine data of nearly 22,000 oceanographic vars. <http://www.bodc.ac.uk/data/where_to_find_data/>`_
|
* `BODC - marine data of ~22K vars <http://www.bodc.ac.uk/data/where_to_find_data/>`_
|
||||||
* `EOSDIS - A data collection of NASA's earth observing system data and information system. <http://sedac.ciesin.columbia.edu/data/sets/browse>`_
|
* `Cambridge, MA, US, GIS data on GitHub <http://cambridgegis.github.io/gisdata.html>`_
|
||||||
* `Factual Global Location Data - 65M POIs with extended attributes in 50 countries. <http://www.factual.com/>`_
|
* `EOSDIS - NASA's earth observing system data <http://sedac.ciesin.columbia.edu/data/sets/browse>`_
|
||||||
* `Global Administrative Areas Database (GADM) - For countries and low-level subdivisions. <http://www.gadm.org/>`_
|
* `Factual Global Location Data <http://www.factual.com/>`_
|
||||||
* `Geo Spatial Data from ASU - Several small spatial or GIS datasets. <http://geodacenter.asu.edu/datalist/>`_
|
* `Geo Spatial Data from ASU <http://geodacenter.asu.edu/datalist/>`_
|
||||||
* `GeoNames - Over eight million placenames (countries, city stat etc.) of the world. <http://www.geonames.org/>`_
|
* `GeoNames Worldwide <http://www.geonames.org/>`_
|
||||||
* `Natural Earth - Vectors and rasters of the world in multiple scales. <http://www.naturalearthdata.com/>`_
|
* `Global Administrative Areas Database (GADM) <http://www.gadm.org/>`_
|
||||||
* `OpenStreetMap - A free map worldwide maintained by the communities. <http://wiki.openstreetmap.org/wiki/Downloading_data>`_
|
* `Landsat 8 on AWS <https://aws.amazon.com/public-data-sets/landsat/>`_
|
||||||
* `TIGER/Line - Official United States boundaries and roads. <http://www.census.gov/geo/maps-data/data/tiger-line.html>`_
|
* `Natural Earth - vectors and rasters of the world <http://www.naturalearthdata.com/>`_
|
||||||
* `TwoFishes - Foursquare's coarse geocoder. <https://github.com/foursquare/twofishes>`_
|
* `Open Street Map (OSM) <http://wiki.openstreetmap.org/wiki/Downloading_data>`_
|
||||||
* `TZ Timezones - A shapefile of the TZ timezones of the world. <http://efele.net/maps/tz/world/>`_
|
* `TIGER/Line - U.S. boundaries and roads <http://www.census.gov/geo/maps-data/data/tiger-line.html>`_
|
||||||
|
* `TwoFishes - Foursquare's coarse geocoder <https://github.com/foursquare/twofishes>`_
|
||||||
|
* `TZ Timezones shapfiles <http://efele.net/maps/tz/world/>`_
|
||||||
|
|
||||||
|
|
||||||
Government
|
Government
|
||||||
----------
|
----------
|
||||||
|
|
||||||
* `Australia <http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument>`_ (abs.gov.au)
|
* `Australia (abs.gov.au) <http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument>`_
|
||||||
* `Australia <https://data.gov.au/>`_ (data.gov.au)
|
* `Australia (data.gov.au) <https://data.gov.au/>`_
|
||||||
|
* `Brazil <http://dados.gov.br/dataset>`_
|
||||||
|
* `Cambridge, MA, US <https://data.cambridgema.gov/>`_
|
||||||
* `Canada <http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1>`_
|
* `Canada <http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1>`_
|
||||||
* `Chicago <https://data.cityofchicago.org/>`_
|
* `Chicago <https://data.cityofchicago.org/>`_
|
||||||
|
* `Dallas Open Data <https://www.dallasopendata.com/>`_
|
||||||
|
* `Denver Open Data <http://data.denvergov.org//>`_
|
||||||
* `EuroStat <http://ec.europa.eu/eurostat/data/database>`_
|
* `EuroStat <http://ec.europa.eu/eurostat/data/database>`_
|
||||||
* `FedStats <http://www.fedstats.gov/cgi-bin/A2Z.cgi>`_
|
* `FedStats <http://www.fedstats.gov/cgi-bin/A2Z.cgi>`_
|
||||||
|
* `France <https://www.data.gouv.fr/en/datasets/>`_
|
||||||
* `Germany <https://www-genesis.destatis.de/genesis/online>`_
|
* `Germany <https://www-genesis.destatis.de/genesis/online>`_
|
||||||
* `Glasgow, Scotland, UK <http://data.glasgow.gov.uk/>`_
|
* `Glasgow, Scotland, UK <http://data.glasgow.gov.uk/>`_
|
||||||
* `Guardian world governments <http://www.guardian.co.uk/world-government-data>`_
|
* `Guardian world governments <http://www.guardian.co.uk/world-government-data>`_
|
||||||
* `London Datastore, U.K <http://data.london.gov.uk/dataset>`_
|
* `Indian Government Data <http://www.data.gov.in>`_
|
||||||
|
* `London Datastore, UK <http://data.london.gov.uk/dataset>`_
|
||||||
|
* `MassGIS, Massachusetts, U.S. <http://www.mass.gov/anf/research-and-tech/it-serv-and-support/application-serv/office-of-geographic-information-massgis/>`_
|
||||||
* `Netherlands <https://data.overheid.nl/>`_
|
* `Netherlands <https://data.overheid.nl/>`_
|
||||||
* `New Zealand <http://www.stats.govt.nz/browse_for_stats.aspx>`_
|
* `New Zealand <http://www.stats.govt.nz/browse_for_stats.aspx>`_
|
||||||
* `NYC betanyc <http://betanyc.us/>`_
|
* `NYC betanyc <http://betanyc.us/>`_
|
||||||
|
@ -166,6 +179,7 @@ Government
|
||||||
* `OECD <http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html>`_
|
* `OECD <http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html>`_
|
||||||
* `Open Government Data (OGD) Platform India <http://www.data.gov.in/>`_
|
* `Open Government Data (OGD) Platform India <http://www.data.gov.in/>`_
|
||||||
* `San Francisco Data sets <http://datasf.org/>`_
|
* `San Francisco Data sets <http://datasf.org/>`_
|
||||||
|
* `Seattle <https://data.seattle.gov/>`_
|
||||||
* `South Africa <http://beta2.statssa.gov.za/>`_
|
* `South Africa <http://beta2.statssa.gov.za/>`_
|
||||||
* `The World Bank <http://wdronline.worldbank.org/>`_
|
* `The World Bank <http://wdronline.worldbank.org/>`_
|
||||||
* `U.K. Government Data <http://data.gov.uk/data>`_
|
* `U.K. Government Data <http://data.gov.uk/data>`_
|
||||||
|
@ -184,39 +198,45 @@ Government
|
||||||
Healthcare
|
Healthcare
|
||||||
----------
|
----------
|
||||||
|
|
||||||
* `EHDP Large Health Data Sets - A collection of health datasets across domains and countries. <http://www.ehdp.com/vitalnet/datasets.htm>`_
|
* `EHDP Large Health Data Sets <http://www.ehdp.com/vitalnet/datasets.htm>`_
|
||||||
* `Gapminder World - A collection of multi-domain, demographic databases for our world. <http://www.gapminder.org/data/>`_
|
* `Gapminder World, demographic databases <http://www.gapminder.org/data/>`_
|
||||||
* `Medicare Coverage Database (MCD) - Containing national and local Coverage Determinations. <http://www.cms.gov/medicare-coverage-database/>`_
|
* `Medicare Coverage Database (MCD), U.S. <http://www.cms.gov/medicare-coverage-database/>`_
|
||||||
* `Medicare Data Engine - Download, explore, and visualize Medicare.gov Data. <https://data.medicare.gov/>`_
|
* `Medicare Data Engine of medicare.gov Data <https://data.medicare.gov/>`_
|
||||||
* `Medicare Data File <http://go.cms.gov/19xxPN4>`_
|
* `Medicare Data File <http://go.cms.gov/19xxPN4>`_
|
||||||
|
* `Number of Ebola Cases and Deaths in Affected Countries (2014) <https://data.hdx.rwlabs.org/dataset/ebola-cases-2014>`_
|
||||||
|
|
||||||
|
|
||||||
Image Processing
|
Image Processing
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
* `2GB of Photos of Cats - 10K cat images with basic annotations. <http://137.189.35.203/WebUI/CatDatabase/catData.html>`_
|
* `10k US Adult Faces Database <http://wilmabainbridge.com/facememorability2.html>`_
|
||||||
* `Face Recognition Benchmark - A collection of face datasets for benchmarking algorithms. <http://www.face-rec.org/databases/>`_
|
* `2GB of Photos of Cats <http://137.189.35.203/WebUI/CatDatabase/catData.html>`_
|
||||||
* `ImageNet - An image database organized according to the WordNet hierarchy. <http://www.image-net.org/>`_
|
* `Affective Image Classification <http://www.imageemotion.org/>`_
|
||||||
|
* `Face Recognition Benchmark <http://www.face-rec.org/databases/>`_
|
||||||
|
* `ImageNet (in WordNet hierarchy) <http://www.image-net.org/>`_
|
||||||
|
* `International Affective Picture System, UFL <http://csea.phhp.ufl.edu/media/iapsmessage.html>`_
|
||||||
|
* `Massive Visual Memory Stimuli, MIT <http://cvcl.mit.edu/MM/stimuli.html>`_
|
||||||
|
* `SUN database, MIT <http://groups.csail.mit.edu/vision/SUN/hierarchy.html>`_
|
||||||
|
|
||||||
|
|
||||||
Machine Learning
|
Machine Learning
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
* `Delve Datasets (Univ. of Toronto) - Evaluating datasets for classification and regression. <http://www.cs.toronto.edu/~delve/data/datasets.html>`_
|
* `Delve Datasets for classification and regression (Univ. of Toronto) <http://www.cs.toronto.edu/~delve/data/datasets.html>`_
|
||||||
* `eBay Online Auctions (2012) - Seller-auction-bidder data with closing prices. <http://www.modelingonlineauctions.com/datasets>`_
|
* `Discogs Monthly Data <http://www.discogs.com/data/>`_
|
||||||
* `IMDb Database - An online database of films, TB programs, and video games. <http://www.imdb.com/interfaces>`_
|
* `eBay Online Auctions (2012) <http://www.modelingonlineauctions.com/datasets>`_
|
||||||
* `Keel Repository - Multiple datasets for classification, regression, time series. <http://sci2s.ugr.es/keel/datasets.php>`_
|
* `IMDb Database <http://www.imdb.com/interfaces>`_
|
||||||
* `Lending Club Loan Data - Loan status (Current, Late, Fully Paid, etc.) and latest payment info. <https://www.lendingclub.com/info/download-data.action>`_
|
* `Keel Repository for classification, regression and time series <http://sci2s.ugr.es/keel/datasets.php>`_
|
||||||
* `Machine Learning Data Set Repository - A data search engine for machine learning tasks. <http://mldata.org/>`_
|
* `Lending Club Loan Data <https://www.lendingclub.com/info/download-data.action>`_
|
||||||
* `Million Song Dataset - Audio features and metadata for a million popular music tracks. <http://labrosa.ee.columbia.edu/millionsong/>`_
|
* `Machine Learning Data Set Repository <http://mldata.org/>`_
|
||||||
* `More Song Datasets - Complementary data of cover songs, lyrics, user listening data. <http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets>`_
|
* `Million Song Dataset <http://labrosa.ee.columbia.edu/millionsong/>`_
|
||||||
* `MovieLens Data Sets - Online movie recommendation including movie tags, user ratings. <http://grouplens.org/datasets/movielens/>`_
|
* `More Song Datasets <http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets>`_
|
||||||
|
* `MovieLens Data Sets <http://grouplens.org/datasets/movielens/>`_
|
||||||
* `RDataMining - "R and Data Mining" ebook data <http://www.rdatamining.com/data>`_
|
* `RDataMining - "R and Data Mining" ebook data <http://www.rdatamining.com/data>`_
|
||||||
* `Registered Meteorites on Earth - 34,513 meteorites updated to 2012. <http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized>`_
|
* `Registered Meteorites on Earth <http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized>`_
|
||||||
* `Restaurants Health Score Data - Health status of restaurants in San Francisco. <http://missionlocal.org/san-francisco-restaurant-health-inspections/>`_
|
* `Restaurants Health Score Data in San Francisco <http://missionlocal.org/san-francisco-restaurant-health-inspections/>`_
|
||||||
* `UCI Machine Learning Repository - One of most famous ML data repositories. <http://archive.ics.uci.edu/ml/>`_
|
* `UCI Machine Learning Repository <http://archive.ics.uci.edu/ml/>`_
|
||||||
* `Yahoo Ratings and Classification Data - About music, movies, user clicks, images etc. <http://webscope.sandbox.yahoo.com/catalog.php?datatype=r>`_
|
* `Yahoo! Ratings and Classification Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=r>`_
|
||||||
|
|
||||||
|
|
||||||
Museums
|
Museums
|
||||||
|
@ -228,36 +248,31 @@ Museums
|
||||||
* `The Getty vocabularies <http://vocab.getty.edu>`_
|
* `The Getty vocabularies <http://vocab.getty.edu>`_
|
||||||
|
|
||||||
|
|
||||||
Music
|
|
||||||
-----
|
|
||||||
|
|
||||||
* `Discogs Data - Monthly dumps of Discogs Release, Artist and Label data. <http://www.discogs.com/data/>`_
|
|
||||||
|
|
||||||
|
|
||||||
Natural Language
|
Natural Language
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
* `ClueWeb09 FACC - Annotated English-language Web pages from the ClueWeb09 corpora. <http://lemurproject.org/clueweb09/FACC1/>`_
|
* `Blogger Corpus <http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm>`_
|
||||||
* `ClueWeb12 FACC - Annotated English-language Web pages from the ClueWeb12 corpora. <http://lemurproject.org/clueweb12/FACC1/>`_
|
* `ClueWeb09 FACC <http://lemurproject.org/clueweb09/FACC1/>`_
|
||||||
* `DBpedia - Multi-domain ontology describing 4.58M “things” with 583M “facts”. <http://wiki.dbpedia.org/Datasets>`_
|
* `ClueWeb12 FACC <http://lemurproject.org/clueweb12/FACC1/>`_
|
||||||
* `Flickr Personal Taxonomies - Personalized tagging pictures with descriptive labels. <http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html>`_
|
* `DBpedia - 4.58M things with 583M facts <http://wiki.dbpedia.org/Datasets>`_
|
||||||
* `Google Books Ngrams (2.2TB) - N-gram corpuses extracted from Google Books. <http://aws.amazon.com/datasets/8172056142375670>`_
|
* `Flickr Personal Taxonomies <http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html>`_
|
||||||
* `Google Web 5gram (1TB, 2006) - 5-gram corpuses extracted from Web pages. <https://catalog.ldc.upenn.edu/LDC2006T13>`_
|
* `Google Books Ngrams (2.2TB) <http://aws.amazon.com/datasets/8172056142375670>`_
|
||||||
* `Gutenberg eBooks List - Basic information about each eBook from Project Gutenberg. <http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs>`_
|
* `Google Web 5gram (1TB, 2006) <https://catalog.ldc.upenn.edu/LDC2006T13>`_
|
||||||
* `Hansards - 1.3M aligned text chunks from official records of Canadian Parliament. <http://www.isi.edu/natural-language/download/hansard/>`_
|
* `Gutenberg eBooks List <http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs>`_
|
||||||
* `Machine Translation - The recurring translation task focusing on European languages. <http://statmt.org/wmt11/translation-task.html#download>`_
|
* `Hansards text chunks of Canadian Parliament <http://www.isi.edu/natural-language/download/hansard/>`_
|
||||||
* `SMS Spam Collection - 5,574 real English messages, labled as being ham or spam. <http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/>`_
|
* `Machine Translation of European languages <http://statmt.org/wmt11/translation-task.html#download>`_
|
||||||
* `USENET corpus - A collection of public USENET postings between Oct 2005 and Jan 2011. <http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html>`_
|
* `SMS Spam Collection in English <http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/>`_
|
||||||
* `Wikidata - Wikipedia databases available in JSON and XML formats. <https://www.wikidata.org/wiki/Wikidata:Database_download>`_
|
* `USENET postings corpus of 2005~2011 <http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html>`_
|
||||||
* `Wikipedia Links data - 40 Million Entities in Context. <https://code.google.com/p/wiki-links/downloads/list>`_
|
* `Wikidata - Wikipedia databases <https://www.wikidata.org/wiki/Wikidata:Database_download>`_
|
||||||
* `WordNet - Databases, associated packages and tools. <http://wordnet.princeton.edu/wordnet/download/>`_
|
* `Wikipedia Links data - 40 Million Entities in Context <https://code.google.com/p/wiki-links/downloads/list>`_
|
||||||
|
* `WordNet databases and tools <http://wordnet.princeton.edu/wordnet/download/>`_
|
||||||
|
|
||||||
|
|
||||||
Physics
|
Physics
|
||||||
-------
|
-------
|
||||||
|
|
||||||
* `CERN Open Data Portal - Experimental data of CMS experiment, ALICE, ATLAS and LHCb <http://opendata.cern.ch/>`_
|
* `CERN Open Data Portal <http://opendata.cern.ch/>`_
|
||||||
* `NSSDC (NASA) - More than 230 TB of data from about 550 space science spacecraft <http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html>`_
|
* `NSSDC (NASA) data of 550 space spacecraft <http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html>`_
|
||||||
|
|
||||||
|
|
||||||
Public Domains
|
Public Domains
|
||||||
|
@ -288,78 +303,84 @@ Public Domains
|
||||||
Search Engines
|
Search Engines
|
||||||
--------------
|
--------------
|
||||||
|
|
||||||
* `Academic Torrents (UMB) - Sharing enormous datasets, for researchers, by researchers. <http://academictorrents.com/>`_
|
* `Academic Torrents of data sharing from UMB <http://academictorrents.com/>`_
|
||||||
* `Archive-it - Web archiving service built at the Internet Archive <https://www.archive-it.org/explore?show=Collections>`_
|
* `Archive-it from Internet Archive <https://www.archive-it.org/explore?show=Collections>`_
|
||||||
* `Datahub.io - The easy way to get, use and share data <http://datahub.io/dataset>`_
|
* `Datahub.io <http://datahub.io/dataset>`_
|
||||||
* `DataMarket (Qlik) <https://datamarket.com/data/list/?q=all>`_
|
* `DataMarket (Qlik) <https://datamarket.com/data/list/?q=all>`_
|
||||||
* `Freebase.com - A community-curated database of well-known people, places, and things <http://www.freebase.com/>`_
|
* `Freebase.com of people, places, and things <http://www.freebase.com/>`_
|
||||||
* `Harvard Dataverse Network - Scientific data for reproducible research <http://thedata.harvard.edu/dvn/>`_
|
* `Harvard Dataverse Network of scientific data <http://thedata.harvard.edu/dvn/>`_
|
||||||
* `ICPSR (UMICH) - Find and analyze data <http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp>`_
|
* `ICPSR (UMICH) <http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp>`_
|
||||||
* `Statista.com - Statistics and Studies from more than 18,000 Sources <http://www.statista.com/>`_
|
* `Open Data Certificates (beta) <https://certificates.theodi.org/datasets>`_
|
||||||
|
* `Statista.com - statistics and Studies <http://www.statista.com/>`_
|
||||||
|
|
||||||
|
|
||||||
Social Sciences
|
Social Sciences
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
* `Ancestry.com Forum Dataset - Forum users and messages over ten years <http://www.cs.cmu.edu/~jelsas/data/ancestry.com/>`_
|
* `Ancestry.com Forum Dataset over 10 years <http://www.cs.cmu.edu/~jelsas/data/ancestry.com/>`_
|
||||||
* `CMU Enron Email - 150 users, mostly senior management of Enron <http://www.cs.cmu.edu/~enron/>`_
|
* `CMU Enron Email of 150 users <http://www.cs.cmu.edu/~enron/>`_
|
||||||
* `Facebook Data Scrape (2005) - 100 American colleges and univ. <https://archive.org/details/oxford-2005-facebook-matrix>`_
|
* `Facebook Data Scrape (2005) <https://archive.org/details/oxford-2005-facebook-matrix>`_
|
||||||
* `Facebook Social Networks from LAW (since 2007) <http://law.di.unimi.it/datasets.php>`_
|
* `Facebook Social Networks from LAW (since 2007) <http://law.di.unimi.it/datasets.php>`_
|
||||||
* `Foursquare (2010, 2011) - Social networks, check-in locations and categories <http://www.public.asu.edu/~hgao16/dataset.html>`_
|
* `Foursquare Social Network in 2010, 2011 <http://www.public.asu.edu/~hgao16/dataset.html>`_
|
||||||
* `Foursquare from UMN/Sarwat (2013) - Users, venues, check-ins, ratings etc. <https://archive.org/details/201309_foursquare_dataset_umn>`_
|
* `Foursquare from UMN/Sarwat (2013) <https://archive.org/details/201309_foursquare_dataset_umn>`_
|
||||||
* `General Social Survey (GSS, since 1972) - Demographic and attitudinal questions, topics etc. <http://www3.norc.org/GSS+Website/>`_
|
* `General Social Survey (GSS) since 1972 <http://www3.norc.org/GSS+Website/>`_
|
||||||
* `GetGlue - Users rating TV shows <http://bit.ly/1aL8XS0>`_
|
* `GetGlue - users rating TV shows <http://bit.ly/1aL8XS0>`_
|
||||||
* `GitHub Archive - Programmers collaboration, projects progress etc. <http://www.githubarchive.org/>`_
|
* `GitHub Collaboration Archive <http://www.githubarchive.org/>`_
|
||||||
* `Mobile Social Networks (UMASS) - Timestamped mote-to-mote (up to 27 subjects) connections <https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks>`_
|
* `MIT Reality Mining Dataset <http://realitycommons.media.mit.edu/realitymining.html>`_
|
||||||
* `PewResearch Internet Project - A wide range of surveys about library usage, online dating etc. <http://www.pewinternet.org/datasets/pages/2/>`_
|
* `Mobile Social Networks from UMASS <https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks>`_
|
||||||
* `SourceForge.net Research Data - Historic and status statistics of projects and users' activities <http://www.nd.edu/~oss/Data/data.html>`_
|
* `PewResearch Internet Survey Project <http://www.pewinternet.org/datasets/pages/2/>`_
|
||||||
* `Stack Exchange Data Explorer - User-contributed content on the Stack Exchange network <http://data.stackexchange.com/help>`_
|
* `SourceForge.net Research Data <http://www.nd.edu/~oss/Data/data.html>`_
|
||||||
* `Titanic Survival Data Set - Demographic information of Titanic passengers <http://bit.do/dataset-titanic-csv-zip>`_
|
* `StackExchange Data Explorer <http://data.stackexchange.com/help>`_
|
||||||
* `Twitter Graph - Crawled entire Twitter site including tweets, user profiles, relations <http://an.kaist.ac.kr/traces/WWW2010.html>`_
|
* `Titanic Survival Data Set <http://bit.do/dataset-titanic-csv-zip>`_
|
||||||
* `UCB's Archive of Social Science Data (D-Lab) - Holdings of political, social and health areas <http://ucdata.berkeley.edu/>`_
|
* `Twitter Graph of entire Twitter site <http://an.kaist.ac.kr/traces/WWW2010.html>`_
|
||||||
* `UCLA Social Sciences Data Archive - A collection of social science data on the Web <http://dataarchives.ss.ucla.edu/Home.DataPortals.htm>`_
|
* `UCB's Archive of Social Science Data (D-Lab) <http://ucdata.berkeley.edu/>`_
|
||||||
* `UNIMI/LAW Social Network Datasets - Social networks like amazon, LiveJournal, dblp and more <http://law.di.unimi.it/datasets.php>`_
|
* `UCLA Social Sciences Data Archive <http://dataarchives.ss.ucla.edu/Home.DataPortals.htm>`_
|
||||||
* `Universities Worldwide - Links to 9307 Universities in 205 countries <http://univ.cc/>`_
|
* `UNIMI/LAW Social Network Datasets <http://law.di.unimi.it/datasets.php>`_
|
||||||
* `UPJOHN for Employment Research - Labor surveys, unemployment spells and more <http://www.upjohn.org/erdc/erdc.html>`_
|
* `Universities Worldwide <http://univ.cc/>`_
|
||||||
* `Yahoo Graph and Social Data - Web page graph, user-group membership, IM friends etc. <http://webscope.sandbox.yahoo.com/catalog.php?datatype=g>`_
|
* `UPJOHN for Labor Employment Research <http://www.upjohn.org/erdc/erdc.html>`_
|
||||||
* `Youtube Video Graph (2007,2008) - Video relations, uploaders, views, ratings and more <http://netsg.cs.sfu.ca/youtubedata/>`_
|
* `Yahoo! Graph and Social Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=g>`_
|
||||||
|
* `Youtube Video Social Graph in 2007,2008 <http://netsg.cs.sfu.ca/youtubedata/>`_
|
||||||
|
* `Google Scholar citation relations <http://www3.cs.stonybrook.edu/~leman/data/gscholar.db>`_
|
||||||
|
* `Political Polarity Data <http://www3.cs.stonybrook.edu/~leman/data/14-icwsm-political-polarity-data.zip>`_
|
||||||
|
|
||||||
|
|
||||||
Sports
|
Sports
|
||||||
------
|
------
|
||||||
|
|
||||||
* `Betfair Event Results - Fully time-stamped historical Betfair exchange data <http://data.betfair.com/>`_
|
* `Betfair Historical Exchange Data <http://data.betfair.com/>`_
|
||||||
* `Cricsheet (baseball) - Thousands of Cricket matches <http://cricsheet.org/>`_
|
* `Cricsheet Matches (baseball) <http://cricsheet.org/>`_
|
||||||
* `Ergast Formula 1, from 1950 up to date (API available) <http://ergast.com/mrd/db>`_
|
* `Ergast Formula 1, from 1950 up to date (API) <http://ergast.com/mrd/db>`_
|
||||||
* `Football/Soccer resouces (data and APIs) <http://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/>`_
|
* `Football/Soccer resouces (data and APIs) <http://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/>`_
|
||||||
* `Lahman's Baseball Database - Batting and pitching statistics, team stats etc. <http://www.seanlahman.com/baseball-archive/statistics/>`_
|
* `Lahman's Baseball Database <http://www.seanlahman.com/baseball-archive/statistics/>`_
|
||||||
* `Retrosheet (baseball) - Play-by-Play files, game logs and schedules <http://www.retrosheet.org/game.htm>`_
|
* `Retrosheet Baseball Statistics <http://www.retrosheet.org/game.htm>`_
|
||||||
|
|
||||||
|
|
||||||
Time Series
|
Time Series
|
||||||
-----------
|
-----------
|
||||||
|
|
||||||
* `Time Series data Library (TSDL), created by Rob Hyndman, MU <https://datamarket.com/data/list/?q=provider:tsdl>`_
|
* `Time Series Data Library (TSDL) from MU <https://datamarket.com/data/list/?q=provider:tsdl>`_
|
||||||
* `UC Riverside Time Series, for classification and clustering. <http://www.cs.ucr.edu/~eamonn/time_series_data/>`_
|
* `UC Riverside Time Series Dataset <http://www.cs.ucr.edu/~eamonn/time_series_data/>`_
|
||||||
|
* `Hard Drive Failure Rates <https://www.backblaze.com/hard-drive-test-data.html>`_
|
||||||
|
|
||||||
|
|
||||||
Transportation
|
Transportation
|
||||||
--------------
|
--------------
|
||||||
|
|
||||||
* `Airlines OD Data 1987-2008, used by ASA Challenge 2009 <http://stat-computing.org/dataexpo/2009/the-data.html>`_
|
* `Airlines OD Data 1987-2008 <http://stat-computing.org/dataexpo/2009/the-data.html>`_
|
||||||
* `Bike Share Data Systems - Trip histories, site maps etc. <https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems>`_
|
* `Bike Share Systems (BSS) collection <https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems>`_
|
||||||
* `Bay Area Bike Share Data - Bike availability and trip history <http://www.bayareabikeshare.com/datachallenge>`_
|
* `Bay Area Bike Share Data <http://www.bayareabikeshare.com/datachallenge>`_
|
||||||
* `Edge data for US domestic flights 1990 to 2009 <http://data.memect.com/?p=229>`_
|
* `GeoLife GPS Trajectory from Microsoft Research <http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/>`_
|
||||||
* `Half a million Hubway rides in MA <http://hubwaydatachallenge.org/trip-history-data/>`_
|
* `Hubway Million Rides in MA <http://hubwaydatachallenge.org/trip-history-data/>`_
|
||||||
* `Marine Traffic - Ship tracks, port calls and more <https://www.marinetraffic.com/de/p/api-services>`_
|
* `Marine Traffic - ship tracks, port calls and more <https://www.marinetraffic.com/de/p/api-services>`_
|
||||||
* `NYC Taxi Trip Data 2013 - FOIA/FOILed by Chris Whong <https://archive.org/details/nycTaxiTripData2013>`_
|
* `NYC Taxi Trip Data 2013 (FOIA/FOILed) <https://archive.org/details/nycTaxiTripData2013>`_
|
||||||
* `OpenFlights - Airport, airline and route data <http://openflights.org/data.html>`_
|
* `OpenFlights - airport, airline and route data <http://openflights.org/data.html>`_
|
||||||
* `RITA Airline On-Time Performance data of major air carriers in US <http://www.transtats.bts.gov/Tables.asp?DB_ID=120>`_
|
* `RITA Airline On-Time Performance data <http://www.transtats.bts.gov/Tables.asp?DB_ID=120>`_
|
||||||
* `RITA/BTS transport data collection (TranStat) <http://www.transtats.bts.gov/DataIndex.asp>`_
|
* `RITA/BTS transport data collection (TranStat) <http://www.transtats.bts.gov/DataIndex.asp>`_
|
||||||
* `Transport for London (TFL) - Trip histories and networking statistics <http://www.tfl.gov.uk/info-for/open-data-users/our-feeds>`_
|
* `Transport for London (TFL) <http://www.tfl.gov.uk/info-for/open-data-users/our-feeds>`_
|
||||||
* `Travel Tracker Survey (TTS), Chicago, 1990, 2007-2008 <http://www.cmap.illinois.gov/data/transportation/travel-tracker-survey>`_
|
* `Travel Tracker Survey (TTS) for Chicago <http://www.cmap.illinois.gov/data/transportation/travel-tracker-survey>`_
|
||||||
* `U.S. Bureau of Transportation Statistics (BTS) <http://www.rita.dot.gov/bts/>`_
|
* `U.S. Bureau of Transportation Statistics (BTS) <http://www.rita.dot.gov/bts/>`_
|
||||||
* `U.S. Freight Analysis Framework - Freight movement among states since 2007 <http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm>`_
|
* `U.S. Domestic Flights 1990 to 2009 <http://academictorrents.com/details/a2ccf94bbb4af222bf8e69dad60a68a29f310d9a>`_
|
||||||
|
* `U.S. Freight Analysis Framework since 2007 <http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm>`_
|
||||||
|
|
||||||
|
|
||||||
Complementary Collections
|
Complementary Collections
|
||||||
|
@ -369,4 +390,4 @@ Complementary Collections
|
||||||
* Inside-r: `Finding Data on the Internet <http://www.inside-r.org/howto/finding-data-internet>`_
|
* Inside-r: `Finding Data on the Internet <http://www.inside-r.org/howto/finding-data-internet>`_
|
||||||
* Quora: `Where can I find large datasets open to the public? <http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public>`_
|
* Quora: `Where can I find large datasets open to the public? <http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public>`_
|
||||||
* RS.io: `100+ Interesting Data Sets for Statistics <http://rs.io/2014/05/29/list-of-data-sets.html>`_
|
* RS.io: `100+ Interesting Data Sets for Statistics <http://rs.io/2014/05/29/list-of-data-sets.html>`_
|
||||||
* StaTrek: `Leveraging open data to understand urban lives <http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/>`_
|
* StaTrek: `Leveraging open data to understand urban lives <http://xiaming.me/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/>`_
|
||||||
|
|
Loading…
Reference in New Issue
Block a user