diff --git a/README.rst b/README.rst index efe7a9d..0876da2 100644 --- a/README.rst +++ b/README.rst @@ -38,7 +38,7 @@ Climate/Weather * `Australian Weather `_ * `Canadian Meteorological Centre `_ -* `Climate Data from UEA (updated at roughly monthly intervals) `_ +* `Climate Data from UEA (updated monthly) `_ * `Global Climate Data Since 1929 `_ * `NOAA Bering Sea Climate `_ * `NOAA Climate Datasets `_ @@ -52,6 +52,8 @@ Complex Networks * `CrossRef DOI URLs `_ * `DBLP Citation dataset `_ * `NBER Patent Citations `_ +* `Network Data `_ +* `UCI Network Data Repository `_ * `NIST complex networks data collection `_ * `Protein-protein interaction network `_ * `PyPI and Maven Dependency Network `_ @@ -68,21 +70,23 @@ Complex Networks Computer Networks ----------------- -* `3.5B Web Pages - Web graph extracted from CommonCraw 2012 web corpus. `_ -* `53.5B Web clicks - Anonymized HTTP records from 100K users in Indiana Univ. `_ -* `CAIDA Internet Datasets - Network traces and topologies at geographically diverse locations. `_ -* `ClueWeb09 - About 1B web pages in ten languages that were collected in Jan. and Feb. 2009. `_ -* `ClueWeb12 - About 733M web pages collected between Feb. and May 2012. `_ -* `CommonCrawl Web Data - Petabytes of data collected over 7 years of web crawling. `_ -* `CRAWDAD Wireless datasets (Dartmouth) - A wireless network data resource for research communities. `_ -* `OpenMobileData (MobiPerf) - Mobile performance measurement data collected with active tests. `_ -* `UCSD Network Telescope - A passive traffic monitoring system covering IPv4 /8 net. `_ +* `3.5B Web Pages from CommonCraw 2012 `_ +* `53.5B Web clicks of 100K users in Indiana Univ. `_ +* `CAIDA Internet Datasets `_ +* `ClueWeb09 - 1B web pages `_ +* `ClueWeb12 - 733M web pages `_ +* `CommonCrawl Web Data over 7 years `_ +* `CRAWDAD Wireless datasets from Dartmouth Univ. `_ +* `Criteo click-through data `_ +* `Open Mobile Data by MobiPerf `_ +* `UCSD Network Telescope, IPv4 /8 net `_ Data Challenges --------------- * `Challenges in Machine Learning `_ +* `D4D Challenge of Orange `_ * `DrivenData Competitions for Social Good `_ * `ICWSM Data Challenge (since 2009) `_ * `Kaggle Competition Data `_ @@ -95,7 +99,7 @@ Data Challenges Economics --------- -* `American Economic Ass. (AEA) `_ +* `American Economic Ass (AEA) `_ * `EconData from UMD `_ * `Internet Product Code Database `_ @@ -133,32 +137,41 @@ Finance GeoSpace/GIS ------------ -* `BODC - Marine data of nearly 22,000 oceanographic vars. `_ -* `EOSDIS - A data collection of NASA's earth observing system data and information system. `_ -* `Factual Global Location Data - 65M POIs with extended attributes in 50 countries. `_ -* `Global Administrative Areas Database (GADM) - For countries and low-level subdivisions. `_ -* `Geo Spatial Data from ASU - Several small spatial or GIS datasets. `_ -* `GeoNames - Over eight million placenames (countries, city stat etc.) of the world. `_ -* `Natural Earth - Vectors and rasters of the world in multiple scales. `_ -* `OpenStreetMap - A free map worldwide maintained by the communities. `_ -* `TIGER/Line - Official United States boundaries and roads. `_ -* `TwoFishes - Foursquare's coarse geocoder. `_ -* `TZ Timezones - A shapefile of the TZ timezones of the world. `_ +* `BODC - marine data of ~22K vars `_ +* `Cambridge, MA, US, GIS data on GitHub `_ +* `EOSDIS - NASA's earth observing system data `_ +* `Factual Global Location Data `_ +* `Geo Spatial Data from ASU `_ +* `GeoNames Worldwide `_ +* `Global Administrative Areas Database (GADM) `_ +* `Landsat 8 on AWS `_ +* `Natural Earth - vectors and rasters of the world `_ +* `Open Street Map (OSM) `_ +* `TIGER/Line - U.S. boundaries and roads `_ +* `TwoFishes - Foursquare's coarse geocoder `_ +* `TZ Timezones shapfiles `_ Government ---------- -* `Australia `_ (abs.gov.au) -* `Australia `_ (data.gov.au) +* `Australia (abs.gov.au) `_ +* `Australia (data.gov.au) `_ +* `Brazil `_ +* `Cambridge, MA, US `_ * `Canada `_ * `Chicago `_ +* `Dallas Open Data `_ +* `Denver Open Data `_ * `EuroStat `_ * `FedStats `_ +* `France `_ * `Germany `_ * `Glasgow, Scotland, UK `_ * `Guardian world governments `_ -* `London Datastore, U.K `_ +* `Indian Government Data `_ +* `London Datastore, UK `_ +* `MassGIS, Massachusetts, U.S. `_ * `Netherlands `_ * `New Zealand `_ * `NYC betanyc `_ @@ -166,6 +179,7 @@ Government * `OECD `_ * `Open Government Data (OGD) Platform India `_ * `San Francisco Data sets `_ +* `Seattle `_ * `South Africa `_ * `The World Bank `_ * `U.K. Government Data `_ @@ -184,39 +198,45 @@ Government Healthcare ---------- -* `EHDP Large Health Data Sets - A collection of health datasets across domains and countries. `_ -* `Gapminder World - A collection of multi-domain, demographic databases for our world. `_ -* `Medicare Coverage Database (MCD) - Containing national and local Coverage Determinations. `_ -* `Medicare Data Engine - Download, explore, and visualize Medicare.gov Data. `_ +* `EHDP Large Health Data Sets `_ +* `Gapminder World, demographic databases `_ +* `Medicare Coverage Database (MCD), U.S. `_ +* `Medicare Data Engine of medicare.gov Data `_ * `Medicare Data File `_ - +* `Number of Ebola Cases and Deaths in Affected Countries (2014) `_ Image Processing ---------------- -* `2GB of Photos of Cats - 10K cat images with basic annotations. `_ -* `Face Recognition Benchmark - A collection of face datasets for benchmarking algorithms. `_ -* `ImageNet - An image database organized according to the WordNet hierarchy. `_ +* `10k US Adult Faces Database `_ +* `2GB of Photos of Cats `_ +* `Affective Image Classification `_ +* `Face Recognition Benchmark `_ +* `ImageNet (in WordNet hierarchy) `_ +* `International Affective Picture System, UFL `_ +* `Massive Visual Memory Stimuli, MIT `_ +* `SUN database, MIT `_ Machine Learning ---------------- -* `Delve Datasets (Univ. of Toronto) - Evaluating datasets for classification and regression. `_ -* `eBay Online Auctions (2012) - Seller-auction-bidder data with closing prices. `_ -* `IMDb Database - An online database of films, TB programs, and video games. `_ -* `Keel Repository - Multiple datasets for classification, regression, time series. `_ -* `Lending Club Loan Data - Loan status (Current, Late, Fully Paid, etc.) and latest payment info. `_ -* `Machine Learning Data Set Repository - A data search engine for machine learning tasks. `_ -* `Million Song Dataset - Audio features and metadata for a million popular music tracks. `_ -* `More Song Datasets - Complementary data of cover songs, lyrics, user listening data. `_ -* `MovieLens Data Sets - Online movie recommendation including movie tags, user ratings. `_ +* `Delve Datasets for classification and regression (Univ. of Toronto) `_ +* `Discogs Monthly Data `_ +* `eBay Online Auctions (2012) `_ +* `IMDb Database `_ +* `Keel Repository for classification, regression and time series `_ +* `Lending Club Loan Data `_ +* `Machine Learning Data Set Repository `_ +* `Million Song Dataset `_ +* `More Song Datasets `_ +* `MovieLens Data Sets `_ * `RDataMining - "R and Data Mining" ebook data `_ -* `Registered Meteorites on Earth - 34,513 meteorites updated to 2012. `_ -* `Restaurants Health Score Data - Health status of restaurants in San Francisco. `_ -* `UCI Machine Learning Repository - One of most famous ML data repositories. `_ -* `Yahoo Ratings and Classification Data - About music, movies, user clicks, images etc. `_ +* `Registered Meteorites on Earth `_ +* `Restaurants Health Score Data in San Francisco `_ +* `UCI Machine Learning Repository `_ +* `Yahoo! Ratings and Classification Data `_ Museums @@ -228,36 +248,31 @@ Museums * `The Getty vocabularies `_ -Music ------ - -* `Discogs Data - Monthly dumps of Discogs Release, Artist and Label data. `_ - - Natural Language ---------------- -* `ClueWeb09 FACC - Annotated English-language Web pages from the ClueWeb09 corpora. `_ -* `ClueWeb12 FACC - Annotated English-language Web pages from the ClueWeb12 corpora. `_ -* `DBpedia - Multi-domain ontology describing 4.58M “things” with 583M “facts”. `_ -* `Flickr Personal Taxonomies - Personalized tagging pictures with descriptive labels. `_ -* `Google Books Ngrams (2.2TB) - N-gram corpuses extracted from Google Books. `_ -* `Google Web 5gram (1TB, 2006) - 5-gram corpuses extracted from Web pages. `_ -* `Gutenberg eBooks List - Basic information about each eBook from Project Gutenberg. `_ -* `Hansards - 1.3M aligned text chunks from official records of Canadian Parliament. `_ -* `Machine Translation - The recurring translation task focusing on European languages. `_ -* `SMS Spam Collection - 5,574 real English messages, labled as being ham or spam. `_ -* `USENET corpus - A collection of public USENET postings between Oct 2005 and Jan 2011. `_ -* `Wikidata - Wikipedia databases available in JSON and XML formats. `_ -* `Wikipedia Links data - 40 Million Entities in Context. `_ -* `WordNet - Databases, associated packages and tools. `_ +* `Blogger Corpus `_ +* `ClueWeb09 FACC `_ +* `ClueWeb12 FACC `_ +* `DBpedia - 4.58M things with 583M facts `_ +* `Flickr Personal Taxonomies `_ +* `Google Books Ngrams (2.2TB) `_ +* `Google Web 5gram (1TB, 2006) `_ +* `Gutenberg eBooks List `_ +* `Hansards text chunks of Canadian Parliament `_ +* `Machine Translation of European languages `_ +* `SMS Spam Collection in English `_ +* `USENET postings corpus of 2005~2011 `_ +* `Wikidata - Wikipedia databases `_ +* `Wikipedia Links data - 40 Million Entities in Context `_ +* `WordNet databases and tools `_ Physics ------- -* `CERN Open Data Portal - Experimental data of CMS experiment, ALICE, ATLAS and LHCb `_ -* `NSSDC (NASA) - More than 230 TB of data from about 550 space science spacecraft `_ +* `CERN Open Data Portal `_ +* `NSSDC (NASA) data of 550 space spacecraft `_ Public Domains @@ -288,78 +303,84 @@ Public Domains Search Engines -------------- -* `Academic Torrents (UMB) - Sharing enormous datasets, for researchers, by researchers. `_ -* `Archive-it - Web archiving service built at the Internet Archive `_ -* `Datahub.io - The easy way to get, use and share data `_ +* `Academic Torrents of data sharing from UMB `_ +* `Archive-it from Internet Archive `_ +* `Datahub.io `_ * `DataMarket (Qlik) `_ -* `Freebase.com - A community-curated database of well-known people, places, and things `_ -* `Harvard Dataverse Network - Scientific data for reproducible research `_ -* `ICPSR (UMICH) - Find and analyze data `_ -* `Statista.com - Statistics and Studies from more than 18,000 Sources `_ +* `Freebase.com of people, places, and things `_ +* `Harvard Dataverse Network of scientific data `_ +* `ICPSR (UMICH) `_ +* `Open Data Certificates (beta) `_ +* `Statista.com - statistics and Studies `_ Social Sciences --------------- -* `Ancestry.com Forum Dataset - Forum users and messages over ten years `_ -* `CMU Enron Email - 150 users, mostly senior management of Enron `_ -* `Facebook Data Scrape (2005) - 100 American colleges and univ. `_ +* `Ancestry.com Forum Dataset over 10 years `_ +* `CMU Enron Email of 150 users `_ +* `Facebook Data Scrape (2005) `_ * `Facebook Social Networks from LAW (since 2007) `_ -* `Foursquare (2010, 2011) - Social networks, check-in locations and categories `_ -* `Foursquare from UMN/Sarwat (2013) - Users, venues, check-ins, ratings etc. `_ -* `General Social Survey (GSS, since 1972) - Demographic and attitudinal questions, topics etc. `_ -* `GetGlue - Users rating TV shows `_ -* `GitHub Archive - Programmers collaboration, projects progress etc. `_ -* `Mobile Social Networks (UMASS) - Timestamped mote-to-mote (up to 27 subjects) connections `_ -* `PewResearch Internet Project - A wide range of surveys about library usage, online dating etc. `_ -* `SourceForge.net Research Data - Historic and status statistics of projects and users' activities `_ -* `Stack Exchange Data Explorer - User-contributed content on the Stack Exchange network `_ -* `Titanic Survival Data Set - Demographic information of Titanic passengers `_ -* `Twitter Graph - Crawled entire Twitter site including tweets, user profiles, relations `_ -* `UCB's Archive of Social Science Data (D-Lab) - Holdings of political, social and health areas `_ -* `UCLA Social Sciences Data Archive - A collection of social science data on the Web `_ -* `UNIMI/LAW Social Network Datasets - Social networks like amazon, LiveJournal, dblp and more `_ -* `Universities Worldwide - Links to 9307 Universities in 205 countries `_ -* `UPJOHN for Employment Research - Labor surveys, unemployment spells and more `_ -* `Yahoo Graph and Social Data - Web page graph, user-group membership, IM friends etc. `_ -* `Youtube Video Graph (2007,2008) - Video relations, uploaders, views, ratings and more `_ +* `Foursquare Social Network in 2010, 2011 `_ +* `Foursquare from UMN/Sarwat (2013) `_ +* `General Social Survey (GSS) since 1972 `_ +* `GetGlue - users rating TV shows `_ +* `GitHub Collaboration Archive `_ +* `MIT Reality Mining Dataset `_ +* `Mobile Social Networks from UMASS `_ +* `PewResearch Internet Survey Project `_ +* `SourceForge.net Research Data `_ +* `StackExchange Data Explorer `_ +* `Titanic Survival Data Set `_ +* `Twitter Graph of entire Twitter site `_ +* `UCB's Archive of Social Science Data (D-Lab) `_ +* `UCLA Social Sciences Data Archive `_ +* `UNIMI/LAW Social Network Datasets `_ +* `Universities Worldwide `_ +* `UPJOHN for Labor Employment Research `_ +* `Yahoo! Graph and Social Data `_ +* `Youtube Video Social Graph in 2007,2008 `_ +* `Google Scholar citation relations `_ +* `Political Polarity Data `_ Sports ------ -* `Betfair Event Results - Fully time-stamped historical Betfair exchange data `_ -* `Cricsheet (baseball) - Thousands of Cricket matches `_ -* `Ergast Formula 1, from 1950 up to date (API available) `_ +* `Betfair Historical Exchange Data `_ +* `Cricsheet Matches (baseball) `_ +* `Ergast Formula 1, from 1950 up to date (API) `_ * `Football/Soccer resouces (data and APIs) `_ -* `Lahman's Baseball Database - Batting and pitching statistics, team stats etc. `_ -* `Retrosheet (baseball) - Play-by-Play files, game logs and schedules `_ +* `Lahman's Baseball Database `_ +* `Retrosheet Baseball Statistics `_ Time Series ----------- -* `Time Series data Library (TSDL), created by Rob Hyndman, MU `_ -* `UC Riverside Time Series, for classification and clustering. `_ +* `Time Series Data Library (TSDL) from MU `_ +* `UC Riverside Time Series Dataset `_ +* `Hard Drive Failure Rates `_ Transportation -------------- -* `Airlines OD Data 1987-2008, used by ASA Challenge 2009 `_ -* `Bike Share Data Systems - Trip histories, site maps etc. `_ -* `Bay Area Bike Share Data - Bike availability and trip history `_ -* `Edge data for US domestic flights 1990 to 2009 `_ -* `Half a million Hubway rides in MA `_ -* `Marine Traffic - Ship tracks, port calls and more `_ -* `NYC Taxi Trip Data 2013 - FOIA/FOILed by Chris Whong `_ -* `OpenFlights - Airport, airline and route data `_ -* `RITA Airline On-Time Performance data of major air carriers in US `_ +* `Airlines OD Data 1987-2008 `_ +* `Bike Share Systems (BSS) collection `_ +* `Bay Area Bike Share Data `_ +* `GeoLife GPS Trajectory from Microsoft Research `_ +* `Hubway Million Rides in MA `_ +* `Marine Traffic - ship tracks, port calls and more `_ +* `NYC Taxi Trip Data 2013 (FOIA/FOILed) `_ +* `OpenFlights - airport, airline and route data `_ +* `RITA Airline On-Time Performance data `_ * `RITA/BTS transport data collection (TranStat) `_ -* `Transport for London (TFL) - Trip histories and networking statistics `_ -* `Travel Tracker Survey (TTS), Chicago, 1990, 2007-2008 `_ +* `Transport for London (TFL) `_ +* `Travel Tracker Survey (TTS) for Chicago `_ * `U.S. Bureau of Transportation Statistics (BTS) `_ -* `U.S. Freight Analysis Framework - Freight movement among states since 2007 `_ +* `U.S. Domestic Flights 1990 to 2009 `_ +* `U.S. Freight Analysis Framework since 2007 `_ Complementary Collections @@ -369,4 +390,4 @@ Complementary Collections * Inside-r: `Finding Data on the Internet `_ * Quora: `Where can I find large datasets open to the public? `_ * RS.io: `100+ Interesting Data Sets for Statistics `_ -* StaTrek: `Leveraging open data to understand urban lives `_ +* StaTrek: `Leveraging open data to understand urban lives `_