From 43889987053539f58940af8f8ce3cc05ec19dc28 Mon Sep 17 00:00:00 2001 From: Xiaming Chen Date: Sat, 31 Jan 2015 17:18:37 +0800 Subject: [PATCH] Tidy data description --- README.rst | 221 ++++++++++++++++++++++++++--------------------------- 1 file changed, 108 insertions(+), 113 deletions(-) diff --git a/README.rst b/README.rst index 1168426..2a43faf 100644 --- a/README.rst +++ b/README.rst @@ -38,7 +38,7 @@ Climate/Weather * `Australian Weather `_ * `Canadian Meteorological Centre `_ -* `Climate Data from UEA (updated at roughly monthly intervals) `_ +* `Climate Data from UEA (updated monthly) `_ * `Global Climate Data Since 1929 `_ * `NOAA Bering Sea Climate `_ * `NOAA Climate Datasets `_ @@ -68,15 +68,15 @@ Complex Networks Computer Networks ----------------- -* `3.5B Web Pages - Web graph extracted from CommonCraw 2012 web corpus. `_ -* `53.5B Web clicks - Anonymized HTTP records from 100K users in Indiana Univ. `_ -* `CAIDA Internet Datasets - Network traces and topologies at geographically diverse locations. `_ -* `ClueWeb09 - About 1B web pages in ten languages that were collected in Jan. and Feb. 2009. `_ -* `ClueWeb12 - About 733M web pages collected between Feb. and May 2012. `_ -* `CommonCrawl Web Data - Petabytes of data collected over 7 years of web crawling. `_ -* `CRAWDAD Wireless datasets (Dartmouth) - A wireless network data resource for research communities. `_ -* `OpenMobileData (MobiPerf) - Mobile performance measurement data collected with active tests. `_ -* `UCSD Network Telescope - A passive traffic monitoring system covering IPv4 /8 net. `_ +* `3.5B Web Pages from CommonCraw 2012 `_ +* `53.5B Web clicks of 100K users in Indiana Univ. `_ +* `CAIDA Internet Datasets `_ +* `ClueWeb09 - 1B web pages `_ +* `ClueWeb12 - 733M web pages `_ +* `CommonCrawl Web Data over 7 years `_ +* `CRAWDAD Wireless datasets from Dartmouth Univ. `_ +* `Open Mobile Data by MobiPerf `_ +* `UCSD Network Telescope, IPv4 /8 net `_ Data Challenges @@ -95,7 +95,7 @@ Data Challenges Economics --------- -* `American Economic Ass. (AEA) `_ +* `American Economic Ass (AEA) `_ * `EconData from UMD `_ * `Internet Product Code Database `_ @@ -133,24 +133,24 @@ Finance GeoSpace/GIS ------------ -* `BODC - Marine data of nearly 22,000 oceanographic vars. `_ -* `EOSDIS - A data collection of NASA's earth observing system data and information system. `_ -* `Factual Global Location Data - 65M POIs with extended attributes in 50 countries. `_ -* `Global Administrative Areas Database (GADM) - For countries and low-level subdivisions. `_ -* `Geo Spatial Data from ASU - Several small spatial or GIS datasets. `_ -* `GeoNames - Over eight million placenames (countries, city stat etc.) of the world. `_ -* `Natural Earth - Vectors and rasters of the world in multiple scales. `_ -* `OpenStreetMap - A free map worldwide maintained by the communities. `_ -* `TIGER/Line - Official United States boundaries and roads. `_ -* `TwoFishes - Foursquare's coarse geocoder. `_ -* `TZ Timezones - A shapefile of the TZ timezones of the world. `_ +* `BODC - marine data of ~22K vars `_ +* `EOSDIS - NASA's earth observing system data `_ +* `Factual Global Location Data `_ +* `Global Administrative Areas Database (GADM) `_ +* `Geo Spatial Data from ASU `_ +* `GeoNames Worldwide `_ +* `Natural Earth - vectors and rasters of the world `_ +* `Open Street Map (OSM) `_ +* `TIGER/Line - U.S. boundaries and roads `_ +* `TwoFishes - Foursquare's coarse geocoder `_ +* `TZ Timezones shapfiles `_ Government ---------- -* `Australia `_ (abs.gov.au) -* `Australia `_ (data.gov.au) +* `Australia (abs.gov.au) `_ +* `Australia (data.gov.au) `_ * `Canada `_ * `Chicago `_ * `EuroStat `_ @@ -185,10 +185,10 @@ Government Healthcare ---------- -* `EHDP Large Health Data Sets - A collection of health datasets across domains and countries. `_ -* `Gapminder World - A collection of multi-domain, demographic databases for our world. `_ -* `Medicare Coverage Database (MCD) - Containing national and local Coverage Determinations. `_ -* `Medicare Data Engine - Download, explore, and visualize Medicare.gov Data. `_ +* `EHDP Large Health Data Sets `_ +* `Gapminder World, demographic databases `_ +* `Medicare Coverage Database (MCD), U.S. `_ +* `Medicare Data Engine of medicare.gov Data `_ * `Medicare Data File `_ @@ -196,28 +196,29 @@ Healthcare Image Processing ---------------- -* `2GB of Photos of Cats - 10K cat images with basic annotations. `_ -* `Face Recognition Benchmark - A collection of face datasets for benchmarking algorithms. `_ -* `ImageNet - An image database organized according to the WordNet hierarchy. `_ +* `2GB of Photos of Cats `_ +* `Face Recognition Benchmark `_ +* `ImageNet - an image database in WordNet hierarchy `_ Machine Learning ---------------- -* `Delve Datasets (Univ. of Toronto) - Evaluating datasets for classification and regression. `_ -* `eBay Online Auctions (2012) - Seller-auction-bidder data with closing prices. `_ -* `IMDb Database - An online database of films, TB programs, and video games. `_ -* `Keel Repository - Multiple datasets for classification, regression, time series. `_ -* `Lending Club Loan Data - Loan status (Current, Late, Fully Paid, etc.) and latest payment info. `_ -* `Machine Learning Data Set Repository - A data search engine for machine learning tasks. `_ -* `Million Song Dataset - Audio features and metadata for a million popular music tracks. `_ -* `More Song Datasets - Complementary data of cover songs, lyrics, user listening data. `_ -* `MovieLens Data Sets - Online movie recommendation including movie tags, user ratings. `_ +* `Delve Datasets for classification and regression (Univ. of Toronto) `_ +* `Discogs Monthly Data `_ +* `eBay Online Auctions (2012) `_ +* `IMDb Database `_ +* `Keel Repository for classification, regression and time series `_ +* `Lending Club Loan Data `_ +* `Machine Learning Data Set Repository `_ +* `Million Song Dataset `_ +* `More Song Datasets `_ +* `MovieLens Data Sets `_ * `RDataMining - "R and Data Mining" ebook data `_ -* `Registered Meteorites on Earth - 34,513 meteorites updated to 2012. `_ -* `Restaurants Health Score Data - Health status of restaurants in San Francisco. `_ -* `UCI Machine Learning Repository - One of most famous ML data repositories. `_ -* `Yahoo Ratings and Classification Data - About music, movies, user clicks, images etc. `_ +* `Registered Meteorites on Earth `_ +* `Restaurants Health Score Data in San Francisco `_ +* `UCI Machine Learning Repository `_ +* `Yahoo! Ratings and Classification Data `_ Museums @@ -229,36 +230,30 @@ Museums * `The Getty vocabularies `_ -Music ------ - -* `Discogs Data - Monthly dumps of Discogs Release, Artist and Label data. `_ - - Natural Language ---------------- -* `ClueWeb09 FACC - Annotated English-language Web pages from the ClueWeb09 corpora. `_ -* `ClueWeb12 FACC - Annotated English-language Web pages from the ClueWeb12 corpora. `_ -* `DBpedia - Multi-domain ontology describing 4.58M “things” with 583M “facts”. `_ -* `Flickr Personal Taxonomies - Personalized tagging pictures with descriptive labels. `_ -* `Google Books Ngrams (2.2TB) - N-gram corpuses extracted from Google Books. `_ -* `Google Web 5gram (1TB, 2006) - 5-gram corpuses extracted from Web pages. `_ -* `Gutenberg eBooks List - Basic information about each eBook from Project Gutenberg. `_ -* `Hansards - 1.3M aligned text chunks from official records of Canadian Parliament. `_ -* `Machine Translation - The recurring translation task focusing on European languages. `_ -* `SMS Spam Collection - 5,574 real English messages, labled as being ham or spam. `_ -* `USENET corpus - A collection of public USENET postings between Oct 2005 and Jan 2011. `_ -* `Wikidata - Wikipedia databases available in JSON and XML formats. `_ -* `Wikipedia Links data - 40 Million Entities in Context. `_ -* `WordNet - Databases, associated packages and tools. `_ +* `ClueWeb09 FACC `_ +* `ClueWeb12 FACC `_ +* `DBpedia - 4.58M “things” with 583M “facts”`_ +* `Flickr Personal Taxonomies `_ +* `Google Books Ngrams (2.2TB) `_ +* `Google Web 5gram (1TB, 2006) `_ +* `Gutenberg eBooks List `_ +* `Hansards text chunks of Canadian Parliament `_ +* `Machine Translation of European languages `_ +* `SMS Spam Collection in English `_ +* `USENET postings corpus of 2005~2011 `_ +* `Wikidata - Wikipedia databases `_ +* `Wikipedia Links data - 40 Million Entities in Context `_ +* `WordNet databases and tools `_ Physics ------- -* `CERN Open Data Portal - Experimental data of CMS experiment, ALICE, ATLAS and LHCb `_ -* `NSSDC (NASA) - More than 230 TB of data from about 550 space science spacecraft `_ +* `CERN Open Data Portal `_ +* `NSSDC (NASA) data of 550 space spacecraft `_ Public Domains @@ -289,77 +284,77 @@ Public Domains Search Engines -------------- -* `Academic Torrents (UMB) - Sharing enormous datasets, for researchers, by researchers. `_ -* `Archive-it - Web archiving service built at the Internet Archive `_ -* `Datahub.io - The easy way to get, use and share data `_ +* `Academic Torrents of data sharing from UMB `_ +* `Archive-it from Internet Archive `_ +* `Datahub.io `_ * `DataMarket (Qlik) `_ -* `Freebase.com - A community-curated database of well-known people, places, and things `_ -* `Harvard Dataverse Network - Scientific data for reproducible research `_ -* `ICPSR (UMICH) - Find and analyze data `_ -* `Statista.com - Statistics and Studies from more than 18,000 Sources `_ +* `Freebase.com of people, places, and things `_ +* `Harvard Dataverse Network of scientific data `_ +* `ICPSR (UMICH) `_ +* `Statista.com - statistics and Studies `_ Social Sciences --------------- -* `Ancestry.com Forum Dataset - Forum users and messages over ten years `_ -* `CMU Enron Email - 150 users, mostly senior management of Enron `_ -* `Facebook Data Scrape (2005) - 100 American colleges and univ. `_ +* `Ancestry.com Forum Dataset over 10 years `_ +* `CMU Enron Email of 150 users `_ +* `Facebook Data Scrape (2005) `_ * `Facebook Social Networks from LAW (since 2007) `_ -* `Foursquare (2010, 2011) - Social networks, check-in locations and categories `_ -* `Foursquare from UMN/Sarwat (2013) - Users, venues, check-ins, ratings etc. `_ -* `General Social Survey (GSS, since 1972) - Demographic and attitudinal questions, topics etc. `_ -* `GetGlue - Users rating TV shows `_ -* `GitHub Archive - Programmers collaboration, projects progress etc. `_ -* `Mobile Social Networks (UMASS) - Timestamped mote-to-mote (up to 27 subjects) connections `_ -* `PewResearch Internet Project - A wide range of surveys about library usage, online dating etc. `_ -* `SourceForge.net Research Data - Historic and status statistics of projects and users' activities `_ -* `Stack Exchange Data Explorer - User-contributed content on the Stack Exchange network `_ -* `Titanic Survival Data Set - Demographic information of Titanic passengers `_ -* `Twitter Graph - Crawled entire Twitter site including tweets, user profiles, relations `_ -* `UCB's Archive of Social Science Data (D-Lab) - Holdings of political, social and health areas `_ -* `UCLA Social Sciences Data Archive - A collection of social science data on the Web `_ -* `UNIMI/LAW Social Network Datasets - Social networks like amazon, LiveJournal, dblp and more `_ -* `Universities Worldwide - Links to 9307 Universities in 205 countries `_ -* `UPJOHN for Employment Research - Labor surveys, unemployment spells and more `_ -* `Yahoo Graph and Social Data - Web page graph, user-group membership, IM friends etc. `_ -* `Youtube Video Graph (2007,2008) - Video relations, uploaders, views, ratings and more `_ +* `Foursquare Social Network in 2010, 2011 `_ +* `Foursquare from UMN/Sarwat (2013) `_ +* `General Social Survey (GSS) since 1972 `_ +* `GetGlue - users rating TV shows `_ +* `GitHub Collaboration Archive `_ +* `Mobile Social Networks from UMASS `_ +* `PewResearch Internet Survey Project `_ +* `SourceForge.net Research Data `_ +* `StackExchange Data Explorer `_ +* `Titanic Survival Data Set `_ +* `Twitter Graph of entire Twitter site `_ +* `UCB's Archive of Social Science Data (D-Lab) `_ +* `UCLA Social Sciences Data Archive `_ +* `UNIMI/LAW Social Network Datasets `_ +* `Universities Worldwide `_ +* `UPJOHN for Labor Employment Research `_ +* `Yahoo! Graph and Social Data `_ +* `Youtube Video Social Graph in 2007,2008 `_ Sports ------ -* `Betfair Event Results - Fully time-stamped historical Betfair exchange data `_ -* `Cricsheet (baseball) - Thousands of Cricket matches `_ -* `Ergast Formula 1, from 1950 up to date (API available) `_ +* `Betfair Historical Exchange Data `_ +* `Cricsheet Matches (baseball) `_ +* `Ergast Formula 1, from 1950 up to date (API) `_ * `Football/Soccer resouces (data and APIs) `_ -* `Lahman's Baseball Database - Batting and pitching statistics, team stats etc. `_ -* `Retrosheet (baseball) - Play-by-Play files, game logs and schedules `_ +* `Lahman's Baseball Database `_ +* `Retrosheet Baseball Statistics `_ Time Series ----------- -* `Time Series data Library (TSDL), created by Rob Hyndman, MU `_ -* `UC Riverside Time Series, for classification and clustering. `_ +* `Time Series Data Library (TSDL) from MU `_ +* `UC Riverside Time Series Dataset `_ Transportation -------------- -* `Airlines OD Data 1987-2008, used by ASA Challenge 2009 `_ -* `Bike Share Data Systems - Trip histories, site maps etc. `_ -* `Edge data for US domestic flights 1990 to 2009 `_ -* `Half a million Hubway rides in MA `_ -* `Marine Traffic - Ship tracks, port calls and more `_ -* `NYC Taxi Trip Data 2013 - FOIA/FOILed by Chris Whong `_ -* `OpenFlights - Airport, airline and route data `_ -* `RITA Airline On-Time Performance data of major air carriers in US `_ +* `Airlines OD Data 1987-2008 `_ +* `Bike Share Systems (BSS) collection `_ +* `Hubway Million Rides in MA `_ +* `Marine Traffic - ship tracks, port calls and more `_ +* `NYC Taxi Trip Data 2013 (FOIA/FOILed) `_ +* `OpenFlights - airport, airline and route data `_ +* `RITA Airline On-Time Performance data `_ * `RITA/BTS transport data collection (TranStat) `_ -* `Transport for London (TFL) - Trip histories and networking statistics `_ -* `Travel Tracker Survey (TTS), Chicago, 1990, 2007-2008 `_ +* `Transport for London (TFL) `_ +* `Travel Tracker Survey (TTS) for Chicago `_ * `U.S. Bureau of Transportation Statistics (BTS) `_ -* `U.S. Freight Analysis Framework - Freight movement among states since 2007 `_ +* `U.S. Domestic Flights 1990 to 2009 `_ +* `U.S. Freight Analysis Framework since 2007 `_ Complementary Collections @@ -369,4 +364,4 @@ Complementary Collections * Inside-r: `Finding Data on the Internet `_ * Quora: `Where can I find large datasets open to the public? `_ * RS.io: `100+ Interesting Data Sets for Statistics `_ -* StaTrek: `Leveraging open data to understand urban lives `_ +* StaTrek: `Leveraging open data to understand urban lives `_ \ No newline at end of file