Added preliminary Kaggle Titanic survivor analysis containing the competition description, evaluation, data set, and snippet to read in the data to pandas.

This commit is contained in:
Donne Martin 2015-03-14 19:53:56 -04:00
parent 1fbbd20c68
commit bcfae90101

160
kaggle/titanic.ipynb Normal file
View File

@ -0,0 +1,160 @@
{
"metadata": {
"name": "",
"signature": "sha256:da82018e898cd7c48f4841109f673f2618fe52d5d4553d5353acc59f8cfb0c07"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Kaggle Machine Learning Competition: Predicting Survivors on the Titanic\n",
"\n",
"* Description\n",
"* Evaluation\n",
"* Data Set\n",
"* Read the Data\n",
"* Explore the Data\n",
"* Feature: Passenger Classes\n",
"* Feature: Sex (Gender)\n",
"* Feature: Embarked\n",
"* Feature: Age\n",
"* Feature: Family Size\n",
"* Random Forest: Training\n",
"* Random Forest: Predicting\n",
"* Support Vector Machine: Training\n",
"* Support Vector Machine: Predicting"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Description"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.\n",
"\n",
"One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.\n",
"\n",
"In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The historical data has been split into two groups, a 'training set' and a 'test set'. For the training set, we provide the outcome ( 'ground truth' ) for each passenger. You will use this set to build your model to generate predictions for the test set.\n",
"\n",
"For each passenger in the test set, you must predict whether or not they survived the sinking ( 0 for deceased, 1 for survived ). Your score is the percentage of passengers you correctly predict.\n",
"\n",
" The Kaggle leaderboard has a public and private component. 50% of your predictions for the test set have been randomly assigned to the public leaderboard ( the same 50% for all users ). Your score on this public portion is what will appear on the leaderboard. At the end of the contest, we will reveal your score on the private 50% of the data, which will determine the final winner. This method prevents users from 'overfitting' to the leaderboard."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| File Name | Available Formats |\n",
"|------------------|-------------------|\n",
"| train | .csv (59.76 kb) |\n",
"| gendermodel | .csv (3.18 kb) |\n",
"| genderclassmodel | .csv (3.18 kb) |\n",
"| test | .csv (27.96 kb) |\n",
"| gendermodel | .py (3.58 kb) |\n",
"| genderclassmodel | .py (5.63 kb) |\n",
"| myfirstforest | .py (3.99 kb) |"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"VARIABLE DESCRIPTIONS:\n",
"survival Survival\n",
" (0 = No; 1 = Yes)\n",
"pclass Passenger Class\n",
" (1 = 1st; 2 = 2nd; 3 = 3rd)\n",
"name Name\n",
"sex Sex\n",
"age Age\n",
"sibsp Number of Siblings/Spouses Aboard\n",
"parch Number of Parents/Children Aboard\n",
"ticket Ticket Number\n",
"fare Passenger Fare\n",
"cabin Cabin\n",
"embarked Port of Embarkation\n",
" (C = Cherbourg; Q = Queenstown; S = Southampton)\n",
"\n",
"SPECIAL NOTES:\n",
"Pclass is a proxy for socio-economic status (SES)\n",
" 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower\n",
"\n",
"Age is in Years; Fractional if Age less than One (1)\n",
" If the Age is Estimated, it is in the form xx.5\n",
"\n",
"With respect to the family relation variables (i.e. sibsp and parch)\n",
"some relations were ignored. The following are the definitions used\n",
"for sibsp and parch.\n",
"\n",
"Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic\n",
"Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)\n",
"Parent: Mother or Father of Passenger Aboard Titanic\n",
"Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic\n",
"\n",
"Other family relatives excluded from this study include cousins,\n",
"nephews/nieces, aunts/uncles, and in-laws. Some children travelled\n",
"only with a nanny, therefore parch=0 for them. As well, some\n",
"travelled with very close friends or neighbors in a village, however,\n",
"the definitions do not support such relations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read the Data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"import pylab as plt\n",
"\n",
"df = pd.read_csv('../data/titanic/train.csv')"
],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}