data-science-ipython-notebooks/kaggle/titanic.ipynb

791 lines
40 KiB
Plaintext
Raw Normal View History

{
"metadata": {
"name": "",
"signature": "sha256:65b853762ab4a84c820902849d99ff6205fb1cc37e8c4b9b84d15cfd3ce9ecab"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Kaggle Machine Learning Competition: Predicting Survivors on the Titanic\n",
"\n",
"* Description\n",
"* Evaluation\n",
"* Data Set\n",
"* Read the Data\n",
"* Explore the Data\n",
"* Feature: Passenger Classes\n",
"* Feature: Sex (Gender)\n",
"* Feature: Embarked\n",
"* Feature: Age\n",
"* Feature: Family Size\n",
"* Random Forest: Training\n",
"* Random Forest: Predicting\n",
"* Support Vector Machine: Training\n",
"* Support Vector Machine: Predicting"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Description"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.\n",
"\n",
"One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.\n",
"\n",
"In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The historical data has been split into two groups, a 'training set' and a 'test set'. For the training set, we provide the outcome ( 'ground truth' ) for each passenger. You will use this set to build your model to generate predictions for the test set.\n",
"\n",
"For each passenger in the test set, you must predict whether or not they survived the sinking ( 0 for deceased, 1 for survived ). Your score is the percentage of passengers you correctly predict.\n",
"\n",
" The Kaggle leaderboard has a public and private component. 50% of your predictions for the test set have been randomly assigned to the public leaderboard ( the same 50% for all users ). Your score on this public portion is what will appear on the leaderboard. At the end of the contest, we will reveal your score on the private 50% of the data, which will determine the final winner. This method prevents users from 'overfitting' to the leaderboard."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| File Name | Available Formats |\n",
"|------------------|-------------------|\n",
"| train | .csv (59.76 kb) |\n",
"| gendermodel | .csv (3.18 kb) |\n",
"| genderclassmodel | .csv (3.18 kb) |\n",
"| test | .csv (27.96 kb) |\n",
"| gendermodel | .py (3.58 kb) |\n",
"| genderclassmodel | .py (5.63 kb) |\n",
"| myfirstforest | .py (3.99 kb) |"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"VARIABLE DESCRIPTIONS:\n",
"survival Survival\n",
" (0 = No; 1 = Yes)\n",
"pclass Passenger Class\n",
" (1 = 1st; 2 = 2nd; 3 = 3rd)\n",
"name Name\n",
"sex Sex\n",
"age Age\n",
"sibsp Number of Siblings/Spouses Aboard\n",
"parch Number of Parents/Children Aboard\n",
"ticket Ticket Number\n",
"fare Passenger Fare\n",
"cabin Cabin\n",
"embarked Port of Embarkation\n",
" (C = Cherbourg; Q = Queenstown; S = Southampton)\n",
"\n",
"SPECIAL NOTES:\n",
"Pclass is a proxy for socio-economic status (SES)\n",
" 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower\n",
"\n",
"Age is in Years; Fractional if Age less than One (1)\n",
" If the Age is Estimated, it is in the form xx.5\n",
"\n",
"With respect to the family relation variables (i.e. sibsp and parch)\n",
"some relations were ignored. The following are the definitions used\n",
"for sibsp and parch.\n",
"\n",
"Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic\n",
"Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)\n",
"Parent: Mother or Father of Passenger Aboard Titanic\n",
"Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic\n",
"\n",
"Other family relatives excluded from this study include cousins,\n",
"nephews/nieces, aunts/uncles, and in-laws. Some children travelled\n",
"only with a nanny, therefore parch=0 for them. As well, some\n",
"travelled with very close friends or neighbors in a village, however,\n",
"the definitions do not support such relations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read the Data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"import pylab as plt\n",
"\n",
"df = pd.read_csv('../data/titanic/train.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore the Data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Braund, Mr. Owen Harris</td>\n",
" <td> male</td>\n",
" <td> 22</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> A/5 21171</td>\n",
" <td> 7.2500</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td> female</td>\n",
" <td> 38</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> PC 17599</td>\n",
" <td> 71.2833</td>\n",
" <td> C85</td>\n",
" <td> C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 3</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> Heikkinen, Miss. Laina</td>\n",
" <td> female</td>\n",
" <td> 26</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> STON/O2. 3101282</td>\n",
" <td> 7.9250</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n",
"2 Heikkinen, Miss. Laina female 26 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S "
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.tail(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>888</th>\n",
" <td> 889</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td> female</td>\n",
" <td>NaN</td>\n",
" <td> 1</td>\n",
" <td> 2</td>\n",
" <td> W./C. 6607</td>\n",
" <td> 23.45</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td> 890</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Behr, Mr. Karl Howell</td>\n",
" <td> male</td>\n",
" <td> 26</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 111369</td>\n",
" <td> 30.00</td>\n",
" <td> C148</td>\n",
" <td> C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td> 891</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Dooley, Mr. Patrick</td>\n",
" <td> male</td>\n",
" <td> 32</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 370376</td>\n",
" <td> 7.75</td>\n",
" <td> NaN</td>\n",
" <td> Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
" PassengerId Survived Pclass Name \\\n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"888 female NaN 1 2 W./C. 6607 23.45 NaN S \n",
"889 male 26 0 0 111369 30.00 C148 C \n",
"890 male 32 0 0 370376 7.75 NaN Q "
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"View the data types of each column:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.dtypes"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"PassengerId int64\n",
"Survived int64\n",
"Pclass int64\n",
"Name object\n",
"Sex object\n",
"Age float64\n",
"SibSp int64\n",
"Parch int64\n",
"Ticket object\n",
"Fare float64\n",
"Cabin object\n",
"Embarked object\n",
"dtype: object"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get some basic information on the DataFrame:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.info()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 891 entries, 0 to 890\n",
"Data columns (total 12 columns):\n",
"PassengerId 891 non-null int64\n",
"Survived 891 non-null int64\n",
"Pclass 891 non-null int64\n",
"Name 891 non-null object\n",
"Sex 891 non-null object\n",
"Age 714 non-null float64\n",
"SibSp 891 non-null int64\n",
"Parch 891 non-null int64\n",
"Ticket 891 non-null object\n",
"Fare 891 non-null float64\n",
"Cabin 204 non-null object\n",
"Embarked 889 non-null object\n",
"dtypes: float64(2), int64(5), object(5)"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note Age, Cabin, and Embarked are missing values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate various descriptive statistics on the DataFrame:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.describe()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td> 891.000000</td>\n",
" <td> 891.000000</td>\n",
" <td> 891.000000</td>\n",
" <td> 714.000000</td>\n",
" <td> 891.000000</td>\n",
" <td> 891.000000</td>\n",
" <td> 891.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td> 446.000000</td>\n",
" <td> 0.383838</td>\n",
" <td> 2.308642</td>\n",
" <td> 29.699118</td>\n",
" <td> 0.523008</td>\n",
" <td> 0.381594</td>\n",
" <td> 32.204208</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td> 257.353842</td>\n",
" <td> 0.486592</td>\n",
" <td> 0.836071</td>\n",
" <td> 14.526497</td>\n",
" <td> 1.102743</td>\n",
" <td> 0.806057</td>\n",
" <td> 49.693429</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td> 1.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 1.000000</td>\n",
" <td> 0.420000</td>\n",
" <td> 0.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td> 223.500000</td>\n",
" <td> 0.000000</td>\n",
" <td> 2.000000</td>\n",
" <td> 20.125000</td>\n",
" <td> 0.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 7.910400</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td> 446.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 3.000000</td>\n",
" <td> 28.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 14.454200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td> 668.500000</td>\n",
" <td> 1.000000</td>\n",
" <td> 3.000000</td>\n",
" <td> 38.000000</td>\n",
" <td> 1.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 31.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td> 891.000000</td>\n",
" <td> 1.000000</td>\n",
" <td> 3.000000</td>\n",
" <td> 80.000000</td>\n",
" <td> 8.000000</td>\n",
" <td> 6.000000</td>\n",
" <td> 512.329200</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
" PassengerId Survived Pclass Age SibSp \\\n",
"count 891.000000 891.000000 891.000000 714.000000 891.000000 \n",
"mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n",
"std 257.353842 0.486592 0.836071 14.526497 1.102743 \n",
"min 1.000000 0.000000 1.000000 0.420000 0.000000 \n",
"25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n",
"50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n",
"75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n",
"max 891.000000 1.000000 3.000000 80.000000 8.000000 \n",
"\n",
" Parch Fare \n",
"count 891.000000 891.000000 \n",
"mean 0.381594 32.204208 \n",
"std 0.806057 49.693429 \n",
"min 0.000000 0.000000 \n",
"25% 0.000000 7.910400 \n",
"50% 0.000000 14.454200 \n",
"75% 0.000000 31.000000 \n",
"max 6.000000 512.329200 "
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature: Passenger Classes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the unique values of Pclass:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"passenger_classes = sort(df['Pclass'].unique())\n",
"passenger_classes"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"array([1, 2, 3])"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot a histogram of Pclass:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['Pclass'].hist(bins=len(passenger_classes))\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEACAYAAAC9Gb03AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAExhJREFUeJzt3W+MZXddx/H3t93WUpCuTc3238I2gQbWmGyjrEQsHCPW\nSqTtA9OWBGylMcZqSkg0bImx8UktPLExpomJYDco1Y1ILYL0n3sMPKANupXSpdImDmErO1WgLdA0\n6ZavD+ZsOw53d+7Mnju/vd95v5LJnnPn3Lkffj372XO/994hMhNJUl2ntA4gSZoti16SirPoJak4\ni16SirPoJak4i16Sipuq6CNiISK+EhEHIuLh4bazI+L+iPh6RNwXEVuXHX9zRDwREY9HxGWzCi9J\nWt20V/QJdJl5SWbuHm7bA9yfmRcDDw77RMRO4BpgJ3A5cEdE+MxBkhpZSwHHiv0rgL3D9l7gqmH7\nSuCuzHwxMxeAJ4HdSJKaWMsV/QMR8eWI+K3htm2ZuThsLwLbhu3zgUPL7nsIuOCEk0qS1mXLlMe9\nLTO/FRE/CdwfEY8v/2ZmZkQc73cp+HsWJKmRqYo+M781/Pk/EfFplkYxixFxbmYejojzgKeHw58C\nti+7+4XDbS9b5R8FSdIxZObKMfqqVh3dRMSZEfHjw/argcuAR4F7gOuGw64D7h627wGujYjTI+Ii\n4I3AwxPC+jXS1y233NI8Q6Uv19O1PFm/1muaK/ptwKcj4ujxf5OZ90XEl4F9EXEDsABcPRT4wYjY\nBxwEjgA35okk1KoWFhZaRyjF9RyPa3lyWLXoM/O/gF0Tbv8O8M5j3OdW4NYTTidJOmHR4mI7IrzI\nH1Hf93Rd1zpGGZXXc3hmrjmW65jRW/TSJrJU9P7dm1+xrqL3E6sF9H3fOkIprueY+tYBhEUvSeU5\nupE2EUc3887RjSRpAou+AGfK43I9x9S3DiAsekkqzxm9tIk4o593zuglSRNY9AU4Ux6X6zmmvnUA\nYdFLUnnO6KVNxBn9vHNGL0mawKIvwJnyuFzPMfWtAwiLXpLKc0YvbSLO6OedM3pJ0gQWfQHOlMfl\neo6pbx1AWPSSVJ4zemkTcUY/75zRS5ImsOgLcKY8LtdzTH3rAMKil6TynNFLm4gz+nnnjF6SNIFF\nX4Az5XG5nmPqWwcQFr0kleeMXtpEnNHPO2f0kqQJLPoCnCmPy/UcU986gLDoJak8Z/TSJuKMft45\no5ckTWDRF+BMeVyu55j61gGERS9J5U01o4+IU4EvA4cy890RcTbwd8DrgQXg6sx8Zjj2ZuD9wEvA\nTZl534Sf54xeasAZ/byb7Yz+A8BBXjlD9gD3Z+bFwIPDPhGxE7gG2AlcDtwRET5rkKSGVi3hiLgQ\neBfwl8DRf0muAPYO23uBq4btK4G7MvPFzFwAngR2jxlYP8qZ8rhczzH1rQOI6a7o/xT4A+CHy27b\nlpmLw/YisG3YPh84tOy4Q8AFJxpSkrR+W473zYj4NeDpzDwQEd2kYzIzI+J4Q7+J37v++uvZsWMH\nAFu3bmXXrl103dJDHL2icn+6/aO3nSx55n3/6G0nS57x//f1w58bsd9t8ONV2++BO4f9HazXcV+M\njYhbgfcBR4AzgNcC/wC8Begy83BEnAfsz8w3RcQegMy8bbj/54FbMvOhFT/XF2OlBnwxdt7N4MXY\nzPxwZm7PzIuAa4F/ycz3AfcA1w2HXQfcPWzfA1wbEadHxEXAG4GH1xpKa+NMeVyu55j61gHEKqOb\nCY5eCtwG7IuIGxjeXgmQmQcjYh9L79A5AtzopbskteXvupE2EUc3887fdSNJmsCiL8CZ8rhczzH1\nrQMIi16SynNGL20izujnnTN6SdIEFn0BzpTH5XqOqW8dQFj0klSeM3ppE3FGP++c0UuSJrDoC3Cm\nPC7Xc0x96wDCopek8pzRS5uIM/p554xekjSBRV+AM+VxuZ5j6lsHEBa9JJXnjF7aRJzRzztn9JKk\nCSz6Apwpj8v1HFPfOoCw6CWpPGf00ibijH7eOaOXJE1g0RfgTHlcrueY+tYBhEUvSeU5o5c2EWf0\n884ZvSRpAou+AGfK43I9x9S3DiAsekkqzxm9tIk4o593zuglSRNY9AU4Ux6X6zmmvnUAYdFLUnnO\n6KVNxBn9vHNGL0mawKIvwJnyuFzPMfWtAwiLXpLKO+6MPiLOAP4V+DHgdOAfM/PmiDgb+Dvg9cAC\ncHVmPjPc52bg/cBLwE2Zed+En+uMXmrAGf28W9+MftUXYyPizMx8PiK2AF8Efh+4AvjfzPxoRHwI\n+InM3BMRO4FPAm8BLgAeAC7OzB+u+JkWvdSART/vZvRibGY+P2yeDpwKfJelot873L4XuGrYvhK4\nKzNfzMwF4Elg91pDaW2cKY/L9RxT3zqAmKLoI+KUiHgEWAT2Z+ZjwLbMXBwOWQS2DdvnA4eW3f0Q\nS1f2kqRGtqx2wDB22RURZwH3RsQvrvh+RsTxngv6PHHGuq5rHaEU13NMXesAYoqiPyozn42IzwI/\nAyxGxLmZeTgizgOeHg57Cti+7G4XDrf9iOuvv54dO3YAsHXrVnbt2vXyX7CjT53dd9/98fdfGae4\nf/Lv98Cdw/4O1mu1d92cAxzJzGci4lXAvcAfA78CfDszPxIRe4CtK16M3c0rL8a+YeUrr74YO66+\n770KHVHl9dz4F2N7vKof0/pejF3tiv48YG9EnMLSPP8TmflgRBwA9kXEDQxvrwTIzIMRsQ84CBwB\nbrTRJaktf9eNtIn49sp55++6kSRNYNEX4Pu+x+V6jqlvHUBY9JJUnjN6aRNxRj/vnNFLkiaw6Atw\npjwu13NMfesAwqKXpPKc0UubiDP6eeeMXpI0gUVfgDPlcbmeY+pbBxAWvSSV54xe2kSc0c87Z/SS\npAks+gKcKY/L9RxT3zqAsOglqTxn9NIm4ox+3jmjlyRNYNEX4Ex5XK7nmPrWAYRFL0nlOaOXNhFn\n9PPOGb0kaQKLvgBnyuNyPcfUtw4gLHpJKs8ZvbSJOKOfd87oJUkTWPQFOFMel+s5pr51AGHRS1J5\nzuilTcQZ/bxzRi9JmsCiL8CZ8rhczzH1rQMIi16SynNGL20izujn3fpm9FtmEWUa3/jGN1o9tE7Q\ntm3bOOOMM1rHkDSlZlf0r3716zb8cat66aUXOPXUjSneF144zP79D3DppZduyOO10Pc9Xde1jjET\nG39F3wPdBj5edXN2Rf+DH3hFP56ejfrLdNZZdQteqsoXY0voWgcoperVfBtd6wBiiqKPiO0RsT8i\nHouIr0bETcPtZ0fE/RHx9Yi4LyK2LrvPzRHxREQ8HhGXzfJ/gCTp+Ka5on8R+GBm/hTwVuB3I+LN\nwB7g/sy8GHhw2CcidgLXADuBy4E7IsJnDjPVtw5Qiu+jH1PfOoCYougz83BmPjJsfx/4GnABcAWw\ndzhsL3DVsH0lcFdmvpiZC8CTwO6Rc0uSprSmK+2I2AFcAjwEbMvMxeFbi8C2Yft84NCyux1i6R8G\nzUzXOkApzujH1LUOINZQ9BHxGuBTwAcy83vLvzd8+ul479nyExqS1MhUb6+MiNNYKvlPZObdw82L\nEXFuZh6OiPOAp4fbnwK2L7v7hcNtK1wP7Bi2twK7eOVf/3740/3p9m9nI9fvwIEDvPTSSy9f+R6d\naVfZv/3229m1a9dJk2fs/Y09P49ub9TjVdvvgTuH/R2s16ofmIqlT1jsBb6dmR9cdvtHh9s+EhF7\ngK2ZuWd4MfaTLM3lLwAeAN6w/HceRER6kT+mno18H/1nPnOrH5iaU35gat7N7gNTbwPeC3wlIg4M\nt90M3Absi4gbgAXgaoDMPBgR+4CDwBHgRn+xzax1rQOUUrXk2+haBxBTFH1mfpFjz/LfeYz73Arc\negK5JEkj8f3tJfStA5Ti++jH1LcOICx6SSrPoi+hax2gFGf0Y+paBxANf3ul5tfb3/721hEkrYFX\n9CX0DR4zC3/tPwkyzOpro/UN
"text": [
"<matplotlib.figure.Figure at 0x10a29b650>"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate a cross tab of Pclass and Survived:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pclass_xt = pd.crosstab(df['Pclass'], df['Survived'])\n",
"pclass_xt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>Survived</th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 80</td>\n",
" <td> 136</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 97</td>\n",
" <td> 87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 372</td>\n",
" <td> 119</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
"Survived 0 1\n",
"Pclass \n",
"1 80 136\n",
"2 97 87\n",
"3 372 119"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Normalize the cross tab to sum to 1:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pclass_xt_pct = pclass_xt.div(pclass_xt.sum(1).astype(float), axis=0)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the cross tab:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pclass_xt_pct.plot(kind='bar', stacked=True)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"<matplotlib.axes._subplots.AxesSubplot at 0x10a0b0c50>"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAW8AAAEKCAYAAADdBdT9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGItJREFUeJzt3X+QnGWZ7vHvZSYrhJCaIOJKwDOpBV1iSCbLjwOFBY3s\natQNFJYwlbPAhj3ngGui/HFOLbuUylhUzjk51lb5I/4IW6xjKRJSgLsgrNSu8q6rIGxwICgJJC6p\nSoDgegKEQGSTcJ8/pjMOk5npyfTb/czz9vWpomre7pd+7u5bL565++0eRQRmZpaXt6QuwMzMjpzD\n28wsQw5vM7MMObzNzDLk8DYzy5DD28wsQw3DW9LfSnpB0hMTnPMlSVslPS5pSbklmpnZaJPZeX8D\nWDrenZI+DJwSEacC1wBfK6k2MzMbR8Pwjoh/AV6c4JSLgW/Wz30Y6Jb0jnLKMzOzsZQx854H7Bhx\nvBM4qYTHNTOzcZT1hqVGHfsz92ZmLdRVwmM8C5w84vik+m1vIsmBbmY2BRExeoNcSnjfDawC1ks6\nB3gpIl4Yp4ASlpscSdDftuXgAeDCNq7X397Xs53cu7y5f+WSDsttYBLhLek24ALgeEk7gBuBmQAR\nsS4i7pP0YUnbgFeBq0urOicvpS7Apsy9y1uH9q9heEfE8kmcs6qccszMbDL8Ccuy9KYuwKbMvctb\nh/bP4V2W+akLsClz7/LWof0r4w3Lpow3jC9Ff+seelqua1PzDB0bAJXQof1LHt5QrXfeW/ofIzOz\nOo9NzDpw11YpHdo/h7eZWYYc3mbPpC7AmtKh/Zu24b169WoWLlzI4sWLWbJkCY888kjTj3nPPfew\nZs2aEqqD2bNnl/I4ZmZTMS3esBztoYce4t5772VwcJCZM2eye/duXn/99Un9uwcOHKCra+yntWzZ\nMpYtW1ZKjX5jskI6dGZaGR3av2m58961axfHH388M2fOBOC4447jne98Jz09PezevRuAjRs3cuGF\nQ19o0N/fz5VXXsn73vc+rrrqKs4991yefPLJ4cer1Wo8+uijDAwM8MlPfpI9e/bQ09MzfP+rr77K\nu971Lg4ePMgvf/lLPvShD3HmmWdy/vnn89RTTwHwzDPPcO6557Jo0SI+/elPt+mVMDMb27QM7w98\n4APs2LGD97znPaxcuZIf/ehHwMS73S1btvCDH/yA73znO/T19bFhwwYAnn/+eXbt2sUZZ5wxfO6c\nOXPo7e2lKAoAvve977F06VJmzJjBNddcw5e//GU2btzI5z//eT7xiU8AcN1117Fy5Uo2bdrEiSee\n2KJnbkl06My0Mjq0f9MyvI855hgeffRRbr75Zt7+9rfT19fHwMDAuOdL4uKLL+atb30rAJdffjl3\n3HEHABs2bOCyyy477N/p6+vj9ttvB2D9+vX09fWxd+9eHnzwQS677DKWLFnCxz/+cXbt2gXAgw8+\nyPLlQ1/zcsUVV5T5dM3Mjti0nHkDvOUtb+GCCy7gggsu4PTTT2dgYICuri7eeOMNAH7zm9+86fxZ\ns2YN/3ziiSfytre9jSeeeIINGzawbt064M0792XLlnHDDTfw4osv8rOf/Yz3v//9vPLKK8ydO5fB\nwcE2PEObNjp0ZloZHdq/abnzfvrpp9m6devw8eDgID09PfT09LBx40YA7rzzzuH7x/qEZl9fH2vW\nrGHPnj0sXLjwsPNmz57NWWedxac+9SmWLVuGJObMmcP8+fOHd+0RwaZNmwA477zzWL9+PQC33npr\nyc/YzOzITMvw3rt3LytWrOC9730vixcvZsuWLXzuc5/jxhtv5LrrruOss86iq6treCct6bB5+Mc+\n9jFuv/12Lr/88uHbRp/X19c3PCM/5NZbb+WWW26ht7eXhQsXcvfddwPwxS9+ka985SssWrSI5557\nzlebVEmHzkwro0P7p3Z9r4ikGGstSdX7bpP+NizUX63vhBmp7X+Jpd1fbNRf3d6B+1e2ekYetluc\nljtvs7bq0JlpZXRo/xzeZmYZcnibdejMtDI6tH8ObzOzDDm8zTp0ZloZHdo/h7eZWYYc3mYdOjOt\njA7tn8PbzCxD0za8D30aspX/TNbu3bu59NJLmT17Nj09Pdx2220tfObWdh06M62MDu3ftP1iqiGt\n/BTT5MN75cqVHHXUUfzqV79icHCQj3zkIyxevJgFCxa0sD4zs/FN2533dPHqq69y1113cdNNNzFr\n1izOO+88LrnkEr71rW+lLs3K0qEz08ro0P45vBt4+umn6erq4pRTThm+bfHixfziF79IWJWZdTqH\ndwN79+5lzpw5b7rt2GOP5ZVXXklUkZWuQ2emldGh/XN4NzB79mz27Nnzpttefvlljj322EQVmZk5\nvBt697vfzYEDB9i2bdvwbY8//vjwH3iwCujQmWlldGj/HN4NHHPMMXz0ox/ls5/9LK+99ho//vGP\nueeee7jyyitTl2ZmHczhPQlf/epX2bdvHyeccAJXXHEFX//61znttNNSl2Vl6dCZaWV0aP+m+XXe\n0+NPjc2dO5fvfve7qcswMxs2bcO7yn8myqaZdv8ZLStXh/bPYxMzsww1DG9JSyVtkbRV0vVj3H+8\npO9LekzSzyWtaEmlZq3Sgbu2SunQ/k0Y3pJmAGuBpcACYLmk0e/UrQIGI6IXqAF/LWnajmPMzKqg\n0c77bGBbRGyPiP3AeuCSUec8Dxz6COIc4P9FxIFyyzRroQ69TrgyOrR/jXbI84AdI453Av951Dl/\nA/xQ0nPAscDl5ZVnZmZjaRTek7nk4wbgsYioSfo94B8lLY6Iw778Y8WKFfT09ADQ3d1Nb2/vkdab\nj0O7gfktOgaKoqBWqw3/DFTmuOWv38jj+W1ej/Svr/vX3HErX7+iKBgYGAAYzsuxaKJL8iSdA/RH\nxNL68V8Bb0TEmhHn3Aesjoif1I9/AFwfERtHPVaMtZakSl0WKAn627BQf3Uvp2zba5hKf3V7B+5f\n2eoZediHXhrNvDcCp0rqkfQ7QB9w96hztgB/WF/kHcB7gH9rvmSzNunQmWlldGj/Jgzv+huPq4D7\ngSeB2yNis6RrJV1bP+1/AWdKehz4J+AvImJ3s4VNlz+DtnbtWs4880yOOuoorr766maflplZKRpe\n0hcR/wD8w6jb1o34+dfAsvJLo7W/ek3ysefNm8dnPvMZ7r//fvbt29fCgiyZDr1OuDI6tH++HruB\nSy+9FICNGzeyc+fOxNWYmQ3xx+MnqcpvMHW8Dp2ZVkaH9s/hPUmTnZGbmbWDw3uSvPOusA6dmVZG\nh/bP4T1J3nmb2XTiNywbOHjwIPv37+fAgQMcPHiQ119/na6uLmbMmJG6NCtLh34fdEv1py6g+rzz\nbuCmm25i1qxZrFmzhm9/+9scffTRrF69OnVZZtNctPGfB9q83vQw4cfjS13oCD8e344xRSueuz8e\n3zx/vDpvQ//fre7zg/Z+pcd4H4+ftmOTKv+P28ysWR6bmHXodcLVUaQuIAmHt5lZhhzeZr7SJHO1\n1AUk4fA2M8uQw9vMM+/MFakLSGJaXG3iTy+amR2Z5OHdqksCK3+tsJXHM+/M1VIXkITHJmZmGXJ4\nl8Vz03y5d5krUheQhMPbzCxDDu+yeG6aL/cuc7XUBSTh8DYzy5DDuyyem+bLvctckbqAJBzeZmYZ\ncniXxXPTfLl3maulLiAJh7eZWYYc3mXx3DRf7l3mitQFJOHwNjPLkMO7LJ6b5su9y1wtdQFJOLzN\nzDLk8C6L56b5cu8yV6QuIAmHt5lZhhzeZfHcNF/uXeZqqQtIwuFtZpYhh3dZPDfNl3uXuSJ1AUk4\nvM3MMtQwvCUtlbRF0lZJ149zTk3SoKSfSypKrzIHnpvmy73LXC11AUlM+AeIJc0A1gJ/CDwL/Kuk\nuyNi84hzuoGvAB+MiJ2Sjm9lwWZm1njnfTawLSK2R8R+YD1wyahz/gtwZ0TsBIiIX5dfZgY8N82X\ne5e5InUBSTQK73nAjhHHO+u3
"text": [
"<matplotlib.figure.Figure at 0x10a267e90>"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that passenger class seems to have a significant impact on whether a passenger survived."
]
}
],
"metadata": {}
}
]
}