data-science-ipython-notebooks/kaggle/titanic.ipynb

2849 lines
375 KiB
Plaintext
Raw Normal View History

{
"metadata": {
"name": "",
"signature": "sha256:875d946727100c856e2ed2888b9ea936d7ae31f29a4b53bbf1b72946e02b7f70"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Kaggle Machine Learning Competition: Predicting Survivors on the Titanic\n",
"\n",
"* Competition Site\n",
"* Description\n",
"* Evaluation\n",
"* Data Set\n",
"* Setup Imports and Variables\n",
"* Explore the Data\n",
"* Feature: Passenger Classes\n",
"* Feature: Sex (Gender)\n",
"* Feature: Embarked\n",
"* Feature: Age\n",
"* Feature: Family Size\n",
"* Random Forest: Training\n",
"* Random Forest: Predicting\n",
"* Support Vector Machine: Training\n",
"* Support Vector Machine: Predicting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Competition Site"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/c/titanic-gettingStarted"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Description"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.\n",
"\n",
"One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.\n",
"\n",
"In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The historical data has been split into two groups, a 'training set' and a 'test set'. For the training set, we provide the outcome ( 'ground truth' ) for each passenger. You will use this set to build your model to generate predictions for the test set.\n",
"\n",
"For each passenger in the test set, you must predict whether or not they survived the sinking ( 0 for deceased, 1 for survived ). Your score is the percentage of passengers you correctly predict.\n",
"\n",
" The Kaggle leaderboard has a public and private component. 50% of your predictions for the test set have been randomly assigned to the public leaderboard ( the same 50% for all users ). Your score on this public portion is what will appear on the leaderboard. At the end of the contest, we will reveal your score on the private 50% of the data, which will determine the final winner. This method prevents users from 'overfitting' to the leaderboard."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| File Name | Available Formats |\n",
"|------------------|-------------------|\n",
"| train | .csv (59.76 kb) |\n",
"| gendermodel | .csv (3.18 kb) |\n",
"| genderclassmodel | .csv (3.18 kb) |\n",
"| test | .csv (27.96 kb) |\n",
"| gendermodel | .py (3.58 kb) |\n",
"| genderclassmodel | .py (5.63 kb) |\n",
"| myfirstforest | .py (3.99 kb) |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<pre>\n",
"VARIABLE DESCRIPTIONS:\n",
"survival Survival\n",
" (0 = No; 1 = Yes)\n",
"pclass Passenger Class\n",
" (1 = 1st; 2 = 2nd; 3 = 3rd)\n",
"name Name\n",
"sex Sex\n",
"age Age\n",
"sibsp Number of Siblings/Spouses Aboard\n",
"parch Number of Parents/Children Aboard\n",
"ticket Ticket Number\n",
"fare Passenger Fare\n",
"cabin Cabin\n",
"embarked Port of Embarkation\n",
" (C = Cherbourg; Q = Queenstown; S = Southampton)\n",
"\n",
"SPECIAL NOTES:\n",
"Pclass is a proxy for socio-economic status (SES)\n",
" 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower\n",
"\n",
"Age is in Years; Fractional if Age less than One (1)\n",
" If the Age is Estimated, it is in the form xx.5\n",
"\n",
"With respect to the family relation variables (i.e. sibsp and parch)\n",
"some relations were ignored. The following are the definitions used\n",
"for sibsp and parch.\n",
"\n",
"Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic\n",
"Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)\n",
"Parent: Mother or Father of Passenger Aboard Titanic\n",
"Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic\n",
"\n",
"Other family relatives excluded from this study include cousins,\n",
"nephews/nieces, aunts/uncles, and in-laws. Some children travelled\n",
"only with a nanny, therefore parch=0 for them. As well, some\n",
"travelled with very close friends or neighbors in a village, however,\n",
"the definitions do not support such relations.\n",
"</pre>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup Imports and Variables"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"import pylab as plt\n",
"\n",
"# Set the global default size of our matplotlib figures\n",
"plt.rc('figure', figsize=(10, 5))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore the Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the data:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df = pd.read_csv('../data/titanic/train.csv')\n",
"df.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Braund, Mr. Owen Harris</td>\n",
" <td> male</td>\n",
" <td> 22</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> A/5 21171</td>\n",
" <td> 7.2500</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td> female</td>\n",
" <td> 38</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> PC 17599</td>\n",
" <td> 71.2833</td>\n",
" <td> C85</td>\n",
" <td> C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 3</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> Heikkinen, Miss. Laina</td>\n",
" <td> female</td>\n",
" <td> 26</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> STON/O2. 3101282</td>\n",
" <td> 7.9250</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n",
"2 Heikkinen, Miss. Laina female 26 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S "
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.tail(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>888</th>\n",
" <td> 889</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td> female</td>\n",
" <td>NaN</td>\n",
" <td> 1</td>\n",
" <td> 2</td>\n",
" <td> W./C. 6607</td>\n",
" <td> 23.45</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td> 890</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Behr, Mr. Karl Howell</td>\n",
" <td> male</td>\n",
" <td> 26</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 111369</td>\n",
" <td> 30.00</td>\n",
" <td> C148</td>\n",
" <td> C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td> 891</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Dooley, Mr. Patrick</td>\n",
" <td> male</td>\n",
" <td> 32</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 370376</td>\n",
" <td> 7.75</td>\n",
" <td> NaN</td>\n",
" <td> Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
" PassengerId Survived Pclass Name \\\n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"888 female NaN 1 2 W./C. 6607 23.45 NaN S \n",
"889 male 26 0 0 111369 30.00 C148 C \n",
"890 male 32 0 0 370376 7.75 NaN Q "
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"View the data types of each column:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.dtypes"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"PassengerId int64\n",
"Survived int64\n",
"Pclass int64\n",
"Name object\n",
"Sex object\n",
"Age float64\n",
"SibSp int64\n",
"Parch int64\n",
"Ticket object\n",
"Fare float64\n",
"Cabin object\n",
"Embarked object\n",
"dtype: object"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Type 'object' is a string for pandas, which poses problems with machine learning algorithms. If we want to use these as features, we'll need to convert these to number representations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get some basic information on the DataFrame:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.info()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 891 entries, 0 to 890\n",
"Data columns (total 12 columns):\n",
"PassengerId 891 non-null int64\n",
"Survived 891 non-null int64\n",
"Pclass 891 non-null int64\n",
"Name 891 non-null object\n",
"Sex 891 non-null object\n",
"Age 714 non-null float64\n",
"SibSp 891 non-null int64\n",
"Parch 891 non-null int64\n",
"Ticket 891 non-null object\n",
"Fare 891 non-null float64\n",
"Cabin 204 non-null object\n",
"Embarked 889 non-null object\n",
"dtypes: float64(2), int64(5), object(5)"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Age, Cabin, and Embarked are missing values. Cabin has too many missing values, whereas we might be able to infer values for Age and Embarked."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate various descriptive statistics on the DataFrame:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.describe()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td> 891.000000</td>\n",
" <td> 891.000000</td>\n",
" <td> 891.000000</td>\n",
" <td> 714.000000</td>\n",
" <td> 891.000000</td>\n",
" <td> 891.000000</td>\n",
" <td> 891.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td> 446.000000</td>\n",
" <td> 0.383838</td>\n",
" <td> 2.308642</td>\n",
" <td> 29.699118</td>\n",
" <td> 0.523008</td>\n",
" <td> 0.381594</td>\n",
" <td> 32.204208</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td> 257.353842</td>\n",
" <td> 0.486592</td>\n",
" <td> 0.836071</td>\n",
" <td> 14.526497</td>\n",
" <td> 1.102743</td>\n",
" <td> 0.806057</td>\n",
" <td> 49.693429</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td> 1.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 1.000000</td>\n",
" <td> 0.420000</td>\n",
" <td> 0.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td> 223.500000</td>\n",
" <td> 0.000000</td>\n",
" <td> 2.000000</td>\n",
" <td> 20.125000</td>\n",
" <td> 0.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 7.910400</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td> 446.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 3.000000</td>\n",
" <td> 28.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 14.454200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td> 668.500000</td>\n",
" <td> 1.000000</td>\n",
" <td> 3.000000</td>\n",
" <td> 38.000000</td>\n",
" <td> 1.000000</td>\n",
" <td> 0.000000</td>\n",
" <td> 31.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td> 891.000000</td>\n",
" <td> 1.000000</td>\n",
" <td> 3.000000</td>\n",
" <td> 80.000000</td>\n",
" <td> 8.000000</td>\n",
" <td> 6.000000</td>\n",
" <td> 512.329200</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
" PassengerId Survived Pclass Age SibSp \\\n",
"count 891.000000 891.000000 891.000000 714.000000 891.000000 \n",
"mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n",
"std 257.353842 0.486592 0.836071 14.526497 1.102743 \n",
"min 1.000000 0.000000 1.000000 0.420000 0.000000 \n",
"25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n",
"50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n",
"75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n",
"max 891.000000 1.000000 3.000000 80.000000 8.000000 \n",
"\n",
" Parch Fare \n",
"count 891.000000 891.000000 \n",
"mean 0.381594 32.204208 \n",
"std 0.806057 49.693429 \n",
"min 0.000000 0.000000 \n",
"25% 0.000000 7.910400 \n",
"50% 0.000000 14.454200 \n",
"75% 0.000000 31.000000 \n",
"max 6.000000 512.329200 "
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have a general idea of the data set contents, we can dive deeper into each column. We'll be doing exploratory data analysis and cleaning data to setup 'features' we'll be using in our machine learning algorithms.\n",
"\n",
"Plot a few features to get a better idea of each:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Set up a grid of plots\n",
"fig = plt.figure(figsize=(10, 10)) \n",
"\n",
"# Plot death and survival counts\n",
"ax0 = plt.subplot2grid((3, 2), (0, 0))\n",
"df['Survived'].value_counts().plot(kind='bar', title='Death and Survival Counts')\n",
"plt.xticks((0, 1), ('Died', 'Survived'), rotation=0)\n",
"\n",
"# Plot Pclass counts\n",
"ax1 = plt.subplot2grid((3, 2), (0, 1))\n",
"df['Pclass'].value_counts().plot(kind='bar', title='Passenger Class Counts')\n",
"\n",
"# Plot Sex counts\n",
"ax2 = plt.subplot2grid((3, 2), (1, 0))\n",
"df['Sex'].value_counts().plot(kind='bar', title='Gender Counts')\n",
"plt.xticks((0, 1), ('Female', 'Male'), rotation=0)\n",
"\n",
"# Plot Embarked counts\n",
"ax3 = plt.subplot2grid((3, 2), (1, 1))\n",
"df['Embarked'].value_counts().plot(kind='bar', title='Ports of Embarkation Counts')\n",
"\n",
"# Plot the Age histogram\n",
"ax4 = plt.subplot2grid((3, 2), (2, 0))\n",
"df['Age'].hist()\n",
"plt.title('Age Histogram')\n",
"plt.xlabel('Age')\n",
"plt.ylabel('Count')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"<matplotlib.text.Text at 0x10a5d70d0>"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAmAAAAJoCAYAAADBOF29AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xm8JFV9///XGxCQdQZBdhiMjohZxgWMW7gqQaIIZBHQ\naBwxMQYN/qJGZkwikIUt8avGhCwadURZxhj3yBoKNQpIZNxGRIxXGZZhHRYxyvL5/XFOMz3NXebO\n7erTVfV+Ph79uF3d1ac+ferW6VOnPlWliMDMzMzMRmez0gGYmZmZdY07YGZmZmYj5g6YmZmZ2Yi5\nA2ZmZmY2Yu6AmZmZmY2YO2BmZmZmI+YOWMtJeljSEwosd1FedvH/MUn/JOnPh1DOhyX91TBiMrPu\nklRJel3pOKys4j+OXSJpUtL9ku6RdJek/5b0h5I0pPIbs1FLep6kr0haJ+kOSV+W9Mw6lhURfxQR\nfz2MovJjSpJ2l/Rvkm7K6/i7kk6WtM0Qlj2tvIyz61yG2cboa+PulXSLpA9J2rZ0XKMmacu8XV4n\n6T5JP8xtw755lhnbkhri2UHSeyT9KK+b6yW9W9Ljal7uUklfqnMZTeYO2GgFcHhE7ADsA5wOnAj8\n2xDLH3uSdgA+B7wXWAjsCZwC/GwTytKwOrAbu8hp4tgJ+CqwFfCreR3/OrAj8AujC8+sqF4btz3w\ndOCZwLxHn8eVpC2meevfgcOBVwA7AL8CXA28cEShPULSlsClwFOAF+d182zgduCgUcdjfSLCjxE9\ngB8CLxx47UDgIeCpeXor4O+AHwG3AP8EbJ3fW0DquNwK3Al8Ftgzv/c3wIPAT4F7gb/Prz8M/CFw\nHXAX8A8zxHcQqRNxF3AT8D7gMX3vT1sWqTP/d8BtwA+AN+b5N5tiOc8E7pohjpOBs/umF/WXBVTA\nXwP/DdwPvB342kAZfwJ8Oj//MPBX+fl3gZf2zbdFjnlJnv44cDOwDrgcOKBv3g/1ypki5r8GvjHL\n+n8O8LVc9lXAs/vemwReNFUd9H3/38v/F7cB78jvHUbquP48r/dr8utL83q4B/hf4JWl///9aP+D\ngTYO+NvcTk3bduX5pvx/BZ6Yt8N1+f/+vL7P7A9cDNwBXAu8vO+9DwP/mJd5D3AF8IS+9w8FvpfL\n/ce8jNf1vX8csDrHegGwT997DwPHA98HfjBFHRyS26U9Z6iny4Dj8vNfAP6L1CG6DfgosGPfvCcC\na/L3uLZXv6T2+mrgbtJvxbumWdbv5/e3mSGep5Da1buAbwMv63uvGqibpcCXBurjUb8Lucyfkn6X\n7gXuzK+/BPhO/j5rgLeW/r8t9fAIWGER8TXSP+Hz8kunkxqdX8l/9wTemd/bjDRatk9+/BT4h1zO\nnwFfAt4YEdtHxAl9i3kpqdPzy8DRkl48TTgPAm8GHkfaQ3oRqaHpN11Zr8/vLcnv/w7Tj8h9D3go\n51QdJmnhYLVM87l+ryI1LNsB/ww8WdIT+95/JfCxvvJ6ZZ5D2ivteTFwa0SsytOfJ9X7LsDX+8qY\nzSHAf0z3Zh4h+zzwHmAn4P8Bn+/77oOHJKaqg+cCi0nr5Z2SnhwRFwCnkn6Yto+Ip+VDPu8FDos0\nEvdsYNUU5ZnVQQCS9gZ+g7QdTdt2zfL/+lfABRGxgNQW/n3fZy4mdVZ2AY4FzpL0lL44jiHtyCwE\nriftpCJpZ9KO1omkbfF7eZmR3z8SWA78JrAzqV09d+A7HknaeT5giu9/CHBlRNy4UbWV/A2wO6nT\nsneOG0lPJu3MPjPXzaGknTVIdfbuiNgReAKwcpqyDwG+EBH3T/WmpMeQOsQXkOryj4GPSXpSnmVj\nDpc+6nchIr4LvAH4am6bdsrz/hvw+vx9nkrqfHaSO2Dj4SZgp3wo7Q+At0TEuoi4DziN1LgQEXdG\nxCcj4v/ye6cCBw+UNdUhstMj4p6IuIG057VkqiAi4usRcVVEPBwRPwL+dYryB8v6lfz60aTG4MaI\nuCvHNuXhuoi4l9ThDOD9wK2SPi3p8TN8hw2KAD4cEd/Nsd4DfJrcscoNx5OBz/R9plfmucARkrbO\n06+kr3GNiA9HxE8i4gHSYdFfkbT9LPFAashvnuH9lwLfi4iP5ZjPI+3Nvmya+aeqg1Mi4mcR8U3g\nG6yve00x/8PAL0l6bESsjYjVG/EdzOZLwKck3UXquFTAqRvRdk33//pzYJGkPSPi5xHxlfz64cAP\nI2JF3p5WkXaAXt5X5n9ExNUR8RBpR6rX7r0E+HZEfCp/9u9JI0Q9bwBOi4jvRcTDpDZ4Se5Q9pyW\n2+ip0iYeN1DejCLiBxFxaUQ8EBG3A+/uq5uHSEdFnirpMRHx44j43766eZKknSPi/oi4cppFzNY2\n/SqwbUScHhEPRsRlpJHDV27sd2D635ip2rGf5++zQ0TcHRHXzGE5reIO2HjYizTUvTOwDfA/OUn/\nLuAL+XUkbSPpX3Ki692kYfMdB3KgptpT6W8M7ieNGj2KpMWSPifp5lz+35Aak40pa3fghr73fjz9\n14WIuDYiXhsRewO/COxBGh3aWDcMTPePbL0S+GRE/N8Uy72edBjyiJwc/7L8WSRtLun0nKB6N+lw\nCuT6n8Ud+TtMZw8eXSc/Iu3Vb6yNWo8R8RPS3v8bgJvyOn3yHJZjtqkCODIiFkbEooh4U0T8bKa2\na5b/17eTfsSvkvRtSa/Nr+8LPKvXTua28pXArn1xrO2L66es3172IB116Nc/vS/w3r5y78iv92+r\ng+1Pv9tJ7eFGkbSrpPMkrcl1cza53c3t1f9HGhFbK+lcSb2yX0caEf+upKskvXSaRWxM2zT4fX40\ny2cGDbZNM5148dukTvBkPnHsV+ewnFZxB6wwSQeS/tG/TNpQfkrKO1qYHwvyUC3AW0kb3EF52Plg\nNhz9mG8S/j+R8h6emMv/Mzb+f+Rm0qGFnn2mm3FQRHwPWEHqiAH8hNQR7dltqo8NTF8C7CLpV0gj\nhufMsMhzSZ21I4HVfXuUrwSOIOVi7Qjsl1/fmCT/S4DfnOGEgBtJDXu/ffPrkL5zf6M11XeezqPW\ne0RcFBGH5nKuJY00mpUyY9s13f9rHg17fUTsScozOkvSL5B2Zi7vaycX5sNcb9yIWG4i7fQC6USe\n/ulc9usHyt42Iq7om2emtvYS4CBJG7tzdSpppOsXc928mr52NyLOjYjnk9qLAM7Ir18fEa+MiF3y\na/8u6bHTxPPiGc7GvgnYe6DtqrNtujoijiId7vwU0x86bT13wEavlx+xg6TDSZ2BsyPiO3m4+/3A\neyTtkufbU9Kh+bPbkTpod+ecopMGyl7L7GfczdSZ2I6ULHm/pP2BP9qIsnrlrQROyPEuBJZN+yHp\nyZLe0mug8tD+K0gnAEDK//g1SXtL2pGUjzHj98iHDD9OOhFgISk/ZMp5gfNIuV9vYMMcr+1ICe13\n5hyTU2da5oD/RzrbaYWkffL32lPSuyT9EvCfwGJJr5C0haRjSEnEn+v7zsfm955J2kvc2A71LaTD\nNL3/rcdLOjJ/hwdIDehDG1mWWR2mbbtm+n+V9HJJvc7ROtI28RBpu1ks6VWSHpMfB+Z2C2beVv+T\ndLjzSKWzGN/Ihp2KfwbeIemAHMOOkl4+RTlTiohLSe3PJyU9PW/T20t6Q98I3mDd/AS4J7eJf9pX\nN4slvVDSVqS26f/66uZVvd8JUiJ+kA7lDjqbNML1idz2bibpcZLeIek3SCco3A+8PdfjBOkQ73n5\n86uA35L0WKU829kuddT/u7AW2Espz4xc/u9K2jEfGr6XDrdN7oCN3mcl3UPay1oOvAvo3yhPJCWM\nXpGHoy8m7TlCOkT3WNIQ91dIhyf7f6TfC/yOpDslTXc4b6aEyreRRoHuIeV/nTcw7+Dn+st6P3Ah\nKTfpauATMyznXuBZwJWS7iN1vL5J2ksmIi4Gzs+vfY2UIDrVsgedQ0pQ/3juzE4VJxFxC6n+np2X\n0/MR0tD7jaQzgb7Ko7//lN8p
"text": [
"<matplotlib.figure.Figure at 0x1084d9c50>"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we'll explore various features to view their impact on survival rates."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature: Passenger Classes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From our exploratory data analysis in the previous section, we see there are three passenger classes: First, Second, and Third class. We'll determine which proportion of passengers survived based on their passenger class."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate a cross tab of Pclass and Survived:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pclass_xt = pd.crosstab(df['Pclass'], df['Survived'])\n",
"pclass_xt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>Survived</th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 80</td>\n",
" <td> 136</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 97</td>\n",
" <td> 87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 372</td>\n",
" <td> 119</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
"Survived 0 1\n",
"Pclass \n",
"1 80 136\n",
"2 97 87\n",
"3 372 119"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the cross tab:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Normalize the cross tab to sum to 1:\n",
"pclass_xt_pct = pclass_xt.div(pclass_xt.sum(1).astype(float), axis=0)\n",
"\n",
"pclass_xt_pct.plot(kind='bar', stacked=True, title='Survival Rate by Passenger Classes')\n",
"plt.xlabel('Passenger Class')\n",
"plt.ylabel('Survival Rate')\n",
"plt.xticks((0, 1, 2), ('First', 'Second', 'Third'), rotation=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 52,
"text": [
"([<matplotlib.axis.XTick at 0x10e60f610>,\n",
" <matplotlib.axis.XTick at 0x10e621c90>,\n",
" <matplotlib.axis.XTick at 0x10e6c7f50>],\n",
" <a list of 3 Text xticklabel objects>)"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAl4AAAFRCAYAAACln6POAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmcHXWZ7/HPF4JsSQyrQgSbC8gFAgmyCOJIK/diFAOD\nChm84A0zV1BBuePoZVSUZhhU1HFcYGSRGRxECAoOiwooY6EsCglhJ2ySEQiIGCALiyQ894/6dVI0\n53SfQJ+qU13f9+vVr5yqU6fOU6efdD/9+z1VpYjAzMzMzLpvjaoDMDMzM2sKF15mZmZmJXHhZWZm\nZlYSF15mZmZmJXHhZWZmZlYSF15mZmZmJXHhZVZzkr4j6fhR2M85kk4ajZhGm6QFkvatOo6mkNQn\n6UVJ/h1hNsr8n8qsCyS9TdL1kp6S9CdJ10rarRvvFREfjYh/HI1dpa+XkTRL0gpJSyQ9Lek2SQd1\nuuNUOL2zG7GtDkn9qaBYImmxpPmSZr3a/daRpDdJ+qGkP6Y8vVXS37rYMusu/wczG2WSJgKXA98E\nNgAmAycCz7+CfUmSRjfC4d9ymOeui4gJwCTgVOAHkjbocL8xwr7L9EhETIiIicBxwFmStq86qG6R\ntGaLdVsDvwX+C5gSEZOAg4FdgfHlRmjWLC68zEbfm4CIiNmRey4ifh4RtwNIGpB07uDGQ6d1JGWS\n/lHSdcAy4NOSbiq+QRqZuCQ9XjlFKOluSfsXthuXRjSmpeUfSno0jXBcI2mH1TgukQ4M+D6wNrB1\n2u/Wkv5T0hPp/b4v6bXpuXOBLYHL0kjTp9L6PdOo4JOSbpG0zwjvv4ekOyUtkvSvktZO+7lD0nsL\nx7xWimPqSAcUEZcATwLbS9pf0rw0ovd7SScU9rlOOqYnUrw3Sto0PTdL0gNpBO13kj5YeN1fS7or\nxXyFpC0Lz70o6ShJ96Z9nlp4bg1J/5Q+y99JOmZIjrxW0tmSFkp6WNJJhedmSbpO0tclPQGsPI6C\nE4FrI+JTEfGH9FncGxGHRcTioRtLOiIdx+J0rEcWnttY0uXpGP4k6VeF545L8Q2OLr4zrZekv5d0\nf/pMZw8W8cN91mZjgQsvs9F3D7AiFUTTW4wKdTJldhjwf8hHH04HtpO0TeH5DwLnFfY3uM8fAIcW\ntnsX8HhE3JKWfwJsA2wC3FzYR8fSCMoRwFPkxzroZGAzYHtgC2AAICIOB34PvDeNNH1N0mTyUcF/\niIgNgE8BF0nauN3bpmPej7zYexMw2Nf2PfLPa9B7yEe1bh3hONZQPl06CbgdWAocFhGvBfYHPirp\nwLT5/wYmAm8ANgSOAp6VtD75yOb0NIK2F3BL2v+BwGeAg4CNgV8D5w8JY39gN2Bn4BBJ70rrjwSm\nA1OBNwN/yUvz5hzgz+mz2CV9Lv+n8PwewAPApsAXWxz+vsCPhvt8hvgDsH86xiOAfx4s5oG/Ax5K\nx7hpOmYkbQccDeyWXrcfsCC95hPAAcDbyXPmSeC09FzLz3o1YjXraS68zEZZRCwB3kb+i/Is4HFJ\nlxT+ah9pyi2AcyLi7oh4MY1AXEIqqCRtC2wHXFp4zeA+zwcOkLROWv4ghV/2EXFORCyLiBfIRz2m\nSprQ4aHtKelJ8l+CXwVmpGMlIh6IiKsj4oWIeAL4Z2C4EazDgJ9GxBXp9b8A5pAXTa0EcGpEPBIR\nT5IXeYMF5nnA/pIGp8gOB85tsY9Bm6fj+CPwefJi676IuCYi7kzx3A5cUDiGPwMbAdumUcx5g8cO\nvAjsJGndiPhDRNyV1n8E+FJE3BMRLwJfAqZJ2qIQy5cjYnFEPAT8krzQAjgE+EZELIyIp9JrBSDp\ndcC7gb+NiGcj4o/AN4C/Kux3YUSclvLnuRafwUbAo8N8Ri8RET+NiAfT418BV5EXTYOfzWZAX0Ss\niIjr0voV5KOiO0paKyJ+HxG/S88dBRyfjm8wFz+QivrhPmuz2nPhZdYFETE/Io6IiC2AKcDm5L8c\nO/XQkOXiSNYHgR+3+oUaEfcDd5MXX+sBM9JrkbSmpC+n6Z2ngQfTy9qNMg31mzQ6tQF50Xfc4BOS\nXifpgjSt9DR54bPRMPt6I3Bwmkp6MhVCewOvH+Y1xc/k9+SfKRGxELiO/Bf3JPKRouFG8hZGxAYR\nsVFEvDkiLkzH8BZJv5T0uKSnyIuDwWM4F7gSuEDSI5JOkTQuIpYBM8mLrIVpym27wjF+s3B8f0rr\nJxdieazw+BlW9VdtNuR4Hy48fiOwFvBoYd+nk49itvqsWvkT6fPrhKR3S/pNmkp8krxAHvxsvgrc\nD1yVpiGPg5W5+H/JRz7/IOl8SZul1/QBPy7EfxewnHzErOVn3WmsZr3OhZdZl0XEPeTTYVPSqmXA\neoVNWhUbQ6cjfwFsorxv6a9IxVQb55MXaQcCdxVGGT5IPr2zb5pO2yqtX62m91RsfBTYR6v6sr5I\nPsIxJe37cF7682Xo8fweODcVQINfEyLiK8O89ZZDHi8sLA9ONx4MXB8RHY/mFPwA+A/gDanZ/PTB\nY4iI5RHxDxGxI/BW4L3Ah9JzV0XEfuTfx/nko5yDx3jkkGNcPyJ+00Esj5JP1w4qPn6I/ESNjQr7\nfW1E7FTYZqTp7F8A7+8gDpT30l0EfAXYNBXfP2VVz9/S1Cu2NXl+fXKwlysizo+IvyAvFgM4Je32\n9+TTs8XPZr2IeHS4z9psLHDhZTbKJG0n6ZOpj4k0tXQocEPa5Bbg7ZK2UN6A/plWuykupOmYHwJf\nIx9x+nm7bcmnyN5FPgpTHPkZT/4Le1HqTRra+9NxAZam+84E/r6w72XA4nTcnx7ykj+QGvGT7wMz\nJO2XRuLWUX6ph8m0JuBoSZMlbQh8Lh3noB+T90J9Avj3To9jiPHAkxHxZ0l7kBeqASsvQ7FTmgpb\nArxA3se3qaQD0+f5QvoMVqT9nQ58VukEBuUN8QcP8/5i1ffgQuBYSZunUbzjBmNJReVVwNclTUi9\naltLenvLvbZ2AvBWSV9JU5dI2kbSucrPyi16Tfp6AnhR0rvJ+7VIr3tveq2Axen4Vyi/XMU7U+H2\nPPDckM/mi0onG0jaRNIB6XHLz3o1js2sp7nwMht9S4C3AL+VtJS84LqNvAmZiPg5MDutuwm4jJeP\nULQasfgBeVP0D1PPUHHbldtHxGPA9eSN3rML2/07+eUDHgHuSHEV32e4a2W1eu4bwDsk7Uzeo/Nm\n4Ol0PBcN2f5LwPFpaumTEfEw+YjcZ4HHyUdA/o72P5OCvIi8irxp/D5g5bXL0rTrxeRTWBe32Udx\nX618DPgHSYvJe7+Kn93ryQvfp8mnxTLyKbE1gL8l/0z/BPwF+WggEfEf5CM8F6Tp19vJC+J2cRQ/\n47PSsd4GzCU/KWJF4fv+IfJi6C5gUYrt9S320/oDyEdB9yL/vO5MU6s/Is/HpcX4Un/VJ8iLwUXk\nf0RcUtjdNuR/CCwhz7vTIuIa8v6uL5H30j1KPqU9+EfGN8mnq69Kn/cN5CcEQPvP2mxMUMSrviZh\n+51L/0p+1s7jQ4bBi9t8i7xR9BlgVkTM61pAZjZmSfo8eUP2mJuWSqNM34mIvqpjMbNXp9sjXv9G\n3ujakqT3ANtExLbkp09/p8vxmNkYlKYf/5p8+rP20tTre5Rfh20y+dTgSCN5ZlYDXS28IuLX5Ndn\naecA8qZYIuK3wKTBfgMzs05I+jD5VOXPIuLaquMZJSI/G3AR+fXW7gS+UGVAZjY6qj5FdzIvP2X6\nDeSNuGZmI4qIs1h1JuGYEBHPsqrnyczGkF5orh96JlX3ms7MzMzMKlT1iNcjvPT6NG9I615Ckosx\nMzMzq42IaHmJnqoLr0uBY8hPt94TeCrSDVuH6ubZl90iKd2troZ+Cbyj6iBegYF65kqdOc8rMOA8\nL5vzvAID9c3z/LJ2rXW18JJ0Pvm9zjaW9BD5mTlrAUTEGRHx03Tmzv3kFx48opvx2Gp4quoAzErg\nPLcmcJ73lK4WXhFxaAfbHNPN
"text": [
"<matplotlib.figure.Figure at 0x10e606e10>"
]
}
],
"prompt_number": 52
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that passenger class seems to have a significant impact on whether a passenger survived. Those in First Class the highest chance for survival."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature: Sex (Gender)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Gender might have also played a role in determining a passenger's survival rate. We'll need to map Sex from a string to a number to prepare it for machine learning algorithms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate a mapping of Sex from a string to a number representation:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sexes = sort(df['Sex'].unique())\n",
"genders_mapping = dict(zip(sexes, range(0, len(sexes) + 1)))\n",
"genders_mapping"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 45,
"text": [
"{'female': 0, 'male': 1}"
]
}
],
"prompt_number": 45
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transform Sex from a string to a number representation:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['Sex_Val'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)\n",
"df.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Sex_Val</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Braund, Mr. Owen Harris</td>\n",
" <td> male</td>\n",
" <td> 22</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> A/5 21171</td>\n",
" <td> 7.2500</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td> female</td>\n",
" <td> 38</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> PC 17599</td>\n",
" <td> 71.2833</td>\n",
" <td> C85</td>\n",
" <td> C</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 3</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> Heikkinen, Miss. Laina</td>\n",
" <td> female</td>\n",
" <td> 26</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> STON/O2. 3101282</td>\n",
" <td> 7.9250</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n",
"2 Heikkinen, Miss. Laina female 26 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Sex_Val \n",
"0 0 A/5 21171 7.2500 NaN S 1 \n",
"1 0 PC 17599 71.2833 C85 C 0 \n",
"2 0 STON/O2. 3101282 7.9250 NaN S 0 "
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot a normalized cross tab for Sex_Val and Survived:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sex_val_xt = pd.crosstab(df['Sex_Val'], df['Survived'])\n",
"sex_val_xt_pct = sex_val_xt.div(sex_val_xt.sum(1).astype(float), axis=0)\n",
"sex_val_xt_pct.plot(kind='bar', stacked=True, title='Survival Rate by Gender')\n",
"plt.xlabel('Gender')\n",
"plt.ylabel('Survival Rate')\n",
"plt.xticks((0, 1), ('Female', 'Male'), rotation=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 56,
"text": [
"([<matplotlib.axis.XTick at 0x10e86b450>,\n",
" <matplotlib.axis.XTick at 0x10e86b0d0>],\n",
" <a list of 2 Text xticklabel objects>)"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAl4AAAFRCAYAAACln6POAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmcHWWZ6PHfQ8KwBQwILiBMM7IMECEoq3jlCDNM1AmI\nA0Qc8Aa9giMI9zo6jojaXAYdro6Digs4KgjIosAIoiCixxFBMBA2SdhMlFVE9kWGhOf+carDSed0\n90ly6lR35ff9fPJJv3Wq3nqq6bw8/dZTb0VmIkmSpPKtVnUAkiRJqwoTL0mSpD4x8ZIkSeoTEy9J\nkqQ+MfGSJEnqExMvSZKkPjHxktQTEfGViDiuB/2cHhEn9CKmXouIhRGxd9VxrIiIaEbEe6qOQ1rV\nmXhJNRYRb4iIqyPisYj4Y0RcFRE7lXGuzPyHzPyXXnRV/FlGRMyOiMUR8WREPB4RN0fE/t12XCRO\ne5UR2/KKiHUj4nMRsSAinoqI30bEdyJil17030HPYpe04ky8pJqKiPWA7wOfB9YHNgGOB55bgb4i\nIqK3EY5+ylE++0VmrgtMBU4Bvh0R63fZb47Rd19ExBrAT4DtgLcC6wLbAOcCb64wtGVExOSqY5Dq\nxMRLqq+tgMzM87LlT5l5RWbeAhARgxFx5tDOETEQES9ExGpFuxkR/xIRvwCeBj4cEb9qP0FE/J+I\n+F7x9ZJbhBExLyLe2rbf5Ij4Q0RML9rfiYgHipm4n0XEtstxXUFxYcBZwBrAq4t+Xx0RP4mIh4vz\nnRURLyk+OxPYDLikmDH7ULF9t2JW8NGIuDEi9hzj/LtExK8j4pGI+EaRRBERt0bE37Zd8+pFHDt0\n6ONQWonw2zLztuK/zzOZeUFmHt/Wx19GxBXFbOX8iDiw7bPTI+JLEfH9iHgiIn4ZEX/R9vlfF8c8\nFhFfLL5v0fb5uyPituI6LouIzdo+eyEi3h8RdwK3j/H9kLQcTLyk+rodWFz8D3pGh1mhbm47HQL8\nL2AK8FVg64jYou3zdwJnt/U31Oe3gYPb9vsb4KHMvLFoXwpsAWwE3NDWR9ciYhJwGPAYSycHJwKv\npDWDtCkwCJCZhwK/A/42M9fNzM9GxCa0ZgX/b2auD3wIuCAiNhzptMU170Mr2dsKGKprO4PW92vI\nW4D7MvOmDv38FXBZZj47yvWtA1xBK7ncCHgH8OWI2KZtt1nF9a0P3FVcO0X8FwDHAi8F7gb2oPjv\nExH7AR8F9gc2BH4OnDMshP2AnYHlSYoljcHES6qpzHwSeAOt/9l+DXgoIr4XES8rdhnrllsCp2fm\nvMx8ITOfAL5HkVBFxJbA1sDFbccM9XkOsG9ErFm030nb/9gz8/TMfDozn6d1+3OHiFi3y0vbLSIe\nBZ4FPgPMLK6VzLw7M6/MzOcz82Hg34HRZrAOAX6QmZcVx/8YmEMraeokgVMy877MfJRWojOUYJ4N\nvDUiphTtQ4EzO/QBrWTowaFGREwvZtwej4j5xea/BRZk5hnF9/9G4ELgwLZ+LszMOZm5uDj/9GL7\nW4BbM/PCzFycmSe3nw94H/DpzLw9M18APg1Mj4hN2/b5dGY+lpnLfWta0shMvKQay8z5mXlYZm4K\nTAM2Bk5eji7uGdZun8l6J3BRZv6pw3nvAubRSr7WBmYWxxIRkyLiXyPiroh4HFhQHDbSLNNwvyxm\np9anlfR9ZOiDiHh5RJwbEfcWfZ9JK8kZyZ8DBxZJz6NFQrcH8IpRjmn/nvyO1veUzLwf+AVwQERM\nBWYw8kzeH4eOK469sbimt9O6dToU267DYnsn8PKhw4Dft/X5LK2ZSYq+7x0l7j8HPt/W7x+L7ZuM\nsL+kHrFoUlpFZObtEXEGcHix6Wlg7bZdOiUbw29H/hjYqKhbegfwv0c55Tm0krRJwG2Z+Zti+zuB\nfYG9M/O3RZLyCMtZ9J6ZT0fEPwALI2LPzPwZ8ClgMTAtMx+LiLcBXxzlen4HnJmZh9O9zYZ9fX9b\n+wzgPcDqwNWZ+cAIfVwJHB8Ra2fmM23b278HvwN+lpn7LEdsQ+6ndauw1WnrwYj22azfASdk5vDb\ni+18AlIqgTNeUk1FxNYR8cGijoniNtLBwDXFLjcCb4yITYsC9I926qa9Udwa/A7wWVozTleMtC+t\nJ/T+htZtrfaZnym0nqx8pKhj+tRo5xxNcbvvNOCf2/p+GniiuO4PDzvk9xSF+IWzgJkRsU8xE7dm\nRDSGvmcdBHBkRGwSERsAHyuuc8hFwGuBo4FvjRL6t4AHgIsiYruhcwM78WLCcymwVUQcUhTqrx4R\nO0fEX7bFMpIfANtFxP7ReirxaJZOrL8KHDv0UENEvKS9cF9SeUy8pPp6EtgVuDYinqKVcN0M/CNA\nZl4BnFds+xVwCcvOcnSa9fg2sDfwnaI+qH3fJftn5oPA1cDuxXmGfAv4LXAfcGsRV/t5RltvqtNn\nJwNviojtadWLvRZ4vLieC4bt/2nguOIW2wcz815aM0PHAg/Rmgn6R0YeG5NWEvkjWgXrdwJL1i4r\nbrteCAwUf3fupFU39SbgNloJ1uPAfOB1wEHFPk/SKuJ/B63v1QNF/H82yvcii2MfplUL9q/Aw7Qe\nZLiq7fz/CZwEnFvckr2FVpK8VD+Sei9aT2SX1HnEN2itUfNQZr5mhH2+QGvdmmeA2Zk5t7SAJKlk\nEfFxYMvMfFfVsUgaf8qe8fomrQLTjiLiLcAWmbklrbqTr5QcjySVprj9+G5atz8laRmlJl6Z+XPg\n0VF22ZdWMSqZeS0wNSJePsr+kjQuRcR7ad2q/GFmXjXW/pJWTVXXeG3C0o8s3wu8qqJYJGmFZebX\nMnNKZr6/6lgkjV9VJ16w7JM5FnVKkqRaqnodr/tYem2ZVxXblhIRJmOSJGnCyMyOS75UnXhdDBxF\n65Hm3YDHMvP3nXYs8+nLiSoiirfQaYmf0npIX0sb9N+QuufY0oFjS2eDji2dtNYs7qzUxCsizqH1\nnrQNI+Ie4JO0VnQmM0/NzB9ExFsi4i5aix4eVmY8WgU8VnUAkmrJsUU9UmrilZkHd7HPUWXGIEmS\nNF6Mh+J6qXemVx2ApFpybFGPmHipXjavOgBJteTYoh6purhe6q0FOEBK6j3Hlp4Yreh8olrehwtM\nvCRJUt/U6SnIFUkkvdWoevE3UkllcGxRj5h4SZIk9YmJl+plQdUBSKolxxb1iImXJEmq1Iknnsi0\nadPYYYcd2HHHHbnuuutWus9LLrmEk046qQfRwZQpU3rSD1hcr7qxDkNSGRxbSnPNNddw6aWXMnfu\nXFZffXUeeeQRnnvuua6OXbRoEZMnd05lZs6cycyZM3sSYy+fxnTGS5IkVebBBx9kww03ZPXVVwdg\ngw024JWvfCUDAwM88sgjAMyZM4c3van1sszBwUEOPfRQ3vCGN/Cud72L3Xffndtuu21Jf41Gg+uv\nv57TTz+dD3zgAzzxxBMMDAws+fzpp59ms802Y/Hixdx99928+c1vZqedduKNb3wjt99+OwALFixg\n9913Z/vtt+e4447r6fWaeKlerMOQVAbHltLss88+3HPPPWy99dYceeSR/Nd//Rcw+izT/PnzufLK\nK/n2t7/NrFmzOP/88wF44IEHePDBB3nd6163ZN/11luP6dOn02w2Afj+97/PjBkzmDRpEocffjhf\n/OIXmTNnDp/5zGd4//vfD8AxxxzDkUceyc0338zGG2/c0+s18ZIkSZVZZ511uP766znttNPYaKON\nmDVrFqeffvqI+0cE++67L2ussQYABx10EN/97ncBOP/88znwwAOXOWbWrFmcd955AJx77rnMmjWL\np556iquvvpoDDzyQHXfckfe97308+OCDAFx99dUcfHDrddOHHHJILy/XGi/VjHUYksrg2FKq1VZb\njT333JM999yT17zmNZx++ulMnjyZF154AYA//elPS+2/9tprL/l644035qUvfSm33HIL559/Pqee\neiqw9IzZzJkzOfbYY3n00Ue5
"text": [
"<matplotlib.figure.Figure at 0x10dcca110>"
]
}
],
"prompt_number": 56
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The majority of females survived, whereas the majority of males did not."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we'll determine whether we can gain any insights on survival rate by looking at both Sex and Pclass."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Count males and females in each Pclass:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Get the unique values of Pclass:\n",
"passenger_classes = sort(df['Pclass'].unique())\n",
"\n",
"for p_class in passenger_classes:\n",
" print 'M: ', p_class, len(df[(df['Sex'] == 'male') & (df['Pclass'] == p_class)])\n",
" print 'F: ', p_class, len(df[(df['Sex'] == 'female') & (df['Pclass'] == p_class)])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"M: 1 122\n",
"F: 1 94\n",
"M: 2 108\n",
"F: 2 76\n",
"M: 3 347\n",
"F: 3 144\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot survival rate by Sex and Pclass:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"females_df = df[df['Sex'] == 'female']\n",
"females_xt = pd.crosstab(females_df['Pclass'], df['Survived'])\n",
"females_xt_pct = females_xt.div(females_xt.sum(1).astype(float), axis=0)\n",
"females_xt_pct.plot(kind='bar', stacked=True, title='Female Survival Rate by Passenger Class')\n",
"plt.xlabel('Passenger Class')\n",
"plt.ylabel('Survival Rate')\n",
"plt.xticks((0, 1, 2), ('First', 'Second', 'Third'), rotation=0)\n",
"\n",
"males_df = df[df['Sex'] == 'male']\n",
"males_xt = pd.crosstab(males_df['Pclass'], df['Survived'])\n",
"males_xt_pct = males_xt.div(males_xt.sum(1).astype(float), axis=0)\n",
"males_xt_pct.plot(kind='bar', stacked=True, title='Male Survival Rate by Passenger Class')\n",
"plt.xticks((0, 1, 2), ('First', 'Second', 'Third'), rotation=0)\n",
"plt.xlabel('Passenger Class')\n",
"plt.ylabel('Survival Rate')\n",
"plt.xticks((0, 1, 2), ('First', 'Second', 'Third'), rotation=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 88,
"text": [
"([<matplotlib.axis.XTick at 0x110f1fb90>,\n",
" <matplotlib.axis.XTick at 0x110f1fa50>,\n",
" <matplotlib.axis.XTick at 0x112bba0d0>],\n",
" <a list of 3 Text xticklabel objects>)"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAl4AAAFRCAYAAACln6POAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xm8JHV1///Xm0W2AQcEjSA6RNGACENY4u5V8iWoQeIC\nEwz4G2OiRojkm8T4i+s1fElCYlxxQWPErxvgFkUj7tcNFQcGcAEFZRJWFVE2AWE43z+q7tBc+s70\n4PRyp17Px6Mf01VdXXWq+9w7537qVFWqCkmSJA3fJuMOQJIkqSssvCRJkkbEwkuSJGlELLwkSZJG\nxMJLkiRpRCy8JEmSRsTCS5owSZYkuSPJRP98Jnlckos2wHqmkly2IWLa0JJMJ3nvuOPokiQzSZ43\n7jikYZnoX+zSOCRZleRXSW5oH9cn+a1xxzWIJIuT/EeSq9q4f5DkpcPYVlV9tap+Zxjr7tUWoTe2\n38UVSd6UZLMB3/ubFk4b7EKHc/Lq6iTvTrLNhlr/QpHkXu338sP2e700ybuSPKhdpNiAn7s0aSy8\npLsr4A+ratv2sV1VXT3uoAb0emBr4HeqajvgacAl92RFSTbdkIH9hvauqm2BxwPPAJ4/ou1mA65r\nTV4BvwvsD7xiA65/oqylOP4w8IfAkcB2wD7ACuBJIwpNGisLL2lASe7d/mV+ZZLLkxw/ezgwyfIk\nX0/yuiS/SHJJkkcneW6S/0nykyTP6VnXU5OsTHJd+/qr78l2+9gf+GBVXQdQVT+oqo+067nbIcze\nwzpz9uEa4Ph2Xx7es/xO7ajNjr2HCJO8NMmH5sT9xiRvbJ8/N8n321G4HyW5R4VTVf0I+Dqw55zt\n/E/7Wa5I8th2/iHA3wPL2lGmlffg8yxgyySntrGfk2Tvdj0vSfLhOfv8piRvGGA/rgTOBPZqRyk/\nmeSnSa5NckaSXXrWubz9zK5P8uMkz27nPyTJl5P8MsnPkpza857fSfK5JD9PclGSw3teOyXJW9pt\nXp/km0l+u+f1g9OMlP6yXe7L6Tn0l+RP2+/y2iRnJnlgz2t3JHlRkouBH8zd7yS/D/w+cFhVnVNV\nd1TV9VX1tqp6d5/lH5zki0muaffxfUnu3fP6S9vv8Pp2P5/Uzj+wzYXr0owu/tu6vhNpVCy8pP76\njXScAvwaeDCwL3Aw8Gc9rx8InA/sAHwQOJ1mZOPBwFHASUm2bpe9ETiqqu4NPBX4iySHzRPLurbb\n65vACe1/1ruvcy/vfljnQOBHwH2BfwA+SjMyMesIYKaqrpmznlOBpyRZBGtGyw4H3t++/hPgqe0o\n3HOB1yfZd4D4ZqVd7+8AjwPO7nntbJpRk+2BDwAfSnKvqjoT+Efg1HbkcnZ7pzD45xngMJrvcnb9\n/9nu3/uAQ2YLgTQjPMuA9wywH7sCTwbOpfk9/C7gge3jZuCkdrltgDcCh7Sf3aOA89p1HQ+cWVWL\ngV2AN/W853NtfDsBfwy8NckePXEsA6bbfboEOKF9747Ah4CX0uTxD9ptVvv6YTTF7NOBHYGv0uR6\nr8OAA+gpjnv8PvCtqrpiLZ/RXCcA9wf2AHZt4ybJw4BjgP3bz+ZgYFX7njcCr29/vn6b5vuTJkNV\n+fDho+dB88v7BuAX7eOjwP2AW4Ate5Y7Evhi+3w58MOe1x4B3AHs1DPvGppDZv22+Qbgde3zJe17\nN1nXdvusZ0ua/xhX0BQXF9P8p32X9fYs/yXgT3v24b/nrO8g4JKe6a/TFIwAU8BlPa99FTi6ff6/\net/XJ86PAS/ut54+y94BXEdTrN4BvGkd39+1wCPa59PAe3teW9/Pcxo4q2c6wJXAY9rpTwN/1j7/\nQ+C7A+bVKpriaos+yy0Frm2fb9Mu/wxgqznLvQc4GdhlzvxlwFfmzDsZeFX7/BTgHT2vPRm4sH3+\nHODrc977Pz058unZ5+30JsBNwK4939XUWj6Dd9KMyK7t+/tS7zbmvPZHwLnt84fQFPQHAZvPWe7L\n7Xe347p+3n34GPXDES/p7ormUMj27eMZwIOAzYGr2sNvvwDeTjOiMOsnPc9vBqiqn82ZNzsi9HtJ\nvtQeXvol8ALgPn1iGWS7dwZedUtV/VNV7d+u73SaEaDFA+773LMLZ4Ct20M3S2hGlj42z3s/wJ2j\nY8/mztEukjy5PaT183YfnkL//Z3PvlW1iKaoeE7ubMQmyd+2h75+2a773jSjMf2s1+fZunz2SVVV\nO71zO+s9NKOZtP+urZG/N6+WVNWxVXVrkq2TnJym+f46mqLh3klSVTe1+/xC4Mr28ODD2vX9HU0h\neHaS7yZ5bs8+/t7s/rX7+GyaonM2jrm5uqh9vnPv/s7d/3bdb+xZ78/b+bv0LLO2M1SvoRm9GkiS\n+7WHeS9vP5v30uZNVV0C/BVNgfWTJB9MMrvu5wEPBS5McnaSpw66TWnYLLykwVwG3Arcp6cgu3dV\nPeIeru8DwH8CD6jmUNHb6f/zeI+3W1U3AP9EM2qyG83IBDTN97Pmnq15l7PJqmo1TfF2ZPs4oy0G\n+vkwMNX2J/1Ru48k2QL4CPAvwH2ranvgv7gHjetV9SHgk9x5uOlxwEuAw6tqcbvu63rWPffsuHvy\nee46+yRNL9gDaEa9AD4O7J1kL5pDxu+/+9vX6W9oioQDqzk09oQ2/rT7/NmqOpjmu7qIZtSIqvpJ\nVT2/qnahKdzfmuTBNCNUX+7Zv+2rOdR6zACxXNnu3+z+pne6Xffz56x7m6r6Zs8yazsj8fPAgb09\nbOvwj8BqYK/2szmanp+TqvpgVT2OpiAs4MR2/iVV9eyq2qmd9+EkWw24TWmoLLykAVTVVcBngdcl\n2TbJJm3j7+Pv4SoXAb+oql8nOZBmROJu/2Gt73aTvDLJ/mlO2d8SOI7mUNUP2tG3K4Cjk2ya5E9p\n+pzW5QM0fULPbp/31a5/huZQ1o+rara5+l7t4xrgjiRPpunHuaf+GTgyyQOAbYHbgWvafX4VzZly\ns64GlrQFxD39HvdL8vS2h+uvaA5VfrNd3800ReUHaHqX5o4WDWIRzajTdUl2ANacaJHkvkkOa/u2\nbqMpnle3rx3efgYAv6TJn9U0helDkxyVZPP2cUDbHwdrL3j/C3hEu83NaHqoeovztwMvS7JnG8O9\n09O4vy5V9QWa/rOPJfndJJu138MLe0bs5n42NwHXt8XaS3o+m4cmeVJb2N9K873MfjZHJZkdxbyu\n/WzuGDROaZgsvKTBPYemgPg+TR/Rh7jzP6V+1x5a21/+LwL+Icn1wCuB09by3rVtd647gHcDs0XW\nQTRN7b9qX/9zmv+8rqFpfv76nG32K/7Opumvuj9Nj898cUJTgBxET4HWjry9mGbk7FqakbOPr2M9\n875WVd8Fvgj8Nc2ZgWcCP6Tpm7qZZlRm1uyZlj9PsqJ9vj6fZ9GMTC5rl/0T4BntSOCs9wB7sfbD\njGvzBmArmu/kLJrPeHafNwH+N813+XOaEwv+on1tf+CbSW6g+TxfXFWrqupGmsL2j9v3XUUz8nmv\nnn3qm6vVnDRxOM3o5DU0De0raAobquo/aUaQTm0P/X0H+IO561mHZ9EUeKfRFIzfoTkJ5XN9ln1N\n+9p1wBk0Re7sNrZo9+tn7T7uSNPfSBvTd9vP5vXAH1fVrQPEJg1dmpaFIa08+Q+a4fefzjeUn+RN\nNM2dvwKWV9XKoQUkSRtYe4biRcD92qJno9EeWr0MeHZVfXnc8Ugbg2GPeL0bOGS+F5M8BXhIVe1O\nc0HEtw05HknaYNrC5G9oztTbKIquNNfxWtwewntZO/uba3uPpMENdNuNe6qqvtqeCTWfp9Fe86aq\nvtX+sN+vqn6ylvdI0ti1fVc/AS5lLX9gLkCPojlUfC/ge8AfeZhO2nCGWngNYBfueurx5TRn0Fh4\nSZpo7dmdi9a54AJTVa+h6a2SNAST0Fw/9wwbb44qSZI2SuMe8bqCnmvk0Ix23e1WEkksxiRJ0oJR\nVX0v3TLuwusTwLE0pyY/Evjl
"text": [
"<matplotlib.figure.Figure at 0x1102a1890>"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAl4AAAFRCAYAAACln6POAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xm4HGWZ9/HvD8KehIAgAoJhBBEIJJFlQFRa8GUCiIgK\nEQacOL4CCspsDuNKGAaV0RcFcQFlhFFkEXABFVTGQgQREsJO2DOyKhgghD3hfv+o5yTFoc9JZ+l6\nuk7/PteVK13VVdV3Vd/nnLufuqtaEYGZmZmZdd9KuQMwMzMz6xcuvMzMzMxq4sLLzMzMrCYuvMzM\nzMxq4sLLzMzMrCYuvMzMzMxq4sLLLANJ4yW9JKmnfwYlvVXS7BWwnZak+1dETCuapOmSvpc7jn4i\nqZD0odxxmOXQ07/0zXqRpDmSnpf0qkHzZ6ViatMuv/44Sf8l6WFJ8yTdIemYbrxWRFwZEW/sxrar\n0nGbL+kpSQ9KOkXSqA7XXd7CaYXdzDDlxjNpPx6R9F1Ja62o7TeFpFXT+3Jnel/vk3SGpNelRYIV\neNzNmsSFl9nSC+Be4KCBGZK2Bdagnj8mXwHWBN4YEWOBdwF3L8uGJK28IgNbTttFxBjgbcB7gMNq\nel2twG0F8M60H28CdgA+swK331OGKY4vAN5J+TMyFpgIzAB2ryk0s57lwsts2Xwf+EBl+u+A/6by\nR1zSPmkU7ElJf5R07FAbk7R2GhF4SNIDko4f5jTkDsA5EfEkQETcEREXpu284hRm9bSOpGmSrpJ0\nkqTHgOMlPS5pm8ry66dRm/WqpwglHSPph4PiPlnSyenxByXdlkbh7pG0TIVTRNwDXAVsPeh1/piO\n5QxJb0nzpwCfBKamUaZZy3A8A1hd0rkp9pmStkvb+YSkCwbt8ymSvtrBfjwEXApMSKOUl0j6s6S5\nki6WtHFlm9PSMZsn6V5JB6f5m0u6QtITkh6VdG5lnTdK+pWkv0iaLemAynNnSvp6es15kq6R9FeV\n5/dMI6VPpOWuUOXUn6S/T+/lXEmXqjKKm/Lro5LuAu4YvN+S3gG8A9gvImZGxEsRMS8ivhkR322z\n/Osl/Y+kx9I+fl/S2pXnj0nv4by0n7un+TulXHhS5eji/1vSe2LWC1x4mS2ba4Cx6Y/fysBUymKs\naj5wSESsDewDfETSfkNs70zgBeD1wGRgT+D/DvPaJ6Q/1lt0EOvg0zo7AfcArwb+HbiIyugdcCBQ\nRMRjg7ZzLrC3pNGwaLTsAODs9PyfgH3SKNwHga9ImtxBfAOUtvtG4K3AtZXnrqUcNVkH+AHwQ0mr\nRsSlwOeBcyNiTEQMvN6ZdH48BewHnF/Z/o/T/n0fmDJQCKgc4ZkKnNXBfmwC7AVcT/m79gxg0/Tv\nWeDUtNxawMnAlHTsdgFuSNs6Hrg0IsYBGwOnVNb5VYpvfeD9wDckbVWJYyowPe3T3cAJad31gB8C\nxwDrUhZPu5ByJOXoJ4H9gfWAK4FzBu3jfsCOVIrjincAf4iIB4c5RoOdAGwIbAVskuJG0pbAkcAO\n6djsCcxJ65wMfCX9fP0V5ftn1vNceJktu+9Rjnr9H+A24GV/aCLiioi4NT2+mbJw2W3wRiRtQPkH\n+h8j4tmIeBT4KuUf03Y+RlnsHAXcKumuNPLTqYci4utpJOI5ykKj+loHp3kvExH/S1lE7J9m7Q48\nExHXpud/HhH3pce/BX5JWUB16npJ8ymP5QUR8d+V1z47Ih5PMZ8ErAZsmZ4WLx9pXNrjCTAjIi6K\niIXAScDqwM4R8TBl4TEwmjQFeDQiZg2xHVEWbY+n9Qrg8xExNyJ+FBHPRcR8ymKxmgsvAdtKWiMi\n/hQRt6X5LwDjJW0cES9ExNVp/juB+yLirHRMbqAsoA+obPOiiJiR9ulsYFKavzdwS0T8OK17CvBI\nZb0jgC+kkdSXgC8Ak1IhOeALEfFERDzf5hi8atD2hhUR90TE5RHxYir2v1I5Ngsp3+ttJK0SEX+M\niHsrx2YLSetFxDMR8YdOX9MsJxdeZssmKAuvv6XNaUYASX8t6Tfp9NITwOGUf5QGex2wCvCwytN+\njwPfohzJeOULl3+8vxARO6TtnU85AjSuw9gHX11YAGumUzfjKUeWfjTEuj9g8ejYwSwe7ULSXumU\n1l/SPuxN+/0dyuSIGE05UvMBLW7ERtK/pFNfT6Rtr005GtPOUh3P5IGBBxERaXqjNOss4JD0+BDK\n930oQXmKbZ2IGB8RR0XE85LWlHSayub7J4ErgLUlKSKeTvt8BPBQOj04UFT+K2VeXSvpFkkfrOzj\nXw/sX9rHg4ENKnH8qRLXs8Do9Hij6v4O3v+07ZMr2/1Lmr9xZZnhrlB9jHL0qiOSNlB5mveBdGy+\nR8qbiLgb+AfKEbA/STpH0sC2PwS8Abhd0rWS9un0Nc1ycuFltowi4o+UTfZ7UY42DPYD4MfAa9Op\nom/R/mfufuB54FXpD/Y6EbF2RGzbQQxPUY5IrAVsBjydnlqzsthrBq82aBsLKYu3g9K/i1Mx0M4F\nQEtlf9K70z4iaTXgQuA/gVdHxDrAz1mGxvWI+CFwCYtPN70V+ARwQESMS9t+srLtwRc0LMvxXDSa\no7IX7LXAQ2nWT4DtJE2gPGV89itXX6J/piwSdkqnxnajMlIXEb+MiD0p36vZwLfT/D9FxGERsTFl\n4f4NSa8H/ghcUdm/ddKp1iM7iOWhtH8D+6vqdNr2YYO2vVZEXFNZZriLSH4N7KRKD9sSfJ5yZGtC\nOjaHUvk5iYhzIuKtlAVhACem+XdHxMERsX6ad4GkNTp8TbNsXHiZLZ8PAbtHxLNtnhsNPB4RL0ja\niXJE4hV/sNLprF8CJ0kaI2ml1HD8tnYvKOmzknZQecn+6sDRwOPAHem02oPAoZJWlvT3lH1OSzJw\nurHtacZKrI9SjpCdCdwbEQPN1aumf48BL0nai7IfZ1l9EThI0muBMcAC4LG0z5+jvFJuwCOUp+MG\nipilOp7J9pL2Tz1c/wA8R9lLR3pvL6Q8Ln+IiMGjRZ0YTTnq9KSkdYFFF1pIerWk/VLf1ouUxfPC\n9NwB6RgAPEGZPwspC9M3SDpE0irp346pPw6GL3h/Tnlac7+0v0fy8uL8W8CnJG2dYlhblcb9JYmI\nyyn7z34k6U2SRqX34YjKiN3gY/M0MC8Va5+oHJs3SNo9FfbPU74vA8fmEEkDo5hPpmPzUqdxmuXi\nwstsOUTEvRFxfXVW5fFHgX+XNA/4LHDe4NUrjz9AWbjcBsylbH4ePFI14CXgu8BAkbUHZVP7M+n5\nD1P+8XqMsvn5qkGv2a74u5byYoANgV8MEyeUBcgeVAq0NPL2ccqRs7mUI2c/WcJ2hnwuIm4B/gf4\nJ8orAy8F7qRsrH6WclRmwMCVln+RNCM9XprjGZQjk1PTsn8LvCeNBA44C5jA8KcZh/NVytuNPAZc\nTXmMB/Z5JeAfKd/Lv1D2xX0kPbcDcI2kpyiP58cjYk7qE9uTslh+EHiYcuRz1co+DT7eAZD6qA6g\nHJ18jLKhfQZlYUNE/JhyBOncdOrvZuBvBm9nCd5HWeCdR1kw3kx5e41ftVn2uPTck8DFlEXuwGus\nlvbr0bSP61E2/pNiuiUdm68A7x+i58ysp6hsZ+jSxqX/ohya//NQw/ySTqE8VfMMMG2YplUzsyxS\nY/lsYINU9IwY6dTq/cDBEXFF7njMRrpuj3h9l/IqoLYk7Q1sHhFbUN4s8ZtdjsfMbKmkwuSfKe+d\nNiKKLpX38RqXTuF9Ks2+Zrh1zGzF6OgrOZZVRFyZrpIayrtI98OJiD+kXwQbRMSfhlnHzKwWqe/q\nT8B9DPMhsoF2oTxVvCpwK/Bun6Yzq0dXC68ObMzLL0t+gPLqGhdeZpZdurpz9BIXbJiIOI6yt8rM\natYLzfWDr77xF6eamZnZiJR7xOtBKvfPoRztesXXTEhyMWZmZmaNERFtb+uSu/D6KeXXnpwraWfg\niaH6u7p59WW3SEq3gGyg3wBv
"text": [
"<matplotlib.figure.Figure at 0x1103af550>"
]
}
],
"prompt_number": 88
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The vast majority of females in First and Second class survived. Males in First class had the highest chance for survival."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature: Embarked"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Embarked column might be an important feature but is missing a couple data points:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df[df['Embarked'].isnull()]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Sex_Val</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>61 </th>\n",
" <td> 62</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Icard, Miss. Amelie</td>\n",
" <td> female</td>\n",
" <td> 38</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 113572</td>\n",
" <td> 80</td>\n",
" <td> B28</td>\n",
" <td> NaN</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>829</th>\n",
" <td> 830</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Stone, Mrs. George Nelson (Martha Evelyn)</td>\n",
" <td> female</td>\n",
" <td> 62</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 113572</td>\n",
" <td> 80</td>\n",
" <td> B28</td>\n",
" <td> NaN</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 16,
"text": [
" PassengerId Survived Pclass Name \\\n",
"61 62 1 1 Icard, Miss. Amelie \n",
"829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked Sex_Val \n",
"61 female 38 0 0 113572 80 B28 NaN 0 \n",
"829 female 62 0 0 113572 80 B28 NaN 0 "
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Prepare to map Embarked from a string to a number representation:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Get the unique values of Embarked\n",
"embarked_locations = sort(df['Embarked'].unique())\n",
"\n",
"embarked_locations_mapping = dict(zip(embarked_locations, \n",
" range(0, len(embarked_locations) + 1)))\n",
"embarked_locations_mapping"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
"{nan: 0, 'C': 1, 'Q': 2, 'S': 3}"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transform Embarked from a string to a number representation to prepare it for machine learning algorithms:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['Embarked_Val'] = df['Embarked'].map(embarked_locations_mapping).astype(int)\n",
"df.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Sex_Val</th>\n",
" <th>Embarked_Val</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Braund, Mr. Owen Harris</td>\n",
" <td> male</td>\n",
" <td> 22</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> A/5 21171</td>\n",
" <td> 7.2500</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td> female</td>\n",
" <td> 38</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> PC 17599</td>\n",
" <td> 71.2833</td>\n",
" <td> C85</td>\n",
" <td> C</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 3</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> Heikkinen, Miss. Laina</td>\n",
" <td> female</td>\n",
" <td> 26</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> STON/O2. 3101282</td>\n",
" <td> 7.9250</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 4</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td> female</td>\n",
" <td> 35</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 113803</td>\n",
" <td> 53.1000</td>\n",
" <td> C123</td>\n",
" <td> S</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> 5</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Allen, Mr. William Henry</td>\n",
" <td> male</td>\n",
" <td> 35</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 373450</td>\n",
" <td> 8.0500</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n",
"2 Heikkinen, Miss. Laina female 26 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 \n",
"4 Allen, Mr. William Henry male 35 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Sex_Val Embarked_Val \n",
"0 0 A/5 21171 7.2500 NaN S 1 3 \n",
"1 0 PC 17599 71.2833 C85 C 0 1 \n",
"2 0 STON/O2. 3101282 7.9250 NaN S 0 3 \n",
"3 0 113803 53.1000 C123 S 0 3 \n",
"4 0 373450 8.0500 NaN S 1 3 "
]
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Count the number of passengers by Embarked:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for embarked in range(0, len(embarked_locations)):\n",
" print embarked, len(df[df['Embarked_Val'] == embarked])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0 2\n",
"1 168\n",
"2 77\n",
"3 644\n"
]
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the histogram for Embarked_Val:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['Embarked_Val'].hist(bins=len(embarked_locations), range=(0, 3))\n",
"plt.title('Port of Embarkation Histogram')\n",
"plt.xlabel('Port of Embarkation')\n",
"plt.ylabel('Count')\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAmgAAAFRCAYAAADAclGxAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X28bnVd5//XGxBvgQNShxvRY6OM0qQHFTRvco+po5Zg\nM5M3Ncaxxl+/McemyRKaJrAmBOpnM+bUVKMNWmBoDkF5w41cilPgGGyxjgiUpzwgxxtATaxAPr8/\nrrU9F9t9zrWuc/a11tp7v56Px/U4a61r3Xz2Zy8uPvu7PmtdqSokSZI0HAf0HYAkSZLuzwJNkiRp\nYCzQJEmSBsYCTZIkaWAs0CRJkgbGAk2SJGlgLNAktZLkGUluTvLVJKfM+Vj/K8kvreL+zkryzlXa\n1xlJfmc19rUPx/7hJB/s49iSumWBJg1Ykh1J7m6KotuT/G6Sh+7Hvp6zH+H8IvCWqjqkqi6ZEuvS\n6y37eKxqXqtln/aVZCHJZ++3o6o3VdWrVyes+x1rW5KrV1i+I8n3Nsf+/ar6Fy32taoFrqTuWaBJ\nw1bA91fVIcCTgKcAPz/LDpIcNLGv7EcsjwS27+X9b8Y68Xrdfhxvf2LdvZPdP/9atdrF6n5ZB/mU\n1gQLNGmNqKrbgA8A/wwgySlJ/jLJnUmuSvK4pXWbUZefTfIJ4O+SXMC4wLq0Gdl6/UrHSPLq5jLm\nl5L8UZKjm+V/BXxHs/1Xkjxgltib0aH/k+TNTby3JHl6klcl+dsku5L8yLLNjkxyWXO8UZJHTuzv\nvzXbfTnJx5M8c+K9s5K8J8k7k3wZOG1ZLA9IcmGSdzfTr0qyvTnOXyX5f5r1Hgq8HzimydlXkhy9\n/HJpi9/DTyf5RJK7krwryQNnyd0Keby6mU6SX2ty9+UkNyT5zib+HwJ+ton7j5r1H9/k8c4kf5Hk\nxRP7fXiSS5v9fCzJf5kczUtyX5LXJLkZ+HTL38G7m9/BV5rYHpvx5eFdSf4myfP2NQ/SRmCBJg1f\nAJIcB7wQuC7J8cAFwOuAI4H3MS6eJkc3Xg68CDisqn4I+Ft2j3D96rccZHz582zgB4Gjgb8B3gVQ\nVf9kYvtDq+qevcW6BycDnwCOAC4ELmI8KvhPgH8DvDXJQyb288OML6seCSwCvz+xr48BTwQOb/Lw\n7iQHT7x/CvDuqjpscrskDwIuBr4OvLT5OXYB31dVhwKvAn4tyYlV9TXgBcBtTc4OrarPMTGa1eL3\nUE0+/wXwaOAJwLa95GgWzweeBTy2+Tl/EPhSVf128zOf28R9alNQX8q4wP824N8Dv9/ED/Dfga8C\nmxkXtD/Ct47anQqcBJzQzE/7HXw/8I7m/euBy5vlxwC/BPzWfmdAWscs0KRhC3BxkjuBq4ER8Cbg\nZcAfV9WVVfUN4FeBBwNPb7Yrxv1it1bVP7Q81g8Db6uqxar6R+AM4LsnR67axjrx+rGJ9z9TVefX\n+AuAL2L8P+pfrKp7qupy4B+Bx0ys/8dV9dEmlv/UxHIsfLMX686quq+q3gw8EPinE9v+6VKfXFX9\nfbPsUOCDwM1V9aNNHFTV+6rqM830R4DLGBc+Sz/TSj/nkmm/Bxj/Hm6vqjsZF0lb95LDpy3L352M\nRz5Xcg9wCPD4JAdU1aer6vY9xPk04KFVdU5V3VtVVwF/DLwiyYHAvwTOrKq/r6pPAeev8LO/qaru\nWjqfWvwOPlJVlzd5eQ/wcOCcZv4PgC1JDt1LLqQNzQJNGrYCTq2qw6tqS1W9tik4jmY8ojVeaVxs\nfBY4dmLbzzKbpVGzpX1+DfjSsn22jXXp9baJ93dNTH+9OcYXli172MS+di6L5Q7GRR1JXt9clryr\nKWIOYzyCtWQn9xfGRco/A8693xvJC5Nck/Fl3TsZjzo+vOXPfAzTfw+TRdPkz7iSa5bl7/DJ/U+q\nqg8Bb2U8+rUryW8lOWQvcS4/H/6mWX4kcNCy95fnj+Xbt/gdfH5i+uvAF5eK4mYe9p4LaUOzQJPW\nptuARy3NJAlwHHDrxDrLL1FNazS/Ddgysc+HMi5Ubt3TBnO09PMsxfIwxpdGb0vyLOBngB+sqk1N\nEfNl7j/is9LPfhlwDnBlkm9v9vtA4A+B84Bvb/b1vol9TcvZrUz/PSyPY9VU1a9X1VMYX3Y8nnFe\nVjrObcBxTXxLHtXE+QXgXibyvWz6m4dbmmj5O5C0HyzQpLXpIuD7kjyn6S/6aeDvgT/dyza7GPd7\n7cmFwKuSPLEpXM5mPKKz4gjOHqzm/6BflPGz1w5m3LP0Z1V1K+PLevcCX0xycJJfYHz5cmpcVfUr\njPulrkzycODg5vVF4L4kL2Tc27VkF/DwvVyKezez/R5WLT9JnpLkqc1x726O+42JuL9jYvVrmnV+\nNuMbIxYY94i9q6ruA94LnJXkwc1NDq9k78XkvvwOJM3AAk1ag6rqJsaN9b/OeATk+4AXV9W9e9ns\nTcDPN71N/3GFfV4J/GfGI0q3MW5qf/mMoS3dJbr0+sOl3TPbiF4xbnQ/k/Fl1hMZ/7wwbnT/AHAT\nsIPx5bK/XbbtSsda6jn7L4xvFLiC8aW91zEueO8AXgH80Tc3qrqRceH610nuyPiu1sl9fZrZfg97\ne2RGm8dpTK5zKPDbTdw7GBeZv9K89zbghOZ3/d7mZogXM77J5AuML42+sjmPAF7L+BLl7Yz7zy5k\n3BM4edxJ+/o72Nu8pAnZ3RIwh50n/5TmLrDGdzD+H8DvMW4SfRTj/7hfWlV3NducAfwo478EX1dV\nl80tQEnSt0hyLuNLvq/qOxZpo5prgXa/AyUHMO53OJnxLd5frKrzkrwBOLyqTk9yAuPLDycxbrK9\nAji+GYKXJM1B88f0A4FPMv78/RPgx1b6xghJ3ejyEudzgVuq6rOMn1F0frP8fOAlzfSpwIXNbfc7\ngFsYF3SSpPk5hPGl7b9jfNXjVy3OpH51+ZUdL2fc1wCwuaqWbrnfxfjhiDC+5fuaiW120v4Wf0nS\nPqiqjwOP7TsOSbt1MoLW3IX1YsZ3PN1P81ycac3CkiRJG0ZXI2gvBP584qGUu5IcVVW3N3dFLT3Q\n8Fbu//ydR7DseUJJLNgkSdKaUVUzP2Knqx60V7D78ibAJez+AuPTGN/yvrT85c1zdR7NeMj9Y8t3\nVlW+OnydeeaZvcew0V7m3JxvhJc5N+cb4bWv5j6C1jyN/LnAqycWnwNc1HxP3w7gpQBVtT3JRcB2\nxg9BfE3tz0+nVbFjx46+Q9hwzHn3zHn3zHn3zPnaMfcCrcbfoXfksmV3MC7aVlr/bMZPMJckSdqQ\n/CYBTbVt27a+Q9hwzHn3zHn3zHn3zPna0dmDaldLEq96SpKkNSEJNeCbBLSGjUajvkPYcMx598x5\n98x598z52mGBJkmSNDBe4pQkSZoTL3FKkiStExZomsqehe6Z8+6Z8+6Z8+6Z87XDAk2SJGlg7EGT\nJEmaE3vQJEmS1gkLNE1lz0L3zHn3zHn3zHn3zPnaYYEmSZI0MPagSZIkzYk9aJIkSeuEBZqmsmeh\ne+a8e+a8e+a8e+Z87bBAkyRJGhh70CRJkubEHjRJkqR1wgJNU9mz0D1z3j1z3r21nPMkvny1eu2r\ng1bxfJUkaQNZi+02I2Ch5xg2mn0r0uxBkyRpRuOREf9fpDbsQZMkSVoXLNA01VruE1mrzHn3zHn3\nzHkfRn0HoJYs0CRJkgbGHjRJkmZkD5raswdNkiRpXbBA01T2iXTPnHfPnHfPnPdh1HcAaskCTZIk\naWDsQZMkaUb2oKk9e9AkSZLWBQs0TWWfSPfMeffMeffMeR9GfQegluZeoCXZlOQ9ST6VZHuSpyY5\nIsnlSW5KclmSTRPrn5Hk5iQ3Jnn+vOOTJEkamrn3oCU5H/hwVb09yUHAQ4H/BHyxqs5L8gbg8Ko6\nPckJwAXAScCxwBXA8VV138T+7EGTJPXKHjS1N8AetCSHAc+qqrcDVNW9VfVl4BTg/Ga184GXNNOn\nAhdW1T1VtQO4BTh5njFKkiQN
"text": [
"<matplotlib.figure.Figure at 0x10a19f490>"
]
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the vast majority of passengers embarked in 'S': 3, we assign the two missing values in Embarked to 'S': "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.replace({'Embarked_Val' : { embarked_locations_mapping[nan] : embarked_locations_mapping['S'] }}, inplace=True)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Verify we do not have any more NaNs for Embarked_Val:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sort(df['Embarked_Val'].unique())"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 22,
"text": [
"array([1, 2, 3])"
]
}
],
"prompt_number": 22
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot a normalized cross tab for Embarked_Val and Survived:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"embarked_val_xt = pd.crosstab(df['Embarked_Val'], df['Survived'])\n",
"embarked_val_xt_pct = embarked_val_xt.div(embarked_val_xt.sum(1).astype(float), axis=0)\n",
"embarked_val_xt_pct.plot(kind='bar', stacked=True)\n",
"plt.title('Survival Rate by Port of Embarkation')\n",
"plt.xlabel('Port of Embarkation')\n",
"plt.ylabel('Survival Rate')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 23,
"text": [
"<matplotlib.text.Text at 0x10a54f690>"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAl4AAAFMCAYAAAAa17KLAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu8HXV56P/PA0EuhhAQtIJgaAELRggKCIXCVlobtcDB\nI+RgQUP9Vayg/GprbT1eQiltqa2lilXwWFFELgoeQaWo1KkiFA2Ei+Uit7RcReR+LQnP+WPNToad\ntfdeO9lrZs1en/frtV9ZM2vWd55Z60ny7O88ayYyE0mSJPXfek0HIEmSNCwsvCRJkmpi4SVJklQT\nCy9JkqSaWHhJkiTVxMJLkiSpJhZeUstExGci4sPTMM4ZEXHidMQ03SJieUQc2HQc0y0i/jIifhER\n99Swr+ci4lencbxp+0wi4qcRsf90jCW1jYWXNA0iYr+IuDwiHo6IX0bEZRGxRz/2lZl/mJl/OR1D\nlT9riIjFEbEyIh6LiEci4rqIOLTXgcv/pF/fj9imIiJGygLksYh4NCJuiojF6zDWnesQy3bA+4Ff\nz8ytJ4m1+vPatd3nNFurz6RbgZ+Z8zPzB9MWmdQis5oOQGq7iJgDfBM4BjgP2BD4TeCZtRgrALK+\nKxvHBM/9KDP3L2P6A+ArEbF1Zj7Uw7g5ydh1ujsztwWIiEOAr0XElZl5Y68DRMR0/Fu5HfDLzPzl\nBNusinVQRMSszFzRdBzSTOGMl7TudqJTK52bHU9n5ncz83qAiFgSEWeObhwR88qZjfXK5aI8BfUj\n4AngAxHxk+oOIuKPIuIb5eNVMwgRcWNEvLmy3azyVNaCcvmrEXFvORP3bxGxyxSOq1oEfplOQflr\n5bi/FhH/GhEPlPv7ckRsVj53Jp0i46JyxuZPyvV7l7OCD0XENRFxwCT73ysi/iMiHoyIf46IDctx\nfhoRv1s55g3KOHab7IAy8xvAQ8DOEfGCiDglIu4uf/4hIl5QjjkSEXdFxJ9GxL3AV4BvA1tXZs9+\nZY03LGKziPhSRNxfzvr97+j4LeA7ldf/82Sxdhm7iIgTI+JH5RgXRsSWEXFWOSv544h4+ZiXvTki\nbis/o78dLewn+vzK55eXx34d8FhErD8mlp0j4vaIWFQud82ziHgX8DbgT8uYv1EZ/8Dy8YY9fA7v\nj4ifR8Q9sZYzltKgsPCS1t3NwMqyIFoYEZuPeb6X2asjgf8PmA18FnhFROxQef5twFmV8UbH/Apw\nRGW73wHuz8xryuVvATsAWwFXV8boWfmf7tHAw3SOddRJwEuBnYFtgSUAmXkU8F/A72bmppn5dxGx\nDZ1Zwb/IzM2BPwHOj4gtx9ttecxvoFPs7QSM9rV9kc77NepNdGaKrp3kONaLzunSzYDry/H2AnYr\nf/aq7APgJcDmdIrItwNvBO4pj2lOZt7XZTefAjYFtgcOKF93dGZ+b8zrf3+iWCewiM6xb0PnfbkC\n+DywBXAj8LEx2/8P4DXAq4FDgOp+u35+Ff+rjHluZq4cXRkRrwb+BTguM88tV3fNs8w8vXx8cnnc\nh5TbV3P4fzP55zAH2Bp4J/DpapEotY2Fl7SOMvMxYD86/5F8Drg/Ir4RES8uN5nslFsCZ2TmjZn5\nXGY+CnyDsqCKiB2BVwAXVl4zOubZwMERsVG5/LZy3WhsZ2TmE5n5LHACsFtEbNrjoe0dEQ8BTwEf\nBw4qj5XMvC0zL83MZzPzAeAf6BQa4zkS+HZm/kv5+u8BS+kUTd0kcGpm3l2e2jyJ1QXmWXRmcmaX\ny0cBZ3YZY9TW5XH8AvgIcFRm3kLnvfqLzHygPIYTyrFGPQd8rDzGp5nkcywL1EXAn5fv+X8Cf18Z\ns5dTr1uXM4KjPw9GxMblcwl8ITPvKHPkYuBnmfmvZWH0VWD3MeOdnJkPZ+adwCmU72EPn18Cnyzf\n/+op8wPo5OZRmfntVRtPnmcTHftkn8Oz5fMrM/Ni4HE6fx+kVrLwkqZBZt6UmUeX/Tnz6fx2fsoU\nhhjbtF2dyXob8PXyP/+x+72VzkzHwRGxCXBQ+VoiYv2I+JuIuDUiHgHuKF823izTWP9ezk5tTqfo\n++DoExHxkog4pzwN9AidwudFE4z1cuCwalEB7Auscbquovqe/Bed95TMvAf4EfDWiJgLLGTimbx7\nMnPzzHxRZr46M88r128N/Ge3fZR+kZn/PcG4Y20JbNBlzG2mMMZorKM/W2TmU5Xnf155/DRw/5jl\n2Txf1/ewx89vbE4GnT7GH1Ub48uZxHXJs8k+h19m5nOV5SdZ8zil1rDwkqZZZt5M53TY/HLVE8Am\nlU26FRtjT0d+D9iq7Fv6X5TF1DjOplOkHQLckJm3l+vfBhwMHJiZm9E5/QVTbHrPzCeAPwQOiNV9\nWX8FrATml2MfxfP/PRl7PP8FnDmmqNg0M/92gl1vN+Zx9RIMo6cbDwMuz8x7p3JMpXuAeRPsY+wx\nTHbK+AE6szNjx7xrLWLrRS+nsMe+h3eXjyf7/LqNn3QKr5dHxCcq63+PifNssjgn+xykGcXCS1pH\nEfGKsvl3m3J5WzqF0BXlJtcA+0fEtmVvyp93G6a6UJ6y+Srwd3RmnL473rbAOXR6u97N82d+ZtP5\nZuWDEfFCOv/ZjrvPiZSn+04H/qwy9hPAo+Vxf2DMS35O2Yhf+jJwUES8oZyJ26hsnB5vNiiAYyNi\nm4jYgk4f0DmV579Op2/pfcCXej2OMc4GPlw2qG8JfJSJT1n+HHhRdL7FuobydN95wEkRMbtsdP8j\nOsc+XWKcx+P5k4iYW+bk+4DRnqzJPr/xPEZnhnH/iPjrylgT5dnPgYmuJzbVz0FqNQsvad09BrwW\nuDIiHqdTcF0H/DFAZn6Xzn941wE/AS6it9mUrwAHAl8dc6rleddTKpu8Lwf2YfV/rNApSP6TzizH\nT8u4qvuZ6LpM3Z47BXhdROxKpw/n1cAj5fGcP2b7v6bzn+lDEfH+zLyLzozch+icHvsvOu/PeP8G\nJZ0i8jvAbcAtwKprl5WnXS+gM1NywThjVMfq5i/p9JldV/4sre5j7Osy8yY6RcLtZe9Vt5nL99Ip\naG4Hflgewxd6iGX0udFvPVZ/Dh2zTfXxZHn0DeAqYBmdLzeMfptyss9v/CAzHwF+G3hjRJzA5Hn2\neWCXMhe6fVZT+hyktot+Xi4oOl+ZfjOdb1m9apxtPknnmzNPAoszc1nfApI0Y0TER4AdM/PtTcci\nSb3q94zXF+hMS3cVEW8CdsjMHYF3AZ/pczySZoDy9OPv0zn9KUmt0dfCKzN/SOdiheM5mE6TLJl5\nJTA3Il7Sz5gktVtE/AGdU5UXZ+ZlTccjSVPRdI/XNjz/K8t3AS9rKBZJLZCZn8vM2Zn5nqZjkaSp\narrwgjW/mWMjpSRJmpGavkn23XRuVTHqZay+zswqEWExJkmSWiMzu17ypenC60LgOOCciNgbeDgz\nf95tw35++7JfImLNu5+1xfeB1zUdxFpY0s5caTPzvAFLzPO6mecNWNLePI8Y/zJ7fS28IuJsOvf2\n2jIi7qRzA9cNADLztMz8dkS8KSJupXPtm6P7GY+m4OGmA5BqYJ5rGJjnA6WvhVdmHtHDNsf1MwZJ\nkqRBMQjN9RpEC5oOQKqBea5hYJ4PFAsvdbf95JtIrWeeaxiY5wOl6eZ6Dao78C+rZj7zXMNgkPJ8\nydQ2n6hJfVBM9QsAFl6SJKk2bf2mYjdrUxh6qlHdDcpvR1I/mecaBub5QLHwkiRJqomFl7q7o+kA\npBqY5xoG5vlAsfCSJEmNOumkk5g/fz677bYbu+++Oz/+8Y/XecyLLrqIk08+eRqig9mzZ0/LOGBz\nvcZjT4CGgXmuYTDgeX7FFVfwrW99i2XLlrHBBhvw4IMP8swzz/T02hUrVjBrVvdS5qCDDuKggw6a\nlhin89uVznhJkqTG3HfffWy55ZZssMEGAGyxxRa89KUvZd68eTz44IMALF26lNe9rnPDySVLlnDU\nUUex33778fa3v5199tmHG264
"text": [
"<matplotlib.figure.Figure at 0x10aae21d0>"
]
}
],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It appears those that embarked in location 'C': 1 had the highest rate of survival."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature: Age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Age column seems like an important feature but is missing many values. \n",
"\n",
"Get the first 10 rows of the Age column:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['Age'][:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 24,
"text": [
"0 22\n",
"1 38\n",
"2 26\n",
"3 35\n",
"4 35\n",
"5 NaN\n",
"6 54\n",
"7 2\n",
"8 27\n",
"9 14\n",
"Name: Age, dtype: float64"
]
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the Age histogram:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['Age'].hist()\n",
"plt.title('Age Histogram')\n",
"plt.xlabel('Age')\n",
"plt.ylabel('Count')\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAmYAAAFRCAYAAADeu2ECAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu0ZHdV4PHvTpqgAeQSwCQkgRvFAMFAEzC8NLkgMIgY\noiiv4dGAOGshQhhFEmYpMDOCQUUEwZkRkmlRouGhEuWRBDiIg4A8Wh4hBJCLaSAdSAgGFBLsPX/U\nueni9qnue2+q6vc7p76ftWqlfufUY/92qs7dfX67qiIzkSRJUnmHlA5AkiRJIxZmkiRJlbAwkyRJ\nqoSFmSRJUiUszCRJkiphYSZJklQJCzNJCyEi3hYRTyodhyQdiIWZpJmKiCYiromIw2b8HE9ft20l\nIq5YG2fmIzLz9Rt4rL0R8UOziFOSDsbCTNLMRMQycApwFXD6DJ8q28u0xBQfa9+DRhw6i8eVNBwW\nZpJm6cnAJcDrgaeM74iI20bEhRHxjYj4UET8z4h439j+u0bExRFxdURcFhG/cFMCGT+rFhF3joj3\nRsS1EfHViDi/3f537c3/KSKuW3vOiHhGRHy2jeWvI+Loscd9WER8pn2sV7ePu/Y8OyLi/0XEyyPi\na8ALI+KHIuLdEfG19rn/NCJuPfZ4qxHxaxHx8TaG10XEkRHx9jZXF0fE0k3JhaR6WZhJmqUnA38B\nXAD8p4j4wbF9rwauA45kVLQ9mfasV0TcArgY+FPg9sDjgNdExN0O8FwHO8s1flbtfwDvyMwl4Bjg\nVQCZeWq7/x6ZeavMfGNEPBh4CfALwNHAF4E/b+O8HfBG4PnAEcBngPvzvWfvTgE+D/xg+zgB/Fb7\nWHcDjgNetC7OnwN+ErgL8Ejg7cBZ7WMcAjz7IHOV1FMWZpJmIiJ+nFHR89bM/CxwKfCEdt+hjIqP\nF2bmtzPz08BO9hVXjwS+kJk7M3NvZu4C3sKoOOp8OuCVEfH1tQtwIZOXN68HliPimMy8PjPff4Cp\n/GfgdZm5KzOvB84G7h8RdwIeAXwyM/+qjfOVwJXr7v/lzHx1u//bmfn5zHxXZt6QmV8Dfh84bd19\nXpWZX83MLwPvA/4hM/8pM78D/CVwrwPEK6nHLMwkzcpTgIsy87p2/Eb2LWfeHtgGXDF2+91j1+8E\n3HddofUERmfXuiTwK5l5m7ULo+Ju0lm0X2/3fSgiPhkRTz3APNbOko2eKPNbwNWMis6j18W9fh7w\nvXOkXZb884jYHRHfYLTMe9t199kzdv3f142/DdzyAPFK6rFtpQOQNDwR8f3AY4BDIuIr7eabA0sR\ncRKjs2ffZbSM99l2/3FjD/EvwHsz82E3JYxJOzJzD/BLbawPBC6JiPdm5j933PzLwPKNDzpaZr0t\nowLsK8CxY/tifLz2dOvGLwH+A/jRzLw2Is6gXUrdylwkDYtnzCTNwhmMCq+7AfdsL3djtCz3lMz8\nD0ZLky+KiO+PiLsCT2JfEfO3wAkR8cSIuFl7+bH2dpNsuHiJiF+IiLUC6tr2efe24z3AD4/d/Hzg\nqRFxz4i4OaPC6gOZ+S/A24CTIuJREbEN+GXgqIM8/S2BbwH/GhHHAM/baNyShs/CTNIsPBk4NzN3\nZ+ZV7WUP8IfAEyLiEOBZwK0Z9WTtZFQAXQ/QLn8+jFHT/5cYnZl6KXCg70Lr6ieb1GN2H+ADEXEd\n8NfAszNztd33ImBnu4T685n5LuA3gDczOnt2fBsXbY/YLwAvA77GqPj8MPCdsedfH8OLgZOBbzDq\ng3vzAeLsmse0vxpEUkUiczbv74g4F/hp4KrMPKnddgqjA/PNGP1r+pmZ+Y/tvrOBpzE6xf/szLxo\nJoFJqlJEnAP8YGYeqN+ram3BeQXwhMx8b+l4JPXPLM+YnQc8fN22lwG/kZn3An6zHRMRJwKPBU5s\n7/Oa9gAnaaAi4i4RcY8YOYXRP8z+snRcm9V+j9lSu8z5gnbzB0rGJKm/Zlb8ZOb7gK+v2/wVRksX\nAEuMligAHgWc3358fBX4HKPv/pE0XLditIz3TUbfC/a7mfnWsiFtyf0ZHbO+ymiV4Iz2ay0kadPm\n/anMs4C/j4jfZVQU3r/dfge+91+Yuxl9FF3SQGXmh4EfKR3HTZWZL2bUNyZJN9m8lwtfx6h/7I7A\nc4FzD3Bbm1slSdJCmfcZs1My8yHt9TcBr22vf4nv/Q6jY9m3zHmjiLBYkyRJvZGZm/oewnmfMftc\nRKz99MiDgcvb628FHhcRh0XE8YyWNz7U9QCZ6WXd5YUvfGHxGGq8mBfzYk7Mi3kxLyUvWzGzM2YR\ncT6j33+7XURcwehTmL8EvLr99NK/t2My89KIuIB93wb+zNzqjBbQ6upq6RCqZF66mZf9mZNu5qWb\neelmXqZjZoVZZj5+wq77Trj9Sxh9o7YkSdJC8rcyB2DHjh2lQ9iU0c8JzsfOnTvn9lzjaj7h27fX\nyzyYk27mpZt56WZepmNm3/w/CxHhCucAjAqzIf9/jKoLM0nSfEQEWXnzv2agaZrSIVSqKR1AlXy9\n7M+cdDMv3cxLN/MyHRZmkiRJlXApU3PnUqYkaRG4lClJktRjFmYD4Lr+JE3pAKrk62V/5qSbeelm\nXrqZl+mwMJMkSaqEPWaaO3vMJEmLwB4zSZKkHrMwGwDX9SdpSgdQJV8v+zMn3cxLN/PSzbxMh4WZ\nJElSJewx09zZYyZJWgRb6THzR8ylGZjnD7XPm0WnJM2OS5kD4Lr+JE3B586KL++5CfcdJt9D3cxL\nN/PSzbxMh4WZJElSJewx09wtQo/ZcOdn/5wkbZTfYyZJktRjFmYD4Lr+JE3pACrVlA6gOr6HupmX\nbualm3mZDgszSZKkSthjprmzx6zP7DGTpI2yx0ySJKnHZlaYRcS5EbEnIj6xbvuvRMSnI+KTEXHO\n2PazI+KzEXFZRDxsVnENkev6kzSlA6hUUzqA6vge6mZeupmXbuZlOmb5zf/nAa8C/mRtQ0Q8CDgd\nuEdm3hARt2+3nwg8FjgROAa4JCJOyMy9M4xPkiSpKjPtMYuIZeDCzDypHV8A/K/MfPe6250N7M3M\nc9rxO4AXZeYH1t3OHrMBsMesz+wxk6SN6kOP2Y8Ap0bEByKiiYj7tNvvAOweu91uRmfOJEmSFsa8\nC7NtwG0y837A84ALDnBb/1m+Qa7rT9KUDqBSTekAquN7qJt56WZeupmX6Zhlj1mX3cBbADLzHyNi\nb0TcDvgScNzY7Y5tt+1nx44dLC8vA7C0tMT27dtZWVkB9r0oFm28ppZ4NhrvvgJhZUbjXTN+/Elj\nDrK/9PimxteOKnk9OZ7deNeuXVXF47jusa8Xbry+urrKVs27x+y/AHfIzBdGxAnAJZl5x7b5/w3A\nKbTN/8Cd1zeU2WM2DPaY9Zk9ZpK0UVvpMZvZGbOIOB84DbhtRFwB/CZwLnBu+xUa1wNPBsjMS9sP\nBlwKfBd4phWYJElaNIfM6oEz8/GZeYfMvHlmHpeZ52XmDZn5pMw8KTPvnZnN2O1fkpl3zsy7ZuY7\nZxXXEI2fQtW4pnQAlWpKB1Ad30PdzEs389LNvEzHzAozSZIkbY6/lam5s8esz+wxk6SN6sP3mEmS\nJGkCC7MBcF1/kqZ0AJVqSgdQHd9D3cxLN/PSzbxMh4WZJElSJewx09zZY9Zn9phJ0kbZYyZJktRj\nFmYD4Lr+JE3pACrVlA6gOr6HupmXbualm3mZDgszSZKkSthjprmzx6zP7DGTpI2yx0ySJKnHLMwG\nwHX9SZrSAVSqKR1AdXwPdTMv3cxLN/MyHRZmkiRJlbDHTHNnj1mf2WMmSRtlj5kkSVKPWZgNgOv6\nkzSlA6hUUzqA6vge6mZeupmXbuZlOizMJEmSKmGPmebOHrM+s8dMkjbKHjNJkqQeszAbANf1J2lK\nB1CppnQA1fE91M28dDMv3czLdFiYSZIkVcIeM82dPWZ9Zo+ZJG1UVT1mEXFuROyJiE907PvViNgb\nEUeMbTs7Ij4bEZdFxMNmFZck
"text": [
"<matplotlib.figure.Figure at 0x10a5ac450>"
]
}
],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Filter the DataFrame to the columns we'll be looking at with Age:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df[['Sex', 'Pclass', 'Age']].head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Sex</th>\n",
" <th>Pclass</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> male</td>\n",
" <td> 3</td>\n",
" <td> 22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> female</td>\n",
" <td> 1</td>\n",
" <td> 38</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> female</td>\n",
" <td> 3</td>\n",
" <td> 26</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 26,
"text": [
" Sex Pclass Age\n",
"0 male 3 22\n",
"1 female 1 38\n",
"2 female 3 26"
]
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Determine the max age:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"max_age = max(df['Age'])\n",
"max_age"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 27,
"text": [
"80.0"
]
}
],
"prompt_number": 27
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Filter the DataFrame to see more information on seniors Age > 60:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df[df['Age'] > 60][['Sex', 'Pclass', 'Age']]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Sex</th>\n",
" <th>Pclass</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>33 </th>\n",
" <td> male</td>\n",
" <td> 2</td>\n",
" <td> 66.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>54 </th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 65.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96 </th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 71.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>116</th>\n",
" <td> male</td>\n",
" <td> 3</td>\n",
" <td> 70.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>170</th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 61.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>252</th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 62.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>275</th>\n",
" <td> female</td>\n",
" <td> 1</td>\n",
" <td> 63.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>280</th>\n",
" <td> male</td>\n",
" <td> 3</td>\n",
" <td> 65.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>326</th>\n",
" <td> male</td>\n",
" <td> 3</td>\n",
" <td> 61.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>438</th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 64.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>456</th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 65.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>483</th>\n",
" <td> female</td>\n",
" <td> 3</td>\n",
" <td> 63.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>493</th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 71.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>545</th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 64.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>555</th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 62.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>570</th>\n",
" <td> male</td>\n",
" <td> 2</td>\n",
" <td> 62.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>625</th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 61.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>630</th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 80.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>672</th>\n",
" <td> male</td>\n",
" <td> 2</td>\n",
" <td> 70.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>745</th>\n",
" <td> male</td>\n",
" <td> 1</td>\n",
" <td> 70.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>829</th>\n",
" <td> female</td>\n",
" <td> 1</td>\n",
" <td> 62.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>851</th>\n",
" <td> male</td>\n",
" <td> 3</td>\n",
" <td> 74.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 28,
"text": [
" Sex Pclass Age\n",
"33 male 2 66.0\n",
"54 male 1 65.0\n",
"96 male 1 71.0\n",
"116 male 3 70.5\n",
"170 male 1 61.0\n",
"252 male 1 62.0\n",
"275 female 1 63.0\n",
"280 male 3 65.0\n",
"326 male 3 61.0\n",
"438 male 1 64.0\n",
"456 male 1 65.0\n",
"483 female 3 63.0\n",
"493 male 1 71.0\n",
"545 male 1 64.0\n",
"555 male 1 62.0\n",
"570 male 2 62.0\n",
"625 male 1 61.0\n",
"630 male 1 80.0\n",
"672 male 2 70.0\n",
"745 male 1 70.0\n",
"829 female 1 62.0\n",
"851 male 3 74.0"
]
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that most are men, most are 1st class, and most perished. Age seems like a good feature to predict whether a passenger survived, but we'll need to do some cleaning, as there are missing values. Missing values pose a problem for machine learning algorithms.\n",
"\n",
"Filter to view missing Age values:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df[df['Age'].isnull()][['Sex', 'Pclass', 'Age']].head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Sex</th>\n",
" <th>Pclass</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5 </th>\n",
" <td> male</td>\n",
" <td> 3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td> male</td>\n",
" <td> 2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td> female</td>\n",
" <td> 3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 29,
"text": [
" Sex Pclass Age\n",
"5 male 3 NaN\n",
"17 male 2 NaN\n",
"19 female 3 NaN"
]
}
],
"prompt_number": 29
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Determine the Age typical for each passenger class by Sex_Val. We'll use the median as the Age histogram seems to be right skewed:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Get the unique values for gender\n",
"genders = sort(df['Sex_Val'].unique())\n",
"\n",
"median_ages = np.zeros((len(genders), len(passenger_classes)))\n",
"median_ages"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 46,
"text": [
"array([[ 0., 0., 0.],\n",
" [ 0., 0., 0.]])"
]
}
],
"prompt_number": 46
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fill in our median ages array:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for genderIdx in range(0, len(genders)):\n",
" for pclassIdx in range(0, len(passenger_classes)):\n",
" median_age = df[(df['Sex_Val'] == genderIdx) & \\\n",
" (df['Pclass'] == pclassIdx + 1)]\n",
" median_ages[genderIdx, pclassIdx] = \\\n",
" median_age['Age'].dropna().median()\n",
"\n",
" df.loc[(df['Age'].isnull()) & \n",
" (df['Sex_Val'] == genderIdx) & \n",
" (df['Pclass'] == pclassIdx + 1), \\\n",
" 'AgeFill'] = median_age\n",
" \n",
"median_ages"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 31,
"text": [
"array([[ 35. , 28. , 21.5],\n",
" [ 40. , 30. , 25. ]])"
]
}
],
"prompt_number": 31
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Make a copy of Age:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['AgeFill'] = df['Age']\n",
"df.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Sex_Val</th>\n",
" <th>Embarked_Val</th>\n",
" <th>AgeFill</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Braund, Mr. Owen Harris</td>\n",
" <td> male</td>\n",
" <td> 22</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> A/5 21171</td>\n",
" <td> 7.2500</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> 22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td> female</td>\n",
" <td> 38</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> PC 17599</td>\n",
" <td> 71.2833</td>\n",
" <td> C85</td>\n",
" <td> C</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 38</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 3</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> Heikkinen, Miss. Laina</td>\n",
" <td> female</td>\n",
" <td> 26</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> STON/O2. 3101282</td>\n",
" <td> 7.9250</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> 26</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 32,
"text": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n",
"2 Heikkinen, Miss. Laina female 26 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Sex_Val Embarked_Val \\\n",
"0 0 A/5 21171 7.2500 NaN S 1 3 \n",
"1 0 PC 17599 71.2833 C85 C 0 1 \n",
"2 0 STON/O2. 3101282 7.9250 NaN S 0 3 \n",
"\n",
" AgeFill \n",
"0 22 \n",
"1 38 \n",
"2 26 "
]
}
],
"prompt_number": 32
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Populate AgeFill based on our median_ages array:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for genderIdx in range(0, len(genders)):\n",
" for pclassIdx in range(0, len(passenger_classes)):\n",
" df.loc[(df['Age'].isnull()) & \n",
" (df['Sex_Val'] == genderIdx) & \n",
" (df['Pclass'] == pclassIdx + 1), \\\n",
" 'AgeFill'] = median_ages[genderIdx, pclassIdx]\n",
" \n",
"df[df['Age'].isnull()].head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Sex_Val</th>\n",
" <th>Embarked_Val</th>\n",
" <th>AgeFill</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5 </th>\n",
" <td> 6</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Moran, Mr. James</td>\n",
" <td> male</td>\n",
" <td>NaN</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 330877</td>\n",
" <td> 8.4583</td>\n",
" <td> NaN</td>\n",
" <td> Q</td>\n",
" <td> 1</td>\n",
" <td> 2</td>\n",
" <td> 25.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td> 18</td>\n",
" <td> 1</td>\n",
" <td> 2</td>\n",
" <td> Williams, Mr. Charles Eugene</td>\n",
" <td> male</td>\n",
" <td>NaN</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 244373</td>\n",
" <td> 13.0000</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> 30.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td> 20</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> Masselmani, Mrs. Fatima</td>\n",
" <td> female</td>\n",
" <td>NaN</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 2649</td>\n",
" <td> 7.2250</td>\n",
" <td> NaN</td>\n",
" <td> C</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 21.5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 33,
"text": [
" PassengerId Survived Pclass Name Sex Age \\\n",
"5 6 0 3 Moran, Mr. James male NaN \n",
"17 18 1 2 Williams, Mr. Charles Eugene male NaN \n",
"19 20 1 3 Masselmani, Mrs. Fatima female NaN \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked Sex_Val Embarked_Val \\\n",
"5 0 0 330877 8.4583 NaN Q 1 2 \n",
"17 0 0 244373 13.0000 NaN S 1 3 \n",
"19 0 0 2649 7.2250 NaN C 0 1 \n",
"\n",
" AgeFill \n",
"5 25.0 \n",
"17 30.0 \n",
"19 21.5 "
]
}
],
"prompt_number": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ensure AgeFill does not contain any missing values:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(df[df['AgeFill'].isnull()])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 34,
"text": [
"0"
]
}
],
"prompt_number": 34
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a feature that records whether Age was originally missing:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['AgeIsNull'] = pd.isnull(df['Age']).astype(int)\n",
"df.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Sex_Val</th>\n",
" <th>Embarked_Val</th>\n",
" <th>AgeFill</th>\n",
" <th>AgeIsNull</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Braund, Mr. Owen Harris</td>\n",
" <td> male</td>\n",
" <td> 22</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> A/5 21171</td>\n",
" <td> 7.2500</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> 22</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td> female</td>\n",
" <td> 38</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> PC 17599</td>\n",
" <td> 71.2833</td>\n",
" <td> C85</td>\n",
" <td> C</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 38</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 3</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> Heikkinen, Miss. Laina</td>\n",
" <td> female</td>\n",
" <td> 26</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> STON/O2. 3101282</td>\n",
" <td> 7.9250</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> 26</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 35,
"text": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n",
"2 Heikkinen, Miss. Laina female 26 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Sex_Val Embarked_Val \\\n",
"0 0 A/5 21171 7.2500 NaN S 1 3 \n",
"1 0 PC 17599 71.2833 C85 C 0 1 \n",
"2 0 STON/O2. 3101282 7.9250 NaN S 0 3 \n",
"\n",
" AgeFill AgeIsNull \n",
"0 22 0 \n",
"1 38 0 \n",
"2 26 0 "
]
}
],
"prompt_number": 35
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot a normalized cross tab for AgeFill and Survived:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fig, axes = plt.subplots(3, 1, figsize=(10, 20))\n",
"\n",
"# Histogram of AgeFill segmented by Survived\n",
"df1 = df[df['Survived'] == 0]['Age']\n",
"df2 = df[df['Survived'] == 1]['Age']\n",
"axes[0].hist([df1, df2], bins=max_age / 10, range=(1, max_age), stacked=True)\n",
"axes[0].legend(('Died', 'Survived'), loc='best')\n",
"axes[0].set_title('Survivors by Age Groups')\n",
"axes[0].set_xlabel('Age')\n",
"axes[0].set_ylabel('Count')\n",
"\n",
"# Plot a normalized cross tab for AgeFill and Survived\n",
"age_fill_xt = pd.crosstab(df['AgeFill'], df['Survived'])\n",
"age_fill_xt_pct = age_fill_xt.div(age_fill_xt.sum(1).astype(float), axis=0)\n",
"age_fill_xt_pct.plot(ax=axes[1], stacked=True)\n",
"axes[1].legend(loc='best')\n",
"axes[1].set_title('Survival Rate by Age')\n",
"axes[1].set_xlabel('Age')\n",
"axes[1].set_ylabel('Count')\n",
"\n",
"# Scatter plot Survived and AgeFil\n",
"axes[2].scatter(df['Survived'], df['AgeFill'])\n",
"axes[2].set_title('Age by Survivors')\n",
"axes[2].set_xlabel('Age')\n",
"axes[2].set_ylabel('Count')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 36,
"text": [
"<matplotlib.text.Text at 0x10b5b8510>"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAmcAAASWCAYAAABvrAuXAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3XmYXGWZ/vH7yUZ2kpAFCAnJDBBBkICAsji0ivwQgYAM\nS5AQlEFldBSXmRFHoBUVGTdGHWBAZU8EBGRTZNFmQAFB2TdhJCEESAIJhOyd5Pn9cc5JTirV3VXV\ndc55q+r7ua6+0rWdeuvtStfdz7scc3cBAAAgDH2KbgAAAAA2IpwBAAAEhHAGAAAQEMIZAABAQAhn\nAAAAASGcAQAABIRwBqCuzOxCM/tawW24zMzOKbINAFArwhnQAszsADP7o5m9aWZvmNl9ZrZXFs/l\n7qe5+zezOHY1zYi/6sIifzOzp+p1zG6e63gze9DMlpnZAjN7wMxOy/p5AYSDcAY0OTMbLulWSf8l\naaSk8ZK+Lml1DccyM7P6trCi5+1Xy8Pq2IR/kLSFpDFZhVpJMrMvSTpf0nmSxrn7OEmflrS/mQ3o\n4jH8HgeaDP+pgea3kyR392s8ssrd73T3JyTJzNrN7MrkzmY2yczWJx/6ZtZhZt80sz9IWi7pX83s\nofQTmNkXzOym+PsNQ4pm9oyZfSR1v35mtsjMpsaXjzCzp8xsiZn93szekbrvHDP7NzN7XNLbZtbX\nzP7dzF42s6Vm9qyZfaCb1z3azO6I79thZhPj4/63mX2vpP03m9np3RxrpqTrJd0Uf59+7GQz+9/4\nee6Mj5/uz/fGVcslZvaomR1Y7gnMbEtFofk0d7/B3ZdLkrs/6u4nuvuaVP9eaGa/NrNlktrMbOf4\nNS4xsyfN7PDUcTvM7JTU5ZPN7N7U5fVm9i9m9n/xz+Y/kwBuZjuY2T1xxXWRmf2imz4CUCeEM6D5\nPSdpXfyhfoiZjSy5vZLhvxMl/ZOkoZIukjTFzHZI3X6CpKtTx0uOOUvS9NT9/p+khe7+qJntFN/+\nOUmjJf1a0i0lVbLjJX1Y0ghJO0j6jKS93H24pIMlzemivSbpY5K+ER/70VT7LpM0PRVARkv6YOr2\nTQ9kNljS0ZKukXStpOPNrH/qLrMkPSBplKR2RX3l8WPHK6pafsPdR0r6sqTr4+csta+i6txNXbym\ntOmSznH3oZIeknSLpNsljZH0L5KuNrMd4/tWMsR7pKR3S9pT0jRJn4ivP0fS7e4+QlHF9UcVtA1A\nLxHOgCbn7m9LOkDRB/Qlkhaa2U1mNja+S0/Dfy7pMnd/xt3Xu/tSRQFiuiTFIWCKpJtTj0mOOVvS\nEWY2ML58QnydJB0n6VZ3v9vd10n6nqRBkvZLPe+P3H2+u6+WtE5ReHmnmfV395fc/W/dtPtWd78v\nrjj9h6R9zWy8uz8k6S1FgUyKAuDv3X1RF8f5qKSl7v4HSb+Lr/tI/NonStpL0lnuvja+T7ofTpT0\na3e/XZLc/S5JD0s6tMzzjJb0uruvT65IVdxWmNkBqfv+yt3vj7+fKmmIu38nbsPvFQXCE7rpm1Ln\nufub7j5P0bBqEqjXSJoU99sad/9jFccEUCPCGdAC3P1Zd/+4u0+QtKukbRV9CFdqXsnldEXsBEk3\nuvuqMs/7gqRnFAW0wZIOjx8rSdtIeil1X4+fZ3y5542Pdbqi6tQCM5ttZtt00V6X9HLqscslLVb0\nuiXpCkXBSfG/V6prMyXdEB9nnaRfaePQ5raSFpe89pe1MZxuL+mYOGAtMbMlkvaXtHWZ53lD0VDs\nht/L7r5fXHF7Qxt/X2/y2uI2lP585qZeayXSj38p9dh/i1/Ln+Lh0o9XcUwANSKcAS3G3Z+TdLmi\nkCZF88gGp+5SLjiUDovdpWhy/O6KKk+zNn/IBrMVBblpkp5OVbteURReJEWLDSRNkDS/q+d199nu\n/r74ca5o4nxXJqSOPVTRsOMr8VVXSZoWt/8digLXZsxsO0kfkDTTzF41s1clHSvpUDMbJelVSaPM\nbFDJ8ybtfknSle4+MvU1zN3/s8zT3a9okcaR3bymRLpfXpE0IRmmjW2vjf24XNKQ1G3lfr4TS76f\nL0nuvsDdP+nu4yV9StIFZvZ3FbQPQC8QzoAmZ2ZTzOyL8fwnmdkERWEpGRZ7VNI/mNmEeFL6GeUO\nk77g7p2SrlM0FDlS0p1d3VfSLxTNNfu0Np3Xda2kj5jZB+I5XF+StEpS2aEzM9spvu8WikLMKkVD\nnWXvrihAJascz5F0v7snoeNlRcOLV0j6ZTxsWs4MSc8qWlSxe/y1k6LK1QnuPjc+TruZ9TezfSUd\nlnr8VZION7OD4wUNA82sLflZpLn7m4oWBFxgZkeb2TAz62PR4ol0uCrt3wckrZD0b3Eb2uI2JJP3\nH5X0UTMbFM8TPEWb+7KZjYjfG59TNL9OZnZMHFAl6U1FoXB9mccDqCPCGdD83pb0HkkPxqv77pf0\nuKIwJHe/U9GH8ePaOLm8tFJWbkL5LEXztq5Lz5NSyQR0d39NUeDaN36e5Pq/KhpS/LGkRYrmcR3u\n7mu7eB1bSDo3vu+riuZolQuSSRuulnS2oiHBPbRxGDNxuaTd1P2Q5kmSLnD3hamvBYoWRZwU3+dj\n8Wt7Q1EIvEbRXK0kBE6T9FVJCxVV0r6kLn73uvt3JX1R0XDia/HXRfHlJEyX9m+nouHiDyvqm59I\nmhH3ryT9MG7PAkmXKgqMpT/PmyT9WdIjiuar/Sy+fi9JD5jZ2/F9Pufuc7rpLwB1YNE0jwwOHP0F\ndoWksYp+EVzs7j8ys3ZFq76SybdfdfffxI85Q9EqoXWKfgnckUnjALQ8M3ufpKvcffse71zdca9R\nNHz79XoeNytmtl7SDj0srgCQo1o2dqxUp6QvxEvmh0r6s5ndqSio/cDdf5C+s5ntomj11i6KJgTf\nZWY7lfxFDgC9Fg+jnq5o9Wpvj7WXpCWSXlQ0fHuEpG/39rgAWldmw5ru/pq7Pxp/v0zRiq1knkW5\npfvTJM129864bP6CpH2yah+A1mRmOysKU+NU3YrVrmwt6feKho9/KOnT7v5YHY6bl2yGTwDULMvK\n2QZmNknRnI8HFC0j/xczO0nRRNovxRNht41vT7ysTZfUA0CvufszijbTrdfxblU0T6shuXvfotsA\nYFOZLwiIhzR/KenzcQXtQkmTFW2c+Kqk73fzcP6iAwAALSXTylk8r+N6RZNufyVJ7r4wdftPFa0M\nk6J9dSakHr6dNt3vKHkMgQ0AADQMd+/pTCybyKxyFm+I+DNFq5bOT12f3tH7KElPxN/frOicdQPM\nbLKkHSX9qdyx3Z2vkq+zzz678DaE9kWf0C/0C/1Cn9AvRX/VIsvK2f6K9hV63Mweia/7qqITDk9V\nNGT5oqJdp+XuT5vZtZKelrRW0j97ra8KAACgQWUWztz9PpWvzP2mm8d8WyxBBwAALYwzBDSJtra2\nopsQHPqkPPqlPPqlPPplc/RJefRL/WR2hoCsmBmjnQAAoCGYmbzKBQG57HMGAADyEa3HQxHqVTwi\nnAEA0GQYYcpfPUMxc84AAAACQjgDAAAICOEMAAAgIIQzAABQuNNOO03f/OY3a3rsySefrDPPPLPO\nLSoOCwIAAGhyeazg7GkRwqRJk7Rw4UL169dPffv21S677KKTTjpJn/zkJ2VmuvDCC2t+bjNrqlWq\nVM4AAGgJnuFXz8xMt956q5YuXaqXXnpJX/nKV3TeeefplFNOqc+ra6IVqoQzAACQq2HDhunwww/X\nNddco8svv1xPPfXUZkOTt956q6ZOnaqRI0dq//331xNPPLHhtkceeUR77rmnhg8fruOPP16rVq0q\n4mVkhnAGAAAKsffee2u77bbTvffeu8mw5COPPKJTTjlFl1xyiRYvXqxPfepTOuKII9TZ2ak1a9bo\nyCOP1MyZM7VkyRIdc8wxuv766xnWBAAAqIdtt91WixcvlrRxbtzFF1+sT33qU9p7771lZjrppJO0\nxRZb6P7779cDDzygtWvX6vOf
"text": [
"<matplotlib.figure.Figure at 0x10a5e2850>"
]
}
],
"prompt_number": 36
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot AgeFill density by Pclass:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for pclass in passenger_classes:\n",
" df.AgeFill[df.Pclass == pclass].plot(kind='kde')\n",
"\n",
"plt.title('Age Density Plot by Passenger Class')\n",
"plt.xlabel('Age')\n",
"plt.legend(('1st Class', '2nd Class', '3rd Class'), loc='best')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 37,
"text": [
"<matplotlib.legend.Legend at 0x10d2e3ed0>"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAm0AAAFRCAYAAAAmW5r1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd8FHX6wPHPQ+ikQUBagFClCIICKp5nzooIIqgoiohi\nORF/gPVOVEBFz4IoIqInh4oHeFIEFUFBg9hAVBAklAChg9TQW/L9/TGTEMIk2d3M7G42z/v12peZ\nndnvfPNkgk++VYwxKKWUUkqp8FYq1BVQSimllFKF06RNKaWUUqoY0KRNKaWUUqoY0KRNKaWUUqoY\n0KRNKaWUUqoY0KRNKaWUUqoY0KRNKeUKEakrIgdERIJwr3QRudylst4TkWfdKEv5xs2fn1IliSZt\nSgWRiKSIyB4RKevxPY6IyH4RyRCRxSLyuJf3BDDGbDTGxBh78Ue7Hn0DLU9EskTkoJ0IbhaRESKS\n/W+WsV+FlZEsIpsKq7ovZfmikDqXGCISKyKvicgGOxZpIjJSRBLsS1yLuVIlSYn7x0SpUBGRJKA9\n8CdwnYe3MsADxphYoAbwMHALMMvDe+ZXj6JqZYyJAS4HbgXucaFMJ262DgarziEnIqUd3isLzAOa\nAVfbsbgI2AW0C24NlYosmrQpFTy9gbnABOCO3CdEJEFEPrVbxhaJyHMisiDX+aYi8pWI7BaRlSJy\nUyH3EgBjzBFjzHysJPEiEbnWLk9E5B92C8guEflIRCrb55LsFqPedkvJThF5Ildd2tutdxkisl1E\nRuT5XJSIDAcuAUbbLS1viMhoEXklz/c9U0QGFhY4Y8wqYAHQ4oxvVKSc3aqzxX6NFJGyIlIJ+AKo\nZddhv4jUyOcWVUXkS/uaFBGpa5f9pht1FpEGIvK1HeudIvKhiMTlKvNxu2Vuv/3zvcx+3zHW9rkL\nReQHEdkrIktE5NJc51JE5BkR+c4uc06uVi5y/Wx3iciTkqu70sdn4y4R2YD1POfVG6gDdDPGrLRj\nsdMYM9wYMzvvxfb3+KP9fWy1n5Uyuc6PFJEddgx+F5EW9vudROQP+/vbLCIPF/YzUarYM8boS1/6\nCsILSANuAxoDx4Gzcp2bDEwEymO1UGwEvrXPVQI2YSV6pYDWwE6gWT73+Qa4y+H9+cC/7K8HAD8A\ntYAywFhgon0uCcgC3gbKAa2Ao8DZ9vkfgdvsrysCF+T5XCmnemC1smwBxD6uChwCquXzfWQBDe2v\nmwPbgDvt4/XAZfbXz9jfS1X79T3wjH3uUmBTIT+X94D9wF+AssBrwAI36ww0xGp5K2OXMR8YaV93\ntv3zrmEf1wUaFBLr2lgtVx3t4yvs4wT7OAVYAzTCeqa+AV7IVa8DQAe7Pi9jPY/Z8fTl2XgPqACU\nc4jBZGB8ITHP/fM7D6sFuhRQD1gBDLDPXQ0sBmJzxSo7TtuAi+2v44A2of4d15e+vH5pS5tSQSAi\nf8H6H+1MY8warP8x3WqfiwK6A0OMMUeNManA+5zqsusMrDfGvG+MyTLGLAGmAYW1tuW1Fahsf/13\n4EljzFZjzAlgGHCjnD7+apgx5pgx5ndgKXCu/f5xoLGIVDXGHDbGLCzoW8/+whjzM5CBlbyA1WX7\njTFmZwGf/1VE9gAzgX8bY8Y7XHMrVpK2yxizy/5ebs97/0J8Zoz5zhhzHBiM1SpZ2406A+8ZY9Ya\nY+YZY07YdRyJlVACZGIlxy1EpIyxxgaus8/lF+tewCxjt1wZY+ZiJTfX2ucNVuKUZow5CvwPK9kH\nuBHrOfzB/tk/zeld2fdR+LMx1FituMccvv8qWAmVT4wxvxpjFtnP9gbgnVyxOQHEAM1EpJQxZpUx\nZnuu2LQQkVhjTIYx5jdf76lUcaVJm1LBcQfwpTHmgH38Mae6SKsBpbFa07JtzvV1PeACu/tor4js\nxUpUqvtZh0RgT64yp+cqbwVwMk+Z23N9fRiItr/uCzQBUsXqyr2W/OUd1/YBVsKB/d8JhdS5jTGm\nijGmkTHm6XyuqQVsyHW80X7PV4Zc8TbGHMKKU3YZRaqzMcaISHURmWx342XYZSTY90sDBgJDgR0i\nMklEatpl5RfresBNeZ6Ji7HGMGbL/fM7wqmfX6083+8RYHeua5Mo/NkoaHLHbvyIv4g0EZHPRGSb\nHZvhnIrN18Bo4E2s2LwtIjH2R28AOgHpdnfwhb7eU6niSpM2pTwmIhWAHsBl9v+YtmFNDjhXRFpi\ndXWexBoHlC331xuB+caYyrleMcaYB/yoQx2sbqjscXIbsbrWcpdZ0RhTaAuJ3XpzqzGmGvAiMMX+\nHs+41OG9D4GuInIu0BT4xNfvoQBbsRKNbHXt9/Krg5OceItINFZrUXYZbtT5eawWtXOMMXFYLYE5\n//4aYyYZYy7BSsYMVlzzi3VFrJ/fBIdn4iUf6rIVK4HP/n4rYCdJNl+ejYLiOhe42q6nL97CSgwb\n2bEZzOmxecMY0xarW7cJ8Kj9/mJjzPVYf/R8gtWaqFRE06RNKe9dj5WUNcPqYjzX/noBcIcxJhOr\nu3OoiFQQkaZY/1PP/h/j50ATEeklImXsVzv7uvwIgIhUtAeozwAWGmOyZ5COBZ6XUwPuq4mITzNa\n7XpUsw8z7HpmOVy6A2ssVw5jzGasbrwPgCn5dK/5axLwpIhUFZGqWN192a1hO4AEEYkt4PMCdBKR\ni8Wa+fgs8KMxZouLdY7GGgu3X0RqYycekNPSdJmIlAOOYY0fzLTPOcU6EyuR7CIiV4k18aO8WMub\n1M7zfTmZan/2Ivv7HZrn2oCfDdsErJa4qSJytoiUEmuizRMico3D9dFYY+wO28/0/fb3iYi0FZEL\n7IkJh7NjY/8O3CYicfbvz4HsmCkVyTRpU8p7vYH/GGM2G2P+tF87sLp9brXHCvXHGky9HWs82ySs\nMTvYXapXYY2n2oI1XugFrEHz+RktIvvt8kZidcd2zHX+dawxV1/a1/2INRg8W0EtKVcDy0XkgF32\nLbkSmdyfex1rLNQeEXkt1/vvAy0pvJvR11ay57CSqt/t12L7PYw1e3ESsM6uh9PsUQP8FxiC1bXX\nhlPdoW7VeRhWS2cG8ClW4pR9bTmsn+dOrJ9tVeCf9jnHWNuJZFfgCawlZDZitd7mTr5Mnq8NgDHm\nD+BBrAkDW7ESnj+xEkYo2rOBPS7wCmAl8JX9PS/Ear38yeEjj2B19+/HGs82Ode5WPu9PUA61mSL\nl+1zvYD1dpfqvViTfJSKaNkzorwpXKQj1kysKOBdY8yLDteMAq7B+iuqT/ZgUhGJB97FmuJvsGah\nOf3CKxVxRORFrNmld4a6Lm4TkUuAD40x9UJdF18Vxzr7yu4O3ovVPbmhsOuVUqHjWUubPSNuNNZf\n982BniLSLM81nbD+oWiM9ZfSW7lOv441O6oZ1pIDqV7VValQs7uRWomlPXAXMD3U9XKb3c01EGtW\nZbFQHOtcGBHpYnedVwJeAX7XhE2p8Odl92h7IM0Yk25PG5+M1Zyf23VY3Q7YU9nj7VlWccAlxpj/\n2OdOGmMyPKyrUqEWg9VldhDrd+UVY8zM0FbJXfYfbXuxZiG+VsjlYaE41tlH12F1tW/BGnd4S2ir\no5TyxRlbkLioNmcuYXCBD9ckYg0o3Ski47EGbf+CtdjiYe+qq1ToGGMWYy26G7GMtf5cdKEXhpHi\nWGdfGGPuIYK311IqUnnZ0ubrYLm8M5wMVjJ5HjDGGHMe1qyrf7hYN6WUUkqpYsXLlrYtnLnu1OZC\nrkm03xNgs70aOcAUHJI2EfFuFoVSSimllMuMMb7u1HIGL1vaFmNtv5JkrwV0M9Y08txmYi2HgL2a\n9T5jzA57m5JNItLEvu4K4A+nm5gw2Ass3F5DhgwJeR3C8aVx0bhoTDQuGheNSyhfReVZS5sx5qSI\n9AfmYC35Mc4Ykyoi99nn3zbGzBKRTiKShtUFmnt5gweB/9oJ39o851QB0tPTQ12FsKRxcaZxOZPG\nxJnGxZnGxZnGxX1edo9ijPkC
"text": [
"<matplotlib.figure.Figure at 0x10b35b490>"
]
}
],
"prompt_number": 37
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When looking at AgeFill density by Pclass, we see the first class passengers were generally older then second class passengers, which in turn were older than third class passengers. We've determined that first class passengers had a higher survival rate than second class passengers, which in turn had a higher survival rate than third class passengers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature: Family Size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a new feature FamilySize that is the sum of Parch (number of parents or children on board) and SibSp (number of siblings or spouses):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['FamilySize'] = df['SibSp'] + df['Parch']\n",
"df.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Sex_Val</th>\n",
" <th>Embarked_Val</th>\n",
" <th>AgeFill</th>\n",
" <th>AgeIsNull</th>\n",
" <th>FamilySize</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> Braund, Mr. Owen Harris</td>\n",
" <td> male</td>\n",
" <td> 22</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> A/5 21171</td>\n",
" <td> 7.2500</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> 22</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td> female</td>\n",
" <td> 38</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> PC 17599</td>\n",
" <td> 71.2833</td>\n",
" <td> C85</td>\n",
" <td> C</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 38</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 3</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> Heikkinen, Miss. Laina</td>\n",
" <td> female</td>\n",
" <td> 26</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> STON/O2. 3101282</td>\n",
" <td> 7.9250</td>\n",
" <td> NaN</td>\n",
" <td> S</td>\n",
" <td> 0</td>\n",
" <td> 3</td>\n",
" <td> 26</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 41,
"text": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n",
"2 Heikkinen, Miss. Laina female 26 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Sex_Val Embarked_Val \\\n",
"0 0 A/5 21171 7.2500 NaN S 1 3 \n",
"1 0 PC 17599 71.2833 C85 C 0 1 \n",
"2 0 STON/O2. 3101282 7.9250 NaN S 0 3 \n",
"\n",
" AgeFill AgeIsNull FamilySize \n",
"0 22 0 1 \n",
"1 38 0 1 \n",
"2 26 0 0 "
]
}
],
"prompt_number": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot a histogram of FamilySize:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df['FamilySize'].hist()\n",
"plt.title('Family Size Histogram')\n",
"plt.xlabel('Family Size')\n",
"plt.ylabel('Count')\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAmYAAAFRCAYAAADeu2ECAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu0ZGV55/HvTxAvAW0ZnRYQbaIwShJFogQ16okX4ngB\nkyhoNIIasxKTKF4ygrmgccWgoyOuZDQTDQwaQIkXFmSi3PQkUSMEBRFahE48xla7McrNoJHLM3/U\nPlCedU73OU1X1X6rvp+1ap397tpV9dZ56ObpvX/1VqoKSZIkTd7dJj0BSZIkDdiYSZIk9YSNmSRJ\nUk/YmEmSJPWEjZkkSVJP2JhJkiT1hI2ZpLFIclOSDd32/03ylhG8xt8l+bWd/byreN0rkjxp3K8r\nafrYmEkiyUKSm7vm6aYkNyZ54M58jarao6oWFofdbc2SvDHJv3bz/EaSDw29xjOr6oM7YbrDr7eQ\n5KlL9h2T5B+HXvenq+oftvM8G5LcnsS/dyWtyL8gJMGgSXp21zztUVX3qaotI37NrPkBydHAi4Gn\nVtUewGOAC3b2xJbY4SZyBWt+36t60mSXUTyvpPGyMZO0rCTrkvxtkmuTfC/JOUn2Gbp/Pslbkny2\nO3t1dpL7JzktyQ1JLk7ykKHjb0/yk0MvUd3+K5I8e+i4uyf59ySPWmZajwHOraqvAVTV1qp6/5I5\nvbzb/tLQGcCbutd/UnffoUk+l+S6JJclefIafz0/1qh1Z9We0m0fkuSS7newJck7usMWz6hd383n\n5zLwB93jtyY5Ncl9hp73JUm+3v0+/mDJ67wpyUeSfDDJDcDRSR6b5J+69/WtJH+W5O5Dz3d7kt9K\nck13VvSPkzy0e8z1ST40fLyk8bMxk7Ro6ZmcuwF/BTy4u/0A+PMlxxzF4AzWPsBDgX/qHrMn8BXg\nhFW87qndcyx6JvDNqvrSMsd+HnhJktcnecwyZ4nuOLtVVY9aPAMIvA64Cvhi11z+LfDHVXU/4PXA\nR5PcfxtzXPq7WToebtTeDbyrqu4L/CTwN93+J3Y/79vN6yLgpcDRwFx37O50v+MkBwL/G3ghsBdw\nX2DvJa97OPA33WudDtwGvBr4L8DjgKcCr1zymMOARwOHAm8A3te9xoOBn+m2JU2IjZkkGDQaZ3Vn\nWq5L8rGq+l5VfbyqflhV3wfeCgyfWSrglKr6WlXdCHwCuLqqPlVVtzFoSB69ndcEOA14VpLdu/Gv\nAcvmxKrqNOB3gV8E5oGtSf7HNt9Y8vPAW4DDu/fxYuDvquqT3XNeAFzCoCFcaZ7Dv5vrGDRMK13e\n/BGwf5L7V9XNXQM2/H6HvQh4Z1UtVNV/AMcDL+gazucBZ1fV56rqFuCPlnnNz1XV2d37+GFVfbGq\nLq6q26vq68Bf8uM1A3h7VX2/qjYCXwY+0b3+Yg23VTNJI2ZjJgkG/8M/oqru191+Ocm9k/yf7vLZ\nDcDfA/dNMtxgbB3a/iFw7ZLx7mxHVX0L+CzwvCTrgGcwaNZWOv70qno6gzNIvwm8JcnTlzs2yb7A\nh4GXVNWmbvdDgOcvabSeAKz0YYelv5v7MTgLtVJW7OXAAcBXusu5z1r53bMX8PWh8b8BuwLru/s2\nD73vHwDfXfL4zcODJAd0l5+/3dXsTxicPRs2XLMfLDPebs0kjY6NmaSVvI5Bg3FId6nsyQyakZUa\nkrsSkF+8nPl8BmeBvr29B1TVbVX1EeBy4KeX3p/kXsBZDC4rnjt0178BHxxutLpLi29fw3xXDPBX\n1aaq+tWqegDwNuAj3VyW+/18C9gwNH4wcCuwBfg28KAl72dpk7X0Od8LbAQe1tXs9/Hveakp/oGV\ntJLdGZxBuSHJniyfF8sK29uz9NiPAwcDrwI+sOKDkqOTPDPJHknuluS/Az8FXLTM4ScDX6mqdyzZ\n/9fAc5IclmSXJPdMMjf8wYa7IsmLkzygG97AoHm6HfhO9/OhQ4efAbwmg6U0dmdwufhDVXU78NFu\nno9LshvwJrb/O94duAm4OcnDgd9azZRX2JY0ATZmklZyEnAv4N+BzzHIHy09Q1NLtrd3/7LHVtUP\ngY8xOHv0sW3M6UbgjQwu/10HnAj8ZlV9bpljjwKeu+STmU+oqs3AEd3zXMvgDNrrWNvfh9taQuMX\ngSuS3AS8C3hBVf1nVd3M4NLiZ7tLqIcwaB4/yOATm/8K3MwgQ0dVXdltf4jBmbWbuvn+5zbm8Hrg\nVxn8nv6ye+xKNVhu385eGkTSGqVqtH8Gu8zI+xn8q7YYfArpGga5j4cAC8CRVXV9d/zxwMsYfLro\nVVV13kgnKKkXkvwhsH9VvWTSc+mj7ozadQwuU359e8dLatM4zpi9m8EnoB4BPJLBR9aPA86vqgOA\nC7vx4sfDjwIOZBAAfk9cJVuaet2l0pcxOMujTpLndB/C+AngHcDlNmXSdBtp05PkvsATq+pkgKq6\ntapuYLD2zqndYacCz+22jwDOqKpbuq9u2QQcMso5SpqsJK9gcDnxE1X1mUnPp2cOB77Z3R4KvGCy\n05E0aruO+Pn3A76T5BTgUcAXgGOB9VW1+BHtrQw+Gg6DxRM/P/T4zQwWrpQ0parqfQwWOdUSVfUK\n4BWTnoek8Rn1ZcJdGXzS6j1VdTDwH3SXLRfVIOS2raCbQVRJkjQTRn3GbDOwuar+uRt/hMHK1luS\nPLCqtiTZizsXpfwmsO/Q4x/U7btDEhs1SZLUjKpa9VI0Iz1jVlVbgG8kOaDb9TTgSuAcBt8PR/fz\nrG77bAZfR7Jbkv2A/YGLl3leb43eTjjhhInPwZu1m8Wb9Wv7Zv3ava3VqM+YwWAdntO6BRL/hcFy\nGbsAZyZ5Od1yGQBVtTHJmQxWrr4VeGXtyLtSby0sLEx6CtpB1q5t1q9t1m92jLwxq6ovAY9d5q6n\nrXD8Wxmsfi1JkjRTXCNMY3XMMcdMegraQdaubdavbdZvdox85f+dLYlXNyVJUhOSUH0J/0tLzc/P\nT3oK2kHWrm3Wr23Wb3bYmEmSJPWElzIlSZJGxEuZkiRJjbIx01iZk2iXtWub9Wub9ZsdNmaSJEk9\nYcZMkiRpRMyYSZIkNcrGTGNlTqJd1q5t1q9t1m922JhJkiT1hBkzSZKkETFjJkmS1CgbM42VOYl2\nWbu2Wb+2Wb/ZYWMmSZLUE2bMJEmSRsSMmSRJUqNszDRW5iTaZe3aZv3aZv1mh42ZJElST5gxkyRJ\nGhEzZpIkSY2yMdNYmZNol7Vrm/Vrm/WbHTZmkiRJPdFkxuy9733vpKcxMo9//ON55CMfOelpSJKk\nnWCtGbNdRzmZUXntay+b9BRG4rbbPsvb3vZDGzNJkmZUk43ZD37wF5Oewkjsttuxk57CyM3PzzM3\nNzfpaWgHWLu2Wb+2Wb/ZYcZMkiSpJ5rMmEFbc16t3XY7lre9bQPHHjv9Z84kSZoFrmMmSZLUKBsz\njZVr8bTL2rXN+rXN+s0OGzNJkqSeMGPWI2bMJEmaLmbMJEmSGmVjprEyJ9Eua9c269c26zc7bMwk\nSZJ6woxZj5gxkyRpupgxkyRJatTIG7MkC0kuT3Jpkou7fXsmOT/J1UnOS7Ju6Pjjk1yT5Kokh416\nfhovcxLtsnZts35ts36zYxxnzAqYq6pHV9Uh3b7jgPOr6gDgwm5MkgOBo4ADgWcA70niWT1JkjQT\nxtX0LL22ejhward9KvDcbvsI4IyquqWqFoBNwCFoaszNzU16CtpB1q5t1q9t1m92jOuM2QVJLkny\nim7f+qra2m1vBdZ323sDm4ceuxnYZwxzlCRJmrhdx/AaT6iqbyd5AHB+kquG76yqGnzSckXL3HcM\nsKHbXgccBMx14/nuZ5vjTZs2MT8/f8e/jhZzBdMyPumkkzjooIN6Mx/Hqx8PZ1z6MB/H1m+Wxtav\nnfHi9sLCAjtirMtlJDkB+D7wCmCuqrYk2Qv4dFU9PMlxAFV1Ynf8J4ETquqioedwuYyGzQ81nWqL\ntWub9Wub9WtXr5bLSHLvJHt0
"text": [
"<matplotlib.figure.Figure at 0x10d42dcd0>"
]
}
],
"prompt_number": 42
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot a histogram of AgeFill segmented by Survived:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Get the unique values of Embarked and its maximum\n",
"family_sizes = sort(df['FamilySize'].unique())\n",
"family_size_max = max(family_sizes)\n",
"\n",
"df1 = df[df['Survived'] == 0]['FamilySize']\n",
"df2 = df[df['Survived'] == 1]['FamilySize']\n",
"plt.hist([df1, df2], bins=family_size_max + 1, \n",
" range=(0, family_size_max), stacked=True)\n",
"plt.legend(('Died', 'Survived'), loc='best')\n",
"plt.title('Survivors by Family Size')\n",
"plt.xlabel('Family Size')\n",
"plt.ylabel('Count')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 43,
"text": [
"<matplotlib.text.Text at 0x10b3f5cd0>"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAmYAAAFRCAYAAADeu2ECAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmYXWWd7v3vTYJEZFZmAtgKSBwakTiAHKIHeWkZPTQE\nUAweDk7tUZxe4SgSWg8OLa1H+xUFFYNCFGeGKJNdKiqTgiLI5CEMwQSEdBBpMMDv/WOvxJ1QqVRC\n7dqrUt/Pde0ra3jW2r+9KxR3nudZa6WqkCRJUv+t1e8CJEmS1GEwkyRJagmDmSRJUksYzCRJklrC\nYCZJktQSBjNJkqSWMJhJGnFJTkvyoT7X8NUkH+lnDcOVZNskf06SZn0gyTE9eJ/fJfkvI31eSSPH\nYCaNE0lemeQXSf4jyf1JLk+yWy/eq6reVlUf7cW5V6WM5vWUJXkiyUNNePpzkgdG4rxLVNWdVbV+\n/e3GkqtVe5KnJTk1yV1Nnbcn+XTX+7ygqn46UnVLGnkT+12ApN5LsgFwAfAW4FxgHWBP4NHVOFcA\napTvTp1kYlU9tqqHjWAJL6qq/zuC5+uFE4BdgalVNT/JdnR+zpLGCHvMpPFhRzpZ6pvV8UhVXVJV\n1wMkmZnka0saJ9m+6SVaq1kfSPLRJD8H/gK8P8nV3W+Q5N1JftAsLx1GTPL7JPt1tZuY5L4kuzTr\nBya5IcnCJP+e5Hldbecm+X+T/Bb4c5IJST6Q5O4kDya5Kcmrh/jcz0pycdN2IMm2zXn/vySfWq7+\n85IcN9wvNMlzkvw4yZ+az/P1JBsuV/v7kvy26b36cpLNk/wwyaIklyTZaLDvu+scT0vyQJIXdG3b\nLMlfkjxzkLJ2A75fVfMBquqOqvr6cjW9uln+j64ewIea91/y/eyf5LrmZ/LzJC8c7vci6akxmEnj\nw83A401g2jfJxsvtH07v1xuA/wGsB3wB2CnJc7v2Hwmc3XW+Jec8Bziiq93/A9xbVdcl2bHZ/07g\nWcAc4Pwk3b35hwP/AGwEPBf4J2C3qtoA2AeYu4J6A7we+Ofm3Nd11fdV4IiuOV3PAv5r1/4VnW95\n/xvYEtgZmAzM7NpXwH9rzrsTsD/wQ+B4YDM6v3/fOcT7UVV/BWbT+e6XOAK4tKruH+SQK4D3JHlb\nkhcu+XzL1bTk3Bs1w6frA58FfgrMS/Ji4MvAscAmwBeB85I8bahaJY0Mg5k0DlTVn4FX0vkf8xnA\nvUl+kGSzpsnKhvwK+GpV/b6qnqiqB4Ef0ASuJDvQCR/ndR2z5JyzgQOTTGrWj2y2AUwHLqiqy6rq\nceBTwNOB3bve97NVNa+qHgUepzMM+/wkazdzs4YaXrygqi5vAs4HgVck2bqqrgYW0QlN0Al//15V\n9w1xrl83PUgLk3ymqv7Q1L24qv4EfBrYa7ljPldV91XVPcDPgF9W1W+az/I94MVDvN8SZ7FssD0K\n+NoK2n4M+ASdQHo1cHeSNw518iTTm/Mf0vwM3gx8saqubnpXz6Iz5P3yYdQq6SkymEnjRFXdVFVv\nqqrJwAuArYDPrMIp7lpuvbsn7Ejge1X1yCDvexvwezrhbF3ggOZY6PQ23dnVtpr32Xqw923OdRyd\nnqkFSWYn2XIF9RZwd9exfwEeoPO5oRN4lvREvYEVh50lXlxVGzev45phyW80w6qLmuOXH15c0LX8\nn8utP0Kn93FIVXUl8J9JpjXDvM9h2QDc3faJqvp8Vb0S2JBOj95Xkuw0WPumd+xzwMFdPXDbAe/t\nCqELgW3o/Kwk9ZjBTBqHqupmYBadgAadeWPrdjXZYrDDllu/FNg0yd/T6XE658mHLDWbTog7CLix\nq5frHjpBAFh6YcFkYN6K3reqZlfVns1xRaeHaEUmd517PTpDc/c0m74OHNTU/zzg+0OcZzCn0OnB\ne0FVbUinJ2tlv1NX92KEWXTC41HAt5oewCFV1aNV9XlgITDlSYV0eku/B7y9qn7TtetO4H93hdCN\nq2q9qvrmatYuaRUYzKRxIMlOSd6TZOtmfTKdoPTLpsl1wH9JMrmZwH7CYKfpXqmqxcC36Aw/bgxc\nsqK2wDfozC17K8vO4zoX2C/Jq5OsDbyXTk/SL1bwOXZs2q5DZ3jtETrhaNDmwGuT7NHMj/oInaHE\neU39dwPX0Ok5+3YzvLgq1qMTaB9svtf3r+LxK9P9HX6dzny119Opd/ADkncl2SvJ05uLLGY0dV67\nXLuJwLeBr1fVt5c7zRnAW5O8NB3PSLJfE2wl9ZjBTBof/gy8DLgyyUN0Atlv6QQhquoS4JvNtquB\n83lyD9lgFwicQ2ee1req6onl2nZPNJ9PJ2y9onmfJdtvodMT9DngPmA/4IAhbouxDp15VPcBf6Qz\nqX+wELmkhrOBk4D76cznesNybWYBL2Tlw5iDffaT6dyaYhGd7+s7K2i3ovMsf6+yFX7fVXUX8Gvg\niaq6fIjzPwycSue7uQ94G525Y3OXa7cNnTmHx3Vdmflgkm2q6ld0Jv7/G52h31uBIeepSRo56fWt\niJrLwb8EPJ/OL5o30fkP/Zt0hiLmAodV1X807U8A/judfwW/s6ou7mmBksatJHvS6TXabqWN+yzJ\nl4F5VfXhftciqXdGo8fs/wBzqmpn4EXATXQuF7+kqnYELmvWSTKFzlVaU4B9gc8vf18fSRoJzdDp\ncXSG7lotyfZ0hjK/3N9KJPVaT0NPM1dlz6r6CkBVPVZVi4AD6Qwh0Px5cLN8EDC7ufx8LnAb8NJe\n1ihp/EmyM51J8Zuzalemjrp0btR7PfDJqrqj3/VI6q1eP5Lp2cB9Sc4E/h74FZ1/oW5eVUsuG19A\n55cjdC5jv6Lr+LtZ9rJ5SXrKqur3DONWFW1QVScCJ/a7Dkmjo9fDhBPpTI79fFXtSucKpuO7GzT3\nLRpqotuoPo9PkiSpX3rdY3Y3cHdzl23oXJ59AjA/yRbNQ3a3BO5t9s+j675DdK4c6r6fEUkMapIk\nacyoqmHfw7CnPWbNJfJ3Nc/DA9gbuIHOpeUzmm0z+NuNHc8DDm8e3PtsYAfgqkHO62uMvk466aS+\n1+DLn914fPnzG9svf35j97Wqet1jBvA/gbObGzz+gc7tMiYA5yY5huZ2GQBVdWOSc4Ebgcfo3JHa\nHjJJkjQu9DyYVedRH1MH2bX3CtqfQudRJ5IkSeOK9wjTqJo2bVq/S9Bq8mc3tvnzG9v8+Y0fPb/z\n/0hL4uimJEkaE5JQqzD5fzTmmEmSpFGSDDsDaISNRMeRwUySpDWMI0ujb6QCsXPMJEmSWsJgJkmS\n1BIGM0mSpJYwmEmSpL5729vexkc/+tHVOvboo4/mxBNPHOGK+sPJ/5IkreFG40rNlV1wsP3223Pv\nvfcyceJEJkyYwJQpU3jjG9/Im9/8ZpJw2mmnrfZ7J1ljrka1x0ySpHGhevhauSRccMEFPPjgg9x5\n550cf/zxfOITn+CYY44ZmU+3hlyJajCTJEmjav311+eAAw7gm9/8JrNmzeKGG2540nDkBRdcwC67\n7MLGG2/MHnvswfXXX79037XXXsuuu+7KBhtswOGHH84jjzzSj4/REwYzSZLUF1OnTmWbbbbhZz/7\n2TJDkddeey3HHHMMZ5xxBg888ABvectbOPDAA1m8eDF//etfOfjgg5kxYwYLFy7k0EMP5Tvf+Y5D\nmZIkSU/VVlttxQMPPAD8bS7c6aefzlve8hamTp1KEt74xjeyzjrr8Mtf/pIrrriCxx57jHe9611M\nmDCBQw45hKlTp/bzI4woJ/9LkqS+mTdvHptsssky2+644w7OOussPve5zy3dtnjxYv74xz9SVWy9\n9dbLtN9uu+2cYyZJkvRUXH311cybN48999xzme3bbrstH/zgB1m4cOHS10MPPcT06dPZcsstmTdv\n3jLt77jjDocyJUmSVsWSXq0HH3yQCy64gCOOOIKjjjqK5z//+VTV0v3HHnssX/jCF7jqqquoKv7y\nl79w4YUX8tBDD7H77rszceJE
"text": [
"<matplotlib.figure.Figure at 0x10d767110>"
]
}
],
"prompt_number": 43
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}