Titanic Data Analysis

Questions about the Data:

  1. What is the passenger distribution like? (Wealth, Age etc)
  2. Is there a correlation between the Odds of Survival and the Passenger Class?
  3. What exact impact did the sex of a passenger have?
  4. Can we build a high performing model to predict if an individual had died or not?

1. Wrangling Data

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams.update({'font.size': 22})

titanic_df = pd.read_csv('titanic_data.csv')
Populating the interactive namespace from numpy and matplotlib
In [2]:
titanic_df.head()
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

2. Data Overview

2.1 Demographics

Data Overview and Crosstables:

So to get things started let's start by looking at the overall survival rate:

In [3]:
print "Survival Rate:"
print titanic_df['Survived'].mean()
Survival Rate:
0.383838383838

So the overall surival rate was roughly 38% but I expect the survival rate of men to be signifantly smaller than of women. So let's take a look at it by creating a cross table:

In [4]:
titanic_df.groupby('Sex').mean()['Survived']
Out[4]:
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

With about 74.2% of women surviving, compared to 18.89% for their male counterpart, the odds of surviving the Titanic were significantly higher for women than for men. But there is more to it, so let's factor in the Cabin Class (1=highest; 3=lowest)

Passenger Class and Gender

In [5]:
print titanic_df.groupby(['Pclass']).mean()['Survived']
print titanic_df.groupby(['Pclass']).mean()['Survived'].plot.bar()
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64
Axes(0.125,0.125;0.775x0.775)
In [6]:
print titanic_df.groupby(['Pclass','Sex']).mean()['Survived']
print titanic_df.groupby(['Pclass','Sex']).mean()['Survived'].plot.bar()
Pclass  Sex   
1       female    0.968085
        male      0.368852
2       female    0.921053
        male      0.157407
3       female    0.500000
        male      0.135447
Name: Survived, dtype: float64
Axes(0.125,0.125;0.775x0.775)

On first glance it seems like their was a big discrepancy of survival odds between cabin classes with 62.9% for first class, 47.2% for 2nd class and 24.2% for third class.

But then when we factor in the gender we can clearly see that the survival odds are, in all categories, being lifted up by the female survival rate. For instance, while 96.8% of all female first class passengers survived, only 36.8% of male first class passengers survived.

But the most outstanding number to me is that in third class significantly less people survived and that there is a survival rate discrepancy of rougly 76.4% exist between the two genders for the 2nd class.

Impact of Age

In [7]:
group_by_age = pd.cut(titanic_df['Age'], np.arange(0, 90, 10))
age_grouping = titanic_df.groupby(group_by_age).mean()
print age_grouping['Survived']
age_grouping['Survived'].plot.bar()
Age
(0, 10]     0.593750
(10, 20]    0.382609
(20, 30]    0.365217
(30, 40]    0.445161
(40, 50]    0.383721
(50, 60]    0.404762
(60, 70]    0.235294
(70, 80]    0.200000
Name: Survived, dtype: float64
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0xa501978>

From those numbers, we can tell that children had the highest odds of surviving and elders had lower odds of surviving (which might also depend on other factors, e.g. physical capability). However there is not really a clear trend for any of the other bins

Distribution Plots

Survival Distribution:

In [8]:
gender = titanic_df["Survived"].value_counts().plot(kind="bar", title= 'Passenger Survival (1=Survived)')
gender.set_xlabel("Survival")
gender.set_ylabel("Passengers")
print gender
print "Odds of Survival:"
print titanic_df['Survived'].mean()
Axes(0.125,0.125;0.775x0.775)
Odds of Survival:
0.383838383838

Age Distribution

In [9]:
age = titanic_df['Age'].hist()
age.set(xlabel="Age", ylabel="Number of Passengers", title = 'Age Distribution')
print age
print ""
print titanic_df['Age'].describe()
Axes(0.125,0.125;0.775x0.775)

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

Number of Parents/Children aboard -- Distribution:

In [10]:
parch = titanic_df['Parch'].hist(bins=5)
parch.set(xlabel="Count", ylabel="Number of Passengers", title = 'Number of Parents/Children Aboard')
print parch
print ""
print titanic_df['Parch'].describe()
Axes(0.125,0.125;0.775x0.775)

count    891.000000
mean       0.381594
std        0.806057
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        6.000000
Name: Parch, dtype: float64

Gender Distribution

In [11]:
gender = titanic_df["Sex"].value_counts().plot(kind="bar", title= 'Passenger Gender Distribution')
gender.set_xlabel("Gender")
gender.set_ylabel("Passengers")
Out[11]:
<matplotlib.text.Text at 0xbcaff98>

Cabin Class Distribution:

1 = First Class, 2 = Second Class, 3 = Third Class

In [12]:
cabin = titanic_df["Pclass"].value_counts().plot(kind="barh", title= 'Passenger Class Distribution')
cabin.set_xlabel("Passengers")
cabin.set_ylabel("Class, 1st, 2nd, 3d")
print cabin
print ""
print titanic_df['Pclass'].describe()
Axes(0.125,0.125;0.775x0.775)

count    891.000000
mean       2.308642
std        0.836071
min        1.000000
25%        2.000000
50%        3.000000
75%        3.000000
max        3.000000
Name: Pclass, dtype: float64
In [13]:
fare = titanic_df['Fare'].hist(bins=5)
fare.set(xlabel="Fare", ylabel="Number of Passengers", title = 'Fare Distribution')
print fare
print ""
print titanic_df['Fare'].describe()
Axes(0.125,0.125;0.775x0.775)

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

Port of Embarkation Distribution:

In [14]:
Embark = titanic_df["Embarked"].value_counts().plot(kind="barh", title= 'Passenger Port Distribution')
Embark.set_xlabel("Passengers")
Embark.set_ylabel("Port")
print Embark
print 'Ports: (C = Cherbourg; Q = Queenstown; S = Southampton)'
Axes(0.125,0.125;0.775x0.775)
Ports: (C = Cherbourg; Q = Queenstown; S = Southampton)

3. Analysis:

Correlation:

In [15]:
def pearson_r(x,y):
    std_x = (x - x.mean()) / x.std(ddof=0)
    std_y = (y - y.mean()) / y.std(ddof=0)
    corr = (std_x * std_y).mean()
    return corr


## Conversion for Gender:
def convert_gender(series):
    if series == 'male':
        return 0
    elif series == 'female':
        return 1
    

## Data as Pandas Series:
survival_ser = titanic_df['Survived']
pclass_ser = titanic_df['Pclass']
age_ser = titanic_df['Age']
fare_ser = titanic_df['Fare']
parch_ser = titanic_df['Parch']
gender_ser = titanic_df['Sex'].apply(convert_gender)

Correlation Values:

In [16]:
print "Age and Survival:"
print pearson_r(survival_ser, age_ser)
print ""

print "Fare and Survival:"
print pearson_r(survival_ser, fare_ser)
print ""

print "Survival and Parents/Children:"
print pearson_r(survival_ser, parch_ser)
print ""

print "Survival and Cabin Class:"
print pearson_r(survival_ser, pclass_ser)
print "-- Note: Low Values indicate that First Class passengers were more likely to survive"
print ""

print "Survival and Gender:"
print pearson_r(survival_ser, gender_ser)
print "-- Note: High Values indicate that female passengers were more likely to survive"
print ""
Age and Survival:
-0.0779826784139

Fare and Survival:
0.257306522385

Survival and Parents/Children:
0.0816294070835

Survival and Cabin Class:
-0.338481035961
-- Note: Low Values indicate that First Class passengers were more likely to survive

Survival and Gender:
0.543351380658
-- Note: High Values indicate that female passengers were more likely to survive

Investigation - Survival and Gender:

In [17]:
grouped_gender = titanic_df.groupby('Sex')
print grouped_gender['Survived'].mean()
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

Conclusion:

Of all female passengers 74.2% survived, but only 18.89% of all male passengers

In [18]:
gends = grouped_gender['Survived'].mean().plot(kind="barh", title= 'Survival Rate by Gender')
gends.set_xlabel("Odds of Survival")
gends.set_ylabel("Gender")
print gends
Axes(0.125,0.125;0.775x0.775)

Investigation - Survival and Cabin Class:

In [19]:
grouped_cabin = titanic_df.groupby('Pclass')
print grouped_cabin['Survived'].mean()
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

Conclusion:

Of all first class passengers 62.96% survived, but only 24.23% of all third class passengers survived. The 2nd class rate is in between with 47.28%.

In [20]:
pclass = grouped_cabin['Survived'].mean().plot(kind="barh", title= 'Survival Rate by Cabin Class')
pclass.set_xlabel("Odds of Survival")
pclass.set_ylabel("Cabin Class")
print pclass
Axes(0.125,0.125;0.775x0.775)

Investigation - Survival and Age:

In [21]:
## Delete rows where Age is NaN:
titanic_age_df = titanic_df[pd.notnull(titanic_df['Age'])]


## Conversion for Age:
def convert_age(age):
    if age <= 20:
        return 1
    if age <= 40:
        return 2
    if age <= 60:
        return 3
    if age <= 80:
        return 4
In [22]:
titanic_age_df['Age'].describe()
Out[22]:
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

Machine Learning

In [23]:
from sklearn.metrics import f1_score, recall_score, precision_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
import sys
import pickle
from time import time

Create Feature Numpy Array

In [24]:
feat = []
passclass = titanic_df['Pclass']
age = titanic_df['Age']
In [25]:
features = []
sex_list = titanic_df['Sex']
for each in sex_list:
    if each == "male":
        features.append(0)
    else:
        features.append(1)
labels = titanic_df['Survived']

labels = np.array(labels)
features = np.array(features).reshape(-1, 1)
In [26]:
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.2, random_state=20)
In [27]:
print ""
print 'SVM Results:'
from sklearn import svm, grid_search

parameters = {'kernel': ['rbf'], 'C':[100, 1000, 10000, 50000]}
svr = svm.SVC()
clf = grid_search.GridSearchCV(svr, parameters) 


t0 = time() 
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

t1 = time()
pred = clf.predict(features_test)
print "Accuracy:", accuracy_score(pred, labels_test)
print "predicting time:", round(time()-t1, 3), "s"
SVM Results:
training time: 0.136 s
Accuracy: 0.832402234637
predicting time: 0.003 s

Put Data in a Dictionary

In [28]:
import csv
reader = csv.DictReader(open('ml_titanic_data.csv'))

result = {}
for row in reader:
    key = row.pop('PassengerId')
    if key in result:
        pass
    result[key] = row

for passenger in result:
    if result[passenger]['Sex'] == 'male':
        result[passenger]['Sex'] = 0
    else:
        result[passenger]['Sex'] = 1

for passenger in result:
    result[passenger]['Fare'] = float(result[passenger]['Fare'])
    if result[passenger]['Age'] == "":
        result[passenger]['Age'] = 'NaN'
    else:
        result[passenger]['Age'] = float(result[passenger]['Age'])
    result[passenger]['Parch'] = float(result[passenger]['Parch'])
    result[passenger]['Pclass'] = float(result[passenger]['Pclass'])
    result[passenger]['Sex'] = float(result[passenger]['Sex'])
    result[passenger]['SibSp'] = float(result[passenger]['SibSp'])
    result[passenger]['Survived'] = float(result[passenger]['Survived'])

Helper Functions:

In [29]:
from feature_format import featureFormat, targetFeatureSplit
In [30]:
features_list = ['Survived','Sex', 'Age', 'Pclass']
my_dataset = result
In [31]:
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
In [32]:
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.20, random_state=20)

Naive Bayes:

In [33]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
clf = GaussianNB()


print ""
print 'Naive Bayes Results:'
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

t1 = time()
pred = clf.predict(features_test)

print "predicting time:", round(time()-t1, 3), "s"

print "-------------------------"
print "Test Metrics:"
print "Accuracy:", accuracy_score(pred, labels_test)
print "Precision Score:", precision_score(labels_test, pred)
print "F1 Score:", f1_score(labels_test, pred)
print "Recall Score:", recall_score(labels_test, pred)
Naive Bayes Results:
training time: 0.002 s
predicting time: 0.001 s
-------------------------
Test Metrics:
Accuracy: 0.782122905028
Precision Score: 0.706896551724
F1 Score: 0.677685950413
Recall Score: 0.650793650794

Naives Bayes - PCA:

In [34]:
#### PCA

from sklearn.decomposition import PCA

pca = PCA(n_components=3).fit(features_train)

X_train_pca = pca.transform(features_train)
X_test_pca = pca.transform(features_test)

print ""
print ""
print "Naive Bayes PCA:"

t0 = time()
clf.fit(X_train_pca, labels_train)
print "training time:", round(time()-t0, 3), "s"

t1 = time()
pred = clf.predict(X_test_pca)


print "predicting time:", round(time()-t1, 3), "s"

print "Explained Variance:", pca.explained_variance_ratio_

print "-------------------------"
print "Test Metrics:"
print "Accuracy:", accuracy_score(pred, labels_test)
print "Precision Score:", precision_score(labels_test, pred)
print "F1 Score:", f1_score(labels_test, pred)
print "Recall Score:", recall_score(labels_test, pred)
print "-------------------------"

Naive Bayes PCA:
training time: 0.001 s
predicting time: 0.0 s
Explained Variance: [  9.97259971e-01   2.00656636e-03   7.33462577e-04]
-------------------------
Test Metrics:
Accuracy: 0.748603351955
Precision Score: 0.640625
F1 Score: 0.645669291339
Recall Score: 0.650793650794
-------------------------

PCA Explanation

The principal component analysis transforms the features to principal components, in a way that the first principal component has the highest variance, which in this case is 99.8%. This way we turned the features "Sex", "Age" and "Passenger Class" into 1 feature which explains the vast majority of variance in the dataset.

This Naive Bayes classifier, using the PCA underperforms the one without PCA by roughly 3.5%, but this way we reduce the complexity of our model by going from 3 features to more or less one feature (one because the other ones are irrelevant since they explain less than 0.2% of the data's variance

SVM

In [35]:
##### SVM
print ""
print 'SVM Results:'
from sklearn import svm, grid_search

parameters = {'kernel': ['rbf'], 'C':[100, 250, 1000, 2500, 10000, 25000, 50000]}
svr = svm.SVC()
clf = grid_search.GridSearchCV(svr, parameters) 


t0 = time() 
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

t1 = time()
pred = clf.predict(features_test)
print "Accuracy:", accuracy_score(pred, labels_test)
print "predicting time:", round(time()-t1, 3), "s"

print "-------------------------"
print "Test Metrics:"
print "Accuracy:", accuracy_score(pred, labels_test)
print "Precision Score:", precision_score(labels_test, pred)
print "F1 Score:", f1_score(labels_test, pred)
print "Recall Score:", recall_score(labels_test, pred)
print "-------------------------"
SVM Results:
training time: 0.963 s
Accuracy: 0.810055865922
predicting time: 0.003 s
-------------------------
Test Metrics:
Accuracy: 0.810055865922
Precision Score: 0.716417910448
F1 Score: 0.738461538462
Recall Score: 0.761904761905
-------------------------

Prediction

We can use this classifier to predict if a person had survived with an accuracy of 81%, using the gender, age and cabin class.

In the first example we enter data for a 24 year old, first class, female (1=female, 0=male) passenger and our model outputs a 1 which stands for survival.

In [36]:
print "Female, 24, 1st class: ", clf.predict([[1, 24, 1]])
print "Male, 36, 2nd class: ", clf.predict([[0, 36, 2]])
print "Female, 10, 3rd class: ", clf.predict([[1, 10, 3]])
Female, 24, 1st class:  [ 1.]
Male, 36, 2nd class:  [ 0.]
Female, 10, 3rd class:  [ 0.]

Random Forest

In [37]:
print ""
print 'Random Forst Results:'
from sklearn import ensemble

clf = ensemble.RandomForestClassifier(n_estimators=100)

t0 = time() 
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

t1 = time()
pred = clf.predict(features_test)
print "Accuracy:", accuracy_score(pred, labels_test)
print "predicting time:", round(time()-t1, 3), "s"

print "-------------------------"
print "Test Metrics:"
print "Accuracy:", accuracy_score(pred, labels_test)
print "Precision Score:", precision_score(labels_test, pred)
print "F1 Score:", f1_score(labels_test, pred)
print "Recall Score:", recall_score(labels_test, pred)
print "-------------------------"
Random Forst Results:
training time: 0.612 s
Accuracy: 0.798882681564
predicting time: 0.019 s
-------------------------
Test Metrics:
Accuracy: 0.798882681564
Precision Score: 0.707692307692
F1 Score: 0.71875
Recall Score: 0.730158730159
-------------------------

Gradient Boosting

In [38]:
print ""
print 'Gradient Boosting Results:'

clf = ensemble.GradientBoostingClassifier(n_estimators=100)

t0 = time() 
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

t1 = time()
pred = clf.predict(features_test)
print "Accuracy:", accuracy_score(pred, labels_test)
print "predicting time:", round(time()-t1, 3), "s"

print "-------------------------"
print "Test Metrics:"
print "Accuracy:", accuracy_score(pred, labels_test)
print "Precision Score:", precision_score(labels_test, pred)
print "F1 Score:", f1_score(labels_test, pred)
print "Recall Score:", recall_score(labels_test, pred)
print "-------------------------"
Gradient Boosting Results:
training time: 0.115 s
Accuracy: 0.821229050279
predicting time: 0.002 s
-------------------------
Test Metrics:
Accuracy: 0.821229050279
Precision Score: 0.762711864407
F1 Score: 0.737704918033
Recall Score: 0.714285714286
-------------------------