RMS Titanic

The RMS Titanic began her maiden voyage on April 10, 1912 and sank to the bottom of the ocean after hitting an iceberg 5 days later. To learn more about the Titanic, you can visit this page on Wikipedia.

Session Information

Before we get started with showing the plots, here is some information about how my RStudio is set up

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.2     tools_3.2.2     htmltools_0.2.6
##  [5] yaml_2.1.13     stringi_0.5-5   rmarkdown_0.8   knitr_1.11     
##  [9] stringr_1.0.0   digest_0.6.8    evaluate_0.7.2

The Dataset

The data set used to produce the following graphs contains information on almost 900 passengers of that fateful ship. Here is a subset and summary of that data:

source("../01 Data/Titanic_DataFrame.R", echo = TRUE)
## 
## > require("jsonlite")
## Loading required package: jsonlite
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View
## 
## > require("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
## 
## > Titanic_df <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from TITANIC\""), 
## +     httpheader = c(DB = " ..." ... [TRUNCATED] 
## 
## > summary(Titanic_df)
##                                       PASSENGERID  SURVIVED    PCLASS   
##  </pre></body></html>Xtext/csvUUTF-8(?N`vPJS:  1   0   :549   1   :216  
##  1                                          :  1   1   :342   2   :184  
##  10                                         :  1   null:  2   3   :491  
##  100                                        :  1              null:  2  
##  101                                        :  1                        
##  102                                        :  1                        
##  (Other)                                    :887                        
##                                     NAME         SEX           AGE     
##  null                                 :  2   female:314   null   :179  
##  Abbing, Mr. Anthony                  :  1   male  :577   24     : 30  
##  Abbott, Mr. Rossmore Edward          :  1   null  :  2   22     : 27  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                18     : 26  
##  Abelson, Mr. Samuel                  :  1                19     : 25  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                28     : 25  
##  (Other)                              :886                (Other):581  
##      SIBSP         PARCH          TICKET                  FARE    
##  0      :608   0      :678   1601    :  7   8.05000019073486: 43  
##  1      :209   1      :118   347082  :  7   13              : 42  
##  2      : 28   2      : 80   CA. 2343:  7   7.89580011367798: 38  
##  4      : 18   3      :  5   3101295 :  6   7.75            : 34  
##  3      : 16   5      :  5   347088  :  6   26              : 31  
##  8      :  7   4      :  4   CA 2144 :  6   10.5            : 24  
##  (Other):  7   (Other):  3   (Other) :854   (Other)         :681  
##          CABIN     EMBARKED  
##  null       :689   C   :168  
##  B96 B98    :  4   null:  4  
##  C23 C25 C27:  4   Q   : 77  
##  G6         :  4   S   :644  
##  C22 C26    :  3             
##  D          :  3             
##  (Other)    :186             
## 
## > head(Titanic_df)
##   PASSENGERID SURVIVED PCLASS                        NAME    SEX AGE SIBSP
## 1         207        0      3  Backstrom, Mr. Karl Alfred   male  32     1
## 2         208        1      3 Albimona, Mr. Nassef Cassem   male  26     0
## 3         209        1      3     Carr, Miss. Helen Ellen female  16     0
## 4         210        1      1            Blank, Mr. Henry   male  40     0
## 5         211        0      3              Ali, Mr. Ahmed   male  24     0
## 6         212        1      2  Cameron, Miss. Clear Annie female  35     0
##   PARCH             TICKET             FARE CABIN EMBARKED
## 1     0            3101278 15.8500003814697  null        S
## 2     0               2699 18.7875003814697  null        C
## 3     0             367231             7.75  null        Q
## 4     0             112277               31   A31        C
## 5     0 SOTON/O.Q. 3101311 7.05000019073486  null        S
## 6     0       F.C.C. 13528               21  null        S

Other data frames

These other two data frames form the basis for the other 5 visualization plots:

source("../01 Data/Titanic_Age_Less_Ten.r", echo = TRUE)
## 
## > require("jsonlite")
## 
## > require("RCurl")
## 
## > Titanic_Age_Less_Ten_df <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from TITANIC where SEX IS NOT NUL .... [TRUNCATED]
source("../01 Data/Titanic_NoNullSex_df.r", echo = TRUE)
## 
## > require("jsonlite")
## 
## > require("RCurl")
## 
## > Titanic_NoNullSex_df <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from TITANIC where SEX IS NOT NULL\" ..." ... [TRUNCATED]

How to Reproduce the following graphs:

There are a total of 6 plots in this document. To reproduce them, you need to:

  1. Load three data frames from the 01 Data folder in our github repository.
  2. Run the Visualizations in the 02 Visualization file

Plot 1: Fare and Age of Titanic Passengers broken down by gender

This plot looks at the relationship between Fare and Age based on the gender of the passenger. This plot includes null gender values.

source("../02 Visualizations/Age_Fare_Sex.R", echo = TRUE)
## 
## > require(ggplot2)
## Loading required package: ggplot2
## 
## > ggplot(data = Titanic_df, aes(x = as.numeric(as.character(AGE)), 
## +     y = as.numeric(as.character(FARE)), color = SEX)) + coord_cartesian() + 
## +   .... [TRUNCATED]
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: Removed 179 rows containing missing values (geom_point).

Here is the R code used to produce this plot:

require(ggplot2)
ggplot(data=Titanic_df,aes(x=as.numeric(as.character(AGE)), y=as.numeric(as.character(FARE)), color=SEX)) + 
  coord_cartesian() + 
  scale_x_continuous() +
  scale_y_continuous() +
  labs(title='Titanic',x="Age", y=paste("Fare")) +
  geom_point(stat="identity",position=position_jitter(width=.3, height=0))

Here is the SQL statement used to produce this plot:

select * from Titanic;

Plot 2: Fare and Age of Titanic Passengers broken down by gender with no null sex

This plot looks at the relationship between Fare and Age based on gender. This plot does not contain null values for gender.

source("../02 Visualizations/Fare_Sex_Age_NoNull.R", echo = TRUE)
## 
## > require(ggplot2)
## 
## > df <- Titanic_NoNullSex_df
## 
## > ggplot(data = df, aes(x = as.numeric(as.character(AGE)), 
## +     y = as.numeric(as.character(FARE)), color = SEX)) + coord_cartesian() + 
## +     scale .... [TRUNCATED]
## Warning: NAs introduced by coercion
## Warning: Removed 177 rows containing missing values (geom_point).

Here is the R code used to produce this plot:

require(ggplot2)
df<-Titanic_NoNullSex_df
ggplot(data=df,aes(x=as.numeric(as.character(AGE)), y=as.numeric(as.character(FARE)), color=SEX)) + 
  coord_cartesian() + 
  scale_x_continuous() +
  scale_y_continuous() +
  labs(title='Titanic',x="Age", y=paste("Fare")) +
  geom_point(stat="identity",position=position_jitter(width=.3, height=0))

Here is the SQL statement used to produce this plot:

select * from Titanic where SEX IS NOT NULL;

Plot 3: Fare and Survival of Titanic Passengers broken down by gender

This plot looks at the relationship between Fare and Survival based on gender. This plot does not include null gender values.

source("../02 Visualizations/Fare_Survived_Character.R", echo = TRUE)
## 
## > require(ggplot2)
## 
## > df <- Titanic_NoNullSex_df
## 
## > ggplot(data = df, aes(x = SEX, y = as.numeric(as.character(FARE)), 
## +     color = as.character(SURVIVED))) + coord_cartesian() + scale_x_discrete()  .... [TRUNCATED]

Here is the R code used to produce this plot:

require(ggplot2)
df<-Titanic_NoNullSex_df
ggplot(data=df,aes(x=SEX, y=as.numeric(as.character(FARE)), color=as.character(SURVIVED))) + 
  coord_cartesian() + 
  scale_x_discrete() +
  scale_y_continuous() +
  labs(title='Titanic',x="SURVIVED", y=paste("FARE")) +
  geom_point(stat="identity",position=position_jitter(width=.3, height=0))

Here is the SQL statement used to produce this plot:

select * from Titanic where SEX IS NOT NULL;

Plot 4: Fare and Survival of Titanic Passengers broken down by gender and passenger class

This plot looks at the relationship of Fare and Survival through the lens of both gender and passenger class. Again, no null gender values are included in this plot.

source("../02 Visualizations/Fare_Survived_Sex.R", echo = TRUE)
## 
## > require(extrafont)
## Loading required package: extrafont
## Registering fonts with R
## 
## > require(ggplot2)
## 
## > ggplot() + coord_cartesian() + scale_x_discrete() + 
## +     scale_y_continuous() + facet_grid(PCLASS ~ SURVIVED, labeller = label_both) + 
## +     labs .... [TRUNCATED]

Here is the R code used to produce this plot:

require(extrafont)
require(ggplot2)
ggplot() + 
  coord_cartesian() + 
  scale_x_discrete() +
  scale_y_continuous() +
  facet_grid(PCLASS~SURVIVED, labeller=label_both) +
  labs(title='Titanic') +
  labs(x="SURVIVED", y=paste("FARE")) +
  layer(data=Titanic_NoNullSex_df, 
        mapping=aes(x=as.character(SEX), y=as.numeric(as.character(FARE)), color=SEX), 
        stat="identity", 
        stat_params=list(), 
        geom="point",
        geom_params=list(), 
        position=position_jitter(width=0.3, height=0)
  )

Here is the SQL statement used to produce this plot:

select * from Titanic where SEX IS NOT NULL;

Plot 5: Fare and Survival of Titanic Passengers 10 years or younger broken down by gender

This plot looks at the Fare and Survival of passengers who are 10 years old or younger and breaks it down based on gender. This plot does not include null gender values nor any passenger who is 11 years old or older.

source("../02 Visualizations/Fare_Survived_Sex_Ten.r", echo = TRUE)
## 
## > require(extrafont)
## 
## > require(ggplot2)
## 
## > ggplot() + coord_cartesian() + scale_x_discrete() + 
## +     scale_y_continuous() + facet_grid(PCLASS ~ SURVIVED, labeller = label_both) + 
## +     labs .... [TRUNCATED]

Here is the R code used to produce this plot:

require(extrafont)
require(ggplot2)
ggplot() + 
  coord_cartesian() + 
  scale_x_discrete() +
  scale_y_continuous() +
  facet_grid(PCLASS~SURVIVED, labeller=label_both) +
  labs(title='Titanic where age <= 10') +
  labs(x="SURVIVED", y=paste("FARE")) +
  layer(data=Titanic_Age_Less_Ten_df, 
        mapping=aes(x=as.character(SEX), y=as.numeric(as.character(FARE)), color=SEX), 
        stat="identity", 
        stat_params=list(), 
        geom="point",
        geom_params=list(), 
        position=position_jitter(width=0.3, height=0)
  )

Here is the SQL statement used to produce this plot:

select * from Titanic where SEX IS NOT NULL AND AGE<=10;

Plot 6: Fare charged by gender and passenger class (new interesting plot)

source("../02 Visualizations/New_Interesting_plot.r", echo = TRUE)
## 
## > require(extrafont)
## 
## > require(ggplot2)
## 
## > ggplot() + coord_cartesian() + scale_x_discrete() + 
## +     scale_y_continuous() + facet_grid(. ~ PCLASS) + labs(title = "Titanic fare's charged by c ..." ... [TRUNCATED]

This new plot examines the relationship between Fare charged and what gender the passenger was while simultaneously looking at the relationship between Fare and passenger class. This plot confirms the idea that on average, the lower your class, the lower the fare you pay. However, a rather surprising observation is that the difference in average price paid for second and third class is so small that it can be considered simliar; in other words, the average Fare charged for a second class passenger is similar, if not the same as the average Fare charged for a third class passenger. Also, notice that for the first class passengers, the women had significantly larger average fares than their male passengers.
Here is the R code used to produce this plot:

require(extrafont)
require(ggplot2)
ggplot() + 
  coord_cartesian() + 
  scale_x_discrete() +
  scale_y_continuous() +
  facet_grid(.~PCLASS) +
  labs(title="Titanic fare's charged by class") +
  labs(x="Gender", y=paste("FARE")) +
  layer(data=Titanic_NoNullSex_df, 
        mapping=aes(x=SEX, y=as.numeric(as.character(FARE)), color=SEX), 
        stat="identity", 
        stat_params=list(), 
        geom="point",
        geom_params=list(), 
        position=position_jitter(width=0.3, height=0)
  )+
  layer(data=Titanic_NoNullSex_df,
        mapping=aes(x=SEX, y=as.numeric(as.character(FARE)), color=SEX),
        stat="boxplot",
        stat_params=list(),
        geom="boxplot",
        geom_params=list(color="red",fill="red", alpha=.4),
        posiion=position_identity()
  )

Here is the SQL statement used to produce this plot:

select * from Titanic where SEX IS NOT NULL;