The RMS Titanic began her maiden voyage on April 10, 1912 and sank to the bottom of the ocean after hitting an iceberg 5 days later. To learn more about the Titanic, you can visit this page on Wikipedia.
Before we get started with showing the plots, here is some information about how my RStudio is set up
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2 tools_3.2.2 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.5-5 rmarkdown_0.8 knitr_1.11
## [9] stringr_1.0.0 digest_0.6.8 evaluate_0.7.2
The data set used to produce the following graphs contains information on almost 900 passengers of that fateful ship. Here is a subset and summary of that data:
source("../01 Data/Titanic_DataFrame.R", echo = TRUE)
##
## > require("jsonlite")
## Loading required package: jsonlite
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
##
## > require("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
##
## > Titanic_df <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from TITANIC\""),
## + httpheader = c(DB = " ..." ... [TRUNCATED]
##
## > summary(Titanic_df)
## PASSENGERID SURVIVED PCLASS
## </pre></body></html>Xtext/csvUUTF-8(?N`vPJS: 1 0 :549 1 :216
## 1 : 1 1 :342 2 :184
## 10 : 1 null: 2 3 :491
## 100 : 1 null: 2
## 101 : 1
## 102 : 1
## (Other) :887
## NAME SEX AGE
## null : 2 female:314 null :179
## Abbing, Mr. Anthony : 1 male :577 24 : 30
## Abbott, Mr. Rossmore Edward : 1 null : 2 22 : 27
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 18 : 26
## Abelson, Mr. Samuel : 1 19 : 25
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 28 : 25
## (Other) :886 (Other):581
## SIBSP PARCH TICKET FARE
## 0 :608 0 :678 1601 : 7 8.05000019073486: 43
## 1 :209 1 :118 347082 : 7 13 : 42
## 2 : 28 2 : 80 CA. 2343: 7 7.89580011367798: 38
## 4 : 18 3 : 5 3101295 : 6 7.75 : 34
## 3 : 16 5 : 5 347088 : 6 26 : 31
## 8 : 7 4 : 4 CA 2144 : 6 10.5 : 24
## (Other): 7 (Other): 3 (Other) :854 (Other) :681
## CABIN EMBARKED
## null :689 C :168
## B96 B98 : 4 null: 4
## C23 C25 C27: 4 Q : 77
## G6 : 4 S :644
## C22 C26 : 3
## D : 3
## (Other) :186
##
## > head(Titanic_df)
## PASSENGERID SURVIVED PCLASS NAME SEX AGE SIBSP
## 1 207 0 3 Backstrom, Mr. Karl Alfred male 32 1
## 2 208 1 3 Albimona, Mr. Nassef Cassem male 26 0
## 3 209 1 3 Carr, Miss. Helen Ellen female 16 0
## 4 210 1 1 Blank, Mr. Henry male 40 0
## 5 211 0 3 Ali, Mr. Ahmed male 24 0
## 6 212 1 2 Cameron, Miss. Clear Annie female 35 0
## PARCH TICKET FARE CABIN EMBARKED
## 1 0 3101278 15.8500003814697 null S
## 2 0 2699 18.7875003814697 null C
## 3 0 367231 7.75 null Q
## 4 0 112277 31 A31 C
## 5 0 SOTON/O.Q. 3101311 7.05000019073486 null S
## 6 0 F.C.C. 13528 21 null S
These other two data frames form the basis for the other 5 visualization plots:
source("../01 Data/Titanic_Age_Less_Ten.r", echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > Titanic_Age_Less_Ten_df <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from TITANIC where SEX IS NOT NUL .... [TRUNCATED]
source("../01 Data/Titanic_NoNullSex_df.r", echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > Titanic_NoNullSex_df <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from TITANIC where SEX IS NOT NULL\" ..." ... [TRUNCATED]
There are a total of 6 plots in this document. To reproduce them, you need to:
This plot looks at the relationship between Fare and Age based on the gender of the passenger. This plot includes null gender values.
source("../02 Visualizations/Age_Fare_Sex.R", echo = TRUE)
##
## > require(ggplot2)
## Loading required package: ggplot2
##
## > ggplot(data = Titanic_df, aes(x = as.numeric(as.character(AGE)),
## + y = as.numeric(as.character(FARE)), color = SEX)) + coord_cartesian() +
## + .... [TRUNCATED]
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: Removed 179 rows containing missing values (geom_point).
Here is the R code used to produce this plot:
require(ggplot2)
ggplot(data=Titanic_df,aes(x=as.numeric(as.character(AGE)), y=as.numeric(as.character(FARE)), color=SEX)) +
coord_cartesian() +
scale_x_continuous() +
scale_y_continuous() +
labs(title='Titanic',x="Age", y=paste("Fare")) +
geom_point(stat="identity",position=position_jitter(width=.3, height=0))
Here is the SQL statement used to produce this plot:
select * from Titanic;
This plot looks at the relationship between Fare and Age based on gender. This plot does not contain null values for gender.
source("../02 Visualizations/Fare_Sex_Age_NoNull.R", echo = TRUE)
##
## > require(ggplot2)
##
## > df <- Titanic_NoNullSex_df
##
## > ggplot(data = df, aes(x = as.numeric(as.character(AGE)),
## + y = as.numeric(as.character(FARE)), color = SEX)) + coord_cartesian() +
## + scale .... [TRUNCATED]
## Warning: NAs introduced by coercion
## Warning: Removed 177 rows containing missing values (geom_point).
Here is the R code used to produce this plot:
require(ggplot2)
df<-Titanic_NoNullSex_df
ggplot(data=df,aes(x=as.numeric(as.character(AGE)), y=as.numeric(as.character(FARE)), color=SEX)) +
coord_cartesian() +
scale_x_continuous() +
scale_y_continuous() +
labs(title='Titanic',x="Age", y=paste("Fare")) +
geom_point(stat="identity",position=position_jitter(width=.3, height=0))
Here is the SQL statement used to produce this plot:
select * from Titanic where SEX IS NOT NULL;
This plot looks at the relationship between Fare and Survival based on gender. This plot does not include null gender values.
source("../02 Visualizations/Fare_Survived_Character.R", echo = TRUE)
##
## > require(ggplot2)
##
## > df <- Titanic_NoNullSex_df
##
## > ggplot(data = df, aes(x = SEX, y = as.numeric(as.character(FARE)),
## + color = as.character(SURVIVED))) + coord_cartesian() + scale_x_discrete() .... [TRUNCATED]
Here is the R code used to produce this plot:
require(ggplot2)
df<-Titanic_NoNullSex_df
ggplot(data=df,aes(x=SEX, y=as.numeric(as.character(FARE)), color=as.character(SURVIVED))) +
coord_cartesian() +
scale_x_discrete() +
scale_y_continuous() +
labs(title='Titanic',x="SURVIVED", y=paste("FARE")) +
geom_point(stat="identity",position=position_jitter(width=.3, height=0))
Here is the SQL statement used to produce this plot:
select * from Titanic where SEX IS NOT NULL;
This plot looks at the relationship of Fare and Survival through the lens of both gender and passenger class. Again, no null gender values are included in this plot.
source("../02 Visualizations/Fare_Survived_Sex.R", echo = TRUE)
##
## > require(extrafont)
## Loading required package: extrafont
## Registering fonts with R
##
## > require(ggplot2)
##
## > ggplot() + coord_cartesian() + scale_x_discrete() +
## + scale_y_continuous() + facet_grid(PCLASS ~ SURVIVED, labeller = label_both) +
## + labs .... [TRUNCATED]
Here is the R code used to produce this plot:
require(extrafont)
require(ggplot2)
ggplot() +
coord_cartesian() +
scale_x_discrete() +
scale_y_continuous() +
facet_grid(PCLASS~SURVIVED, labeller=label_both) +
labs(title='Titanic') +
labs(x="SURVIVED", y=paste("FARE")) +
layer(data=Titanic_NoNullSex_df,
mapping=aes(x=as.character(SEX), y=as.numeric(as.character(FARE)), color=SEX),
stat="identity",
stat_params=list(),
geom="point",
geom_params=list(),
position=position_jitter(width=0.3, height=0)
)
Here is the SQL statement used to produce this plot:
select * from Titanic where SEX IS NOT NULL;
This plot looks at the Fare and Survival of passengers who are 10 years old or younger and breaks it down based on gender. This plot does not include null gender values nor any passenger who is 11 years old or older.
source("../02 Visualizations/Fare_Survived_Sex_Ten.r", echo = TRUE)
##
## > require(extrafont)
##
## > require(ggplot2)
##
## > ggplot() + coord_cartesian() + scale_x_discrete() +
## + scale_y_continuous() + facet_grid(PCLASS ~ SURVIVED, labeller = label_both) +
## + labs .... [TRUNCATED]
Here is the R code used to produce this plot:
require(extrafont)
require(ggplot2)
ggplot() +
coord_cartesian() +
scale_x_discrete() +
scale_y_continuous() +
facet_grid(PCLASS~SURVIVED, labeller=label_both) +
labs(title='Titanic where age <= 10') +
labs(x="SURVIVED", y=paste("FARE")) +
layer(data=Titanic_Age_Less_Ten_df,
mapping=aes(x=as.character(SEX), y=as.numeric(as.character(FARE)), color=SEX),
stat="identity",
stat_params=list(),
geom="point",
geom_params=list(),
position=position_jitter(width=0.3, height=0)
)
Here is the SQL statement used to produce this plot:
select * from Titanic where SEX IS NOT NULL AND AGE<=10;
source("../02 Visualizations/New_Interesting_plot.r", echo = TRUE)
##
## > require(extrafont)
##
## > require(ggplot2)
##
## > ggplot() + coord_cartesian() + scale_x_discrete() +
## + scale_y_continuous() + facet_grid(. ~ PCLASS) + labs(title = "Titanic fare's charged by c ..." ... [TRUNCATED]
This new plot examines the relationship between Fare charged and what gender the passenger was while simultaneously looking at the relationship between Fare and passenger class. This plot confirms the idea that on average, the lower your class, the lower the fare you pay. However, a rather surprising observation is that the difference in average price paid for second and third class is so small that it can be considered simliar; in other words, the average Fare charged for a second class passenger is similar, if not the same as the average Fare charged for a third class passenger. Also, notice that for the first class passengers, the women had significantly larger average fares than their male passengers.
Here is the R code used to produce this plot:
require(extrafont)
require(ggplot2)
ggplot() +
coord_cartesian() +
scale_x_discrete() +
scale_y_continuous() +
facet_grid(.~PCLASS) +
labs(title="Titanic fare's charged by class") +
labs(x="Gender", y=paste("FARE")) +
layer(data=Titanic_NoNullSex_df,
mapping=aes(x=SEX, y=as.numeric(as.character(FARE)), color=SEX),
stat="identity",
stat_params=list(),
geom="point",
geom_params=list(),
position=position_jitter(width=0.3, height=0)
)+
layer(data=Titanic_NoNullSex_df,
mapping=aes(x=SEX, y=as.numeric(as.character(FARE)), color=SEX),
stat="boxplot",
stat_params=list(),
geom="boxplot",
geom_params=list(color="red",fill="red", alpha=.4),
posiion=position_identity()
)
Here is the SQL statement used to produce this plot:
select * from Titanic where SEX IS NOT NULL;