For this project, we joined two datasets to produce our plots. There are an endless list of fields that one could join on, like dates, gender, state, and so on. We decided to join two datasets based on the common zip code field present in both to see if the number of fast food locations in a zip code is influenced by the number of people living in that zip code and/or the overall wealth of the zip code as measured by the mean and median salary.
The first dataset comes from Fast Food Maps, a website created to show all restaruant locations of the top ten fast food chains in America. The data is from the summer of 2007. Here is a summary of that dataset:
require("jsonlite")
## Loading required package: jsonlite
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
require("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
# Loads the data from Fast Food table into Fast Food dataframe
# Change the USER and PASS below to be your UTEid
fast_food <- data.frame(fromJSON(getURL(URLencode('129.152.144.84:5001/rest/native/?query="select * from FASTFOODMAPS_LOCATIONS_2007"'),httpheader=c(DB='jdbc:oracle:thin:@129.152.144.84:1521/PDBF15DV.usuniversi01134.oraclecloud.internal', USER='cs329e_sm44585', PASS='orcl_sm44585', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE), ))
summary(fast_food)
## RESTAURANT ADDRESS CITY
## m :13745 5505 Blue Lagoon Drive: 4 Houston : 487
## b : 7435 100 Main Street : 3 Los Angeles: 309
## p : 6214 265 Washington Street : 3 Las Vegas : 294
## t : 6164 3000 Island Ave : 3 Chicago : 266
## w : 5971 303 Main St : 3 Columbus : 258
## k : 5424 808 Main St : 3 Phoenix : 256
## (Other): 5049 (Other) :49983 (Other) :48132
## STATE ZIP PHONE ROW_NUM
## CA : 6078 Min. : 1001 null :20025 Min. : 1
## TX : 3844 1st Qu.:30260 () : 76 1st Qu.:12501
## FL : 2996 Median :48624 (216)524-4444: 28 Median :25002
## OH : 2336 Mean :53222 (901)362-3333: 25 Mean :25002
## IL : 2009 3rd Qu.:78222 (614)895-1111: 22 3rd Qu.:37502
## NY : 1916 Max. :99901 (205)978-8000: 21 Max. :50002
## (Other):30823 (Other) :29805
## LONGNITUDE LAT
## Min. :19.52 Min. :-159.59
## 1st Qu.:33.82 1st Qu.: -98.56
## Median :37.72 Median : -87.70
## Mean :37.27 Mean : -92.60
## 3rd Qu.:40.92 3rd Qu.: -81.33
## Max. :64.86 Max. : -67.28
##
head(fast_food)
## RESTAURANT ADDRESS CITY STATE ZIP
## 1 b 22740 Lorain Road Fairview Park OH 44126
## 2 b 2626 Franklin Road, S. W. Roanoke VA 24014
## 3 b 6730 Harrisburg Boulevard Houston TX 77011
## 4 b 655 Bridge Street North Weymouth MA 2190
## 5 b 4024 Davison Road Burton MI 48509
## 6 b 7314 Ritchie Highway Glen Burnie MD 21061
## PHONE ROW_NUM LONGNITUDE LAT
## 1 (440) 734-2200 186 41.4387 -81.8718
## 2 (540) 343-6038 187 37.2493 -79.9561
## 3 (713) 926-6559 188 29.7372 -95.3076
## 4 (781) 335-4764 189 42.2446 -70.9391
## 5 (313) 744-0734 190 43.0331 -83.6343
## 6 (301) 760-9834 191 39.1712 -76.6201
Here is the script we used to extract, transform, and load the fast food dataset into Oracle:
#Before running this R file make sure you set you working directory to where the CSV file located.
file_path <- "fastfoodmaps_locations_2007.csv"
df <- read.csv(file_path, stringsAsFactors = FALSE)
# Replace "." (i.e., period) with "_" in the column names.
names(df) <- gsub("\\.+", "_", names(df))
str(df) # Uncomment this and run just the lines to here to get column types to use for getting the list of measures.
# Generate List of Measures
measures <- c("Row_num", "Longnitude", "Lat")
# Make Zip codes all five digits
df$Zip <- gsub(df$Zip, pattern="-.*", replacement = "")
# remove leading zero on zip codes to match other data set
df$Zip <- gsub(df$Zip, pattern="^0", replacement = "")
# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(df)) {
df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}
dimensions <- setdiff(names(df), measures)
#dimensions
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
# Get rid of " and ' in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
# Change & to and in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
# Change : to ; in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
}
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
}
}
write.csv(df, paste(gsub(".csv", "", file_path), ".reformatted.csv", sep=""), row.names=FALSE, na = "")
tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n-- Change table_name to the table name you want.\n")
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
sql <- paste(sql, paste(d, "varchar2(4000),\n"))
}
}
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
else sql <- paste(sql, paste(m, "number(38,4)\n"))
}
}
sql <- paste(sql, ");")
cat(sql)
The second dataset comes from the University of Michigan’s Population Studies Center. The dataset originally came from the 2010 American Community Survey and the Center stripped out all of the columns except ZIP, POP, MEAN, and MEDIAN salaries. Here is a summary of that dataset:
require("jsonlite")
require("RCurl")
# Loads median, mean, and population data into Zip Code dataframe
zip_code <- data.frame(fromJSON(getURL(URLencode('129.152.144.84:5001/rest/native/?query="select * from MedianZIP32"'),httpheader=c(DB='jdbc:oracle:thin:@129.152.144.84:1521/PDBF15DV.usuniversi01134.oraclecloud.internal', USER='cs329e_sm44585', PASS='orcl_sm44585', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE), ))
zip_code$MEAN <- as.numeric(levels(zip_code$MEAN))[zip_code$MEAN]
## Warning: NAs introduced by coercion
summary(zip_code)
## ZIP MEDIAN MEAN POP
## Min. : 1001 Min. : 33 Min. : 54 Min. : 1
## 1st Qu.:27301 1st Qu.: 38462 1st Qu.: 48593 1st Qu.: 736
## Median :49875 Median : 46504 Median : 56950 Median : 2756
## Mean :49875 Mean : 50938 Mean : 63452 Mean : 9193
## 3rd Qu.:72134 3rd Qu.: 58256 3rd Qu.: 70341 3rd Qu.: 12513
## Max. :99929 Max. :223106 Max. :361842 Max. :113916
## NA's :7
head(zip_code)
## ZIP MEDIAN MEAN POP
## 1 1001 56663 66688 16445
## 2 1002 49853 75063 28069
## 3 1003 28462 35121 8491
## 4 1005 75423 82442 4798
## 5 1007 79076 85802 12962
## 6 1008 63980 78391 1244
Here is the script we used to extract, transform, and load the zip code dataset into Oracle:
#Before running this R file make sure you set you working directory to where the CSV file located.
file_path <- "MedianZIP-3-2.csv"
df <- read.csv(file_path, stringsAsFactors = FALSE)
# Replace "." (i.e., period) with "_" in the column names.
names(df) <- gsub("\\.+", "_", names(df))
str(df) # Uncomment this and run just the lines to here to get column types to use for getting the list of measures.
measures <- c("Median", "Mean", "Pop")
# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(df)) {
df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}
dimensions <- setdiff(names(df), measures)
#dimensions
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
# Get rid of " and ' in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
# Change & to and in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
# Change : to ; in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
}
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
}
}
df$Median <- as.numeric(levels(df$Median))[df$Median]
df$Mean <- as.numeric(levels(df$Mean))[df$Mean]
df$Pop <- as.numeric(levels(df$Pop))[df$Pop]
write.csv(df, paste(gsub(".csv", "", file_path), ".reformatted.csv", sep=""), row.names=FALSE, na = "")
tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n-- Change table_name to the table name you want.\n")
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
sql <- paste(sql, paste(d, "varchar2(4000),\n"))
}
}
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
else sql <- paste(sql, paste(m, "number(38,4)\n"))
}
}
sql <- paste(sql, ");")
cat(sql)
This is how Rstudio is set up in order to execute the experiment and produce these results:
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RCurl_1.95-4.7 bitops_1.0-6 jsonlite_0.9.16
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2 tools_3.2.2 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.5-5 rmarkdown_0.8 knitr_1.11
## [9] stringr_1.0.0 digest_0.6.8 evaluate_0.7.2
source("../02 Data Wrangling/Project3_Plot1.R", echo = TRUE)
##
## > require(tidyr)
## Loading required package: tidyr
##
## > require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## > require(ggplot2)
## Loading required package: ggplot2
##
## > Plot1_df <- dplyr::left_join(fast_food, zip_code,
## + by = "ZIP")
##
## > Plot1_df$RESTAURANT <- factor(Plot1_df$RESTAURANT,
## + levels = c("m", "b", "p", "t", "w", "j", "h", "c", "i", "k"),
## + labels = c("McDonalds ..." ... [TRUNCATED]
##
## > Plot1_df <- Plot1_df %>% select(ZIP, MEDIAN, RESTAURANT)
##
## > Plot1_df <- Plot1_df %>% mutate(MEDIAN_SAL_PERCENT = cume_dist(Plot1_df$MEDIAN)) %>%
## + filter(MEDIAN_SAL_PERCENT <= 0.1 | MEDIAN_SAL_PERCENT >= .... [TRUNCATED]
##
## > Plot1_df <- Plot1_df %>% group_by(ZIP) %>% summarize(MEDIAN = first(MEDIAN),
## + TOTAL_RESTAURANTS = n())
##
## > ggplot() + coord_cartesian() + scale_x_continuous() +
## + scale_y_continuous() + labs(title = "Top & Bottom 10 percent of Zip codes by median sal ..." ... [TRUNCATED]
This plot was created using a LEFT_JOIN to see whether there exists a relationship between the wealth of the zip code (we used median salary as a proxy for wealth) and the number of fast food restarants in that zip code. So, we looked at the top and bottom ten percent of zip codes based on median salary. We intially believed that the wealthier the zip code was, the less fast food restaurants there would be because wealthy people tend to have healthier foods because they can afford it. The plot seems to confirm that phenomenon.
Going off the same logic, we thought that the poorer a zip code was, the more fast food restaurants that would be located in that zip code. However, the plot shows that this does not seem to be the case, and instead implies that the zip codes with the middle 80% of wealth (middle class) have the most number of fast food restaurants.
Plot1_df %>% select(ZIP, MEDIAN, RESTAURANT) %>% mutate(MEDIAN_SAL_PERCENT = cume_dist(Plot1_df$MEDIAN)) %>% filter(MEDIAN_SAL_PERCENT <= .1 | MEDIAN_SAL_PERCENT >= .9) %>% group_by(ZIP) %>% summarize(MEDIAN = first(MEDIAN), TOTAL_RESTAURANTS = n())
source("../02 Data Wrangling/Project3_Plot2.R", echo = TRUE)
##
## > require(tidyr)
##
## > require(dplyr)
##
## > require(ggplot2)
##
## > Plot2_df <- dplyr::semi_join(zip_code, fast_food,
## + by = "ZIP")
##
## > Plot2_df <- dplyr::inner_join(Plot2_df, fast_food,
## + by = "ZIP")
##
## > Plot2_df$RESTAURANT <- factor(Plot2_df$RESTAURANT,
## + levels = c("m", "b", "p", "t", "w", "j", "h", "c", "i", "k"),
## + labels = c("McDonalds ..." ... [TRUNCATED]
##
## > Plot2_df <- Plot2_df %>% select(ZIP, MEDIAN, RESTAURANT,
## + POP)
##
## > Plot2_df_both <- Plot2_df %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>%
## + filter(POP_PERCENT <= 0.1 | POP_PERCENT >= 0.9)
##
## > Plot2_df_bot10 <- Plot2_df %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>%
## + filter(POP_PERCENT <= 0.1)
##
## > Plot2_df_top10 <- Plot2_df %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>%
## + filter(POP_PERCENT >= 0.9)
##
## > Plot2_df_both <- Plot2_df_both %>% group_by(ZIP) %>%
## + summarize(POP = first(POP), TOTAL_RESTAURANTS = n())
##
## > Plot2_df_bot10 <- Plot2_df_bot10 %>% group_by(ZIP) %>%
## + summarize(POP = first(POP), TOTAL_RESTAURANTS = n())
##
## > Plot2_df_top10 <- Plot2_df_top10 %>% group_by(ZIP) %>%
## + summarize(POP = first(POP), TOTAL_RESTAURANTS = n())
##
## > ggplot() + coord_cartesian() + scale_x_continuous() +
## + scale_y_continuous() + labs(title = "Top & Bottom 10 percent of Zip Codes by Population ..." ... [TRUNCATED]
This plot was created using a SEMI_JOIN to first eliminate any zip codes in the zip code dataset where there wasn’t a corresponding restaurant location in the fast food dataset. Then an INNER_JOIN was used to combine both datasets to see whether there exists a relationship between the population density of the zip code and the number of fast food restaruants in that zip code. We filtered the data by the top and bottom 10 percent of zip codes by population. Our hypothesis was that the more people that lived in a zip code, the more fast food restaurant locations there would be in that particular zip code.
The plot did reveal such a trend, but what was so suprising and interesting was how stark that trend was: in looking at box plots for the top and bottom 10 percent, there is almost a 10 fold difference in the number of fast food restaurant locations between the least and most populated zip codes. For the bottom 10 percent, the average total number of fast food restaruants was just slightly above one, whereas for the most populated, that number was almost ten locations!
Plot2_df_both <- Plot2_df %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>% filter(POP_PERCENT <= .1 | POP_PERCENT >= .9) %>% group_by(ZIP) %>% summarize(POP = first(POP), TOTAL_RESTAURANTS = n())
Plot2_df_bot10 <- Plot2_df %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>% filter(POP_PERCENT <= .1) %>% group_by(ZIP) %>% summarize(POP = first(POP), TOTAL_RESTAURANTS = n())
Plot2_df_top10 <- Plot2_df %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>% filter(POP_PERCENT >= .9) %>% group_by(ZIP) %>% summarize(POP = first(POP), TOTAL_RESTAURANTS = n())
source("../02 Data Wrangling/Project3_Plot3.R", echo = TRUE)
##
## > require(tidyr)
##
## > require(dplyr)
##
## > require(ggplot2)
##
## > Plot3_df <- dplyr::inner_join(fast_food, zip_code,
## + by = "ZIP")
##
## > Plot3_df$RESTAURANT <- factor(Plot3_df$RESTAURANT,
## + levels = c("m", "b", "p", "t", "w", "j", "h", "c", "i", "k"),
## + labels = c("McDonalds ..." ... [TRUNCATED]
##
## > Plot3_df <- Plot3_df %>% select(ZIP, MEDIAN, RESTAURANT,
## + POP)
##
## > Plot3_df <- Plot3_df %>% group_by(ZIP, POP) %>% summarize(TOTAL_RESTAURANTS = n()) %>%
## + ungroup() %>% arrange(POP)
##
## > ggplot() + coord_cartesian() + scale_x_continuous() +
## + scale_y_continuous() + labs(title = "Relationship of zip code population and number of ..." ... [TRUNCATED]
This plot was created using a INNER_JOIN to see whether there exists the ability to predict the number of fast food restaruants in a zip code given its population. Before creating this plot, plot 2 confirmed for us that the more populated zip codes had much higher total fast food reastuarant locations. As such, we hypothesized that there was a positive trend between those two variables because one could reason that the more people there are, the more restaruants there would be to serve all of those people.
However, upon adding the trend line, the plot tells a different story and completely surprised us. When a zip code has more than roughly 80,000 people living within it, the number of fast food restaurant locations not only levels off, but the trend actually suggests that the number of locations actually decreases, which is not what we orignally hypothesized.
Plot3_df <- Plot3_df %>% select(ZIP, MEDIAN, RESTAURANT, POP) %>%group_by(ZIP,POP) %>% summarize(TOTAL_RESTAURANTS = n()) %>%ungroup()%>%arrange(POP)