What happens when you join Fast Food Restaurant Location data and Zip Code data?

For this project, we joined two datasets to produce our plots. There are an endless list of fields that one could join on, like dates, gender, state, and so on. We decided to join two datasets based on the common zip code field present in both to see if the number of fast food locations in a zip code is influenced by the number of people living in that zip code and/or the overall wealth of the zip code as measured by the mean and median salary.

Fast Food map dataset

The first dataset comes from Fast Food Maps, a website created to show all restaruant locations of the top ten fast food chains in America. The data is from the summer of 2007. Here is a summary of that dataset:

require("jsonlite")
## Loading required package: jsonlite
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View
require("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
# Loads the data from Fast Food table into Fast Food dataframe
# Change the USER and PASS below to be your UTEid
fast_food <- data.frame(fromJSON(getURL(URLencode('129.152.144.84:5001/rest/native/?query="select * from FASTFOODMAPS_LOCATIONS_2007"'),httpheader=c(DB='jdbc:oracle:thin:@129.152.144.84:1521/PDBF15DV.usuniversi01134.oraclecloud.internal', USER='cs329e_sm44585', PASS='orcl_sm44585', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE), ))
summary(fast_food)
##    RESTAURANT                      ADDRESS               CITY      
##  m      :13745   5505 Blue Lagoon Drive:    4   Houston    :  487  
##  b      : 7435   100 Main Street       :    3   Los Angeles:  309  
##  p      : 6214   265 Washington Street :    3   Las Vegas  :  294  
##  t      : 6164   3000 Island Ave       :    3   Chicago    :  266  
##  w      : 5971   303 Main St           :    3   Columbus   :  258  
##  k      : 5424   808 Main St           :    3   Phoenix    :  256  
##  (Other): 5049   (Other)               :49983   (Other)    :48132  
##      STATE            ZIP                  PHONE          ROW_NUM     
##  CA     : 6078   Min.   : 1001   null         :20025   Min.   :    1  
##  TX     : 3844   1st Qu.:30260   ()           :   76   1st Qu.:12501  
##  FL     : 2996   Median :48624   (216)524-4444:   28   Median :25002  
##  OH     : 2336   Mean   :53222   (901)362-3333:   25   Mean   :25002  
##  IL     : 2009   3rd Qu.:78222   (614)895-1111:   22   3rd Qu.:37502  
##  NY     : 1916   Max.   :99901   (205)978-8000:   21   Max.   :50002  
##  (Other):30823                   (Other)      :29805                  
##    LONGNITUDE         LAT         
##  Min.   :19.52   Min.   :-159.59  
##  1st Qu.:33.82   1st Qu.: -98.56  
##  Median :37.72   Median : -87.70  
##  Mean   :37.27   Mean   : -92.60  
##  3rd Qu.:40.92   3rd Qu.: -81.33  
##  Max.   :64.86   Max.   : -67.28  
## 
head(fast_food)
##   RESTAURANT                   ADDRESS           CITY STATE   ZIP
## 1          b         22740 Lorain Road  Fairview Park    OH 44126
## 2          b 2626 Franklin Road, S. W.        Roanoke    VA 24014
## 3          b 6730 Harrisburg Boulevard        Houston    TX 77011
## 4          b         655 Bridge Street North Weymouth    MA  2190
## 5          b         4024 Davison Road         Burton    MI 48509
## 6          b      7314 Ritchie Highway    Glen Burnie    MD 21061
##            PHONE ROW_NUM LONGNITUDE      LAT
## 1 (440) 734-2200     186    41.4387 -81.8718
## 2 (540) 343-6038     187    37.2493 -79.9561
## 3 (713) 926-6559     188    29.7372 -95.3076
## 4 (781) 335-4764     189    42.2446 -70.9391
## 5 (313) 744-0734     190    43.0331 -83.6343
## 6 (301) 760-9834     191    39.1712 -76.6201

Fast Food dataset Extract, Transform, Load (ETL) script

Here is the script we used to extract, transform, and load the fast food dataset into Oracle:

#Before running this R file make sure you set you working directory to where the CSV file located.

file_path <- "fastfoodmaps_locations_2007.csv"

df <- read.csv(file_path, stringsAsFactors = FALSE)

# Replace "." (i.e., period) with "_" in the column names.
names(df) <- gsub("\\.+", "_", names(df))

str(df) # Uncomment this and  run just the lines to here to get column types to use for getting the list of measures.

# Generate List of Measures
measures <- c("Row_num", "Longnitude", "Lat")


# Make Zip codes all five digits
df$Zip <- gsub(df$Zip, pattern="-.*", replacement = "")
# remove leading zero on zip codes to match other data set
df$Zip <- gsub(df$Zip, pattern="^0", replacement = "")

# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(df)) {
  df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}

dimensions <- setdiff(names(df), measures)

#dimensions
if( length(measures) > 1 || ! is.na(dimensions)) {
  for(d in dimensions) {
    # Get rid of " and ' in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
    # Change & to and in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
    # Change : to ; in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
  }
}


# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions
if( length(measures) > 1 || ! is.na(measures)) {
  for(m in measures) {
    df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
  }
}

write.csv(df, paste(gsub(".csv", "", file_path), ".reformatted.csv", sep=""), row.names=FALSE, na = "")

tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n-- Change table_name to the table name you want.\n")
if( length(measures) > 1 || ! is.na(dimensions)) {
  for(d in dimensions) {
    sql <- paste(sql, paste(d, "varchar2(4000),\n"))
  }
}
if( length(measures) > 1 || ! is.na(measures)) {
  for(m in measures) {
    if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
    else sql <- paste(sql, paste(m, "number(38,4)\n"))
  }
}
sql <- paste(sql, ");")
cat(sql)

A detailed explanation of each column in the fast food location data set:

  • RESTAURANT - First character of fast food restaurant name (b = Burger King, M = Mcdonald’s, etc)
  • ADDRESS - Fast food restaurant’s address
  • CITY - The city that the fast food restaurant is located in
  • STATE - The state that the fast food restaurant is located in
  • ZIP - Zip code for that specific fast food restaurant
  • PHONE - Phone number for each fast food restaurant
  • ROW_NUM - Record number
  • LONGNITUDE - GPS Longitude location
  • LAT - GPS latitude location

Zip code dataset

The second dataset comes from the University of Michigan’s Population Studies Center. The dataset originally came from the 2010 American Community Survey and the Center stripped out all of the columns except ZIP, POP, MEAN, and MEDIAN salaries. Here is a summary of that dataset:

require("jsonlite")
require("RCurl")
# Loads median, mean, and population data into Zip Code dataframe
zip_code <- data.frame(fromJSON(getURL(URLencode('129.152.144.84:5001/rest/native/?query="select * from MedianZIP32"'),httpheader=c(DB='jdbc:oracle:thin:@129.152.144.84:1521/PDBF15DV.usuniversi01134.oraclecloud.internal', USER='cs329e_sm44585', PASS='orcl_sm44585', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE), ))

zip_code$MEAN <- as.numeric(levels(zip_code$MEAN))[zip_code$MEAN]
## Warning: NAs introduced by coercion
summary(zip_code)
##       ZIP            MEDIAN            MEAN             POP        
##  Min.   : 1001   Min.   :    33   Min.   :    54   Min.   :     1  
##  1st Qu.:27301   1st Qu.: 38462   1st Qu.: 48593   1st Qu.:   736  
##  Median :49875   Median : 46504   Median : 56950   Median :  2756  
##  Mean   :49875   Mean   : 50938   Mean   : 63452   Mean   :  9193  
##  3rd Qu.:72134   3rd Qu.: 58256   3rd Qu.: 70341   3rd Qu.: 12513  
##  Max.   :99929   Max.   :223106   Max.   :361842   Max.   :113916  
##                                   NA's   :7
head(zip_code)
##    ZIP MEDIAN  MEAN   POP
## 1 1001  56663 66688 16445
## 2 1002  49853 75063 28069
## 3 1003  28462 35121  8491
## 4 1005  75423 82442  4798
## 5 1007  79076 85802 12962
## 6 1008  63980 78391  1244

Zip Code data set Extract, Transform, Load (ETL) script

Here is the script we used to extract, transform, and load the zip code dataset into Oracle:

#Before running this R file make sure you set you working directory to where the CSV file located.

file_path <- "MedianZIP-3-2.csv"

df <- read.csv(file_path, stringsAsFactors = FALSE)

# Replace "." (i.e., period) with "_" in the column names.
names(df) <- gsub("\\.+", "_", names(df))

str(df) # Uncomment this and  run just the lines to here to get column types to use for getting the list of measures.


measures <- c("Median", "Mean", "Pop")

# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(df)) {
  df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}

dimensions <- setdiff(names(df), measures)
#dimensions
if( length(measures) > 1 || ! is.na(dimensions)) {
  for(d in dimensions) {
    # Get rid of " and ' in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
    # Change & to and in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
    # Change : to ; in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
  }
}


# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions
if( length(measures) > 1 || ! is.na(measures)) {
  for(m in measures) {
    df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
  }
}

df$Median <- as.numeric(levels(df$Median))[df$Median]
df$Mean <- as.numeric(levels(df$Mean))[df$Mean]
df$Pop <- as.numeric(levels(df$Pop))[df$Pop]

write.csv(df, paste(gsub(".csv", "", file_path), ".reformatted.csv", sep=""), row.names=FALSE, na = "")

tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n-- Change table_name to the table name you want.\n")
if( length(measures) > 1 || ! is.na(dimensions)) {
  for(d in dimensions) {
    sql <- paste(sql, paste(d, "varchar2(4000),\n"))
  }
}
if( length(measures) > 1 || ! is.na(measures)) {
  for(m in measures) {
    if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
    else sql <- paste(sql, paste(m, "number(38,4)\n"))
  }
}
sql <- paste(sql, ");")
cat(sql)

A detailed explanation of each column in the Median Zip code data set:

  • ZIP - US Postal Zip code
  • MEDIAN - Median salary of each zip code
  • MEAN - Mean salary of each zip code
  • POP - Population of each zip code

Session Info

This is how Rstudio is set up in order to execute the experiment and produce these results:

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RCurl_1.95-4.7  bitops_1.0-6    jsonlite_0.9.16
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.2     tools_3.2.2     htmltools_0.2.6
##  [5] yaml_2.1.13     stringi_0.5-5   rmarkdown_0.8   knitr_1.11     
##  [9] stringr_1.0.0   digest_0.6.8    evaluate_0.7.2

Plot 1: Does the wealth of the zip code affect the number of fast food restaruants in a given zip code?

source("../02 Data Wrangling/Project3_Plot1.R", echo = TRUE)
## 
## > require(tidyr)
## Loading required package: tidyr
## 
## > require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## > require(ggplot2)
## Loading required package: ggplot2
## 
## > Plot1_df <- dplyr::left_join(fast_food, zip_code, 
## +     by = "ZIP")
## 
## > Plot1_df$RESTAURANT <- factor(Plot1_df$RESTAURANT, 
## +     levels = c("m", "b", "p", "t", "w", "j", "h", "c", "i", "k"), 
## +     labels = c("McDonalds ..." ... [TRUNCATED] 
## 
## > Plot1_df <- Plot1_df %>% select(ZIP, MEDIAN, RESTAURANT)
## 
## > Plot1_df <- Plot1_df %>% mutate(MEDIAN_SAL_PERCENT = cume_dist(Plot1_df$MEDIAN)) %>% 
## +     filter(MEDIAN_SAL_PERCENT <= 0.1 | MEDIAN_SAL_PERCENT >= .... [TRUNCATED] 
## 
## > Plot1_df <- Plot1_df %>% group_by(ZIP) %>% summarize(MEDIAN = first(MEDIAN), 
## +     TOTAL_RESTAURANTS = n())
## 
## > ggplot() + coord_cartesian() + scale_x_continuous() + 
## +     scale_y_continuous() + labs(title = "Top & Bottom 10 percent of Zip codes by median sal ..." ... [TRUNCATED]

This plot was created using a LEFT_JOIN to see whether there exists a relationship between the wealth of the zip code (we used median salary as a proxy for wealth) and the number of fast food restarants in that zip code. So, we looked at the top and bottom ten percent of zip codes based on median salary. We intially believed that the wealthier the zip code was, the less fast food restaurants there would be because wealthy people tend to have healthier foods because they can afford it. The plot seems to confirm that phenomenon.

Going off the same logic, we thought that the poorer a zip code was, the more fast food restaurants that would be located in that zip code. However, the plot shows that this does not seem to be the case, and instead implies that the zip codes with the middle 80% of wealth (middle class) have the most number of fast food restaurants.

Data Wrangling R Workflow to produce the dataframe that powers this plot:

Plot1_df %>% select(ZIP, MEDIAN, RESTAURANT) %>% mutate(MEDIAN_SAL_PERCENT = cume_dist(Plot1_df$MEDIAN)) %>% filter(MEDIAN_SAL_PERCENT  <= .1 | MEDIAN_SAL_PERCENT  >= .9) %>% group_by(ZIP) %>% summarize(MEDIAN = first(MEDIAN), TOTAL_RESTAURANTS = n())

Plot 2: Does population density influence the number of restaurants in a zip code?

source("../02 Data Wrangling/Project3_Plot2.R", echo = TRUE)
## 
## > require(tidyr)
## 
## > require(dplyr)
## 
## > require(ggplot2)
## 
## > Plot2_df <- dplyr::semi_join(zip_code, fast_food, 
## +     by = "ZIP")
## 
## > Plot2_df <- dplyr::inner_join(Plot2_df, fast_food, 
## +     by = "ZIP")
## 
## > Plot2_df$RESTAURANT <- factor(Plot2_df$RESTAURANT, 
## +     levels = c("m", "b", "p", "t", "w", "j", "h", "c", "i", "k"), 
## +     labels = c("McDonalds ..." ... [TRUNCATED] 
## 
## > Plot2_df <- Plot2_df %>% select(ZIP, MEDIAN, RESTAURANT, 
## +     POP)
## 
## > Plot2_df_both <- Plot2_df %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>% 
## +     filter(POP_PERCENT <= 0.1 | POP_PERCENT >= 0.9)
## 
## > Plot2_df_bot10 <- Plot2_df %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>% 
## +     filter(POP_PERCENT <= 0.1)
## 
## > Plot2_df_top10 <- Plot2_df %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>% 
## +     filter(POP_PERCENT >= 0.9)
## 
## > Plot2_df_both <- Plot2_df_both %>% group_by(ZIP) %>% 
## +     summarize(POP = first(POP), TOTAL_RESTAURANTS = n())
## 
## > Plot2_df_bot10 <- Plot2_df_bot10 %>% group_by(ZIP) %>% 
## +     summarize(POP = first(POP), TOTAL_RESTAURANTS = n())
## 
## > Plot2_df_top10 <- Plot2_df_top10 %>% group_by(ZIP) %>% 
## +     summarize(POP = first(POP), TOTAL_RESTAURANTS = n())
## 
## > ggplot() + coord_cartesian() + scale_x_continuous() + 
## +     scale_y_continuous() + labs(title = "Top & Bottom 10 percent of Zip Codes by Population ..." ... [TRUNCATED]

This plot was created using a SEMI_JOIN to first eliminate any zip codes in the zip code dataset where there wasn’t a corresponding restaurant location in the fast food dataset. Then an INNER_JOIN was used to combine both datasets to see whether there exists a relationship between the population density of the zip code and the number of fast food restaruants in that zip code. We filtered the data by the top and bottom 10 percent of zip codes by population. Our hypothesis was that the more people that lived in a zip code, the more fast food restaurant locations there would be in that particular zip code.

The plot did reveal such a trend, but what was so suprising and interesting was how stark that trend was: in looking at box plots for the top and bottom 10 percent, there is almost a 10 fold difference in the number of fast food restaurant locations between the least and most populated zip codes. For the bottom 10 percent, the average total number of fast food restaruants was just slightly above one, whereas for the most populated, that number was almost ten locations!

Data Wrangling R Workflow to produce the dataframe that powers this plot:

Plot2_df_both <- Plot2_df  %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>% filter(POP_PERCENT <= .1 | POP_PERCENT >= .9) %>% group_by(ZIP) %>% summarize(POP = first(POP), TOTAL_RESTAURANTS = n())

Plot2_df_bot10 <- Plot2_df  %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>% filter(POP_PERCENT <= .1) %>% group_by(ZIP) %>% summarize(POP = first(POP), TOTAL_RESTAURANTS = n())

Plot2_df_top10 <- Plot2_df  %>% mutate(POP_PERCENT = cume_dist(Plot2_df$POP)) %>% filter(POP_PERCENT >= .9) %>% group_by(ZIP) %>% summarize(POP = first(POP), TOTAL_RESTAURANTS = n())

Plot 3: Can you predict the number of restaurants located in a zip code based on its population?

source("../02 Data Wrangling/Project3_Plot3.R", echo = TRUE)
## 
## > require(tidyr)
## 
## > require(dplyr)
## 
## > require(ggplot2)
## 
## > Plot3_df <- dplyr::inner_join(fast_food, zip_code, 
## +     by = "ZIP")
## 
## > Plot3_df$RESTAURANT <- factor(Plot3_df$RESTAURANT, 
## +     levels = c("m", "b", "p", "t", "w", "j", "h", "c", "i", "k"), 
## +     labels = c("McDonalds ..." ... [TRUNCATED] 
## 
## > Plot3_df <- Plot3_df %>% select(ZIP, MEDIAN, RESTAURANT, 
## +     POP)
## 
## > Plot3_df <- Plot3_df %>% group_by(ZIP, POP) %>% summarize(TOTAL_RESTAURANTS = n()) %>% 
## +     ungroup() %>% arrange(POP)
## 
## > ggplot() + coord_cartesian() + scale_x_continuous() + 
## +     scale_y_continuous() + labs(title = "Relationship of zip code population and number of  ..." ... [TRUNCATED]

This plot was created using a INNER_JOIN to see whether there exists the ability to predict the number of fast food restaruants in a zip code given its population. Before creating this plot, plot 2 confirmed for us that the more populated zip codes had much higher total fast food reastuarant locations. As such, we hypothesized that there was a positive trend between those two variables because one could reason that the more people there are, the more restaruants there would be to serve all of those people.

However, upon adding the trend line, the plot tells a different story and completely surprised us. When a zip code has more than roughly 80,000 people living within it, the number of fast food restaurant locations not only levels off, but the trend actually suggests that the number of locations actually decreases, which is not what we orignally hypothesized.

Data Wrangling R Workflow to produce the dataframe that powers this plot:

Plot3_df <- Plot3_df %>% select(ZIP, MEDIAN, RESTAURANT, POP) %>%group_by(ZIP,POP) %>% summarize(TOTAL_RESTAURANTS = n()) %>%ungroup()%>%arrange(POP)