For this project, we joined three datasets to produce our plots. The first dataset contains location data for ten fast food establishments in the United States. The second dataset contains salary and population data for all United States ZIP codes dataset. The third dataset contains data about the average sales revenue generated by a franchise unit for the ten fast food establishments that the first dataset has. By using these three datasets, we can join on both the ZIP code field common to the first two datasets and we can join on the RESTAURANT field common to the first and third dataset. By using this newly joined data, we then produced analysis well beyond what the three datesets seperately would have been able to give us. We followed all of the steps in Dr. Cannata’s presentation and were able to generate some pretty interesting plots. Specifically, we were able to look at the market share for each restaurant, the contribution of every state to the total sales revenue of each restaurant, and determine what restaruants were located in more wealthy ZIP codes.
This is how Rstudio is set up in order to execute the experiment and produce these results:
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.1 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2 tools_3.2.2 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.5-5 rmarkdown_0.8 knitr_1.11
## [9] stringr_1.0.0 digest_0.6.8 evaluate_0.7.2
For this project, we also built and deployed a Shiny Application that showcases three of the most interesting plots we saw when analyzing these datasets.
The first dataset comes from Fast Food Maps, a website that was created to show all restaruant locations of the top ten fast food chains in America. The data is from the summer of 2007. Here is a summary of that dataset:
require("jsonlite")
## Loading required package: jsonlite
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
require("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
# Loads the data from Fast Food table into Fast Food dataframe
# Change the USER and PASS below to be your UTEid
fast_food <- data.frame(fromJSON(getURL(URLencode('skipper.cs.utexas.edu:5001/rest/native/?query="select * from FASTFOODMAPS_LOCATIONS_2007"'),httpheader=c(DB='jdbc:oracle:thin:@sayonara.microlab.cs.utexas.edu:1521:orcl', USER='C##cs329e_cz4795', PASS='orcl_cz4795', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE)))
summary(fast_food)
## RESTAURANT ADDRESS CITY
## McDonalds :13745 5505 Blue Lagoon Drive: 4 Houston : 487
## Burger King: 7435 100 Main Street : 3 Los Angeles: 309
## Pizza Hut : 6214 265 Washington Street : 3 Las Vegas : 294
## Taco Bell : 6164 3000 Island Ave : 3 Chicago : 266
## Wendys : 5971 303 Main St : 3 Columbus : 258
## KFC : 5424 808 Main St : 3 Phoenix : 256
## (Other) : 5049 (Other) :49983 (Other) :48132
## STATE PHONE X ZIP
## CA : 6078 null :20025 Min. :0.0000 Min. : 1001
## TX : 3844 () : 76 1st Qu.:0.0000 1st Qu.:30260
## FL : 2996 (216)524-4444: 28 Median :0.0000 Median :48624
## OH : 2336 (901)362-3333: 25 Mean :0.0062 Mean :53222
## IL : 2009 (614)895-1111: 22 3rd Qu.:0.0000 3rd Qu.:78222
## NY : 1916 (205)978-8000: 21 Max. :1.0000 Max. :99901
## (Other):30823 (Other) :29805
## ROW_NUM LAT LONGI
## Min. : 1 Min. :19.52 Min. :-159.59
## 1st Qu.:12501 1st Qu.:33.82 1st Qu.: -98.56
## Median :25002 Median :37.72 Median : -87.70
## Mean :25002 Mean :37.27 Mean : -92.60
## 3rd Qu.:37502 3rd Qu.:40.92 3rd Qu.: -81.33
## Max. :50002 Max. :64.86 Max. : -67.28
##
head(fast_food)
## RESTAURANT ADDRESS CITY STATE
## 1 Burger King 2423 South Carrollton Ave. New Orleans LA
## 2 Burger King 4901 South Broadway Englewood CO
## 3 Burger King 1436 Apple Avenue Muskegon MI
## 4 Burger King 5456 East Mockingbird Lane, 400 Dallas TX
## 5 Burger King 10170 West Grand Avenue Franklin Park IL
## 6 Burger King 1621 Carter Hill Road Montgomery AL
## PHONE X ZIP ROW_NUM LAT LONGI
## 1 (504) 866-4144 0 70118 163 29.9560 -90.1200
## 2 (303) 789-9526 0 80110 164 39.6275 -104.9880
## 3 (231) 773-9321 0 49442 165 43.2343 -86.2083
## 4 (214) 828-9146 0 75206 166 32.8364 -96.7745
## 5 (708) 455-0203 0 60131 167 41.9303 -87.8793
## 6 (334) 265-7944 0 36106 168 32.3595 -86.2866
Here is the script we used to extract, transform, and load the fast food dataset into Oracle:
#Before running this R file make sure you set you working directory to where the CSV file located.
file_path <- "fastfoodmaps_locations_2007.csv"
df <- read.csv(file_path, stringsAsFactors = FALSE)
# Replace "." (i.e., period) with "_" in the column names.
names(df) <- gsub("\\.+", "_", names(df))
str(df) # Uncomment this and run just the lines to here to get column types to use for getting the list of measures.
# Generate List of Measures
measures <- c("ZIP", "Row_Num", "LAT", "LONGI")
# Make Zip codes all five digits
df$ZIP <- gsub(df$ZIP, pattern="-.*", replacement = "")
# remove leading zero on zip codes to match other data set
df$ZIP <- gsub(df$ZIP, pattern="^0", replacement = "")
#Relabel the restaurant columns
df$RESTAURANT <- gsub(df$RESTAURANT, pattern="^m", replacement = "McDonalds")
df$RESTAURANT <- gsub(df$RESTAURANT, pattern="^b", replacement = "Burger King")
df$RESTAURANT <- gsub(df$RESTAURANT, pattern="^p", replacement = "Pizza Hut")
df$RESTAURANT <- gsub(df$RESTAURANT, pattern="^t", replacement = "Taco Bell")
df$RESTAURANT <- gsub(df$RESTAURANT, pattern="^w", replacement = "Wendys")
df$RESTAURANT <- gsub(df$RESTAURANT, pattern="^j", replacement = "Jack in the Box")
df$RESTAURANT <- gsub(df$RESTAURANT, pattern="^h", replacement = "Hardees")
df$RESTAURANT <- gsub(df$RESTAURANT, pattern="^c", replacement = "Carls Jr")
df$RESTAURANT <- gsub(df$RESTAURANT, pattern="^i", replacement = "In-N-Out")
df$RESTAURANT <- gsub(df$RESTAURANT, pattern="^k", replacement = "KFC")
# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(df)) {
df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}
dimensions <- setdiff(names(df), measures)
#dimensions
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
# Get rid of " and ' in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
# Change & to and in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
# Change : to ; in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
}
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
}
}
write.csv(df, paste(gsub(".csv", "", file_path), ".reformatted.csv", sep=""), row.names=FALSE, na = "")
tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n-- Change table_name to the table name you want.\n")
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
sql <- paste(sql, paste(d, "varchar2(4000),\n"))
}
}
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
else sql <- paste(sql, paste(m, "number(38,4)\n"))
}
}
sql <- paste(sql, ");")
cat(sql)
The second dataset comes from the University of Michigan’s Population Studies Center. The dataset originally came from the 2010 American Community Survey and the Center stripped out all of the columns except ZIP, POP, MEAN, and MEDIAN salaries. Here is a summary of that dataset:
require("jsonlite")
require("RCurl")
# Loads median, mean, and population data into Zip Code dataframe
zip_code <- data.frame(fromJSON(getURL(URLencode('skipper.cs.utexas.edu:5001/rest/native/?query="select * from MedianZIP"'),httpheader=c(DB='jdbc:oracle:thin:@sayonara.microlab.cs.utexas.edu:1521:orcl', USER='C##cs329e_cz4795', PASS='orcl_cz4795', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE)))
summary(zip_code)
## ZIP MEDIAN MEAN POP
## Min. : 1001 Min. : 33 49548 : 21 Min. : 1
## 1st Qu.:27301 1st Qu.: 38462 46477 : 14 1st Qu.: 736
## Median :49875 Median : 46504 47396 : 14 Median : 2756
## Mean :49875 Mean : 50938 55699 : 14 Mean : 9193
## 3rd Qu.:72134 3rd Qu.: 58256 49486 : 11 3rd Qu.: 12513
## Max. :99929 Max. :223106 54748 : 11 Max. :113916
## (Other):32549
head(zip_code)
## ZIP MEDIAN MEAN POP
## 1 3049 112885 159945 7013
## 2 3051 77996 88948 22952
## 3 3052 100051 102541 7331
## 4 3053 93465 104876 23364
## 5 3054 92954 100328 25094
## 6 3055 70617 80757 13437
Here is the script we used to extract, transform, and load the ZIP code dataset into Oracle:
#Before running this R file make sure you set you working directory to where the CSV file located.
file_path <- "MedianZIP.csv"
df <- read.csv(file_path, stringsAsFactors = FALSE)
# Replace "." (i.e., period) with "_" in the column names.
names(df) <- gsub("\\.+", "_", names(df))
str(df) # Uncomment this and run just the lines to here to get column types to use for getting the list of measures.
measures <- c("Zip", "Median", "Mean", "Pop")
# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(df)) {
df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}
dimensions <- setdiff(names(df), measures)
#dimensions
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
# Get rid of " and ' in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
# Change & to and in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
# Change : to ; in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
}
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
}
}
df$Median <- as.numeric(levels(df$Median))[df$Median]
df$Mean <- as.numeric(levels(df$Mean))[df$Mean]
df$Pop <- as.numeric(levels(df$Pop))[df$Pop]
write.csv(df, paste(gsub(".csv", "", file_path), ".reformatted.csv", sep=""), row.names=FALSE, na = "")
tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n-- Change table_name to the table name you want.\n")
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
sql <- paste(sql, paste(d, "varchar2(4000),\n"))
}
}
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
else sql <- paste(sql, paste(m, "number(38,4)\n"))
}
}
sql <- paste(sql, ");")
cat(sql)
We created this dataset using data from QSR Magazine’s Top 50 quick-serve and fast-casual rankings. It is VERY important to note that by joining this to the Fast Food locations dataset, it implies that each franchise in that dataset makes the average sales revenue for that particular establishment. We realize that this is not the most accurate as there are individual franchise units that make well above and well below the average, but we do not have access to that data as that data is privately held by the respective parent company of the franchise. Nonetheless, this estimation does make for some interesting visualizations. Here is a summary of that dataset:
require("jsonlite")
require("RCurl")
# Loads the data from Total Car Sales table into CAR_Sale dataframe
# Change the USER and PASS below to be your UTEid
fast_food_sale <- data.frame(fromJSON(getURL(URLencode('skipper.cs.utexas.edu:5001/rest/native/?query="select RESTAURANT, SALES from FASTFOOD_SALES_RANK"'),httpheader=c(DB='jdbc:oracle:thin:@sayonara.microlab.cs.utexas.edu:1521:orcl', USER='C##cs329e_cz4795', PASS='orcl_cz4795', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE)))
summary(fast_food_sale)
## RESTAURANT SALES
## Burger King :1 Min. : 855000
## Carls Jr :1 1st Qu.:1098000
## Hardees :1 Median :1305000
## In-N-Out :1 Mean :1380000
## Jack in the Box:1 3rd Qu.:1407250
## KFC :1 Max. :2400000
## (Other) :4
Here is the script we used to extract, transform, and load the Fast Food Sales Revenue dataset into Oracle:
#Before running this R file make sure you set you working directory to where the CSV file located.
file_path <- "Fastfood_sales_rank.csv"
df <- read.csv(file_path, stringsAsFactors = FALSE)
# Replace "." (i.e., period) with "_" in the column names.
names(df) <- gsub("\\.+", "_", names(df))
str(df) # Uncomment this and run just the lines to here to get column types to use for getting the list of measures.
# Generate List of Measures
measures <- c("Sales")
# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(df)) {
df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}
dimensions <- setdiff(names(df), measures)
#dimensions
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
# Get rid of " and ' in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
# Change & to and in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
# Change : to ; in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
}
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
}
}
write.csv(df, paste(gsub(".csv", "", file_path), ".reformatted.csv", sep=""), row.names=FALSE, na = "")
tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n-- Change table_name to the table name you want.\n")
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
sql <- paste(sql, paste(d, "varchar2(4000),\n"))
}
}
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
else sql <- paste(sql, paste(m, "number(38,4)\n"))
}
}
sql <- paste(sql, ");")
cat(sql)
We followed Dr. Cannata’s detailed, step-by-step methodology for performing data analysis to produce these various plots.
The first step in the methodolgy is to “Start with a green thing (usually a Measure) in non-aggregated mode and make a Boxplot.” So, we first made a boxplot of the median salary for all ZIP codes from the ZIP code dataset. To recreate this plot, uncheck “aggregate measures” in the analysis tab and simply drag the Median green pill from the Measures column onto rows and click Boxplot on Show Me. Here is what it looked like:
image:
We took the boxplot one step further by adding Restaurant to columns. That is when we found something interesting: all but two fast food establishments had boxplots showing pretty much the same median salary for the ZIP codes they had franchises in, except for In-N-Out and Hardee’s. In-N-Out had an median salary that was alomst $10,000 above the others, while Hardee’s had an meadian salary that was about $5,000 less than the others. You can find this plot in our Shiny App under the “Boxplot” tab. Here is the plot in Tableau:
image:
Here is the plot and the code used to generate it using R:
source("../02 R SQL/Plot3_Boxplot.R", echo = TRUE)
##
## > require("ggplot2")
## Loading required package: ggplot2
##
## > require("ggthemes")
## Loading required package: ggthemes
##
## > require("gplots")
## Loading required package: gplots
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
##
## > require("grid")
## Loading required package: grid
##
## > require("RCurl")
##
## > require("reshape2")
## Loading required package: reshape2
##
## > require("tableplot")
## Loading required package: tableplot
##
## > require("tidyr")
## Loading required package: tidyr
##
## > require("dplyr")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## > require("jsonlite")
##
## > require("extrafont")
## Loading required package: extrafont
## Registering fonts with R
##
## > require("lubridate")
## Loading required package: lubridate
##
## > boxplot <- dplyr::left_join(fast_food, zip_code, by = "ZIP")
##
## > ggplot() + coord_cartesian() + scale_x_discrete() +
## + scale_y_continuous() + labs(title = "Combined MPG of every model year") +
## + labs(x = .... [TRUNCATED]
## Warning: Removed 572 rows containing non-finite values (stat_boxplot).
## Warning: Removed 572 rows containing non-finite values (stat_boxplot).
## Warning: Removed 572 rows containing missing values (geom_point).
require("ggplot2")
require("ggthemes")
require("gplots")
require("grid")
require("RCurl")
require("reshape2")
require("tableplot")
require("tidyr")
require("dplyr")
require("jsonlite")
require("extrafont")
require("lubridate")
boxplot <- dplyr::left_join(fast_food, zip_code, by="ZIP")
ggplot() +
coord_cartesian() +
scale_x_discrete() +
scale_y_continuous() +
labs(title="Combined MPG of every model year") +
labs(x="Year", y="Combined MPG") +
layer(data=boxplot ,
mapping=aes(x=RESTAURANT, y=MEDIAN),
stat="identity",
stat_params=list(),
geom="point",
geom_params=list(color="red"),
position=position_identity()
)+
layer(data = boxplot,
mapping=aes(x=RESTAURANT, y=MEDIAN),
stat="boxplot",
stat_params=list(),
geom="boxplot",
geom_params=list(color="black",fill="red", alpha=.4),
posiion=position_identity()
)+
layer(data = boxplot,
mapping=aes(x=RESTAURANT, y=MEDIAN),
stat="boxplot",
stat_params=list(),
geom="errorbar",
geom_params=list(color="black",fill="red", alpha=.4),
posiion=position_identity()
)
The second step in the methodology is to “Start with a green thing in aggregated mode and make a Histogram.” The first histogram we made depicts the total number of Fast Food establishments in the United States. To recreate this plot, click on Number of Records under Measures and then click on “Histogram” under Show Me. Then, drag Restaurant to columns. Here is what it looked like:
image:
We spent some time making various histograms, but none of them, this one included, produced any interesting visualizations. This was because the visualizations merely reinforced what common knowledge would suggest - in the case of the first plot, McDonald’s clearly had more franchises than any of the other fast food establishments, which did not surprise any of us.
The third step in the methodology is to “Start with 2 green things and make a Scatter plot.” So, the first scatterplot we made depicts the median salary and population of all ZIP codes in the United States. To recreate this plot, drag Median under Measures to columns and drag Pop under measures to Rows. Here is what it looked like:
At first, we thought we had found an interesting scatterplot when we added a polynomial trend line on top of a plot that looked at the population and number of fast food franchises for a particular restaurant. Here is that plot in Tableau:
However, we realized rather quickly that trend lines suggest a correleation between two variables, which was a conjecture we could not reasonably support given our datasets, and thus did not include in our Shiny App.
The fourth step in the methodology is to “Start with two blue things (usually Dimensions) plus a green thing and make a Crosstab.” So, we first made a crosstab that shows the sum of the sales revenues for each restaurant by each state. To recreate this plot, take the Restaurant pill under Dimensions and add it to the Columns area. Then take the State pill under Dimensions and add it to the Rows area. Finally, take the Sales per Unit pill under Measures and add it to ABC/123 (Text). You should get this plot:
After seeing what our first crosstab looked like, we looked into creating a KPI. After some searching, we created a KPI that measures the market share a particular fast food establishment has in a given state. From this, we added two parameter controls for the user to specify what upper and lower boundries to use when determining how dominate a company was in terms of market share. The default is that a weak market leader is one who has a market share less than 20%, an average market leader has a market share between 20% and 40%, and a strong market leader has a market share greater than 40%. To recreate this plot, simply take the plot from Step 4: Crosstab in Tableau, create a calculated field called Total Sales Revenue % using this calulation: ROUND(SUM([Sales per Unit]) / TOTAL(SUM([Sales per Unit])), 2), and drag that pill twice: once to the color area and again on the text area. Here is that plot in Tableau:
Here is the plot and the code to generate the it in R:
source("../02 R SQL/Plot1_Crosstab.R", echo = TRUE)
##
## > require(jsonlite)
##
## > require(RCurl)
##
## > KPI_LOW = 0.2
##
## > KPI_HIGH = 0.4
##
## > crosstab <- dplyr::inner_join(fast_food, fast_food_sale,
## + by = "RESTAURANT")
##
## > crosstab <- crosstab %>% select(RESTAURANT, SALES,
## + STATE) %>% group_by(STATE) %>% mutate(sum_total_sales = sum(as.numeric(SALES))) %>%
## + .... [TRUNCATED]
##
## > crosstab <- crosstab %>% transform(STATE = ordered(STATE),
## + RESTAURANT = ordered(RESTAURANT))
##
## > ggplot() + coord_cartesian() + scale_x_discrete() +
## + scale_y_discrete() + labs(title = "Market Share of Fast Food Restaurants By State") +
## + .... [TRUNCATED]
require(jsonlite)
require(RCurl)
# The following is equivalent to create a crosstab with a KPIs in Tableau
KPI_LOW = .2
KPI_HIGH = .4
#Join the fast food and fast food sales dataset
crosstab <- dplyr::inner_join(fast_food, fastfood_sale, by="RESTAURANT")
# The following is equivalent to creat a crosstab with two KPIs in Tableau"
crosstab <- crosstab %>%select(RESTAURANT, SALES, STATE)%>%group_by(STATE)%>%mutate(sum_total_sales = sum(as.numeric(SALES))) %>% group_by(STATE,RESTAURANT)%>% mutate(sum_restaurant_sales = sum(as.numeric(SALES))) %>% group_by(STATE, RESTAURANT) %>% summarise(sum_total_sales = mean(sum_total_sales), sum_restaurant_sales = mean(sum_restaurant_sales)) %>% mutate(ratio_1 = sum_restaurant_sales / sum_total_sales)%>% mutate(kpi_1 = ifelse(ratio_1 < KPI_LOW, '03 Low Market Share', ifelse(ratio_1 <= KPI_HIGH, '02 Average Market Share', '01 High Market Share')))
# This line turns the State and Restaurant columns into ordered factors.
crosstab <- crosstab %>% transform(STATE = ordered(STATE), RESTAURANT = ordered(RESTAURANT))
#This generates the crosstab plot
ggplot() +
coord_cartesian() +
scale_x_discrete() +
scale_y_discrete() +
labs(title='Market Share of Fast Food Restaurants By State') +
labs(x=paste("Fast Food Restaurant"), y=paste("State")) +
layer(data=crosstab,
mapping=aes(x=RESTAURANT, y=STATE, label=round(ratio_1, 4)),
stat="identity",
stat_params=list(),
geom="text",
geom_params=list(colour="black"),
position=position_identity()
) +
layer(data=crosstab,
mapping=aes(x=RESTAURANT, y=STATE, fill=kpi_1),
stat="identity",
stat_params=list(),
geom="tile",
geom_params=list(alpha=0.50),
position=position_identity()
)
Step 5 of the methodology says to “Start with a blue thing and a green thing and make a Barchart.” So, the first bar chart we made looked at the sales revenue generated by all franchise locations in the United States for the 10 restaurants. This visualization did not produce anything interesting expect for the fact that McDonald’s not only makes more than all the other fast food chains, but by an enormous margin. To recreate this plot, take the Restaurant pill under Dimensions and place it in the columns area and take the Sales per Unit pill under Measures and place it in the rows area. Here is that plot in Tableau:
We took the original bar chart one step further by seeing if there were states that contributed more to a fast food establishment’s overall U.S. sales revenue than other states. For example, the Market Share KPI crosstab showed that McDonald’s had a dominate market share in almost every state, but we were curious to see which state, if any, was contributing the most to McDonald’s total U.S. sales revenue. To do this, we added a KPI that looked at the difference between the sales of state and average sales of United States for each restaurant. To recreate this plot, flip the axis and add the State pill to the rows area. Then, create a calculated field called Calculated Difference from the Average (table calculation) with the following calculation: SUM([Sales per Unit]) - WINDOW_AVG(SUM([Sales per Unit])). Drag that pill twice: once to color and once to text. Finally, navigate to the Analytics tab and drag the average line onto panes. Here is that plot in Tableau:
When recreating this plot in R and the Shiny App, we added the average line to show the average sales of all 50 United States for each restaurant. Additionally, the red numbers represent the difference in dollars between the sales of each state and the average sales revenue of all 50 states. Here is the plot and the code to generate it in R:
source("../02 R SQL/Plot2_Barchart.R", echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > require(dplyr)
##
## > require(tidyr)
##
## > require(ggplot2)
##
## > require(reshape2)
##
## > library(scales)
##
## > join_data <- dplyr::inner_join(fast_food, fast_food_sale,
## + by = "RESTAURANT")
##
## > bar_chart <- join_data %>% select(STATE, RESTAURANT,
## + SALES) %>% subset(STATE %in% c("AK", "AL", "AR", "AZ", "CA",
## + "CO", "CT", "DC", "D ..." ... [TRUNCATED]
##
## > WINDOW_AVG = aggregate(bar_chart[, 4], list(RESTAURANT = bar_chart$RESTAURANT),
## + mean)
##
## > bar_chart <- dplyr::right_join(bar_chart, WINDOW_AVG,
## + by = "RESTAURANT")
##
## > bar_chart <- bar_chart %>% select(STATE, RESTAURANT,
## + variable, value.x, value.y) %>% mutate(Diff_To_Avg = value.x -
## + value.y)
##
## > options(scipen = 999)
##
## > ggplot() + coord_cartesian() + scale_x_discrete() +
## + scale_y_continuous(labels = dollar) + facet_wrap(~RESTAURANT) +
## + labs(title = "Tota ..." ... [TRUNCATED]
require("jsonlite")
require("RCurl")
require(dplyr)
require(tidyr)
require(ggplot2)
require(reshape2)
library(scales)
#Join fast food locations and fast food sales by franchise unit datasets
join_data<-dplyr::inner_join(fast_food, fast_food_sale, by="RESTAURANT")
#R workflow to generate the bar chart
bar_chart <- join_data %>% select(STATE, RESTAURANT, SALES) %>% subset(STATE%in% c("AK", "AL","AR","AZ","CA","CO", "CT", "DC", "DE", "FL", "GA", "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME", "MI", "MN", "MO", "MS", "MT", "NC", "ND", "NE", "NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY"))%>%subset(RESTAURANT%in% c("McDonalds", "Burger King"))%>%group_by(STATE,RESTAURANT) %>% summarise(sum_sales = sum(as.numeric(SALES)))%>%melt(id.vars = c("STATE","RESTAURANT"))%>%group_by(variable)%>%group_by(STATE)%>%group_by(STATE)
#Generate window average
WINDOW_AVG=aggregate(bar_chart[, 4], list(RESTAURANT=bar_chart$RESTAURANT), mean)
bar_chart<-dplyr::right_join(bar_chart, WINDOW_AVG, by="RESTAURANT")
bar_chart<-bar_chart %>% select (STATE,RESTAURANT,variable,value.x,value.y)%>%mutate(Diff_To_Avg = value.x - value.y)
#Plot Function to generate bar chart with reference line and values
ggplot() +
coord_cartesian() +
scale_x_discrete() +
scale_y_continuous() +
facet_wrap(~RESTAURANT) +
labs(title='Total sales of every fastfood restaurant in every state ') +
labs(x=paste("State"), y=paste("Sales")) +
layer(data=bar_chart,
mapping=aes(x=STATE, y=value.x),
stat="identity",
stat_params=list(),
geom="bar",
geom_params=list(fill="steelblue"),
position=position_dodge()
)+ coord_flip()+
layer(data=bar_chart,
mapping=aes(x=STATE, y=value.x, label=round(value.x,2)),
stat="identity",
stat_params=list(),
geom="text",
geom_params=list(colour="black", hjust=0),
position=position_identity()
)+
layer(data=bar_chart,
mapping=aes(yintercept = value.y),
geom="hline",
linetype="dashed",
size=2,
geom_params=list(colour="red")
)+
layer(data=bar_chart,
mapping=aes(x=STATE, y=value.x, label=round(Diff_To_Avg)),
stat="identity",
stat_params=list(),
geom="text",
geom_params=list(colour="red", hjust=-2),
position=position_identity()
)
This map plot was created because we had location data on more than 50,000 fast food restaurant locations. Maps in R are not as flexible as they are in Tableau; thus, we were only able to generate this map that plots the location of each restaurant. It is interesting from the perspective that you can visually see where the majority of restaurants are located or where a particular brand holds a lot of fast food locations. Here is the plot in Tableau:
The map will not generate in a r markdown html document. To view the plot, simply look at the Bonus_Map.R file under the 02 R SQL folder. Here is the code that generated that plot:
require(leaflet)
locations <- fast_food %>% subset(RESTAURANT %in% c("Burger King", "McDonalds"))
content <- paste(sep = "<br/>",
locations$RESTAURANT,
locations$ADDRESS,
paste(sep = " ", locations$CITY, locations$STATE, locations$ZIP
))
leaflet(data = locations) %>% addTiles() %>%
addMarkers( ~LONGI, ~LAT, popup = ~content ,
clusterOptions = markerClusterOptions()
)