This data set was referenced in a September 1998 article written by Craig Olson in the Journal of Applied Econometrics. The data was used to determine the validity of the hypothesis that wives who do not have health insurance through their husband’s employer are more likely to work full-time to get health insurance than wives who are covered by their husband’s health insurance.
Here is the article in its entirety if you would like to read it: 1998 Olson Journal of Applied Econometrics
The data first had to be extracted, transformed, and then loaded into the Oracle database. Here is the code that performs that ETL:
#Before running this R file make sure you set you working directory to where the CSV file located.
file_path <- "HI.csv"
df <- read.csv(file_path, stringsAsFactors = FALSE)
Newdf <- subset(df, select = c("whrswk","hhi","whi","hhi2","education","race","hispanic","experience","kidslt6","kids618","husby","region"))
#head(Newdf)
# Replace "." (i.e., period) with "_" in the column names.
names(Newdf) <- gsub("\\.+", "_", names(Newdf))
str(Newdf) # Uncomment this and run just the lines to here to get column types to use for getting the list of measures.
measures <- c("whrswk", "experience", "kidslt6", "kids618", "husby")
# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(Newdf)) {
Newdf[n] <- data.frame(lapply(Newdf[n], gsub, pattern="[^ -~]",replacement= ""))
}
dimensions <- setdiff(names(Newdf), measures)
#dimensions
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
# Get rid of " and ' in dimensions.
Newdf[d] <- data.frame(lapply(Newdf[d], gsub, pattern="[\"']",replacement= ""))
# Change & to and in dimensions.
Newdf[d] <- data.frame(lapply(Newdf[d], gsub, pattern="&",replacement= " and "))
# Change : to ; in dimensions.
Newdf[d] <- data.frame(lapply(Newdf[d], gsub, pattern=":",replacement= ";"))
}
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
Newdf[m] <- data.frame(lapply(Newdf[m], gsub, pattern="[^--.0-9]",replacement= ""))
}
}
write.csv(Newdf, paste(gsub(".csv", "", file_path), ".reformatted.csv", sep=""), row.names=FALSE, na = "")
tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n-- Change table_name to the table name you want.\n")
if( length(measures) > 1 || ! is.na(dimensions)) {
for(d in dimensions) {
sql <- paste(sql, paste(d, "varchar2(4000),\n"))
}
}
if( length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
else sql <- paste(sql, paste(m, "number(38,4)\n"))
}
}
sql <- paste(sql, ");")
cat(sql)
The data set is from the March 1993 US Current Population Survey. Here is a summary of that data set:
require("jsonlite")
## Loading required package: jsonlite
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
require("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
# Change the USER and PASS below to be your UTEid
df <- data.frame(fromJSON(getURL(URLencode('129.152.144.84:5001/rest/native/?query="select * from hi"'),httpheader=c(DB='jdbc:oracle:thin:@129.152.144.84:1521/PDBF15DV.usuniversi01134.oraclecloud.internal', USER='cs329e_cz4795', PASS='orcl_cz4795', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE), ))
summary(df)
## HHI WHI HHI2 EDUCATION RACE
## no :11219 no :13961 no : 8696 <9years :1122 black: 1241
## yes:11053 yes: 8311 yes:13576 >16years :1440 other: 171
## 12years :8677 white:20860
## 13-15years:5790
## 16years :3472
## 9-11years :1771
## HISPANIC REGION WHRSWK EXPERIENCE
## no :20601 northcentral:5491 Min. : 0.00 Min. :-1.00
## yes: 1671 other :5170 1st Qu.: 0.00 1st Qu.:14.00
## south :6778 Median :35.00 Median :21.00
## west :4833 Mean :25.57 Mean :22.94
## 3rd Qu.:40.00 3rd Qu.:31.00
## Max. :90.00 Max. :51.00
## KIDSLT6 KIDS618 HUSBY
## Min. :0.0000 Min. :0.000 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 7.5
## Median :0.0000 Median :0.000 Median : 25.0
## Mean :0.3494 Mean :0.691 Mean : 27.1
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.: 40.0
## Max. :5.0000 Max. :8.000 Max. :183.7
Here is the first ten rows of that data set:
head(df)
## HHI WHI HHI2 EDUCATION RACE HISPANIC REGION WHRSWK EXPERIENCE
## 1 no no no 13-15years white no northcentral 0 13.0
## 2 no yes no 13-15years white no northcentral 50 24.0
## 3 yes no yes 12years white no northcentral 40 43.0
## 4 no yes yes 13-15years white no northcentral 40 17.0
## 5 yes no yes 9-11years white no northcentral 0 44.5
## 6 yes yes yes 12years white no northcentral 40 32.0
## KIDSLT6 KIDS618 HUSBY
## 1 2 1 11.960
## 2 0 1 1.200
## 3 0 0 31.275
## 4 0 1 9.000
## 5 0 0 0.000
## 6 0 0 15.690
A detailed explanation of each column in the data set:
This is how Rstudio is set up in order to execute the experiment and produce these results:
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RCurl_1.95-4.7 bitops_1.0-6 jsonlite_0.9.16
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2 tools_3.2.2 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.5-5 rmarkdown_0.8 knitr_1.11
## [9] stringr_1.0.0 digest_0.6.8 evaluate_0.7.2
source("../02 Data Wrangling/Project2_Plot1_df.r", echo = TRUE)
##
## > require(tidyr)
## Loading required package: tidyr
##
## > require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## > require(ggplot2)
## Loading required package: ggplot2
##
## > Plot1_df <- df
##
## > Plot1_df$EDUCATION <- factor(df$EDUCATION, levels = c("<9years",
## + "9-11years", "12years", "13-15years", "16years", ">16years"),
## + labels .... [TRUNCATED]
##
## > Plot1_df <- Plot1_df %>% group_by(EDUCATION, HUSBY) %>%
## + summarise(mean_kids = mean(KIDSLT6 + KIDS618))
##
## > ggplot() + coord_cartesian() + scale_x_continuous() +
## + scale_y_continuous() + facet_grid(. ~ EDUCATION) + labs(title = "Average Number of Chil ..." ... [TRUNCATED]
This plot was originally created to verify the hypothesis that there is an inverse relationship between the husband’s income and average children as well as an inverse relationship between the wife’s education level and the average number of children. In other words, the more educated a woman is, and the higher the husband’s income, the less children they have on average.
However, after adding a box plot layer showing the mean and outlier income level of the husband for each education level of the wife, the data reveals an interesting phenomenon; the higher the wife’s education level is, the higher the husband’s income is. This suggests that the more educated a woman is, the higher her spouse’s income is.
Data Wrangling R Workflow to produce the dataframe that powers this plot:
Plot1_df <- df %>% group_by (EDUCATION, HUSBY) %>% summarise(mean_kids = mean(KIDSLT6 + KIDS618))
source("../02 Data Wrangling/Project2_Plot2_df.r", echo = TRUE)
##
## > require(tidyr)
##
## > require(dplyr)
##
## > require(ggplot2)
##
## > Plot2_df <- df %>% mutate(EXPERIENCE_PERCENT = cume_dist(EXPERIENCE)) %>%
## + filter(EXPERIENCE_PERCENT <= 0.1 | EXPERIENCE_PERCENT >=
## + .... [TRUNCATED]
##
## > ggplot() + coord_cartesian() + scale_x_continuous() +
## + scale_y_continuous() + labs(title = "Top & Bottom 10 percent of Husband's Work Experien ..." ... [TRUNCATED]
This plot looks at how likely the husband is to have health insurance based on how much work experience he has and his income level.
This plot is interesting because it suggests that a husband’s likelihood of having heath insurance is influenced more by his work experience than his income. The husbands in the top 10 percent of work experience are almost 6 times as likely to have health insurance than are the husbands who are in the bottom 10 percent of work experience.
Data Wrangling R Workflow to produce the dataframe that powers this plot:
Plot2_df <- df %>% mutate(EXPERIENCE_PERCENT = cume_dist(EXPERIENCE)) %>% filter(EXPERIENCE_PERCENT <= .1 | EXPERIENCE_PERCENT >= .9)
source("../02 Data Wrangling/Project2_Plot3_df.r", echo = TRUE)
##
## > require(tidyr)
##
## > require(dplyr)
##
## > require(ggplot2)
##
## > Plot3_df <- df %>% select(HHI, KIDSLT6, KIDS618, WHRSWK) %>%
## + filter(HHI == "no") %>% mutate(TOTAL_KIDS = KIDSLT6 + KIDS618)
##
## > ggplot() + coord_cartesian() + scale_x_continuous() +
## + scale_y_continuous() + facet_grid(. ~ TOTAL_KIDS, labeller = label_both) +
## + labs( .... [TRUNCATED]
This plot is generated by a subset of the data where the husbands do not have health insurance through their employer. From that subset, this plot looks at how the number of kids affects the hours worked per week by the wife.
Overall, the trend shown in the plot provides support for the hypothesis that even when a husband does not have health insurance, the wife will ultimately work less hours per week as she has more children.
Data Wrangling R Workflow to produce the dataframe that powers this plot:
Plot3_df <- df %>% select(HHI, KIDSLT6, KIDS618, WHRSWK) %>% filter(HHI == "no") %>% mutate(TOTAL_KIDS = KIDSLT6 + KIDS618)