Health Insurance and Hours Worked By Wives

This data set was referenced in a September 1998 article written by Craig Olson in the Journal of Applied Econometrics. The data was used to determine the validity of the hypothesis that wives who do not have health insurance through their husband’s employer are more likely to work full-time to get health insurance than wives who are covered by their husband’s health insurance.

Here is the article in its entirety if you would like to read it: 1998 Olson Journal of Applied Econometrics

The data set

The data first had to be extracted, transformed, and then loaded into the Oracle database. Here is the code that performs that ETL:

#Before running this R file make sure you set you working directory to where the CSV file located.

file_path <- "HI.csv"

df <- read.csv(file_path, stringsAsFactors = FALSE)

Newdf <- subset(df, select = c("whrswk","hhi","whi","hhi2","education","race","hispanic","experience","kidslt6","kids618","husby","region"))

#head(Newdf)

# Replace "." (i.e., period) with "_" in the column names.
names(Newdf) <- gsub("\\.+", "_", names(Newdf))

str(Newdf) # Uncomment this and  run just the lines to here to get column types to use for getting the list of measures.

measures <- c("whrswk", "experience", "kidslt6", "kids618", "husby")

# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(Newdf)) {
    Newdf[n] <- data.frame(lapply(Newdf[n], gsub, pattern="[^ -~]",replacement= ""))
}

dimensions <- setdiff(names(Newdf), measures)
#dimensions
if( length(measures) > 1 || ! is.na(dimensions)) {
  for(d in dimensions) {
    # Get rid of " and ' in dimensions.
    Newdf[d] <- data.frame(lapply(Newdf[d], gsub, pattern="[\"']",replacement= ""))
    # Change & to and in dimensions.
    Newdf[d] <- data.frame(lapply(Newdf[d], gsub, pattern="&",replacement= " and "))
    # Change : to ; in dimensions.
    Newdf[d] <- data.frame(lapply(Newdf[d], gsub, pattern=":",replacement= ";"))
  }
}


# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions
if( length(measures) > 1 || ! is.na(measures)) {
  for(m in measures) {
    Newdf[m] <- data.frame(lapply(Newdf[m], gsub, pattern="[^--.0-9]",replacement= ""))
  }
}

write.csv(Newdf, paste(gsub(".csv", "", file_path), ".reformatted.csv", sep=""), row.names=FALSE, na = "")

tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n-- Change table_name to the table name you want.\n")
if( length(measures) > 1 || ! is.na(dimensions)) {
  for(d in dimensions) {
    sql <- paste(sql, paste(d, "varchar2(4000),\n"))
  }
}
if( length(measures) > 1 || ! is.na(measures)) {
  for(m in measures) {
    if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
    else sql <- paste(sql, paste(m, "number(38,4)\n"))
  }
}
sql <- paste(sql, ");")
cat(sql)

The data set is from the March 1993 US Current Population Survey. Here is a summary of that data set:

require("jsonlite")
## Loading required package: jsonlite
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View
require("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
# Change the USER and PASS below to be your UTEid
df <- data.frame(fromJSON(getURL(URLencode('129.152.144.84:5001/rest/native/?query="select * from hi"'),httpheader=c(DB='jdbc:oracle:thin:@129.152.144.84:1521/PDBF15DV.usuniversi01134.oraclecloud.internal', USER='cs329e_cz4795', PASS='orcl_cz4795', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE), ))
summary(df)
##   HHI         WHI         HHI2            EDUCATION       RACE      
##  no :11219   no :13961   no : 8696   <9years   :1122   black: 1241  
##  yes:11053   yes: 8311   yes:13576   >16years  :1440   other:  171  
##                                      12years   :8677   white:20860  
##                                      13-15years:5790                
##                                      16years   :3472                
##                                      9-11years :1771                
##  HISPANIC             REGION         WHRSWK        EXPERIENCE   
##  no :20601   northcentral:5491   Min.   : 0.00   Min.   :-1.00  
##  yes: 1671   other       :5170   1st Qu.: 0.00   1st Qu.:14.00  
##              south       :6778   Median :35.00   Median :21.00  
##              west        :4833   Mean   :25.57   Mean   :22.94  
##                                  3rd Qu.:40.00   3rd Qu.:31.00  
##                                  Max.   :90.00   Max.   :51.00  
##     KIDSLT6          KIDS618          HUSBY      
##  Min.   :0.0000   Min.   :0.000   Min.   :  0.0  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:  7.5  
##  Median :0.0000   Median :0.000   Median : 25.0  
##  Mean   :0.3494   Mean   :0.691   Mean   : 27.1  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.: 40.0  
##  Max.   :5.0000   Max.   :8.000   Max.   :183.7

Here is the first ten rows of that data set:

head(df)
##   HHI WHI HHI2  EDUCATION  RACE HISPANIC       REGION WHRSWK EXPERIENCE
## 1  no  no   no 13-15years white       no northcentral      0       13.0
## 2  no yes   no 13-15years white       no northcentral     50       24.0
## 3 yes  no  yes    12years white       no northcentral     40       43.0
## 4  no yes  yes 13-15years white       no northcentral     40       17.0
## 5 yes  no  yes  9-11years white       no northcentral      0       44.5
## 6 yes yes  yes    12years white       no northcentral     40       32.0
##   KIDSLT6 KIDS618  HUSBY
## 1       2       1 11.960
## 2       0       1  1.200
## 3       0       0 31.275
## 4       0       1  9.000
## 5       0       0  0.000
## 6       0       0 15.690

A detailed explanation of each column in the data set:

  • WHRSWK - Hours worked per week by wife
  • HHI - Is the wife covered by the husband’s health insurance? “yes”, “no”
  • WHI - Does the wife have health insurnace through her job? “yes”, “no”
  • HHI2 - Does the husband has health insurance through his own job? “yes”, “no”
  • education - Education level of the wife, “<9years”, “9-11years”, “12years”, “13-15years”, “16years”, “>16years”
  • race - Race of wife: “white”, “black”, “other”
  • hispanic - Is the wife Hispanic? “yes”, “no”
  • region - Region where wife and husband live: “other”, “northcentral”, “south”, “west”
  • experience - Years of husbands potential work experience
  • kidslt6 - Number of kids under age of 6
  • kids618 - Number of kids 6–18 years old
  • husby - Husband’s income in thousands of dollars

Session Info

This is how Rstudio is set up in order to execute the experiment and produce these results:

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RCurl_1.95-4.7  bitops_1.0-6    jsonlite_0.9.16
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.2     tools_3.2.2     htmltools_0.2.6
##  [5] yaml_2.1.13     stringi_0.5-5   rmarkdown_0.8   knitr_1.11     
##  [9] stringr_1.0.0   digest_0.6.8    evaluate_0.7.2

Plot 1: Average Number of Children per Wife’s Education Level and Husband’s Income

source("../02 Data Wrangling/Project2_Plot1_df.r", echo = TRUE)
## 
## > require(tidyr)
## Loading required package: tidyr
## 
## > require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## > require(ggplot2)
## Loading required package: ggplot2
## 
## > Plot1_df <- df
## 
## > Plot1_df$EDUCATION <- factor(df$EDUCATION, levels = c("<9years", 
## +     "9-11years", "12years", "13-15years", "16years", ">16years"), 
## +     labels  .... [TRUNCATED] 
## 
## > Plot1_df <- Plot1_df %>% group_by(EDUCATION, HUSBY) %>% 
## +     summarise(mean_kids = mean(KIDSLT6 + KIDS618))
## 
## > ggplot() + coord_cartesian() + scale_x_continuous() + 
## +     scale_y_continuous() + facet_grid(. ~ EDUCATION) + labs(title = "Average Number of Chil ..." ... [TRUNCATED]

This plot was originally created to verify the hypothesis that there is an inverse relationship between the husband’s income and average children as well as an inverse relationship between the wife’s education level and the average number of children. In other words, the more educated a woman is, and the higher the husband’s income, the less children they have on average.

However, after adding a box plot layer showing the mean and outlier income level of the husband for each education level of the wife, the data reveals an interesting phenomenon; the higher the wife’s education level is, the higher the husband’s income is. This suggests that the more educated a woman is, the higher her spouse’s income is.

Data Wrangling R Workflow to produce the dataframe that powers this plot:

Plot1_df <- df %>% group_by (EDUCATION, HUSBY) %>% summarise(mean_kids = mean(KIDSLT6 + KIDS618))

Plot 2: Top and Bottom 10 Percent of Husband’s Work Experience

source("../02 Data Wrangling/Project2_Plot2_df.r", echo = TRUE)
## 
## > require(tidyr)
## 
## > require(dplyr)
## 
## > require(ggplot2)
## 
## > Plot2_df <- df %>% mutate(EXPERIENCE_PERCENT = cume_dist(EXPERIENCE)) %>% 
## +     filter(EXPERIENCE_PERCENT <= 0.1 | EXPERIENCE_PERCENT >= 
## +         .... [TRUNCATED] 
## 
## > ggplot() + coord_cartesian() + scale_x_continuous() + 
## +     scale_y_continuous() + labs(title = "Top & Bottom 10 percent of Husband's Work Experien ..." ... [TRUNCATED]

This plot looks at how likely the husband is to have health insurance based on how much work experience he has and his income level.

This plot is interesting because it suggests that a husband’s likelihood of having heath insurance is influenced more by his work experience than his income. The husbands in the top 10 percent of work experience are almost 6 times as likely to have health insurance than are the husbands who are in the bottom 10 percent of work experience.

Data Wrangling R Workflow to produce the dataframe that powers this plot:

Plot2_df <- df %>% mutate(EXPERIENCE_PERCENT = cume_dist(EXPERIENCE)) %>% filter(EXPERIENCE_PERCENT  <= .1 | EXPERIENCE_PERCENT  >= .9)

Plot 3: Influence of Number of Kids and No Health Insurance from Husband on Hours Worked by Wives

source("../02 Data Wrangling/Project2_Plot3_df.r", echo = TRUE)
## 
## > require(tidyr)
## 
## > require(dplyr)
## 
## > require(ggplot2)
## 
## > Plot3_df <- df %>% select(HHI, KIDSLT6, KIDS618, WHRSWK) %>% 
## +     filter(HHI == "no") %>% mutate(TOTAL_KIDS = KIDSLT6 + KIDS618)
## 
## > ggplot() + coord_cartesian() + scale_x_continuous() + 
## +     scale_y_continuous() + facet_grid(. ~ TOTAL_KIDS, labeller = label_both) + 
## +     labs( .... [TRUNCATED]

This plot is generated by a subset of the data where the husbands do not have health insurance through their employer. From that subset, this plot looks at how the number of kids affects the hours worked per week by the wife.

Overall, the trend shown in the plot provides support for the hypothesis that even when a husband does not have health insurance, the wife will ultimately work less hours per week as she has more children.

Data Wrangling R Workflow to produce the dataframe that powers this plot:

Plot3_df <- df %>% select(HHI, KIDSLT6, KIDS618, WHRSWK) %>% filter(HHI == "no") %>% mutate(TOTAL_KIDS = KIDSLT6 + KIDS618)