Emulate ggplot using plot(): logit regression

I find using ggplot for logistic regression confusing, and I’ve already spent quite some time coding my own template for logit regression in the past, so here’s a quick “fix” to make your simple plots produced by plot() emulate ggplot!

Why would one bother to do this?

Because whoever doesn’t think iris blue (#00BFC4) is pretty has a problem.

The coral-ish color can be defined by col=rgb(248/256, 118/256, 109/256), and the iris blue can be defined by col=rgb(0, 191/256, 196/256).

The figure below was generated by (1) first plotting the raw data of response variable and its covariates while suppressing all labels (specifically, “ann” and “axes”), and (2) then adding the logit plot to the original one using “par(new=TRUE)”

Maybe it saves more time if I just spend ten minutes reading ggplot’s grammar while plotting logit regression…

 

Figure: Emulating ggplot using plot()

Screen Shot 2016-08-21 at 2.40.37 PM

Advertisements

Loop through files to loop through variables!

Say you have a bunch of data files formatted in exactly the same way (which is not rare if you are scraping or if the data are clean), how do you loop through all the files at once, extract all the useful information, and bind them to a big matrix? Consider the following code:

Suppose all my files are named “1.csv”, …, “5.csv”, and we loop through files by

file.names <- c("1", "2", "3", "4", "5")
for (i in 1:length(file.names)) {
data <- readLines(paste(file.names[i], "csv", sep = "."))
read <- read.csv(textConnection(data), header = TRUE, stringsAsFactors = FALSE)
assign(paste(file.names[i]), read)
}

Oftentimes you would need to reshape your data. Suppose we are looking at such data

year    place1    place2
1999    1.1       7.8
...

An efficient way to reshape your data is to write a melt function:

my.melt <- function(x){
  x <- melt(x, id.vars=c('year'), 
  variable.name='place')
  x
}

Since all the files are the same, we get a  long list of variables that have the same dimension. Thus, we can merge all of them. Consider the example where I want to merge two of my variables:

var.names=list(var1, var2)
for (i in 1:length(var.names)){
  var.names[[i]] <- my.melt(var.names[[i]]) 
}

Alternatively, you can use lapply()

reshape <- lapply(var.names, my.melt)

Now, we need to cbind() all our data:

datalist = list() # create empty list
for (i in 1:5) {
  datalist[[i]] <- reshape[[i]]
}
merge <- do.call(cbind, datalist)
names(merge) <- c(var.names)

Definitely not the smartest way — but it works.

A quick note on causal effect

This post is a quick note as I have been reading papers and books on causal inference recently. I am sure that the materials are extremely intuitive to most social scientists, but I hope some of my notes could help the beginning grad students to quickly understand the notion of causal effect.

I recently came across a paper by Barnow, Cain, and Goldberger, that was published thirty-six years ago. In their paper, they talked about how to operationalize the following equation

y=\alpha z+w+\varepsilon,

where \alpha is true treatment effect, z is treatment status, y is the outcome, and w is an unobserved variable, with random term \varepsilon. The basic idea is to introduce observable variables that determine the assignment in the equation:

“Assume that an observed variable, t, was used to determine assignment into the treatment group and the control group… [S]ince t is the only systematic determinant of treatment status, t will capture any correlation between z and w. Thus, the observed t could replace the unobserved w as the explanatory variable.”

To understand their argument, let’s quickly review the conditional independence assumption (CIA). The CIA states that conditional on t, the outcomes are independent of treatments, that is, \{y_{0},y_{1}\}\perp\!\!\!\perp z|t.

Suppose we have
\underbrace{ E[y|z=1] - E[y|z=0]}_{\text{observed difference}} = \underbrace{ E[y_{1}-y_{0}|z=1]}_{\text{treatment effect}} + \underbrace{ (E[y_{0}|z=1] - E[y_{0}|z=0])}_{\text{selection bias}}.

However, the selection bias is undesirable in our contexts. Conditional on t, we obtain
{ E[y|t,z=1] - E[y|t,z=0]} = { E[y_{1}-y_{0}|t}].

In this way, selection bias disappears! Now, let’s go back to Barnow, Cain, and Goldberger’s equation —

y=\alpha z+w+\varepsilon.

With the CIA, we can decompose w into w=\beta t + \varepsilon^*, where \beta is a vector of population regression coefficients that is assumed to satisfy

E[w|t]=\beta t .

That is,

y=\alpha z +\beta t + \varepsilon,

where \alpha is the causal effect.