17  Instrumental Variables

17.1 Problem

17.2 Review of Regression and Assumptions

Let there be \(N\) units indexed by \(i = 1, 2, \ldots, N\). For example, these could be people, households, firms, states, stocks, etc. For each unit, we observe an outcome variable of interest, \(Y_i\), and a regressor \(X_i\). We can run the below regression:

\[\begin{equation} Y_i = \beta_0 + \beta_1 X_i + U_i. \end{equation}\]

Recall that \(U_i\) is unobservable and \(\beta_0\) and \(\beta_1\) are the parameters.

If the following assumptions hold, then we can interpret the model as causal. That is, increasing \(X_i\) by one unit causes \(Y_i\) to change by \(\beta_1\) units.

  1. Linear Model.
  2. Strict Exogeneity: \(\mathbb{E}(U \vert X) = \mathbb{E}(U) = 0\).
  3. No Perfect Multicollinearity.
  4. Homoscedasticity.

We saw this interpretation with randomized controlled trials. Suppose that \(Y_i\) is income and \(X_i\) is an indicator of whether an individual was randomized to attend job training. That is, if \(X_i = 1\) then individual \(i\) was randomized into the treatment group and attended the job training. If \(X_i = 0\), then individual \(i\) was randomized into the control group and did not attend the job training. Then \(\beta_1\) will be the causal effect of job training on income. Note that in this case, \(X_i\) is exogenous under the assumption that randomizing means that not only are the observed variables the same between treatment and control, but also the unobserved factors (\(U_i\)) are the same between treatment and control.

When exogeneity breaks down, then we cannot interpret the regression causally. There are several reasons why exogeneity may not hold. Regardless of the reason, \(\mathbb{E}(U\vert X) \neq \mathbb{E}(U)\).

17.3 Basic Instrumental Variable Setup

Suppose we have the regression model from above:

\[ Y_i = \beta_0 + \beta_1 X_i + U_i. \tag{17.1}\]

We know that \(X_i\) is endogeneous, meaning that \(\mathbb{E}(U\vert X) \neq \mathbb{E}(U)\). Suppose we observe \(Z_i\), which is a valid instrumental variable if the following conditions hold:

  1. Relevance. \(X\) and \(Z\) are correlated.

  2. Exogeneity. \(Z\) cannot be correlated with \(U\).

We can check for relevance in the data. That is because \(X\) and \(Z\) are both observed. We can simply calculate the correlation between them to see that they are related. However, we cannot check for exogeneity. This is an assumption on the unobserved term \(U\), so we must rely on logical arguments to justify exogeneity.

17.3.1 Two-Stage Least Squares Estimator

How do we put into practice the basic instrumental variable setup? We need an estimation strategy. Two-stage least squares (TSLS) is a common approach. There are two stages.

  1. Regress the endogeneous regressor on the instrument:

\[\begin{equation} X_i = \pi_0 + \pi_1 Z_i + V_i. \end{equation}\]

The unobserved variable \(V_i\) is the component of \(X_i\) not explained by the instrument. From this regression, we can predict the regressor. That means calculating \(\hat{X}_i\) for all units. As long as \(Z\) satisfies the two assumptions, then we can interpret \(\hat{X}\) as exogeneous to \(Y\).

  1. Regress the outcome on the predicted regressor:

\[\begin{equation} Y_i = \alpha_0 + \alpha_1 \hat{X}_i + W_i. \end{equation}\]

Then, \(\hat{\alpha}_0\) and \(\hat{\alpha}_1\) are the TSLS estimates of \(\beta_0\) and \(\beta_1\) from Equation Equation 17.1.

17.3.1.1 R Implementation

Suppose a policymaker wants to decrease smoking. They suggest increasing taxes on cigarettes to raise the price. In order to know the effect this policy will have on cigarette consumption, we need to know: what is the causal effect of increasing the price of cigarettes on the demand for cigarettes?

Why is this a difficult question to analyze? Demand and supply are determined simultaneously. It is possible that both the demand for cigarettes affects the price, and that the price affects the demand. This general source of endogeneity is called simultaneity.

We can use taxes on cigarettes as instruments for the price. Let us evaluate the two assumptions for this instrument.

  1. Relevance. Taxes affect the prices of cigarettes.

  2. Exogeneity. Taxes indirectly affect the demand for cigarettes through prices. However, there are not other plausible channels for how taxes may affect the demand for cigarettes through other channels. This assumption can always be debated, but the argument of exogeneity is that apart from the effect through prices, there is no other effect.

Let us run the analysis in R. We will use data compiled in Stock and Watson (2015). We can do some usual commands to explore the data.

# Load this package so we can access the data
library(AER) 
Loading required package: car
Loading required package: carData
Loading required package: lmtest
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
# Load other packages that wil be useful
library(dplyr)

Attaching package: 'dplyr'
The following object is masked from 'package:car':

    recode
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(magrittr)

# Data on cigarette consumption for the 48 continental US from 1985-1995
data("CigarettesSW") 

# Number of rows and columns
dim(CigarettesSW)
[1] 96  9
# Names of variables
names(CigarettesSW)
[1] "state"      "year"       "cpi"        "population" "packs"     
[6] "income"     "tax"        "price"      "taxs"      
# Summary statistics of all variables
summary(CigarettesSW)
     state      year         cpi          population           packs       
 AL     : 2   1985:48   Min.   :1.076   Min.   :  478447   Min.   : 49.27  
 AR     : 2   1995:48   1st Qu.:1.076   1st Qu.: 1622606   1st Qu.: 92.45  
 AZ     : 2             Median :1.300   Median : 3697472   Median :110.16  
 CA     : 2             Mean   :1.300   Mean   : 5168866   Mean   :109.18  
 CO     : 2             3rd Qu.:1.524   3rd Qu.: 5901500   3rd Qu.:123.52  
 CT     : 2             Max.   :1.524   Max.   :31493524   Max.   :197.99  
 (Other):84                                                                
     income               tax            price             taxs       
 Min.   :  6887097   Min.   :18.00   Min.   : 84.97   Min.   : 21.27  
 1st Qu.: 25520384   1st Qu.:31.00   1st Qu.:102.71   1st Qu.: 34.77  
 Median : 61661644   Median :37.00   Median :137.72   Median : 41.05  
 Mean   : 99878736   Mean   :42.68   Mean   :143.45   Mean   : 48.33  
 3rd Qu.:127313964   3rd Qu.:50.88   3rd Qu.:176.15   3rd Qu.: 59.48  
 Max.   :771470144   Max.   :99.00   Max.   :240.85   Max.   :112.63  
                                                                      

We want to run this regression:

\[\begin{equation} \log(Q_i) = \beta_0 + \beta_1 \log(P_i) + U_i. \end{equation}\]

The index \(i\) corresponds to states. Here, \(Q_i\) is the number of cigarette packs sold per capita and \(P_i\) is the average real price (after tax) per pack of cigarettes. The instrument is \(S_i\), which is the real price per pack. We are going to focus on one year of data (1995). After exploring the data, the next step is to clean the data for this analysis.

# Limit to 1995
cig <- CigarettesSW %>%
  filter(year == 1995)

# Calculate real prices
cig <- cig %>%
  mutate(P = price / cpi, # Real price 
         S = (taxs - tax) / cpi) # Real sales tax

# Summarize variables we will use in the analysis
summary(cig[c("P", "S", "packs")])
       P                S              packs       
 Min.   : 95.79   Min.   : 0.000   Min.   : 49.27  
 1st Qu.:109.32   1st Qu.: 4.641   1st Qu.: 80.23  
 Median :115.59   Median : 5.634   Median : 92.84  
 Mean   :120.24   Mean   : 5.364   Mean   : 96.33  
 3rd Qu.:130.26   3rd Qu.: 7.280   3rd Qu.:109.27  
 Max.   :158.04   Max.   :10.264   Max.   :172.65  

Now, we can perform the first stage of the TSLS estimator:

\[\begin{equation} \log(P_i) = \pi_0 + \pi_1 S_i + V_i. \end{equation}\]

tsls_s1 <- lm(log(P) ~ S,
              data = cig)

summary(tsls_s1)

Call:
lm(formula = log(P) ~ S, data = cig)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.221027 -0.044324  0.000111  0.063730  0.210717 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.616546   0.029108   158.6  < 2e-16 ***
S           0.030729   0.004802     6.4 7.27e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09394 on 46 degrees of freedom
Multiple R-squared:  0.471, Adjusted R-squared:  0.4595 
F-statistic: 40.96 on 1 and 46 DF,  p-value: 7.271e-08

The interpretation is that increasing the sales tax by 1 dollar per pack increases the per pack price by 3.07%. This is because the outcome variable is logged. We can use this regression to evaluate the first assumption of the instrument (relevance). All of the below criteria are related, but can be useful to check to understand the first stage.s

  • Significance of the estimated coefficients. We see that the coefficients are significantly different than 0. This means that the relationship between the sales tax and the price is statistically significant.
  • Magnitude of the estimated coefficients. We determined that there is a statistical significance, but what about an economic significance? We can see this subjectively from the magnitude of the estimated coefficients. Increasing a price by approximately 3% seems like a substantial change in prices. Often this is discussed alongside other estimates or papers. For reference, consider that inflation was 2.9% in 2024.
  • \(R^2\). We can see how much of the observed variation of the price is explained by the sales tax by examining the \(R^2\). We find that 47% of the variation is explained.
summary(tsls_s1)$r.squared
[1] 0.4709961
  • \(F\) statistic. The \(F\) statistic tells us the overall significance of the model. A common, but arbitrary, rule-of-thumb is that an instrument satisfies the relevance assumption if the \(F\) statistic is at least 10.
summary(tsls_s1)$fstatistic
   value    numdf    dendf 
40.95588  1.00000 46.00000 

We can be satisfied that the instrument is relevant. From the regression, we calculate the fitted values. Note that I add it to the data.

cig <- cig %>%
  mutate(logP_pred = tsls_s1$fitted.values)

Then, we use this to run the second stage:

\[\begin{equation} \log(Q_i) = \alpha_0 + \alpha_1 \widehat{log({P}_i)} + W_i \end{equation}\]

tsls_s2 <- lm(log(packs) ~ logP_pred,
              data = cig)

summary(tsls_s2)

Call:
lm(formula = log(packs) ~ logP_pred, data = cig)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.63180 -0.15802  0.00524  0.13574  0.61434 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.7199     1.8012   5.396  2.3e-06 ***
logP_pred    -1.0836     0.3766  -2.877  0.00607 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2264 on 46 degrees of freedom
Multiple R-squared:  0.1525,    Adjusted R-squared:  0.1341 
F-statistic: 8.277 on 1 and 46 DF,  p-value: 0.006069

The interpretation is that increasing the price per pack of cigarettes by 1% causally reduces cigarette demand by 1.08%. This holds under the assumptions above. If there are other threats to exogeneity, such as omitted variable bias, then these estimates can no longer be interpreted causally.

We can use the variable income to control for the tax policies and general economic condition of states. Let us see how including a covariate changes the process.

# Create per capita real income
cig <- cig %>%
  mutate(real_inc = (income / population) / cpi)

# First stage
tsls_X_s1 <- lm(log(P) ~ log(real_inc) + S, data = cig)

summary(tsls_X_s1)

Call:
lm(formula = log(P) ~ log(real_inc) + S, data = cig)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.163799 -0.033049  0.001907  0.049322  0.185542 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.590811   0.225558  15.920  < 2e-16 ***
log(real_inc) 0.389283   0.085104   4.574 3.74e-05 ***
S             0.027395   0.004077   6.720 2.65e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07848 on 45 degrees of freedom
Multiple R-squared:  0.6389,    Adjusted R-squared:  0.6228 
F-statistic: 39.81 on 2 and 45 DF,  p-value: 1.114e-10
summary(tsls_X_s1)$fstatistic
 value  numdf  dendf 
39.809  2.000 45.000 
summary(tsls_X_s1)$r.squared
[1] 0.6388965
cig <- cig %>%
  mutate(logP_X_pred = tsls_X_s1$fitted.values)


# Second stage
tsls_X_s2 <- lm(log(packs) ~ log(real_inc) + logP_X_pred, data = cig)
summary(tsls_X_s2)

Call:
lm(formula = log(packs) ~ log(real_inc) + logP_X_pred, data = cig)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.67048 -0.13306  0.00598  0.13361  0.58044 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     9.4307     1.6247   5.805 6.09e-07 ***
log(real_inc)   0.2145     0.3212   0.668   0.5077    
logP_X_pred    -1.1434     0.4300  -2.659   0.0108 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2267 on 45 degrees of freedom
Multiple R-squared:  0.1687,    Adjusted R-squared:  0.1318 
F-statistic: 4.567 on 2 and 45 DF,  p-value: 0.01564

The AER package has the function ivreg, which is a one-stop function for the above protocol. The first equation is our desired equation. After the midline (\(\vert\)), put the right-hand side of the first stage. You can see that the output is what we have above from our more manual TSLS approach.

ivreg(log(packs) ~ log(P) + log(real_inc) | log(real_inc) + S,
      data = cig) %>%
  summary()

Call:
ivreg(formula = log(packs) ~ log(P) + log(real_inc) | log(real_inc) + 
    S, data = cig)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.611000 -0.086072  0.009423  0.106912  0.393159 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     9.4307     1.3584   6.943 1.24e-08 ***
log(P)         -1.1434     0.3595  -3.181  0.00266 ** 
log(real_inc)   0.2145     0.2686   0.799  0.42867    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1896 on 45 degrees of freedom
Multiple R-Squared: 0.4189, Adjusted R-squared: 0.3931 
Wald test: 6.534 on 2 and 45 DF,  p-value: 0.003227 

17.4 Further Reading

See Stock and Watson (2020) for more details on IV regression. I adapted parts of chapter 12 from Hanck et al. (2018) for these notes. The example is from Stock and Watson (2020).

17.4.1 References

Hanck, Cristoph, Martin Arnold, Alexander Gerber, and Martin Schmelzer. 2018. Introduction to Econometrics with R. https://bookdown.org/machar1991/ITER/.
Stock, James H., and Mark W. Watson. 2020. Introduction to Econometrics. 4th ed. New York, NY: Pearson. https://www.pearson.com/en-us/pearsonplus/p/9780136879787?srsltid=AfmBOoq1UPESY9Ez-JJYsWKIOM907u7A75_4qkiJiqT6ZPyAdlBH_5b4.