Let there be \(N\) units indexed by \(i = 1, 2, \ldots, N\). For example, these could be people, households, firms, states, stocks, etc. For each unit, we observe an outcome variable of interest, \(Y_i\), and a regressor \(X_i\). We can run the below regression:
Recall that \(U_i\) is unobservable and \(\beta_0\) and \(\beta_1\) are the parameters.
If the following assumptions hold, then we can interpret the model as causal. That is, increasing \(X_i\) by one unit causes \(Y_i\) to change by \(\beta_1\) units.
We saw this interpretation with randomized controlled trials. Suppose that \(Y_i\) is income and \(X_i\) is an indicator of whether an individual was randomized to attend job training. That is, if \(X_i = 1\) then individual \(i\) was randomized into the treatment group and attended the job training. If \(X_i = 0\), then individual \(i\) was randomized into the control group and did not attend the job training. Then \(\beta_1\) will be the causal effect of job training on income. Note that in this case, \(X_i\) is exogenous under the assumption that randomizing means that not only are the observed variables the same between treatment and control, but also the unobserved factors (\(U_i\)) are the same between treatment and control.
When exogeneity breaks down, then we cannot interpret the regression causally. There are several reasons why exogeneity may not hold. Regardless of the reason, \(\mathbb{E}(U\vert X) \neq \mathbb{E}(U)\).
We know that \(X_i\) is endogeneous, meaning that \(\mathbb{E}(U\vert X) \neq \mathbb{E}(U)\). Suppose we observe \(Z_i\), which is a valid instrumental variable if the following conditions hold:
Relevance. \(X\) and \(Z\) are correlated.
Exogeneity. \(Z\) cannot be correlated with \(U\).
We can check for relevance in the data. That is because \(X\) and \(Z\) are both observed. We can simply calculate the correlation between them to see that they are related. However, we cannot check for exogeneity. This is an assumption on the unobserved term \(U\), so we must rely on logical arguments to justify exogeneity.
17.3.1 Two-Stage Least Squares Estimator
How do we put into practice the basic instrumental variable setup? We need an estimation strategy. Two-stage least squares (TSLS) is a common approach. There are two stages.
Regress the endogeneous regressor on the instrument:
The unobserved variable \(V_i\) is the component of \(X_i\) not explained by the instrument. From this regression, we can predict the regressor. That means calculating \(\hat{X}_i\) for all units. As long as \(Z\) satisfies the two assumptions, then we can interpret \(\hat{X}\) as exogeneous to \(Y\).
Then, \(\hat{\alpha}_0\) and \(\hat{\alpha}_1\) are the TSLS estimates of \(\beta_0\) and \(\beta_1\) from Equation Equation 17.1.
17.3.1.1 R Implementation
Suppose a policymaker wants to decrease smoking. They suggest increasing taxes on cigarettes to raise the price. In order to know the effect this policy will have on cigarette consumption, we need to know: what is the causal effect of increasing the price of cigarettes on the demand for cigarettes?
Why is this a difficult question to analyze? Demand and supply are determined simultaneously. It is possible that both the demand for cigarettes affects the price, and that the price affects the demand. This general source of endogeneity is called simultaneity.
We can use taxes on cigarettes as instruments for the price. Let us evaluate the two assumptions for this instrument.
Relevance. Taxes affect the prices of cigarettes.
Exogeneity. Taxes indirectly affect the demand for cigarettes through prices. However, there are not other plausible channels for how taxes may affect the demand for cigarettes through other channels. This assumption can always be debated, but the argument of exogeneity is that apart from the effect through prices, there is no other effect.
Let us run the analysis in R. We will use data compiled in Stock and Watson (2015). We can do some usual commands to explore the data.
# Load this package so we can access the datalibrary(AER)
Loading required package: car
Loading required package: carData
Loading required package: lmtest
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
# Load other packages that wil be usefullibrary(dplyr)
Attaching package: 'dplyr'
The following object is masked from 'package:car':
recode
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)library(magrittr)# Data on cigarette consumption for the 48 continental US from 1985-1995data("CigarettesSW") # Number of rows and columnsdim(CigarettesSW)
# Summary statistics of all variablessummary(CigarettesSW)
state year cpi population packs
AL : 2 1985:48 Min. :1.076 Min. : 478447 Min. : 49.27
AR : 2 1995:48 1st Qu.:1.076 1st Qu.: 1622606 1st Qu.: 92.45
AZ : 2 Median :1.300 Median : 3697472 Median :110.16
CA : 2 Mean :1.300 Mean : 5168866 Mean :109.18
CO : 2 3rd Qu.:1.524 3rd Qu.: 5901500 3rd Qu.:123.52
CT : 2 Max. :1.524 Max. :31493524 Max. :197.99
(Other):84
income tax price taxs
Min. : 6887097 Min. :18.00 Min. : 84.97 Min. : 21.27
1st Qu.: 25520384 1st Qu.:31.00 1st Qu.:102.71 1st Qu.: 34.77
Median : 61661644 Median :37.00 Median :137.72 Median : 41.05
Mean : 99878736 Mean :42.68 Mean :143.45 Mean : 48.33
3rd Qu.:127313964 3rd Qu.:50.88 3rd Qu.:176.15 3rd Qu.: 59.48
Max. :771470144 Max. :99.00 Max. :240.85 Max. :112.63
The index \(i\) corresponds to states. Here, \(Q_i\) is the number of cigarette packs sold per capita and \(P_i\) is the average real price (after tax) per pack of cigarettes. The instrument is \(S_i\), which is the real price per pack. We are going to focus on one year of data (1995). After exploring the data, the next step is to clean the data for this analysis.
# Limit to 1995cig <- CigarettesSW %>%filter(year ==1995)# Calculate real pricescig <- cig %>%mutate(P = price / cpi, # Real price S = (taxs - tax) / cpi) # Real sales tax# Summarize variables we will use in the analysissummary(cig[c("P", "S", "packs")])
P S packs
Min. : 95.79 Min. : 0.000 Min. : 49.27
1st Qu.:109.32 1st Qu.: 4.641 1st Qu.: 80.23
Median :115.59 Median : 5.634 Median : 92.84
Mean :120.24 Mean : 5.364 Mean : 96.33
3rd Qu.:130.26 3rd Qu.: 7.280 3rd Qu.:109.27
Max. :158.04 Max. :10.264 Max. :172.65
Now, we can perform the first stage of the TSLS estimator:
Call:
lm(formula = log(P) ~ S, data = cig)
Residuals:
Min 1Q Median 3Q Max
-0.221027 -0.044324 0.000111 0.063730 0.210717
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.616546 0.029108 158.6 < 2e-16 ***
S 0.030729 0.004802 6.4 7.27e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09394 on 46 degrees of freedom
Multiple R-squared: 0.471, Adjusted R-squared: 0.4595
F-statistic: 40.96 on 1 and 46 DF, p-value: 7.271e-08
The interpretation is that increasing the sales tax by 1 dollar per pack increases the per pack price by 3.07%. This is because the outcome variable is logged. We can use this regression to evaluate the first assumption of the instrument (relevance). All of the below criteria are related, but can be useful to check to understand the first stage.s
Significance of the estimated coefficients. We see that the coefficients are significantly different than 0. This means that the relationship between the sales tax and the price is statistically significant.
Magnitude of the estimated coefficients. We determined that there is a statistical significance, but what about an economic significance? We can see this subjectively from the magnitude of the estimated coefficients. Increasing a price by approximately 3% seems like a substantial change in prices. Often this is discussed alongside other estimates or papers. For reference, consider that inflation was 2.9% in 2024.
\(R^2\). We can see how much of the observed variation of the price is explained by the sales tax by examining the \(R^2\). We find that 47% of the variation is explained.
summary(tsls_s1)$r.squared
[1] 0.4709961
\(F\) statistic. The \(F\) statistic tells us the overall significance of the model. A common, but arbitrary, rule-of-thumb is that an instrument satisfies the relevance assumption if the \(F\) statistic is at least 10.
summary(tsls_s1)$fstatistic
value numdf dendf
40.95588 1.00000 46.00000
We can be satisfied that the instrument is relevant. From the regression, we calculate the fitted values. Note that I add it to the data.
Call:
lm(formula = log(packs) ~ logP_pred, data = cig)
Residuals:
Min 1Q Median 3Q Max
-0.63180 -0.15802 0.00524 0.13574 0.61434
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.7199 1.8012 5.396 2.3e-06 ***
logP_pred -1.0836 0.3766 -2.877 0.00607 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2264 on 46 degrees of freedom
Multiple R-squared: 0.1525, Adjusted R-squared: 0.1341
F-statistic: 8.277 on 1 and 46 DF, p-value: 0.006069
The interpretation is that increasing the price per pack of cigarettes by 1% causally reduces cigarette demand by 1.08%. This holds under the assumptions above. If there are other threats to exogeneity, such as omitted variable bias, then these estimates can no longer be interpreted causally.
We can use the variable income to control for the tax policies and general economic condition of states. Let us see how including a covariate changes the process.
# Create per capita real incomecig <- cig %>%mutate(real_inc = (income / population) / cpi)# First stagetsls_X_s1 <-lm(log(P) ~log(real_inc) + S, data = cig)summary(tsls_X_s1)
Call:
lm(formula = log(P) ~ log(real_inc) + S, data = cig)
Residuals:
Min 1Q Median 3Q Max
-0.163799 -0.033049 0.001907 0.049322 0.185542
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.590811 0.225558 15.920 < 2e-16 ***
log(real_inc) 0.389283 0.085104 4.574 3.74e-05 ***
S 0.027395 0.004077 6.720 2.65e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.07848 on 45 degrees of freedom
Multiple R-squared: 0.6389, Adjusted R-squared: 0.6228
F-statistic: 39.81 on 2 and 45 DF, p-value: 1.114e-10
summary(tsls_X_s1)$fstatistic
value numdf dendf
39.809 2.000 45.000
summary(tsls_X_s1)$r.squared
[1] 0.6388965
cig <- cig %>%mutate(logP_X_pred = tsls_X_s1$fitted.values)# Second stagetsls_X_s2 <-lm(log(packs) ~log(real_inc) + logP_X_pred, data = cig)summary(tsls_X_s2)
Call:
lm(formula = log(packs) ~ log(real_inc) + logP_X_pred, data = cig)
Residuals:
Min 1Q Median 3Q Max
-0.67048 -0.13306 0.00598 0.13361 0.58044
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.4307 1.6247 5.805 6.09e-07 ***
log(real_inc) 0.2145 0.3212 0.668 0.5077
logP_X_pred -1.1434 0.4300 -2.659 0.0108 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2267 on 45 degrees of freedom
Multiple R-squared: 0.1687, Adjusted R-squared: 0.1318
F-statistic: 4.567 on 2 and 45 DF, p-value: 0.01564
The AER package has the function ivreg, which is a one-stop function for the above protocol. The first equation is our desired equation. After the midline (\(\vert\)), put the right-hand side of the first stage. You can see that the output is what we have above from our more manual TSLS approach.
Call:
ivreg(formula = log(packs) ~ log(P) + log(real_inc) | log(real_inc) +
S, data = cig)
Residuals:
Min 1Q Median 3Q Max
-0.611000 -0.086072 0.009423 0.106912 0.393159
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.4307 1.3584 6.943 1.24e-08 ***
log(P) -1.1434 0.3595 -3.181 0.00266 **
log(real_inc) 0.2145 0.2686 0.799 0.42867
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1896 on 45 degrees of freedom
Multiple R-Squared: 0.4189, Adjusted R-squared: 0.3931
Wald test: 6.534 on 2 and 45 DF, p-value: 0.003227
17.4 Further Reading
See Stock and Watson (2020) for more details on IV regression. I adapted parts of chapter 12 from Hanck et al. (2018) for these notes. The example is from Stock and Watson (2020).
17.4.1 References
Hanck, Cristoph, Martin Arnold, Alexander Gerber, and Martin Schmelzer. 2018. Introduction to Econometrics with R. https://bookdown.org/machar1991/ITER/.