Regression is a powerful tool to investigate the dependence of one variable on another Casella and Berger (2002). In this chapter, we will review linear regression, its common pitfalls, and some useful non-linear specifications (logit and probit).
15.1 Regression
In the population, suppose there are three random variables \(X\), \(U\), and \(Y\). Because they are radnom variables, they have some underlying probability distributions and may have a statistical relationship. Generally, we may think of \(Y\) as a function of \(X\) and \(U\):
\[\begin{equation*}
Y = f(X, U).
\end{equation*}\]
We are not saying that \(X\) and \(U\) cause \(Y\) or anything like that. All we are saying is that we think there is a way to write \(Y\) as depending on the values that \(X\) and \(U\) take. In practice, having a general function \(f()\) makes the problem harder. We can assume that the function is linear. Even if it is not actually linear, a linear function is often an adequate approximation.
Suppose we have a sample of size \(N\) units, indexed by \(i\). That is, \(i = 1, 2, 3, \ldots, N\). Each unit \(i\) has a realization of the random variables: \(Y_i\), \(X_i\), and \(U_i\). Suppose that \(U\) is unobservable. That means that our dataset contains \((X_1, X_2, \ldots, X_N)\) and \((Y_1, Y_2, \ldots, Y_N)\). We can set up the linear regression as
\[\begin{equation*}
Y = \beta_0 + \beta_1 X + U.
\end{equation*}\]
Note that \(U\) is unobserved so it does not matter if it has a coefficient attached to it. It is common to assume that \(\mathbb{E}(U) = 0\). Even if this were not the case, the intercept \(\beta_0\) can be rescaled to accommodate it. Then, the interpretation is that the expected value of \(Y\) given the value of \(X = x\) is:
The estimation of \(\beta_0\) and \(\beta_1\) can be done using Ordinary Least Squares (OLS). Let us use R for this. We will use data from the General Social Survey (GSS) available on Canvas for download. Suppose we are interested of the relationship between education and income.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(haven)library(magrittr)# Remember to set your working directory first#setwd("Your/Working/Directory/Here")gss <-read_dta("Data/Regression/GSS2022.dta")# Summarize education and incomegss %>%select(educ, conrinc) %>%summary()
educ conrinc
Min. : 0.00 Min. : 316.5
1st Qu.:12.00 1st Qu.: 15033.8
Median :14.00 Median : 28485.0
Mean :14.15 Mean : 40508.0
3rd Qu.:16.00 3rd Qu.: 52222.5
Max. :20.00 Max. :163572.7
NA's :29 NA's :1804
While it would work to write out the expressions for the OLS estimator explicitly in R, it is practical to use a function for linear regression. The built-in function lm() is great. Note that it uses the formula specification where ~ means equals. Given that we specify the data, we should not reference variables using gss$conrinc or similar subsetting methods. Note that it automatically estimates a regression with an intercept.
While the print out of this command is useful to get the broad strokes the regression, there is a lot more information hidden in the object that lm() returns. Assign the output to an object in the environment to see it. That way you can access elements of it as a list.
out <-lm(formula = conrinc ~ educ,data = gss)#str(out)names(out)
Notice that the object from summary() has other elements. These elements may be particularly useful.
coefficients. Note that this looks different and has more information than out$coefficients. It also contains the standard errors, \(t\)-statistics (the estimated coefficient divided by the standard error), and the two-sided \(p\)-value testing if the coefficient is different than 0.
sigma. Mean square error where \(p\) is the number of coefficients: \(\frac{1}{n - p} \sum_{i}^n Y_i - \hat{Y}_i\).
df. A vector with three numbers: the total number of coefficients from covariates that are not linearly dependent \((p)\), the number of observations minus this \((n-p)\), and the total number of coefficients includign those that are linearly dependent. Note that you want the first and last numbers to be the same. Otherwise, you are specifying a linear regression with multicollinearity.
r.squared. If there is an intercept, it can be intuitively understood as the fraction of variance that the model explains. The formula is \(1 - \frac{\frac{1}{n - p} \sum_{i}^n Y_i - \hat{Y}_i}{\sum_i^n (Y_i - \bar{Y})^2}\).
adj.r.squared. This \(R^2\) is the same idea but it penalizes additional covariates.
fstatistic. A vector with three numbers: the \(F\)-statistic and its degrees of freedom.
Now that we are comfortable with the basics of lm(), let us explore adding more variables and using other specifications. To add variables, use +.
# Create a variable for experiencegss <- gss %>%mutate(exp = age - (educ +5))lm(formula = conrinc ~ educ + exp,data = gss) %>%summary()
Call:
lm(formula = conrinc ~ educ + exp, data = gss)
Residuals:
Min 1Q Median 3Q Max
-73546 -20917 -7928 9913 170963
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -42069.86 4337.30 -9.70 < 2e-16 ***
educ 5030.36 270.28 18.61 < 2e-16 ***
exp 383.12 51.91 7.38 2.22e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 35790 on 2237 degrees of freedom
(1909 observations deleted due to missingness)
Multiple R-squared: 0.1425, Adjusted R-squared: 0.1418
F-statistic: 185.9 on 2 and 2237 DF, p-value: < 2.2e-16
The classic Mincer equation regresses the log of the wage on education, experience, and experience squared. We can do these mathematical transformations within the formula call. It can also be done by creating a new variable or transforming the existing variable in the dataset. Just be mindful of the values. For example, given we are taking the log of income, we should check that there are no incomes less than or equal to zero. The function I() ensures that the formula interprets the expression as mathematical (i.e., squared) rather than an expression for the formula.
Call:
lm(formula = log(conrinc) ~ educ + exp + I(exp^2), data = gss)
Residuals:
Min 1Q Median 3Q Max
-5.3915 -0.4422 0.1464 0.6327 3.5533
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.143e+00 1.268e-01 56.32 <2e-16 ***
educ 1.443e-01 7.551e-03 19.11 <2e-16 ***
exp 7.815e-02 5.146e-03 15.19 <2e-16 ***
I(exp^2) -1.254e-03 9.171e-05 -13.67 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9974 on 2236 degrees of freedom
(1909 observations deleted due to missingness)
Multiple R-squared: 0.2156, Adjusted R-squared: 0.2145
F-statistic: 204.9 on 3 and 2236 DF, p-value: < 2.2e-16
Also remember that linear regression means linear in the coefficients. That means that it is fine to have non-linear functions of the covariates like we do for experience squared. These are all continuous variables. It is common to want to include dummies. It does not matter if the variable is numeric (0 or 1) or logical (TRUE or FALSE).
Call:
lm(formula = log(conrinc) ~ educ + exp + I(exp^2) + sex, data = gss)
Residuals:
Min 1Q Median 3Q Max
-5.1993 -0.4232 0.1642 0.5970 3.3478
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.740e+00 1.378e-01 56.17 <2e-16 ***
educ 1.478e-01 7.401e-03 19.98 <2e-16 ***
exp 7.623e-02 5.042e-03 15.12 <2e-16 ***
I(exp^2) -1.227e-03 8.983e-05 -13.66 <2e-16 ***
sex -4.128e-01 4.136e-02 -9.98 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9763 on 2234 degrees of freedom
(1910 observations deleted due to missingness)
Multiple R-squared: 0.2491, Adjusted R-squared: 0.2477
F-statistic: 185.2 on 4 and 2234 DF, p-value: < 2.2e-16
Let us add a categorical variable that takes more than two values (not an indicator). Notice that the values of race are 1, 2, 3, and NA corresponding to white, black, other, and missing. If we incldue the variable as-is in the regression, interpretation will be difficult. An easy fix is to transform it to a factor variable.
Call:
lm(formula = log(conrinc) ~ educ + exp + I(exp^2) + sex + race,
data = gss)
Residuals:
Min 1Q Median 3Q Max
-5.1945 -0.4223 0.1623 0.6018 3.3459
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.732e+00 1.459e-01 52.985 <2e-16 ***
educ 1.481e-01 7.428e-03 19.936 <2e-16 ***
exp 7.604e-02 5.063e-03 15.020 <2e-16 ***
I(exp^2) -1.223e-03 9.026e-05 -13.550 <2e-16 ***
sex -4.161e-01 4.157e-02 -10.009 <2e-16 ***
race 5.859e-03 2.745e-02 0.213 0.831
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9776 on 2216 degrees of freedom
(1927 observations deleted due to missingness)
Multiple R-squared: 0.2496, Adjusted R-squared: 0.2479
F-statistic: 147.4 on 5 and 2216 DF, p-value: < 2.2e-16
# Transformed to a factor variablelm(formula =log(conrinc) ~ educ + exp +I(exp^2) + sex +as.factor(race),data = gss) %>%summary()
Call:
lm(formula = log(conrinc) ~ educ + exp + I(exp^2) + sex + as.factor(race),
data = gss)
Residuals:
Min 1Q Median 3Q Max
-5.2060 -0.4303 0.1736 0.5881 3.3330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.760e+00 1.398e-01 55.502 <2e-16 ***
educ 1.467e-01 7.435e-03 19.730 <2e-16 ***
exp 7.612e-02 5.056e-03 15.057 <2e-16 ***
I(exp^2) -1.225e-03 9.014e-05 -13.595 <2e-16 ***
sex -4.076e-01 4.164e-02 -9.789 <2e-16 ***
as.factor(race)2 -1.258e-01 5.604e-02 -2.246 0.0248 *
as.factor(race)3 5.791e-02 5.745e-02 1.008 0.3135
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9762 on 2215 degrees of freedom
(1927 observations deleted due to missingness)
Multiple R-squared: 0.252, Adjusted R-squared: 0.25
F-statistic: 124.4 on 6 and 2215 DF, p-value: < 2.2e-16
# More careful transformation to a factor variablegss <- gss %>%mutate(race_factor =factor(race, labels =c("WHITE", "BLACK", "OTHER")))# Note that it automatically omits the first level of the factor, in this case WHITE# Use relevel to set a different referencegss <- gss %>%mutate(race_factor =relevel(race_factor, ref ="BLACK"))lm(formula =log(conrinc) ~ educ + exp +I(exp^2) + sex + race_factor,data = gss) %>%summary()
Call:
lm(formula = log(conrinc) ~ educ + exp + I(exp^2) + sex + race_factor,
data = gss)
Residuals:
Min 1Q Median 3Q Max
-5.2060 -0.4303 0.1736 0.5881 3.3330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.635e+00 1.444e-01 52.855 <2e-16 ***
educ 1.467e-01 7.435e-03 19.730 <2e-16 ***
exp 7.612e-02 5.056e-03 15.057 <2e-16 ***
I(exp^2) -1.225e-03 9.014e-05 -13.595 <2e-16 ***
sex -4.076e-01 4.164e-02 -9.789 <2e-16 ***
race_factorWHITE 1.258e-01 5.604e-02 2.246 0.0248 *
race_factorOTHER 1.838e-01 7.148e-02 2.571 0.0102 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9762 on 2215 degrees of freedom
(1927 observations deleted due to missingness)
Multiple R-squared: 0.252, Adjusted R-squared: 0.25
F-statistic: 124.4 on 6 and 2215 DF, p-value: < 2.2e-16
Interactions can be added with * or :. Note the difference below.
Call:
lm(formula = log(conrinc) ~ sex:educ + exp + I(exp^2) + race_factor,
data = gss)
Residuals:
Min 1Q Median 3Q Max
-4.8045 -0.5021 0.1699 0.6831 3.1987
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.9383502 0.1034027 86.442 < 2e-16 ***
exp 0.0811234 0.0055512 14.614 < 2e-16 ***
I(exp^2) -0.0013716 0.0000988 -13.882 < 2e-16 ***
race_factorWHITE 0.2438159 0.0612984 3.978 7.19e-05 ***
race_factorOTHER 0.3035977 0.0783431 3.875 0.00011 ***
sex:educ 0.0050231 0.0026394 1.903 0.05715 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.073 on 2216 degrees of freedom
(1927 observations deleted due to missingness)
Multiple R-squared: 0.09584, Adjusted R-squared: 0.0938
F-statistic: 46.98 on 5 and 2216 DF, p-value: < 2.2e-16
15.1.1 Interpretation of Linear and Non-Linear Specifications
The interpretation of the coefficients depends on the specification. I go through common specifications below.
Linear. \(Y = \beta_0 + \beta_1 X + U\). Increasing \(X\) by one unit changes \(Y\) by \(\hat{\beta}_1\) units on average.
Log-Linear. \(\ln(Y) = \beta_0 + \beta_1 X+ U\). Increasing \(X\) by one unit approximately changes \(Y\) by \(\hat{\beta}_1 \times 100\) percent on average.
Linear-Log. \(Y = \beta_0 + \beta_1 \ln(X)+ U\). Increasing \(X\) by one percent approximately changes \(Y\) by \(\hat{\beta}_1 \times 0.01\) percent on average.
Log-Log. \(\ln(Y) = \beta_0 + \beta_1 \ln(X)+ U\). Increasing \(X\) by one percent approximately changes \(Y\) by \(\hat{\beta}_1 \times 100\) percent on average.
Multivariate Regression. \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1 + \ldots + U\). Increasing \(X_1\) by one unit changes changes \(Y\) by \(\hat{\beta}_1\) units, holding all other covariates constant, changes \(Y\) by \(\hat{\beta}_1\) on average.
15.1.2 Regression Assumptions
Now that we are empowered with R to run regressions, it can be easy to lose sight of the theory. Recall the core assumptions of the classic framework.
Linear Model. This assumption states that the true model of the relationship between \(X\) and \(Y\) is linear.
Strict Exogeneity. \(\mathbb{E}(U \vert X) = \mathbb{E}(U) = 0.\) This is a stronger way of saying that \(U\) and \(X\) are uncorrelated. A common way to break this assumption is to have ommited variable bias.
No Perfect Multicollinearity. The covariates need to be linearly independent. For example, the below regression drops age because it is a linear function of experience. But, experience squared is allowed because it is not a linear function of experience.
Call:
lm(formula = log(conrinc) ~ educ + exp + I(exp^2) + age, data = gss)
Residuals:
Min 1Q Median 3Q Max
-5.3915 -0.4422 0.1464 0.6327 3.5533
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.143e+00 1.268e-01 56.32 <2e-16 ***
educ 1.443e-01 7.551e-03 19.11 <2e-16 ***
exp 7.815e-02 5.146e-03 15.19 <2e-16 ***
I(exp^2) -1.254e-03 9.171e-05 -13.67 <2e-16 ***
age NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9974 on 2236 degrees of freedom
(1909 observations deleted due to missingness)
Multiple R-squared: 0.2156, Adjusted R-squared: 0.2145
F-statistic: 204.9 on 3 and 2236 DF, p-value: < 2.2e-16
Homoscedasticity. The variance of the error term is constant and finite.
15.2 Categorical Outcome Variables
Above, we always assumed that \(Y\) was continuous. If \(Y\) is discrete, then we want to consider alternative specifications. We can certainly use the variable as is. However, the interpretation may not capture the goal of the analysis.
# Diabetes table(gss$diabetes, useNA ="always")
1 2 <NA>
293 2060 1796
# Interpreting variable as if it were numericlm(diabetes ~ educ,data = gss) %>%summary()
Call:
lm(formula = diabetes ~ educ, data = gss)
Residuals:
Min 1Q Median 3Q Max
-0.8756 0.1246 0.1248 0.1251 0.1260
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.874e+00 3.611e-02 51.893 <2e-16 ***
educ 8.692e-05 2.433e-03 0.036 0.972
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3307 on 2336 degrees of freedom
(1811 observations deleted due to missingness)
Multiple R-squared: 5.465e-07, Adjusted R-squared: -0.0004275
F-statistic: 0.001277 on 1 and 2336 DF, p-value: 0.9715
For the case of a binary \(Y\) variable (it only takes two values), common specifications are probit and logit. Probit uses the cumulative distribution function (CDF) of the Normal distribution
\[\begin{equation*}
\mathbb{E}(Y \vert X = x) = \Pr(Y = 1 \vert X = x) = \Phi(\beta_0 + \beta_1 X),
\end{equation*}\]
where \(\Phi(x) = \Pr(Z \leq x)\) for \(Z\) distributed according to the standard Normal distribution (mean is 0 and standard deviation is 1). We can use glm() to estimate probit regressions.
# Transform the variable to be 0/1 instead of 1/2gss <- gss %>%mutate(diabetes_binary = diabetes -1)table(gss$diabetes_binary)
Another way to put this is that if \(X\) increases by one unit the odds ratio changs by \(\exp(\hat{\beta}_1)\).
These are two examples of regression models with a binary outcome variable. If \(Y\) is discrete but takes more than two values, then another model is needed. Multinomial logit is popular. In R, it is convenint to use the package nnet.
library(nnet)# Would you say that in general your health is Excellent, Very good, Good, Fair, or Poor?table(gss$health1, useNA ="always")
1 2 3 4 5 <NA>
383 711 848 385 36 1786
multinom(health1 ~ educ,data = gss) %>%summary()
# weights: 15 (8 variable)
initial value 3777.350780
iter 10 value 3222.809202
final value 3221.902559
converged
Here we can interpret the coefficients as follows. We are comparing the log odds ratio of each value of \(Y\) to a baseline value of \(Y\). In this example, it is \(HEALTH1 = 1\), which corresponds to excellent health. You can convert the variable to a factor and use relevel() to specify a different baseline category. The intercept and coefficient on \(X\) depends on the level of \(Y\).
15.3 The Goal of Program Evaluation
The goal of program evaluation is to study the effectiveness of interventions. Decision makers, including policymakers and industry leaders, need to know to what extent (possibly costly) interventions will have their intended effects.
In public policy and economics, emblematic questions in program evaluation include:
Do programs that transfer cash to individuals reduce poverty?
Does increasing the minimum wage affect employment?
Does health insurance improve health outcomes?
In industry, questions requiring casual analysis might be:
Does changing the price of one item affect the overall amount customers spend?
How does changing an interface affect user engagement?
Do personalized discounts or recommendations result in additional spending?
To consider these questions, we cannot rely solely on basic regression analysis.
15.3.1 Causality vs. Correlation
Regression gets at correlation, not at causality. While this may be intuitive, we can review the fundamental concepts to have a more complete understanding of why regression may be inappropriate for drawing certain conclusions from the data.
15.3.1.1 Causality
In everyday situations, there can be a clear understanding of cause and effect. If I enter the pin of my door code, that will cause the front door of my house to unlock. There is no question here what was the cause and what was the effect. Decision makers are interested in understanding causality when it is not so obvious.
Causality is the effects within a causal model keeping other conditions the same (ceteris paribus). A causal model basically contains the following elements
Variables determined inside the model (\(Y\)). These are called outcome or dependent variables.
Variables determined outside the model \((X, U)\). These are called covariates, regressors, or independent variables.
Functional relationships between \((X, U)\) and \(Y\). This can be written generally as \(Y = g(X,U)\).
Causality refers to some inherent relationship of cause \((X, U)\) and effect \(Y\).
15.3.1.2 Correlation
Correlation refers to a statistical relationship between \(X\) and \(Y\). Mathematically, it is
\(\text{Cov}(X, Y)\) is the covariance between \(X\) and \(Y\). It is equal to \(\mathbb{E}\left[(X - \mu_X)(Y - \mu_Y)\right]\), where \(\mu_X\) and \(\mu_Y\) are the means of \(X\) and \(Y\). The covariance measures how \(X\) varies with \(Y\) and how \(Y\)varies with \(X\).
\(\sigma_X\) and \(\sigma_Y\) are the standard deviations of \(X\) and \(Y\).
Because the correlation is the covariance divided by the standard deviations, it can be thought of as a rescaled covariance. There are some important properties of correlation.
Correlation is symmetric. This is one clear reason why the concepts of correlation and causality are separate.
The sign of the correlation is equal to the sign of the covariance.
The correlation is between \(-1\) and \(1\). This helps give a sense of the direction and magnitude of the linear relationship between \(X\) and \(Y\).
15.4 Activity: Practice with Regression
This activity will help you practice regression in R. Write your code in a .R file and upload it to Canvas at the end of class for participation points in today’s class. You can write answers to the questions using comments. Do not worry about finsihing it, just get as far as you can. You will be graded for completion.
Go to Canvas > Modules > Module 1 > Data. Download GS2022.dta and the codebook. Save them to a convenient folder.
What format is the data in? Load the data into R. Hint: Use the haven package.
In this activity, you will be looking at the variables vote20 and vote16. Search these variables in the codebook until you get to the part that shows what the values represent.
Explore these variables. What is the voting turn-out in the GSS for the years 2016 and 2020? Using Google, how do they compare to the general population turn-out? What might this say about the GSS sample?
Create two new variables vote_binary20 and vote_binary16. They should take the value TRUE if the individual voted and FALSE if they did not vote. If the individual is ineligible, the value should be NA. Using the command table(), check that your new variables are correctly defined.
Suppose you are interested in understanding how different characteristics are correlated with voting in the 2020 presidential election. Check the documentation and run summary statistics on the following variables about basic demographic characteristics: conrinc, educ, hispanic, race, marital, age, sex, born, partyid. Run a linear regression using these variables as predictors for voting in 2020. Make sure you use the variable you created in question 5.
Run probit and logit regressions using these varaibles as predictors for voting in 2020.
Based on 6 and 7, do you find anything interesting or surprising?
Do your own exploration of the codebook and look for other variables you think could be useful to predict voter turn-out for the 2020 presidential election. Feel free to keep or drop the variables from question 6. For each variable you use, make sure you first check the basic summary statistics of the variable.
Did you find anything interesting or surprising? If someone were interested in predicting voter turn-out in 2024 and 2028, what other variables might they need that are not included in this dataset?
15.5 References
The extract from the Census comes from the NORC. See Chatterjee and Hadi (2006) for an accessible background on regression. The notes on causality draw from Matt Mastens’s Identification and Causality notes.