21  Machine Learning Basics

Machine learning (sometimes abbreviated to ML) is a class of statistical tools that help analysts find patterns in data. Basically, ML helps with prediction problems. Many core methods in ML are modern, with new developments everyday, but there are also fundamental components that have been around for a long time and now can be applied thanks to the larger availability of large datasets and computing power. Academic researchers use ML but it is also commonly used in the public and private sectors, and is a core part of recent advances in AI tools. We will go through some ML basics and see how to implement them in R. You will have practice doing so with an activity.

There are two broad categories of ML.

21.1 Supervised Learning

21.1.1 Linear Regression

You already are familiar with a method that can be used for prediction. Before, we were using linear regression to understand the relationship between the outcome variable (\(Y\)) and covariates (\(X_1, X_2, \ldots\)). Now, instead of focusing on interpreting the coefficients, we focus on minimizing prediction error. Specifically, linear regression minimizes the mean squared error (MSE) between the predicted and observed outcomes. If the fitted value is \(\hat{Y}_i\), the MSE is

\[\begin{equation*} \text{MSE} = \frac{1}{N} \sum_{i =1}^N (Y_i - \hat{Y}_i)^2. \end{equation*}\]

Consider two datasets. One is the training dataset. We use this dataset to estimate the linear regression. From this dataset, we get the estimated coefficients. In the testing dataset we predict the outcome but using the estimated coefficients from the training dataset. Because we actually observe the outcome in tthe testing dataset as well, we can compute the squared prediction error. Suppose we have one dataset that we partition into training \((i = 1, 2, \ldots, N)\) and testing \((i = N+1, N+2, \ldots, M)\) datasets.

Let us see how we implement this in R. We will use a dataset from the ggplot2 package called diamonds. This is a very clean dataset with the price and attribute of diamonds. We will try to predict the price of diamonds based on their attributes. You can think of this exercise extending to any situation where you want to predict prices.

library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(magrittr)

data(diamonds)
summary(diamonds)
     carat               cut        color        clarity          depth      
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
                                    J: 2808   (Other): 2531                  
     table           price             x                y         
 Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
 Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
 Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
 3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
 Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
                                                                  
       z         
 Min.   : 0.000  
 1st Qu.: 2.910  
 Median : 3.530  
 Mean   : 3.539  
 3rd Qu.: 4.040  
 Max.   :31.800  
                 
# Make the factors unordered (makes the regression easier)
diamonds <- diamonds %>%
  mutate(across(c(cut, color, clarity), ~ factor(.x, ordered = FALSE)))

Let us first define the training and testing datasets. We can randomly divide the data into the two datasets. We use 70% for the training dataset because we have a really big sample here.

set.seed(470500)

# Create an ID variable
diamonds <- diamonds %>%
  mutate(id = 1:nrow(diamonds))

# Randomly select 70% to the training dataset
train_ids <- sample(diamonds$id, size = 0.7 * nrow(diamonds))

train <- diamonds[(diamonds$id %in% train_ids), ]
test <- diamonds[!(diamonds$id %in% train_ids), ]

Now, we can use the training dataset to estimate the regression of price on characteristics.

train_lm <- lm(price ~ carat + cut + color + 
                 clarity + depth + table + x + y + z,
               data = diamonds)

summary(train_lm)

Call:
lm(formula = price ~ carat + cut + color + clarity + depth + 
    table + x + y + z, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-21376.0   -592.4   -183.5    376.4  10694.2 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2184.477    408.197   5.352 8.76e-08 ***
carat        11256.978     48.628 231.494  < 2e-16 ***
cutGood        579.751     33.592  17.259  < 2e-16 ***
cutVery Good   726.783     32.241  22.542  < 2e-16 ***
cutPremium     762.144     32.228  23.649  < 2e-16 ***
cutIdeal       832.912     33.407  24.932  < 2e-16 ***
colorE        -209.118     17.893 -11.687  < 2e-16 ***
colorF        -272.854     18.093 -15.081  < 2e-16 ***
colorG        -482.039     17.716 -27.209  < 2e-16 ***
colorH        -980.267     18.836 -52.043  < 2e-16 ***
colorI       -1466.244     21.162 -69.286  < 2e-16 ***
colorJ       -2369.398     26.131 -90.674  < 2e-16 ***
claritySI2    2702.586     43.818  61.677  < 2e-16 ***
claritySI1    3665.472     43.634  84.005  < 2e-16 ***
clarityVS2    4267.224     43.853  97.306  < 2e-16 ***
clarityVS1    4578.398     44.546 102.779  < 2e-16 ***
clarityVVS2   4950.814     45.855 107.967  < 2e-16 ***
clarityVVS1   5007.759     47.160 106.187  < 2e-16 ***
clarityIF     5345.102     51.024 104.757  < 2e-16 ***
depth          -63.806      4.535 -14.071  < 2e-16 ***
table          -26.474      2.912  -9.092  < 2e-16 ***
x            -1008.261     32.898 -30.648  < 2e-16 ***
y                9.609     19.333   0.497    0.619    
z              -50.119     33.486  -1.497    0.134    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1130 on 53916 degrees of freedom
Multiple R-squared:  0.9198,    Adjusted R-squared:  0.9198 
F-statistic: 2.688e+04 on 23 and 53916 DF,  p-value: < 2.2e-16

To test our model, we can use predict() to predict the price in the testing data with our estimated coefficients.

test_pred <- predict(train_lm, newdata = test)
summary(test_pred)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -4308    1089    2819    3937    5879   39394 
(test$price - test_pred)^2 %>%
  mean()
[1] 1261473

This does not mean too much to us unless we compare to another model.

train_lm2 <- lm(price ~ carat + cut + color + clarity,
               data = diamonds)

summary(train_lm2)

Call:
lm(formula = price ~ carat + cut + color + clarity, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-16813.5   -680.4   -197.6    466.4  10394.9 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -7362.80      51.68 -142.46   <2e-16 ***
carat         8886.13      12.03  738.44   <2e-16 ***
cutGood        655.77      33.63   19.50   <2e-16 ***
cutVery Good   848.72      31.28   27.14   <2e-16 ***
cutPremium     869.40      30.93   28.11   <2e-16 ***
cutIdeal       998.25      30.66   32.56   <2e-16 ***
colorE        -211.68      18.32  -11.56   <2e-16 ***
colorF        -303.31      18.51  -16.39   <2e-16 ***
colorG        -506.20      18.12  -27.93   <2e-16 ***
colorH        -978.70      19.27  -50.78   <2e-16 ***
colorI       -1440.30      21.65  -66.54   <2e-16 ***
colorJ       -2325.22      26.72  -87.01   <2e-16 ***
claritySI2    2625.95      44.79   58.63   <2e-16 ***
claritySI1    3573.69      44.60   80.13   <2e-16 ***
clarityVS2    4217.83      44.84   94.06   <2e-16 ***
clarityVS1    4534.88      45.54   99.59   <2e-16 ***
clarityVVS2   4967.20      46.89  105.93   <2e-16 ***
clarityVVS1   5072.03      48.21  105.20   <2e-16 ***
clarityIF     5419.65      52.14  103.95   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1157 on 53921 degrees of freedom
Multiple R-squared:  0.9159,    Adjusted R-squared:  0.9159 
F-statistic: 3.264e+04 on 18 and 53921 DF,  p-value: < 2.2e-16
test_pred2 <- predict(train_lm2, newdata = test)

(test$price - test_pred2)^2 %>%
  mean()
[1] 1328692

Comparing the MSE between these two models, the first model has better predictive power. But what about other models? What about all the possible combinations of covariates? What about adding polynomials of those covariates? ML methods can help with that too (below).

21.1.2 Logistic Regression

We can apply the same logic when the outcome is binary, but we use logistic regression. The model predicts the probability of success, when the outcome is one. Suppose we want to predict if a diamond has a price below someone’s budget of $1,000.

train <- train %>%
  mutate(inbudget = (price <= 1000))

test <- test %>%
  mutate(inbudget = (price <= 1000))


# Train
ib_train_lm <- glm(inbudget ~ carat + cut + color + clarity + depth + table + price + x + y + z,
    data = train,
    family = "binomial")
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(ib_train_lm)

Call:
glm(formula = inbudget ~ carat + cut + color + clarity + depth + 
    table + price + x + y + z, family = "binomial", data = train)

Coefficients:
               Estimate Std. Error    z value Pr(>|z|)    
(Intercept)   1.199e+16  2.871e+07  417678354   <2e-16 ***
carat        -1.145e+15  4.777e+06 -239754796   <2e-16 ***
cutGood       2.742e+13  2.394e+06   11454293   <2e-16 ***
cutVery Good -1.798e+13  2.304e+06   -7801079   <2e-16 ***
cutPremium   -1.965e+14  2.307e+06  -85191457   <2e-16 ***
cutIdeal     -1.899e+14  2.388e+06  -79502937   <2e-16 ***
colorE        1.142e+14  1.271e+06   89862090   <2e-16 ***
colorF        2.088e+14  1.282e+06  162806470   <2e-16 ***
colorG        3.120e+14  1.266e+06  246437038   <2e-16 ***
colorH        4.192e+14  1.367e+06  306732002   <2e-16 ***
colorI        6.511e+14  1.563e+06  416413920   <2e-16 ***
colorJ        7.919e+14  1.998e+06  396405362   <2e-16 ***
claritySI2   -8.918e+14  3.187e+06 -279870015   <2e-16 ***
claritySI1   -1.073e+15  3.264e+06 -328657932   <2e-16 ***
clarityVS2   -1.340e+15  3.346e+06 -400416873   <2e-16 ***
clarityVS1   -1.424e+15  3.429e+06 -415227268   <2e-16 ***
clarityVVS2  -1.591e+15  3.565e+06 -446331901   <2e-16 ***
clarityVVS1  -1.911e+15  3.653e+06 -523092303   <2e-16 ***
clarityIF    -2.321e+15  3.951e+06 -587473604   <2e-16 ***
depth        -3.769e+13  3.159e+05 -119307594   <2e-16 ***
table        -1.878e+13  2.080e+05  -90303057   <2e-16 ***
price         2.039e+11  3.049e+02  668533089   <2e-16 ***
x            -1.344e+15  2.194e+06 -612489770   <2e-16 ***
y             6.368e+13  1.294e+06   49228723   <2e-16 ***
z            -3.354e+14  2.044e+06 -164050373   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance:  44034  on 37757  degrees of freedom
Residual deviance: 135524  on 37733  degrees of freedom
AIC: 135574

Number of Fisher Scoring iterations: 25
# Predict in testing data
ib_test_pred <- predict(ib_train_lm, newdata = test, type = "response")

summary(ib_test_pred)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.2444  0.0000  1.0000 
ib_test_pred_binary <- (ib_test_pred > 0.5)

# Confusion matrix 
table(observed = test$inbudget, predicted = ib_test_pred_binary)
        predicted
observed FALSE  TRUE
   FALSE 11634   213
   TRUE    593  3742
# Accuracy
mean(test$inbudget == ib_test_pred_binary)
[1] 0.9501916

21.1.3 LASSO

Under that intuition, ML methods can help us find a model (what to put on the right-hand side) for the best prediction (what is on the left-hand side). We will go through the workflow of LASSO, which is one method you can use.

  1. Split the cleaned data into training and testing. We can use the tidymodels library to split the data into training and testing datsaets.
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom        1.0.8     ✔ rsample      1.3.0
✔ dials        1.4.0     ✔ tibble       3.2.1
✔ infer        1.0.8     ✔ tidyr        1.3.1
✔ modeldata    1.4.0     ✔ tune         1.3.0
✔ parsnip      1.3.1     ✔ workflows    1.2.0
✔ purrr        1.0.4     ✔ workflowsets 1.1.0
✔ recipes      1.2.1     ✔ yardstick    1.3.2
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard()   masks scales::discard()
✖ tidyr::extract()   masks magrittr::extract()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::lag()       masks stats::lag()
✖ purrr::set_names() masks magrittr::set_names()
✖ recipes::step()    masks stats::step()
d_split <- initial_split(diamonds) # Creates the split (default is 75/25)

d_train <- training(d_split) # Define training data based on split
d_test <- testing(d_split) # Definet testing data based on split
  1. Build a “recipe.” This tells R what is the outcome variable and helps format the possible predictors.
recipe(price ~ ., data = d_train) %>%
  summary()
# A tibble: 11 × 4
   variable type      role      source  
   <chr>    <list>    <chr>     <chr>   
 1 carat    <chr [2]> predictor original
 2 cut      <chr [3]> predictor original
 3 color    <chr [3]> predictor original
 4 clarity  <chr [3]> predictor original
 5 depth    <chr [2]> predictor original
 6 table    <chr [2]> predictor original
 7 x        <chr [2]> predictor original
 8 y        <chr [2]> predictor original
 9 z        <chr [2]> predictor original
10 id       <chr [2]> predictor original
11 price    <chr [2]> outcome   original
# Account for ID variable and factor variables

d_rec <- recipe(price ~ ., data = d_train) %>%
  step_dummy(all_nominal_predictors())  %>% 
  update_role(id, new_role = "ID") 

summary(d_rec)
# A tibble: 11 × 4
   variable type      role      source  
   <chr>    <list>    <chr>     <chr>   
 1 carat    <chr [2]> predictor original
 2 cut      <chr [3]> predictor original
 3 color    <chr [3]> predictor original
 4 clarity  <chr [3]> predictor original
 5 depth    <chr [2]> predictor original
 6 table    <chr [2]> predictor original
 7 x        <chr [2]> predictor original
 8 y        <chr [2]> predictor original
 9 z        <chr [2]> predictor original
10 id       <chr [2]> ID        original
11 price    <chr [2]> outcome   original
  1. “Prep” the recipe. Given the recipe, R then puts together the data inputs. You can explore the outputs of recipe and prep to better understand these steps.
d_prep <- d_rec %>%
  prep()

summary(d_rec)
# A tibble: 11 × 4
   variable type      role      source  
   <chr>    <list>    <chr>     <chr>   
 1 carat    <chr [2]> predictor original
 2 cut      <chr [3]> predictor original
 3 color    <chr [3]> predictor original
 4 clarity  <chr [3]> predictor original
 5 depth    <chr [2]> predictor original
 6 table    <chr [2]> predictor original
 7 x        <chr [2]> predictor original
 8 y        <chr [2]> predictor original
 9 z        <chr [2]> predictor original
10 id       <chr [2]> ID        original
11 price    <chr [2]> outcome   original
  1. Specify parameters of the model. The basic idea of model selection is that we have an objective function with a penalty term. This penalty term depends on the number of predictors. If we include more predictors, the penalty increases. This prevents us just using every predictor available. Then the question is what parameters do we choose for our objective function? Here is the general format for LASSO:

\[\begin{equation} \arg\min_{\beta} \sum_{i = 1}^N (Y_i - \beta ' X_i)^2 + \lambda \sum_{k=1}^K \vert \beta_k \vert \end{equation}\]

We can dissect this term by term. The first term in Equation (1) is the usual OLS estimator. The second term is the penalty. The parameter \(\lambda\) is the first parameter we need to set. This itself is called the penalty term. Large values of \(\lambda\) results in a more parsimonious model (fewer predictors) because the penalty is large. If \(\lambda\) is zero, then the model is the same as OLS. In practice, they can cover a large range. See methods in cross-validation to select \(\lambda\) in a more principled way than just setting it. Then, there is the function on the parameters \(\beta\). Suppose there are \(K\) possible predictors. LASSO involves taking the absolute value of the parameters. Other methods, like Ridge regression (not in these notes), use other functions rather than the absolute value (e.g., square the parameters). This function is the second part that the analyst sets.

In the code below, the penalty argument is \(\lambda\) and the mixture argument set to 1 means that we are using LASSO. We use set_engine() to tell R which algorithm to use. Because we are using LASSO, we use “glmnet.” This also works for Ridge or similar methods. Other engines include “lm” for OLS or “keras” for Neural Nets.

d_lasso <- linear_reg(penalty = 0.1, mixture = 1) %>%
  set_engine("glmnet")
  1. Create the workflow. This puts together the model and the recipe.
d_wf <- workflow() %>%
  add_recipe(d_rec)
  1. Fit the model and investigate the coefficients. If a coefficient is 0, that means that LASSO dropped it because it did not add sufficient predictive power. The interpretation of the coefficients is the same as for standard linear regression.
# Run LASSO!
d_fit <- d_wf %>%
  add_model(d_lasso) %>%
  fit(data = d_train)

# Check out coefficients
d_fit %>%
  extract_fit_parsnip() %>%
  tidy() 

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loaded glmnet 4.1-8
# A tibble: 24 × 3
   term          estimate penalty
   <chr>            <dbl>   <dbl>
 1 (Intercept)     2338.      0.1
 2 carat          11015.      0.1
 3 depth            -64.7     0.1
 4 table            -28.5     0.1
 5 x               -912.      0.1
 6 y                  0       0.1
 7 z                -45.9     0.1
 8 cut_Good         537.      0.1
 9 cut_Very.Good    690.      0.1
10 cut_Premium      712.      0.1
# ℹ 14 more rows
  1. Apply to testing data and calculate the MSE (or other metric). If we were to use other models or tuning parameters, then this MSE can be used to compare the models.
# Apply recipe and model to test data
d_results <- predict(d_fit, new_data = d_test) %>%
  bind_cols(d_test)

# Assess MSE
mean((d_results$.pred - d_results$price)^2)
[1] 1263936

Here is a visualization of how the coefficients change with different parameters \(\lambda\).

# Raw glmnet object
glmnet_fit <- d_fit %>%
  extract_fit_engine() 

# Clean coefficient paths
# Install broom!
library(broom)
coefs_long <- tidy(glmnet_fit, return_zeros = TRUE, matrix = "beta")  

# Plot
ggplot(coefs_long, aes(x = log(lambda), y = estimate, color = term)) +
  geom_line(alpha = 0.8) +
  labs(title = "LASSO Coefficient Paths (Workflow Model)",
       x = "log(Lambda)",
       y = "Coefficient Value") +
  theme_minimal() +
  theme(legend.position = "bottom")

Line plot showing the paths of LASSO regression coefficients for different terms as log(Lambda) increases. Each term is represented by a colored line. The x-axis shows log(Lambda); the y-axis shows the coefficient value. The plot title is `LASSO Coefficient Paths (Workflow Model)`. A legend for the terms appears at the bottom.

21.1.4 Other Tools

Decision trees and the more general random forest methods provide other ways to predict. There are various packages to implement these methods in R, including rpart. One issue that arises in all these methods is overfitting. This means that the model performs very well in the training data, but not so well in the testing data. Cross-validation is a tool that addresses this issue. The basic idea of cross-validation is that the split of the data is not so simplistic into training/testing. Instead, the data is repeatedly divided into multiple subsets (or “folds”), so that each part of the data is used for both training and testing at different stages. For example, in 5-fold cross-validation, the data is split into five parts: the model is trained on four and tested on the fifth, and this process repeats five times. The average performance across all folds gives a more reliable estimate of how the model is likely to perform on new, unseen data.

21.2 Unsupervised Learning

A common application of unsupervised learning is “clustering” or “pattern recognition.” The most foundational approach is the k-means algorithm, which attempts to group observations into \(k\) clusters based on similarity. The intuition behind k-means is simple: the algorithm assigns each observation to the nearest cluster center, then updates the cluster centers to be the average of the observations assigned to them. This process repeats until the assignments stop changing. The result is a division of the data into groups that are internally similar but distinct from one another. While it is fast and widely used, it does require the user to specify the number of clusters in advance. K-means and similar tools are useful for summarizing large datasets, detecting outliers, or identifying natural groupings when no outcome variable is available.

# Load built-in iris dataset
data(iris)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
# Remove the species label for unsupervised learning
iris_data <- iris[, 1:4]

# Run k-means with 3 clusters (we know there are 3 species)
set.seed(470500)
iris_kmeans <- kmeans(iris_data, centers = 3, nstart = 20)

# Add the cluster labels to the original data
iris_clustered <- iris %>%
  mutate(cluster = as.factor(iris_kmeans$cluster))

# Plot the clusters (e.g., using petal length vs. width)
library(ggplot2)
ggplot(iris_clustered, aes(x = Petal.Length, y = Petal.Width, color = cluster)) +
  geom_point() 

Scatter plot of iris dataset clustered with k-means (3 clusters). Points are plotted by petal length (x-axis) and petal width (y-axis). Each point is colored by its cluster label. No species labels are shown.

21.3 Further Reading

There are so many great resources online to learn ML methods. Athey and Imbens (2019) provides a nice overview on the classes of methods. Text analysis is another class of machne learning that is useful. You can use the libraries tidytext or tm for that.

Athey, Susan, and Guido W. Imbens. 2019. “Machine Learning Methods That Economists Should Know About.” Annual Review of Economics 11 (1): 685–725. https://doi.org/10.1146/annurev-economics-080217-053433.