10 Summary Statistics

This chapter covers statistics useful to understand and describe data that are already pre-processed. At the end of this chapter is an interactive exercise to write a function to output a table of descriptive statistics.

Here are all the libraries you should install for this chapter.

library(dplyr)
library(ggplot2)
library(purrr)
library(readr)
library(stargazer)
library(stringr)
library(tidyr)

10.1 Built-in Functions

Again, we will use the dataset gapminder_large.csv which contains measures of development, environment, and society from the countries in the world. GDP and the Gini Index are measured in 2015. The Corruption Perception Index (cpi) is measured between 2012 and 2017. A higher score means less corruption. Life expectancy (lifeexp) is measured between 2012 and 2018. It is measured in years and is the average number of years a newborn would live holding constant contemporaneous mortality patterns. C02 emissions (co2) is measured between 2015 and 2018. The units are metric tonnes of CO2 per person.

df <- read.csv("Data/Gapminder/gapminder_large.csv")

First, we want to get a sense of the data. How many observations are there? How many variables? What are the names of the variables and what classes are they?

head(df) # Display the first 6 rows

              country gdp_2015 gini_2015              region co2_2015 co2_2016
1         Afghanistan      574      36.8      Asia & Pacific    0.262    0.245
2             Albania     4520      29.0              Europe    1.600    1.570
3             Algeria     4780      27.6         Arab States    3.800    3.640
4             Andorra    42100      40.0              Europe    5.970    6.070
5              Angola     3750      42.6              Africa    1.220    1.180
6 Antigua and Barbuda    13300      40.0 South/Latin America    5.840    5.900
  co2_2017 co2_2018 cpi_2012 cpi_2013 cpi_2014 cpi_2015 cpi_2016 cpi_2017
1    0.247    0.254        8        8       12       11       15       15
2    1.610    1.590       33       31       33       36       39       38
3    3.560    3.690       34       36       36       36       34       33
4    6.270    6.120       NA       NA       NA       NA       NA       NA
5    1.140    1.120       22       23       19       15       18       19
6    5.890    5.880       NA       NA       NA       NA       NA       NA
  lifeexp_2012 lifeexp_2013 lifeexp_2014 lifeexp_2015 lifeexp_2016 lifeexp_2017
1         60.8         61.3         61.2         61.2         61.2         63.4
2         77.8         77.9         77.9         78.0         78.1         78.2
3         76.8         76.9         77.0         77.1         77.4         77.7
4         82.4         82.5         82.5         82.6         82.7         82.7
5         61.3         61.9         62.8         63.3         63.8         64.2
6         76.7         76.8         76.8         76.9         77.0         77.0
  lifeexp_2018
1         63.7
2         78.3
3         77.9
4           NA
5         64.6
6         77.2

dim(df) # Confirm the number of rows and columns

[1] 195  21

names(df) # List the variable names

 [1] "country"      "gdp_2015"     "gini_2015"    "region"       "co2_2015"    
 [6] "co2_2016"     "co2_2017"     "co2_2018"     "cpi_2012"     "cpi_2013"    
[11] "cpi_2014"     "cpi_2015"     "cpi_2016"     "cpi_2017"     "lifeexp_2012"
[16] "lifeexp_2013" "lifeexp_2014" "lifeexp_2015" "lifeexp_2016" "lifeexp_2017"
[21] "lifeexp_2018"

sapply(df, typeof)

     country     gdp_2015    gini_2015       region     co2_2015     co2_2016 
 "character"    "integer"     "double"  "character"     "double"     "double" 
    co2_2017     co2_2018     cpi_2012     cpi_2013     cpi_2014     cpi_2015 
    "double"     "double"    "integer"    "integer"    "integer"    "integer" 
    cpi_2016     cpi_2017 lifeexp_2012 lifeexp_2013 lifeexp_2014 lifeexp_2015 
   "integer"    "integer"     "double"     "double"     "double"     "double" 
lifeexp_2016 lifeexp_2017 lifeexp_2018 
    "double"     "double"     "double"

The function mean() calculates the arithmetic mean. Here is a simple demonstration of it with a vector of 50 draws from the N(0,1) distribution.

x <- rnorm(50, mean = 0, sd = 1)
mean(x)

[1] -0.134323

If there are any elements of the input that are NA, you must specify the argument na.rm = TRUE. Otherwise, the result will be NA.

mean(df$gdp_2015)

[1] NA

mean(df$gdp_2015, na.rm = TRUE)

[1] 14298.43

If you will only be using one data frame and do not want to repeatedly call variables using the format above, you can attach the data and then refer just to the variable name.

attach(df)
mean(gdp_2015, na.rm = TRUE)

[1] 14298.43

While this is convenient, it is not always clear to which data frame the variable belongs. Also, if any variables have the same names as functions, those functions will be masked. This chapter thus relies on data$colname format for clarity. Let us detach the dataset and continue.

detach(df)

To calculate the mean for more than one column, we can use apply-like functions. Here, we are calculating the mean of every column except country and region, which are string variables.

sapply(df[, -c(1, 4)], mean, na.rm = TRUE)

    gdp_2015    gini_2015     co2_2015     co2_2016     co2_2017     co2_2018 
14298.427807    38.932821     4.456147     4.423509     4.446485     4.455041 
    cpi_2012     cpi_2013     cpi_2014     cpi_2015     cpi_2016     cpi_2017 
   42.906977    42.306358    42.929825    42.339394    42.687861    42.790960 
lifeexp_2012 lifeexp_2013 lifeexp_2014 lifeexp_2015 lifeexp_2016 lifeexp_2017 
   71.309091    71.642781    71.867380    72.144385    72.448128    72.737433 
lifeexp_2018 
   72.969022

The median is calculated with a function that is very similar to the mean function.

median(x)

[1] -0.2529011

median(df$gini_2015, na.rm = TRUE)

[1] 39.1

The function quantile() allows you to calculate other percentiles. Without specifying the probabilities in the probs argument, the function automatically outputs the minimum and maximum values, and the 25th, 50th, and 75th percentiles.

quantile(x)

        0%        25%        50%        75%       100% 
-2.1427658 -0.7534393 -0.2529011  0.4379559  1.4941623

quantile(df$lifeexp_2015, probs = c(0.10, 0.90), na.rm = TRUE)

 10%  90% 
61.3 81.5

Here are some functions to calculate measures of dispersion. Note the importance of specifying na.rm = TRUE.

min(df$co2_2015, na.rm = TRUE)

[1] 0.0367

max(df$co2_2015, na.rm = TRUE)

[1] 41.3

range(df$co2_2015, na.rm = TRUE)

[1]  0.0367 41.3000

IQR(df$co2_2015, na.rm = TRUE)

[1] 5.18525

var(df$co2_2015, na.rm = TRUE) # Unbiased estimator

[1] 33.79678

sd(df$co2_2015, na.rm = TRUE)

[1] 5.813499

The function summary() is a fast way to calculate many summary statistics at once. There is no need to add the na.rm = TRUE argument, and the function actually counts the number of NA values, if there are any.

summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2.1428 -0.7534 -0.2529 -0.1343  0.4380  1.4942

summary(df$cpi_2015)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   8.00   28.00   37.00   42.34   54.00   91.00      30

The covariance and correlation coefficient are calculated using cov() and corr(). Specifying what to do with NA values is a little more complicated for these functions. The argument use determines the strategy more precisely. If use = "pairwise.complete.obs", then the covariance/correlation is only calculated for observations with two non-missing values. .

cov(df$gdp_2015, df$co2_2015)

[1] NA

cov(df$gdp_2015, df$co2_2015, use = "pairwise.complete.obs")

[1] 65913.78

cor(df$gdp_2015, df$co2_2015, use = "pairwise.complete.obs")

[1] 0.6068677

Data frames can be input into these functions, producing pairwise correlations.

cov(df[, c(2, 3, 5)], use = "pairwise.complete.obs")

              gdp_2015     gini_2015     co2_2015
gdp_2015  510161355.07 -42624.157967 65913.780200
gini_2015    -42624.16     54.232732    -7.687662
co2_2015      65913.78     -7.687662    33.796776

cor(df[, c(2, 3, 5)], use = "pairwise.complete.obs")

            gdp_2015  gini_2015   co2_2015
gdp_2015   1.0000000 -0.2532801  0.6068677
gini_2015 -0.2532801  1.0000000 -0.1782023
co2_2015   0.6068677 -0.1782023  1.0000000

The function t.test() performs a t-test. The arguments augment the details of the test, including the null and alternative hypotheses.

t.test(df$lifeexp_2012, mu = 72, alternative = "two.sided")


    One Sample t-test

data:  df$lifeexp_2012
t = -1.1733, df = 186, p-value = 0.2422
alternative hypothesis: true mean is not equal to 72
95 percent confidence interval:
 70.14742 72.47076
sample estimates:
mean of x 
 71.30909

t.test(df$lifeexp_2012, df$lifeexp_2017, paired = TRUE, var.equal = FALSE, conf.level = 0.90)


    Paired t-test

data:  df$lifeexp_2012 and df$lifeexp_2017
t = -12.386, df = 186, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
90 percent confidence interval:
 -1.618977 -1.237707
sample estimates:
mean difference 
      -1.428342

The function ks.test() performs the Kolmogorov-Smirnov Test to compare two distributions

ks.test(df[df$region == "Africa", "co2_2015"],
        df[df$region == "Middle east", "co2_2015"],
        alternative = "two.sided")


    Exact two-sample Kolmogorov-Smirnov test

data:  df[df$region == "Africa", "co2_2015"] and df[df$region == "Middle east", "co2_2015"]
D = 0.7803, p-value = 2.778e-06
alternative hypothesis: two-sided

10.1.1 Practice Exercises

Save the below code to an object. What is the data structure of this object? How can you extract information from this object?

t.test(df$lifeexp_2012, mu = 72, alternative = "two.sided")

10.2 `tidyverse` Functions

The advantages of dplyr functions and pipes are especially clear for producing summary statistics. We read in the data as a tibble.

tib <- read_csv("Data/Gapminder/gapminder_large.csv")

Rows: 195 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): country, region
dbl (19): gdp_2015, gini_2015, co2_2015, co2_2016, co2_2017, co2_2018, cpi_2...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The function summarise() allows for many types of summary statistics. The output is itself a tibble. Here are examples naming the column of the output. Note that we need to specify na.rm = TRUE.

tib %>%
  summarise("Mean GDP 2015" = mean(gdp_2015, na.rm = TRUE))

# A tibble: 1 × 1
  `Mean GDP 2015`
            <dbl>
1          14298.

tib %>%
  summarise(MeanGDP2015 = mean(gdp_2015, na.rm = TRUE))

# A tibble: 1 × 1
  MeanGDP2015
        <dbl>
1      14298.

It is fine to refrain from naming the column. R automatically assigns the name based on the statistic.

tib %>%
  summarise(mean(gdp_2015, na.rm = TRUE))

# A tibble: 1 × 1
  `mean(gdp_2015, na.rm = TRUE)`
                           <dbl>
1                         14298.

It is possible to calculate many statistics at once.

tib %>%
  summarise(Median = median(gdp_2015, na.rm = TRUE),
            Variance = var(gdp_2015, na.rm =TRUE),
            SD = sd(gdp_2015, na.rm = TRUE),
            Minimum = min(gdp_2015, na.rm = TRUE),
            Maximum = max(gdp_2015, na.rm = TRUE),
            N = n())

# A tibble: 1 × 6
  Median   Variance     SD Minimum Maximum     N
   <dbl>      <dbl>  <dbl>   <dbl>   <dbl> <int>
1   5740 510161355. 22587.     228  190000   195

The function across() can be used inside summarise() and mutate(). In the first argument, specify the vector of column names or indices. In the second argument, specify the function(s) to apply. The format comes from the package purrr and allows you to specify the values of the other arguments of the function.

tib %>%
  summarise(across(c(2, 3, 5), ~ mean(.x, na.rm = TRUE)))

# A tibble: 1 × 3
  gdp_2015 gini_2015 co2_2015
     <dbl>     <dbl>    <dbl>
1   14298.      38.9     4.46

tib %>%
  summarise(across(starts_with("co2"), ~ median(.x, na.rm = TRUE)),
            across(starts_with("lifeexp"), ~ median(.x, na.rm = TRUE)))

# A tibble: 1 × 11
  co2_2015 co2_2016 co2_2017 co2_2018 lifeexp_2012 lifeexp_2013 lifeexp_2014
     <dbl>    <dbl>    <dbl>    <dbl>        <dbl>        <dbl>        <dbl>
1     2.48     2.48     2.50     2.53         73.2         73.1         73.1
# ℹ 4 more variables: lifeexp_2015 <dbl>, lifeexp_2016 <dbl>,
#   lifeexp_2017 <dbl>, lifeexp_2018 <dbl>

tib %>%
  summarise(across(c(2, 3, 5), list(mean = ~ mean(.x, na.rm = TRUE),
                                    median = ~ median(.x, na.rm = TRUE))))

# A tibble: 1 × 6
  gdp_2015_mean gdp_2015_median gini_2015_mean gini_2015_median co2_2015_mean
          <dbl>           <dbl>          <dbl>            <dbl>         <dbl>
1        14298.            5740           38.9             39.1          4.46
# ℹ 1 more variable: co2_2015_median <dbl>

Now that we are comfortable with summarise(), let’s add layers using the pipe operator. Adding group_by() beforehand allows for this.

tib %>%
  group_by(region) %>%
  summarise(MeanGINI = mean(gini_2015, na.rm = TRUE),
            N = n(),
            N_NA = sum(is.na(gini_2015)))

# A tibble: 7 × 4
  region              MeanGINI     N  N_NA
  <chr>                  <dbl> <int> <int>
1 Africa                  43.8    44     0
2 Arab States             38.3    10     0
3 Asia & Pacific          36.7    45     0
4 Europe                  32.8    49     0
5 Middle east             36.8    12     0
6 North America           36.5     2     0
7 South/Latin America     45.7    33     0

We can also filter to only focus on certain observations.

tib %>%
  filter(region %in% c("Africa", "Middle east")) %>%
  group_by(region) %>%
  summarise(Mean_Gini = mean(gini_2015),
            SD_Gini = sd(gini_2015))

# A tibble: 2 × 3
  region      Mean_Gini SD_Gini
  <chr>           <dbl>   <dbl>
1 Africa           43.8    7.85
2 Middle east      36.8    4.09

The data itself can be transformed in the pipe operations. Here, we are creating a variable that is then summarized.

tib %>%
  mutate(gini_rescaled = gini_2015/100) %>%
  group_by(region) %>%
  summarise(InterQuartileRange = IQR(gini_rescaled))

# A tibble: 7 × 2
  region              InterQuartileRange
  <chr>                            <dbl>
1 Africa                          0.0865
2 Arab States                     0.088 
3 Asia & Pacific                  0.065 
4 Europe                          0.0720
5 Middle east                     0.0677
6 North America                   0.0480
7 South/Latin America             0.067

10.3 Tables

10.3.1 Creating Tables with `stargazer`

The package stargazer provides a simple way to output summary statistics from data frames. The simplest way to use the stargazer() function is to input a data frame. By default, it will return the LaTeX code for a table with summary statistics for all numeric variables. The default statistics are the number of observations, the mean, the standard deviation, the minimum, the 25th percentile, the 75th percentile, and the maximum.

Stargazer will return a raw output to be copied into another file. The type will determine the formatting. The type latex is for use in PDF compilation of RMarkdown files or for LaTeX programming (outside the scope of this class). The type html can be used for HTML compilation of RMarkdown files.

stargazer(df, type = "latex")


% Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com
% Date and time: Sun, Aug 31, 2025 - 08:52:26
\begin{table}[!htbp] \centering 
  \caption{} 
  \label{} 
\begin{tabular}{@{\extracolsep{5pt}}lccccc} 
\\[-1.8ex]\hline 
\hline \\[-1.8ex] 
Statistic & \multicolumn{1}{c}{N} & \multicolumn{1}{c}{Mean} & \multicolumn{1}{c}{St. Dev.} & \multicolumn{1}{c}{Min} & \multicolumn{1}{c}{Max} \\ 
\hline \\[-1.8ex] 
gdp\_2015 & 187 & 14,298.430 & 22,586.750 & 228 & 190,000 \\ 
gini\_2015 & 195 & 38.933 & 7.364 & 24.800 & 63.100 \\ 
co2\_2015 & 192 & 4.456 & 5.813 & 0.037 & 41.300 \\ 
co2\_2016 & 192 & 4.424 & 5.644 & 0.025 & 38.500 \\ 
co2\_2017 & 192 & 4.446 & 5.652 & 0.024 & 39.800 \\ 
co2\_2018 & 192 & 4.455 & 5.609 & 0.024 & 38.000 \\ 
cpi\_2012 & 172 & 42.907 & 19.614 & 8 & 90 \\ 
cpi\_2013 & 173 & 42.306 & 19.881 & 8 & 91 \\ 
cpi\_2014 & 171 & 42.930 & 19.811 & 8 & 92 \\ 
cpi\_2015 & 165 & 42.339 & 20.150 & 8 & 91 \\ 
cpi\_2016 & 173 & 42.688 & 19.375 & 10 & 90 \\ 
cpi\_2017 & 177 & 42.791 & 18.978 & 9 & 89 \\ 
lifeexp\_2012 & 187 & 71.309 & 8.052 & 48.900 & 83.600 \\ 
lifeexp\_2013 & 187 & 71.643 & 7.882 & 48.500 & 83.900 \\ 
lifeexp\_2014 & 187 & 71.867 & 7.752 & 48.700 & 84.200 \\ 
lifeexp\_2015 & 187 & 72.144 & 7.497 & 50.500 & 84.400 \\ 
lifeexp\_2016 & 187 & 72.448 & 7.296 & 51.700 & 84.700 \\ 
lifeexp\_2017 & 187 & 72.737 & 7.070 & 51.900 & 84.800 \\ 
lifeexp\_2018 & 184 & 72.969 & 6.968 & 52.400 & 85.000 \\ 
\hline \\[-1.8ex] 
\end{tabular} 
\end{table}

stargazer(df, type = "html")


<table style="text-align:center"><tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>N</td><td>Mean</td><td>St. Dev.</td><td>Min</td><td>Max</td></tr>
<tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">gdp_2015</td><td>187</td><td>14,298.430</td><td>22,586.750</td><td>228</td><td>190,000</td></tr>
<tr><td style="text-align:left">gini_2015</td><td>195</td><td>38.933</td><td>7.364</td><td>24.800</td><td>63.100</td></tr>
<tr><td style="text-align:left">co2_2015</td><td>192</td><td>4.456</td><td>5.813</td><td>0.037</td><td>41.300</td></tr>
<tr><td style="text-align:left">co2_2016</td><td>192</td><td>4.424</td><td>5.644</td><td>0.025</td><td>38.500</td></tr>
<tr><td style="text-align:left">co2_2017</td><td>192</td><td>4.446</td><td>5.652</td><td>0.024</td><td>39.800</td></tr>
<tr><td style="text-align:left">co2_2018</td><td>192</td><td>4.455</td><td>5.609</td><td>0.024</td><td>38.000</td></tr>
<tr><td style="text-align:left">cpi_2012</td><td>172</td><td>42.907</td><td>19.614</td><td>8</td><td>90</td></tr>
<tr><td style="text-align:left">cpi_2013</td><td>173</td><td>42.306</td><td>19.881</td><td>8</td><td>91</td></tr>
<tr><td style="text-align:left">cpi_2014</td><td>171</td><td>42.930</td><td>19.811</td><td>8</td><td>92</td></tr>
<tr><td style="text-align:left">cpi_2015</td><td>165</td><td>42.339</td><td>20.150</td><td>8</td><td>91</td></tr>
<tr><td style="text-align:left">cpi_2016</td><td>173</td><td>42.688</td><td>19.375</td><td>10</td><td>90</td></tr>
<tr><td style="text-align:left">cpi_2017</td><td>177</td><td>42.791</td><td>18.978</td><td>9</td><td>89</td></tr>
<tr><td style="text-align:left">lifeexp_2012</td><td>187</td><td>71.309</td><td>8.052</td><td>48.900</td><td>83.600</td></tr>
<tr><td style="text-align:left">lifeexp_2013</td><td>187</td><td>71.643</td><td>7.882</td><td>48.500</td><td>83.900</td></tr>
<tr><td style="text-align:left">lifeexp_2014</td><td>187</td><td>71.867</td><td>7.752</td><td>48.700</td><td>84.200</td></tr>
<tr><td style="text-align:left">lifeexp_2015</td><td>187</td><td>72.144</td><td>7.497</td><td>50.500</td><td>84.400</td></tr>
<tr><td style="text-align:left">lifeexp_2016</td><td>187</td><td>72.448</td><td>7.296</td><td>51.700</td><td>84.700</td></tr>
<tr><td style="text-align:left">lifeexp_2017</td><td>187</td><td>72.737</td><td>7.070</td><td>51.900</td><td>84.800</td></tr>
<tr><td style="text-align:left">lifeexp_2018</td><td>184</td><td>72.969</td><td>6.968</td><td>52.400</td><td>85.000</td></tr>
<tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr></table>

If we want the table to show up formatted when we Knit, we need to add the option results = "asis" to the code chunk.

```{r, results = "asis"}
stargazer(df, type = "html")
```

Here is what that code chunk looks like. I switched the type to be latex so that it will compile in a PDF. Table 1 may appear on a different page away from the code chunk. We can add header = FALSE to remove the comment with the citation for the stargazer package.

stargazer(df, type = "latex")

% Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com % Date and time: Sun, Aug 31, 2025 - 08:52:26

Inputting a selected set of variables will restrict the table.

stargazer(df[, 5:8], header = FALSE)


\begin{table}[!htbp] \centering 
  \caption{} 
  \label{} 
\begin{tabular}{@{\extracolsep{5pt}}lccccc} 
\\[-1.8ex]\hline 
\hline \\[-1.8ex] 
Statistic & \multicolumn{1}{c}{N} & \multicolumn{1}{c}{Mean} & \multicolumn{1}{c}{St. Dev.} & \multicolumn{1}{c}{Min} & \multicolumn{1}{c}{Max} \\ 
\hline \\[-1.8ex] 
co2\_2015 & 192 & 4.456 & 5.813 & 0.037 & 41.300 \\ 
co2\_2016 & 192 & 4.424 & 5.644 & 0.025 & 38.500 \\ 
co2\_2017 & 192 & 4.446 & 5.652 & 0.024 & 39.800 \\ 
co2\_2018 & 192 & 4.455 & 5.609 & 0.024 & 38.000 \\ 
\hline \\[-1.8ex] 
\end{tabular} 
\end{table}

There are many options to alter the output. Here are a few examples. See ?stargazer and this document for more examples. Remember to change the type to html if you are compiling the below to HTML.

stargazer(df[, 5:8], title = "CO2 Emissions", type = "latex", header = FALSE)

stargazer(df[, 5:8], summary.stat = c("n", "mean", "sd"), type = "latex", header = FALSE)

stargazer(df[, 5:8], flip = TRUE, type = "latex", header = FALSE)

10.4 Advanced: Creating Tables from Scratch

This is outside the scope of the class.

These exercises will take you through creating a function that outputs a table of summary statistics for inclusion in a LaTeX document.

This function will take a tibble of summary statistics as an input. Create a tibble that lists, for each region, the mean, standard deviation, minimum, maximum, and number of non-missing observations for 2015 life expectancy. Save this tibble to an object so you can access it.

gapminder <- read_csv("Data/Gapminder/gapminder_large.csv")
out <- gapminder %>%
  group_by(region) %>%
  summarise(mean(lifeexp_2015, na.rm = TRUE),
            sd(lifeexp_2015, na.rm = TRUE),
            min(lifeexp_2015, na.rm = TRUE),
            max(lifeexp_2015, na.rm = TRUE),
            sum(!is.na(lifeexp_2015)))

Define the name of your function. The first argument will be a tibble like the one you produced in question 1.

create_table <- function(tib) {
  
  # Open file connection
  # Define header lines
  # Define body lines
  # Define footer lines
  # Write header, body, and footer lines
  # Close file connection
  
}

We want to write an output to a LaTeX file. This will require a file connection. That is, you will open a file with a certain file name, write lines from the tibble of question 1, and close the file. The second argument will be the filename. A file connection is opened and closed with the following commands.

connection <- file(filename) # Open a file
close(connection)
Add these to your function and include an argument for the filename.
create_table <- function(tib, filename) {
  
  # Open file connection
  connection <- file(filename)
  
  # Define header lines
  # Define body lines
  # Define footer lines
  # Write header, body, and footer lines
  
  # Close file connection
  close(connection)
  
}

Tables in LaTeX require a header and footer to open and close the tabular environment. Start by defining the footer as this is simplest. Add the following object to your function in between opening and closing the file connection. Why do we use two backslashes instead of one?

create_table <- function(tib, filename) {
  # Open file connection
  connection <- file(filename)
  
  # Define header lines
  # Define body lines
  
  # Define footer lines
  foot <- c("\\bottomrule", "\\end{tabular}")
  
  # Write header, body, and footer lines
  
  # Close file connection
  close(connection)
  
}

To actually write a line in the file, we will use the function writeLines(). Add this to your function. Start by writing the footer to the file. At this point, test the function out to see how the writeLines() function works.

create_table <- function(tib, filename) {
  # Open file connection
  connection <- file(filename)
  
  # Define header lines
  # Define body lines
  
  # Define footer lines
  foot <- c("\\bottomrule", "\\end{tabular}")
  
  # Write header, body, and footer lines
  writeLines(foot, connection)
  
  # Close file connection
  close(connection)
  
}
create_table(out, "Data/Gapminder/Output_Data/test.tex")

Now let’s define the header. We need to begin the tabular environment, the title columns, and the alignment of the columns. The names of the columns will be the third argument. Let’s start with all centrally aligned columns. Add the header to the writeLines() function. We have a tibble with 1 column for the region and 5 columns for summary statistics. What does your file look like now? Make adjustments if there are some oddities.

create_table <- function(tib, filename, colnames) {
  # Open file connection
  connection <- file(filename)
  
  # Define header lines
  head <- c(paste0("\\begin{tabular}{", str_dup("c", dim(tib)[2]), "}"),
            "\\toprule",
            paste(str_c(colnames, collapse = " & "), "\\\\"),
            "\\midrule")
  
  # Define body lines
  
  # Define footer lines
  foot <- c("\\bottomrule", "\\end{tabular}")
  
  # Write header, body, and footer lines
  writeLines(c(head, foot), connection)
  
  # Close file connection
  close(connection)
  
}
create_table(out, "Data/Gapminder/Output_Data/test.tex", c("Region", "Mean", "SD", "Min.", "Max.", "N"))

Finally, loop through each row in the tibble to print each line.

create_table <- function(tib, filename, colnames) {
  # Open file connection
  connection <- file(filename)
  
  # Define header lines
  head <- c(paste0("\\begin{tabular}{", str_dup("c", dim(tib)[2]), "}"),
            "\\toprule",
            paste(str_c(colnames, collapse = " & "), "\\\\"),
            "\\midrule")
  
  # Define body lines
  for (i in 1:dim(tib)[1]) {
    
    if (i == 1) {
      body <- paste(str_c(tib[i, ], collapse = " & "), "\\\\")
    } else {
      body <- c(body, paste(str_c(tib[i, ], collapse = " & "), "\\\\"))
    }
    
  }
  
  # Define footer lines
  foot <- c("\\bottomrule", "\\end{tabular}")
  
  # Write header, body, and footer lines
  writeLines(c(head, foot), connection)
  
  # Close file connection
  close(connection)
  
}
create_table(out, "Data/Gapminder/Output_Data/test.tex", c("Region", "Mean", "SD", "Min.", "Max.", "N"))

10.5 Further Reading

Reference the dplyr cheat sheet. Higher-order moments are available in the moments package.