Introduction to {equatiomatic}

Background and Motivation

If you use R to statistically analyze your data, you might be used to seeing and interpreting the output from functions for models, like lm() and glm(). For example, here is the code and output for a single regression model, fit using the lm() function. We’ll examine how the depth of penguin’s bills relates to their bill length using data from the {palmerpenguins} package:

lm(bill_length_mm ~ bill_depth_mm, penguins)
#> 
#> Call:
#> lm(formula = bill_length_mm ~ bill_depth_mm, data = penguins)
#> 
#> Coefficients:
#>   (Intercept)  bill_depth_mm  
#>       55.0674        -0.6498

At the same time, you might have come across—or written!—equations that appear in books, journal articles, and reports. Equations that look like this:

\[ \operatorname{bill\_length\_mm} = \alpha + \beta_{1}(\operatorname{bill\_depth\_mm}) + \epsilon \]

The purpose of {equatiomatic} is to help you go from model output, like the lm() output above, to the equation we just saw.

As we detail in this vignette, {equatiomatic} provides the underlying equation corresponding to the statistical model output. It does this through the use of the TeX equation formatting system, which can then be included in documents written in a number of formats, including (importantly) R Markdown documents rendered to either HTML or PDF formats.

Example Usage

A basic example for a multiple linear regression

Let’s look at another basic example, again using the {palmerpenguins} data:

library(equatiomatic)

# fit a basic multiple linear regression model
m <- lm(bill_length_mm ~ bill_depth_mm + flipper_length_mm, penguins)

Now we can pull the TeX code with extract_eq

extract_eq(m)

\[ \operatorname{bill\_length\_mm} = \alpha + \beta_{1}(\operatorname{bill\_depth\_mm}) + \beta_{2}(\operatorname{flipper\_length\_mm}) + \epsilon \]

You can of course set echo = FALSE as well, and then you’ll get just the equation, which will look like the below.

\[ \operatorname{bill\_length\_mm} = \alpha + \beta_{1}(\operatorname{bill\_depth\_mm}) + \beta_{2}(\operatorname{flipper\_length\_mm}) + \epsilon \]

Let’s dive into a bit more complex example.

An example with a slightly more complicated model

The above was super simple, but—often—you have categorical variables with lots of levels and interactions, which can make the output of your models (and writing a model equation for your models) a bit more complicated. {equatiomatic} is intended to “smooth out” some of these issues, requiring the same process as in the above example to extract (and present) the model’s equation:

For example, here is an example using a categorical variable (island) in an interaction with the bill_depth_mm variable we used in the example above:

m2 <- lm(bill_length_mm ~ bill_depth_mm * island, penguins)
extract_eq(m2)

\[ \operatorname{bill\_length\_mm} = \alpha + \beta_{1}(\operatorname{bill\_depth\_mm}) + \beta_{2}(\operatorname{island}_{\operatorname{Dream}}) + \beta_{3}(\operatorname{island}_{\operatorname{Torgersen}}) + \beta_{4}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + \beta_{5}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}}) + \epsilon \]

Sometimes, for such models, the equations can get overly long.

That’s where the wrap and terms_per_line arguments come in.

extract_eq(m2, wrap = TRUE) # default terms_per_line = 4

\[ \begin{aligned} \operatorname{bill\_length\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_depth\_mm}) + \beta_{2}(\operatorname{island}_{\operatorname{Dream}}) + \beta_{3}(\operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad \beta_{4}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + \beta_{5}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}}) + \epsilon \end{aligned} \]

extract_eq(m2, wrap = TRUE, terms_per_line = 2)

\[ \begin{aligned} \operatorname{bill\_length\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_depth\_mm})\ + \\ &\quad \beta_{2}(\operatorname{island}_{\operatorname{Dream}}) + \beta_{3}(\operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad \beta_{4}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + \beta_{5}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad \epsilon \end{aligned} \]

Maybe you want different intercept notation, such as \(\beta_0\)? Simply pass “beta” to the intercept argument, as follows.

extract_eq(m2, wrap = TRUE, intercept = "beta")

\[ \begin{aligned} \operatorname{bill\_length\_mm} &= \beta_{0} + \beta_{1}(\operatorname{bill\_depth\_mm}) + \beta_{2}(\operatorname{island}_{\operatorname{Dream}}) + \beta_{3}(\operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad \beta_{4}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + \beta_{5}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}}) + \epsilon \end{aligned} \]

We also wrap all the variable names in \operatorname by default so they show up as plain text, but if you’d like your variable names to be italicized just set ital_vars = TRUE.

extract_eq(m2, wrap = TRUE, ital_vars = TRUE)

\[ \begin{aligned} bill\_length\_mm &= \alpha + \beta_{1}(bill\_depth\_mm) + \beta_{2}(island_{Dream}) + \beta_{3}(island_{Torgersen})\ + \\ &\quad \beta_{4}(bill\_depth\_mm \times island_{Dream}) + \beta_{5}(bill\_depth\_mm \times island_{Torgersen}) + \epsilon \end{aligned} \]

Raw TeX code

Currently, the intercept argument defaults to "alpha" and only takes one additional argument, "beta". However, the raw_tex and greek arguments allows you to specify whatever syntax you would like both for the intercept and for the coefficients. For example

extract_eq(m2,
  wrap = TRUE,
  intercept = "\\hat{\\phi}",
  greek = "\\hat{\\gamma}",
  raw_tex = TRUE
)

\[ \begin{aligned} \operatorname{bill\_length\_mm} &= \hat{\phi} + \hat{\gamma}_{1}(\operatorname{bill\_depth\_mm}) + \hat{\gamma}_{2}(\operatorname{island}_{\operatorname{Dream}}) + \hat{\gamma}_{3}(\operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad \hat{\gamma}_{4}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + \hat{\gamma}_{5}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}}) + \epsilon \end{aligned} \]

Coefficients

In many cases, you’re interested more in including the coefficient estimates (e.g., 3.04), than the Greek symbols (e.g., \(\\beta\_1\)). This may be helpful for communicating the results of a model (and, possibly, for teaching about the statistical model). To do this, simply change the use_coefs argument:

extract_eq(m2, wrap = TRUE, use_coefs = TRUE)

\[ \begin{aligned} \operatorname{\widehat{bill\_length\_mm}} &= 63.72 - 1.16(\operatorname{bill\_depth\_mm}) - 54.12(\operatorname{island}_{\operatorname{Dream}}) - 35.13(\operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad 3.05(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + 1.73(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}}) \end{aligned} \]

Other models

While the above examples focused on regression models (and the lm() function), {equatiomatic} supports output from other model types as well, specifically:

Model Packages/Functions
linear regression stats::lm
logistic regression stats::glm(family = binomial(link = 'logit'))
probit regression stats::glm(family = binomial(link = 'probit'))
ordinal logistic regression MASS::polr(method = 'logistic'); ordinal::clm(link = 'logit')
ordinal probit regression MASS::polr(method = 'probit'); ordinal::clm(link = 'probit')
auto-regressive integrated moving average forecast::Arima
regression with auto-regressive integrated moving average errors forecast::Arima

Here are a few basic examples using these supported model types:

Logistic Regression

lr <- glm(sex ~ species * bill_length_mm,
  data = penguins,
  family = binomial(link = "logit")
)

extract_eq(lr, wrap = TRUE)

\[ \begin{aligned} \log\left[ \frac { P( \operatorname{sex} = \operatorname{male} ) }{ 1 - P( \operatorname{sex} = \operatorname{male} ) } \right] &= \alpha + \beta_{1}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{2}(\operatorname{species}_{\operatorname{Gentoo}}) + \beta_{3}(\operatorname{bill\_length\_mm})\ + \\ &\quad \beta_{4}(\operatorname{species}_{\operatorname{Chinstrap}} \times \operatorname{bill\_length\_mm}) + \beta_{5}(\operatorname{species}_{\operatorname{Gentoo}} \times \operatorname{bill\_length\_mm}) \end{aligned} \]

You can also (optionally) show the how the data are assumed to be distributed.

extract_eq(lr, wrap = TRUE, show_distribution = TRUE)

\[ \begin{aligned} \operatorname{sex} &\sim Bernoulli\left(\operatorname{prob}_{\operatorname{sex} = \operatorname{male}}= \hat{P}\right) \\ \log\left[ \frac { \hat{P} }{ 1 - \hat{P} } \right] &= \alpha + \beta_{1}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{2}(\operatorname{species}_{\operatorname{Gentoo}}) + \beta_{3}(\operatorname{bill\_length\_mm})\ + \\ &\quad \beta_{4}(\operatorname{species}_{\operatorname{Chinstrap}} \times \operatorname{bill\_length\_mm}) + \beta_{5}(\operatorname{species}_{\operatorname{Gentoo}} \times \operatorname{bill\_length\_mm}) \end{aligned} \]

Probit Regression

Probit regression works similarly to logistic regression:

pr <- glm(sex ~ species * bill_length_mm,
  data = penguins,
  family = binomial(link = "probit")
)

extract_eq(pr, wrap = TRUE)

\[ \begin{aligned} P( \operatorname{sex} = \operatorname{male} ) &= \Phi[\alpha + \beta_{1}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{2}(\operatorname{species}_{\operatorname{Gentoo}}) + \beta_{3}(\operatorname{bill\_length\_mm})\ + \\ &\qquad\ \beta_{4}(\operatorname{species}_{\operatorname{Chinstrap}} \times \operatorname{bill\_length\_mm}) + \beta_{5}(\operatorname{species}_{\operatorname{Gentoo}} \times \operatorname{bill\_length\_mm})] \end{aligned} \]

Again, you can (optionally) show the how the data are assumed distributed.

extract_eq(pr, wrap = TRUE, show_distribution = TRUE)

\[ \begin{aligned} \operatorname{sex} &\sim Bernoulli\left(\operatorname{prob}_{\operatorname{sex} = \operatorname{male}}= \hat{P}\right) \\ \hat{P} &= \Phi[\alpha + \beta_{1}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{2}(\operatorname{species}_{\operatorname{Gentoo}}) + \beta_{3}(\operatorname{bill\_length\_mm})\ + \\ &\qquad\ \beta_{4}(\operatorname{species}_{\operatorname{Chinstrap}} \times \operatorname{bill\_length\_mm}) + \beta_{5}(\operatorname{species}_{\operatorname{Gentoo}} \times \operatorname{bill\_length\_mm})] \end{aligned} \]

Ordinal Regression

For these examples, we’ll use wine tasting data from the {ordinal} package. If you don’t already have this package, you can download it with

install.packages("ordinal")

The data look like this

response rating temp contact bottle judge
36 2 cold no 1 1
48 3 cold no 2 1
47 3 cold yes 3 1
67 4 cold yes 4 1
77 4 warm no 5 1
60 4 warm no 6 1

We’ll fit a model from the documentation, with the ordinal rating response predicted by an interaction between the temperature and and contact.

You can fit ordinal response models with either MASS::polr or ordinal::clm, and {equatiomatic} works with either, and with logistic or probit link functions.

{MASS}

Logit method

mass_ologit <- MASS::polr(rating ~ temp * contact,
  data = ordinal::wine
)

extract_eq(mass_ologit, wrap = TRUE, terms_per_line = 2)
#> 
#> Re-fitting to get Hessian
#> 
#> 
#> Re-fitting to get Hessian

\[ \begin{aligned} \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{1} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{1} ) } \right] &= \alpha_{1} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{2} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{2} ) } \right] &= \alpha_{2} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{3} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{3} ) } \right] &= \alpha_{3} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{4} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{4} ) } \right] &= \alpha_{4} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \end{aligned} \]

Probit method

mass_oprobit <- MASS::polr(rating ~ temp * contact,
  data = ordinal::wine,
  method = "probit"
)

extract_eq(mass_oprobit, wrap = TRUE, terms_per_line = 2)
#> 
#> Re-fitting to get Hessian
#> 
#> 
#> Re-fitting to get Hessian

\[ \begin{aligned} P( \operatorname{rating} \leq \operatorname{1} ) &= \Phi[\alpha_{1} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{2} ) &= \Phi[\alpha_{2} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{3} ) &= \Phi[\alpha_{3} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{4} ) &= \Phi[\alpha_{4} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \end{aligned} \]

{ordinal}

And we can do the same thing with the {ordinal} package

Logit method

ordinal_ologit <- ordinal::clm(rating ~ temp * contact,
  data = ordinal::wine,
  link = "logit"
)

extract_eq(ordinal_ologit, wrap = TRUE, terms_per_line = 2)

\[ \begin{aligned} \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{1} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{1} ) } \right] &= \alpha_{1} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{2} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{2} ) } \right] &= \alpha_{2} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{3} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{3} ) } \right] &= \alpha_{3} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{4} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{4} ) } \right] &= \alpha_{4} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \end{aligned} \]

Probit method

ordinal_probit <- ordinal::clm(rating ~ temp * contact,
  data = ordinal::wine,
  link = "probit"
)

extract_eq(ordinal_probit, wrap = TRUE, terms_per_line = 2)

\[ \begin{aligned} P( \operatorname{rating} \leq \operatorname{1} ) &= \Phi[\alpha_{1} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{2} ) &= \Phi[\alpha_{2} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{3} ) &= \Phi[\alpha_{3} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{4} ) &= \Phi[\alpha_{4} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \end{aligned} \]

Things the package does not yet do

We are aware of a few things the package doesn’t yet do, but that we hope to add later. These include:

Regarding this last point, we are hopeful that we can incorporate essentially all the models covered by broom. Multilevel models are particularly high on our wish list. But we have not yet had the time to develop these.

Contributing

We would LOVE to have you as a contributor! Is there a model that we don’t currently fit that you want implemented? There are a few ways to go about this. You can either (a) fork the repository and implement the method on your own, then submit a PR, or (b) file an issue.

If you file an issue it would be really helpful if you could provide an example of a fitted model and what the equation for that model should look like. We will try to get to these as soon as possible.

Also, the next planned vignette (at this particular moment) is on contributing to the package, with a step-by-step example of implementing a new method. So stay tuned for that, if you’re interested in (a) but not yet sure how to get started.