If you use R to statistically analyze your data, you might be used to
seeing and interpreting the output from functions for models, like
lm()
and glm()
. For example, here is the code
and output for a single regression model, fit using the
lm()
function. We’ll examine how the depth of penguin’s
bills relates to their bill length using data from the {palmerpenguins}
package:
lm(bill_length_mm ~ bill_depth_mm, penguins)
#>
#> Call:
#> lm(formula = bill_length_mm ~ bill_depth_mm, data = penguins)
#>
#> Coefficients:
#> (Intercept) bill_depth_mm
#> 55.0674 -0.6498
At the same time, you might have come across—or written!—equations that appear in books, journal articles, and reports. Equations that look like this:
\[ \operatorname{bill\_length\_mm} = \alpha + \beta_{1}(\operatorname{bill\_depth\_mm}) + \epsilon \]
The purpose of {equatiomatic} is
to help you go from model output, like the lm()
output
above, to the equation we just saw.
As we detail in this vignette, {equatiomatic} provides the underlying equation corresponding to the statistical model output. It does this through the use of the TeX equation formatting system, which can then be included in documents written in a number of formats, including (importantly) R Markdown documents rendered to either HTML or PDF formats.
Let’s look at another basic example, again using the {palmerpenguins} data:
library(equatiomatic)
# fit a basic multiple linear regression model
m <- lm(bill_length_mm ~ bill_depth_mm + flipper_length_mm, penguins)
Now we can pull the TeX code with extract_eq
\[ \operatorname{bill\_length\_mm} = \alpha + \beta_{1}(\operatorname{bill\_depth\_mm}) + \beta_{2}(\operatorname{flipper\_length\_mm}) + \epsilon \]
You can of course set echo = FALSE
as well, and then
you’ll get just the equation, which will look like the below.
\[ \operatorname{bill\_length\_mm} = \alpha + \beta_{1}(\operatorname{bill\_depth\_mm}) + \beta_{2}(\operatorname{flipper\_length\_mm}) + \epsilon \]
Let’s dive into a bit more complex example.
The above was super simple, but—often—you have categorical variables with lots of levels and interactions, which can make the output of your models (and writing a model equation for your models) a bit more complicated. {equatiomatic} is intended to “smooth out” some of these issues, requiring the same process as in the above example to extract (and present) the model’s equation:
For example, here is an example using a categorical variable
(island
) in an interaction with the
bill_depth_mm
variable we used in the example above:
\[ \operatorname{bill\_length\_mm} = \alpha + \beta_{1}(\operatorname{bill\_depth\_mm}) + \beta_{2}(\operatorname{island}_{\operatorname{Dream}}) + \beta_{3}(\operatorname{island}_{\operatorname{Torgersen}}) + \beta_{4}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + \beta_{5}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}}) + \epsilon \]
Sometimes, for such models, the equations can get overly long.
That’s where the wrap
and terms_per_line
arguments come in.
\[ \begin{aligned} \operatorname{bill\_length\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_depth\_mm}) + \beta_{2}(\operatorname{island}_{\operatorname{Dream}}) + \beta_{3}(\operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad \beta_{4}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + \beta_{5}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}}) + \epsilon \end{aligned} \]
\[ \begin{aligned} \operatorname{bill\_length\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_depth\_mm})\ + \\ &\quad \beta_{2}(\operatorname{island}_{\operatorname{Dream}}) + \beta_{3}(\operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad \beta_{4}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + \beta_{5}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad \epsilon \end{aligned} \]
Maybe you want different intercept notation, such as \(\beta_0\)? Simply pass “beta” to the
intercept
argument, as follows.
\[ \begin{aligned} \operatorname{bill\_length\_mm} &= \beta_{0} + \beta_{1}(\operatorname{bill\_depth\_mm}) + \beta_{2}(\operatorname{island}_{\operatorname{Dream}}) + \beta_{3}(\operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad \beta_{4}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + \beta_{5}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}}) + \epsilon \end{aligned} \]
We also wrap all the variable names in \operatorname
by
default so they show up as plain text, but if you’d like your variable
names to be italicized just set ital_vars = TRUE
.
\[ \begin{aligned} bill\_length\_mm &= \alpha + \beta_{1}(bill\_depth\_mm) + \beta_{2}(island_{Dream}) + \beta_{3}(island_{Torgersen})\ + \\ &\quad \beta_{4}(bill\_depth\_mm \times island_{Dream}) + \beta_{5}(bill\_depth\_mm \times island_{Torgersen}) + \epsilon \end{aligned} \]
Currently, the intercept
argument defaults to
"alpha"
and only takes one additional argument,
"beta"
. However, the raw_tex
and
greek
arguments allows you to specify whatever syntax you
would like both for the intercept and for the coefficients. For
example
\[ \begin{aligned} \operatorname{bill\_length\_mm} &= \hat{\phi} + \hat{\gamma}_{1}(\operatorname{bill\_depth\_mm}) + \hat{\gamma}_{2}(\operatorname{island}_{\operatorname{Dream}}) + \hat{\gamma}_{3}(\operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad \hat{\gamma}_{4}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + \hat{\gamma}_{5}(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}}) + \epsilon \end{aligned} \]
In many cases, you’re interested more in including the coefficient
estimates (e.g., 3.04
), than the Greek symbols (e.g., \(\\beta\_1\)). This may be helpful for
communicating the results of a model (and, possibly, for teaching about
the statistical model). To do this, simply change the
use_coefs
argument:
\[ \begin{aligned} \operatorname{\widehat{bill\_length\_mm}} &= 63.72 - 1.16(\operatorname{bill\_depth\_mm}) - 54.12(\operatorname{island}_{\operatorname{Dream}}) - 35.13(\operatorname{island}_{\operatorname{Torgersen}})\ + \\ &\quad 3.05(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Dream}}) + 1.73(\operatorname{bill\_depth\_mm} \times \operatorname{island}_{\operatorname{Torgersen}}) \end{aligned} \]
While the above examples focused on regression models (and the
lm()
function), {equatiomatic} supports output from other
model types as well, specifically:
Model | Packages/Functions |
---|---|
linear regression | stats::lm |
logistic regression | stats::glm(family = binomial(link = 'logit')) |
probit regression | stats::glm(family = binomial(link = 'probit')) |
ordinal logistic regression | MASS::polr(method = 'logistic') ;
ordinal::clm(link = 'logit') |
ordinal probit regression | MASS::polr(method = 'probit') ;
ordinal::clm(link = 'probit') |
auto-regressive integrated moving average | forecast::Arima |
regression with auto-regressive integrated moving average errors | forecast::Arima |
Here are a few basic examples using these supported model types:
lr <- glm(sex ~ species * bill_length_mm,
data = penguins,
family = binomial(link = "logit")
)
extract_eq(lr, wrap = TRUE)
\[ \begin{aligned} \log\left[ \frac { P( \operatorname{sex} = \operatorname{male} ) }{ 1 - P( \operatorname{sex} = \operatorname{male} ) } \right] &= \alpha + \beta_{1}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{2}(\operatorname{species}_{\operatorname{Gentoo}}) + \beta_{3}(\operatorname{bill\_length\_mm})\ + \\ &\quad \beta_{4}(\operatorname{species}_{\operatorname{Chinstrap}} \times \operatorname{bill\_length\_mm}) + \beta_{5}(\operatorname{species}_{\operatorname{Gentoo}} \times \operatorname{bill\_length\_mm}) \end{aligned} \]
You can also (optionally) show the how the data are assumed to be distributed.
\[ \begin{aligned} \operatorname{sex} &\sim Bernoulli\left(\operatorname{prob}_{\operatorname{sex} = \operatorname{male}}= \hat{P}\right) \\ \log\left[ \frac { \hat{P} }{ 1 - \hat{P} } \right] &= \alpha + \beta_{1}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{2}(\operatorname{species}_{\operatorname{Gentoo}}) + \beta_{3}(\operatorname{bill\_length\_mm})\ + \\ &\quad \beta_{4}(\operatorname{species}_{\operatorname{Chinstrap}} \times \operatorname{bill\_length\_mm}) + \beta_{5}(\operatorname{species}_{\operatorname{Gentoo}} \times \operatorname{bill\_length\_mm}) \end{aligned} \]
Probit regression works similarly to logistic regression:
pr <- glm(sex ~ species * bill_length_mm,
data = penguins,
family = binomial(link = "probit")
)
extract_eq(pr, wrap = TRUE)
\[ \begin{aligned} P( \operatorname{sex} = \operatorname{male} ) &= \Phi[\alpha + \beta_{1}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{2}(\operatorname{species}_{\operatorname{Gentoo}}) + \beta_{3}(\operatorname{bill\_length\_mm})\ + \\ &\qquad\ \beta_{4}(\operatorname{species}_{\operatorname{Chinstrap}} \times \operatorname{bill\_length\_mm}) + \beta_{5}(\operatorname{species}_{\operatorname{Gentoo}} \times \operatorname{bill\_length\_mm})] \end{aligned} \]
Again, you can (optionally) show the how the data are assumed distributed.
\[ \begin{aligned} \operatorname{sex} &\sim Bernoulli\left(\operatorname{prob}_{\operatorname{sex} = \operatorname{male}}= \hat{P}\right) \\ \hat{P} &= \Phi[\alpha + \beta_{1}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{2}(\operatorname{species}_{\operatorname{Gentoo}}) + \beta_{3}(\operatorname{bill\_length\_mm})\ + \\ &\qquad\ \beta_{4}(\operatorname{species}_{\operatorname{Chinstrap}} \times \operatorname{bill\_length\_mm}) + \beta_{5}(\operatorname{species}_{\operatorname{Gentoo}} \times \operatorname{bill\_length\_mm})] \end{aligned} \]
For these examples, we’ll use wine tasting data from the {ordinal} package. If you don’t already have this package, you can download it with
The data look like this
response | rating | temp | contact | bottle | judge |
---|---|---|---|---|---|
36 | 2 | cold | no | 1 | 1 |
48 | 3 | cold | no | 2 | 1 |
47 | 3 | cold | yes | 3 | 1 |
67 | 4 | cold | yes | 4 | 1 |
77 | 4 | warm | no | 5 | 1 |
60 | 4 | warm | no | 6 | 1 |
We’ll fit a model from the documentation, with the ordinal rating response predicted by an interaction between the temperature and and contact.
You can fit ordinal response models with either
MASS::polr
or ordinal::clm
, and {equatiomatic}
works with either, and with logistic or probit link functions.
mass_ologit <- MASS::polr(rating ~ temp * contact,
data = ordinal::wine
)
extract_eq(mass_ologit, wrap = TRUE, terms_per_line = 2)
#>
#> Re-fitting to get Hessian
#>
#>
#> Re-fitting to get Hessian
\[ \begin{aligned} \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{1} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{1} ) } \right] &= \alpha_{1} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{2} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{2} ) } \right] &= \alpha_{2} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{3} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{3} ) } \right] &= \alpha_{3} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{4} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{4} ) } \right] &= \alpha_{4} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \end{aligned} \]
mass_oprobit <- MASS::polr(rating ~ temp * contact,
data = ordinal::wine,
method = "probit"
)
extract_eq(mass_oprobit, wrap = TRUE, terms_per_line = 2)
#>
#> Re-fitting to get Hessian
#>
#>
#> Re-fitting to get Hessian
\[ \begin{aligned} P( \operatorname{rating} \leq \operatorname{1} ) &= \Phi[\alpha_{1} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{2} ) &= \Phi[\alpha_{2} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{3} ) &= \Phi[\alpha_{3} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{4} ) &= \Phi[\alpha_{4} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \end{aligned} \]
And we can do the same thing with the {ordinal} package
ordinal_ologit <- ordinal::clm(rating ~ temp * contact,
data = ordinal::wine,
link = "logit"
)
extract_eq(ordinal_ologit, wrap = TRUE, terms_per_line = 2)
\[ \begin{aligned} \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{1} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{1} ) } \right] &= \alpha_{1} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{2} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{2} ) } \right] &= \alpha_{2} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{3} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{3} ) } \right] &= \alpha_{3} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \\ \log\left[ \frac { P( \operatorname{rating} \leq \operatorname{4} ) }{ 1 - P( \operatorname{rating} \leq \operatorname{4} ) } \right] &= \alpha_{4} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\quad \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}}) \end{aligned} \]
ordinal_probit <- ordinal::clm(rating ~ temp * contact,
data = ordinal::wine,
link = "probit"
)
extract_eq(ordinal_probit, wrap = TRUE, terms_per_line = 2)
\[ \begin{aligned} P( \operatorname{rating} \leq \operatorname{1} ) &= \Phi[\alpha_{1} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{2} ) &= \Phi[\alpha_{2} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{3} ) &= \Phi[\alpha_{3} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \\ P( \operatorname{rating} \leq \operatorname{4} ) &= \Phi[\alpha_{4} + \beta_{1}(\operatorname{temp}_{\operatorname{warm}})\ + \\ &\qquad\ \beta_{2}(\operatorname{contact}_{\operatorname{yes}}) + \beta_{3}(\operatorname{temp}_{\operatorname{warm}} \times \operatorname{contact}_{\operatorname{yes}})] \end{aligned} \]
We are aware of a few things the package doesn’t yet do, but that we hope to add later. These include:
log
, exp
,
sqrt
)lm(y ~ poly(x, 3))
)Regarding this last point, we are hopeful that we can incorporate essentially all the models covered by broom. Multilevel models are particularly high on our wish list. But we have not yet had the time to develop these.
We would LOVE to have you as a contributor! Is there a model that we don’t currently fit that you want implemented? There are a few ways to go about this. You can either (a) fork the repository and implement the method on your own, then submit a PR, or (b) file an issue.
If you file an issue it would be really helpful if you could provide an example of a fitted model and what the equation for that model should look like. We will try to get to these as soon as possible.
Also, the next planned vignette (at this particular moment) is on contributing to the package, with a step-by-step example of implementing a new method. So stay tuned for that, if you’re interested in (a) but not yet sure how to get started.