RRPP is an acronym for randomization of residuals in a permutation procedure. RRPP refers to the package utilizing RRPP.
Analysis of variance (ANOVA) means different things to different people, but generally speaking, one recognizes an ANOVA table as a table of statistics including sums of squares, \(SS\), mean squares , \(MS\), \(F\)-statistics, and \(P\)-values. In this table, such statistics might be found for different model terms and can be calculated for different types of \(SS\).
ANOVA in RRPP is a concept that generalizes the statistics used in univariate ANOVA to multivariate data. The fundamental statistic is \(SS\), calculated as the trace of the sums of squares and cross-products matrix, \(\mathbf{S}\). Thus, \(SS\) is sum of each variable’s \(SS\), meaning for univariate data, \(SS\) is found for a single variable, only. Dividing this trace form of \(SS\) by appropriate degrees of freedom, produces \(MS\) values, which can be used to calculate \(F\)-statistics; i.e.,
\[ F = \frac{MS_{effect}}{MS_{Random}} = \frac{SS_{effect}/df_{effect}}{SS_{random}/df_{random}}= \frac{trace(\mathbf{S}_{effect})/df_{effect}}{trace(\mathbf{S}_{random})/df_{random}},\]
where \(random\) can refer to a random effect or residuals, and \(df\) is the appropriate degrees of freedom for effect and random \(SS\).
In this manner, there is no fundamental difference in how univariate and multivariate statistics are calculated (the univariate statistics are a simplified version of the same multivariate statistics). Therefore, ANOVA in RRPP does not distinguish between univariate and multivariate data, but refers to this consistency in calculation of statistics.
Multivariate analysis of variance (MANOVA) is perhaps typically thought of as an analog to ANOVA when data are multivariate. In what is classically referred to as, “MANOVA,” multivariate statistics are calculated, including Roy’s maximum root, Pillai trace, Hotelling-Lawley trace, and Wilks \(\Lambda\). These statistics do not have probability distributions (density functions), so they are converted to statistics that are assumed to approximately follow \(F\)-distributions. This works in cases that the number of observations are sufficiently larger than the number of response variables. Therefore, the M in MANOVA can be thought of as an indicator that multivariate data are analyzed and/or that multivariate statistics are calculated as a means to estimate \(F\)-statistics, but the ANOVA in MANOVA is not different than parametric ANOVA for univariate data, once \(F\)-statistics are calculated.
In RRPP, the M in MANOVA references the calculation of multivariate statistics and generating their empirical sampling distributions. Calculating multivariate statistics is not different than parametric MANOVA, but in RRPP, the statistics are calculated in every random permutation, eliminating the need to find an approximate probability distribution (the empirical sampling distribution is sufficient as an approximation). Whereas parametric MANOVA could be thought of as the forcing of multivariate statistics into a univariate ANOVA-style framework, MANOVA in RRPP is better defined as using randomization of residuals in a permutation procedure on multivariate statistics in much the same way as the traces of \(\mathbf{S}\) matrices in ANOVA. Description of how RRPP works can be found in another vignette
All multivariate statistics are derived from eigenanalysis of
\[\mathbf{S}_{random}^{-1}\mathbf{S}_{effect},\]
where \(\mathbf{S}_{random}\) is typically the \(\mathbf{S}\) matrix for residuals. Singular matrices are not a concern if data are projected into an Euclidean space of appropriate dimensions before matrix inversion. However, Wilks \(\Lambda\) might be less appropriate than other multivariate statistics, as it relies on products of eigenvalues rather than summation.
When fitting a linear model in RRPP over many permutations, some ANOVA statistics are calculated automatically in every permutation. Others can be obtained from those statistics.
library(RRPP)
data(Pupfish)
fit <- lm.rrpp(coords ~ Sex*Pop, SS.type = "I",
data = Pupfish, print.progress = FALSE)
attributes(fit)
## $names
## [1] "call" "LM" "ANOVA" "PermInfo" "turbo" "verbose" "subTest"
##
## $class
## [1] "lm.rrpp"
## $names
## [1] "SS.type" "SS" "RSS" "TSS" "RSS.model"
## [6] "Rsq" "full.resid" "n" "p" "df"
## $names
## [1] "SS" "MS" "Rsq" "Fs" "cohenf"
The distributions of ANOVA statistics can then be used to construct an ANOVA table, via the anova.lm.rrpp S3 generic; i.e.,
##
## Analysis of Variance, using Residual Randomization
## Permutation procedure: Randomization of null model residuals
## Number of permutations: 1000
## Estimation method: Ordinary Least Squares
## Sums of Squares and Cross-products: Type I
## Effect sizes (Z) based on F distributions
##
## Df SS MS Rsq F Z Pr(>F)
## Sex 1 0.015780 0.0157802 0.28012 28.209 4.7773 0.001 ***
## Pop 1 0.009129 0.0091294 0.16206 16.320 4.7097 0.001 ***
## Sex:Pop 1 0.003453 0.0034532 0.06130 6.173 3.7015 0.001 ***
## Residuals 50 0.027970 0.0005594 0.49651
## Total 53 0.056333
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call: lm.rrpp(f1 = coords ~ Sex * Pop, SS.type = "I", data = Pupfish,
## print.progress = FALSE)
To switch to MANOVA statistics, the linear model fit must be updated to include MANOVA statistics, in addition to the ANOVA statistics already generated. The MANOVA statistics take more time to calculate, because of matrix inversion and eigenanalysis in every permutation, so it is not performed unless requested.
## $names
## [1] "call" "LM" "ANOVA" "PermInfo" "turbo" "verbose" "subTest"
## [8] "MANOVA"
##
## $class
## [1] "manova.lm.rrpp" "lm.rrpp"
## $names
## [1] "SSCP" "invR.H" "eigs" "error"
## [5] "PCA" "manova.pc.dims" "e.rank" "SS.tot"
The “eigs” object returns all the eigenvalues of \(\mathbf{S}_{random}^{-1}\mathbf{S}_{effect}\), because the default argument. One can return the matrices, themselves, with “invR.H”, where \(\mathbf{H} = \mathbf{S}_{effect}\), and \(\mathbf{R} = \mathbf{S}_{residual}\), in line with common notation. (\(\mathbf{H}\) and \(\mathbf{R}\) are SSCP matrices for model effects and residuals, respectively.) The class lm.rrpp fit has class manova.lm.rrpp added, and the S3 summary function produces a MANOVA-like table.
##
## Linear Model fit with lm.rrpp
##
## Number of observations: 54
## Number of dependent variables: 112
## Data space dimensions: 53
## Residual covariance matrix rank: 50
## Sums of Squares and Cross-products: Type I
## Number of permutations: 1000
##
## Df Rand Roy Z Pr(>Roy)
## Sex 1 Residuals 17.73326 6.8151620 0.001
## Pop 1 Residuals 50.34889 1.5093978 0.070
## Sex:Pop 1 Residuals 12.71263 -0.5293145 0.706
## Full.Model 3 Residuals 54.35676 8.0972310 0.001
## Residuals 50
##
## Linear Model fit with lm.rrpp
##
## Number of observations: 54
## Number of dependent variables: 112
## Data space dimensions: 53
## Residual covariance matrix rank: 50
## Sums of Squares and Cross-products: Type I
## Number of permutations: 1000
##
## Df Rand Pillai Z Pr(>Pillai)
## Sex 1 Residuals 0.9466190 4.9765825 0.001
## Pop 1 Residuals 0.9805254 1.3801816 0.070
## Sex:Pop 1 Residuals 0.9270745 -0.3379822 0.706
## Full.Model 3 Residuals 2.8339784 9.6895441 0.001
## Residuals 50
It should be apparent that although the ANOVA and MANOVA results have some similarities, the effect sizes (\(Z\)-scores) and \(P\)-values can vary. Comparatively, the results of the MANOVA statistics can be influenced by variable covariances more so than ANOVA results (which are influenced solely by variances; i.e., the dispersion of values in the data space). The results are also influenced by dimensionality. In this case, there are more shape variables (112) than residual degrees of freedom (50), so data are projected into a lower-dimension Euclidean space to allow matrix inversion; i.e., \(\mathbf{S}_{residual}^{-1}\mathbf{S}_{effect}\). A tolerance of 0 means all possible PCs were retained for analysis. (Tolerance is the permissible relative change in eigenvalues. For example, tol = 0.01 means that if the k + 1st eigenvalue is not at least 1% smaller than the kth eigenvalue, retain only k eigenvectors. The k + 1, k + 2, k + 3, … eigenvalues decay at such a small rate that they are considered inconsequential.)
When the number of observations far exceeds the data space dimensions, this is okay, and the results will be qualitatively more similar to ANOVA results. Consider this example (although reducing dimensions is not recommended as a typical solution):
##
## Analysis of Variance, using Residual Randomization
## Permutation procedure: Randomization of null model residuals
## Number of permutations: 1000
## Estimation method: Ordinary Least Squares
## Sums of Squares and Cross-products: Type I
## Effect sizes (Z) based on F distributions
##
## Df SS MS Rsq F Z Pr(>F)
## Sex 1 0.015780 0.0157802 0.28012 28.209 4.7773 0.001 ***
## Pop 1 0.009129 0.0091294 0.16206 16.320 4.7097 0.001 ***
## Sex:Pop 1 0.003453 0.0034532 0.06130 6.173 3.7015 0.001 ***
## Residuals 50 0.027970 0.0005594 0.49651
## Total 53 0.056333
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call: lm.rrpp(f1 = coords ~ Sex * Pop, SS.type = "I", data = Pupfish,
## print.progress = FALSE)
##
## Linear Model fit with lm.rrpp
##
## Number of observations: 54
## Number of dependent variables: 112
## Data space dimensions: 53
## Residual covariance matrix rank: 50
## Sums of Squares and Cross-products: Type I
## Number of permutations: 1000
##
## Df Rand Roy Z Pr(>Roy)
## Sex 1 Residuals 17.73326 6.8151620 0.001
## Pop 1 Residuals 50.34889 1.5093978 0.070
## Sex:Pop 1 Residuals 12.71263 -0.5293145 0.706
## Full.Model 3 Residuals 54.35676 8.0972310 0.001
## Residuals 50
Alternatively, one could specify an exact number of PCs to use in analysis.
##
## Analysis of Variance, using Residual Randomization
## Permutation procedure: Randomization of null model residuals
## Number of permutations: 1000
## Estimation method: Ordinary Least Squares
## Sums of Squares and Cross-products: Type I
## Effect sizes (Z) based on F distributions
##
## Df SS MS Rsq F Z Pr(>F)
## Sex 1 0.015780 0.0157802 0.28012 28.209 4.7773 0.001 ***
## Pop 1 0.009129 0.0091294 0.16206 16.320 4.7097 0.001 ***
## Sex:Pop 1 0.003453 0.0034532 0.06130 6.173 3.7015 0.001 ***
## Residuals 50 0.027970 0.0005594 0.49651
## Total 53 0.056333
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call: lm.rrpp(f1 = coords ~ Sex * Pop, SS.type = "I", data = Pupfish,
## print.progress = FALSE)
##
## Linear Model fit with lm.rrpp
##
## Number of observations: 54
## Number of dependent variables: 112
## Data space dimensions: 53
## Residual covariance matrix rank: 50
## Data reduced to 10 PCs, as required or prescribed.
## 2 % of overall variation explained by these PCs.
## See $MANOVA$PCA from manova.lm.rrpp object for more information.
## Sums of Squares and Cross-products: Type I
## Number of permutations: 1000
##
## Df Rand Roy Z Pr(>Roy)
## Sex 1 Residuals 7.357775 8.928089 0.001
## Pop 1 Residuals 18.318603 11.680405 0.001
## Sex:Pop 1 Residuals 5.132268 8.500135 0.001
## Full.Model 3 Residuals 20.461447 10.525117 0.001
## Residuals 50
If the maximum number of possible PCs is chosen (full data space dimensionality) AND the number of variables exceeds the number of observations, RRPP will attempt to assess the number of real dimensions in each random permutation and adjust SSCP matrices, accordingly. (It is possible to randomly generate SSCP matrices of lower rank in random permutations, via RRPP.) However, it cannot be guaranteed that eigenvalues will be positive in such cases. If a warning message is delivered, “NaNs produced,” this might be the reason. Using a slightly lower number of PCs will probably resolve the issue.
ANOVA in RRPP generalizes univariate ANOVA to multivariate data, its statistics are directly associated with the amount of dispersion in multivariate data spaces, irrespective of variable covariances, and data dimensionality has no effect or limitation on their calculation.
MANOVA in RRPP produces the same multivariate statistics found in parametric MANOVA (when the number of observations exceeds residual degrees of freedom), but unlike parametric MANOVA, \(P\)-values are estimated from empirical sampling distributions of statistics (rather than estimating values that approximately follow \(F\)-distributions, with constraints). The choice of multivariate statistic is inconsequential for \(P\)-values, but makes a difference for effect sizes.
Because high-dimensional data force a projection of data into Euclidean subspaces, to make matrix inversion and eigenanalysis possible, MANOVA results could differ from ANOVA results (both qualitatively and in terms of effect sizes). For multivariate data with many more observations than variables (ideal conditions), results using either ANOVA or MANOVA will be comparable. For high-dimensional data, ANOVA results might be more reliable.