Path coefficient analysis which introduced by Sewall Wright in 1921 as “correlation and causation” is the extended form of multiple regression analysis, which decomposes correlation coefficients into direct, indirect, spurious and unanalyzed effects. It is a vital tool to study the cause-effect relationships of normal variables. It is of 3 types: simple, sequential and multivariate, in the simple form, there is a single dependent (endogenous) and one or more independent variables (exogenous). Certainly Sewall Wright, is the pioneer of path coefficient analysis who has numerous publications in this case from 1916 to 1980 ys. This method was initially considered with skepticism and later accepted and widely used in social sciences. Today, path coefficient analysis is used in almost all fields of life. For more info on path coefficient analysis see (Bondari 1990; Wright 1923, 1934, 1960; Li 1975; Wolfle 2003). It is suggested to refer to the statistical references, for example (Snedecor and Cochran 1980; Bhattacharyya and Johnson 1997; Draper and Smith 1981; Neter, Whitmore, and Wasserman 1992) in order to become more familiar with topics in statistics, such as descriptive statistics.
In a path coefficient analysis, descriptive statistics and Pearson correlation coefficients (double-headed arrows) between variables may be estimates which is done in this package. Moreover, and especially simple or multiple linear regression of dependent (or endogenous) variable(s) on independent variable(s) may be done, a task is done here. Of course, in a sequential path coefficient analysis, intervening or endogenous variables exist and analyses are performed step-by-step via this package, but in a simple path coefficient analysis one step is enough, which is done in this package along with the path diagram which is drawn automatically, but for complicated or sequential path, some more works must be done which is discussed later in this manual. In a path model, path coefficient or direct effects (Pi’s) indicates the direct effect of a variable on another, and are standardized partial regression coefficients (in Wright’s terminology) due they are estimated from correlations or from the transformed (standardized) data as: \(P_i = \beta_i\frac{\sigma_{X_i}}{\sigma_Y}\). The path equations are as follows:
\[\mathbf{X} = \begin{pmatrix} P_1 + P_2r_{12} + P_3r_{13} + ... + P_nr_{1n} = r_{Y1} \\ P_1r_{21} + P_2 + P_3r_{23} + ... + P_nr_{2n} = r_{Y2} \\ P_1r_{31} + P_2r_{32} + P_3 + ... + P_nr_{3n} = r_{Y3}\\ \vdots + \vdots \\ P_1r_{n1} + P_2r_{n2} + P_3r_{n3} + ... + P_n = r_{Yn} \\ \end{pmatrix}\]
Our package is capable of performing this straightforward task through detailed explanations. As stated by Bondari (1990), for two dependent variables \(Y_1\) and \(Y_2\):
\[Y_1=p_1X_1+p_2X_2+p_3X_3+... +p_nX_n\\ Y_2=p'_1X_1+p'_2X_2+p'_3X_3+... +p'_nX_n\\ ...\\ where:\\ r_{Y_1Y_2}=p_1p'_1+p_2p'_2+p_3p'_3+...+p_np'_n+\sigma_{i=j}p_ip'_1r_{ij}=\sigma_{i,j}p_ip'_ir_{ij}\]
The commands above are shown in the Figures 1&2. The simple path diagram:
The opening part of this vignette (instruction manual) provides a brief introduction to the concepts underpinning path coefficient analysis. The subsequent part showcases two practical demonstrations. In a path coefficient analysis, the Pearson correlation coefficients between dependent variables and their related independent variables are decomposed, as previously mentioned.
Our ** package can be applied in two cases: simple and sequential path coefficient analysis. If not installed, the ** package is being installed firstly through:
if(!require('Path.Analysis')){
install.packages('Path.Analysis')
}
#> Loading required package: Path.Analysis
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
library('Path.Analysis')
The analyses requires the following R packages:
When data is put within the data
folder of \(\mathbf{}\) package. This is the
simplest dataset in this package consisting of a dependent variable
called Y and 3 independent called X1, X2 and
X3. Then in the command prompt line type the following commands
and run the analyses:
data(dtsimp)
head(dtsimp[1:3, ])
Correlation between variables:
corr(dtsimp, verbose = FALSE)
Simple linear regression between Y and X1-X3 vars:
reg(dtsimp, 1, verbose = FALSE)
Plot the path main diagram
matdiag(dtsimp, 1)
#> [[1]]
#> y x1 x2 x3
#> y 1.00 0.43 -0.12 0.03
#> x1 0.43 1.00 -0.14 0.08
#> x2 -0.12 -0.14 1.00 -0.08
#> x3 0.03 0.08 -0.08 1.00
#>
#> n= 105
#>
#>
#> P
#> y x1 x2 x3
#> y 0.0000 0.2226 0.7772
#> x1 0.0000 0.1682 0.4333
#> x2 0.2226 0.1682 0.4329
#> x3 0.7772 0.4333 0.4329
#>
#> [[2]]
#> [[2]]$p
#> y x1 x2 x3
#> y 0.000000e+00 4.281686e-06 0.2225777 0.7772096
#> x1 4.281686e-06 0.000000e+00 0.1682316 0.4333210
#> x2 2.225777e-01 1.682316e-01 0.0000000 0.4328677
#> x3 7.772096e-01 4.333210e-01 0.4328677 0.0000000
#>
#> [[2]]$lowCI
#> y x1 x2 x3
#> y 1.0000000 0.2616079 -0.3046920 -0.1646039
#> x1 0.2616079 1.0000000 -0.3188570 -0.1161105
#> x2 -0.3046920 -0.3188570 1.0000000 -0.2650856
#> x3 -0.1646039 -0.1161105 -0.2650856 1.0000000
#>
#> [[2]]$uppCI
#> y x1 x2 x3
#> y 1.00000000 0.57567143 0.07331527 0.2184383
#> x1 0.57567143 1.00000000 0.05769229 0.2650146
#> x2 0.07331527 0.05769229 1.00000000 0.1160351
#> x3 0.21843826 0.26501461 0.11603511 1.0000000
#> Warning in summary.lm(mlreg): essentially perfect fit: summary may be
#> unreliable
#> [[1]]
#>
#> Call:
#> lm(formula = datap[, resp] ~ ., data = datap)
#>
#> Coefficients:
#> (Intercept) y x1 x2 x3
#> 1.109e-14 1.000e+00 3.295e-18 -1.762e-17 7.056e-17
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = datap[, resp] ~ ., data = datap)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.397e-15 -1.069e-15 -1.786e-16 9.219e-16 1.232e-14
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.109e-14 1.644e-15 6.75e+00 9.81e-10 ***
#> y 1.000e+00 1.065e-17 9.39e+16 < 2e-16 ***
#> x1 3.295e-18 1.457e-17 2.26e-01 0.821
#> x2 -1.762e-17 9.490e-17 -1.86e-01 0.853
#> x3 7.056e-17 2.392e-16 2.95e-01 0.769
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.845e-15 on 100 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 2.722e+33 on 4 and 100 DF, p-value: < 2.2e-16
Fig. 3: Diagram of the path coefficient analysis of ‘dtsimp’ sample dataset.
> Note: when user faces with an external data:
Suppose we have data stored in a hard drive at the path
Path/to/data
in a file called mydata.xls
. To
perform the following steps in RStudio console, follow these
instructions:
library(readxl), if installed the
readxl
package.
dtraw <- read_excel(“Path/to/data/mydata.xls”).
The next dataset, called dtraw
is used in this part. It
is also a built-in data in ** and contains nine variables: one dependent
variable called Y
and eight independent variables labeled
X1
through X8
. This dataset belongs to a
population of a Camelina oil crop in its seed oil (Y) and C18, C18.1,
C18.2, C18.3, C20.0, C20.1, C20.2, C22.1 fatty acids (marked as X1-X8)
were measured. Then type the following commands in the RStudio console
and run them:
data(dtraw)
rownames(dtraw) <- dtraw[, 1]
dtraw[, 1] <- NULL
head(dtraw[1:4, ])
The output is as follows:
data(dtraw)
dtraw <- as.data.frame(dtraw)
rownames(dtraw) <- dtraw[, 1]
dtraw[, 1] <- NULL
head(dtraw[1:4, ])
#> Y X1 X2 X3 X4 X5 X6 X7 X8
#> DH1 38.58 2.20 15.61 15.05 35.37 1.29 14.16 1.49 3.20
#> DH2 38.73 2.23 15.34 15.56 34.50 1.23 14.46 1.47 3.33
#> DH3 38.87 2.14 16.66 15.41 36.82 1.24 14.06 2.07 3.19
#> DH4 36.72 2.84 14.46 16.42 34.33 1.27 14.07 1.38 3.13
This dataset can be analyzed via ** packages as follows using ‘corr_plot’ function of ‘metan’ package, thanks to (Olivoto and Dal’Col Lúcio 2020).
Running ‘cor_plot’ function for ‘dtsimp’:
Fig. 4: Correlogram of dtsimp dataset, a built-in sample data.
Running the ‘matdiag’ function for ‘dtraw’ dataset ignoring the first column from left, or column names:
Fig. 5: Diagram of the path coefficient analysis of dtraw
The most significant part of my ** package is fitting such diagram, which is produced with the assistance of the DiagrammeR package.
It is important to exercise caution when encountering a short Plot Window in RStudio. To resolve this issue, navigate to R-Studio and position the cursor at the top of the graph window until four-way arrows appear. Then, effortlessly drag the top of the plot region upwards towards the variable list. If the figure region problem originated from this, running the code without any modifications will generate the anticipated graph. Additionally, ensure that your outer default margins are correctly sized and that your R plot area labels are not truncated. https://www.programmingr.com/r-error-messages/r-figure-margins-too-large/
When response existed between dependents, but not the first from left:
data(heart)
desc(heart, 2)
#> $`Descriptive statistics:`
#> Biking Heart.disease Smoking
#> Min. : 1.119 Min. : 0.5519 Min. : 0.5259
#> 1st Qu.:20.205 1st Qu.: 6.5137 1st Qu.: 8.2798
#> Median :35.824 Median :10.3853 Median :15.8146
#> Mean :37.788 Mean :10.1745 Mean :15.4350
#> 3rd Qu.:57.853 3rd Qu.:13.7240 3rd Qu.:22.5689
#> Max. :74.907 Max. :20.4535 Max. :29.9467
#>
#> $`Descriptive statistics:`
#> Biking Heart.disease Smoking
#> nbr.val 4.980000e+02 498.0000000 498.0000000
#> nbr.null 0.000000e+00 0.0000000 0.0000000
#> nbr.na 0.000000e+00 0.0000000 0.0000000
#> min 1.119154e+00 0.5518982 0.5258500
#> max 7.490711e+01 20.4534962 29.9467431
#> range 7.378796e+01 19.9015981 29.4208931
#> sum 1.881863e+04 5066.9199578 7686.6471384
#> median 3.582446e+01 10.3852547 15.8146139
#> mean 3.778841e+01 10.1745381 15.4350344
#> SE.mean 9.626099e-01 0.2048706 0.3714820
#> CI.mean.0.95 1.891286e+00 0.4025192 0.7298687
#> var 4.614556e+02 20.9020349 68.7234260
#> std.dev 2.148152e+01 4.5718743 8.2899593
#> coef.var 5.684684e-01 0.4493447 0.5370872
#>
#> $`Correlation coefficients:`
#> Biking Heart.disease Smoking
#> Biking 1.00000000 -0.9354555 0.01513618
#> Heart.disease -0.93545547 1.0000000 0.30913098
#> Smoking 0.01513618 0.3091310 1.00000000
# matdiag(heart, 2)
*Please be cautious that the diagram is only produced automatically
when there is only one dependent variable and related independent
variable (causative). In the data set, the dependent variable (Y) should
be the first variable from the left, and the other variables should be
ordered from left to right, as observed in dtsimp
or
dtraw
. In other words, when the target is simple path
coefficient analysis, you can call the packages via: **matdiag(dtsimp,
1). The package extracts textual outputs (without graphs) under any
conditions, even when there is missing data.*
As mentioned earlier, there are two types of path diagrams or methodologies: simple and multivariate. The multivariate form requires more steps and work, but the relationships between variables are the same and easy to understand. In the case of a sequential path diagram, this methodology is more complex because it includes intervening variables that need to be accounted for. Let’s consider a specific scenario with a dataset. For more information see (Arminian et al. 2008). Regarding the dataset, let’s assume our data is stored in a hard drive with the path “~path_to_data/” and is named ‘dtseq.xls’. To load this dataset into the Rstudio console, follow these steps:
library(readxl) #following installing the readxl
package
dtseq <- read_excel(“~path_to_data/dtseq.xls”)
Methods like ‘Pearson’ or ‘Spearman’ can be used to analyze the correlation between variables. A correlogram is a tool that combines scatterplots and histograms, making it possible to examine the relationship between each pair of numeric variables in a matrix. The correlation is visually depicted in scatterplots, while the diagonal of the correlogram showcases the distribution of each variable using a histogram or density plot. (Source: https://python-graph-gallery.com/correlogram/) This analysis can be presented in the form of tables or matrices, which can be generated using the ‘PerformanceAnalytics’ and ‘metan’ packages.
step 1: YLD v.s FS, DFT, FW, FV:
library(metan)
data(dtseq)
dtseq1 <- dtseq[, c(2, 4, 3, 6, 5)]
head(dtseq1)
matdiag(dtseq1, 1)
#> # A tibble: 6 × 5
#> YLD FS DFT FW FV
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 410. 31.6 51.7 8.69 10.0
#> 2 84.7 38.5 52 8.16 8.33
#> 3 360. 25.3 54.7 7.48 5.65
#> 4 380. 33.9 49 10.0 9.33
#> 5 311. 24.7 50 9.19 10.0
#> 6 404. 19.1 52.5 9.97 9.87
Fig. 6: Diagram of dtseq1, modified of the dtseq data.
Network diagrams, also known as graphs, visually depict the connections between a group of entities. Each entity is represented as a node or vertice, and the connections between nodes are shown as links or edges (source: https://www.data-to-viz.com/graph/network.html). In R software, you can create network plots or connections between objects using the ‘corrr’ package. This package allows you to create colored links that can be thin or thick, depending on the strength of the correlation, to represent the correlations between objects. Take a look at the graph that illustrates the correlations for ‘dtseq1’. It showcases a larger number of variables, making it visually appealing and informative.
#> Correlation computed with
#> • Method: 'pearson'
#> • Missing treated using: 'pairwise.complete.obs'
Fig. 7: Network plot of the dtraw2.
Fig. 8: Heatmap of the dtraw2 dataset.
For plotting the heatmaps and clustering of observations and
variables simultaneously, we can use some packages developed such as
ComplexHeatmap
<10.1002/imt2.43> (Gu Z (2022).
“Complex Heatmap Visualization.” iMeta. doi:10.1002/imt2.43.), and pheatmap
packages. We here introduce the application of
ComplexHeatmap
package in clustering the
dtraw2
dataset measured on 35 genotypes of a plant with 9
traits.
Fig. 9: Complex heatmap plot1 of the dtraw2.
Step 2: FS vs. FLP, DFL:
#> # A tibble: 6 × 3
#> FS FLP DFL
#> <dbl> <dbl> <dbl>
#> 1 31.6 55.2 16.7
#> 2 38.5 55.7 18.3
#> 3 25.3 49.8 17.6
#> 4 33.9 59.1 17.9
#> 5 24.7 49.5 15.5
#> 6 19.1 67.3 17.3
Fig. 10: Diagram of the path coefficient analysis of the dtseq2 (part of dtseq)
Step 3: DFT vs. FLP, DFL:
#> # A tibble: 6 × 3
#> DFT FLP DFL
#> <dbl> <dbl> <dbl>
#> 1 51.7 55.2 16.7
#> 2 52 55.7 18.3
#> 3 54.7 49.8 17.6
#> 4 49 59.1 17.9
#> 5 50 49.5 15.5
#> 6 52.5 67.3 17.3
Fig. 11: Diagram of the path coefficient analysis of dtseq3 (part of dtseq)
Step 4: FW vs. FLP, DFL:
#> # A tibble: 6 × 3
#> FW FLP DFL
#> <dbl> <dbl> <dbl>
#> 1 8.69 55.2 16.7
#> 2 8.16 55.7 18.3
#> 3 7.48 49.8 17.6
#> 4 10.0 59.1 17.9
#> 5 9.19 49.5 15.5
#> 6 9.97 67.3 17.3
Fig. 12: Diagram of the path coefficient analysis of dtseq4 (part of dtseq)
Step 5: FV vs. FLP, DFL:
#> # A tibble: 6 × 3
#> FV FLP DFL
#> <dbl> <dbl> <dbl>
#> 1 10.0 55.2 16.7
#> 2 8.33 55.7 18.3
#> 3 5.65 49.8 17.6
#> 4 9.33 59.1 17.9
#> 5 10.0 49.5 15.5
#> 6 9.87 67.3 17.3
Fig. 13: Correlation plot of the dtseq5 (part of dtseq)
Step 6: DFL vs. FLP:
#> # A tibble: 6 × 2
#> DFL FLP
#> <dbl> <dbl>
#> 1 16.7 55.2
#> 2 18.3 55.7
#> 3 17.6 49.8
#> 4 17.9 59.1
#> 5 15.5 49.5
#> 6 17.3 67.3
Fig. 14: Network plot of the dtseq6 (part of dtseq).
Multivariate analysis of variance (MANOVA) to estimate SSCP matrices and so on. This requires the following package to be installed:
data(dtseqr)
dtseqr <- as.data.frame(dtseqr)
dtseqr[, 1] <- as.factor(dtseqr[, 1]) # Rep
dtseqr[, 2] <- as.factor(dtseqr[, 2]) # Genotypes
f <- lm(cbind(YLD, DFT, FS, FV, FW, DFL, FLP) ~ Rep + Genotypes, dtseqr)
summary(Anova(f)) # all results for MANOVA
#>
#> Type II MANOVA Tests:
#>
#> Sum of squares and products for error:
#> YLD DFT FS FV FW DFL FLP
#> YLD 30872.750 1305.82650 -1284.15150 -402.8620 -420.6260 -213.0040 1418.89700
#> DFT 1305.827 677.12718 86.45078 -1.4006 50.1155 118.7334 69.47155
#> FS -1284.151 86.45078 254.45997 -23.1646 24.3031 43.1752 41.54635
#> FV -402.862 -1.40060 -23.16460 28.2584 6.8566 18.5540 -39.72960
#> FW -420.626 50.11550 24.30310 6.8566 24.0384 22.1664 -42.02800
#> DFL -213.004 118.73340 43.17520 18.5540 22.1664 56.4220 -35.56200
#> FLP 1418.897 69.47155 41.54635 -39.7296 -42.0280 -35.5620 271.59110
#>
#> ------------------------------------------
#>
#> Term: Rep
#>
#> Sum of squares and products for the hypothesis:
#> YLD DFT FS FV FW DFL FLP
#> YLD 327.6100 195.02750 124.79950 10.860 -53.2140 7.240 -336.8410
#> DFT 195.0275 116.10063 74.29363 6.465 -31.6785 4.310 -200.5228
#> FS 124.7995 74.29363 47.54103 4.137 -20.2713 2.758 -128.3159
#> FV 10.8600 6.46500 4.13700 0.360 -1.7640 0.240 -11.1660
#> FW -53.2140 -31.67850 -20.27130 -1.764 8.6436 -1.176 54.7134
#> DFL 7.2400 4.31000 2.75800 0.240 -1.1760 0.160 -7.4440
#> FLP -336.8410 -200.52275 -128.31595 -11.166 54.7134 -7.444 346.3321
#>
#> Multivariate Tests: Rep
#> Df test stat approx F num Df den Df Pr(>F)
#> Pillai 2 0.98891 1.25750 14 18 0.31903
#> Wilks 2 0.01109 9.70834 14 16 2.5259e-05 ***
#> Hotelling-Lawley 2 89.15113 44.57557 14 14 3.7404e-09 ***
#> Roy 2 89.15113 114.62289 7 9 4.5158e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> ------------------------------------------
#>
#> Term: Genotypes
#>
#> Sum of squares and products for the hypothesis:
#> YLD DFT FS FV FW DFL
#> YLD 697553.096 5033.04937 -11241.57825 4343.21250 2541.13650 -2906.41425
#> DFT 5033.049 116.01356 -91.18987 7.00395 -1.07955 -11.41207
#> FS -11241.578 -91.18987 926.14245 3.20730 -24.60870 134.06535
#> FV 4343.213 7.00395 3.20730 83.14020 37.53540 -13.49070
#> FW 2541.136 -1.07955 -24.60870 37.53540 23.42280 -5.21070
#> DFL -2906.414 -11.41207 134.06535 -13.49070 -5.21070 54.84165
#> FLP 5863.115 22.55888 -56.60295 126.96570 90.78030 72.84285
#> FLP
#> YLD 5863.11525
#> DFT 22.55888
#> FS -56.60295
#> FV 126.96570
#> FW 90.78030
#> DFL 72.84285
#> FLP 769.10145
#>
#> Multivariate Tests: Genotypes
#> Df test stat approx F num Df den Df Pr(>F)
#> Pillai 7 3.896 2.510 49 98.00000 5.5418e-05 ***
#> Wilks 7 0.000 16.018 49 45.03719 < 2.22e-16 ***
#> Hotelling-Lawley 7 4971.966 637.803 49 44.00000 < 2.22e-16 ***
#> Roy 7 4949.541 9899.081 7 14.00000 < 2.22e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Anova(f)$SSPE # individual printing SSCP matrix of error
Anova(f)$SSPE[4:5, 4:5] # SSCP matrix of error for two dependent variables i.e Fv and FW.
#> FV FW
#> FV 28.2584 6.8566
#> FW 6.8566 24.0384
Following performing multivariate path coefficient analysis, it is necessary to estimate the correlation coefficient between residuals ( here FV and FW are final dependent variables) as follow. To do this, the Error matrix needs to be calculated in MANOVA.
ru1u2 <- Anova(f)$SSPE[4, 5]/( sqrt(Anova(f)$SSPE[4, 4])*sqrt(Anova(f)$SSPE[5, 5]))
cat("\nCorrelation coefficient between residuals is:\n", ru1u2)
#>
#> Correlation coefficient between residuals is:
#> 0.2630766
After performing or analyzing sequential path analyses step-by-step, it is now time to create a sequential path diagram, which includes a multivariate path diagram. To do this, one can use the program called Graphviz in relation to DiarammeR. If a specific section of the sequential model is considered as a multivariate path, one can draw a multivariate path Diagram (Arminian et al. 2008) and estimate the correlation coefficient between residuals (as previously estimated) as follows:
Fig. 15: Sequential univariate path diagram. It is important to note that residuals can be added to each endogenous variable, which are estimated throughout steps 1 to 6 above.
For full color names or other signs of DiagrammeR or lots of node/nodge attributes, and Graphviz go to be used in the diagrams see the manuals and guides like: https://rich-iannone.github.io/DiagrammeR/articles/graphviz-mermaid.html. Also: vignettes/graphviz-mermaid.Rmd
Fig. 16: The sequential multivariate path Diagram
Notice: Users can see the ‘lavaan’ package in R and simple ‘PATHSAS’ code written by Cramer et al. (Cramer, Wehner, and Donaghy 1999), and also and “semPlot” function of ‘OpenMxas’ package as initial tools for conducting path analyses and SEM (Structural Equation Modeling).