library(geodl)
Accuracy assessment is an important component of the modeling process. Specifically, it is important to assess your model against withheld data. Assessing the model relative to the training samples can be misleading due to issue of overfitting. As a result, using a withheld, randomized, unbiased test set to assess the final model is important to quantify how well the model generalizes to new data, which is generally the point of creating a model. Before we begin, here are a few notes on key terminology:
In this example, we are primarily interested in the test set. Once a final model is generated, it can be used to predict to a test set. The test set labels can be compared to the predictions to generated a confusion matrix and associated assessment metrics.
An example confusion matrix is shown below. geodl uses the confusion matrix configuration standard within the field of remote sensing where the columns represent the reference labels and the rows represent the predictions. In the example confusion matrix, 50 samples were predicted to Class A and were correctly predicted as Class A. 8 samples were examples of Class A but were incorrectly predicted as Class B. 10 samples were examples of Class B what were incorrectly labeled as Class A. Relative to Class A, the 8 samples there were mislabeled to Class B would represent omission errors: they were incorrectly omitted from Class A. In contrast, the 10 reference samples that were from Class B but incorrectly predicted to Class A would be examples of commission error relative to Class A: they were incorrectly included in Class A.
In short, the confusion matrix describes not just the overall amount of error, but differentiates the types of errors. This allows analysts and users to understand which classes were most commonly confused or which classes were most difficult to map or differentiate.
\[ \begin{array}{c|ccc} & \text{Reference A} & \text{Reference B} & \text{Reference C} \\ \hline \text{Prediction A} & 50 & 10 & 5 \\ \text{Prediction B} & 8 & 45 & 7 \\ \text{Prediction C} & 2 & 5 & 60 \\ \end{array} \]
Overall accuracy represents the percentage or proportion of the total samples that were correctly predicted.
\[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \]
Outside of an aggregated overall accuracy, it can be useful to report class-level metrics. In remote sensing 1 - omission error for a class is generally termed producer’s accuracy while 1 - commission error is termed user’s accuracy. When the confusion matrix is configured such that the reference labels define the columns and the predictions define the rows, producer’s accuracy is calculated as the number correct for the class divided by the column total while user’s accuracy is calculated as the number correct for the class divided by the associated row total.
\[ \text{Producer's Accuracy (PA)}\_i = \frac{\text{Number of Correctly Classified Samples of Class } i}{\text{Total Number of Reference Samples of Class } i} \]
\[ \text{User's Accuracy (UA)}\_i = \frac{\text{Number of Correctly Classified Samples of Class } i}{\text{Total Number of Samples Classified as Class } i} \]
For a binary classification problem where one class is the positive case or case of interest and the other class is the background or negative case, it is common to use different terminology. The confusion matrix below represents a binary confusion matrix. Here is an explanation of the associated terminology:
\[ \begin{array}{c|cc} & \text{Reference Positive} & \text{Reference Negative} \\ \hline \text{Prediction Positive} & TP & FP \\ \text{Prediction Negative} & FN & TN \\ \end{array} \]
From the binary confusion matrix, we can calculate overall accuracy as stated above. Overall accuracy can also be defined relative to TP, TN, FN, and FP counts as follows:
\[ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \]
At the class-level, recall for each class can be calculated using the TP and FN counts. Recall is equivalent to class-level producer’s accuracy and quantifies 1 - omission error relative to the positive case.
\[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]
Class-level precision quantifies 1 - commission error and is equivalent to user’s accuracy for the positive case. It is calculated using the TP and FP counts.
\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \]
Precision and recall can be combined to a single class-level metric as the F1-score, which is the harmonic mean of precision and recall. It can be stated relative to precision and recall or relative to TP, FP, and FN counts.
\[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
\[ F1 = 2 \cdot \frac{TP}{2TP + FP + FN} \]
For the negative or background class, specificity represents 1 - omission error while negative predictive value (NPV) represents 1 - commission error.
\[ \text{Specificity} = \frac{TN}{TN + FP} \] \[ \text{NPV} = \frac{TN}{TN + FN} \]
Lastly, it might be of interest to aggregate class-level metrics. There are three general ways to do this. * macro-averaging: calculate the metric separately for each class then take the average such that each class is equally weighted in the aggregated metric.
micro-averaging: aggregate TP, TN, FP, and FN counts and calculate a single metric such that more abundant classes have a larger weight in the final calculation.
weighted macro-averaging: calculate a macro-average with user-specified class weights such that the classes are not equally weighted.
For a multiclass problem, micro-averaged user’s accuracy (precision) and producer’s accuracy (recall) are equivalent to each other and also equivalent to overall accuracy and the micro-averaged F1-score. So, there is no need to calculate micro-averaged metrics if overall accuracy is reported. Instead, it makes more sense to report overall accuracy, macro-averaged class metrics, and non-aggregated class metrics. This is the method used within geodl.
In this first example, geodl’s assessPnts() function is used to calculate assessment metrics for a multiclass classification from a table or at point locations. The “ref” column represents the reference labels while the “pred” column represents the predictions. The mappings parameter allows for providing more meaningful class names and is especially useful when classes are represented using numeric codes.
For a multiclass assessment, the following are returned: class names ($Classes), count of samples per class in the reference data ($referenceCounts), count of samples per class in the predictions ($predictionCounts), confusion matrix ($confusionMatrix), aggregated assessment metrics ($aggMetrics) (OA = overall accuracy, macroF1 = macro-averaged class aggregated F1-score, macroPA = macro-averaged class aggregated producer’s accuracy or recall, and macroUA = macro-averaged class aggregated user’s accuracy or precision), class-level user’s accuracies or precisions ($userAccuracies), class-level producer’s accuracies or recalls ($producerAccuracies), and class-level F1-scores ($F1Scores).
<- readr::read_csv("C:/myFiles/data/tables/multiClassExample.csv") mcIn
<- assessPnts(reference=mcIn$ref,
myMetrics predicted=mcIn$pred,
multiclass=TRUE,
mappings=c("Barren",
"Forest",
"Impervous",
"Low Vegetation",
"Mixed Developed",
"Water"))
print(myMetrics)
$Classes
1] "Barren" "Forest" "Impervous" "Low Vegetation" "Mixed Developed" "Water"
[
$referenceCounts
Barren Forest Impervous Low Vegetation Mixed Developed Water 163 20807 426 3182 520 200
$predictionCounts
Barren Forest Impervous Low Vegetation Mixed Developed Water 194 21440 281 2733 484 166
$confusionMatrix
Reference
Predicted Barren Forest Impervous Low Vegetation Mixed Developed Water75 7 59 46 1 6
Barren 13 20585 62 617 142 21
Forest 10 8 196 33 22 12
Impervous 63 138 34 2413 84 1
Low Vegetation 1 64 75 72 270 2
Mixed Developed 1 5 0 1 1 158
Water
$aggMetrics
OA macroF1 macroPA macroUA1 0.9367 0.6991 0.6629 0.7395
$userAccuracies
Barren Forest Impervous Low Vegetation Mixed Developed Water 0.3866 0.9601 0.6975 0.8829 0.5579 0.9518
$producerAccuracies
Barren Forest Impervous Low Vegetation Mixed Developed Water 0.4601 0.9893 0.4601 0.7583 0.5192 0.7900
$f1Scores
Barren Forest Impervous Low Vegetation Mixed Developed Water 0.4202 0.9745 0.5545 0.8159 0.5378 0.8634
A binary classification can also be assessed using the assessPnts() function and a table or point locations. For a binary classification the multiclass parameter should be set to FALSE. For a binary case, the $Classes, $referenceCounts,$predictionCounts, and $confusionMatrix objects are also returned; however, the $aggMets object is replaced with $Mets, which stores the following metrics: overall accuracy, recall, precision, specificity, negative predictive value (NPV), and F1-score. For binary cases, the second class is assumed to be the positive case.
<- readr::read_csv("C:/myFiles/data/tables/binaryExample.csv") bIn
<- assessPnts(reference=bIn$ref,
myMetrics predicted=bIn$pred,
multiclass=FALSE,
mappings=c("Not Mine", "Mine"))
print(myMetrics)
$Classes
1] "Not Mine" "Mine"
[
$referenceCounts
Negative Positive 4822 178
$predictionCounts
Negative Positive 4840 160
$ConfusionMatrix
Reference
Predicted Negative Positive4820 20
Negative 2 158
Positive
$Mets
OA Recall Precision Specificity NPV F1Score0.9956 0.8876 0.9875 0.9996 0.9959 0.9349 Mine
Before using the assessPnts() function, you may need to extract predictions into a table. This example demonstrates how to extract reference and prediction numeric codes from raster grids at point locations. Note that it is important to make sure all data layers use the same projection or coordinate reference system. The extract() function from the terra packages can be used to extract raster call values at point locations.
Once data are extracted, the assessPnts() tool can be used with the resulting table. It may be useful to recode the class numeric codes to more meaningful names beforehand.
<- terra::vect("C:/myFiles/data/topoResult/topoPnts.shp")
pntsIn <- terra::rast("C:/myFiles/data/topoResult/topoRef.tif")
refG <- terra::rast("C:/myFiles/data/topoResult/topoPred.tif") predG
<- terra::project(pntsIn, terra::crs(refG))
pntsIn2 <- terra::extract(refG, pntsIn2)
refIsect <- terra::extract(predG, pntsIn2)
predIsect
<- data.frame(ref=as.factor(refIsect$topoRef),
resultsIn pred=as.factor(predIsect$topoPred))
$ref <- forcats::fct_recode(resultsIn$ref,
resultsIn"Not Mine" = "0",
"Mine" = "1")
$pred <- forcats::fct_recode(resultsIn$pred,
resultsIn"Not Mine" = "0",
"Mine" = "1")
<- assessPnts(reference=bIn$ref,
myMetrics predicted=bIn$pred,
multiclass=FALSE,
mappings=c("Not Mine", "Mine")
)print(myMetrics)
$Classes
1] "Not Mine" "Mine"
[
$referenceCounts
Negative Positive 4822 178
$predictionCounts
Negative Positive 4840 160
$ConfusionMatrix
Reference
Predicted Negative Positive4820 20
Negative 2 158
Positive
$Mets
OA Recall Precision Specificity NPV F1Score0.9956 0.8876 0.9875 0.9996 0.9959 0.9349 Mine
The assessRaster() function allows for calculating assessment metrics from reference and prediction categorical raster grids as opposed to point locations or tables. Note that the grids being compared should have the same spatial extent, coordinate reference system, and number of rows and columns of cells.
<- terra::rast("C:/myFiles/data/topoResult/topoRef.tif")
refG <- terra::rast("C:/myFiles/data/topoResult/topoPred.tif") predG
<- terra::crop(terra::project(refG, predG), predG) refG2
<- assessRaster(reference = refG2,
myMetrics predicted = predG,
multiclass = FALSE,
mappings = c("Not Mine", "Mine")
)print(myMetrics)
$Classes
1] "Not Mine" "Mine"
[
$referenceCounts
Negative Positive 36022015 1301194
$predictionCounts
Negative Positive 36146932 1176277
$ConfusionMatrix
Reference
Predicted Negative Positive35994704 152228
Negative 27311 1148966
Positive
$Mets
OA Recall Precision Specificity NPV F1Score0.9952 0.883 0.9768 0.9992 0.9958 0.9275 Mine