The area under the precision-recall curve (AUPRC) is another performance metric that you can use to evaluate a classification model. If your model achieves a perfect AUPRC, it means your model can find all of the positive samples (perfect recall) without accidentally marking any negative samples as positive (perfect precision.) It’s important to consider both recall and precision together, because you could achieve perfect recall (but bad precision) using a naive classifier that marked everything positive, and you could achieve perfect precision (but bad recall) using a naive classifier that marked everything negative.
How to interpret AUPRC
Figure: PR Curves, from scikit-learn
The figure above shows some example PR curves. The AUPRC for a given curve is simply the area beneath it.
The worst AUPRC is 0, and the best AUPRC is 1.0. This is in contrast to AUROC, where the lowest value is 0.5. Note also that a high AUROC does not guarantee a high AUPRC — it’s entirely possible to have a high AUROC (e.g. 0.8) and a low AUPRC (e.g 0.15) for the same classifier on the same data.
How to calculate AUPRC
The AUPRC is calculated as the area under the PR curve. A PR curve shows the trade-off between precision and recall across different decision thresholds. Note that “recall” is another name for the true positive rate (TPR). Thus, AUPRC and AUROC both make use of the TPR. For a review of TPR, precision, and decision thresholds, see Measuring Performance: The Confusion Matrix. Similar to plotted ROC curves, in a plotted PR curve the decision thresholds are implicit and are not shown as a separate axis.
The x-axis of a PR curve is the recall and the y-axis is the precision. (This is in contrast to ROC curves, where the y-axis is the recall and the x-axis is FPR.)
- A PR curve starts at the upper left corner, i.e. the point (recall = 0, precision = 1) which corresponds to a decision threshold of 1 (where every example is classified as negative, because all predicted probabilities are less than 1.) Note that the ground truth label (positive or negative) of the example with the largest output value has a big effect on the appearance of the PR curve.
- A PR curve ends at the lower right, where recall = 1 and precision is low. This corresponds to a decision threshold of 0 (where every example is classified as positive, because all predicted probabilities are greater than 0.) Note that estimates of precision for recall near zero tend to have high variance.
- The points in between, which create the PR curve, are obtained by calculating the precision and recall for different decision thresholds between 1 and 0. For a rough “angular” curve you would use only a few decision thresholds. For a smoother curve, you would use many decision thresholds.
The steps for calculating test set AUPRC are the same as the steps for calculating AUROC (see this post), except you calculate precision and recall at each threshold (instead of FPR and recall at each threshold). There are multiple methods for subsequent calculation of the area under the PR curve, including the lower trapezoid estimator, the average precision, and the interpolated median estimator. In Python, I like to use average precision:
auprc = sklearn.metrics.average_precision_score(true_labels, predicted_probs)
For this function you just have to provide a vector of the ground truth labels (true_labels) and a vector of the corresponding predicted probabilities from your model (predicted_probs.) Sklearn will use this information to calculate the average precision for you.
Additional Section Reference: Boyd et al., “Area Under the Precision-Recall Curve: Point Estimates and Confidence Intervals.”
When to use AUPRC
One interesting feature of PR curves is that they do not use true negatives at all:
- Recall = TPR = True Positives / (True Positives + False Negatives). Recall can be thought of as the ability of the classifier to correctly mark all positive examples as positive.
- Precision = True Positives / (True Positives + False Positives). Precision can be thought of as the ability of the classifier not to wrongly label a negative sample as positive (ref)
Because PR curves don’t use true negatives anywhere, AUPRC is a particularly useful metric for classifiers built for data with many true negatives. AUPRC won’t be “swamped” by these true negatives and can give you a clearer perspective of your classifier’s utility. In medicine, it’s quite common to work with data sets where positive examples are rare and the vast majority of examples are negative.
For a visualization of two algorithms compared using AUROC and AUPRC, see the top of page 2 of this paper. You can appreciate that the AUROC curves of the two compared algorithms look quite similar, but their AUPRC curves are more separated.
What if my model predicts more than two classes?
You can pretend your task is composed of many different binary classification tasks, and calculate AUPRC for Class A vs. Not Class A, Class B vs. Not Class B, Class C vs. Not Class C…etc.
The end! Happy AUPRC-ing.