Explaining Models or Modelling Explanations

Counterfactual Explanations and Algorithmic Recourse for Trustworthy AI

Delft University of Technology

Arie van Deursen
Cynthia C. S. Liem

February 13, 2026

Background

Economist, now PhD CS

How can we make opaque AI more trustworthy?

Explainable AI, Adversarial ML, Probabilistic ML

Core developer and maintainer of Taija (Trustworthy AI in Julia)

Scan for slides. Links to www.patalt.org.

Scan for slides. Links to www.patalt.org.

Agenda

  • Intro: counterfactual explanations (CE) and algorithmic recourse (AR)
  • Unexpected Challenges: endogenous dynamics of AR
  • Paradigm Shift: explanations should be faithful first, plausible second
  • New Opportunities: teaching models plausible explanations through CE

Intro

A Toy Problem

Cats and dogs in two dimensions.

Cats and dogs in two dimensions.

Traversing the Parameter Space

Model Training

Objective:

\[ \begin{aligned} \min_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \end{aligned} \]

Traversing the Parameter Space

Model Training

Objective:

\[ \begin{aligned} \min_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \end{aligned} \]

Solution:

\[ \begin{aligned} \theta_{t+1} &= \theta_t - \nabla_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \\ \textcolor{orange}{\theta^*}&=\theta_T \end{aligned} \]

Fitted model. Contour shows predicted probability y=🐶.

Fitted model. Contour shows predicted probability \(y=🐶\).

Traversing the Feature Space

Counterfactual Search

Objective:

\[ \begin{aligned} \min_{\textcolor{purple}{\mathbf{x}}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} }) + \lambda \text{reg} } \} \end{aligned} \]

Traversing the Feature Space

Counterfactual Search

Objective:

\[ \begin{aligned} \min_{\textcolor{purple}{\mathbf{x}}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} }) + \lambda \text{reg} } \} \end{aligned} \]

Solution:

\[ \begin{aligned} \mathbf{x}_{t+1} &= \mathbf{x}_t - \nabla_{\textcolor{purple}{\theta}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} })} \} \\ \textcolor{purple}{\mathbf{x}^*}&=\mathbf{x}_T \end{aligned} \]

Counterfactual explanation for what it takes to be a dog.

Counterfactual explanation for what it takes to be a dog.

Algorithmic Recourse

Provided CE is valid, plausible and actionable, it can be used to provide recourse to individuals negatively affected by models.

“If your income had been x, then …”

Figure 1: Counterfactuals for random samples from the Give Me Some Credit dataset (Kaggle 2011). Features ‘age’ and ‘income’ are shown.

Unexpected Challenges

Hidden Cost of Implausibility

AR can introduce costly dynamics1

Endogenous Macrodynamics in Algorithmic Recourse.

Endogenous Macrodynamics in Algorithmic Recourse.
Figure 2: Illustration of external cost of individual recourse.

Insight: Implausible Explanations Are Costly

Mitigation Strategies

  • Incorporate hidden cost in reframed objective.
  • Even simple mitigation strategies can help.
  • Reducing hidden cost is (roughly) equivalent to ensuring plausibility.

Reframed Objective

\[ \begin{aligned} \mathbf{s}^\prime &= \arg \min_{\mathbf{s}^\prime \in \mathcal{S}} \{ {\text{yloss}(M(f(\mathbf{s}^\prime)),y^*)} \\ &+ \lambda_1 {\text{cost}(f(\mathbf{s}^\prime))} + \lambda_2 {\text{extcost}(f(\mathbf{s}^\prime))} \} \end{aligned} \]

Mitigation strategies to tackle hidden costs of AR.

Mitigation strategies to tackle hidden costs of AR.

Paradigm Shift

Plausibility at all cost?

All of these counterfactuals are valid explanations for the model’s prediction.

Pick your poison …

Figure 3: Turning a 9 into a 7: Counterfactual explanations for an image classifier produced using Wachter (Wachter, Mittelstadt, and Russell 2017), Schut (Schut et al. 2021) and REVISE (Joshi et al. 2019).

Faithful First, Plausible Second

Counterfactuals as plausible as the model permits1.

Figure 4: KDE for training data.
Figure 5: KDE for model posterior.

Faithful Counterfactuals

Figure 6: Turning a 9 into a 7. ECCCo applied to MLP (a), Ensemble (b), JEM (c), JEM Ensemble (d).

Insight: faithfulness facilitates

Figure 7: Results for different generators (from 3 to 5).

New Opportunities

Counterfactual Training: Method

Figure 8: (a) conventional training, all mutable; (b) CT, all mutable; (c) conventional, age immutable; (d) CT, age immutable.
  1. Contrast faithful CE with data.
  2. Enforce actionability constraints.
  3. Bonus: use nascent CE as AE.

Insight: We can hold models accountable for plausible explanations1.

Counterfactual Training: Results

  • Models trained with CT learn more plausible and (provably) actionable explanations.
  • Predictive performance does not suffer, robust performance improves.

Plausibility: BL (top row) vs CT using the ECCCo generator (bottom row) counterfactuals for a randomly selected factual from class “0” (in blue). CT produces more plausible counterfactuals than BL.

Plausibility: BL (top row) vs CT using the ECCCo generator (bottom row) counterfactuals for a randomly selected factual from class “0” (in blue). CT produces more plausible counterfactuals than BL.

Actionability: Sample visual explanations (integrated gradients) for the MNIST dataset. Mutability constraints are imposed on the five top and bottom rows of pixels. CT (bottom) is less sensitive to protected features.

Actionability: Sample visual explanations (integrated gradients) for the MNIST dataset. Mutability constraints are imposed on the five top and bottom rows of pixels. CT (bottom) is less sensitive to protected features.

If we still have time …

Spurious Sparks of AGI

We challenge the idea that the finding of meaningful patterns in latent spaces of large models is indicative of AGI1.

Figure 9: Inflation of prices or birds? It doesn’t matter!

Taija

  • Work presented @ JuliaCon 2022, 2023, 2024.
  • Google Summer of Code and Julia Season of Contributions 2024.
  • Total of three software projects @ TU Delft.

Trustworthy AI in Julia: github.com/JuliaTrustworthyAI

References

Altmeyer, Patrick, Giovan Angela, Aleksander Buszydlik, Karol Dobiczek, Arie van Deursen, and Cynthia C. S. Liem. 2023. “Endogenous Macrodynamics in Algorithmic Recourse.” In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 418–31. IEEE. https://doi.org/10.1109/satml54575.2023.00036.
Altmeyer, Patrick, Aleksander Buszydlik, Arie van Deursen, and Cynthia C. S. Liem. 2026. “Counterfactual Training: Teaching Models Plausible and Actionable Explanations.” In 2026 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE. upcoming.
Altmeyer, Patrick, Andrew M Demetriou, Antony Bartlett, and Cynthia C. S. Liem. 2024. “Position: Stop Making Unscientific AGI Performance Claims.” In International Conference on Machine Learning, 1222–42. PMLR. https://proceedings.mlr.press/v235/altmeyer24a.html.
Altmeyer, Patrick, Mojtaba Farmanbar, Arie van Deursen, and Cynthia C. S. Liem. 2023. “Faithful Model Explanations Through Energy-Constrained Conformal Counterfactuals.” https://arxiv.org/abs/2312.10648.
Joshi, Shalmali, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. 2019. Towards Realistic Individual Recourse and Actionable Explanations in Black-Box Decision Making Systems.” https://arxiv.org/abs/1907.09615.
Kaggle. 2011. “Give Me Some Credit, Improve on the State of the Art in Credit Scoring by Predicting the Probability That Somebody Will Experience Financial Distress in the Next Two Years.” https://www.kaggle.com/c/GiveMeSomeCredit; Kaggle. https://www.kaggle.com/c/GiveMeSomeCredit.
Schut, Lisa, Oscar Key, Rory McGrath, Luca Costabello, Bogdan Sacaleanu, Yarin Gal, et al. 2021. “Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties.” In International Conference on Artificial Intelligence and Statistics, 1756–64. PMLR.
Wachter, Sandra, Brent Mittelstadt, and Chris Russell. 2017. “Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR.” Harv. JL & Tech. 31: 841. https://doi.org/10.2139/ssrn.3063289.