Explaining Models or Modelling Explanations

Challenging Existing Paradigms in Trustworthy AI

Delft University of Technology

Arie van Deursen
Cynthia C. S. Liem

May 8, 2025

Background

Economist, now PhD CS

How can we make opaque AI more trustworthy?

Explainable AI, Adversarial ML, Probabilistic ML

Maintainer of Taija (trustworthy AI in Julia)

Scan for slides. Links to www.patalt.org.

Scan for slides. Links to www.patalt.org.

Agenda

  • What are counterfactual explanations (CE) and algorithmic recourse (AR) and why are they useful?
  • What dynamics are generated when off-the-shelf solutions to CE and AR are implemented in practice?
  • Can we generate plausible counterfactuals relying only on the opaque model itself?
  • How can we leverage counterfactuals during training to build more trustworthy models?

Background

CE in Five Slides

Cats and dogs in two dimensions.

Cats and dogs in two dimensions.

CE in Five Slides

Model Training

Objective:

\[ \begin{aligned} \min_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \end{aligned} \]

CE in Five Slides

Model Training

Objective:

\[ \begin{aligned} \min_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \end{aligned} \]

Solution:

\[ \begin{aligned} \theta_{t+1} &= \theta_t - \nabla_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \\ \textcolor{orange}{\theta^*}&=\theta_T \end{aligned} \]

CE in Five Slides

Fitted model. Contour shows predicted probability y=🐶.

Fitted model. Contour shows predicted probability \(y=🐶\).

CE in Five Slides

Counterfactual Search

Objective:

\[ \begin{aligned} \min_{\textcolor{purple}{\mathbf{x}}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} })} \} \end{aligned} \]

CE in Five Slides

Counterfactual Search

Objective:

\[ \begin{aligned} \min_{\textcolor{purple}{\mathbf{x}}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} })} \} \end{aligned} \]

Solution:

\[ \begin{aligned} \mathbf{x}_{t+1} &= \mathbf{x}_t - \nabla_{\textcolor{purple}{\theta}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} })} \} \\ \textcolor{purple}{\mathbf{x}^*}&=\mathbf{x}_T \end{aligned} \]

CE in Five Slides

\[ \begin{aligned} \min_{\mathbf{Z}^\prime \in \mathcal{Z}^L} \{ {\text{yloss}(M_{\theta}(f(\mathbf{Z}^\prime)),\mathbf{y}^+)} + \lambda {\text{cost}(f(\mathbf{Z}^\prime)) } \} \end{aligned} \]

Counterfactual Explanations explain how inputs into a model need to change for it to produce different outputs1.

Counterfactual explanation for what it takes to be a dog.

Counterfactual explanation for what it takes to be a dog.

Algorithmic Recourse

Provided CE is valid, plausible and actionable, it can be used to provide recourse to individuals negatively affected by models.

“If your income had been X, then …”

Figure 1: Counterfactuals for random samples from the Give Me Some Credit dataset (Kaggle 2011). Features ‘age’ and ‘income’ are shown.

Dynamics of CE and AR

Hidden Cost of Implausibility

AR can introduce costly dynamics1

Endogenous Macrodynamics in Algorithmic Recourse.

Endogenous Macrodynamics in Algorithmic Recourse.
Figure 2: Illustration of external cost of individual recourse.

Insight: individual recourse neglects bigger picture.

Mitigation Strategies

  • Incorporate hidden cost in reframed objective.
  • Reducing hidden cost is equivalent to ensuring plausibility.

\[ \begin{aligned} \mathbf{s}^\prime &= \arg \min_{\mathbf{s}^\prime \in \mathcal{S}} \{ {\text{yloss}(M(f(\mathbf{s}^\prime)),y^*)} \\ &+ \lambda_1 {\text{cost}(f(\mathbf{s}^\prime))} + \lambda_2 {\text{extcost}(f(\mathbf{s}^\prime))} \} \end{aligned} \]

Plausibility at all cost?

Pick your Poison

All of these counterfactuals are valid explanations for the model’s prediction.

Which one would you pick?

Figure 3: Turning a 9 into a 7: Counterfactual explanations for an image classifier produced using Wachter (Wachter, Mittelstadt, and Russell 2017), Schut (Schut et al. 2021) and REVISE (Joshi et al. 2019).

Faithful First, Plausible Second

Counterfactuals as plausible as the model permits1.

Figure 4: KDE for training data.
Figure 5: KDE for model posterior.

Faithful Counterfactuals

Figure 6: Turning a 9 into a 7. ECCCo applied to MLP (a), Ensemble (b), JEM (c), JEM Ensemble (d).

Insight: faithfulness facilitates

Figure 7: Results for different generators (from 3 to 5).

Teaching models plausible explanations

Counterfactual Training: Method

Idea

Let the model compare its own explanations to plausible ones1.

  1. Contrast faithful counterfactuals with data.
  2. Use nascent CE as adversarial examples.

Example of an adversarial attack. Source: Goodfellow, Shlens, and Szegedy (2015)

Example of an adversarial attack. Source: Goodfellow, Shlens, and Szegedy (2015)

Counterfactual Training: Results

Figure 8: (a) conventional training, all mutable; (b) CT, all mutable; (c) conventional, age immutable; (d) CT, age immutable.
  • Models trained with CT learn more plausible and (provably) actionable explanations.
  • Predictive performance does not suffer, robust performance improves.

If we still have time …

Spurious Sparks of AGI

We challenge the idea that the finding of meaningful patterns in latent spaces of large models is indicative of AGI1.

Figure 9: Inflation of prices or birds? It doesn’t matter!

Taija

  • Work presented @ JuliaCon 2022, 2023, 2024.
  • Google Summer of Code and Julia Season of Contributions 2024.
  • Total of three software projects @ TU Delft.

Trustworthy AI in Julia: github.com/JuliaTrustworthyAI

References

Altmeyer, Patrick, Giovan Angela, Aleksander Buszydlik, Karol Dobiczek, Arie van Deursen, and Cynthia CS Liem. 2023. “Endogenous Macrodynamics in Algorithmic Recourse.” In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 418–31. IEEE.
Altmeyer, Patrick, Andrew M. Demetriou, Antony Bartlett, and Cynthia C. S. Liem. 2024. “Position Paper: Against Spurious Sparks-Dovelating Inflated AI Claims.” https://arxiv.org/abs/2402.03962.
Altmeyer, Patrick, Arie van Deursen, and Cynthia C. S. Liem. 2023. Explaining Black-Box Models through Counterfactuals.” In Proceedings of the JuliaCon Conferences, 1:130.
Altmeyer, Patrick, Mojtaba Farmanbar, Arie van Deursen, and Cynthia C. S. Liem. 2023. “Faithful Model Explanations Through Energy-Constrained Conformal Counterfactuals.” https://arxiv.org/abs/2312.10648.
Goodfellow, Ian, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples.” https://arxiv.org/abs/1412.6572.
Joshi, Shalmali, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. 2019. Towards Realistic Individual Recourse and Actionable Explanations in Black-Box Decision Making Systems.” https://arxiv.org/abs/1907.09615.
Kaggle. 2011. “Give Me Some Credit, Improve on the State of the Art in Credit Scoring by Predicting the Probability That Somebody Will Experience Financial Distress in the Next Two Years.” https://www.kaggle.com/c/GiveMeSomeCredit; Kaggle. https://www.kaggle.com/c/GiveMeSomeCredit.
Schut, Lisa, Oscar Key, Rory McGrath, Luca Costabello, Bogdan Sacaleanu, Yarin Gal, et al. 2021. “Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties.” In International Conference on Artificial Intelligence and Statistics, 1756–64. PMLR.
Wachter, Sandra, Brent Mittelstadt, and Chris Russell. 2017. “Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR.” Harv. JL & Tech. 31: 841. https://doi.org/10.2139/ssrn.3063289.