Explaining Models or Modelling Explanations

Counterfactual Explanations and Algorithmic Recourse for Trustworthy AI

Delft University of Technology

April 16, 2026

Background

Economist, then PhD CS

How can we make opaque AI more trustworthy?

Explainable AI, Adversarial ML, Probabilistic ML

Core developer and maintainer of Taija (Trustworthy AI in Julia)

Scan for slides. Links to www.patalt.org.

Scan for slides. Links to www.patalt.org.

Agenda

  • Intro: counterfactual explanations (CE) and algorithmic recourse (AR)
  • Unexpected Challenges: endogenous dynamics of AR
  • Paradigm Shift: explanations should be faithful first, plausible second
  • New Opportunities: teaching models plausible explanations through CE

Intro

Training Opaque Models

Tweaking Parameters

Objective:

\[ \begin{aligned} \min_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \end{aligned} \]

Training Opaque Models

Tweaking Parameters

Objective:

\[ \begin{aligned} \min_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \end{aligned} \]

Solution:

\[ \begin{aligned} \theta_{t+1} &= \theta_t - \nabla_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \\ \textcolor{orange}{\theta^*}&=\theta_T \end{aligned} \]

Explaining Opaque Models

Tweaking Inputs

Objective:

\[ \begin{aligned} \min_{\textcolor{purple}{\mathbf{x}}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} }) + \lambda \text{reg}(\mathbf{x};\cdot) } \} \end{aligned} \]

Explaining Opaque Models

Tweaking Inputs

Objective:

\[ \begin{aligned} \min_{\textcolor{purple}{\mathbf{x}}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} }) + \lambda \text{reg}(\mathbf{x};\cdot)} \} \end{aligned} \]

Solution:

\[ \begin{aligned} \mathbf{x}_{t+1} &= \mathbf{x}_t - \nabla_{\textcolor{purple}{\mathbf{x}}} \{ \text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} }) \\&+ \lambda \text{reg}(\mathbf{x};\cdot) \} \\ \textcolor{purple}{\mathbf{x}^*}&=\mathbf{x}_T \end{aligned} \]

Algorithmic Recourse

Provided CE is valid, plausible and actionable, it can be used to provide recourse to individuals negatively affected by models.

“If your income had been x, then …”

Figure 1: Counterfactuals for random samples from the Give Me Some Credit dataset (Kaggle 2011). Features ‘age’ and ‘income’ are shown.

Unexpected Challenges

Hidden Cost of Implausibility

AR can introduce costly dynamics1

Endogenous Macrodynamics in Algorithmic Recourse.

Endogenous Macrodynamics in Algorithmic Recourse.
Figure 2: Illustration of external cost of individual recourse.

Insight: Implausible Explanations Are Costly

Mitigation Strategies

Reframed Objective

\[ \begin{aligned} \mathbf{s}^\prime &= \arg \min_{\mathbf{s}^\prime \in \mathcal{S}} \{ {\text{yloss}(M(f(\mathbf{s}^\prime)),y^*)} \\ &+ \lambda_1 {\text{cost}(f(\mathbf{s}^\prime))} + \lambda_2 {\text{extcost}(f(\mathbf{s}^\prime))} \} \end{aligned} \]

  • Even simple mitigation strategies can help.
  • Reducing hidden cost is (roughly) equivalent to ensuring plausibility.

Mitigation strategies to tackle hidden costs of AR.

Mitigation strategies to tackle hidden costs of AR.

Explanation or Adversarial Example?

Plausibility at all cost?

All of these counterfactuals are valid explanations for the model’s prediction.

Pick your poison …

Figure 3: Turning a 9 into a 7: Counterfactual explanations for an image classifier using different approaches (Altmeyer, Farmanbar, et al. 2024).

Faithful First, Plausible Second

Figure 4: Turning a 9 into a 7. ECCCo applied to MLP (a), Ensemble (b), JEM (c), JEM Ensemble (d).

Insight: faithfulness facilitates1

Figure 5: Results for different generators (from 3 to 5).

Putting it all together

Counterfactual Training

First, Tweaking Inputs1

\[ \begin{aligned} \mathbf{x}_{t+1} &= \mathbf{x}_t - \nabla_{\textcolor{purple}{\mathbf{x}}} \{ {ECCCo(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} })} \} \\ \textcolor{purple}{\mathbf{x}^*}&=\mathbf{x}_T \end{aligned} \]

Then, Tweaking Parameters

\[ \begin{aligned} \theta_{t+1} &= \theta_t - \nabla_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} + \text{div}(\textcolor{purple}{\mathbf{x}^*},\mathbf{x}^+,y^+; \theta) \} \\ \textcolor{orange}{\theta^*}&=\theta_T \end{aligned} \]

Counterfactual Training

  1. Contrast faithful CE with data \(\rightarrow\) Explainability \(\uparrow\)
  2. Feature mutability constraints \(\rightarrow\) Actionability \(\uparrow\)(holds provably under certain assumptions)
  3. Bonus: use nascent CE as AE \(\rightarrow\) Robustness \(\uparrow\)
Figure 6: (a) conventional training, all mutable; (b) CT, all mutable; (c) conventional, age immutable; (d) CT, age immutable.

Counterfactual Training: Results

Plausibility: Visual explanations (counterfactuals) for baseline (top row) vs CT (bottom).

Plausibility: Visual explanations (counterfactuals) for baseline (top row) vs CT (bottom).

Actionability: Visual explanations (integrated gradients) as before. Five top and bottom rows immutable.

Actionability: Visual explanations (integrated gradients) as before. Five top and bottom rows immutable.

Test accuracies on adversarially perturbed data with varying perturbation sizes.

Test accuracies on adversarially perturbed data with varying perturbation sizes.

The Hard Numbers

Extensive experiments and ablation studies on nine datasets–synthetic, tabular and vision–generating millions of counterfactuals:1

  1. Plausibility of CEs increases by up to 90%.
  2. Actionability: cost of reaching valid counterfactuals with protected features decreases by 19% on average.
  3. Models’ adversarial robustness improves consistently.

Check it out!

Preprint

Preprint

Software

Software

Homepage

Homepage

Taija

  • Work presented @ JuliaCon 2022, 2023, 2024.
  • Google Summer of Code and Julia Season of Contributions 2024.
  • Total of three software projects @ TU Delft.

Trustworthy AI in Julia: github.com/JuliaTrustworthyAI

If we still have time …

Spurious Sparks of AGI

We challenge the idea that the finding of meaningful patterns in latent spaces of large models is indicative of AGI1.

Figure 7: Inflation of prices or birds? It doesn’t matter!

References

Altmeyer, Patrick, Giovan Angela, Aleksander Buszydlik, Karol Dobiczek, Arie van Deursen, and Cynthia C. S. Liem. 2023. “Endogenous Macrodynamics in Algorithmic Recourse.” In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 418–31. IEEE. https://doi.org/10.1109/satml54575.2023.00036.
Altmeyer, Patrick, Andrew M Demetriou, Antony Bartlett, and Cynthia C. S. Liem. 2024. “Position: Stop Making Unscientific AGI Performance Claims.” In International Conference on Machine Learning, 1222–42. PMLR. https://proceedings.mlr.press/v235/altmeyer24a.html.
Altmeyer, Patrick, Arie van Deursen, and Cynthia C. S. Liem. 2023. Explaining Black-Box Models through Counterfactuals.” In Proceedings of the JuliaCon Conferences, 1:130. https://doi.org/10.21105/jcon.00130.
Altmeyer, Patrick, Mojtaba Farmanbar, Arie van Deursen, and Cynthia C. S. Liem. 2024. Faithful Model Explanations through Energy-Constrained Conformal Counterfactuals.” In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 38:10829–37. 10. https://doi.org/10.1609/aaai.v38i10.28956.
(DHPC), Delft High Performance Computing Centre. 2022. DelftBlue Supercomputer (Phase 1).” https://www.tudelft.nl/dhpc/ark:/44463/DelftBluePhase1.
Kaggle. 2011. “Give Me Some Credit, Improve on the State of the Art in Credit Scoring by Predicting the Probability That Somebody Will Experience Financial Distress in the Next Two Years.” https://www.kaggle.com/c/GiveMeSomeCredit; Kaggle. https://www.kaggle.com/c/GiveMeSomeCredit.