Counterfactual Training

Teaching Models Plausible and Actionable Explanations

Delft University of Technology

Aleksander Buszydlik
Arie van Deursen
Cynthia C. S. Liem

March 23, 2026

Training Opaque Models

Tweaking Parameters

Objective:

\[ \begin{aligned} \min_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \end{aligned} \]

Training Opaque Models

Tweaking Parameters

Objective:

\[ \begin{aligned} \min_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \end{aligned} \]

Solution:

\[ \begin{aligned} \theta_{t+1} &= \theta_t - \nabla_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \\ \textcolor{orange}{\theta^*}&=\theta_T \end{aligned} \]

Explaining Opaque Models

Tweaking Inputs

Objective:

\[ \begin{aligned} \min_{\textcolor{purple}{\mathbf{x}}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} }) + \lambda \text{reg} } \} \end{aligned} \]

Explaining Opaque Models

Tweaking Inputs

Objective:

\[ \begin{aligned} \min_{\textcolor{purple}{\mathbf{x}}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} }) + \lambda \text{reg} } \} \end{aligned} \]

Solution:

\[ \begin{aligned} \mathbf{x}_{t+1} &= \mathbf{x}_t - \nabla_{\textcolor{purple}{\theta}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} })} \} \\ \textcolor{purple}{\mathbf{x}^*}&=\mathbf{x}_T \end{aligned} \]

Explanation or Adversarial Example?

Plausibility at all cost?

All of these counterfactuals are valid explanations for the model’s prediction.

Pick your poison …

Figure 1: Turning a 9 into a 7: Counterfactual explanations for an image classifier using different approaches (Altmeyer et al. 2024).

Faithful First, Plausible Second

Figure 2: Turning a 9 into a 7. ECCCo applied to MLP (a), Ensemble (b), JEM (c), JEM Ensemble (d).

Insight: faithfulness facilitates1

Figure 3: Results for different generators (from 3 to 5).

Putting it all together

Counterfactual Training

First, Tweaking Inputs1

\[ \begin{aligned} \mathbf{x}_{t+1} &= \mathbf{x}_t - \nabla_{\textcolor{purple}{\theta}} \{ {ECCCo(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} })} \} \\ \textcolor{purple}{\mathbf{x}^*}&=\mathbf{x}_T \end{aligned} \]

Then, Tweaking Parameters

\[ \begin{aligned} \theta_{t+1} &= \theta_t - \nabla_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} + \text{div}(\textcolor{purple}{\mathbf{x}^*},\mathbf{x}^+;y^+, \theta) \} \\ \textcolor{orange}{\theta^*}&=\theta_T \end{aligned} \]

Counterfactual Training

  1. Contrast faithful CE with data \(\rightarrow\) Explainability \(\uparrow\)
  2. Feature mutability constraints \(\rightarrow\) Actionability \(\uparrow\)(holds provably under certain assumptions)
  3. Bonus: use nascent CE as AE \(\rightarrow\) Robustness \(\uparrow\)
Figure 4: (a) conventional training, all mutable; (b) CT, all mutable; (c) conventional, age immutable; (d) CT, age immutable.

Counterfactual Training: Results

Plausibility: Visual explanations (counterfactuals) for baseline (top row) vs CT (bottom).

Plausibility: Visual explanations (counterfactuals) for baseline (top row) vs CT (bottom).

Actionability: Visual explanations (integrated gradients) as before. Five top and bottom rows immutable.

Actionability: Visual explanations (integrated gradients) as before. Five top and bottom rows immutable.

Test accuracies on adversarially perturbed data with varying perturbation sizes.

Test accuracies on adversarially perturbed data with varying perturbation sizes.

The Hard Numbers

Extensive experiments and ablation studies on nine datasets–synthetic, tabular and vision–generating millions of counterfactuals:1

  1. Plausibility of CEs increases by up to 90%.
  2. Actionability: cost of reaching valid counterfactuals with protected features decreases by 19% on average.
  3. Models’ adversarial robustness improves consistently.

Check it out!

Preprint

Preprint

Software

Software

Homepage

Homepage

References

Altmeyer, Patrick, Arie van Deursen, and Cynthia C. S. Liem. 2023. Explaining Black-Box Models through Counterfactuals.” In Proceedings of the JuliaCon Conferences, 1:130. https://doi.org/10.21105/jcon.00130.
Altmeyer, Patrick, Mojtaba Farmanbar, Arie van Deursen, and Cynthia C. S. Liem. 2024. Faithful Model Explanations through Energy-Constrained Conformal Counterfactuals.” In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 38:10829–37. 10. https://doi.org/10.1609/aaai.v38i10.28956.
(DHPC), Delft High Performance Computing Centre. 2022. DelftBlue Supercomputer (Phase 1).” https://www.tudelft.nl/dhpc/ark:/44463/DelftBluePhase1.