Counterfactual Training

Teaching Models Plausible and Actionable Explanations

Patrick Altmeyer

Delft University of Technology

Aleksander Buszydlik

Arie van Deursen

Cynthia C. S. Liem

March 23, 2026

Training Opaque Models

Tweaking Parameters

Objective:

\[ \begin{aligned} \min_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \end{aligned} \]

Training Opaque Models

Tweaking Parameters

Objective:

\[ \begin{aligned} \min_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \end{aligned} \]

Solution:

\[ \begin{aligned} \theta_{t+1} &= \theta_t - \nabla_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} \} \\ \textcolor{orange}{\theta^*}&=\theta_T \end{aligned} \]

Explaining Opaque Models

Tweaking Inputs

Objective:

\[ \begin{aligned} \min_{\textcolor{purple}{\mathbf{x}}} \{ {\text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} }) + \lambda \text{reg}(\mathbf{x};\cdot) } \} \end{aligned} \]

Explaining Opaque Models

Tweaking Inputs

Objective:

Solution:

\[ \begin{aligned} \mathbf{x}_{t+1} &= \mathbf{x}_t - \nabla_{\textcolor{purple}{\mathbf{x}}} \{ \text{yloss}(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} }) \\&+ \lambda \text{reg}(\mathbf{x};\cdot) \} \\ \textcolor{purple}{\mathbf{x}^*}&=\mathbf{x}_T \end{aligned} \]

Explanation or Adversarial Example?

Plausibility at all cost?

All of these counterfactuals are valid explanations for the model’s prediction.

Pick your poison …

Figure 1: Turning a 9 into a 7: Counterfactual explanations for an image classifier using different approaches (Altmeyer et al. 2024).

Faithful First, Plausible Second

Figure 2: Turning a 9 into a 7. *ECCCo* applied to MLP (a), Ensemble (b), JEM (c), JEM Ensemble (d).

Insight: faithfulness facilitates¹

model quality checks (Figure 2).
state-of-the-art plausibility (Figure 3).

Figure 3: Results for different generators (from 3 to 5).

Putting it all together

Counterfactual Training

First, Tweaking Inputs¹

\[ \begin{aligned} \mathbf{x}_{t+1} &= \mathbf{x}_t - \nabla_{\textcolor{purple}{\mathbf{x}}} \{ {ECCCo(M_{\textcolor{orange}{\theta^*}}(\mathbf{x}),\mathbf{y^{\textcolor{purple}{+}} })} \} \\ \textcolor{purple}{\mathbf{x}^*}&=\mathbf{x}_T \end{aligned} \]

Then, Tweaking Parameters

\[ \begin{aligned} \theta_{t+1} &= \theta_t - \nabla_{\textcolor{orange}{\theta}} \{ {\text{yloss}(M_{\theta}(\mathbf{x}),\mathbf{y})} + \text{div}(\textcolor{purple}{\mathbf{x}^*},\mathbf{x}^+,y^+; \theta) \} \\ \textcolor{orange}{\theta^*}&=\theta_T \end{aligned} \]

Counterfactual Training

Contrast faithful CE with data \(\rightarrow\) Explainability \(\uparrow\)
Feature mutability constraints \(\rightarrow\) Actionability \(\uparrow\)(holds provably under certain assumptions)
Bonus: use nascent CE as AE \(\rightarrow\) Robustness \(\uparrow\)

Figure 4: (a) conventional training, all mutable; (b) CT, all mutable; (c) conventional, *age* immutable; (d) CT, *age* immutable.

Counterfactual Training: Results

Plausibility: Visual explanations (counterfactuals) for baseline (top row) vs CT (bottom). — **Plausibility**: Visual explanations (counterfactuals) for baseline (top row) vs CT (bottom).

Actionability: Visual explanations (integrated gradients) as before. Five top and bottom rows immutable. — **Actionability**: Visual explanations (integrated gradients) as before. Five top and bottom rows immutable.

Test accuracies on adversarially perturbed data with varying perturbation sizes.

The Hard Numbers

Extensive experiments and ablation studies on nine datasets–synthetic, tabular and vision–generating millions of counterfactuals:¹

Plausibility of CEs increases by up to 90%.
Actionability: cost of reaching valid counterfactuals with protected features decreases by 19% on average.
Models’ adversarial robustness improves consistently.

Check it out!

References

Altmeyer, Patrick, Arie van Deursen, and Cynthia C. S. Liem. 2023. “Explaining Black-Box Models through Counterfactuals.” In Proceedings of the JuliaCon Conferences, 1:130. https://doi.org/10.21105/jcon.00130.

Altmeyer, Patrick, Mojtaba Farmanbar, Arie van Deursen, and Cynthia C. S. Liem. 2024. “Faithful Model Explanations through Energy-Constrained Conformal Counterfactuals.” In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 38:10829–37. 10. https://doi.org/10.1609/aaai.v38i10.28956.

(DHPC), Delft High Performance Computing Centre. 2022. “DelftBlue Supercomputer (Phase 1).” https://www.tudelft.nl/dhpc/ark:/44463/DelftBluePhase1.