7 Conclusion
Artificial Intelligence, Trustworthy AI, Counterfactual Explanations, Algorithmic Recourse
Machine learning and artificial intelligence have developed rapidly in recent decades. Despite their success, state-of-the-art machine learning models are complex and their decision logic is difficult to interpret by humans. This thesis contributes to a growing body of research and literature that aims to tackle these issues and ultimately make opaque models more trustworthy. We have presented several technological innovations, methodological advances, empirical analyses and critical evaluations of existing paradigms and practices. In this final chapter, we conclude.
In Section 7.1, we begin by revisiting the main research questions set out at the beginning of this thesis. Next, we assess the real-world implications of this thesis and present an outlook for the future (Section 7.2). This is followed by a critical reflection on the limitations of our own work and potential threats to the validity of this thesis in Section 7.3. Finally, we present several recommendations for researchers and practitioners in Section 7.4.
7.1 Revisiting Resarch Questions
In TRQ [trq:jcon], we ask what counterfactual explanations are, why they might be useful for trustworthy AI and if there exist sufficient open-source implementations. Highlighting shortcomings of popular XAI approaches like LIME and SHAP, we argue that CE offer a useful and intuitive alternative. In particular, we explain that contrary to methods relying on local surrogate models, CE have full fidelity by construction as long as they are valid, which can be guaranteed (Guidotti 2022). Additionally, while CE can be manipulated much like LIME and SHAP, remedies are simple and cheap (Slack et al. 2020, 2021). In terms of other research contributions, our work in Chapter 2 also highlights a weakness of surrogate-based CE methods like REVISE (Joshi et al. 2019): the quality of the generated CE no longer depends exclusively on the quality of the opaque model, but also the surrogate. To the best of our knowledge, this is highlighted here for the first time. This observation has had considerable impact on our work in Chapter 4 and Chapter 5.
With respect to software availability, our study finds that—at the time it was first written—the availability of open-source software to explain opaque AI models through counterfactuals is still limited. While researchers have made piece-wise implementations of specific methods available1, the only attempt at providing a unifying framework is the Python library CARLA (Pawelczyk et al. 2021). The Python library was released before CounterfactualExplanations.jl. It offers support for a number of different CE methods, but seems to no longer be actively maintained2. Within the Julia ecosystem, CounterfactualExplanations.jl is the first and only unifying framework for CE. Our work addresses a growing demand for packages that contribute towards trustworthy AI, which has been recognized and embraced by the Julia community. To the best of our knowledge, our framework is the only one that allows users to easily combine different CE methods and generate multiple counterfactuals in parallel using both multithreading and multiprocessing. This has enabled us to run experiments of a scale that is unprecedented in the field using one TU Delft’s high-performance computing cluster ((DHPC) 2022).
Overall conclusion: Counterfactual explanations are an effective tool for trustworthy AI and our CounterfactualExplanations.jl package fills an important gap in the open-source software landscape.
In TRQ [trq:satml], we wonder about the dynamics of counterfactual explanation and algorithmic recourse, when they are implemented in practice. Our study finds that off-the-shelf counterfactual generators induce endogenous undesirable macrodynamics with respect to the underlying model and data, if not handled carefully. In particular, we show that a narrow focus on minimizing individual costs neglects the downstream effects of recourse itself, which carries real-world risks; a bank, for example, that offers individual recourse to its loan applicants, can be expected to face increased credit risk if minimal cost recourse is implemented. We find that independent of the application, classifier performance can be expected to deteriorate if models are retrained on datasets that include minimal cost counterfactuals.
A key observation is that minimal cost counterfactuals are fundamentally at odds with plausibility. As we explained already in the introduction, a counterfactual cannot be close to its factual starting point and close to the target domain both at the same time. Consequently, CE methods that target plausibility such as Joshi et al. (2019) and Schut et al. (2021), tend to suffer less from inducing undesirable dynamics than the baseline method by Wachter, Mittelstadt, and Russell (2017). We formalize this trade-off between individual costs and external costs due to implausibility and propose simple and effective mitigation strategies. An important consideration in this context is convergence: if the counterfactual search is discontinued immediately after the decision boundary is crossed, then it is unlikely the final counterfactual is plausible. In fact, we find that the simplest mitigation strategy for undesirable endogenous dynamics is to simply choose a high enough decision threshold for convergence: counterfactuals will typically end up deeper inside the target domain if the counterfactual search is considered as converged if and only if the classifier predicts the target class with high probability.
Overall conclusion: In the broader context of this thesis, the most important conclusion of Chapter 3 is that implausible counterfactuals can cause unexpected negative consequences in practice.
TRQ [trq:aaai], asks if plausible explanations can be attained without relying on surrogate models. We demonstrate that it is indeed possible to rely exclusively on properties provided by the opaque model itself to achieve plausibility, but only if the model itself has actually learned plausible explanations for the data, where this latter condition is a feature of our proposed CE method, not a bug. To demonstrate this, our study begins by revisiting an observation from Chapter 2: using surrogate models to generate counterfactuals can affect the quality of the counterfactuals in unexpected and adverse ways. We provide a simple and yet compelling motivating example demonstrating that the surrogate-based REVISE generator (Joshi et al. 2019) can yield highly plausible counterfactuals, even if the opaque model has learned demonstrably implausible explanations for the data. Thus, we argue, it is possible to inadvertently “whitewash” an untrustworthy “black-box” model by effectively reallocating the task of learning plausible explanations from the model itself to the surrogate. To avoid such scenarios, we argue that inducing plausibility at all costs is a misguided paradigm. Instead, we should aim to generate counterfactuals that faithfully represent the conditional posterior distribution over inputs learned by the model.
To achieve this goal, we propose a new method for generating energy-constrained conformal counterfactuals—ECCCo. Our approach leverages ideas underlying joint energy-based models (Grathwohl et al. 2020) and conformal prediction. Specifically, we ensure that counterfactuals reach low-energy states with respect to the model and lead to high-certainty predictions of the target class. Through extensive experiments, we demonstrate the ECCCo achieves state-of-the-art levels of plausibility for well-specified models. This allows researchers and practitioners to use ECCCo to assess how trustworthy opaque models are based on the plausibility of the explanations they have learned.
Overall conclusion: Counterfactual explanations can be both plausible and faithful to the opaque model. Instead of aiming to develop model-agnostic tools for generating plausible explanations (“modelling explanations”), we should hold models accountable for delivering such explanations.
TRQ [trq:ecml], asks a natural follow-up question: provided we have a way to generate faithful counterfactuals, can we use them to improve the trustworthiness of models? To this end, our work in Chapter 5 introduces a new training regime for differentiable models like artificial neural networks: counterfactual training. It involves generating faithful counterfactual explanations during each training iteration and then backpropagating model gradients with respect to the contrastive divergence between counterfactuals and observed training samples in the target domain. Additionally, we interpret intermediate counterfactuals near the decision boundary as adversarial examples and penalize the model’s adversarial loss. Our work therefore explicitly connects explainable to adversarial ML.
Our empirical findings demonstrate that CT yields more adversarially robust models that learn more plausible explanations for the data. Beyond plausibility and adversarial robustness, counterfactual training can also be used to ensure that models learn actionable explanations. To this end, we prove that CT induces models that are less sensitive to immutable or protected features. Importantly, our empirical results also show that these benefits with respect to trustworthiness do not come at the cost of reduced predictive performance. We find that predictive performance of models on test data is either unaffected by CT or more robust or both. The work in this chapter therefore provide substantial advances with respect to training more trustworthy AI.
Overall conclusion: Faithful counterfactual explanations can be leveraged during training to improve models with respect to explainability and adversarial robustness.
TRQ [trq:icml], enquires about the role of trustworthy AI in the context of LLMs. In Chapter 6, we critically assess a viral recent work that uses standard tools from mechanistic interpretability to arrive at the conclusion that modern LLMs learn world models. This in turn has been characterized as a milestone on the path towards AGI. Our study presents a number of experiments involving models of varying complexity to demonstrate that the finding of concept-related representations in latent spaces of models should not surprise us and certainly not be seen as evidence in favor of AGI. A thorough review of the social sciences’ literature demonstrates why researchers might still fall into that trap, especially in an environment that has made AGI the north star of AI research (Blili-Hamelin et al. 2025). In summary, we caution researchers against misinterpreting results from mechanistic interpretability or else its role in the pursuit of more trustworthy AI may be tarnished.
Overall conclusion: Tools from mechanistic interpretability should be used carefully to avoid tarnishing their credibility with respect to trustworthy AI. Further work is needed to improve the usefulness of CE for LLMs.
7.2 Implications and Outlook
The findings in this work have shed light on many challenges and questions in the field of trustworthy AI and, in particular, counterfactual explanations. While we have also proposed solutions to some of the more specific challenges that we have encountered, our work highlights broader challenges that remain unanswered.
Chapter 3 and Chapter 4 have demonstrated that the field needs to rethink some of its core objectives. Chapter 3, in particular, showed that algorithmic recourse in practice involves multiple stakeholders typically competing for scarce resources. As more opaque AI is deployed in the real world, we may have to rethink recourse as an economic and societal problem. Thus research on AR will inevitably inform future economic and societal questions and vice versa. Our findings from Chapter 4 require us to rethink what we truly want to get out of XAI methods: practical but possibly misleading answers or enhanced understanding of model behavior? Our findings also invite follow-up questions about the evaluation of counterfactual explanations. While we provide a nuanced definition and metric for faithfulness, we do not pretend to provide final answers in Chapter 4 and believe that this objective for counterfactuals deserves more attention.
Chapter 5 has important implications for the connection between explainability and adversarial robustness in machine learning. Our framework for counterfactual training constitutes a solid starting ground, but there is likely much untapped potential for synergies between these two subfields of trustworthy AI. We therefore believe that researchers from both communities would benefit from collaborating. We further believe that practitioners would benefit from taking a holistic approach to trustworthy AI, explicitly recognizing that various objectives may complement each other but also compete.
In all of this, we hope that the work presented in Chapter 2 can continue to play a role in facilitating research and experimentation. The broader ecosystem of packages that have grown out of this initial work have certainly gained some traction and popularity in the Julia community, but to create a lasting impact they will need to continue to be maintained and developed further. We believe that Taija has great potential to for both research and industry.
Finally, results presented in this thesis also have implications for the ongoing discourse around AGI. Chapter 6 has shown that we should insist on adhering to scientific principles when engaging in this discourse as academics, especially those among us who are considered as thought leaders by many. As a whole, this thesis has also shown that we are still struggling to truly understand and control the behavior of even the most basic building blocks of AI. This should give anyone in our field currently treating AGI as the north-star goal of AI research serious pause.
To end on an optimistic note, we believe that this work also provides hope for trustworthy AI. We have shown that it is possible to use model explanations for good: if carefully constructed, they can help us to not only assess the trustworthiness of opaque models, but also improve it. This requires work, but as economists like to say: “There is no such thing as a free lunch”3.
7.3 Limitations and Threats to Validity
In this section, we highlight limitations and threats to the validity of our work. We focus on points that were not already discussed explicitly in the context of individual chapters.
7.3.1 Construct Validity
Our evaluations of counterfactuals in Chapter 4 and Chapter 5 rely on imperfect metrics used to assess the plausibility and faithfulness of explanations. For plausibility, we extend existing distance-based metrics for measuring the dissimilarity between counterfactuals and observed training data in the target domain. This is a valid approach to assess plausibility, to the extent that “broadly consistent with the observed data” is an adequate proxy for “plausible as assessed by humans”. We found in our own work that this is not always the case: in Chapter 4, for instance, we found that image counterfactuals produced by ECCCo were sometimes more visually appealing and plausible than the distance-based evaluation metrics suggested. To mitigate this, we tested different distance measures and eventually introduced a new divergence measure in Chapter 5, but we recognize that ideally plausibility of explanations would be assessed directly by humans. The scale of our experiments involving multiple millions of counterfactuals, made this option infeasible.
Regarding faithfulness, we rely on established methods for estimating the conditional model posterior over inputs (Grathwohl et al. 2020; Murphy 2023). This approach has two potential shortcomings with respect to the validity of our work: firstly, the estimated empirical distributions are subject to estimation error; secondly, our proposed method in Chapter 4 is biased towards this metric by construction, although the same can be said about other methods targeting certain metrics like minimal costs. The important finding in this context is that our proposed counterfactual generator can satisfy its primary target (faithfulness) while also achieving its secondary target (plausibility).
7.3.2 Internal Validity
Further on evaluation, we rely on cross-validation to account for stochasticity: specifically, we always generate and evaluate counterfactuals multiple times, each time drawing a different random subsample of the available data. We then compute averages and standard deviations of our evaluation metrics to get a sense of how substantial and significant the differences in outcomes are for the various methods we test. This is consistent with common practice in the related literature and—we believe—sufficient to arrive at the conclusions we present in individual chapters. Nonetheless, we recognize that in Chapter 4 we fall short of testing our evaluations and rankings as rigorously for statistical significance as we do in Chapter 3.
The internal validity of the findings in Chapter 3 is largely threatened by the simplifying assumptions we make about stakeholder actions. For example, we assume that any individual provided with a valid algorithmic recourse ends up implementing the exact recommendations. We also assume that model owners retrain models regularly after individuals have implemented recourse and that no entirely new samples are added to the training population. Retraining is continued over multiple rounds even in the face of model deterioration and other negative dynamics after the first few rounds. All of these modelling assumptions are necessarily simple to focus on our main narrative. We intentionally abstract from detail to study the worst-case high-level effects we are interested in.
7.3.3 External Validity
To empirically test our claims and proposed methods in Chapter 3, Chapter 4 and Chapter 5, we employed both synthetic and publicly available real-world datasets that are commonly studied in the related literature. We have also largely relied on studying small and simple neural network architectures, again consistent with the related literature. While we have made an effort to always include a broad range of sources to ensure a certain degree of robustness in our findings, it is certainly possible and indeed expected that some of our findings do not always hold true in practice. We expect this in some cases, because certain results are subject to hyperparameter sensitivity, in particular results from Chapter 4 and to a lesser degree Chapter 5. A related threat to external validity is scalability: the computational cost involved in generating counterfactual explanations increases in the dimensionality of inputs, which may make certain methods we propose—in particular counterfactual training—computationally prohibitive.
Concerning Chapter 6, it could be argued that the experiments we present involve models that are too simple to warrant any discussion around AGI4. Our response to this would be that the choice of simple models that have not previously been linked to AGI is very much intentional. As our experiments show, properties of LLMs that have been presented as novel and surprising are in fact shared by much simpler models.
7.3.4 Software Limitations
Since Chapter 2 was published in 2023, we have continued to actively develop and maintain CounterfactualExplanations.jl, such that it has now reached maturity with respect to fundamental features. That being said, to the best of our knowledge it has never been tested in a production environment involving larger models and datasets than the ones we have used in our research. Due to our focus on simulations and cross-validations involving many counterfactuals (high \(n\)) of typically moderate dimensionality (small \(p\)), we have prioritized support for parallelization through multithreading and multiprocessing, as opposed to graphical processing units (GPUs). Thankfully, Julia offers fantastic support also for the latter and since we rely on standard routines for autodifferention, it should be straightforward to address this limitation. Beyond this, there are numerous smaller outstanding development tasks listed on our repository: JuliaTrustworthyAI/CounterfactualExplanations.jl/issues.
Concerning internal validity, our software is no different from any other software in that it contains bugs and inefficiencies. We have encountered such shortcomings in the past and expect to find more in the future. Relatedly, certain software architecture and prioritization choices we have made may be suboptimal for specific applications, even though they have served us well in the past. Regarding external validity, there is a strong possibility that Patrick will not be able to maintain it as actively in the past. To address this risk, we have taken steps to attract external contributions in the past and aim to continue in this fashion.
7.4 Recommendations for Research and Practice
In this section, we provide general recommendations for both researchers and practitioners working with opaque machine learning models. They are derived from our research findings but not in all cases directly tied to specific results.
As proposition (5) of this thesis states: strange things really do happen in high-dimensional spaces. This is an observation that universally applies to all chapters of this thesis in one way or another. Recommendation 1 can be seen as general call for caution. Even though we—along with many others working in the field—have contributed through our work towards making opaque models more trustworthy, there is simply no silver bullet. For better or worse, high degrees of freedom in representation learning make models susceptible to learning representations that humans cannot interpret. This is what makes such models so powerful at achieving narrow objectives. But as we have seen throughout this thesis, it also has the potential to make them sensitive to spurious associations in the data they are trained on. Our work contributes several results that can aid researchers in navigating this challenge, but we want to be very clear that we think of our findings as remedies, not a cure.
We consider explainable and trustworthy AI as a moving target, just like adversarial robustness is still considered an unsolved challenge even for simple models (Kolter 2023): for every explanation and any attack we have identified, another is likely to follow. Thus, we recommend that researchers and practitioners avoid striving for trustworthy AI as some attainable end goal, and instead recognize that its continuous process that requires work. Counterfactual explanations provide a particularly useful framework to deal with models that enjoy large degrees of freedom, precisely because they are similarly unconstrained in terms of the feature space they can occupy.
Multiplicity of counterfactual explanations is a feature, not a bug. Uniqueness has in the past been considered as an explicit goal in the context of XAI, possibly because humans are naturally inclined to prefer straight answers over complicated ones. We would argue, however, that the notion of finding unique solutions to our search for model explanations is fundamentally at odds with basic properties of the models we are studying: they are not unique solutions either, not even to the narrow objectives we typically train them for. Any fitted neural network is just a random outcome of a stochastic training process that could have resulted in any one of many different paramterizations that provide compelling explanations for the data (Wilson 2020).
More to the point of explainability and its use cases, our work has also frequently shown that real-world objectives are not in fact narrow: Chapter 3 highlighted the trade-off between individual and external costs in algorithmic recourse; Chapter 4 shed light on the interplay between plausibility and faithfulness of counterfactual explanations; and in Chapter 5 we have made the case for explicitly adjusting training objectives to induce models to learn actionable and plausible explanations. Since objectives are multiple and context-dependent, explanations are also inevitably variable. Therefore, it is our recommendation that researchers and practitioners use tools for trustworthy AI that are flexible enough to accommodate such multiplicity of objectives and explanations.
Speaking of objectives, we believe that the guiding objective for counterfactual explanations—and in fact any XAI method—should be faithfulness to the model. It is very difficult to think of scenarios that call for plausible, robust, diverse, actionable or easily affordable recourse recommendations that do not also faithfully explain the model in question. At best, we would consider this a short-term solution to dealing with opaque models in practice. As we have demonstrated in Chapter 4, it is entirely possible to generate plausible explanations for accurate and yet fundamentally untrustworthy models. This should be keenly avoided, since it may instill trust in models that are not worthy of it.
In close relation to Recommendation 3, we recommend that researchers in XAI avoid considering plausible, model-agnostic explanations as the holy grail. Explanations do not make automated decisions that may affect the lives of individuals, models do. Explanations are merely reflections of how models arrive at the decisions they make. Thus, we should use explanations primarily to inform our understanding of models and strive to improve models based on explanations, instead of treating models as fixed and tailoring our explanation around them. Otherwise, we risk still treating models as oracles that cannot be held accountable, much like we have done in the past (O’Neil 2016).
This is more important in the age of LLMs than ever before. As we have argued in Chapter 6, people are prone to anthropomorphize and idolize complex technologies they do not fully understand. There is a real risk today that people are so dumbstruck and overwhelmed by machines that are quite literally optimized to appeal to them, that people end up blindly relying on them, even worshiping them. This, coupled with a lack of accountability, provides model owners with unprecedented powers to affect individuals. No matter how powerful these models become, we need to avoid thinking that they are inscrutable, leave alone infallible.
This last Recommendation 5 is not directly tied to any specific result of this work, but rather the thesis as a whole and the direct or indirect contributions by the many people that co-shaped it. In times when diversity is once again under threat by narrow-minded people in powerful positions, we find it important to share that in our own experience, diverse perspectives supercharge innovation. Many of the findings in this thesis have resulted from combining ideas from different subfields of AI or even external domains including economics and other social sciences. This culminated in Chapter 6, which involved co-authors from a variety of different disciplines and is likely going to be one of the more influential contributions of this thesis.
In addition to the examples listed in Chapter 15 of Molnar (2022) we have identified Schut et al. (2021)—a fast method for probabilistic classifiers—and Prado-Romero and Stilo (2022) for graph counterfactual explanations.↩︎
At the time of writing, there has not been an update to the code base in over two years.↩︎
This quote is often attributed to Milton Friedman, but it likely originated earlier. According to the Cambridge Dictionary, the phrase is used to emphasize that you cannot get something for nothing: https://dictionary.cambridge.org/us/dictionary/english/there-s-no-such-thing-as-a-free-lunch↩︎
This is in fact how one of the few dissenting audience members at ICML dismissed our work without any further consideration.↩︎