1  Introduction

Keywords

Artificial Intelligence, Trustworthy AI, Counterfactual Explanations, Algorithmic Recourse

Recent developments in artificial intelligence (AI) have largely centered around representation learning: instead of relying on features and rules that are carefully hand-crafted by humans, modern machine learning (ML) models are tasked with learning representations directly from data to make predictions (Goodfellow, Bengio, and Courville 2016)—this typically involves optimizing these representation to achieve narrow training objectives like predictive accuracy. Modern advances in computing have made it possible to provide such models with ever-growing degrees of freedom to achieve that task, which frequently allows them to outperform traditionally more parsimonious models. While this branch of AI has certainly not been the only active field of research, it is arguably the one that has attracted the highest levels of public attention and investment over the past decade. This trend has been fuelled by increasingly bold promises that “big data leads to better […] decisions” (McAfee et al. 2012), companies embracing “machine learning will be the big winners of tomorrow” (Tank 2017) and that, ultimately, “[AI] could massively accelerate scientific discovery and innovation well beyond what we are capable of doing on our own” (Altman 2025).

Unfortunately, the models underlying all these developments learn increasingly complex and highly sensitive representations that humans can no longer easily interpret. This trend towards complexity for the sake of performance has come under serious scrutiny in recent years. One important challenge arising from high sensitivity is model robustness: at the very cusp of the deep learning (DL) revolution, Szegedy et al. (2014) showed that artificial neural networks (ANN) are sensitive to adversarial examples (AEs): perturbed versions of data instances that yield vastly different model predictions despite being “imperceptible” in that they are semantically indifferent from their factual counterparts. Even though some partially effective mitigation strategies have been proposed—most notably adversarial training (Goodfellow, Shlens, and Szegedy 2015)—truly robust deep learning remains unattainable even for models that are considered “shallow” by today’s standards (Kolter 2023).

Another obvious challenge of increased complexity is our own lack of human understanding with respect to the decision logic underlying these models: as one recent work puts it, “nobody understands deep learning” (Prince 2023). This, too, has attracted much criticism: O’Neil (2016) pointed to the dangers of deploying such opaque models in the real world; Buolamwini and Gebru (2018) uncovered hidden biases of supposedly ‘neutral algorithms’ and Rudin (2019) argued against using opaque models altogether. On the other end of the spectrum, the “black-box” challenge (as it is sometimes called) has attracted an abundance of research on explainable AI (XAI), a paradigm that focuses on the development of tools to derive (post-hoc) explanations from complex model representations. Such explanations should mitigate a scenario in which practitioners deploy opaque models and blindly rely on their predictions. Effective XAI tools hold the promise of not only aiding us in monitoring models, but also providing recourse to individuals subjected to them.

Part of the problem is that the high degrees of freedom provide room for many solutions that are locally optimal with respect to narrow objectives (Wilson 2020).1 Indeed, recent work on the so-called “lottery ticket hypothesis” suggests that modern neural networks can be pruned by up to 90% while preserving their predictive performance (Frankle and Carbin 2019). Similarly, Zhang et al. (2021) showed that state-of-the-art neural networks are expressive enough to fit randomly labeled data. Thus, looking at the predictive performance alone, the solutions may seem to provide compelling explanations for the data, when in fact they are based on purely associative, semantically meaningless patterns.

While we believe that for large enough models, bullet-proof explainability remains as unattainable as robustness, the contributions of this thesis demonstrate that XAI tools can help us to not only shed light on the solutions space, but tame it. We will show that is important to not simply seek and isolate model explanations that satisfy us, but rather think of explanations as distributional quantities that depend on both the underlying data and the model. By faithfully presenting the whole spectrum of these distributions and inducing models to be aligned with the subset of explanations that humans consider meaningful, XAI can make fundamental progress towards trustworthy AI.

1.1 Trustworthy Artificial Intelligence

Trustworthy AI is a relatively novel term spanning a broad field of research. It covers a range of subtopics including fairness, ethics, societal impact and explainability. Varshney (2022) represents the first concerted effort towards unifying and defining related concepts in a single self-contained resource. The urgency for this kind of effort and the field as whole first crystallized in the early 2010s when both industry and regulators began using AI to process the vast amounts of data afforded by the digitalization of society. From recommender systems used by tech giants to tailor consumer advertisements to natural language processing (NLP) used by central banks to monitor economic sentiment—everyone has been eager to innovate in this space. But aside from innovation and progress, novel and disruptive technologies also generally present society with new challenges.

O’Neil (2016) was among the first to point out some of these challenges in her influential book ‘Weapons of Maths Destruction’. Backed by numerous real-world examples, O’Neil (2016) makes a striking case for why should be very careful and even skeptical of using opaque algorithms to organize society. At times when some governments have chosen to co-operate with the tech industry to monitor and organize sensitive social security data of their constituents and even deploy AI in the context of nuclear security (OpenAI 2025; Field 2025), O’Neil’s warnings seem more relevant than ever. AI is not inherently good or bad, but it also will not hold itself accountable for any real-world consequences—good or bad. That remains our responsibility and ongoing efforts towards trustworthy AI play an important role in fulfilling it.

While that responsibility has mostly been shunned by those intent on moving fast and breaking things2, we remain cautiously optimistic about improving things from the inside, much in line with Varshney (2022). Conscious of the “increasingly sociotechnical nature of machine learning”, he defines trustworthy AI in terms of its interaction with humans and society at large. Since we will refer back to this at times, we restate his definition of trustworthy AI here:

Definition 1.1 (Trustworthy AI) For an AI system to be considered trustworthy, it needs to fulfill the following criteria:

  1. Achieve basic performance at the task its intended to be used for.
  2. Achieve this performance reliably, i.e. safely, fairly and robustly.
  3. Facilitate human interaction through predictability, understandability and ideally transparency.
  4. Be aligned with our agenda.

Even though modern AI systems generally fail to comply with this definition and “corporations do not trust artificial intelligence and machine learning in critical enterprise workflows” (Varshney 2022), Definition 1.1 provides goal posts that are not out of reach. In fact, we would argue that at least in those cases where we are sufficiently tempted to use AI today, basic performance is usually not an issue. In this work, we typically work on one or more of the remaining three criteria under the assumption that some complex AI tool is preferable over a more simple solution in terms of performance. We are well aware that this assumption does not always hold, and simple tools are often preferable (Rudin 2019). Still, we believe that complex, opaque models are here to stay and hence we aim to contribute towards making them more trustworthy.

Before we zoom in on the subdomain of trustworthy AI that is most relevant to this thesis, it is worth pointing out that contributions in this space, especially regarding sociotechnical problems, have come from a variety of disciplines and communities (Buszydlik et al. 2025). These communities are not generally focused on providing technical solutions for AI, which stands in contrast to the domain of Explainable AI introduced in the following section (Buszydlik et al. 2024).

1.2 Explainable Artificial Intelligence

Considering Definition 1.1, our work contributes primarily to improving AI’s potential for criterium 3: human interaction. Specifically, most of our methodological contributions are geared towards fostering predictability and understandability of models through the means of Explainable AI (XAI). The field of XAI is concerned with creating methods that improve the explainability of models and thus foster human interaction and trust (Arrieta et al. 2020). This subfield of trustworthy AI is active and large. We will once again refrain from attempting to provide a detailed general introduction to the topic and instead refer readers to Molnar (2022) for a comprehensive overview. For the remainder of this work, it suffices to understand that our methodological contributions largely fall into the category of post-hoc, local, explanations for opaque, supervised models. This family of models most notably includes ANNs, but also other popular ML models including random forests, XGBoost and Support Vector Machines (SVM). We distinguish opaque explainable models from inherently interpretable models (Rudin 2019). The latter category includes models that are interpretable by design, such as linear regression, logistic regression and shallow decision trees (Molnar 2022).

Local explanations are local in that they apply to individual samples, sometimes synonymously referred to as instances or inputs. Specifically, they explain the mapping from individual inputs to predictions for opaque models (Molnar 2022). Among the most popular local explanation methods are LIME (Ribeiro, Singh, and Guestrin 2016), SHAP (Lundberg and Lee 2017) and counterfactual explanations (CE) (Wachter, Mittelstadt, and Russell 2017). LIME and SHAP are closely related in that they both use locally additive, linear and interpretable surrogate models to explain the predictions made by the opaque model. SHAP, in particular, has gained huge popularity among researchers and practitioners, likely due to being solidly rooted in game theory and ready availability of multiple open-source software implementations (Molnar 2022). Both LIME and SHAP rely on input perturbations in the local neighborhoods of individual instance to construct the surrogate explanation model, which makes them vulnerable to adversaries (Slack et al. 2020). The reliance on surrogate models is one key feature that distinguishes LIME and SHAP from counterfactual explanations, the method we focus on in this work.

1.3 Counterfactual Explanations

Instead of locally approximating the behavior of opaque models, counterfactual explanations work under the premise of evaluating perturbed inputs directly with respect to the opaque model. Specifically, valid counterfactuals are perturbed inputs that yield some pre-determined change in the prediction of the model. This makes the interpretation of counterfactual explanations intuitive and straightforward: perturbations to individual inputs tell us directly what type of feature changes would have been necessary to yield some desired prediction (Molnar 2022).

Typically, counterfactuals are generated with the objective to minimize those necessary changes, which is how the counterfactual search objective was originally framed in the seminal work by Wachter, Mittelstadt, and Russell (2017). This makes sense, if we think of CE as a means to provide algorithmic recourse (AR) to individuals adversely affected by automated decision-making (ADM) systems (a.k.a. “Weapons of Math Destruction” (O’Neil 2016)). In this context, minimal changes to features can be thought of as minimal costs to individuals who need to implement recourse to change negative into positive predictions and outcomes. But as we will demonstrate in this thesis, minimizing costs to individuals in this way neglects the downstream effects that individual recourse can be expected to have on the broader group stakeholders. Still, proximity—in terms of minimal distance or cost—is one of the core desiderata for counterfactuals (Verma et al. 2022; Karimi et al. 2021).

Of course, even minimal feature changes may be infeasible in practice: individuals cannot change their height, age or ethnicity, for example, but if a model is sensitive to these features, then unconstrained counterfactuals will inevitably reflect this and the resulting recourse recommendations will not be actionable. Actionability is therefore another key desideratum for CE and AR that has received attention in the literature (Ustun, Spangher, and Liu 2019). We will see that at least for counterfactual explainers that rely on gradient-based optimization, it is straight-forward to respect actionability constraints. But while this is fortunate news with respect to actionable recourse, we will also argue that actionability constraints should really be addressed before the inference stage, during model training. For models with high degrees of freedom, this is of course not trivial.

By now it may already be obvious that counterfactual explanations are not unique. After all, we can perturb features in many, possibly infinite ways to achieve some desired prediction. Suppose, for example, that we have an opaque model that predicts whether individuals qualify for a loan to purchase a home. In this case, the outcome of interest is binary from the perspective of individuals affected by the model: \(\{0:=\text{empty hands},1:=\text{homeowners}\}\). Assuming the model is any conventional classification model, there exist infinitely many unique counterfactual states on either side of the decision boundary. This inherent multiplicity of explanations has been described as a limitation of CE in some places (Molnar 2022), presumably because it challenges us to form a feasible and desirable subset of explanations. Much of the existing work in this field has indeed been focused on designing methodologies for generating counterfactuals that meet certain desiderata. Some have explicitly embraced multiplicity of explanations and argued that its desirable to end up with a diverse set of counterfactuals (Mothilal, Sharma, and Tan 2020). In the context of algorithmic recourse, this corresponds to offering individuals a menu of recourse recommendations to choose from according to their own preferences.

Apart from proximity and diversity, various works have proposed methods aimed at ensuring plausibility of explanations (Joshi et al. 2019; Poyiadzi et al. 2020; Schut et al. 2021). The guiding principle is to generate counterfactuals that are close to the data manifold in the target domain. Since the target domain is generally different from the factual domain—that is the domain the instance originally belongs to—any improvements with respect to plausibility inevitably decrease proximity: a counterfactual cannot be close to its factual and the target manifold at the same time. These types of trade-offs between different desiderata are not uncommon, although fortunately different desiderata also tend to complement each other. Plausibility, for example, has also been linked to robustness of counterfactuals (Artelt et al. 2021), where explanations are considered as robust to the extent that they remain valid if the model or data changes (Pawelczyk et al. 2023). Robustness of counterfactuals has in turn been linked to diversity (Leofante and Potyka 2024).

Navigating the sheer amount of desiderata for CE and their interplay can be challenging: depending on the context, domain and even individual users, one may need to optimize for one desideratum at the cost of another. In this thesis, we offer one guiding principle that should help researchers and practitioners in this respect. Specifically, we argue and demonstrate that counterfactual explanations should first and foremost be faithful to the model in question. In other words, counterfactuals should be consistent with what the model has learned about the underlying problem and data. Faithfulness has previously been largely ignored by researchers, but we demonstrate that neglecting this desideratum can lead to undesirable outcomes. It is, for example, generally possible to generate plausible counterfactuals for even the most fragile and untrustworthy models that were optimized solely for accuracy. But if these counterfactuals do not faithfully explain model behavior, they are not only useful but potentially misleading, instilling a false sense of trust in poorly trained models.

1.4 Trustworthy AI in the Age of LLMs

Existing challenges with respect to the trustworthiness of opaque AI models have become more pressing in recent years as the scale and potential impact of AI systems on society has increased in the age of LLMs. Following the release of ChatGPT, even some of the most influential and respected AI researchers were in such awe that they publically expressed concern around our capability to control these systems, spurring an active debate and research on AI safety and explainability (Future of Life Institute 2023). An emerging line of research in this context is mechanistic interpretability, which aims to shed light on the inner workings of vast neural networks.

There have been promising advances in this field that aid us in understanding, monitoring and controlling the tools we have so readily deployed on society (Bereska and Gavves 2024). Unfortunately though, there has also been a tendency in some circles to jump from interpretability findings to premature conclusions about AGI. As a final research contribution of this thesis, we critically assess this trend and call for greater caution and modesty in interpreting and presenting such findings.

1.5 Goals and Research Questions

As stated earlier, the principal goal of this work is to contribute methods that help us in making opaque AI models more trustworthy. Since the field of trustworthy AI is still relatively young, it is important that research and any related software are made widely and openly accessible to other researchers and practitioners.

1.5.1 Goals

The principal goals of this thesis are as follows:

  1. Explore and challenge existing technologies and paradigms in trustworthy AI, in particular with respect to explainability.
  2. Improve our ability to hold complex machine learning models accountable through novel methods that facilitate thorough scrutiny.
  3. Leverage the results of such scrutiny to aid us in building models that are inherently more trustworthy.

General principles that have played a role in achieving all of these goals include a strong adherence to best practices for producing reproducible and accessible research, as well as open-source software. The remainder of Section 1.5 dives deeper into more granular research questions that have grown out of these principal goals.

1.5.2 Counterfactual Explanations and Open-Source

Open-source software implementations of LIME and SHAP have contributed to the popularity of these methods (Molnar 2022) and we strive to achieve the same outcome for counterfactual explanations. Specifically, we aim to make existing work in the field readily available and in doing so, we hope to inform our own research about any existing gaps, challenges or open questions. Ultimately, it is our goal to contribute methodological advances accompanied by state-of-the-art open-source software that enable researchers and practitioners to not only better understand the behavior of opaque AI models, but also use that understanding in order to improve their trustworthiness.

To achieve this goal, we begin our research trajectory with the following question:

Thesis Research Question 1.1: Counterfactual Explanations and Open-Source
What are counterfactual explanations, why are they useful for trustworthy AI and what gaps are there in the existing open-source software landscape?

1.5.3 Dynamics of CE and AR

As part of answering this question, in Chapter 2 we introduce a novel comprehensive, extensive and highly performant software implementation for generating counterfactual explanations in the Julia programming language (Bezanson et al. 2017; Altmeyer, Deursen, and Liem 2023). This is a first important step towards facilitating human interaction with opaque AI in the context of this thesis (Definition 1.1). The fast performance of Julia and our package allows us to explore previously untapped challenges that relate to the dynamics of counterfactual explanations (Verma et al. 2022). In particular, we ask ourselves:

Thesis Research Question 1.2: Dynamics of CE and AR
What dynamics are generated when off-the-shelf solutions to CE and AR are implemented in practice?

1.5.4 Plausibility and Faithfulness

In consideration of Definition 1.1, we specifically wonder if by facilitating human interaction we risk creating adverse effects on other aspects of trustworthiness including basic model performance and reliability. Answering these questions requires computationally expensive simulations that involve repeatedly generating CE and AR and (re-)training machine learning models. Findings from such simulations help us to uncover consequences that were difficult to predict when designing initial objectives for individual recourse. Our work on this question makes it clear that a narrow focus on minimizing costs to individuals can create dynamics that are costly to other individuals and stakeholders (Chapter 3). To avoid such endogenous dynamics, CE and AR need to be consistent with the data-generating process, which we have referred to above as ‘plausible’. Since existing work on generating plausible counterfactuals typically involve surrogate models that are not strictly needed to generate valid CE, we wonder:

Thesis Research Question 1.3: Plausibility and Faithfulness
Can we generate plausible counterfactuals relying only on the opaque model itself?

1.5.5 Counterfactual Training

We find that this is not only possible, but also constitutes a cleaner and more principled approach towards explaining models through counterfactuals. It mitigates the risk of entangling the behavior of the opaque model with the surrogate. We demonstrate that only faithful explanations enable us to distinguish trustworthy from untrustworthy models (Chapter 4). We consider this as one of the key steps towards truly understanding the behavior of opaque models and thus fostering meaningful human interaction (Definition 1.1). It allows us to ask the following question:

Thesis Research Question 1.4: Counterfactual Training
How can we leverage faithful counterfactual explanations during training to build more trustworthy models?

Suppose we have trained some opaque model that achieves good basic (predictive) performance, but faithful explanations reveal that it is untrustworthy. In other words, the supervised model excels at its narrow discriminative objective by making predictions based on associations in the data that are not meaningful to humans. Knowing that this model is not trustworthy is useful in and of itself, but in lack of a more principled framework to act on this information it creates a dilemma: should we still go ahead and use the model or discard it in favor of a more trustworthy, but possibly less performant alternative. Ideally, we would like to have the best of both worlds by improving the trustworthiness of the performant model. Since we typically have a pre-defined notion of meaningful explanations for data, we wonder if it is possible to use faithful explanations as feedback for models during training. Our work on this question directly targets the alignment aspect of Definition 1.1 and indirectly improves all other aspects (Chapter 5).

1.5.6 Trustworthy AI and LLMs

Even though our work remains focused on contributions to core research questions in the field of CE and AR, we are not oblivious to the advancements and potential societal impacts of LLMs. It is therefore natural to ask ourselves to what extent existing work on trustworthy AI (including our own) can play a role in better understanding the behavior these models. In particular, we ask:

Thesis Research Question 1.5: Trustworthy AI and LLMs
Can we explain the predictions of LLMs and do recent findings from mechanistic interpretability really hint at AGI?

The first part of this question is naturally aligned with the broader scope of this work. The second part is a reaction to concerning trends and tendencies of some fellow researchers to make unscientific claims about AGI based on questionable evidence. We find it necessary to distance ourselves from such practices and to caution other researchers against it, because we believe they dampen the credibility of otherwise valid and valiant efforts towards improved trustworthiness through mechanistic interpretability (Chapter 6).

1.6 Research Methodology

This work has been predominated by quantitative methods and software development. Development has often informed research and vice-versa.

1.6.1 Quantitative Methods

All chapters contain descriptions and mathematical expositions of specific quantitative methods, as well as computational experiments involving both synthetic, vision and real-world tabular datasets. Since counterfactual explanations involve a counterfactual search objective, optimization—in particular stochastic gradient-based optimization—has been the main quantitative method that unites Chapter 2 to Chapter 5. Across these chapters we also make use of simulations (Chapter 3), bootstrapping (most notably Chapter 4 and Chapter 5), statistical divergence measures, confidence intervals and hypothesis testing. We also borrow and adapt established methods from contrastive learning (Chapter 4 and Chapter 5), robust (adversarial) learning (Chapter 5), and conformal prediction (Chapter 4). Chapter 5 also involves a formal mathematical proof. In Chapter 6, we employ tools from mechanistic interpretability for LLMs such as linear probes and propose a specific hypothesis test. All of our research works involve deep learning and other machine learning models. Quantitative methods that have not or only indirectly been employed in any of the chapters but nonetheless played an important role in our research and development process include: Laplace approximation, Bayesian deep learning, (variational) autoencoders, decision trees and tree-based algorithms. Finally, we have made heavy use of multiprocessing and multithreading to run extensive computational experiments as part of Chapter 4 and Chapter 5.

1.6.2 Interdisciplinary Research

During his previous employment as an economist at the Bank of England (Appendix C), the author of this dissertation realized that despite a growing appetite for AI, monetary policymakers were rightly skeptical of models they cannot fully understand nor trust—after all, the decisions made by central banks affect the lives of entire populations. This background has helped shape much of the work in this thesis, because it has enabled the author to consider certain problems from a unique interdisciplinary angle. Some chapters of this thesis are indeed interdisciplinary in that they are characterized by a bridging of financial and economical expertise and machine learning expertise: Chapter 3 essentially reformulates algorithmic recourse as a scarce resource over which multiple stakeholders compete; Chapter 6 involves elements and data from economics, finance and psychology, driven by diverse academic and professional backgrounds of the group of authors; and, elements of this thesis including Chapter 2, Chapter 4 and Chapter 6 were presented during invited talks at central banks and other financial institutions such as the Bank of England, De Nederlandsche Bank and the Verbond van Verzekeraars.

A specific example that plays a role in the context of Chapter 6 should help to illustrate how this work has benefited from interdisciplinary perspectives: the concept of “emergence” in complex AI systems, which has been tied to AGI in some places. One can draw a parallel to the “emergence” of asset price bubbles in financial markets (which are complex systems): asset bubbles involve prolonged and often dramatic increases in prices, far beyond the fundamental value of assets. While they may involve rational and predictable behavior of individual economic agents (Brunnermeier 2016), their emergence is notoriously hard to explain, and they typically create substantial economic damage (Mishkin et al. 2008). Economists have proposed no shortage of models and methods to explain and detect bubbles, but to the best of our knowledge none has ever attributed such asset price dynamics to some latent intelligence of markets.

1.6.3 FAIR Data and Software Management

Throughout this project, we have made an effort to comply with FAIR data principles (Wilkinson et al. 2016). All of our research papers and the accompanying code bases are maintained in version-controlled repositories, which are organized and documented according to best practices either as a Julia project or—in most cases—a fully-fledged package (see Table 1.1 below). In both cases, Julia’s package manager Pkg.jl handles all dependencies as specified in the Project.toml files contained in the repositories. Projects can be forked and cloned to local machines, while packages can be installed from running Julia sessions using Pkg.jl. We use Zenodo and 4TU.ResearchData to permanently archive research results on the web and create digital object identifiers (DOI) for individual releases of the various code bases. These releases are generally managed using semantic version control (SVC). Relevant DOIs specific to the individual papers are listed in Table 1.1. All of our experiments rely on publically available datasets, so in terms of new data, besides the software itself we only release our research results. Consistent with TU Delft’s Open Access policy, all research papers included in this thesis have been made freely available on the pure.tudelft.nl repository.

1.7 Outline and Contributions

So far we have presented the overarching topics and questions that have shaped this work with occasional references to where they appear in the remainder of this thesis. In this final section of the introduction, we provide an outline of what follows along with detailed descriptions of our contributions. The body of this thesis consists of independent and original research papers that have been peer-reviewed and published (Chapter 2 to Chapter 6). They each individually address different thesis research questions outlined above and contribute to varying aspects of Definition 1.1. Unless explicitly stated otherwise, the papers are included in their original form to ensure their integrity. Only minor modifications have been made if any at all.

In Chapter 2, we present CounterfactualExplanations.jl: a package for generating Counterfactual Explanations (CE) and Algorithmic Recourse (AR) for opaque machine learning models in Julia. We discuss the usefulness of CE for explainable AI and demonstrate the functionality of the package. The package is straightforward to use and designed with a focus on customization, extensibility and performance. It is the de facto go-to place for counterfactual explanations and among the most prominent packages for XAI in Julia: at the time of writing, the package has received well over 100 stars on GitHub—somewhat higher but broadly in the same range as ExplainableAI.jl and ShapML; the package also counts over ten contributors, was the main target of a successful Julia Seasons of Contributions project and has been presented to the developer community in main talks at JuliaCon 2022 and 2024.

We have developed extensive research software in Julia (Bezanson et al. 2017), utilizing other languages including R, Python and Lua in supporting functions. A result of this—and a major contribution of this thesis—is the Taija package ecosystem for trustworthy AI in Julia (67 followers and 24 contributors on GitHub). It includes packages for model explainability (CounterfactualExplanations.jl, predictive uncertainty quantification (ConformalPrediction.jl [142 stars], LaplaceRedux.jl [47 stars]), Bayesian deep learning (LaplaceRedux.jl) and energy-based models (JointEnergyModels.jl). Additionally, there are number of meta packages that ship supporting functionality for the core packages: visualizations (TaijaPlotting.jl), datasets for testing and benchmarking (TaijaData.jl) and parallelization (TaijaParallel.jl). The ecosystem has attracted contributions through software projects at TU Delft, as well as Google Summer of Code and Julia Season of Contributions (in this context, see also Chapter B on supervision engagements).

While Chapter 2 is first and foremost a developer-friendly introduction to our research software package, we include benchmarks of several popular methods for generating CE as part of the exposition of its functionality. The work was presented at JuliaCon Global 2022 and published in proceedings (Altmeyer, Deursen, and Liem 2023). The chapter makes the following main contributions to the thesis and the field of explainable AI as a whole:

  • We fill a gap in the existing open-source software landscape for counterfactual explanations and thus directly address the aspect of human interaction that is needed for trustworthy AI.
  • The choice of Julia as a modern, open-source and highly performant programming language, facilitates experimentation with CE methods at an unprecedented scale.
  • The vast online documentation accompanying the package and the paper provides an actively maintained, up-to-date introduction not only to our research software, but also the field of CE more generally.
  • CounterfactualExplanations.jl has not only powered most of the experiments presented in this thesis, but also external research. It has also laid the foundation for a growing ecosystem of packages geared towards trustworthy AI in Julia.

Chapter 3 is the first traditional research contribution of this thesis. It explores what has been identified in Verma et al. (2022) as one of the core research challenges for the field: the dynamics of recourse. Existing work on CE and AR has largely focused on single individuals in a static environment: given some estimated model, the goal is to find valid counterfactuals for an individual instance that fulfill various desiderata. The ability of such counterfactuals to handle dynamics like data and model drift remains a largely unexplored research challenge. There has also been surprisingly little work on the related question of how the actual implementation of recourse by one individual may affect other individuals. Through this work, we aim to close that gap. We first show that many of the existing methodologies can be collectively described by a generalized framework. We then argue that the existing framework does not account for a hidden external cost of recourse, that only reveals itself when studying the endogenous dynamics of recourse at the group level. Through simulation experiments involving various state-of-the-art counterfactual generators and several benchmark datasets, we generate large numbers of counterfactuals and study the resulting domain and model shifts. We find that the induced shifts are substantial enough to likely impede the applicability of AR in some situations. Fortunately, we find various strategies to mitigate these concerns. Our simulation framework for studying recourse dynamics is fast and open-sourced. This chapter was originally published at the first IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) in 2023 (Altmeyer et al. 2023). The key contributions of this work are as follows:

  • It demonstrates that long-held beliefs as to what defines optimality in AR, may not always be suitable. Specifically, our experiments show that the application of recourse in practice using off-the-shelf CE methods induces substantial domain and model shifts.
  • We argue that these shifts should be considered as a negative externality of individual recourse and call for a paradigm shift from individual to collective recourse in these types of situations.
  • By proposing an adapted counterfactual search objective that incorporates this hidden cost, we make that paradigm shift explicit and show that this modified objective lends itself to mitigation strategies.

In recognition of the fact that more plausible counterfactuals are less likely to cause undesirable dynamics, Chapter 4 explores this desideratum more closely. To address the need for plausible explanations, existing work has primarily relied on surrogate models to learn how the input data is distributed. This effectively reallocates the task of learning realistic explanations for the data from the model itself to the surrogate. Consequently, the generated explanations may seem plausible to humans but need not necessarily describe the behavior of the opaque model faithfully. We formalize this notion of faithfulness through the introduction of a tailored evaluation metric and propose a novel algorithmic framework for generating Energy-Constrained Conformal Counterfactuals that are only as plausible as the model permits. Through extensive empirical studies, we demonstrate that ECCCo reconciles the need for faithfulness and plausibility. In particular, we show that for models with gradient access, it is possible to achieve state-of-the-art performance without the need for surrogate models. To do so, our framework relies solely on properties defining the opaque model itself by leveraging recent advances in energy-based modelling and conformal prediction. To our knowledge, this is the first venture in this direction for generating faithful counterfactual explanations. This chapter was originally published at AAAI 2024 (Altmeyer, Farmanbar, et al. 2024) and makes the following key contributions:

  • We show that established measures of model fidelity in XAI in an insufficient evaluation metric for counterfactuals and propose a definition of faithfulness that gives rise to more suitable metrics.
  • We introduce ECCCo: a novel algorithmic approach aimed at generating energy-constrained conformal counterfactuals that faithfully explain model behavior. We back this claim through extensive empirical evidence demonstrating that ECCCo attains plausibility only when appropriate.
  • The work lays the foundation for future work aimed at leveraging faithful counterfactuals to improve the trustworthiness of models.

Chapter 5 applies the methods developed in the previous chapter to teach models plausible and actionable explanations. We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. As discussed above, to be useful in real-word decision-making systems, counterfactuals ought to be (1) plausible with respect to the underlying data and (2) actionable with respect to the user-defined mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. As we demonstrate in Chapter 4, the common objective of developing model-agnostic explainers that deliver plausible explanations for any model is misguided and unnecessary. In Chapter 5, we therefore hold models directly accountable for the desired end goal: counterfactual training employs faithful counterfactuals ad-hoc during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable explanations while promoting robustness and preserving high predictive performance. This work will be published at the 2026 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) and makes the following key contributions:

  • We introduce the methodological framework for counterfactual training (CT) and show theoretically that it can be employed to enforce global actionability constraints.
  • Building on previous related work, we propose a new perspective on the link between CE and adversarial examples: specifically, we show and utilize the fact that gradient-based interim (‘nascent’) CE comply with the standard definition of AE, as samples that have undergone “non-random imperceptible perturbations” (Szegedy et al. 2014).
  • Through extensive experiments, we demonstrate that CT substantially improves explainability and positively contributes to the adversarial robustness of trained models without sacrificing predictive performance.

The final research chapter, Chapter 6, explores challenges for trustworthy AI in the age of LLMs. We argue that recent developments in the field of AI, and particularly large language models, have created a ‘perfect storm’ for observing ‘sparks’ of Artificial General Intelligence that are spurious. Like simpler models, LLMs distill meaningful representations in their latent embeddings that have been shown to correlate with external variables. Nonetheless, the correlation of such representations has often been linked to human-like intelligence in the latter but not the former. We probe models of varying complexity including random projections, matrix decompositions, deep autoencoders and transformers: all of them successfully distill information that can be used to predict latent or external variables and yet none of them have previously been linked to AGI. We argue and empirically demonstrate that the finding of meaningful patterns in latent spaces of models cannot be seen as evidence in favor of AGI. Additionally, we review literature from the social sciences that shows that humans are prone to seek such patterns and anthropomorphize. We conclude that both the methodological setup and common public image of AI are ideal for the misinterpretation that correlations between model representations and some variables of interest are ‘caused’ by the model’s understanding of underlying ‘ground truth’ relationships. We, therefore, call for the academic community to exercise extra caution, and to be keenly aware of principles of academic integrity, in interpreting and communicating about AI research outcomes. This work was presented at ECONDAT 2024 and eventually published as a position paper at ICML 2024 (Altmeyer, Demetriou, et al. 2024). We make the following key contributions:

  • We present several experiments that may invite claims on models yielding more intelligent outcomes than would have been expected—while at the same time indicating how we feel these claims should not be made. Our findings demonstrate that researchers should exert caution when interpreting results from mechanistic interpretability.
  • To lend further weight to our argument, we present a review of social science findings in that underline how prone humans are to being enticed by patterns that are not really there.
  • We also propose specific structural and cultural changes to improve the current situation by helping researchers avoid common pitfalls.

Finally, we conclude this thesis by discussing the core findings and contributions of this work and proposing directions for future research. To summarize, Table 1.1 provides an overview of the core research chapters along with links to permanent digital object identifiers.

1.8 Origins of Chapters

This final section of the introduction explains the publication history and author contributions in some more detail. All chapters have undergone thorough peer review and have been published in top-tier academic venues.3 For all chapters, Patrick Altmeyer was either lead first author or, in the case of Chapter 6, joint first author. Arie van Deursen and Cynthia C. S. Liem have primarily contributed in an editorial capacity consistent with their roles as Patrick’s supervisor and daily supervisor, respectively.

Chapter 2
This chapter was published in JuliaCon Proceedings by Patrick Altmeyer, Arie van Deursen and Cynthia C. S. Liem (2023). The work was presented by Patrick as a main talk at JuliaCon 2022.
Chapter 3
This chapter was published in 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) by Patrick Altmeyer, Giovan Angela, Aleksander Buszydlik, Karol Dobiczek, Arie van Deursen and Cynthia C. S. Liem (2023). Patrick gave an oral and poster presentation at SaTML 2023. Giovan, Aleksander and Karol were all bachelor’s students at the time that were co-supervised by Patrick and Cynthia during their final-year research projects (see Appendix B for details).
Chapter 4
This chapter was published in Proceedings of the AAAI Conference on Artificial Intelligence by Patrick Altmeyer, Mojtaba Farmanbar, Arie van Deursen and Cynthia C. S. Liem (2024). Patrick was joined by Arie to present the work as a poster at AAAI 2024. Mojtaba, who was affiliated with ING Bank at the time, provided expert insights during multiple discussion and editorial meetings.
Chapter 5
This chapter has been accepted for publication at SaTML 2026 and will list Patrick Altmeyer, Aleksander Buszydlik, Arie van Deursen and Cynthia C. S. Liem as authors (2026). Aleksander joined this project at a later stage of the project after finishing his master’s degree. He contributed to formal analysis, literature review and writing (both drafting and reviewing), as well as conceptualization, software and visualization for specific evaluation metrics.
Chapter 6
This chapter was published in Proceedings of the 41st International Conference on Machine Learning by Patrick Altmeyer, Andrew M. Demetriou, Antony Bartlett, Cynthia C. S. Liem (2024). Patrick presented the work as a poster at ICML 2024, but he shared the first-author role with Andrew. Patrick contributed to conceptualization, data curation, formal analysis, investigation, literature review, methodology, project administration, software, visualization and writing (both drafting and reviewing).

  1. We follow the standard ML convention, where “degrees of freedom” refer to the number of parameters estimated from data.↩︎

  2. Until 2014, “move fast and break things” was part of Meta’s official motto (then still operating under the name of Facebook). The phrase has been used to characterize the broader tech industry (see, for example, Vardi (2018)).↩︎

  3. At the time of printing, Chapter 5 has not yet been published but accepted for publication.↩︎