A peek inside the ‘Black Box’ - interpreting neural networks

Propelled by advancements in modern computer technology, deep learning has re-emerged as perhaps the most promising artificial intelligence (AI) technology of the last two decades. By treating problems as a nested, hierarchy of hidden layers deep artificial neural networks achieve the power and flexibility necessary for AI systems to navigate complex real-world environments. Unfortunately, their very nature has earned them a reputation as Black Box algorithms and their lack of interpretability remains a major impediment to their more wide-spread application.
In science, research questions usually demand not just answers but also explanations and variable selection is often as important as prediction (Ish-Horowicz et al. 2019). Economists, for example, recognise the undeniable potential of deep learning, but are rightly hesitant to employ novel tools that are not fully transparent and ultimately cannot be trusted. Similarly, real-world applications of AI have come under increasing scrutiny with regulators imposing that individuals influenced by algorithms should have the right to obtain explanations (Fan, Xiong, and Wang 2020). In high-risk decision-making fields such as AI systems that drive autonomous vehicles the need for explanations is self-evident (Ish-Horowicz et al. 2019).
In light of these challenges it is not surprising that research on explainable AI has recently gained considerable momentum (Arrieta et al. 2020). While in this short essay we will focus on deep learning in particular, it should be noted that this growing body of literature is concerned with a broader realm of machine learning models. The rest of this note is structured as follows: the first section provides a brief overview of recent advancements towards interpreting deep neural networks largely drawing on Fan, Xiong, and Wang (2020); the second section considers a novel entropy-based approach towards interpretability proposed by Crawford et al. (2019); finally, in the last section we will see how this approach can be applied to deep neural networks as proposed in Ish-Horowicz et al. (2019).
Interpretable DL - a whistle-stop tour
Before delving further into how the intrinsics of deep neural networks can be disentangled we should first clarify what interpretability in the context of algorithms actually means. Fan, Xiong, and Wang (2020) describes model interpretability simply as the extent to which humans can “understand and reason” the model. This may concern an understanding of both the ad-hoc workings of the algorithm as well as the post-hoc interpretability of its output. In the context of linear regression, for example, ad-hoc workings of the model are often described through the intuitive idea of linearly projecting the outcome variable
Understanding the ad-hoc intrinsic mechanisms of a DNN is inherently difficult. While generally transparency may be preserved in the presence of nonlinearity (e.g. decision trees), multiple hidden layers of networks (each of them) involving nonlinear operations are usually out of the realm of human comprehension (Fan, Xiong, and Wang 2020). Training also generally involves optimization of non-convex functions that involve an increasing number of saddle points as the dimensionality increases (Fan, Xiong, and Wang 2020). Methods to circumvent this problematic usually boil down to decreasing the overall complexity, either by regularizing the model or through proxy methods. Regularization – while traditionally done to avoid overfitting – has been found to be useful to create more interpretable representations. Monotonicity constraints, for example, impose that as the value of a specified covariate increases model predictions either monotonically decrease or increase. Proxy methods construct simpler representations of a learned DNN, such as a rule-based decision tree. This essentially involves repeatedly querying the trained network while varying the inputs and then deriving decision rules based on the model output.
Post-hoc interpretability usually revolves around the understanding of feature importance. A greedy approach to this issue involves simply removing features one by one and checking how model predictions change. A more sophisticated approach along these lines is Shapley value, which draws on cooperative game theory. The Shapley value assigns varying payouts to players depending on their contribution to overall payout. In the context of neural networks input covariate
The remainder of this note focuses on a novel approach to feature extraction that measures entropy shifts in a learned probabilistic neural network in response to model inputs
An entropy-based approach to variable importance
Ish-Horowicz et al. (2019) motivate their methodology for interpreting neural networks through Gaussian Process regression. Consider the following Bayesian regression model with Gaussian priors:
This naturally gives rise to a particular example of a Gaussian Process (GP). In particular, since
where
In a standard linear regression model coefficients characterize the projection of the outcome variable
The primary focus here is to learn the mapping from input to output. The key differentiating feature between this approach and the non-parametric model in Equation 1 is the fact that in case of the latter we are interested in learning not only the mapping from inputs to outputs, but also the representation (
where
The proposed methodology in Crawford et al. (2019) and Ish-Horowicz et al. (2019) depends on the availability of a posterior distribution over
Covariates that contribute significant information to the model will have
which in light of its bounds can naturally be interpreted as
Application to Bayesian neural networks
In order to use the RATE criterion in the context of deep learning we need to work in the Bayesian setting. Contrary to standard artificial neural networks which work under the assumption that weights have some true latent value, Bayesian neural networks place a prior distribution over network parameters and hence treat weights as random variables (Goan and Fookes 2020). Not only does it perhaps seem more natural to treat unobserved weights as random, but the Bayesian setting also naturally gives rise to reason about uncertainty in predictions, which can ultimately help us develop more trustworthy models (Goan and Fookes 2020). A drawback of BNNs is that exact computation of posteriors is computationally challenging and often intractable (a non-trivial issue that we will turn back to in a moment).
When the prior placed over parameters is Gaussian, the output of the BNN approaches a Gaussian Process as the width of the network grows, in line with the discussion in the previous section. This is exactly the assumption that Ish-Horowicz et al. (2019) work with. They propose an architecture for a multi-layer perceptron (MLP) composed of (1) an input layer collecting covariates
where
Having established this basic, flexible set-up the Ish-Horowicz et al. (2019) go on to derive closed-form expressions for RATE in this setting. The details are omitted here since the logic is largely analogous to what we learned above, but can be found in Ish-Horowicz et al. (2019).
Conclusion
The RATE criterion originally proposed by Crawford et al. (2019) and shown to be applicable to Bayesian neural networks in Ish-Horowicz et al. (2019) offers an intuitive way to measure variable importance in the context of deep learning. By defining variable importance as the contribution inputs make to a probabilistic model, it implicitly incorporates the interactions between covariates and nonlinearities that the model has learned. In other words, it allows researchers to peek directly into the Black Box. This opens up interesting avenues for future research, as the approach can be readily applied in academic disciplines and real-world applications that rely heavily on explainability of outcomes.
References
Footnotes
Simulatability describes the overall, high-level understandability of the mechanisms underlying the model – put simply, the less complex the model, the higher its simulatability. Decomposability concerns the extent to which the model can be taken apart into smaller pieces – neural networks by there very nature are compositions of multiple layers. Finally, algorithmic transparency refers to the extent to which the training of the algorithm is well-understood and to some extent observable – since DNNs generally deal with optimization of non-convex functions and often lack unique solution they are inherently intransparent.↩︎
For simplicity I have omitted the deterministic bias term.↩︎
Citation
@online{altmeyer2021,
author = {Altmeyer, Patrick},
title = {A Peek Inside the “{Black} {Box}” - Interpreting Neural
Networks},
date = {2021-02-07},
url = {https://www.patalt.org/blog/posts/a-peek-inside-the-black-box-interpreting-neural-networks/},
langid = {en}
}