Explainability
Surrogate Explainers
- Ribeiro, Singh, and Guestrin (2016) propose Local Interpretable Model-Agnostic Explanations (LIME): the approach involves generating local perturbations in the input space, deriving predictions from the original classifier and than fitting a white box model (e.g. linear regression) on this synthetic data set.
- Lundberg and Lee (2017) propose SHAP as a provably unified approach to additive feature attribution methods (including LIME) with certain desiderata. Contrary to LIME, this approach involves permuting through the feature space and checking how different features impact model predictions when they are included in the permutations.
Criticism (Surrogate Explainers)
“Explanatory models by definition do not produce 100% reliable explanations, because they are approximations. This means explanations can’t be fully trusted, and so neither can the original model.” – causaLens, 2021
- Mittelstadt, Russell, and Wachter (2019) points out that there is a gap in the understanding of what explanations are between computer scientists and explanation scientists (social scientists, cognitive scientists, pyschologists, …). Current methods produce at best locally reliable explanations. There needs to be shift towards interactive explanations.
- Rudin (2019) argues that instead of bothering with explanations for black box models we should focus on designing inherently interpretable models. In her view the trade-off between (intrinsic) explainability and performance is not as clear-cut as people claim.
- Lakkaraju and Bastani (2020) show how misleading black box explanations can manipulate users into trusting an untrustworthy model.
- Slack et al. (2020) demonstrate that both LIME and SHAP are not reliable: their reliance on feature perturbations makes them susceptible to adversarial attacks.
- Can we quantify robustness of surrogate explainers?
- Comparison of different complexity measures.
- Design surrogate explainer that incorporates causality.
- Surrogate explainers by definition are approximations of black box models, so can never be 100% trusted.
- Adversarially robust surrogate explainers by minimizing divergence between observed and perturbed data. (Slack et al. 2020)
Counterfactual Explanations (CE)
- Wachter, Mittelstadt, and Russell (2017) were among the first to propose counterfactual explanations that do not require knowledge about the inner workings of a black box model.
- Ustun, Spangher, and Liu (2019) propose a framework for actionable recourse in the context of linear classifiers.
- Joshi et al. (2019) extend the framework of Ustun, Spangher, and Liu (2019). Their proposed REVISE method is applicable to a broader class of models including black box classifiers and structural causal models. For a summary see here and for a set of slides see here.
- Poyiadzi et al. (2020) propose FACE: feasible and actionable counterfactual explanations. The premise is that the shortest distance to the decision boundary may not be a desirable counterfactual.
- Schut et al. (2021) introduce Bayesian modelling to the context of CE: their approach implicitly minimizes aleatoric and epistemic uncertainty to generate a CE that us unambiguous and realistic, respectively.
- Test what counterfactual explanations are most desirable through user study at ING
- Design counterfactual explanations that incorporate causality
- There are in fact several very recent papers (including Joshi et al. (2019)) that link CGMs to counterfactual explanations (see below)
- Time series: what is the link to counterfactual analysis in multivariate time series? (e.g. chapter 4 in Kilian and Lütkepohl (2017))
- What about the continuous outcome variables? (e.g. target inflation rate, … ING cases?)
- How do counterfactual explainers fare where LIME/SHAP fail? (Slack et al. 2020)
- Can counterfactual explainers be fooled much like LIME/SHAP?
- Can we establish a link between counterfactual and surrogate explainers? Important attributes identified by LIME/SHAP should play a prominent role in counterfactuals.
- Can counterfactual explainers be used to detect adversarial examples?
- Limiting behaviour: what happens if all individual with negative outcome move across the original decision boundary?
References
Joshi, Shalmali, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. 2019. “Towards Realistic Individual Recourse and Actionable Explanations in Black-Box Decision Making Systems.” arXiv Preprint arXiv:1907.09615.
Kilian, Lutz, and Helmut Lütkepohl. 2017. Structural Vector Autoregressive Analysis. Cambridge University Press.
Lakkaraju, Himabindu, and Osbert Bastani. 2020. “" How Do i Fool You?" Manipulating User Trust via Misleading Black Box Explanations.” In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 79–85.
Lundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 4768–77.
Mittelstadt, Brent, Chris Russell, and Sandra Wachter. 2019. “Explaining Explanations in AI.” In Proceedings of the Conference on Fairness, Accountability, and Transparency, 279–88.
Poyiadzi, Rafael, Kacper Sokol, Raul Santos-Rodriguez, Tijl De Bie, and Peter Flach. 2020. “FACE: Feasible and Actionable Counterfactual Explanations.” In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 344–50.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “"Why Should i Trust You?" Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–44.
Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1 (5): 206–15.
Schut, Lisa, Oscar Key, Rory Mc Grath, Luca Costabello, Bogdan Sacaleanu, Yarin Gal, et al. 2021. “Generating Interpretable Counterfactual Explanations by Implicit Minimisation of Epistemic and Aleatoric Uncertainties.” In International Conference on Artificial Intelligence and Statistics, 1756–64. PMLR.
Slack, Dylan, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. 2020. “Fooling Lime and Shap: Adversarial Attacks on Post Hoc Explanation Methods.” In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 180–86.
Ustun, Berk, Alexander Spangher, and Yang Liu. 2019. “Actionable Recourse in Linear Classification.” In Proceedings of the Conference on Fairness, Accountability, and Transparency, 10–19.
Wachter, Sandra, Brent Mittelstadt, and Chris Russell. 2017. “Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR.” Harv. JL & Tech. 31: 841.