Explainability in Deep Neural Networks
The wild success of Deep Neural Network (DNN) models in a variety of domains has created considerable excitement in the machine learning community. Despite this success, a deep understanding of why DNNs perform so well, and whether their performance is somehow brittle, has been lacking.
About this blog post series
The wild success of Deep Neural Network (DNN) models in a variety of domains has created considerable excitement in the machine learning community. Despite this success, a deep understanding of why DNNs perform so well, and whether their performance is somehow brittle, has been lacking. Two recent developments hold promise in shedding light on the behavior of DNNs, and could point the way to improving deep learning models

The discovery^{[1]} that several Deep Neural Network (DNN) models are vulnerable to adversarial examples: it is often possible to slightly perturb the input to a DNN classifier (e.g. an imageclassifier) in such a way that the perturbation is invisible to a human, and yet the classifier's output can change drastically: for example a classifier that is correctly labeling an image as a school bus can be fooled into classifying it as an ostritch by adding an imperceptible change to the image. Besides the obvious security implications, the existense of adversarial examples seems to suggest that perhaps DNNs are not really learning the "essense" of a concept (which would presumably make them robust to such attacks). This opens up a variety of research avenues aimed at developing methods to train adversarially robust networks, and examining properties of adversarially trained networks.

Although DNNs are being increasingly adopted in realworld contexts, explaining their behavior has often been difficult. Explainability is crucial for a variety of reasons, and researchers have proposed various notions of what consititutes an explantion, and methods to produce explanations (see recent surveys by Ras et. al. ^{[2]}and Guidotti et al^{[3]}). One specific type of explanation is referred to as attribution: attributing the output of a DNN to its input features or internal components (such as neurons in a hidden layer).
In this series of blog posts we focus on these and related topics. This post is an introduction to the notion of Explainability, and specifically Attribution, in the context of DNNs.
Importance of Explainability
DNNs are increasingly being used (or being seriously considered) to power realworld systems that produce decisions, predictions, scores or classifcations that directly or indirectly impact humans. Examples include:
 deciding whether a consumer's loan application should be approved
 diagnosing a patient based on his/her symptoms and medical history
 detecting disease or malignancy in xrays or other images
 deciding a consumer's insurance premium
 a realtimeadbidding (RTB) system deciding how much to bid for an ad opportunity on behalf of an advertiser, by predicting whether a consumer exposed to the ad would click or make a purchase.
 a recommender system that chooses what products to show to a user
 a newsfeed ranking algorithm that decides what news articles to show to a user
The recent dramatic success of DNNs may create a temptation to treat these as opaque blackboxes and simply trust that they "just work". However in realworld applications such as the above, there are several reasons why it is crucial to be able to explain their behavior:
 Explaining what aspects of the input caused the DNN to produce a specific output can help identify weaknesses of the model, or diagnose mistakes, and these insights can guide model refinements or feature preprocessing.
 Understanding the behavior of a DNN model can help instill trust in the model, especially in safetycritical applications such as personalized healthcare.
 Understanding the distributional behavior of a DNN model can help uncover biases or spurious correlations the model has learned (due to quirks in the training data distribution), and suggest ways to improve the models by fixing issues in the training data.
 Quantifying the aggregate importance of input features in driving a model's outputs, provides an understanding of the relative overall impact of different features, and this can in turn help diagonose potential problems in a model, or help eliminate unimportant features from model training.
 Explainability will also become necessary in legal scenarios where the "blame" for a wrong decision assisted by an ML model needs to be appropriately assigned, or to comply with laws such as the "right to an explanation" that is part of the European GDPR.
Notions of Explainability of DNNs
Given the importance of explainability of Deep Learning models, a natural question is: what consititutes a good explanation? A variety of notions of explanation have been proposed by researchers but in this blog we will focus on two types of explanations:
 Quantification of the influence of input features or internal hidden units in determining the DNN's output, either on specific input instances or in aggregate over some dataset. These are called attribution methods.
 DNNs are essentially learning representations, and so another way to explain a DNN's behavior is to interpret the representation learned by its internal hidden units or layers.
Attribution Methods for DNNs
An attribution method aims to answer the following types of questions:
How much does an input feature (or hidden unit) contribute to the output of the DNN on a specific input? And what is the overall aggregate contribution on a given data distribution?
Let us first consider feature attribution, which is the attribution question for an input feature. At the most fundamental level, we'd like to know how much "credit" to assign this input feature to explain the output of the DNN. To understand why this is not a trivial question, let us look at some simple feature attribution methods that may come to mind.
To make things precise we denote the function computed by the DNN as $F(x)$ where $x$ is the input featurevector, say of dimension $d$. The individual featuredimensions of $x$ are denoted $x_1, x_2, \ldots, x_d$. Our aim is to compute for each dimension $i$, and attribution $A^F_i(x)$, i.e. the contribution of feature $x_i$ to $F(x)$. For brevity let us write $x[x_i=b_i]$ to denote the vector $x$ where $x_i$ is replaced by value $b_i$ (read this as "the vector $x$ where $x_i$ is assigned the value $b_i$").
One simple idea is to take a counterfactual view point:
Compare the DNN output $F(x)$ to what its output would have been if the input feature $x_i$ were not "active", i.e. if it were replaced by some suitable "informationless" baseline value $b_i$. In other words, the attribution to feature feature $x_i$ is given by:
\begin{equation} A_i^F(x; b) = F(x)  F(x[x_i=b_i]), \label{eqattrchange} \tag{1} \end{equation}where we parametrize the attribution by the baselinevector $b$.
The choice of the informationless baseline very much depends on the application. If $x$ represents the pixels of an image, then the allblack image would be a suitable baseline. In certain settings, if $x$ represents a collection of continuous numerical features, then the allzeros vector could be a reasonable baseline. If $x$ represents a concatenation of embedding vectors (for example in NLP applications), then again the allzeros vector could be a valid baseline.
Let's take a closer look at the above attribution definition. If $F(x)$ is a trivial shallow DNN computing a linear function of $x$ (with no activation function), say $$F(x) = \sum_{i=1}^d w_i x_i$$ where $w_i$ are the learned model weights, then the definition of $A_i^F(x; b)$ is simply $w_i (x_i  b_i)$, or in other words the attribution of feature $i$ is the weight times the value change from its baseline value. This is a perfectly reasonable definition of attribution for this special case. In fact it satisfies a nice property:
Additivity: The sum of the attributions of all input features equals the change of $F$ from the baseline $b$ to $x$:
\begin{align*} \sum_{i=1}^d A_i^F(x; b) &= \sum_{i=1}^d w_i(x_i  b_i)\\ &= \sum_{i=1}^d w_i x_i  \sum_{i=1}^d w_i b_i \\ &= F(x)  F(b) \end{align*}
Additivity is a nice property for an attribution method to have since it allows us to think of each feature's attribution $A_i^F(x; b)$ as the contribution of the feature to the change $F(x)  F(b)$. Note that in the above proof we relied on the fact that $F(x)$ is linear, and we wouldn't expect the method $\eqref{eqattrchange}$ to satisfy additivity if $F(x)$ is nonlinear, and DNNs necessarily involve nonlinearities. So for the general case, how can we improve upon the naive method $\eqref{eqattrchange}$ ? We will consider this in the next post, so stay tuned!
Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. “Intriguing Properties of Neural Networks.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1312.6199. ↩︎
Ras, Gabrielle, Marcel van Gerven, and Pim Haselager. 2018. “Explanation Methods in Deep Learning: Users, Values, Concerns and Challenges.” arXiv [cs.AI]. arXiv. http://arxiv.org/abs/1803.07517. ↩︎
Guidotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. “A Survey of Methods for Explaining Black Box Models.” ACM Comput. Surv. 51 (5): 93:1–93:42. ↩︎