Explainability in Deep Neural Networks

The wild success of Deep Neural Network (DNN) models in a variety of domains has created considerable excitement in the machine learning community. Despite this success, a deep understanding of why DNNs perform so well, and whether their performance is somehow brittle, has been lacking.

Explainability in Deep Neural Networks

About this blog post series

The wild success of Deep Neural Network (DNN) models in a variety of domains has created considerable excitement in the machine learning community. Despite this success, a deep understanding of why DNNs perform so well, and whether their performance is somehow brittle, has been lacking. Two recent developments hold promise in shedding light on the behavior of DNNs, and could point the way to improving deep learning models

  • The discovery[1] that several Deep Neural Network (DNN) models are vulnerable to adversarial examples: it is often possible to slightly perturb the input to a DNN classifier (e.g. an image-classifier) in such a way that the perturbation is invisible to a human, and yet the classifier's output can change drastically: for example a classifier that is correctly labeling an image as a school bus can be fooled into classifying it as an ostritch by adding an imperceptible change to the image. Besides the obvious security implications, the existense of adversarial examples seems to suggest that perhaps DNNs are not really learning the "essense" of a concept (which would presumably make them robust to such attacks). This opens up a variety of research avenues aimed at developing methods to train adversarially robust networks, and examining properties of adversarially trained networks.

  • Although DNNs are being increasingly adopted in real-world contexts, explaining their behavior has often been difficult. Explainability is crucial for a variety of reasons, and researchers have proposed various notions of what consititutes an explantion, and methods to produce explanations (see recent surveys by Ras et. al. [2]and Guidotti et al[3]). One specific type of explanation is referred to as attribution: attributing the output of a DNN to its input features or internal components (such as neurons in a hidden layer).

In this series of blog posts we focus on these and related topics. This post is an introduction to the notion of Explainability, and specifically Attribution, in the context of DNNs.

Importance of Explainability

DNNs are increasingly being used (or being seriously considered) to power real-world systems that produce decisions, predictions, scores or classifcations that directly or indirectly impact humans. Examples include:

  • deciding whether a consumer's loan application should be approved
  • diagnosing a patient based on his/her symptoms and medical history
  • detecting disease or malignancy in x-rays or other images
  • deciding a consumer's insurance premium
  • a real-time-ad-bidding (RTB) system deciding how much to bid for an ad opportunity on behalf of an advertiser, by predicting whether a consumer exposed to the ad would click or make a purchase.
  • a recommender system that chooses what products to show to a user
  • a news-feed ranking algorithm that decides what news articles to show to a user

The recent dramatic success of DNNs may create a temptation to treat these as opaque black-boxes and simply trust that they "just work". However in real-world applications such as the above, there are several reasons why it is crucial to be able to explain their behavior:

  • Explaining what aspects of the input caused the DNN to produce a specific output can help identify weaknesses of the model, or diagnose mistakes, and these insights can guide model refinements or feature pre-processing.
  • Understanding the behavior of a DNN model can help instill trust in the model, especially in safety-critical applications such as personalized health-care.
  • Understanding the distributional behavior of a DNN model can help un-cover biases or spurious correlations the model has learned (due to quirks in the training data distribution), and suggest ways to improve the models by fixing issues in the training data.
  • Quantifying the aggregate importance of input features in driving a model's outputs, provides an understanding of the relative overall impact of different features, and this can in turn help diagonose potential problems in a model, or help eliminate un-important features from model training.
  • Explainability will also become necessary in legal scenarios where the "blame" for a wrong decision assisted by an ML model needs to be appropriately assigned, or to comply with laws such as the "right to an explanation" that is part of the European GDPR.

Notions of Explainability of DNNs

Given the importance of explainability of Deep Learning models, a natural question is: what consititutes a good explanation? A variety of notions of explanation have been proposed by researchers but in this blog we will focus on two types of explanations:

  • Quantification of the influence of input features or internal hidden units in determining the DNN's output, either on specific input instances or in aggregate over some dataset. These are called attribution methods.
  • DNNs are essentially learning representations, and so another way to explain a DNN's behavior is to interpret the representation learned by its internal hidden units or layers.

Attribution Methods for DNNs

An attribution method aims to answer the following types of questions:

How much does an input feature (or hidden unit) contribute to the output of the DNN on a specific input? And what is the overall aggregate contribution on a given data distribution?

Let us first consider feature attribution, which is the attribution question for an input feature. At the most fundamental level, we'd like to know how much "credit" to assign this input feature to explain the output of the DNN. To understand why this is not a trivial question, let us look at some simple feature attribution methods that may come to mind.

To make things precise we denote the function computed by the DNN as $F(x)$ where $x$ is the input feature-vector, say of dimension $d$. The individual feature-dimensions of $x$ are denoted $x_1, x_2, \ldots, x_d$. Our aim is to compute for each dimension $i$, and attribution $A^F_i(x)$, i.e. the contribution of feature $x_i$ to $F(x)$. For brevity let us write $x[x_i=b_i]$ to denote the vector $x$ where $x_i$ is replaced by value $b_i$ (read this as "the vector $x$ where $x_i$ is assigned the value $b_i$").

One simple idea is to take a counterfactual view point:

Compare the DNN output $F(x)$ to what its output would have been if the input feature $x_i$ were not "active", i.e. if it were replaced by some suitable "information-less" baseline value $b_i$. In other words, the attribution to feature feature $x_i$ is given by:

\begin{equation} A_i^F(x; b) = F(x) - F(x[x_i=b_i]), \label{eq-attr-change} \tag{1} \end{equation}
where we parametrize the attribution by the baseline-vector $b​$.

The choice of the information-less baseline very much depends on the application. If $x$ represents the pixels of an image, then the all-black image would be a suitable baseline. In certain settings, if $x$ represents a collection of continuous numerical features, then the all-zeros vector could be a reasonable baseline. If $x$ represents a concatenation of embedding vectors (for example in NLP applications), then again the all-zeros vector could be a valid baseline.

Let's take a closer look at the above attribution definition. If $F(x)$ is a trivial shallow DNN computing a linear function of $x$ (with no activation function), say $$F(x) = \sum_{i=1}^d w_i x_i$$ where $w_i$ are the learned model weights, then the definition of $A_i^F(x; b)$ is simply $w_i (x_i - b_i)$, or in other words the attribution of feature $i$ is the weight times the value change from its baseline value. This is a perfectly reasonable definition of attribution for this special case. In fact it satisfies a nice property:

Additivity: The sum of the attributions of all input features equals the change of $F$ from the baseline $b$ to $x$:

\begin{align*} \sum_{i=1}^d A_i^F(x; b) &= \sum_{i=1}^d w_i(x_i - b_i)\\ &= \sum_{i=1}^d w_i x_i - \sum_{i=1}^d w_i b_i \\ &= F(x) - F(b) \end{align*}

Additivity is a nice property for an attribution method to have since it allows us to think of each feature's attribution $A_i^F(x; b)$ as the contribution of the feature to the change $F(x) - F(b)$. Note that in the above proof we relied on the fact that $F(x)$ is linear, and we wouldn't expect the method $\eqref{eq-attr-change}$ to satisfy additivity if $F(x)$ is non-linear, and DNNs necessarily involve non-linearities. So for the general case, how can we improve upon the naive method $\eqref{eq-attr-change}$ ? We will consider this in the next post, so stay tuned!


  1. Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. “Intriguing Properties of Neural Networks.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1312.6199. ↩︎

  2. Ras, Gabrielle, Marcel van Gerven, and Pim Haselager. 2018. “Explanation Methods in Deep Learning: Users, Values, Concerns and Challenges.” arXiv [cs.AI]. arXiv. http://arxiv.org/abs/1803.07517. ↩︎

  3. Guidotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. “A Survey of Methods for Explaining Black Box Models.” ACM Comput. Surv. 51 (5): 93:1–93:42. ↩︎