]]>In many important applications, AI models are trained on labeled data but when deployed in the wild, labels are not readily available (for example in medical imaging where the model is identifying a cancerous patch, "ground-truth" labels may require expert examination). A critical question is -- in the
In many important applications, AI models are trained on labeled data but when deployed in the wild, labels are not readily available (for example in medical imaging where the model is identifying a cancerous patch, "ground-truth" labels may require expert examination). A critical question is -- in the absence of "ground-truth" labels, how much can we trust the predictions of the model? This blog post describes an intriguing new technique based on training an ensemble of "check" models (on the original labeled training data and the unlabeled test data) in such a way that the correctness of the main model is related to agreement with the check models. This leads to an elegant way to estimate the accuracy of the model on unlabeled data. The full paper^{[1]} will appear in NeurIPS 2021.
When a machine learning model is deployed in the wild, it can encounter test data drawn from distributions different from the training data distribution and suffer a drop in performance. For example, the deep models trained on ImageNet will have large performance drop on the test data from ObjectNet^{[2]}.
For safe deployment, it is essential to estimate the accuracy of the pre-trained model on the test data. If we have the labels for the test data, then we can simply compute the accuracy of the model. However, the labels for the test inputs are usually not immediately available in practice, and obtaining them can be expensive. This observation leads to a challenging task of unsupervised accuracy estimation -- estimating the accuracy of a pre-trained classifier on a set of unlabeled test inputs. Furthermore, it is beneficial to estimate the correctness of the predictions on individual points. This leads to an even more difficult task of error detection, which aims to identify points in the unlabeled test set that are misclassified by the pre-trained model. Such a finer-grained estimation can facilitate a further improvement of the pre-trained model (e.g., manually label those misclassified data points and retrain the model on them).
While there have been previous attempts to address unsupervised accuracy estimation or the broader problem of error detection, their successes usually rely on some assumptions that may not hold in practice. For example, a natural approach is to use metrics based on model confidence (i.e. softmax probability) to measure the performance of the pre-trained model, e.g., as in ^{[3]}. If the model is well-calibrated with respect to test data, then the average confidence on the test data approximates its accuracy quite well. However, it has been observed that many machine learning systems, in particular modern neural networks, are poorly calibrated, especially on test data with distribution shifts ^{[4]}^{[5]}. Another method is to learn a regression function that takes statistics about the model and the test data as input and predicts the performance on the test data ^{[3:1]}^{[6]}. This requires training on labeled data from various data distributions, which is very expensive or even impractical. Furthermore, the performance predictor trained on the labeled data may not generalize to unknown data distributions. Recent work by ^{[7]} proposes to learn a “check” model using domain-invariant representation and use it as a proxy for the unknown true test labels to estimate the performance via error detection. It relies on the success of the domain-invariant representation methods to obtain a highly accurate check model on the test data. Hence, the check model performance suffers when domain-invariant representation is not accurate in circumstances such as test data having outlier feature vectors or different class probabilities than the training data.
Our basic idea is to learn a “check model” $h$ and use the disagreement between $h$ and the pre-trained model $f$ for the tasks of accuracy estimation and error detection: identify a point $x$ as misclassified only if $h$ disagrees with $f$ on $x$.
It is easy to understand that the disagreement approach succeeds if: (C1) $h$ agrees with $f$ on points where $f$ is correct; and (C2) $h$ disagrees with $f$ on points where $f$ is incorrect. Our first key observation is that usually (C1) can be satisfied approximately in practice. Intuitively, if we train $h$ and $f$ using the same training data, then $h$ can be trained to be correct on the subset of the instance space where $f$ is correct. However, (C2) is a more tricky condition: when $f$ is incorrect, we would like $h$ to disagree with $f$, which means $h$ can either be correct or make a different mistake than $f$. Thus the condition (C2) may not be easily satisfied since $h$ can make similar mistakes as $f$, which then leads to an overestimation of the accuracy.
Our focus is to improve the disagreement on misclassified points and we use ensemble and self-training to achieve this. We propose to learn an ensemble of models (instead of one check model) and identify a point $x$ as mis-classified if the ensemble disagrees with $f$ on $x$ (i.e. a large fraction of the models in the ensemble disagree with $f$ on their predictions on $x$). To make the ensemble disagree with $f$ on a mis-classified test input $x$, we have two ways: one is to train accurate models in the ensemble such that they can predict correctly on the test point $x$. The other is to train a diverse ensemble such that they won’t make similar mistakes as $f$. The first way may be achievable when the training data contains information for prediction on $x$. A prototypical example is when the test inputs are corruption of clean data from the training distribution (e.g., the training data are images in sunny days while the test inputs are ones in rainy days), and techniques like unsupervised domain adaptation can be used to improve the prediction on such test inputs. However, correct predictions on $x$ may not be feasible in many interesting scenarios due to insufficient information (e.g., the test image in the open world can contain an object that is never seen in the training data). Fortunately, it has been shown that for such inputs, one can obtain an ensemble of models with diverse predictions (e.g.,^{[8]}). This then gives the second way to achieve disagreement: using diverse ensembles.
Empirically, we observe that the ensemble may only be able to identify a subset of the mis-classified points. Therefore, we propose to iteratively identify more and more mis-classified points by self-training. For each mis-classified data point $x$ identified by the ensemble, we assign it a pseudo-label that is different from $f(x)$ (e.g. use the majority vote of the ensemble or a random label as the pseudo-label). Then we can train (with regularization) a new ensemble to encourage their disagreement with $f$ on the pseudo-labeled data $R$ (e.g., use a supervised loss on $R$ with a small weight as the regularization).
Based on these intuitions, We propose a principled and practically effective framework that makes a novel use of the self-training technique on ensembles for the challenging tasks of accuracy estimation and error detection.
Theoretically, we show that our framework succeeds if in each iteration, the ensemble learned satisfy the following conditions: (A) correct on points where $f$ is correct; (B) mostly disagree with $f$ on the pseudo-labeled data $R_X$; (C) either correct or diverse on $W_X \setminus R_X$, where $W_X$ is the set of misclassified points.
Based on the success conditions of our framework, we propose two concrete ensemble learning methods $\mathcal{T}_{RI}$ and $\mathcal{T}_{RM}$. $\mathcal{T}_{RI}$ trains ensemble models with different random initializations similar to Deep Ensemble^{[8:1]}. It has been shown that deep models trained from different random initialization can be diverse on outlier data points^{[9]} and thus can satisfy our condition (C). $\mathcal{T}_{RM}$ trains ensemble models using the representation matching technique (a common approach to improve target accuracy in the domain adaptation problem). We use the checkpoint models during training as the ensemble since we observe that they can have diversity on misclassified data points empirically. Thus, for $\mathcal{T}_{RM}$, our condition (C) can be satisfied better.
We perform experiments for unsupervised accuracy estimation and error detection tasks on 59 pairs of training-test datasets from five dataset categories, including image classification and sentiment classification datasets. In summary, our findings are: (1) Our method achieves state-of-the-art results on both accuracy estimation and error detection tasks. (2) Both ensemble and self-training techniques have positive effects on the tasks and it is easy to pick suitable hyperparameters for our algorithms. (3) Empirical results show that the assumptions made in our analysis hold approximately.
In this blogpost, we present a generic algorithmic framework for the important and challenging tasks of unsupervised accuracy estimation and error detection. The instantiations of our framework show strong empirical results on various datasets for both tasks. It will be of significant interest to extend our methodology to other data modalities such as tabular data and event sequence data. Also, the current methodology applies to models whose output is a class label, and the metric of interest is accuracy. However, there are important settings where the output is a class probability. One very common example of such a setting is probabilistic binary classification, where the model outputs a probability of the “positive” class, and the metric of interest is usually the ROC-AUC or PR-AUC. We plan to extend some of our ideas to this case.
Chen, J., Liu, F., Avci, B., Wu, X., Liang, Y., & Jha, S. (2021). Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles. Advances in Neural Information Processing Systems, 34. ↩︎
Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems 32, pages 9448–9458. 2019. ↩︎
Elsahar, Hady, and Matthias Gallé. "To annotate or not? predicting performance drop under domain shift." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. ↩︎ ↩︎
Guo, Chuan, et al. "On calibration of modern neural networks." International Conference on Machine Learning. PMLR, 2017. ↩︎
Ovadia, Yaniv, et al. "Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift." Advances in Neural Information Processing Systems 32 (2019): 13991-14002. ↩︎
Schelter, Sebastian, Tammo Rukat, and Felix Bießmann. "Learning to validate the predictions of black box classifiers on unseen data." Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2020. ↩︎
Chuang, Ching-Yao, Antonio Torralba, and Stefanie Jegelka. "Estimating Generalization under Distribution Shifts via Domain-Invariant Representations." International Conference on Machine Learning. PMLR, 2020. ↩︎
Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and scalable predictive uncertainty estimation using deep ensembles." Advances in neural information processing systems 30 (2017). ↩︎ ↩︎
Fort, Stanislav, Huiyi Hu, and Balaji Lakshminarayanan. "Deep ensembles: A loss landscape perspective." arXiv preprint arXiv:1912.02757 (2019). ↩︎
When a human explains how they identify a certain object in an image, they may attribute their identification to certain parts of the image. If this attribution changes substantially when a very similar image is presented, it is an indication that their understanding of the concept is somehow not solid, or robust. Deep Learning models often exhibit such brittleness of attribution, and this makes them both difficult to trust and vulnerable to adversarial examples. This raises a natural question: can we train deep learning models to have robust attributions, and hence alleviate these issues? In this post, Jiefeng Chen and Xi Wu describe intriguing new results from their recent NeurIPS 2019 paper^{[1]} (co-authored with Vaibhav Rastogi, Yingyu Liang and Somesh Jha) on robustifying model attribution and the various benefits of doing so.
-- Prasad Chalasani
Fundamental to human perception is the concept of attribution (according to Merriam-Webster, "attribution" means "to explain (something) by indicating a cause"). Even at a very young age, humans can already attribute (and oftentimes surprisingly well). For example, when we show a picture of animals to a 3-year old kid and ask "Do you see a dog?" -- "Yes!" he or she answers; "So what makes you think so?" -- he or she will point to the part of the picture that accounts for a dog -- that's attribution. Perhaps more importantly, robust attribution plays a fundamental role for humans to check whether one has learned the concept correctly. For example, if for a very similar image the same conversation leads the kid to point to a birdie and say "that's the dog!" then this indicates that he or she has not learned what a dog really looks like.
Interestingly, while attribution plays a critical role in human learning, it has (until very recently) received very little attention in machine learning, and in particular deep learning, even though the holy grail of machine learning is to be able to learn to generalize -- or intuitively to learn the "ground truth causality". This is not surprising, however -- if we have learned to be able to predict accurately on unseen instances -- we must have learned the true underlying concept, so it is natural to wonder why one should bother with attribution. After all, deep learning has achieved unprecedented success in domains such as Computer Vision, Natural Language Processing, Game Playing, etc., why should we bother about their attributions?
This lack of attention to attribution has been changing over the past few years, with the discovery of adversarial examples^{[2]}^{[3]}. The Achilles' heel of the above "no need for attribution due to high prediction accuracy" argument is the existence of spurious correlations. Classical machine learning theory assumes i.i.d. samples and a "low complexity" hypothesis class to start with, in order to eventually be able to generalize. Unfortunately, neither of these two assumptions hold for deep learning: (1) Samples may be taken with inherent bias that we are not aware of (for example, pictures of cows may all have a green background), and (2) Deep models have huge “capacity” (i.e. large number of trainable parameters). The end result is that deep learning models can easily absorb spurious correlations and can superficially achieve high prediction accuracy, but can easily break down in slightly different but still very natural contexts. We list two examples here:
We believe that to address brittle predictions and attributions (and other artifacts arising from spurious correlations) in Machine Learning, the key is to enforce robustness of attributions.
In our recent work which will appear in NeurIPS 2019, we introduce Robust Attribution Regularization (RAR)^{[1:1]}. RAR aims to regularize the training so the resulting model will have robust attributions that are not substantially changed under minimal input perturbations. To compute attributions of models, we use Integrated Gradients (IG), which is a principled method that has been covered in detail in several posts on this blog (The Axioms of Attribution and Path Methods for Feature Attribution). Note that attributions computed by IG, much like attributions given by the kid in the above example, offer clues of the model's (or equivalently the kid's) thinking.
At a high level, the RAR formulation is very simple and intuitive. If we let $\Delta(\mathbf{x})$ denote the set of allowed (predefined) perturbations $\mathbf{x’}$ of an input instance $x$, the RAR objective discourages (via an additive regularization term) large differences between the attribution for $\mathbf{x}$ and the attribution for $\mathbf{x’}$ (in the worst case sense over perturbations $\mathbf{x’} \in \Delta(\mathbf{x})$). In our paper, attribution is captured by Integrated Gradients applied to the loss function in the supervised learning setting (other formulations are possible and we leave these for future work). Thus we augment the traditional machine-learning objective function with an additional “attribution-regularizer” term, yielding the objective:
\begin{align*} & \min_\theta \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[\ell(\mathbf{x},y;\theta)+\lambda*RAR] \\ & where\ RAR = \max_{\mathbf{x'}\in \Delta(\mathbf{x})} s(IG(\mathbf{x},\mathbf{x'})) \end{align*}
Here, $s$ is the size function (which we elaborate on below), $\mathcal{D}$ is the data distribution, $\mathbf{x}$ is the input, $y$ is the label, $\theta$ denotes the vector of model parameters, $\ell$ is the loss function, $\lambda$ is the regularization parameter, $\mathbf{x'}$ is the perturbed input, $\Delta(\mathbf{x})$ is the set of allowed perturbations, and $IG(\mathbf{x}, \mathbf{x’})$ denotes the Integrated Gradients vector for $\mathbf{x}$ with $\mathbf{x’}$ as the baseline (see our previous post for the precise definition of IG).
In simple terms, the RAR objective captures our desire to simultaneously achieve two goals:
Interestingly, while at first glance this objective function seems unrelated to adversarial training, our theoretical analysis reveals that RAR offers a rich and principled generalization to it. In fact, if one uses $\sum_{i} a_i$ as the size function $s(a)$, and define the permissible perturbation-set $\Delta(\mathbf{x})$ as those where no component is perturbed by more than some $\varepsilon > 0$ (these are referred to as $l_\infty$-norm-bounded perturbations), then via a straightforward calculation using the Completeness Axiom of IG, one can show that the RAR objective degenerates to the adversarial training objective (for robust predictions) of Madry et al.^{[5]}! Furthermore, our paper shows that not only adversarial training, but several objectives (including ones in different robust optimization models) could be viewed as special weak cases of "robust attributions". This gives rise to our first major point about the RAR theory:
Robust attribution regularization gives principled generalizations of previous objectives designed for robust predictions, in both uncertainty set model and distributional robustness model. Moreover, for 1-layer neural networks, RAR naturally degenerates to max-margin training.
Our empirical evaluation of RAR also gives encouraging results and interesting observations. We perform experiments on four classic datasets: MNIST^{[6]}, Fashion-MNIST^{[7]}, GTSRB^{[8]}, and Flower^{[9]}. To evaluate attribution robustness, we use attribution attacks proposed by Ghorbani et al.^{[4:1]} to perturb the inputs (thus causing changes in attributions) while keeping predictions intact, and use Top-K intersection and Kendall’s Correlation as metrics to measure rank correlations between original and perturbed saliency maps (higher values of these metrics indicate more robust attributions). Our main findings can be summarized as follows:
To make the discussion more concrete, let us consider the following figure as an example:
Here NATURAL denotes the naturally trained model, and IG-SUM-NORM is the model trained using our robust attribution method (here we set the size function $s(a)=\sum_{i} a_i+||a||_1$ ). For all images, the models give correct prediction – Windflower. However, the saliency maps, computed via IG, show that attributions of the naturally trained model are very fragile: this is both visually apparent and also reflected in the low intersection and correlation scores. By contrast, the model trained using our RAR method is much more robust in its attributions. It is intriguing to note that the attributions of the RAR-based model are more human-aligned: they precisely highlight the flower!
RAR only scratches the surface of leveraging attributions in deep learning. Our results are encouraging and they hint strongly that robust attributions can help protect models from spurious correlations: intuitively, one cannot reliably attribute to spurious correlations. We hope this post inspires research efforts to further explore this potential connection.
Jiefeng Chen, Xi Wu, Vaibhav Rastogi, Yingyu Liang, and Somesh Jha. "Robust Attribution Regularization." arXiv preprint arXiv:1905.09957 (2019). https://arxiv.org/abs/1905.09957 ↩︎ ↩︎
Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). ↩︎
Zico Kolter and Aleksander Madry. "Adversarial robustness – theory and practice." https://adversarial-ml-tutorial.org/. ↩︎
Ghorbani, Amirata, Abubakar Abid, and James Zou. "Interpretation of neural networks is fragile." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019. ↩︎ ↩︎
Madry, Aleksander, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. "Towards deep learning models resistant to adversarial attacks." arXiv preprint arXiv:1706.06083 (2017). ↩︎
Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. "The mnist database of handwritten digits", 1998. http://yann.lecun.com/exdb/mnist/. ↩︎
Han Xiao, Kashif Rasul, and Roland Vollgraf. "Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms." arXiv preprint arXiv:1708.07747, 2017. ↩︎
Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. "Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition." Neural networks, 32:323–332, 2012. ↩︎
M-E Nilsback and Andrew Zisserman. "A visual vocabulary for flower classification." In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1447–1454. IEEE, 2006. ↩︎
In the recent years, deep neural networks have become increasingly powerful at tasks previously only humans had mastered. Deep learning has become widely used, and while it has many practitioners, its inner workings are far from well-understood. As the application of ML has increased, so has the need for algorithmic transparency, the ability to understand why algorithms deployed in the real world make the decisions they do. In recent work [1] with my colleagues at CMU, appearing in ITC 2018, we focus on bringing greater transparency to these mysterious, yet effective, machine learning techniques.
Machine learning has become increasingly powerful and ubiquitous. A 2018 study [2] found that a contemporary deep network could more accurately diagnose retinal disease than retina specialists and optometrists. However, despite their expanding capability and proliferation into sensitive domains, deep networks are still found to discriminate, overfit, and be vulnerable to attacks. For example, recent work out of CMU [3] has shown that specially-crafted 3D-printed eyeglasses can fool state-of-the-art face recognition models.
There has therefore been a growing need to address these problems via a deeper understanding of modern machine learning methods, which are generally concerningly opaque and difficult to understand.
When a deep neural network makes a decision, how do we know we can trust it? We often use measures like generalization error to give us a degree of confidence about our models’ reliability. Of course, this doesn’t give us certainty on individual predictions, and furthermore, when a network is deployed in the real world, it may have unexpected results if the train and test data don’t capture the distribution in the wild. Another approach is increasing trust by explaining key aspects of the predictions made by trained models. For example, we might wonder, does the model make decisions for the right reasons? When we can reveal the inner workings behind a given prediction of the model, we can be more informed on whether the prediction is trustworthy or not.
Similarly, we might want to know, did our model learn a pattern that we overlooked but might find useful? If our model made a mistake, why? How can we teach the model not to make the same kind of mistake in the future?
Answering these high-level questions requires the ability to make a rich set of queries that help us learn about a model’s predictive behavior. Primarily, an explanation framework is meant to give us the means to make such queries. An explanation framework’s utility comes from its ability to express and accurately answer queries. Relative to the growing body of prior work on DNN explanations, our work provides greater utility by generalizing to a larger space of explanations that admit a more diverse set of queries. Particularly, we consider the following axes:
Quantity of interest | The quantity of interest lets us specify what we want to explain. Often, this is the output of the network corresponding to a particular class, addressing, e.g., why did the model classify the image in Figure 1 as a sports car? However, we could also consider various combinations of outputs, allowing us to ask more specific questions, such as, why did the model classify this image as a sports car and not a convertible? As shown in Figure 1, the former may highlight general “car features,” such as tires, while the latter (called a comparative explanation) might focus on the roof of the car, a “car feature” not shared by convertibles. Our work is the first to provide support for comparative explanations.
Distribution of interest | The distribution of interest lets us specify the set of instances over which we want our explanations to be faithful. In some cases, we may want to explain the model’s behavior on a particular instance, whereas other times we may be interested in a more general behavior over a class of instances. Previous explanation frameworks have tended to work only for individual instances.
Internal Slice | The slice, or layer, of the network provides flexibility over the level of abstraction for the explanation. In a low layer, an explanation may highlight the edges that were most important in identifying an object like a face, while in a higher layer, the explanation might highlight high-level features such as a nose or mouth. We found that by raising the level of abstraction, we were able to produce explanations that generalized over larger sets of instances.
One clear use case for explanations is for human consumption. In order to be fully leveraged by humans, explanations need to be interpretable. A large vector of numbers doesn’t in general make us more confident we understand what a network is doing. Previously, there have been two roughly separate lines of work within the subfield of DNN transparency: interpretation techniques, and influence measures.
Interpretation | At a high level, deep learning is simply computing higher and higher level features that make data points increasingly separable by class. Zeiler et al. [4] demonstrated that some of the features learned by CNNs learning object-recognition tasks may appeal to human intuition: edges, shapes, and eventually high-level concepts like eyes or car tires. The method of Zeiler et al. can be considered an interpretation technique; namely, it gives us a better understanding of what a feature learned by a deep network represents. Interpretation techniques give us somewhat of a view of what a network has learned, but they don’t inform us of the causal relationships between the features and the outcome of the model, i.e., they give us an idea of what the high-level features are, but not how they are used.
Influence | On the other hand, influence gives us an idea of how features are used, but doesn’t tell us what the features represent. This has served as a severe limitation for prior work, e.g., [5], [6], [7], as prior work on influence measures has accordingly focused only on readily interpretable features, essentially the inputs.
Our work unifies these two ideas, allowing us to make the influence of internal features meaningful. We view an explanation as comprised of both an influence measure and an interpretation technique: the influence of a (possibly internal) feature is measured, and then the feature is interpreted, providing a meaningful explanation of what the feature represented, and how it was used. This can be thought of as guiding our choice of which features to interpret based on the importance of each feature. Practically, this means that we can take a large explanation, as could be produced by a technique like Integrated Gradients [6] or GradCAM [7], and decompose it into its components, as viewed by the network. This gives us a finer understanding of how the network is operating. Without this decomposition, explanations tend be be overly vague, and are therefore far less useful than the more nuanced explanations our framework is uniquely capable of. An example of this decomposition is shown in Figure 2; the boxed image on the far right of Figure 2 shows a typical explanation produced by Saliency Maps [5], which is notably less informative.
Explanations are a key stepping stone on the path towards algorithmic transparency. Explanations have direct applications towards auditing models that are deployed increasingly in critical or sensitive areas, e.g., medical diagnoses, credit and loan decisions, etc. Furthermore, we anticipate explanations will aid in debugging arbitrary model behavior, and in combining machine and human efforts to solve complicated problems.
Finally, while use via human consumption is perhaps the most natural use for explanations, explanations can be used by algorithms as well. In my upcoming work in ICLR 2019 [8], my colleagues and I demonstrate that influence can be used to create a post-hoc regularizer that reduces undesirable bias in model predictions. My ongoing research also suggests that explanations have applications in areas such as security, data privacy, and fairness.
By now, there have been many publicized instances of machine learning models that appear to discriminate on the basis of a protected attribute, such as race or gender. To mention a few, COMPAS^{[1]}, a model created by Northpointe (now Equivant) to measure the recidivism risk of many criminal defendants across the U.S., came under scrutiny because it categorized black people as high risk disproportionately more often, even after controlling for whether they actually recidivated^{[2]}. More recently, Amazon revealed that a model that they built to screen job applicants tended to favor men over women, leading them to scrap the model^{[3]}.
To be clear, I’m not claiming that this happened because someone at Northpointe or Amazon was a malicious racist or sexist. Rather, the issue is that it’s hard to ensure that a model is free of discriminatory biases; even if we don’t give the model access to the protected attribute directly, the model may learn to discriminate anyway by using a proxy that is correlated with the protected attribute.
There has been a lot of research in the field of fair machine learning, and in the process many different definitions of fairness have come up. One of the simplest and the most commonly used definitions is demographic parity, which states that the positive classification rate must be the same regardless of the protected attribute. For example, in the context of job applications, if 10% of the white applicants are ultimately hired, then 10% of the black applicants should be hired as well.
But just because a hiring process satisfies demographic parity does not mean that the hiring process is free of discrimination. For example, suppose that an employer uses a flawed model that results in the hiring of 10% of the white applicants and only 5% of the black applicants. Realizing the flaw of the model, the employer then attempts to remove the racial imbalance by hiring five more percent of the black applicants. Although the two racial groups now receive the same result on average, there still could be individual applicants who are harmed by the flawed model and are not hired later. This example is not just hypothetical; the state government of Connecticut did something very similar in the 1970s, and the U.S. Supreme Court decided for the above reason that demographic parity is not a complete defense to claims of discrimination^{[4]}.
So instead of considering the statistical effect of the proxy on the model as a whole, we choose to identify the cause of the discriminatory behavior directly. Previously, Datta et al.^{[5]} formalized the definition of a proxy as any component of the model that is:
This is a high-level summary of the definition, and in order to apply this definition in practice, we have to define what a component is, as well as specifying a measure of association and influence.
Our paper at NeurIPS 2018^{[6]} does exactly that for the setting of linear regression. We then improve on the proxy detection procedure proposed by Datta et al.^{[5:1]}, which is basically a brute-force algorithm that enumerates all components of the model. In the setting of linear regression, we found a nice structure that allows for a convex optimization (second-order cone programming) procedure that detects proxies quickly and efficiently. This is useful in practice because it allows a model trainer to learn which part of the model is discriminatory and take steps to fix that part.
One final point of note is that not all proxies are bad. To see why, let’s go back to the example of job applications. If the job requires heavy lifting, a model that is aware of this requirement will use the weightlifting ability of the applicant to help make its predictions. Although the weightlifting ability is likely to be a proxy for gender, this proxy is considered justified in this context because it is relevant for this specific prediction task. So in general, we assume that once a proxy has been identified, a human domain expert would look at the proxy to decide whether it should be allowed to remain in the model. But humans, unlike computers, are limited in their ability to process large amounts of work, so we want to focus their attention on the proxies that we are really unsure about. In particular, in the above example, we are pretty sure that the use of the weightlifting ability is justified, so we want to avoid identifying proxies that are largely based on the weightlifting ability. In the NeurIPS paper^{[6:1]}, we formalize this intuition by designating an input attribute as exempt, and we show how to modify our proxy detection algorithm to avoid proxies that are driven by the exempt attribute.
I hope that this post clarified the motivation for our work on proxy use, and I encourage you to read the paper^{[6:2]} for more technical details.
http://www.equivant.com/assets/img/content/Practitioners_Guide_COMPASCore_121917.pdf ↩︎
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing ↩︎
https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G ↩︎
http://papers.nips.cc/paper/7708-hunting-for-discriminatory-proxies-in-linear-regression-models ↩︎ ↩︎ ↩︎
Training neural networks involves gradients and differentiation, and it turns out that one good way of explaining a neural network's behavior (in terms of its inputs) is by integrating gradients...
In the last post we formalized several desirable properties of feature-attribution methods as Axioms of Attribution, and the post concluded by saying Path Integrated Gradient Methods (or Path Methods in short) satisfy many of these axioms. This post will delve deeper into these methods. We continue to use the formulation from the ICML 2017 paper by Sundararajan et. al.^{[1]} The detailed derivation is our own, and assumes the reader has some familiarity with differential and integral calculus.
To understand the math behind Path Methods, it helps to recall our 2-dimensional example where the function computed by the neural network is $F(x) = \min(1, x_1 + x_2)$ (we will use this example in the rest of this post). Suppose we want to find the feature-attributions for the function value at point $P_3 = (1.2, 0.6)$. In the contour plot below we see two specific paths from the origin to $P_3$. Notice that these paths sequentially change one variable from 0 (the baseline value) to its final value at $P_3$: the red path changes $x_1$ first, then $x_2$. The blue path changes $x_2$ first, then $x_1$. There are of course an infinite number of possible paths from the origin to $P_3$, and paths such as these that sequentially change one variable at a time are called extremal paths (we informally called them "axis-parallel" paths in the last post). If the input $x$ is $d$-dimensional, there are $d!$ possible orderings of the input dimensions, and there would be $d!$ different extremal paths.
What is nice about extremal paths is that we can easily compute the contribution of each variable along a specific path. In the above example, when moving from the origin toward $P_3$ along the blue path, on the first segment only $x_2$ changes (from 0 to its final value 0.6), and causes $F$ to increase by $0.6$, and on the second segment only $x_1$ changes, and causes $F$ to increase by $0.4$, which results in attributions $(0.4, 0.6)$.
The reason it is simple to compute the contributions of each feature on an extremal path is that on each straight-line segment, only one feature changes. So this idea extends naturally to compute the feature-contributions along any step-like portion of a path, and we will see below how this is useful.
Before moving on to discuss attribution along arbitrary paths, we should first consider how a general path can be represented. It is convenient to use a parametric representation of a path from the origin to $x$. Imagine we are moving along this path from the origin to $x$, and we introduce a parameter $\alpha$, which we will call the path position parameter, that represents the fraction of the path covered so far: this $\alpha$ uniquely identifies a point along the path. Then the path can be represented by a function $g$ that maps $\alpha$ to the $d$-dimensional coordinates of the unique point on the path that corresponds to $\alpha$.
Thus $g(\alpha)$ represents the $d$-dimensional coordinates of the point on the path at position $\alpha$. In particular $g(0)$ is the origin (all zero coordinates), and $g(1) = x$. We write $g_i(\alpha)$ to denote the $i$'th coordinate of $g(\alpha)$. We will only consider smooth functions $g$, i.e. those that are differentiable.
So how can we do attribution along an arbitrary path $\Pi$ from the origin to $P_3$? Here's a simple idea:
Since we already know how to do attribution along any "step-like" path, we approximate the path $\Pi$ with a sequence of infinitesimally small steps, as shown in the figure below.
Let's zoom in and look at a specific step $ABC$ in this step-approximation, as shown in figure below:
In the above figure, suppose the point $A$ on the path corresponds to path-position $\alpha$ in the parametric path-representation $g$, and $C$ corresponds to $\alpha + d\alpha$ (remember these are infinitesimally-small steps). So the coordinates at $A$ are $g(\alpha)$ and the corresponding function value is $F(g(\alpha))$. As we move along the path from $A$ to $C$, i.e., as $\alpha$ increases by an infinitesimal amount $d\alpha$, the value of the function $F$ changes by a certain amount, and we want to calculate how much the features $x_1$ and $x_2$ contribute to this change. First let's compute the change in $x_1$ when $\alpha$ increases by $d\alpha$. This is given by
$$
dg_1(\alpha) = \frac{ \partial g_1(\alpha) }{\partial \alpha} d\alpha,
$$
and the portion of the function change attributable to this $x_1$ change is $dg_1(\alpha)$ times the gradient of $F$ with respect to dimension 1 at path-position $\alpha$, which is denoted by $\partial_1 F(g(\alpha))$ (we are using the shorthand $\partial_1$ for the partial derivative with respect to dimension 1), and thus the contribution of $x_1$ along this infinitesimal step $d\alpha$ is:
$$
\partial_1 F(g(\alpha)) dg_1(\alpha) \; = \;
\partial_1 F(g(\alpha)) \frac{ \partial g_1(\alpha) }{\partial \alpha} d\alpha
$$
Summing these contributions over all path-positions $\alpha$ yields an integral for the total contribution of feature $x_1$ along the path $\Pi$, and generalizing to any dimension $i$, we can write this definition:
Path Integrated Gradient attribution of function $F$ at $x$, along path $\Pi$, for dimension $i$:
$$
A^{F,\Pi}_i(x) =
\int_0^1 \partial_i F(g(\alpha))
\frac{ \partial g_i(\alpha) }{\partial \alpha} d\alpha.
$$
The reason for the name "Integrated Gradient" should be clear -- we are integrating an expression involving the gradient of the final output of the neural network with respect to the input features. It's important to realize that these gradients are not the same as the gradients used when training the network: the latter are gradients of the loss (which is, roughly, a measure of the "error" between the network output and the desired output) with respect to the network parameters.
As a special case of the above definition, note that the straight line path from the origin to $x$ is given by the parametric representation $g(\alpha) = \alpha x$: as $\alpha$ changes from 0 to 1, we are uniformly scaling all coordinates from the 0-vector until they reach $x$. For this case, note that $g_i(\alpha) = \alpha x_i$ and $\partial g_i(\alpha)/\partial \alpha = x_i$, which implies the following simplified attribution expression, referred to by Sundararajan et. al.^{[1:1]} as the Integrated Gradient:
Integrated Gradient (IG) Attribution (with baseline = all-zeros vector)
$$
A^{F}_i(x) =
x_i \int_0^1 \partial_i F(\alpha x) d\alpha
$$
Let's apply the IG method to compute the feature attributions of the function value at $P_3=(1.2, 0.6)$, using the straight line from the origin to $P_3$ in our example where $F(x) = \min(1, x_1 + x_2)$, as shown in the figure below:
At any point on the path from the origin until $A = (2/3, 1/3)$, the sum of the coordinates is at most 1, so the value of $F$ is essentially $x_1 + x_2$, and therefore the gradients $\partial_i F(\alpha x)$ are 1 for both dimensions $i = 1,2$. Between point $A$ and $P_3$, the coordinates add up to at least 1 so $F$ is saturated at 1, and both gradients vanish. The value of $\alpha$ corresponding to point $A$ is $1/(3 \times 0.6) = 1/1.8$. Thus from the above expression, the IG-based contribution of dimension 1 to the function value at $P_3$ is
$$
x_1 \int_0^1 \partial_1 F(\alpha x) d\alpha \;=\; 1.2 \int_0^{\frac{1}{1.8}} 1 \, d\alpha \;=\; \frac{1.2}{1.8} \;=\; \frac{2}{3},
$$
and similarly the attribution to dimension 2 is $0.6/1.8 = 1/3$. Notice how this attribution $(2/3, 1/3)$ is more reasonable than both the attributions we saw earlier using the two extremal paths (see the first figure above):
As mentioned in the previous post, the IG method is the only one among path methods that satisfies all the axioms stated in that post. We should also point out that the integrals in the definitions above are instances of path integrals which have a rich history in Mathematics^{[2]} and Physics^{[3]}, and in particular Richard Feynman introduced a version of path integrals in his formulation of Quantum Mechanics^{[4]}.
In a future post we will look at other properties of Path Methods and their variants, how the IG attribution can be computed in general, and some simple models where the IG has a closed form analytic expression.
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. 2017. “Axiomatic Attribution for Deep Networks.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1703.01365. ↩︎ ↩︎
Ahlfors, Lars V. 1953. “Complex Analysis: An Introduction to the Theory of Analytic Functions of One Complex Variable.” New York, London, 177. ↩︎
Kleinert, Hagen. 1995. Path Integrals in Quantum Mechanics Statistics and Polymer Physics. World Scientific. ↩︎
Feynman, Richard P., Albert R. Hibbs, and Daniel F. Styer. 2010. Quantum Mechanics and Path Integrals. Courier Corporation. ↩︎
Unlike predictive models, for attribution methods there is no universal measure of "accuracy" or "correctness"; we can only say whether a method is intuitively "reasonable" or not, and we saw some examples of this analysis in the previous post. It is therefore common in this area to take an axiomatic approach: formalize certain "desirable" properties of attribution methods as Axioms, to help guide the design and evaluation of attribution methods. This post will describe a set of such axioms, based on the ICML 2017 paper of Sundararajan et al ^{[1]} (which is an excellent read on the subject).
Recall that we denote the function computed by the neural network as $F(x)$, where $x$ is the $d$-dimensional input vector. If $b$ is the baseline input vector, the attribution of the difference $F(x) - F(b)$ to feature $x_i$ is denoted by $A^F_i(x; b)$. To avoid notational clutter in this post we will assume that $b$ is the all-zeros vector, and that $F(b) = 0$, and we will simply write $A^F_i(x)$ to denote the attribution of $F(x)$ to feature $x_i$. As a running example we will continue to use the function $F(x) = \min(1, x_1 + x_2)$.
We mentioned this axiom in the last two posts:
Completeness: For any input $x$, the sum of the feature attributions equals $F(x)$:
\begin{equation*} F(x) = \sum_i A_i^F(x) \end{equation*}
Why is this property called "completeness"? Because any attribution method satisfying this axiom "completely" accounts for the value of $F(x)$ in terms of the feature attributions.
We saw in the last post that the perturbation method fails to satisfy completeness: If $F(x) = \min(1, x_1 + x_2)$, then for $x = (2,2)$, $F(x) = 1$, but $F((0,2)) = F((2,0)) = 1$, meaning that perturbing either input individually to 0 does not alter the function value, and this method would attribute 0 to both features, clearly violating Completeness.
On the other hand we saw in the last post that a certain path method does satisfy Completeness. Recall that the path method consists of picking a path from the origin (zero vector) to $x$, and computing how much each feature contributes along this path. In the figure below, to calculate the attributions at $P_3 = (1.2, 0.6)$, we considered two paths from the origin to $P_3$. Each segment of the path sequentially varies a single feature, keeping the other features (just one in this 2D case) constant. For example if we choose to do attribution using the blue path from the origin to $P_3$, then on the first segment only $x_2$ changes (from 0 to its final value 0.6), and causes $F$ to increase by $0.6$, and on the second segment only $x_1$ changes, and causes $F$ to increase by $0.4$, which results in attributions $(0.4, 0.6)$, which sum to 1. Similarly, the red path yields attributions $(1,0)$, which again sum to 1.
There are two versions of this axiom:
Sensitivity: If $x$ has only one non-zero feature and $F(x) \neq 0$, then the attribution to that feature should be non-zero.
Insensitivity: If $F(x)$ does not mathematically depend on the value of a feature $x_i$ then the attribution to $x_i$ should always be zero.
It is easily verified that: (a) Perturbation methods satisfy Sensitivity, and (b) Completeness implies Sensitivity, so the Path method satisfies this axiom as well. However another method we saw in the last post, the gradient method, does not. In our example, at the point $x = (2,0)$, $F(x)=1$ but the gradient of $F(x)$ with respect to $x_1$ is 0, which would be the gradient-based attribution. The reason the gradient method violates Sensitivity is that gradients are local. All methods we've seen so far satisfy the Insensitivity axiom.
Implementation Invariance: When two neural networks compute the same mathematical function $F(x)$, regardless of how differently they are implemented, the attributions to all features should always be identical.
This one might seem completely un-controversial, but surprisingly enough there are some attribution methods that violate this (e.g. DeepLift^{[2]} and Layer-wise Relevance Propagation or LRP^{[3]}). All the methods we have seen, however, do satisfy this axiom, since none of them refer to the implementation details of $F(x)$.
Linearity: If we linearly compose two neural networks computing $F$ and $G$ to form a new neural network that computes $H = aF + bG$, then for any feature $x_i$, its attribution for $H$ is a weighted sum of the attributions for $F$ and $G$, with weights $a$ and $b$ respectively, i.e.,
\begin{equation*} A_i^{H}(x) = a A_i^F(x) + b A_i^G(x) \end{equation*}In other words, this axiom says that the attribution method should preserve linearity.
Two features are said to be symmetric with respect to a function $F(x)$ if interchanging them does not change the function mathematically. For example $x_1, x_2$ are symmetric in $F(x) = \min(1, x_1 + x_2)$, but they are not symmetric in $F(x) = 3x_1 + x_2$. The following axiom states that an attribution method must preserve symmetry:
Symmetry-Preserving: For any input $x$ where the values of two symmetric features are the same, their attributions should be identical as well.
We saw a specific path method above: the path starts at the origin, and changes each feature sequentially until the specific input $x$ is reached.
For each segment where a feature $x_i$ is changed (and all others are kept constant), we then calculated how much $x_i$ contributed to the change in function value along. The specific paths we considered are axis-parallel paths (since each segment is parallel to some feature-axis), but there is no reason we should restrict ourselves to such paths. On a more general path from the origin to $x$, there is a way to compute the contribution of each variable, using a method called Path Integrated Gradients.
Path Integrated Gradients have an intriguing connection to work by two Nobel-prize winning economists Lloyd Shapley and Robert Aumann on cost-sharing: $F(x)$ represents a project's cost and the "features" are the demands of the participants, and attribution corresponds to cost-shares (or cost-allocations). Specifically,
Attribution based on a Path Integrated Gradient (along any fixed path from the origin to $x$) corresponds to a cost-sharing method referred to as Aumann-Shapley^{[4]}, and it has been proved^{[5]} that this method is the only one that satisfies all of the above axioms except the Symmetry-Preserving Axiom.
And the paper by Sundararajan et. al.^{[1:1]} (on which this post is based) proposes a special case of the Path Integrated Gradient and shows:
Attribution using the Integrated Gradient along the straight line from the origin to $x$, is the unique Path Method that is Symmetry-Preserving.
A future post will take a closer look at Path Integrated Gradients.
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. 2017. “Axiomatic Attribution for Deep Networks.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1703.01365. ↩︎ ↩︎
Shrikumar, Avanti, Peyton Greenside, and Anshul Kundaje. 2017. “Learning Important Features Through Propagating Activation Differences.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1704.02685. ↩︎
Binder, Alexander, Grégoire Montavon, Sebastian Bach, Klaus-Robert Müller, and Wojciech Samek. 2016. “Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1604.00825. ↩︎
Shapley, L., Robert Aumann. 1974. Values of Non-Atomic Games. Princeton University Press. ↩︎
Friedman, Eric J. 2004. “Paths and Consistency in Additive Cost Sharing.” International Journal of Game Theory 32 (4): 501–18. ↩︎
$
\def\R{{\mathbb{R}}}
$
In the previous post we kicked off our discussion of explainability in DNNs by presenting a simple feature attribution method for evaluating the contribution of an input feature to a neural network's output. Specifically, we considered an intuitive method for feature attribution (some authors refer to it as a perturbation^{[1]} or ablation^{[2]} method): If $ F(x)$ is the (scalar) function computed by the neural network (where $x \in \R^d$ is a $d$-dimensional input feature vector), we posit a certain "information-less" or "neutral" baseline input vector $b \in \R^d$, and measure the attribution of $F(x)$ to feature $x_i$ (relative to the baseline $b$) denoted $A^F_i(x; b)$, as the difference between $F(x)$ and the counterfactual output $F(x[x_i = b_i])$obtained by perturbing $x_i$ to its baseline value $b_i$:
$$
A^F_i(x; b) := F(x) - F(x[x_i = b_i])
$$
We saw that when this method is applied to a simple linear model, it satisfies a nice property called additivity: the feature attributions add up to $F(x) - F(b)$, the change in the DNN output from input $b$ to input $x$. And this property (also referred to as completeness by some authors^{[3]} ) is nice to have since we can think of each feature's attribution as its contribution to the output change $F(x) - F(b)$. We concluded by saying that this property would not hold in general when there are non-linearities, which of course are essential to the operation of a DNN.
At this point it will be helpful to pause and see where this method fits within the broader class of feature-importance measures (we borrow some of this description from recent papers^{[1:1]}^{[4]} ^{[5]}). Attribution methods belong to a more general class of explanation methods known as saliency methods: roughly speaking, these methods seek to provide insights into the output $F(x)$ by ranking the explanatory power (or "salience" on "importance") of input features. Some of the saliency methods proposed recently include:
Note that when applying these methods, $F(x)$ could either represent the final output sore (e.g. probability of a certain class-label), or it could represent the activation value of an internal neuron.
Now let's return to the perturbation-based attribution method above and examine some of its limitations. Consider the following example (related to the one in Shrikumar et al., 2017^{[1:2]}). Say we have a simple non-linear function $F(x)$ of a 2-dimensional input vector $x$ where $x_i \geq 0$ for $i \in \{1,2\}$, and
$$
F(x) = \min(1, x_1 + x_2)
$$
Below is a 3D surface plot of this function. What's significant about this function is that, as the sum of the coordinates $s = x_1 + x_2$ increases from 0, the value of $F(x)$ is initially $s$, but then flattens out after $s=1$. The darker green colors in the plot indicate higher values of $F(x)$.
It will be more convenient for us to look at at a 2D contour plot instead, as shown below. This is essentially a view "from the top" of the above 3D surface plot, where the value of the function $F(x)$ is encoded by the darkness of the green color: Darker bands correspond to points $x$ with higher values of $F(x)$. All the points to the right and above the white line have the same dark green color, indicating that the function is 1 in that entire region. The spacing between bands in the countour plot is also meaningful: the function-value changes by the same amount between bands, so a bigger spacing means that the function is changing more slowly. (Of course in the non-flat region, the function is linear so the spacing is uniform there, but in the logistic function example below, we will see non-uniform spacing.)
Now here is a geometric view of the perturbation-based attribution method:
The attribution $A^F_i(x; b)$ to a feature $x_i$ is the drop in the value of the function $F(.)$ when we move on a straight line path (parallel to the $x_i$-axis) from $x$ to $x[x_i=b_i]$.
In our specific 2D example above, assuming a baseline vector $b=(0,0)$ (where $F(b)=0$), this translates to:
For any point $P$ with coordinates $(x_1, x_2)$, the attribution to dimension $i$ ($i \in \{1,2\}$) is the drop in the value of the function $F(x)$ when we move on a straight line path from $P$ parallel to the $x_i$ axis until $x_i=0$, i.e., to find the attribution to dimension $i$ we move from $P$ along the perpendicular to the other dimension.
To make it easier to read off the attributions, we show the amount by which the function value drops along each dotted line-segment (assuming we are moving toward one of the axes). For example at point $P_1 = (0.2, 0.6)$ the attributions to dimensions 1 and 2 are 0.2 and 0.6 respectively, which is what we would expect. Similarly at $P_2 = (0.4, 0.6)$ the attributions are 0.4 and 0.6.
But the point $P_3 = (1.2, 0.6)$ is interesting: the change in function value is 0 as we move from $P_3$ to $ P_2$ and the total change on the perpendicular path to the $x_2$ axis is 0.4, so the attribution to $x_1$ is 0.4. However the attribution to $x_2$ is zero since the function value doesn't change at all along the perpendicular path from $P_3$ to the $x_1$-axis! Point $P_4 = (1.4, 1.2)$ is an even more extreme example: the attributions to both dimensions are zero. The reason for these oddities is clearly the flattening out of the function $F(x)$ when the sum of its input dimensions exceeds 1. This is an instance of the well-known saturation issue in the attribution methods literature:
Saturation: When the output of a neural network is (completely or nearly) saturated, some attribution methods may significantly underestimate the importance of features.
In our example, when using the perturbation method, the contributions of the coordinates "saturate" when their sum reaches 1, and so for any point $x$ where either coordinate is at least 1 (such as $P_3$ or $P_4$ above), the contribution of the other coordinate will be 0 (since its perturbation to 0 will not affect the function value).
Besides the zero attribution problem, points $P_3$ and $P_4$ raise a subtler issue: how do we allocate "credit" for the function value (which is 1 at both points) among the two coordinates? For example for $P_3 = (1.2, 0.6)$, we know that the perturbation-based allocation of $(0.4, 0)$ is somehow "wrong" (besides violating additivity, it is counterintuitive that the second dimension gets no credit), but then what is the "correct" or fair allocation of credit? The $x_1$ coordinate is twice the $x_2$ coordinate, so does that mean we should allocate twice the credit to the first coordinate?
Before considering these questions, let us look at a different saliency method, mentioned above:
Gradient or Sensitivity based importance: The importance of a feature $x_i$ to the output $F(x)$ of a DNN, relative to a baseline vector $b$, is defined as:
\begin{equation*} G^F_i(x; b) := (x_i - b_i) \partial F(x)/\partial x_i \end{equation*}
This is not that different in spirit from the perturbation method: instead of perturbing the input feature $x_i$ to its baseline value $b_i$ and looking at how the output $F(x)$ changes, here we are computing the gradient $g$ of $F(x)$ with respect to $x_i$ and estimating the function value change, by an approximation $ g(x_i - b_i)$which would be exact if the gradient was constant between the current feature value $x_i$ and the baseline feature value $b_i$. Of course, if the gradient really were constant then there would be no difference between the perturbation method and the gradient method. (This method can also be rationalized by a first-order Taylor expansion of the multi-variate function $F(x)$).
The gradient method is not immune to the saturation issue either, since for any point $(x_1, x_2)$ in the saturated region (i.e. where $x_1 + x_2 > 1$) the gradients w.r.t. both $x_1$ and $x_2$ would both be zero, and hence the method would assign zero importance to both features.
We should note that the above oddities in attribution are exhibited not just by a function $F(x)$ that "flattens out" suddenly; any non-linearity can cause similar issues. For example the logistic function
$$
F(x) = \frac{1}{1 + \exp(-x_1 - x_2 - 1)},
$$
displayed in this 3D surface plot below, causes similar difficulties for the perturbation and gradient-based methods.
In the corresponding 2D contour plot (see below), the "gradual" flattening out is shown by the increasingly-spaced bands toward the upper-right side of the figure. Notice how for the point $x=(1.2, 0.8)$, the difference $F(x) - F((0,0)) = 0.22$, but the attributions to the 2 coordinates sum to just 0.14, violating the additivity/completeness property.
At this point it may seem like a challenging task to define a feature attribution method that satisfies additivity and perhaps other desirable properties. Let us revisit the geometric view of the perturbation-method above, to see if it might offer a possible clue for a better attribution method. We said that when attributing the $F(x) - F(b)$ to a dimension $x_i$ we move from the point $x$ along a straight line path parallel to the $x_i$ axis until the $i$'th coordinate equals $b_i$. In other words we are trying to allocate "credit" for $F(x)-F(b)$ according to the function-value change along $i$ different straight line paths.
For example for point $P_3 = (1.2, 0.6)$ in the saturation example above, the attributions were based on the function-change along a "horizontal" path and a "vertical" path to the axes. Can we try to define a single path and allocate credit according to how the function changes on that path? Let us try a couple of possible paths from point $P_3$ to the origin, as shown in the figure below:
So just picking an arbitrary path is not going to give us a sensible attribution. But it turns out that paths, and gradients along paths, are central to one way of defining attributions that satisfy additivity and other desirable properties. We will look at this in a future post.
Shrikumar, Avanti, Peyton Greenside, and Anshul Kundaje. 2017. “Learning Important Features Through Propagating Activation Differences.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1704.02685. ↩︎ ↩︎ ↩︎
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. 2016. “Gradients of Counterfactuals.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1611.02639. ↩︎
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. 2017. “Axiomatic Attribution for Deep Networks.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1703.01365. ↩︎
Kindermans, Pieter-Jan, Kristof T. Schütt, Maximilian Alber, Klaus-Robert Müller, Dumitru Erhan, Been Kim, and Sven Dähne. 2017. “Learning How to Explain Neural Networks: PatternNet and PatternAttribution.” arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1705.05598. ↩︎
Kindermans, Pieter-Jan, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T. Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. 2017. “The (Un)reliability of Saliency Methods.” arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1711.00867. ↩︎
Zeiler, Matthew D., and Rob Fergus. 2013. “Visualizing and Understanding Convolutional Networks.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1311.2901. ↩︎
The wild success of Deep Neural Network (DNN) models in a variety of domains has created considerable excitement in the machine learning community. Despite this success, a deep understanding of why DNNs perform so well, and whether their performance is somehow brittle, has been lacking. Two recent developments hold promise in shedding light on the behavior of DNNs, and could point the way to improving deep learning models
The discovery^{[1]} that several Deep Neural Network (DNN) models are vulnerable to adversarial examples: it is often possible to slightly perturb the input to a DNN classifier (e.g. an image-classifier) in such a way that the perturbation is invisible to a human, and yet the classifier's output can change drastically: for example a classifier that is correctly labeling an image as a school bus can be fooled into classifying it as an ostritch by adding an imperceptible change to the image. Besides the obvious security implications, the existense of adversarial examples seems to suggest that perhaps DNNs are not really learning the "essense" of a concept (which would presumably make them robust to such attacks). This opens up a variety of research avenues aimed at developing methods to train adversarially robust networks, and examining properties of adversarially trained networks.
Although DNNs are being increasingly adopted in real-world contexts, explaining their behavior has often been difficult. Explainability is crucial for a variety of reasons, and researchers have proposed various notions of what consititutes an explantion, and methods to produce explanations (see recent surveys by Ras et. al. ^{[2]}and Guidotti et al^{[3]}). One specific type of explanation is referred to as attribution: attributing the output of a DNN to its input features or internal components (such as neurons in a hidden layer).
In this series of blog posts we focus on these and related topics. This post is an introduction to the notion of Explainability, and specifically Attribution, in the context of DNNs.
DNNs are increasingly being used (or being seriously considered) to power real-world systems that produce decisions, predictions, scores or classifcations that directly or indirectly impact humans. Examples include:
The recent dramatic success of DNNs may create a temptation to treat these as opaque black-boxes and simply trust that they "just work". However in real-world applications such as the above, there are several reasons why it is crucial to be able to explain their behavior:
Given the importance of explainability of Deep Learning models, a natural question is: what consititutes a good explanation? A variety of notions of explanation have been proposed by researchers but in this blog we will focus on two types of explanations:
An attribution method aims to answer the following types of questions:
How much does an input feature (or hidden unit) contribute to the output of the DNN on a specific input? And what is the overall aggregate contribution on a given data distribution?
Let us first consider feature attribution, which is the attribution question for an input feature. At the most fundamental level, we'd like to know how much "credit" to assign this input feature to explain the output of the DNN. To understand why this is not a trivial question, let us look at some simple feature attribution methods that may come to mind.
To make things precise we denote the function computed by the DNN as $F(x)$ where $x$ is the input feature-vector, say of dimension $d$. The individual feature-dimensions of $x$ are denoted $x_1, x_2, \ldots, x_d$. Our aim is to compute for each dimension $i$, and attribution $A^F_i(x)$, i.e. the contribution of feature $x_i$ to $F(x)$. For brevity let us write $x[x_i=b_i]$ to denote the vector $x$ where $x_i$ is replaced by value $b_i$ (read this as "the vector $x$ where $x_i$ is assigned the value $b_i$").
One simple idea is to take a counterfactual view point:
Compare the DNN output $F(x)$ to what its output would have been if the input feature $x_i$ were not "active", i.e. if it were replaced by some suitable "information-less" baseline value $b_i$. In other words, the attribution to feature feature $x_i$ is given by:
\begin{equation} A_i^F(x; b) = F(x) - F(x[x_i=b_i]), \label{eq-attr-change} \tag{1} \end{equation}where we parametrize the attribution by the baseline-vector $b$.
The choice of the information-less baseline very much depends on the application. If $x$ represents the pixels of an image, then the all-black image would be a suitable baseline. In certain settings, if $x$ represents a collection of continuous numerical features, then the all-zeros vector could be a reasonable baseline. If $x$ represents a concatenation of embedding vectors (for example in NLP applications), then again the all-zeros vector could be a valid baseline.
Let's take a closer look at the above attribution definition. If $F(x)$ is a trivial shallow DNN computing a linear function of $x$ (with no activation function), say $$F(x) = \sum_{i=1}^d w_i x_i$$ where $w_i$ are the learned model weights, then the definition of $A_i^F(x; b)$ is simply $w_i (x_i - b_i)$, or in other words the attribution of feature $i$ is the weight times the value change from its baseline value. This is a perfectly reasonable definition of attribution for this special case. In fact it satisfies a nice property:
Additivity: The sum of the attributions of all input features equals the change of $F$ from the baseline $b$ to $x$:
\begin{align*} \sum_{i=1}^d A_i^F(x; b) &= \sum_{i=1}^d w_i(x_i - b_i)\\ &= \sum_{i=1}^d w_i x_i - \sum_{i=1}^d w_i b_i \\ &= F(x) - F(b) \end{align*}
Additivity is a nice property for an attribution method to have since it allows us to think of each feature's attribution $A_i^F(x; b)$ as the contribution of the feature to the change $F(x) - F(b)$. Note that in the above proof we relied on the fact that $F(x)$ is linear, and we wouldn't expect the method $\eqref{eq-attr-change}$ to satisfy additivity if $F(x)$ is non-linear, and DNNs necessarily involve non-linearities. So for the general case, how can we improve upon the naive method $\eqref{eq-attr-change}$ ? We will consider this in the next post, so stay tuned!
Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. “Intriguing Properties of Neural Networks.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1312.6199. ↩︎
Ras, Gabrielle, Marcel van Gerven, and Pim Haselager. 2018. “Explanation Methods in Deep Learning: Users, Values, Concerns and Challenges.” arXiv [cs.AI]. arXiv. http://arxiv.org/abs/1803.07517. ↩︎
Guidotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. “A Survey of Methods for Explaining Black Box Models.” ACM Comput. Surv. 51 (5): 93:1–93:42. ↩︎