Explaining Vulnerability of Machine Learning to Adversarial Attacks

Melis, Marco

Pattern recognition systems based on machine learning techniques are nowadays widely used in many different fields, ranging from biometrics to computer security. In spite of the huge performance often provided by these systems, there is a general consensus that their reliability should be carefully assessed, in particular when applied to critical applications like medicine, criminal justice, financial markets, or self-driving cars. Especially in the era of big data, the chance of these systems inadvertently making wrong decisions while misguided from artifacts or spurious correlations in the training information is not negligible. Accordingly, to increase the trust of the users and identify the potential design flaws of the algorithms, many scientists started to explore the research field of explainable machine learning, with the goal of designing systems that are not only able to perform a pattern recognition task accurately, but that are also interpretable, i. e., they can "explain or present their decisions in understandable terms to a human". In parallel, another research field raised more than 10 years ago: adversarial machine learning. In the context of security tasks like spam filtering or malware detection, skilled and adaptive adversaries (human beings) may modify legitimate samples to defeat a system, by creating the so-called adversarial attacks. Thus, scientists started to consider such adversarial environments during the engineering process, by evaluating the potential vulnerabilities, measuring the performance in terms of robustness against these attacks, and designing potential countermeasures. Despite the vast amount of work in this direction, providing a thorough definition of the effects of adversarial attacks is still an open issue, especially if the systems are not able to provide an explanation alongside their automated decisions. In this thesis, we conduct a systematic investigation of the connections between explainability techniques and adversarial robustness, in order to gain a better understanding of the reasons behind the brittleness of modern machine learning algorithms, and with the goal of designing more robust systems that can be trusted by the users to safely operate in an adversarial environment. To this end, we start by proposing a novel optimization framework for crafting different adversarial attacks under the same unified mathematical formulation, which eases the study of their different security properties, like one of the most sudden, transferability. After providing a formal definition of this property and different quantitative metrics for its evaluation, we apply a novel explainability method based on highly-interpretable relevance vectors, that allows one to compare different models with respect to their learned behavior, and to get insights on their security properties, including adversarial robustness and their resilience to transfer attacks. Finally, to facilitate the practical application of these concepts, we also present secml, an open-source Python library that integrates all the tools required for developing and evaluating secure and explainable machine learning based systems, without the need of leveraging multiple third-party libraries.