On the transferability of adversarial examples in Machine Learning : the case of email spam detection
Details
- Supervisors
- Degree label
- Abstract
- Nowadays, machine learning techniques are used widely. As they are more and more employed, the question of safety comes into play: are machine learning techniques robust to attacks from malicious users? It has been shown that they are not. In particular, a number of articles have shown that even state-of-the-art machine learning models are weak to adversarial examples attacks. So-called adversarial examples are inputs modified as to be misclassified by a model. In image classification, these adversarial examples have also been shown to transfer across models: adversarial examples crafted by attacking a model A often fools another model B. This phenomenon called transferability has been less studied in the case of text classification. In this thesis, we study the transferability of adversarial examples for email spam detection. For this purpose, we choose six models: (i) multinomial naive Bayes classifier, (ii) logistic regression, (iii) linear support vector machine, (iv) random forest, (v) k-nearest neighbors and (vi) convolution neural networks. We also chose to focus on two attacks: (i) text fooler, a black-box attack, and (ii) hot-flip, a white-box attack. As our main result, we evidence that, for our two chosen attacks, adversarial examples do not transfer when one takes into account natural constraints on the transformations that an email can undergo.