Image spam filtering using textual and visual information

Fumera, Giorgio; Pillai, I; Roli, Fabio; Biggio, Battista

In this paper we focus on the so-called image spam, which consists in embedding the spam message into images attached to e-mails to circumvent statistical techniques based on the analysis of body text of e-mails (like the “bayesian ﬁlters”), and in applying content obscuring techniques to such images to make them unreadable by standard OCR systems without compromising human readability. We argue that a prominent role against image spam will be played by computer vision techniques, in particular visual pattern recognition and image processing techniques. We then discuss two possible approaches to defeat image spam: exploiting the high-level textual information embedded into images by combining OCR and text categorization techniques, and exploiting the low-level image information to detect content obscuring techniques applied to spam images. We also report some results of an experimental investigation on a large data set of spam e-mails, aimed at evaluating the eﬀectiveness of combining standard OCR and text categorization techniques, and preliminary results on the use of low-level features to detect image defects (like broken characters or noise components interfering with characters in a binarized image) which are typical consequences of content obscuring techniques that spammers are using