Similarity and diversity: two sides of the same coin in the evaluation of data streams

Saia, Roberto

The Information Systems represent the primary instrument of growth for the companies that operate in the so-called e-commerce environment. The data streams generated by the users that interact with their websites are the primary source to define the user behavioral models. Some main examples of services integrated in these websites are the Recommender Systems, where these models are exploited in order to generate recommendations of items of potential interest to users, the User Segmentation Systems, where the models are used in order to group the users on the basis of their preferences, and the Fraud Detection Systems, where these models are exploited to determine the legitimacy of a financial transaction. Even though in literature diversity and similarity are considered as two sides of the same coin, almost all the approaches take into account them in a mutually exclusive manner, rather than jointly. The aim of this thesis is to demonstrate how the consideration of both sides of this coin is instead essential to overcome some well-known problems that affict the state-of-the-art approaches used to implement these services, improving their performance. Its contributions are the following: with regard to the recommender systems, the detection of the diversity in a user profile is used to discard incoherent items, improving the accuracy, while the exploitation of the similarity of the predicted items is used to re-rank the recommendations, improving their effectiveness; with regard to the user segmentation systems, the detection of the diversity overcomes the problem of the non-reliability of data source, while the exploitation of the similarity reduces the problems of understandability and triviality of the obtained segments; lastly, concerning the fraud detection systems, the joint use of both diversity and similarity in the evaluation of a new transaction overcomes the problems of the data scarcity, and those of the non-stationary and unbalanced class distribution.