UNICA IRIS Institutional Research Information System

Nowadays, communications made by using the modern Internet-based opportunities have revolutionized the way people exchange information between them, allowing real-time discussions among a huge number of people. However, the advantages offered by such powerful instruments of communication are sometimes jeopardized by the dangers related to personal attacks that lead many people to leave a discussion that they were participating. Such a problem is related to the so-called toxic comments, i.e., personal attacks, verbal bullying and, more generally, an aggressive way in which many people participate in a discussion, which brings some participants to abandon it. By exploiting the Apache Spark big data framework and several word embeddings, this paper presents an approach able to operate a multi-class multi-label classification of a discussion within a range of six classes of toxicity. We evaluate such an approach by classifying a dataset of comments taken from the Wikipedia's talk page, according to a Kaggle challenge. The experimental results prove that, through the adoption of different sets of word embeddings, our supervised approach outperforms the state-of-the-art ones that operate by exploiting the canonical bag-of-word model. In addition, the adoption of a word embeddings defined in a similar scenario (i.e., discussions related to e-learning videos), proves that it is possible to improve the performance with respect to the state-of-the-art word embeddings solutions.

A supervised multi-class multi-label word embeddings approach for toxic comment classification

Carta S.;Corriga A.;Mulas R.;Reforgiato Recupero D.;Saia R.

2019-01-01

Abstract

Nowadays, communications made by using the modern Internet-based opportunities have revolutionized the way people exchange information between them, allowing real-time discussions among a huge number of people. However, the advantages offered by such powerful instruments of communication are sometimes jeopardized by the dangers related to personal attacks that lead many people to leave a discussion that they were participating. Such a problem is related to the so-called toxic comments, i.e., personal attacks, verbal bullying and, more generally, an aggressive way in which many people participate in a discussion, which brings some participants to abandon it. By exploiting the Apache Spark big data framework and several word embeddings, this paper presents an approach able to operate a multi-class multi-label classification of a discussion within a range of six classes of toxicity. We evaluate such an approach by classifying a dataset of comments taken from the Wikipedia's talk page, according to a Kaggle challenge. The experimental results prove that, through the adoption of different sets of word embeddings, our supervised approach outperforms the state-of-the-art ones that operate by exploiting the canonical bag-of-word model. In addition, the adoption of a word embeddings defined in a similar scenario (i.e., discussions related to e-learning videos), proves that it is possible to improve the performance with respect to the state-of-the-art word embeddings solutions.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	Parole chiave
	
				Apache Spark; Sentiment Analysis; Supervised Approach; Word Embeddings
			
	Tipologia:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
KDIR_2019_17_CR.pdf Solo gestori archivio Tipologia: versione pre-print Dimensione 134.55 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	134.55 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I metadati presenti in IRIS UNICA sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono protetti da diritto d'autore, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/279698

Citazioni

ND

46

ND

ND

social impact