UNICA IRIS Institutional Research Information System

Stopwords are meaningless, non-significant terms that frequently occur in a document. They should be removed, like a noise. Traditionally, two different approaches of building a stoplist have been used: the former considers the most frequent terms looking at a language (e.g., english stoplist), the other includes the most occurring terms in a document collection. In several tasks, e.g., text classification and clustering, documents are typically grouped into categories. We propose a novel approach aimed at automatically identifying specific stopwords for each category. The proposal relies on two unbiased metrics that allow to analyze the informative content of each term; one measures the discriminant capability and the latter measures the characteristic capability. For each term, the former is expected to be high in accordance with the ability to distinguish a category against others, whereas the latter is expected to be high according to how the term is frequent and common over all categories. A preliminary study and experiments have been performed, pointing out our insight. Results confirm that, for each domain, the metrics easily identify specific stoplist wich include classical and category-dependent stopwords.

Stopwords identification by means of characteristic and discriminant analysis

ARMANO, GIULIANO;FANNI, FRANCESCA;GIULIANI, ALESSANDRO

2015-01-01

Abstract

Stopwords are meaningless, non-significant terms that frequently occur in a document. They should be removed, like a noise. Traditionally, two different approaches of building a stoplist have been used: the former considers the most frequent terms looking at a language (e.g., english stoplist), the other includes the most occurring terms in a document collection. In several tasks, e.g., text classification and clustering, documents are typically grouped into categories. We propose a novel approach aimed at automatically identifying specific stopwords for each category. The proposal relies on two unbiased metrics that allow to analyze the informative content of each term; one measures the discriminant capability and the latter measures the characteristic capability. For each term, the former is expected to be high in accordance with the ability to distinguish a category against others, whereas the latter is expected to be high according to how the term is frequent and common over all categories. A preliminary study and experiments have been performed, pointing out our insight. Results confirm that, for each domain, the metrics easily identify specific stoplist wich include classical and category-dependent stopwords.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2015
			
	Codice ISBN
	
				9789897580741
9897580743
			
	Parole chiave
	
				Characteristic capability; Discriminant capability; Stopwords; Text classification; Artificial Intelligence; Software
			
	Tipologia:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2015-ICAART-armano.pdf Solo gestori archivio Tipologia: versione editoriale (VoR) Dimensione 341.98 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	341.98 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I metadati presenti in IRIS UNICA sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono protetti da diritto d'autore, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/197248

Citazioni

ND

6

ND

ND

social impact