UNICA IRIS Institutional Research Information System

Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task.

The challenges of German archival document categorization on insufficient labeled data

Fabian Hoppe;Tabea Tietz;Danilo Dessi';Nils Meyer;Mirjam Sprau;Mehwish Alam;Harald Sack

2020-01-01

Abstract

Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2020
			
	Parole chiave
	
				Cultural Heritage; Dataless Categorization; Document Exploration; Text Categorization
			
	Tipologia:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2020 - The Challenges of German Archival Document Categorization on Insufficient Labeled Data.pdf accesso aperto Tipologia: versione editoriale (VoR) Dimensione 259.07 kB Formato Adobe PDF Visualizza/Apri	259.07 kB	Adobe PDF	Visualizza/Apri

I metadati presenti in IRIS UNICA sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono protetti da diritto d'autore, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/321829

Citazioni

ND

5

ND

ND

social impact