Several techniques and workflows have emerged recently for automatically extracting knowledge graphs from documents like scientific articles and patents. However, adapting these approaches to integrate alternative text sources such as micro-blogging posts and news and to model open-domain entities and relationships commonly found in these sources is still challenging. This paper introduces an improved information extraction pipeline designed specifically for extracting a knowledge graph comprising open-domain entities from micro-blogging posts on social media platforms. Our pipeline utilizes dependency parsing and employs unsupervised classification of entity relations through hierarchical clustering over word embeddings. We present a case study involving the extraction of semantic triples from a tweet collection concerning digital transformation and show through two experimental evaluations on the same dataset that our system achieves precision rates exceeding 95% and surpasses similar pipelines by approximately 5% in terms of precision, while also generating a notably higher number of triples.
Knowledge Graphs for Digital Transformation Monitoring in Social Media
Zavarella V.;Reforgiato Recupero D.
;Buscaldi D.;Dessi D.;
2024-01-01
Abstract
Several techniques and workflows have emerged recently for automatically extracting knowledge graphs from documents like scientific articles and patents. However, adapting these approaches to integrate alternative text sources such as micro-blogging posts and news and to model open-domain entities and relationships commonly found in these sources is still challenging. This paper introduces an improved information extraction pipeline designed specifically for extracting a knowledge graph comprising open-domain entities from micro-blogging posts on social media platforms. Our pipeline utilizes dependency parsing and employs unsupervised classification of entity relations through hierarchical clustering over word embeddings. We present a case study involving the extraction of semantic triples from a tweet collection concerning digital transformation and show through two experimental evaluations on the same dataset that our system achieves precision rates exceeding 95% and surpasses similar pipelines by approximately 5% in terms of precision, while also generating a notably higher number of triples.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.