The goal of this thesis is to build a trading strategy that jointly uses quantitative and qualitative sentiment variables. In particular, we want to see if we can improve the equity line of a trading bot when trained in a trading environment in which we also insert sentiment variables and attention measures in addition to price and volume variables. Our target market is the US stock market and in particular the S&P 500. As a proxy for the equity investors' attention, we use the S&P 500 Google Search Volume Index downloaded from Google Trend, while the sentiment variable is built from textual data of the 4 main financial social media. The text corpus includes the tweets posted on StockTwits and Twitter and the comments published on the Yahoo Finance and Investing Message Board concerning the ticker of the American stock index and its Etf. The downloaded messages are over 5.7 million and cover a period of 15 years from 2006 to 2021. 32% of this data has been labeled by users as bullish or bearish, while the remainder is unlabeled. This meant for us to research the best sentiment classifier and use it to label messages that didn't have one, as we wanted our sentiment variable to include the full amount of data collected. To do this, we adopted the two main financial sentiment analysis approaches on the labeled data, namely the lexicon approach and the machine learning model approach. After testing the classification skills of 16 of the main financial and non-financial sentiment lexicons, and having verified their poor performance, we necessarily had to undertake the machine learning strategy. This meant, first of all establishing the best word embedding techniques distinct between frequentist and probabilistic methods, then comparing different unsupervised learning algorithms to understand if there could be some data dimensionality reduction techniques without losing the most precious information, and finally testing the classification capabilities of the most advanced machine learning models in textual data classification field. Supervised model training included exhaustive parametric research via 5-folds cross-validation for simpler models and random parametric research for more complex models. Ultimately, we find that the best sentiment classifier on our data is the LSTM model, with a test accuracy of 77%. After having employed it to label the unlabeled data, we were able to build a sentiment variable expressing investors' bullish and/or bearish moods. Subsequently, the sentiment and attention variables were aggregated to the price and volume data of the US stock market ETF to create a reinforcement learning environment in which to train our agent. By doing several tests, we discover that our agent achieves a significantly higher return when the sentiment and attention variables are also included in the RL environment.

Machine Learning in Social Media Sentiment Classification and Trading Strategy Design

CAMBA, GIACOMO
2022-04-20

Abstract

The goal of this thesis is to build a trading strategy that jointly uses quantitative and qualitative sentiment variables. In particular, we want to see if we can improve the equity line of a trading bot when trained in a trading environment in which we also insert sentiment variables and attention measures in addition to price and volume variables. Our target market is the US stock market and in particular the S&P 500. As a proxy for the equity investors' attention, we use the S&P 500 Google Search Volume Index downloaded from Google Trend, while the sentiment variable is built from textual data of the 4 main financial social media. The text corpus includes the tweets posted on StockTwits and Twitter and the comments published on the Yahoo Finance and Investing Message Board concerning the ticker of the American stock index and its Etf. The downloaded messages are over 5.7 million and cover a period of 15 years from 2006 to 2021. 32% of this data has been labeled by users as bullish or bearish, while the remainder is unlabeled. This meant for us to research the best sentiment classifier and use it to label messages that didn't have one, as we wanted our sentiment variable to include the full amount of data collected. To do this, we adopted the two main financial sentiment analysis approaches on the labeled data, namely the lexicon approach and the machine learning model approach. After testing the classification skills of 16 of the main financial and non-financial sentiment lexicons, and having verified their poor performance, we necessarily had to undertake the machine learning strategy. This meant, first of all establishing the best word embedding techniques distinct between frequentist and probabilistic methods, then comparing different unsupervised learning algorithms to understand if there could be some data dimensionality reduction techniques without losing the most precious information, and finally testing the classification capabilities of the most advanced machine learning models in textual data classification field. Supervised model training included exhaustive parametric research via 5-folds cross-validation for simpler models and random parametric research for more complex models. Ultimately, we find that the best sentiment classifier on our data is the LSTM model, with a test accuracy of 77%. After having employed it to label the unlabeled data, we were able to build a sentiment variable expressing investors' bullish and/or bearish moods. Subsequently, the sentiment and attention variables were aggregated to the price and volume data of the US stock market ETF to create a reinforcement learning environment in which to train our agent. By doing several tests, we discover that our agent achieves a significantly higher return when the sentiment and attention variables are also included in the RL environment.
20-apr-2022
File in questo prodotto:
File Dimensione Formato  
tesi di dottorato_giacomo camba.pdf

embargo fino al 19/04/2025

Descrizione: tesi di dottorato_giacomo camba
Tipologia: Tesi di dottorato
Dimensione 12.12 MB
Formato Adobe PDF
12.12 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/333407
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact