Research around the process of automatic price prediction of stock markets indicates that published news are an important asset to solve this problem.We further elaborate on an NLP-based approach to generate industry-specific lexicons from news documents exploiting the distributed technology of Apache Spark, with a focus on individuating on a day-to-day scale the correlation between significant stock price variations and the words collected from press releases. Thereafter we apply a binary classification algorithm that builds upon our newly generated lexicons to predict the magnitude of fluctuation of stock market price. Subsequently, by processing news belonging to a large collection of news articles from the most prestigious press agencies, we validate our approach by conducting an experiment on the market history of the US companies belonging to the Standard & Poor 500 index. We also test the performance of the algorithm on a multi-lingual setting, in particular focusing on the Italian stock market and the Italy 40 (FTSE MIB) index. Final data about classification results let us assess the mutual dependence between terms and prices, and help us evaluating the predictive power of our created lexicons.
A Big Data framework based on Apache Spark for Industry-specific Lexicon Generation for Stock Market Prediction
Reforgiato Recupero D.
;Stanciu M. M.
2021-01-01
Abstract
Research around the process of automatic price prediction of stock markets indicates that published news are an important asset to solve this problem.We further elaborate on an NLP-based approach to generate industry-specific lexicons from news documents exploiting the distributed technology of Apache Spark, with a focus on individuating on a day-to-day scale the correlation between significant stock price variations and the words collected from press releases. Thereafter we apply a binary classification algorithm that builds upon our newly generated lexicons to predict the magnitude of fluctuation of stock market price. Subsequently, by processing news belonging to a large collection of news articles from the most prestigious press agencies, we validate our approach by conducting an experiment on the market history of the US companies belonging to the Standard & Poor 500 index. We also test the performance of the algorithm on a multi-lingual setting, in particular focusing on the Italian stock market and the Italy 40 (FTSE MIB) index. Final data about classification results let us assess the mutual dependence between terms and prices, and help us evaluating the predictive power of our created lexicons.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.