Topic Modeling and Sentiment Analysis for Predicting the Direction of Movement of the S&P 500 Index Based on Financial News

Autor: M. Azuero
Masterarbeit: MT2404 (Juni, 2024)
Betreut von: Assoz. Univ.-Prof. Mag. Dr. Christoph Schütz
Angeleitet von: Simon Staudinger, MSc
Ausgeführt an: Universität Linz, Institut für Wirtschaftsinformatik - Data & Knowledge Engineering
Ressourcen: Kopie

Kurzfassung (Englisch)

Predicting stock prices and determining the optimal moments to buy or sell stocks is a long-standing challenge for investors. Advances in natural language processing (NLP) allow for extracting valuable insights from unstructured text, and diverse studies have used news articles to predict the stock market, employing techniques such as lexicon-based sentiment analysis and topic modeling through Latent Dirichlet Allocation (LDA). These traditional approaches, however, do not consider the semantic relationships among words. Language models that use text embedding techniques, such as BERT, have gained popularity in the NLP field for their ability to consider the context of words.

This thesis evaluates the use of BERT-based topic modeling and sentiment analysis of financial news in the context of training a classifier to predict the direction of movement of the S? 500 index. On the one hand, this thesis evaluates BERT-based models that consider semantic relationships among words, specifically FinBERT and BERTopic, in conjunction with various classification algorithms, including Logistic Regression, Support Vector Machine (SVM), and Random Forest, among others. On the other hand, to provide a benchmark, the method is applied with the same classification algorithms using traditional techniques for sentiment analysis and topic modeling that do not consider word context. The benchmark sentiment analysis relies on a lexicon-based approach utilizing the Loughran and McDonald dictionary, while the topic modeling employs Latent Dirichlet Allocation (LDA).

The comparison between the BERT-based method and the selected benchmark involves evaluating the accuracy, precision, sensitivity, and other classification metrics. Furthermore, the research explores the influence of several factors on prediction outcomes, including the size and frequency of training the topic model and the impact of utilizing only the headline versus the full article. The results indicate that BERT-based methods marginally outperform traditional approaches in predicting stock price direction. However, it has become apparent that relying solely on sentiment information and topic models derived from financial news may not suffice for accurately forecasting the S? 500 index's direction.