Master's Theses in Data Science

Theses in Data Science are assigned twice a year by the Examination Board in a central process. Outside of this process, we can only assign topics in rare exceptional cases.

We only supervise external theses in exceptional cases if the task fits in well with the research topics of the professorship. Please ask Prof. Schenkel specifically if you have a suggestion for a Master's thesis topic that you would like to work on outside the university.

Examples for recently completed Master's theses

[BT] Comparison of Methods for Extracting, Retrieving and Ranking Coherent Phrases in Long Posts from Debate Portals

- no abstract available -

[BT] AQUAPLANE: Argument Quality Explainer

- no abstract available -

[MT] Development of a search engine for imagines

- no abstract available -

[MT] Using Latent Dirichlet Allocation to analyze topical changes in computer science conferences

In todays digital world, there exist many large scale databases consisting of texts. Thus it is important to have ways, in which data, a user is interested in, can be retrieved. One such way is presented by topic modeling algorithms, which automatically generate topics over a dataset and are then able to present documents as mixtures of these topics. This way, a user can filter publications based on the predominant topics they contain. However this approach works on the document level. The leading question in this thesis, is, how these models can be used to yield topical compositions of collections consisting of different documents. Specifically we take a look at scientific conferences from the field of computer science. This thesis presents a way to model those conferences as topic vectors. We then evaluate if these topic vectors share some similarity when the corresponding conferences are belonging to the same subfield of computer science. In order to do that we use clustering techniques to find groups of similar conferences based on our topical modeling and compare the obtained clustering with a golden dataset that groups conferences into subfields. This comparison is done using the rand index. Our results show a strong similarity between the golden clustering and the one obtained by our approach.

[MT] Predicting DJIA stock market movements using news headlines

Stock movement prediction is a challenging task due to the characteristics of the stock market. However, it is a field where people can gain high returns with patience and a basic understanding of the stock market. Many previous studies have been conducted on predicting stock price movements using statistical techniques such as ARMA, ARIMA etc. In the era of social media, recent research on stock price movements has mostly focused on tweets, financial news, and company earnings calls. In this study, we concentrate on stock market prediction using news headlines. We construct models to predict the Dow Jones Industrial Average Index price using a single day’s top 25 news headlines. Our target variable is a binary variable, represented by 0 and 1. We create the target variable using the Dow Jones Industrial Average Adjusted Close Price. If the Adjusted Close Price increases or remains the same, we label it 1; otherwise, we label it 0. In our baseline model, we first concatenate all 25 news headlines to create a whole sentence. Then we preprocess the news text by performing steps such as removing punctuation, lemmatizing, and identifying named entities. After that, we apply conventional vectorisers such as CountVectorizer and TF-IDF vectorizer to extract numerical values from the text. We then use standard algorithms for the training and testing data. Instead of using CountVectorizer and TF-IDF in our next model, we employ word embedding models like GloVe and Word2Vec. Our third model uses a state-of-the-art BERT embedding layer instead of GloVe and Word2Vec. In our final model, we adopt a novel approach that combines the BERT embedding layer with various text stylistic features and sentiment scores such as positivity, negativity, and compound to predict stock price movements. In this thesis, we achieved an accuracy of approximately 59% in predicting stock price movements.

[MT] Sales Forecasting for Company Turnover: A Comparative Analysis of Existing Methods and Development of an Explainable, Corrective, Adaptive, Perfective, and Preventive Model using Time Series and Machine Learning Methods

Integrating statistical and machine learning (ML) and deep learning (DL) techniques is quickly gaining popularity in different business sectors, inventory management, marketing, and financial planning, all so businesses can attain a competitive edge in the market by effectively directing their resources and identifying the opportunities and challenges in their processes. This study delves into the application of various statistical, ML, and DL models to predict both a company’s monthly turnover and individual project turnovers. Utilizing all years of non-truncated data, models such as SARIMAX, Prophet, SimpleFeedForward, and DeepAR were rigorously trained, evaluated, and backtested. Results showcased SARIMAX’s higher predictive accuracy, with the SimpleFeedForward model training. For project-level forecasting, the data was transformed into lagged datasets, aggregated with unique project features. Using models like Decision Tree, Random Forest, Gradient Boosting Regressor, and XGBoost, the study unearthed intriguing insights. While initial trials with 5-lags were underwhelming, extending to 10 and 15 lags progressively improved performance, culminating in an outstanding average MAPE of approximately 5% at 20 lags. Furthermore, the adaptive, perfective and explainable aspects of the developed forecasting tool makes it simple for anyone to replicate the results or choose to repeat the process with a different dataset.

[MT] Estimation of structural health on foundations of offshore wind farms using machine learning techniques

This thesis is conducted in collaboration with Ramboll, a renowned global firm specializing in architecture, engineering, and consultancy services. The Department of Energy in Hamburg has actively engaged in various projects related to offshore wind energy. Aligned with these endeavors, the present research topic emerged to address the ongoing need for estimating structural health by predicting fatigue using Supervisory Control and Data Acquisition (SCADA) data. The primary focus is to leverage data collected from structures equipped with both SCADA sensors and strain gauges and employ models to estimate fatigue on other structures with no strain gauges.

The subsequent chapters delve into in-depth discussions on data, preprocessing, feature selection, and machine learning, shedding light on their operational mechanisms. The rationale behind utilizing specific machine learning models, such as Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BiLSTM), Extreme Gradient Boosting (XGBoost), and AutoRegressive Integrated Moving Average (ARIMA), is explored. The evaluation of these models provides an assessment of their performance, efficiency, and accuracy, offering an understanding of why certain models are better suited for fatigue prediction.