Master's Theses in Data Science

Theses in Data Science are assigned twice a year by the Examination Board in a central process. Outside of this process, we can only assign topics in rare exceptional cases.

We only supervise external theses in exceptional cases if the task fits in well with the research topics of the professorship. Please ask Prof. Schenkel specifically if you have a suggestion for a Master's thesis topic that you would like to work on outside the university.

Examples for recently completed Master's theses

[BT] Development of a benchmark system for RDF and Web API alignment systems

- no abstract available -

[BT] The Use of Linguistic Cues for Assigning Statements to Political Parties

Politics and Linguistics have an inextricable affinity. A wide array of evidence suggests that latent ideological nuances are ingrained within the language of political discourse. Over the last decade, uncovering and leveraging patterns in language data has become one of the most outstanding achievements of modern Data Science, which raises some noteworthy questions regarding its prospects within the political landscape.

This paper will examine how the relationship between Politics and Linguistics can be approached in Data Science. I will explore the abilities and limitations of contemporary concepts and state-of-the-art instruments in Natural Language Processing, Machine Learning, and Information Retrieval to address questions inspired by political linguistics, and, more specifically, to classify political claims in terms of their ideology with the help of political party programs in the context of an election process. The connections between Linguistics, Ideology and Data Science are interesting in their own right, but may also be of paramount importance for practical applications. Leveraging political linguistics could have profound implications for research on political behavior, and enable a more accessible way of understanding political agendas, revealing antagonistic lexical structures that arise from a set of political parties competing for attention and support in the context of an election.

[MT] Realisierung einer Lastprognose für [die] Wärmeleistung der nächsten fünf Tage mithilfe von verschiedenen Machine-Learning-Algorithmen

- kein Abstract verfügbar -

[MT] Fair ranking of search results

- no abstract available -

[MT] Explainable AI for Environmental Gas Sensors

Releasing harmful pollutants like ozone and nitrogen dioxide gas into the atmosphere has been a serious concern in today’s era. Such gases are harmful to the health of humans as well as other species and cause damage to the environment. Therefore, it has become necessary to find ways to monitor the air quality around us. With technological advancement, chemical gas sensors equipped with machine or deep learning algorithms can be employed to detect these gases and their concentrations. However, with the use of such algorithms, there comes a challenge to understand why they made certain predictions in human terms. This thesis aims to address this concern by adopting different explainable artificial intelligence (XAI) approaches for the gas sensors that can help understand the reasons behind the predictions made by the models, allowing not only to understand these models but also improve understanding of sensor behaviour.

[MT] A generic translation concept for the SchenQL query language

- kein Abstract verfügbar -

[MT] Comparison of Extrapolation Models for Hard-Drive Failures

Hard disk drive failures are a rare event, nonetheless the occurrence of such failures especially in modern data storage centres can result in catastrophic data losses and large monetary costs. To tackle such problems, companies rely on SMART (Self-Monitoring, Analysis and Report Technology) attributes which monitor the state of drives and report upcoming failures. Therefore, this thesis uses the Backblaze public dataset and aims at forecasting hard drive failures that would happen in 1, 10, 20 and 30 days. Also, SMART features were modeled as lag windows to include past values, with one of the window sizes: current day, last five days or last ten days. For forecasting, machine learning algorithms Random Forest, Linear Support Vector Machine and Multi-layer Perceptron were used. These models were also compared and evaluated using raw, normalized and standardized SMART features in order to observe their forecasting abilities. The training and testing of models was set in Azure machine learning, using various Jupyter Notebooks and blob storage for storing the data. Thus, the obtained results showed that models were able to forecast failures that would happen in a further future than in a nearer one. On the other hand, including past feature values by creating lag windows had no significant impact on forecasting performances. Best results were obtained by Linear Support Vector Machine while looking at 30 days into the future and at a lag window including SMART features of the current day only, with an F1-score of 51%. The other two models, namely Random Forest and Linear Support Vector Machine also reported performance increases while looking at 10 and 20 days into the future. Thus, all models performed the worst while looking 1 day into the future. In relation to including past SMART feature values, visible positive impact was not reported considering overall performance of the models. Moreover, in relation to data normalization and standardization, Linear Support Vector Machine reported only a slight increase in its performance, whereas Random Forest did not visibly increase in its performance at all. In regard to Multi-layer Perceptron, while training on raw SMART features resulted in a F1-Score of 0%, using standardized data, brought increased outcomes in the forecasting performance. Overall, Linear Support Vector Machine reported the best hard drive failure forecasting results in comparison to Random Forest and Multi-layer Perceptron and therefore is considered as the best forecasting model in this thesis.

[MT] Reconstruction of Argumentation Graphs

Argumentation can be understood as the activity of using arguments to convince, agree, or disagree people with people about a point of view. In our daily lives, argumentation is one of the most common behaviors in applying natural language. For example, social media users would respond to controversial topics using their stances and opinions. The collection and analysis of user ideas are critical to studying social phenomena and trends. However, it is hard to analyze all collected arguments since processing enormous data size needs much time and human costs, which is undesirable. This requires more efficient methods. A possible solution might be the research in computational argumentation because computers can handle numerous data efficiently. Besides social phenomenon analysis, other areas such as business and linguistics also benefit from studying argumentation.

Computational argumentation is a growing research field that yield many new methods in this area. This work is inspired by a study investigating in transforming natural language texts to argument graphs. In this thesis, we base on the previous studies and explore deep into the steps of each part, including classifying major claims, inferring relations between statements, and constructing argument graphs, and investigate in approaches for improvement. We propose a new method in major claim classification, which is to find the statement describing the core idea of the discussion, and obtain an excellent enhancement. Moreover, we introduce state-of-the-art methods to estimate the relations between arguments. We suggest six methods in the step of argument graph construction, which also give satisfactory results. There are some limitations to our research. We discuss them and explore some possible further improvements for achieving a better result in the future studies.

[MT] Fine-tuning a Transformer model for Multilingual document semantic similarity

An information retrieval system’s purpose is to return results that are relevant to the user’s query. Information relevant to the user’s request may not exist in the user’s native language in some instances. It’s also possible that the user can read papers in languages other than his or her native tongue but has trouble forming inquiries in them. The primary goal of Multilingual Information Extraction is to locate the most relevant information accessible, regardless of the query language.

Artificial intelligence (AI) has become an increasingly popular research field in recent years. Similarly, Natural Language Processing (NLP) has become an important point of discussion. Neural networks, do exceptionally well in this field. The speed and performance of neural networks dealing with diverse NLP tasks have been greatly enhanced due to a variety of effective learning methods and technologies.

The recent advances in NLP transfer learning have resulted in powerful models, mostly from the tech giants like Google, Facebook, Microsoft, etc. which perform well on NLP tasks in the general domain. In this thesis, we are going to fine-tune multilingual transformer models for the domain of engineering data both in English and German Languages. Hence, we need a language independent model - which can able to learn it’s parameters (weights and bias) of any language-specific features. First, we will describe how multilingual transfer is implemented, with the focus on state-of-the-art transformer models. Then, in the methodology part, we leverage our engineering domain data of English-German languages to fine-tune multilingual transformer models.

[MT] Towards (semi) automated literature-based complete transformer-based MCQ generation model for data base related field deployment

- no abstract available -