Abschlussarbeiten in Data Science

Abschlussarbeiten in Data Science werden zweimal im Jahr durch den Prüfungsausschuss in einem zentralen Verfahren vergeben. Außerhalb dieses Prozesses können wir nur in seltenen Ausnahmefällen Themen vergeben.

Externe Abschlussarbeiten betreuen wir nur in Ausnahmefällen, falls die Aufgabenstellung gut zu den Forschungsthemen der Professur passt. Fragen Sie dazu bitte gezielt bei Prof. Schenkel nach, wenn Sie einen Vorschlag für ein Masterarbeitsthema haben, dass Sie außerhalb der Universität bearbeiten wollen.

Beispiele für in unserer Arbeitsgruppe abgeschlossene Masterarbeiten in Data Science

[BA] Entwicklung eines Benchmarksystems für RDF und Web-API-Alignment-Systeme

Die Beschaffung und Integration von Daten einer Web-API ist ein essenzieller Prozess zur Pflege von Wissensbasen in Form von RDF-Datenbanken. Bevor die Datenintegration erfolgen kann, muss zunächst eine Abbildung der Daten der RDF-Datenbank und den Antworten einer Web-API erfolgen, ein sogenanntes Alignment. Die automatisierte Generierung solcher Alignments wird von Alignment Systemen übernommen. Die Entwicklung dieser ist zeitaufwendig und bedarf eines ständigen erstellen und vergleichen von generierten Alignments gegenüber einem idealen Alignment, dem Goldstand. Dessen Erstellung ist ein aufwendiger Prozess, der nur von Experten und in der Regel manuell ausgeführt wird. Um die Entwickler solcher Systeme zu unterstützen, wird in dieser Arbeit die neue Komponente Goldstandard-Builder fur das Benchmarksystem ETARA präsentiert. Diese automatisierten einzelne Schritte des Prozesses zur Erstellung eines Goldstandards und reduziert damit den benötigten Zeitaufwand. Weiter wurde das System ETARA um eine Benutzeroberfläche erweitert um den Zugang zum System zu vereinfachen

[BA] The Use of Linguistic Cues for Assigning Statements to Political Parties

Politics and Linguistics have an inextricable affinity. A wide array of evidence suggests that latent ideological nuances are ingrained within the language of political discourse. Over the last decade, uncovering and leveraging patterns in language data has become one of the most outstanding achievements of modern Data Science, which raises some noteworthy questions regarding its prospects within the political landscape.

This paper will examine how the relationship between Politics and Linguistics can be approached in Data Science. I will explore the abilities and limitations of contemporary concepts and state-of-the-art instruments in Natural Language Processing, Machine Learning, and Information Retrieval to address questions inspired by political linguistics, and, more specifically, to classify political claims in terms of their ideology with the help of political party programs in the context of an election process. The connections between Linguistics, Ideology and Data Science are interesting in their own right, but may also be of paramount importance for practical applications. Leveraging political linguistics could have profound implications for research on political behavior, and enable a more accessible way of understanding political agendas, revealing antagonistic lexical structures that arise from a set of political parties competing for attention and support in the context of an election.

[MA] Realisierung einer Lastprognose für [die] Wärmeleistung der nächsten fünf Tage mithilfe von verschiedenen Machine-Learning-Algorithmen

- kein Abstract verfügbar -

[MA] Faires Ranking von Suchergebnissen

Rankings beziehungsweise Rankingsysteme sind inzwischen allgegenwärtig und werden in allen erdenklichen Situationen eingesetzt. Nachdem über Jahrzehnte hinweg Rankingalgorithmen ausschließlich die Nützlichkeit von Dokumenten als alleinigen Faktor in Betracht gezogen haben, gibt es in letzter Zeit immer mehr Ansätze, andere Faktoren miteinzubeziehen, um so fairere und letztendlich potentiell bessere Rankings zu erstellen. In dieser Arbeit werden zwei Fairness-berücksichtigende Rankingalgorithmen für Publikationen aus dblp präsentiert, die Geschlecht und Herkunft der Autoren betrachten, um in Bezug auf diese Faktoren möglichst nützliche und faire Rankings zu erstellen. Bei dem ersten Algorithmus handelt es um eine Implementierung eines bestehenden Algorithmus aus dem TREC 2020 FAIRNESS RANKING TRACK, der zweite Algorithmus ist eine auf diesem basierende Weiterentwicklung, welcher über mehrere Iterationen hinweg dessen Ergebnisse weiter verbessert.

[MA] Explainable AI for Environmental Gas Sensors

Releasing harmful pollutants like ozone and nitrogen dioxide gas into the atmosphere has been a serious concern in today’s era. Such gases are harmful to the health of humans as well as other species and cause damage to the environment. Therefore, it has become necessary to find ways to monitor the air quality around us. With technological advancement, chemical gas sensors equipped with machine or deep learning algorithms can be employed to detect these gases and their concentrations. However, with the use of such algorithms, there comes a challenge to understand why they made certain predictions in human terms. This thesis aims to address this concern by adopting different explainable artificial intelligence (XAI) approaches for the gas sensors that can help understand the reasons behind the predictions made by the models, allowing not only to understand these models but also improve understanding of sensor behaviour.

[MA] Generisches Übersetzungskonzept für die SchenQL Anfragesprache

- kein Abstract verfügbar -

[MA] Comparison of Extrapolation Models for Hard-Drive Failures

Hard disk drive failures are a rare event, nonetheless the occurrence of such failures especially in modern data storage centres can result in catastrophic data losses and large monetary costs. To tackle such problems, companies rely on SMART (Self-Monitoring, Analysis and Report Technology) attributes which monitor the state of drives and report upcoming failures. Therefore, this thesis uses the Backblaze public dataset and aims at forecasting hard drive failures that would happen in 1, 10, 20 and 30 days. Also, SMART features were modeled as lag windows to include past values, with one of the window sizes: current day, last five days or last ten days. For forecasting, machine learning algorithms Random Forest, Linear Support Vector Machine and Multi-layer Perceptron were used. These models were also compared and evaluated using raw, normalized and standardized SMART features in order to observe their forecasting abilities. The training and testing of models was set in Azure machine learning, using various Jupyter Notebooks and blob storage for storing the data. Thus, the obtained results showed that models were able to forecast failures that would happen in a further future than in a nearer one. On the other hand, including past feature values by creating lag windows had no significant impact on forecasting performances. Best results were obtained by Linear Support Vector Machine while looking at 30 days into the future and at a lag window including SMART features of the current day only, with an F1-score of 51%. The other two models, namely Random Forest and Linear Support Vector Machine also reported performance increases while looking at 10 and 20 days into the future. Thus, all models performed the worst while looking 1 day into the future. In relation to including past SMART feature values, visible positive impact was not reported considering overall performance of the models. Moreover, in relation to data normalization and standardization, Linear Support Vector Machine reported only a slight increase in its performance, whereas Random Forest did not visibly increase in its performance at all. In regard to Multi-layer Perceptron, while training on raw SMART features resulted in a F1-Score of 0%, using standardized data, brought increased outcomes in the forecasting performance. Overall, Linear Support Vector Machine reported the best hard drive failure forecasting results in comparison to Random Forest and Multi-layer Perceptron and therefore is considered as the best forecasting model in this thesis.

[MA] Reconstruction of Argumentation Graphs

Argumentation can be understood as the activity of using arguments to convince, agree, or disagree people with people about a point of view. In our daily lives, argumentation is one of the most common behaviors in applying natural language. For example, social media users would respond to controversial topics using their stances and opinions. The collection and analysis of user ideas are critical to studying social phenomena and trends. However, it is hard to analyze all collected arguments since processing enormous data size needs much time and human costs, which is undesirable. This requires more efficient methods. A possible solution might be the research in computational argumentation because computers can handle numerous data efficiently. Besides social phenomenon analysis, other areas such as business and linguistics also benefit from studying argumentation.

Computational argumentation is a growing research field that yield many new methods in this area. This work is inspired by a study investigating in transforming natural language texts to argument graphs. In this thesis, we base on the previous studies and explore deep into the steps of each part, including classifying major claims, inferring relations between statements, and constructing argument graphs, and investigate in approaches for improvement. We propose a new method in major claim classification, which is to find the statement describing the core idea of the discussion, and obtain an excellent enhancement. Moreover, we introduce state-of-the-art methods to estimate the relations between arguments. We suggest six methods in the step of argument graph construction, which also give satisfactory results. There are some limitations to our research. We discuss them and explore some possible further improvements for achieving a better result in the future studies.

[MA] Fine-tuning a Transformer model for Multilingual document semantic similarity

An information retrieval system’s purpose is to return results that are relevant to the user’s query. Information relevant to the user’s request may not exist in the user’s native language in some instances. It’s also possible that the user can read papers in languages other than his or her native tongue but has trouble forming inquiries in them. The primary goal of Multilingual Information Extraction is to locate the most relevant information accessible, regardless of the query language.

Artificial intelligence (AI) has become an increasingly popular research field in recent years. Similarly, Natural Language Processing (NLP) has become an important point of discussion. Neural networks, do exceptionally well in this field. The speed and performance of neural networks dealing with diverse NLP tasks have been greatly enhanced due to a variety of effective learning methods and technologies.

The recent advances in NLP transfer learning have resulted in powerful models, mostly from the tech giants like Google, Facebook, Microsoft, etc. which perform well on NLP tasks in the general domain. In this thesis, we are going to fine-tune multilingual transformer models for the domain of engineering data both in English and German Languages. Hence, we need a language independent model - which can able to learn it’s parameters (weights and bias) of any language-specific features. First, we will describe how multilingual transfer is implemented, with the focus on state-of-the-art transformer models. Then, in the methodology part, we leverage our engineering domain data of English-German languages to fine-tune multilingual transformer models.

[MA] Towards (semi) automated literature-based complete transformer-based MCQ generation model for data base related field deployment

- kein Abstract verfügbar -